An End-to-End RAG example using faiss retriver using langchain and openai gpt-3.5 for QA#

This notebook presents a comprehensive end-to-end example utilizing the library’s functionality. Specifically, it showcases how to use a RAG (Retrieval-Augmented Generation) model, powered by GPT-3.5, to retrieve information. This notebook provides insights into leveraging the library for complex use cases.

[ ]:
!pip install "antimatter[langchain]"
!pip install python-dotenv openai

Import openai key from a .env file.#

[13]:
import dotenv
import os
dotenv.load_dotenv(os.path.join(os.getenv("HOME"), '.openai_env'))
[13]:
True

Register a domain and create a read/write context#

[15]:
import os
from antimatter import new_domain, Session
from antimatter.builders import *
from antimatter.datatype.datatypes import Datatype
[65]:
# Either create a new domain or use an existing one
if True:
    sess = new_domain("[email protected]")
    print ("domain: %s" % (sess.domain_id))
    # print ("api key: %s" % (sess.api_key))
    # print(f"sess = Session(domain='{sess.domain_id}', api_key='{sess.api_key}')")
else:
    sess = Session(domain='<domain_id>', api_key='<api_key>')

file_name = "/tmp/testdata.capsule"
domain: dm-TWWpmXTE35r

Add some facts to this domain#

Create a fact type called is_project_member with the attributes email and project. Add 2 facts to this type: - is_project_member(email="test@test.com", project="project1") - is_project_member(email="test2@test2.com", project="project2")

[66]:
sess.add_fact_type(
    "is_project_member",
    description="Team membership",
    arguments={"email": "email of the member", "project": "name of the project"},
)

sess.add_fact(
    "is_project_member",
    "[email protected]",
    "project1",
)

sess.add_fact(
    "is_project_member",
    "[email protected]",
    "project2",
)
[66]:
{'id': 'ft-12sadzxf1ghsysim',
 'name': 'is_project_member',
 'arguments': ['[email protected]', 'project2']}
[67]:
sess.list_facts('is_project_member')
[67]:
[{'id': 'ft-12sadzxf1ghsysim',
  'name': 'is_project_member',
  'arguments': ['[email protected]', 'project2']},
 {'id': 'ft-i3hq2mh61rrhtx64',
  'name': 'is_project_member',
  'arguments': ['[email protected]', 'project1']}]

Open a dataset#

[130]:
# Load dataset
import pandas as pd

data = [
    {"id":1,"first_name":"Amanda","last_name":"Jordan","email":"[email protected]","gender":"Female","ip_address":"1.197.201.2","cc":"6759521864920116","country":"Indonesia","birthdate":"3\\/8\\/1971","salary":49756.53,"title":"Internal Auditor","comments":"Hello friends, my name is Alice Johnson and I just turned 29 years old! \\ud83c\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."},
    {"id":2,"first_name":"Albert","last_name":"Freeman","email":"[email protected]","gender":"Male","ip_address":"218.111.175.34","cc":"","country":"Canada","birthdate":"1\\/16\\/1968","salary":150280.17,"title":"Accountant IV","comments":"Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."},
    {"id":3,"first_name":"Evelyn","last_name":"Morgan","email":"[email protected]","gender":"Female","ip_address":"7.161.136.94","cc":"6767119071901597","country":"Russia","birthdate":"2\\/1\\/1960","salary":144972.51,"title":"Structural Engineer","comments":"Booking Confirmation: Thank you, David Smith (DOB: 01\\/12\\/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."},
]

df = pd.DataFrame(data)
df.head()
[130]:
id first_name last_name email gender ip_address cc country birthdate salary title comments
0 1 Amanda Jordan [email protected] Female 1.197.201.2 6759521864920116 Indonesia 3\/8\/1971 49756.53 Internal Auditor Hello friends, my name is Alice Johnson and I ...
1 2 Albert Freeman [email protected] Male 218.111.175.34 Canada 1\/16\/1968 150280.17 Accountant IV Customer feedback: I recently visited your sto...
2 3 Evelyn Morgan [email protected] Female 7.161.136.94 6767119071901597 Russia 2\/1\/1960 144972.51 Structural Engineer Booking Confirmation: Thank you, David Smith (...

List and create write context#

[69]:
sess.list_write_context()
[69]:
[]
[70]:
# Create a new write context
sess.add_write_context(
    "write_ctx", WriteContextBuilder().\
        set_summary("Sample write context").\
        set_description("Sample description").\
        add_hook("fast-pii", ">1.0.0", WriteContextHookMode.Sync)
)
[71]:
sess.list_write_context()
[71]:
[{'name': 'write_ctx',
  'summary': 'Sample write context',
  'description': 'Sample description',
  'config': {'required_hooks': [{'hook': 'fast-pii',
     'constraint': '>1.0.0',
     'mode': 'sync'}]},
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None}]

Encapsulate data using the write context#

[72]:
df_capsule = sess.encapsulate(data=df, write_context="write_ctx", path=file_name)
[73]:
!ls -lrtha /tmp/testdata.capsule
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
-rw-rw-r-- 1 ajay ajay 9.3K Mar 25 12:48 /tmp/testdata.capsule

List & Create read contexts#

[108]:
sess.list_read_context()
[108]:
[]
[110]:
sess.add_read_context("read_ctx",
    ReadContextBuilder().\
        set_summary("Sample read context").\
        set_description("Sample description").\
        add_required_hook("fast-pii", ">1.0.0").\
        add_read_parameter("key", True, "description")
)
[111]:
sess.list_read_context()
[111]:
[{'name': 'read_ctx',
  'summary': 'Sample read context',
  'description': 'Sample description',
  'read_parameters': [{'key': 'key',
    'required': True,
    'description': 'description'}],
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None}]

Open and read data based on read context#

[112]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")

Retrieve the data as a langchain retriever#

[113]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)

Retrieve some data from the retriever#

[114]:
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
[114]:
[Document(page_content="{'id': '2', 'first_name': 'Albert', 'last_name': 'Freeman', 'email': '[email protected]', 'gender': 'Male', 'ip_address': '218.111.175.34', 'cc': '', 'country': 'Canada', 'birthdate': '1/16/1968', 'salary': '150280.17', 'title': 'Accountant IV', 'comments': 'Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details.'}"),
 Document(page_content="{'id': '1', 'first_name': 'Amanda', 'last_name': 'Jordan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '1.197.201.2', 'cc': '6759521864920116', 'country': 'Indonesia', 'birthdate': '3/8/1971', 'salary': '49756.53', 'title': 'Internal Auditor', 'comments': 'Hello friends, my name is Alice Johnson and I just turned 29 years old! πŸŽ‰ I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567.'}"),
 Document(page_content="{'id': '3', 'first_name': 'Evelyn', 'last_name': 'Morgan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '7.161.136.94', 'cc': '6767119071901597', 'country': 'Russia', 'birthdate': '2/1/1960', 'salary': '144972.51', 'title': 'Structural Engineer', 'comments': 'Booking Confirmation: Thank you, David Smith (DOB: 01/12/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected].'}")]

Create a gpt-3.5 qa and test with langchain retriever#

[46]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female from Indonesia. Her email is [email protected], her IP address is 1.197.201.2, and her credit card number is 6759521864920116. She was born on March 8, 1971, and her job title is Internal Auditor with a salary of $49,756.53. If you need more specific details, please let me know.

Create a new read context with rules to redact data#

[115]:
email_redaction_rule = sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
    source=Source.Tags,
    key="tag.antimatter.io/pii/email_address",
    operator=Operator.Exists
).set_action(Action.Redact).set_priority(20))
[116]:
sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
    source=Source.Tags,
    key="tag.antimatter.io/pii/credit_card",
    operator=Operator.Exists
).set_action(Action.Redact).set_priority(30))
[116]:
'rl-agg15cu1nd5c5jrj'
[117]:
sess.describe_read_context("read_ctx")
[117]:
{'name': 'read_ctx',
 'summary': 'Sample read context',
 'description': 'Sample description',
 'required_hooks': [{'hook': 'fast-pii',
   'constraint': '>1.0.0',
   'write_context': None}],
 'read_parameters': [{'key': 'key',
   'required': True,
   'description': 'description'}],
 'rules': [{'id': 'rl-47hp45t7hahz5v2l',
   'match_expressions': [{'source': 'tags',
     'key': 'tag.antimatter.io/pii/email_address',
     'operator': 'Exists',
     'values': None,
     'value': None}],
   'action': 'Redact',
   'token_scope': None,
   'token_format': None,
   'facts': [],
   'priority': 20,
   'imported': False,
   'source_domain_id': None,
   'source_domain_name': None},
  {'id': 'rl-agg15cu1nd5c5jrj',
   'match_expressions': [{'source': 'tags',
     'key': 'tag.antimatter.io/pii/credit_card',
     'operator': 'Exists',
     'values': None,
     'value': None}],
   'action': 'Redact',
   'token_scope': None,
   'token_format': None,
   'facts': [],
   'priority': 30,
   'imported': False,
   'source_domain_id': None,
   'source_domain_name': None}],
 'imported': False,
 'source_domain_id': None,
 'source_domain_name': None}

Materialize the data with the new rules for redaction#

[118]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[119]:
df = capsule.data_as(dt=Datatype.PandasDataframe)
[120]:
df
[120]:
id first_name last_name email gender ip_address cc country birthdate salary title comments
0 1 Amanda Jordan {redacted} Female 1.197.201.2 {redacted} Indonesia 3/8/1971 49756.53 Internal Auditor Hello friends, my name is Alice Johnson and I ...
1 2 Albert Freeman {redacted} Male 218.111.175.34 Canada 1/16/1968 150280.17 Accountant IV Customer feedback: I recently visited your sto...
2 3 Evelyn Morgan {redacted} Female 7.161.136.94 {redacted} Russia 2/1/1960 144972.51 Structural Engineer Booking Confirmation: Thank you, David Smith (...

Use RAG qa with new redacted context and it’s materialized retriever#

[121]:
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
[121]:
[Document(page_content="{'id': '2', 'first_name': 'Albert', 'last_name': 'Freeman', 'email': '{redacted}', 'gender': 'Male', 'ip_address': '218.111.175.34', 'cc': '', 'country': 'Canada', 'birthdate': '1/16/1968', 'salary': '150280.17', 'title': 'Accountant IV', 'comments': 'Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at {redacted} for any further details.'}"),
 Document(page_content="{'id': '1', 'first_name': 'Amanda', 'last_name': 'Jordan', 'email': '{redacted}', 'gender': 'Female', 'ip_address': '1.197.201.2', 'cc': '{redacted}', 'country': 'Indonesia', 'birthdate': '3/8/1971', 'salary': '49756.53', 'title': 'Internal Auditor', 'comments': 'Hello friends, my name is Alice Johnson and I just turned 29 years old! πŸŽ‰ I am looking forward to connecting with all of you. Feel free to drop me a line at {redacted} or call me at 415-123-4567.'}"),
 Document(page_content="{'id': '3', 'first_name': 'Evelyn', 'last_name': 'Morgan', 'email': '{redacted}', 'gender': 'Female', 'ip_address': '7.161.136.94', 'cc': '{redacted}', 'country': 'Russia', 'birthdate': '2/1/1960', 'salary': '144972.51', 'title': 'Structural Engineer', 'comments': 'Booking Confirmation: Thank you, David Smith (DOB: 01/12/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at {redacted}.'}")]
[122]:
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female Internal Auditor from Indonesia. She was born on March 8, 1971. Her email address and credit card information have been redacted for privacy reasons. Her IP address is 1.197.201.2, and her salary is $49,756.53. She left a comment saying that her name is Alice Johnson and she just turned 29 years old, inviting friends to connect with her through email or phone.

Remove email redaction from the rule#

[123]:
sess.delete_read_context_rule('read_ctx', email_redaction_rule)
[124]:
sess.describe_read_context("read_ctx")
[124]:
{'name': 'read_ctx',
 'summary': 'Sample read context',
 'description': 'Sample description',
 'required_hooks': [{'hook': 'fast-pii',
   'constraint': '>1.0.0',
   'write_context': None}],
 'read_parameters': [{'key': 'key',
   'required': True,
   'description': 'description'}],
 'rules': [{'id': 'rl-agg15cu1nd5c5jrj',
   'match_expressions': [{'source': 'tags',
     'key': 'tag.antimatter.io/pii/credit_card',
     'operator': 'Exists',
     'values': None,
     'value': None}],
   'action': 'Redact',
   'token_scope': None,
   'token_format': None,
   'facts': [],
   'priority': 30,
   'imported': False,
   'source_domain_id': None,
   'source_domain_name': None}],
 'imported': False,
 'source_domain_id': None,
 'source_domain_name': None}

Read the data with the new redaction rule#

[125]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[128]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
[128]:
[Document(page_content="{'id': '2', 'first_name': 'Albert', 'last_name': 'Freeman', 'email': '[email protected]', 'gender': 'Male', 'ip_address': '218.111.175.34', 'cc': '', 'country': 'Canada', 'birthdate': '1/16/1968', 'salary': '150280.17', 'title': 'Accountant IV', 'comments': 'Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details.'}"),
 Document(page_content="{'id': '1', 'first_name': 'Amanda', 'last_name': 'Jordan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '1.197.201.2', 'cc': '{redacted}', 'country': 'Indonesia', 'birthdate': '3/8/1971', 'salary': '49756.53', 'title': 'Internal Auditor', 'comments': 'Hello friends, my name is Alice Johnson and I just turned 29 years old! πŸŽ‰ I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567.'}"),
 Document(page_content="{'id': '3', 'first_name': 'Evelyn', 'last_name': 'Morgan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '7.161.136.94', 'cc': '{redacted}', 'country': 'Russia', 'birthdate': '2/1/1960', 'salary': '144972.51', 'title': 'Structural Engineer', 'comments': 'Booking Confirmation: Thank you, David Smith (DOB: 01/12/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected].'}")]
[129]:
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female from Indonesia. She was born on March 8, 1971. She works as an Internal Auditor and has an email address [email protected]. Her salary is $49,756.53. If you need more specific information, feel free to ask!
[ ]: