A pipeline for building a Retrieval-Augmented Generation (RAG) system using document embedding.
We use three stages in the pipeline:
- (1) Collect Data: Preprocess a corpus of PDF, HTML and txt documents to build a database of embedded text chunks for a specified subject.
- (2) Preprocessing: Extract text from documents and embed text chunks.
- (3) Retrieval: Retrieve relevant information from the database using cosine similarity.
- (4) Generation: Generate responses using the retrieved information with a large language model (LLM).
Manually place the PDF, HTML and/or txt documents in the data/{subject}
folder. The folder structure should look like this:
data/
└── subject/
├── document1.pdf
├── document2.html
├── document3.txt
└── ...
Note: Use meaningful document names for the PDF and HTML files, since the document name is used as source information in the RAG system (so we can reference the actual document later).
The module, database_builder.py
, preprocesses a corpus of PDF and HTML documents to build a database of embedded text chunks for a specified subject. This is part of the initial preprocessing step in the RAG pipeline.
- Extracts text from PDF and HTML documents, located in
data/{subject}
folder. - Chunks text into manageable pieces with specified size and overlap.
- Embeds text chunks using the specified embedding model (e.g.
text-embedding-3-large
from OpenAI). - Saves embedded chunks to a Parquet file located in
embedding_database/{subject}.parquet
for further use. Note: This should be moved to an actual vector database in the future.
- Install dependencies:
pip install -r requirements.txt
- Ensure the OpenAI API key is set in a
.env
file.
from database_builder import DatabaseBuilder
from util.embedding_service.embedding_service_factory import EmbeddingServiceFactory
# Create an embedding service
embedding_service = EmbeddingServiceFactory().get_embedding_service(provider="openai", model_name="text-embedding-3-large")
# Initialize the DatabaseBuilder
db_builder = DatabaseBuilder(
subject="test", # Replace with actual subject (e.g. "biology")
embedding_service=embedding_service,
chunk_size=1000,
overlap_size=50,
min_text_length=0
)
# Build the database
db_builder.build_database(root_folder_path="data/")
- Determine the optimal values for
chunk_size
,overlap_size
, andmin_text_length
. Many different values are used in research papers. - Implement a database to store the embedded text chunks. For now, we save the embedded chunks to a Parquet file.
- Which embedding model to use? We currently use the
text-embedding-3-large
model from OpenAI. - Consider semantic chunking stratgies instead of the
RecursiveTextSplitter
. - Add support for Batch embedding to save costs.
In this stage, we retrieve relevant information from the embedded database using the user's query. The retrieval process involves:
- Embedding the User Query: The query (or list of queries) is embedded using the same embedding model used to create the database.
- Computing Similarities: We compute the cosine similarity between the query embeddings and the document embeddings in the database.
- Retrieving Top-K Documents: The documents with the highest similarity scores above a specified threshold are retrieved.
- Thresholding: We can set a threshold to filter out documents with low similarity scores.
We support both single queries and lists of queries, which can be useful when you want to capture different aspects or phrasings of a question.
from retriever import Retriever
# Initialize the Retriever
retriever = Retriever(
subject="test", # Replace with actual subject (e.g., "biology")
embedding_service=embedding_service,
debug=False # Set to True to enable debug information
)
# Query the database
query = "What is the capital of France?"
retrieved_docs, total_tokens = retriever.retrieve(
user_queries=[query], # List of user queries
top_k=5, # Number of top documents to retrieve
threshold=0.2 # Cosine similarity threshold
)
# Process the retrieved documents
for idx, doc in enumerate(retrieved_docs):
print(f"\nDocument {idx + 1}")
print(f"Content: {doc.content}")
print(f"Metadata: {doc.metadata}")
print(f"Similarity: {doc.similarity:.4f}")
In this optional final stage, we generate a response to the user's query using the retrieved documents. The Generator class utilizes a language model to produce an answer that is informed by the retrieved chunks of information. We use a generic factory (LLMServiceFactory) to interact with different language models (LLMs) – for more details, see the README.md
in the llm_service
folder. \
This step is optional. If you only need to retrieve relevant chunks, you can omit this part.
from generator import Generator
from retriever import Retriever
# Assume we have already retrieved documents using the Retriever
retriever = Retriever(
subject="test",
embedding_service=embedding_service
)
user_query = "What is the capital of France?"
retrieved_docs, _ = retriever.retrieve(
user_queries=[user_query],
top_k=5,
threshold=0.2
)
# Initialize the Generator
generator = Generator(
model_provider="HuggingFace",
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
temperature=0.1,
max_tokens=2048
)
# Generate a response
response = generator.generate_response(
user_query=user_query,
retrieved_data=retrieved_docs
)
# Output the generated response
print("Generated Response:")
print(response)
The retrieved documents are formatted and included in the prompt to provide context to the language model. You can customize the prompt in the generate_response
method to better suit your use case or to optimize the model's output.