Retrieval-Augmented Generation (RAG) Using Hugging Face Embeddings

This project demonstrates how to implement a Retrieval-Augmented Generation (RAG) pipeline using Hugging Face embeddings and ChromaDB for efficient semantic search. The solution reads, processes, and embeds textual data, enabling a user to perform accurate and fast queries on the data.

Features

Dataset Integration: Load and process datasets from Hugging Face.
Text Chunking: Split large text into manageable chunks for embedding.
Embeddings Generation: Utilize Hugging Face embeddings (BAAI/bge-base-en-v1.5) to convert text chunks into vector representations.
ChromaDB Storage: Store embeddings in ChromaDB for easy retrieval.
Semantic Search: Query the stored data for relevant text based on a provided prompt using semantic similarity.

Installation

Before running the notebook, ensure the necessary libraries are installed:

pip install chromadb
pip install llama-index

You also need to clone the required datasets from Hugging Face If u you just want to check it out and test the working :) :

git clone https://huggingface.co/datasets/NahedAbdelgaber/evaluating-student-writing
git clone https://huggingface.co/datasets/transformersbook/emotion-train-split

How It Works

Load Datasets:
- The notebook loads the "Evaluating Student Writing" dataset and splits the text into chunks for embedding.
Embedding Creation:
- Using the BAAI/bge-base-en-v1.5 model, text chunks are converted into vector embeddings. You can any model of your liking.
ChromaDB Integration:
- The generated embeddings, along with their corresponding text chunks, are stored in ChromaDB for persistence and later querying.
Semantic Search:
- A query function is provided to search the vector database using a given input query. The relevant chunks are returned based on similarity to the query.

Usage

To use the code, simply run the notebook after installing the dependencies and cloning the required datasets. The following command can be used to query the stored embeddings:

query_collection("Your search query here", n_results=1)

This will return the most relevant text chunk based on the provided query.

Example

query_collection(
  "Even though the planet is very similar to Earth, there are challenges to get accurate data because of the harsh conditions on the planet.", 
  n_results=1
)

Files

There are 2 files in here. The simple one just create a vector database of a single file and the advance one can work on multiple files with differnt extensions and create vector database of them and you can also test it out on a text-gen model.

Dependencies

ChromaDB
Hugging Face Embeddings
llama-index

Future Enhancements

Improve the chunking mechanism for more flexible handling of overlapping sentences.
Fine-tune the embedding model for more specific domain applications.
Add support for multiple datasets.

License

This repository is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Retrieval-Augmented Generation (RAG) Using Hugging Face Embeddings

Features

Installation

How It Works

Usage

Example

Files

Dependencies

Future Enhancements

License

Thanks for checking it out :)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Retrieval-Augmented Generation (RAG) Using Hugging Face Embeddings

Features

Installation

How It Works

Usage

Example

Files

Dependencies

Future Enhancements

License

Thanks for checking it out :)