Skip to content

Latest commit

 

History

History
78 lines (54 loc) · 3.18 KB

File metadata and controls

78 lines (54 loc) · 3.18 KB

Retrieval-Augmented Generation (RAG) Using Hugging Face Embeddings

This project demonstrates how to implement a Retrieval-Augmented Generation (RAG) pipeline using Hugging Face embeddings and ChromaDB for efficient semantic search. The solution reads, processes, and embeds textual data, enabling a user to perform accurate and fast queries on the data.

Features

  • Dataset Integration: Load and process datasets from Hugging Face.
  • Text Chunking: Split large text into manageable chunks for embedding.
  • Embeddings Generation: Utilize Hugging Face embeddings (BAAI/bge-base-en-v1.5) to convert text chunks into vector representations.
  • ChromaDB Storage: Store embeddings in ChromaDB for easy retrieval.
  • Semantic Search: Query the stored data for relevant text based on a provided prompt using semantic similarity.

Installation

Before running the notebook, ensure the necessary libraries are installed:

pip install chromadb
pip install llama-index

You also need to clone the required datasets from Hugging Face If u you just want to check it out and test the working :) :

git clone https://huggingface.co/datasets/NahedAbdelgaber/evaluating-student-writing
git clone https://huggingface.co/datasets/transformersbook/emotion-train-split

How It Works

  1. Load Datasets:

    • The notebook loads the "Evaluating Student Writing" dataset and splits the text into chunks for embedding.
  2. Embedding Creation:

    • Using the BAAI/bge-base-en-v1.5 model, text chunks are converted into vector embeddings. You can any model of your liking.
  3. ChromaDB Integration:

    • The generated embeddings, along with their corresponding text chunks, are stored in ChromaDB for persistence and later querying.
  4. Semantic Search:

    • A query function is provided to search the vector database using a given input query. The relevant chunks are returned based on similarity to the query.

Usage

To use the code, simply run the notebook after installing the dependencies and cloning the required datasets. The following command can be used to query the stored embeddings:

query_collection("Your search query here", n_results=1)

This will return the most relevant text chunk based on the provided query.

Example

query_collection(
  "Even though the planet is very similar to Earth, there are challenges to get accurate data because of the harsh conditions on the planet.", 
  n_results=1
)

Files

There are 2 files in here. The simple one just create a vector database of a single file and the advance one can work on multiple files with differnt extensions and create vector database of them and you can also test it out on a text-gen model.

Dependencies

Future Enhancements

  • Improve the chunking mechanism for more flexible handling of overlapping sentences.
  • Fine-tune the embedding model for more specific domain applications.
  • Add support for multiple datasets.

License

This repository is licensed under the MIT License.

Thanks for checking it out :)