This project focuses on refining the accuracy and correctness of LLM inferences with web page content as context using a RAG (Retrieval Augmented Generation) approach. The LLM and embedding model used here are Llama 3.1 8B and mxbai-embed-large
, respectively, both of which are launched with Ollama.
This project has three components: the web crawler, the database builder, and the RAG application.
The script crawler.py
is a web crawler that scrapes all child links with the same base URL up to a specified depth. It parses the metadata and the page content of each scraped web page to markdown text and writes them to a JSON file. Furthermore, the script will also populate a text file containing all the links that it has scraped.
For experimental purposes, the script crawler_multi.py
is also provided. It has the same functionalities as crawler.py
but it is a multithreaded crawler that can scrape web pages concurrently starting from different subdirectories of the same base URL.
The script build_database.py
reads the content of all the JSON files produced by crawler.py
or crawler_multi.py
and inserts them to a Chroma database saved on disk. It splits the page content of each document (web page) into smaller chunks and stores vector embeddings generated by the embedding model for each chunk in the database. For each chunk, a custom embedding ID is assigned in the form of https://[some url].com#[seq num]
, where the sequence number is the ith chunk of the document. All documents begin with a sequence number of 0. The purpose of assigning custom embedding IDs is to know which documents are passed into the LLM during generation and pinpoint exactly which section of each documents it is referring to.
Similar to crawler_multi.py
, this program is also multi-thread since Chroma is thread-safe, so it can add multiple documents to the database simultaneously.
The Jupyter notebook local_rag.ipynb
sets up the entire RAG pipeline and creates the workflow using the LangGraph library. The function generate_answer()
not only returns the generated response but also the chunk IDs that were used for context.
First, download Ollama from this link.
To install the LLM and embedding models, run
ollama pull llama3.1
ollama pull mxbai-embed-large
Next, install the required packages:
pip install -r requirements.txt
To start the crawler, run
python crawler.py [-h] --url URL --base BASE --out OUT [--depth DEPTH] [--css [CSS ...]] [--exclude [EXCLUDE ...]] [--no-update]
or for the multithreaded crawler, run
crawler_multi.py [-h] --url URL [URL ...] --base BASE --out OUT [--depth DEPTH] [--css [CSS ...]] [--exclude [EXCLUDE ...]] [--no-update]
See python crawler.py -h
or python crawler_multi.py -h
for more information about the argument options.
To build the database, run
python build_database.py [-h] -d DIR [DIR ...] -o OUT -n NAME [-u]
See python build_database.py -h
for more information about the argument options.
Finally, the RAG application can be run under local_rag.ipynb
.
- Recently, I've made improvements in preprocessing the web page content as the crawler now filters the HTML with CSS selectors and outputs the data in markdown text.
- However, the chunking strategy is still not robust since some parts of useful information are sometimes cut off in the chunk, resulting in an incomplete response. Opting in for a bigger chunk size also poses problems for the LLM to absorb all the context despite Llama's large context window.
- I'm also looking for a better open-source embedding model that can support more tokens but also managable to run in my system. Out of all the embedding models I've tried,
mxbai-embed-large
is still the best one so far. - I'm still conflicted about adding a condition to perform a web search (if no relevant documents are found) because I want to keep the entire RAG pipeline in a local environment.
- Like many other RAG examples, I might add a final hallucination-checking stage to confirm that the LLM is backing up its answer with only the provided documents.