This project demonstrates a Retrieval-Augmented Generation (RAG) model implementation designed to answer questions based on medical literature. The RAG model combines retrieval-based techniques and Generative AI for accurate and contextually relevant responses.
The model is trained and tested using the following medical books:
- Medical Book: General medical reference.
- Gray's Anatomy for Students: Detailed anatomy reference.
- Harrison's Principles of Internal Medicine: Comprehensive internal medicine resource.
- Oxford Handbook of Clinical Medicine: Clinical reference for healthcare professionals.
- Where There Is No Doctor: Health care guide for rural and remote areas.
- Current Medical Diagnosis & Treatment: Diagnostic and treatment guidelines.
- Davidson’s Principles and Practice of Medicine: Principles of modern medicine.
- Harrison’s Pulmonary and Critical Care Medicine: Specialized reference for pulmonary medicine.
Each book is processed to extract relevant information, chunked into smaller text segments, and stored for retrieval during question answering.
- Python 3.10.0
- LangChain: Framework for building language model applications.
- Chroma: Vectorstore for document embeddings.
- Google Generative AI: For embedding generation and answer generation.
- FAISS: For efficient similarity searches. You can download the above techstack from requirements.txt--> {https://github.com/eeshan15/Vaidya-GPT-Prototype/blob/main/requirements.txt} and complete code from test.ipynb file ---> https://github.com/eeshan15/Vaidya-GPT-Prototype/blob/main/test.ipynb
langchain
langchain_community
langchain_chroma
langchain_google_genai
PyPDFLoader
: For loading PDF data.dotenv
: For managing environment variables.FAISS
: For vector-based similarity search.tqdm
: For progress visualization.
-
Document Loader:
- PDFs are loaded using
PyPDFLoader
.
- PDFs are loaded using
-
Text Splitting:
- Documents are split into chunks of 1000 characters using
RecursiveCharacterTextSplitter
.
- Documents are split into chunks of 1000 characters using
-
Embeddings:
- Generated using the Google Generative AI Embeddings model (
embedding-001
).
- Generated using the Google Generative AI Embeddings model (
-
Vectorstore:
- Chunks and their embeddings are stored in Chroma or FAISS for efficient retrieval.
-
Generative AI:
- Uses the Gemini 1.5 Pro version for natural language response generation.
-
Retrieval Chain:
- Combines retrieved documents with a generative model to answer questions.
- Multiple Document Support: The model is designed to handle multiple books and provide consolidated answers.
- Customizable Prompting: The system uses dynamic prompts to tailor responses based on retrieved content.
- High Accuracy: Tested to deliver accurate and concise answers to medical queries.
- Clone the repository and navigate to the project directory.
- Install dependencies:
pip install langchain langchain_chroma langchain_google_genai dotenv tqdm
- Place your medical books in the
data/
directory.
- Open
test.ipynb
in Jupyter Notebook or an equivalent editor. - Run all the cells sequentially.
Run the following code to generate and save the vectorstore:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma
# Load and process documents
loader = PyPDFLoader("data/Medical_book.pdf")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(data)
# Create vectorstore
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="vectorstore")
vectorstore.persist()
After generating the vectorstore, run the following snippet to query:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})
question = "What is Myopia?"
response = rag_chain.invoke({"input": question})
print(response["answer"])
- Extend the model to include more specialized datasets.
- Optimize retrieval and generation for faster response times.
- Integrate a user-friendly interface with Streamlit.
This RAG model effectively answers complex medical questions by leveraging retrieval and generation, offering a valuable tool for students, practitioners, and researchers in medicine.