This project provides an interactive Streamlit-based web application that allows users to upload PDF and CSV files, store their content in a vector database using LangChain and Chroma, and query the uploaded documents using OpenAI's LLMs (e.g., GPT-3.5-turbo). The app intelligently retrieves relevant information from the documents and provides citations for the sources.
-
Upload and Process Documents:
- Upload multiple PDF and CSV files.
- Extract content using LangChain's document loaders.
-
Vector Database Storage:
- Store document embeddings in a persistent Chroma vector database.
-
Interactive Query System:
- Ask questions about the uploaded documents.
- Retrieve answers along with source citations.
-
Download Cited Files:
- Easily download files cited in the query response.
- Streamlit: For creating the web interface.
- LangChain: For document processing and retrieval.
- Chroma: As the vector database for storing embeddings.
- OpenAI API: For LLM-based query answering.
- Python: The core language for building the application.
- Python 3.10 or later
- OpenAI API Key
-
Clone the Repository:
git clone git@github.com:stacksapien/smart-doc-search.git cd smart-doc-search
-
Set Up a Virtual Environment:
python3 -m venv env source env/bin/activate # On Windows: .\\env\\Scripts\\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Configure Environment Variables: Create a file named
.env
in the root directory and add your OpenAI API key:OPENAI_API_KEY=your_openai_api_key
-
Run the Application:
streamlit run app.py
-
Access the App: Open your browser and navigate to:
http://localhost:8501
- Upload one or more PDF or CSV files using the file uploader.
- Uploaded files are processed and stored in the
uploaded_files
directory.
- Enter your query in the text box provided.
- The app retrieves relevant answers from the uploaded documents and displays the sources.
- Files cited in the response are available for download.
smart-doc-search/
│
├── app.py # Main Streamlit application
├── requirements.txt # List of Python dependencies
├── .env # Environment variables (not included in Git)
├── uploaded_files/ # Directory for storing uploaded files
├── chromadb/ # Directory for persistent Chroma vector database
└── README.md # Project documentation
- Launch an Ubuntu EC2 instance and configure security groups to allow inbound traffic on ports 22 and 8501.
- SSH into the instance and set up Python, Streamlit, and the application as per the installation instructions.
- Use a process manager like
tmux
orscreen
to keep the app running.
- Configure a reverse proxy (e.g., Nginx) to serve the Streamlit app under your domain.
- Enable HTTPS using Certbot for SSL certificates.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add feature-name"
- Push to the branch:
git push origin feature-name
- Submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
If you encounter any issues or have feature requests, please open an issue.
- Vishal Verma - LinkedIn
Feel free to reach out with any questions or feedback!