ScraperWizard is an AI-powered application designed to automate information retrieval from the web based on user-defined prompts. This tool allows users to upload datasets, define search queries dynamically, and extract relevant information using advanced LLM capabilities. The extracted data can be displayed in a user-friendly dashboard and downloaded as structured files.
This video describes the demo and a few important other points. Loom Video
-
File Upload & Google Sheets Integration:
- Upload CSV files or connect Google Sheets for data input.
- Select a primary column (e.g., company names) for the search query.
- Preview uploaded data within the dashboard.
-
Dynamic Prompt Input:
- Define custom search prompts using placeholders like
{entity}
. - Prompts are dynamically replaced with each entity from the selected column.
- Define custom search prompts using placeholders like
-
Automated Web Search:
- Perform searches using ScraperAPI or similar services.
- Handle rate limits and API constraints effectively.
- Collect and store search results (e.g., URLs, snippets).
-
LLM Integration for Data Parsing:
- Use Groq’s LLM or OpenAI’s GPT API to extract precise information from search results.
- Customize backend prompts for detailed extraction.
-
Data Display & Download:
- Visualize extracted data in a structured table format.
- Download results as CSV files or update the connected Google Sheet.
- Python 3.8+
- API keys for ScraperAPI (or equivalent), Groq API, Google Cloud OAuth, Google Cloud API Key.
- Google Cloud account for accessing Google Sheets API.
AI Based Webscraper
├── backend
│ ├── results
│ │ └── result_input.csv
│ ├── uploads
│ │ └── input.csv
│ ├── .env # Backend environment variables
│ ├── .gitignore
│ ├── app.py # Backend server code
│ ├── requirements.txt # Python dependencies
│ ├── Test.csv
├── frontend
│ ├── public
│ │ ├── favicon.svg
│ │ ├── index.html
│ │ ├── logo192.png
│ │ ├── logo512.png
│ │ ├── manifest.json
│ │ ├── robots.txt
│ ├── src
│ │ ├── components
│ │ │ └── CSVProcessor.tsx # Main data processor component
│ │ ├── App.css
│ │ ├── App.js
│ │ ├── App.test.js
│ │ ├── index.css
│ │ ├── index.js
│ │ ├── logo.svg
│ │ ├── reportWebVitals.js
│ │ ├── setupTests.js
│ ├── .env # Frontend environment variables
│ ├── .gitignore
│ ├── package-lock.json
│ ├── package.json
│ ├── postcss.config.js
│ ├── README.md
│ ├── tailwind.config.js
├── README.md # Main project readme
-
Navigate to the backend directory:
cd backend
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables in
.env
:SCRAPER_API_KEY=<Scraper API Key> GROQ_API_KEY=<Groq API Key>
-
Start the server:
python app.py
The backend server will be available at http://localhost:5000
.
- Node.js 16+
-
Navigate to the frontend directory:
cd frontend
-
Install dependencies:
npm install
-
Configure environment variables in
.env
:REACT_APP_CLIENT_ID=<Google Oauth Client ID> REACT_APP_API_KEY=<Google Cloud API Key>
-
Start the development server:
npm start
The frontend will be available at http://localhost:3000
.
-
Upload your data:
-
Define your prompt:
-
Retrieve and process data:
- ScraperWizard performs automated searches and processes results through the integrated LLM.
-
View and download results:
- Real-time Google Sheets updates with the extracted data.
- Robust error handling for failed queries.
- Backend: Python, Flask
- Data Handling: Pandas, Google Sheets API
- Search API: ScraperAPI
- LLM API: Groq
- Frontend: ReactJS, Tailwind CSS
Made by Srikar Veluvali.