ScraperWizard

ScraperWizard is an AI-powered application designed to automate information retrieval from the web based on user-defined prompts. This tool allows users to upload datasets, define search queries dynamically, and extract relevant information using advanced LLM capabilities. The extracted data can be displayed in a user-friendly dashboard and downloaded as structured files.

Loom Video

This video describes the demo and a few important other points. Loom Video

Key Features

File Upload & Google Sheets Integration:
- Upload CSV files or connect Google Sheets for data input.
- Select a primary column (e.g., company names) for the search query.
- Preview uploaded data within the dashboard.
Dynamic Prompt Input:
- Define custom search prompts using placeholders like {entity}.
- Prompts are dynamically replaced with each entity from the selected column.
Automated Web Search:
- Perform searches using ScraperAPI or similar services.
- Handle rate limits and API constraints effectively.
- Collect and store search results (e.g., URLs, snippets).
LLM Integration for Data Parsing:
- Use Groq’s LLM or OpenAI’s GPT API to extract precise information from search results.
- Customize backend prompts for detailed extraction.
Data Display & Download:
- Visualize extracted data in a structured table format.
- Download results as CSV files or update the connected Google Sheet.

Setup Instructions

Prerequisites

Python 3.8+
API keys for ScraperAPI (or equivalent), Groq API, Google Cloud OAuth, Google Cloud API Key.
Google Cloud account for accessing Google Sheets API.

Project Structure

AI Based Webscraper
├── backend
│   ├── results
│   │   └── result_input.csv
│   ├── uploads
│   │   └── input.csv
│   ├── .env               # Backend environment variables
│   ├── .gitignore
│   ├── app.py             # Backend server code
│   ├── requirements.txt   # Python dependencies
│   ├── Test.csv
├── frontend
│   ├── public
│   │   ├── favicon.svg
│   │   ├── index.html
│   │   ├── logo192.png
│   │   ├── logo512.png
│   │   ├── manifest.json
│   │   ├── robots.txt
│   ├── src
│   │   ├── components
│   │   │   └── CSVProcessor.tsx  # Main data processor component
│   │   ├── App.css
│   │   ├── App.js
│   │   ├── App.test.js
│   │   ├── index.css
│   │   ├── index.js
│   │   ├── logo.svg
│   │   ├── reportWebVitals.js
│   │   ├── setupTests.js
│   ├── .env                # Frontend environment variables
│   ├── .gitignore
│   ├── package-lock.json
│   ├── package.json
│   ├── postcss.config.js
│   ├── README.md
│   ├── tailwind.config.js
├── README.md               # Main project readme

Installation

Navigate to the backend directory:
```
cd backend
```

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables in .env:

SCRAPER_API_KEY=<Scraper API Key>
GROQ_API_KEY=<Groq API Key>

Start the server:
```
python app.py
```

The backend server will be available at http://localhost:5000.

Frontend Setup

Prerequisites

Node.js 16+

Installation

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```

Configure environment variables in .env:

REACT_APP_CLIENT_ID=<Google Oauth Client ID>
REACT_APP_API_KEY=<Google Cloud API Key>

Start the development server:
```
npm start
```

The frontend will be available at http://localhost:3000.

Usage Guide

Upload your data:
- Upload a CSV file or connect to a Google Sheet.
- Select the column containing entities for the search query.
Define your prompt:
- Input a query template like: "Find the email address of {company}."
- The placeholder {entity} will be dynamically replaced for each row.
Retrieve and process data:
- ScraperWizard performs automated searches and processes results through the integrated LLM.
View and download results:
- Extracted data is displayed in a table format.
- Download the results as a CSV.

Optional Features

Real-time Google Sheets updates with the extracted data.
Robust error handling for failed queries.

Technologies Used

Backend: Python, Flask
Data Handling: Pandas, Google Sheets API
Search API: ScraperAPI
LLM API: Groq
Frontend: ReactJS, Tailwind CSS

Made by Srikar Veluvali.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScraperWizard

Loom Video

Key Features

Setup Instructions

Prerequisites

Project Structure

Installation

Frontend Setup

Prerequisites

Installation

Usage Guide

Optional Features

Technologies Used

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
backend		backend
frontend		frontend
README.md		README.md

SrikarVeluvali/ScraperWizard

Folders and files

Latest commit

History

Repository files navigation

ScraperWizard

Loom Video

Key Features

Setup Instructions

Prerequisites

Project Structure

Installation

Frontend Setup

Prerequisites

Installation

Usage Guide

Optional Features

Technologies Used

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages