Skip to content

ScraperWizard is a full-stack application that automates web data extraction using custom search prompts and AI-powered processing. Users can upload datasets, define dynamic queries, and retrieve structured information seamlessly through an intuitive dashboard. Built with Flask, React, and integrations like ScraperAPI and Groq's LLM,.

Notifications You must be signed in to change notification settings

SrikarVeluvali/ScraperWizard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

ScraperWizard

ScraperWizard is an AI-powered application designed to automate information retrieval from the web based on user-defined prompts. This tool allows users to upload datasets, define search queries dynamically, and extract relevant information using advanced LLM capabilities. The extracted data can be displayed in a user-friendly dashboard and downloaded as structured files.

Loom Video

This video describes the demo and a few important other points. Loom Video

Key Features

  • File Upload & Google Sheets Integration:

    • Upload CSV files or connect Google Sheets for data input.
    • Select a primary column (e.g., company names) for the search query.
    • Preview uploaded data within the dashboard.
  • Dynamic Prompt Input:

    • Define custom search prompts using placeholders like {entity}.
    • Prompts are dynamically replaced with each entity from the selected column.
  • Automated Web Search:

    • Perform searches using ScraperAPI or similar services.
    • Handle rate limits and API constraints effectively.
    • Collect and store search results (e.g., URLs, snippets).
  • LLM Integration for Data Parsing:

    • Use Groq’s LLM or OpenAI’s GPT API to extract precise information from search results.
    • Customize backend prompts for detailed extraction.
  • Data Display & Download:

    • Visualize extracted data in a structured table format.
    • Download results as CSV files or update the connected Google Sheet.

Setup Instructions

Prerequisites

  • Python 3.8+
  • API keys for ScraperAPI (or equivalent), Groq API, Google Cloud OAuth, Google Cloud API Key.
  • Google Cloud account for accessing Google Sheets API.

Project Structure

AI Based Webscraper
├── backend
│   ├── results
│   │   └── result_input.csv
│   ├── uploads
│   │   └── input.csv
│   ├── .env               # Backend environment variables
│   ├── .gitignore
│   ├── app.py             # Backend server code
│   ├── requirements.txt   # Python dependencies
│   ├── Test.csv
├── frontend
│   ├── public
│   │   ├── favicon.svg
│   │   ├── index.html
│   │   ├── logo192.png
│   │   ├── logo512.png
│   │   ├── manifest.json
│   │   ├── robots.txt
│   ├── src
│   │   ├── components
│   │   │   └── CSVProcessor.tsx  # Main data processor component
│   │   ├── App.css
│   │   ├── App.js
│   │   ├── App.test.js
│   │   ├── index.css
│   │   ├── index.js
│   │   ├── logo.svg
│   │   ├── reportWebVitals.js
│   │   ├── setupTests.js
│   ├── .env                # Frontend environment variables
│   ├── .gitignore
│   ├── package-lock.json
│   ├── package.json
│   ├── postcss.config.js
│   ├── README.md
│   ├── tailwind.config.js
├── README.md               # Main project readme

Installation

  1. Navigate to the backend directory:

    cd backend
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure environment variables in .env:

    SCRAPER_API_KEY=<Scraper API Key>
    GROQ_API_KEY=<Groq API Key>
    
  5. Start the server:

    python app.py

The backend server will be available at http://localhost:5000.

Frontend Setup

Prerequisites

  • Node.js 16+

Installation

  1. Navigate to the frontend directory:

    cd frontend
  2. Install dependencies:

    npm install
  3. Configure environment variables in .env:

    REACT_APP_CLIENT_ID=<Google Oauth Client ID>
    REACT_APP_API_KEY=<Google Cloud API Key>
    
  4. Start the development server:

    npm start

The frontend will be available at http://localhost:3000.

Usage Guide

  1. Upload your data:

    • Upload a CSV file or connect to a Google Sheet.
      • image
    • Select the column containing entities for the search query.
      • image
  2. Define your prompt:

    • Input a query template like: "Find the email address of {company}."
      • image
    • The placeholder {entity} will be dynamically replaced for each row.
  3. Retrieve and process data:

    • ScraperWizard performs automated searches and processes results through the integrated LLM.
  4. View and download results:

    • Extracted data is displayed in a table format.

      • image
    • Download the results as a CSV.

Optional Features

  • Real-time Google Sheets updates with the extracted data.
  • Robust error handling for failed queries.

Technologies Used

  • Backend: Python, Flask
  • Data Handling: Pandas, Google Sheets API
  • Search API: ScraperAPI
  • LLM API: Groq
  • Frontend: ReactJS, Tailwind CSS

Made by Srikar Veluvali.

About

ScraperWizard is a full-stack application that automates web data extraction using custom search prompts and AI-powered processing. Users can upload datasets, define dynamic queries, and retrieve structured information seamlessly through an intuitive dashboard. Built with Flask, React, and integrations like ScraperAPI and Groq's LLM,.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published