Skip to content

ClaudeCrawl is clone of FireCrawl's web scraping feature

License

Notifications You must be signed in to change notification settings

haandol/claudecrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClaudeCrawl

A web content scraper utilizing Playwright and Claude LLM through AWS Bedrock to intelligently extract and structure data from web pages.

Simple Overview

Features

  • Intelligent web content extraction using Claude LLM
  • Structured data output in JSONL format
  • Configurable schema-based parsing
  • AWS Bedrock integration
  • Logging support

Prerequisites

  • Docker
  • Python 3.13+
  • AWS Account with Bedrock access
  • AWS CLI configured

Installation

  1. Clone the repository:
git clone https://github.com/haandol/claude-web-scraper.git
cd claude-web-scraper
  1. Install uv:
pip install uv
  1. Install dependencies
uv sync
  1. Install Playwright dependencies:
uv run playwright install
  1. Configure environment variables: Create a .env file in the project root with:
MODEL_ID=us.anthropic.claude-3-5-haiku-20241022-v1:0
AWS_PROFILE_NAME=your_profile_name
AWS_REGION=your_aws_region

Usage

  1. open app.py and modify url, OutputSchema and instruction

  2. Run the scraper:

uv run -- python app.py

The script will:

  • Crawl the specified data webpage
  • Extract article information using Claude LLM
  • Save the results in output/output.jsonl, unless you specify a different path.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

About

ClaudeCrawl is clone of FireCrawl's web scraping feature

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published