ClaudeCrawl

A web content scraper utilizing Playwright and Claude LLM through AWS Bedrock to intelligently extract and structure data from web pages.

Features

Intelligent web content extraction using Claude LLM
Structured data output in JSONL format
Configurable schema-based parsing
AWS Bedrock integration
Logging support

Prerequisites

Docker
Python 3.13+
AWS Account with Bedrock access
AWS CLI configured

Installation

Clone the repository:

git clone https://github.com/haandol/claude-web-scraper.git
cd claude-web-scraper

Install uv:

pip install uv

Install dependencies

uv sync

Install Playwright dependencies:

uv run playwright install

Configure environment variables: Create a .env file in the project root with:

MODEL_ID=us.anthropic.claude-3-5-haiku-20241022-v1:0
AWS_PROFILE_NAME=your_profile_name
AWS_REGION=your_aws_region

Usage

open app.py and modify url, OutputSchema and instruction
Run the scraper:

uv run -- python app.py

The script will:

Crawl the specified data webpage
Extract article information using Claude LLM
Save the results in output/output.jsonl, unless you specify a different path.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.devcontainer		.devcontainer
docs		docs
env		env
notebooks		notebooks
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClaudeCrawl

Features

Prerequisites

Installation

Usage

License

About

Releases

Packages

Languages

License

haandol/claudecrawl

Folders and files

Latest commit

History

Repository files navigation

ClaudeCrawl

Features

Prerequisites

Installation

Usage

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages