Web Crawler

A Python-based web crawler that maps website structure and extracts content. This tool can generate both text and Excel outputs of crawled pages along with visual sitemaps.

Features

Crawls websites and extracts content
Generates visual sitemaps in DOT format
Supports both TXT and XLSX output formats
Configurable crawl depth and page limits
Handles both internal and external links
Normalizes URLs and removes unwanted parameters

Installation

Clone the repository
Install dependencies:

pip install -r requirements.txt

Usage

Basic usage:

python crawler.py <URL>

Options:

--depth: Maximum crawl depth (default: unlimited)
--max-pages: Maximum number of pages to crawl (default: unlimited)
--output-format: Output format, either 'txt' or 'xlsx' (default: txt)

Example:

python crawler.py https://example.com --depth 2 --max-pages 10 --output-format xlsx

Output

The crawler generates two types of output:

Content files (TXT or XLSX) containing extracted text from crawled pages
A sitemap.dot file visualizing the website structure

Output files are organized in folders by domain name in the output directory.

Dependencies

requests: For making HTTP requests
beautifulsoup4: For HTML parsing
xlsxwriter: For Excel file generation

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
__pycache__		__pycache__
output		output
tests		tests
venv		venv
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
config.py		config.py
crawler.py		crawler.py
requirements.txt		requirements.txt
sitemap.py		sitemap.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Features

Installation

Usage

Output

Dependencies

About

Releases 1

Packages

Languages

JoshuaWink/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Features

Installation

Usage

Output

Dependencies

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages