URL Crawler

This project is a web crawler designed to scrape and crawl links within a specified base URL. The tool is configured to work on https://books.toscrape.com/, a demo website for web scraping practice, ensuring compliance with ethical and permitted scraping.

Features

Multi-threaded crawling: Efficiently processes multiple URLs concurrently using worker threads.
Queue management: Maintains a list of URLs to crawl (queue.txt) and already visited URLs (crawled.txt).
Domain restriction: Ensures that only links within the specified base URL are crawled, preventing unintended external requests.
Customizable ChromeDriver setup: Allows flexibility in using Selenium for handling dynamic pages.
Static website support: Efficiently handles crawling for websites with limited dynamic content.

Requirements

Python 3.x
Selenium
BeautifulSoup4
ChromeDriver (Update the CHROMEDRIVER_PATH variable in main.py with the path to your ChromeDriver executable).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
domain.py		domain.py
link_finder.py		link_finder.py
main.py		main.py
spider.py		spider.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

URL Crawler

Features

Requirements

About

Releases

Packages

Languages

masoudrahimi39/web-crawling

Folders and files

Latest commit

History

Repository files navigation

URL Crawler

Features

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages