BookScraper is a Scrapy project designed to scrape book information from the website books.toscrape.com. This project demonstrates various Scrapy features, including custom item pipelines, middleware for rotating user agents and proxies, and database integration.
bookscraper/
│
├── bookscraper/
│ ├── spiders/
│ │ └── bookspider.py
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ └── settings.py
│
├── .env
└── README.md
- Scrapes book details including title, price, availability, and more
- Custom item pipeline for data cleaning and processing
- Fake User-Agent rotation using ScrapeOps API
- Proxy rotation using scrapy-rotating-proxies
- Option to save data to MySQL database
- Python 3.x
- Scrapy
- MySQL (optional, for database storage)
-
Clone this repository:
git clone https://github.com/yourusername/bookscraper.git cd bookscraper
-
Install the required packages:
pip install scrapy mysql-connector-python python-dotenv requests scrapy-rotating-proxies
-
Set up your environment variables by creating a
.env
file in the project root:DB_HOST=your_db_host DB_USER=your_db_user DB_PASSWORD=your_db_password DB_DATABASE=your_db_name SCRAPEOPS_API_KEY=your_scrapeops_api_key
To run the spider:
scrapy crawl bookspider
This will start the scraping process and save the results to booksdata.json
in the project root.
You can modify the following settings in settings.py
:
SCRAPEOPS_FAKE_USER_AGENTS_ACTIVE
: Enable/disable fake user agent rotationSCRAPEOPS_NUM_RESULTS
: Number of fake user agents to fetchROTATING_PROXY_LIST
: List of proxy servers to useITEM_PIPELINES
: Enable/disable specific item pipelines
This project uses the scrapy-rotating-proxies
package for proxy rotation. To set it up:
-
Install the package:
pip install scrapy-rotating-proxies
-
Add the following to your
settings.py
:ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # Add more proxies here ] DOWNLOADER_MIDDLEWARES = { # ... 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610, 'rotating_proxies.middlewares.BanDetectionMiddleware': 620, }
Note: While using free proxies, you may encounter issues such as slow response times, non-functioning proxies, or proxies that have been blacklisted. For more reliable scraping, consider using paid proxy services.
By default, the scraped data is saved to a JSON file. To enable MySQL storage:
- Uncomment the
SaveToMySQLPipeline
inITEM_PIPELINES
insettings.py
- Ensure your MySQL credentials are correctly set in the
.env
file
- To modify the scraping behavior, edit
bookspider.py
in thespiders
directory - To change data processing logic, modify the pipelines in
pipelines.py
- To adjust middleware behavior, edit
middlewares.py
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the MIT License.