This repository contains a collection of websites that I have scraped for learning and experimentation purposes. The scraped data is organized into subfolders, where each subfolder corresponds to a specific website. These websites were scraped using different techniques, including Beautiful Soup (bs4) for static content, Selenium for dynamic content, and a mix of both for certain cases.
- Main Folder: Contains subfolders, each representing a scraped website.
- Subfolders: Named based on the website they were scraped from. Each subfolder contains:
- The Python code used to scrape the website in two formats:
.py
and.ipynb
. - The CSV file containing the scraped data.
- The Python code used to scrape the website in two formats:
-
Static Websites:
- Scraped using Beautiful Soup (bs4).
- These websites have static HTML content that can be directly accessed and parsed.
-
Dynamic Websites:
- Scraped using Selenium.
- These websites load data dynamically through JavaScript, requiring a browser simulation to fetch the content.
-
Mixed Approach:
- Some websites required a combination of Selenium and bs4.
- Selenium was used to render the dynamic content, and Beautiful Soup was used for parsing the HTML.
Below is the list of websites classified by the scraping technique used:
To replicate or run the scraping scripts used in this project, the following Python libraries are required:
- Beautiful Soup:
bs4
- Selenium
- Requests
- lxml
- html.parser
Ensure you have Python installed, along with the necessary libraries. For Selenium, download the appropriate browser driver (e.g., ChromeDriver for Google Chrome).
-
Clone this repository to your local machine:
git clone https://github.com/chouaib-629/WebScraping.git
-
Navigate to the desired subfolder to inspect the scraped data or associated scripts.
- The data scraped from these websites is for educational purposes only. Please adhere to the terms and conditions of the websites before scraping.
- The scripts and data are provided "as is" without warranty of any kind.
This project is managed by a data science enthusiast and full-stack developer experimenting with web scraping techniques.
For questions or support, please contact Me.