This repository contains a Python-based project that demonstrates web scraping and data analysis. The project involves extracting book-related data from the Books to Scrape website, followed by exploratory data analysis (EDA) and visualizations to gain insights from the collected data.
The repository includes the following Jupyter Notebooks:
-
books-website-scraping.ipynb
- Extracts book data such as titles, ratings, prices, and availability from the website.
- Saves the scraped data into a CSV file for further analysis.
-
books-data-analysis.ipynb
- Loads the scraped data from the CSV file.
- Cleans and preprocesses the dataset (e.g., converting ratings to numerical values).
- Performs EDA and visualizations to analyze pricing, ratings, and other trends.
-
Web Scraping:
- Extract book details including:
- Book ID (UPC)
- Title
- Category
- Rating
- Price
- Stock availability (Stock status)
- Quantity available
- Extract book details including:
-
Exploratory Data Analysis (EDA):
- Visualizes key metrics such as price distributions and rating trends.
- Identifies relationships between features like price and rating.
-
Web Scraping:
requests
BeautifulSoup
-
Data Manipulation:
pandas
-
Data Visualization:
matplotlib
seaborn
squarify
The scraped dataset is available on Kaggle: Books Data on Kaggle
- ID: Unique Product Code (UPC) for each book.
- Title: The title of the book.
- Category: Genre or category of the book.
- Price [£]: Price in GBP (£).
- Rating: Star rating (One to Five) based on customer reviews.
- Availability: Whether the book is in stock.
- Quantity: The number of available copies.