GrabIT is a simple FastAPI based web scraping tool designed to scrape and store IT product data from various e-commerce websites. GrabIT simplifies the data collection process while conducting market research or managing a inventory of IT products.
Use Cases:
- Market Analysis: Collect comprehensive product data to gain insights into market trends and competitor offerings.
- Product Management: Automate data entry for product catalogs and inventory management.
Features:
- Easy to configure product webpage structure by specifying element css selectors for multiple websites in the
sites.yaml
file.. - Data extraction from individual product URLs or crawling using sitemap.xml.
- React frontend with interactive swagger api documentation.
- JWT authentication
- Docker ready
Under the hood, GrabIT is a web scraping tool that leverages several technologies to fetch, parse, validate, store, and serve data. Here’s a breakdown of its components:
- Requests: Used for fetching web pages.
- BeautifulSoup: Used for parsing data from HTML.
- Pydantic: Used for data validation.
- SQLAlchemy and PostgreSQL: Used for storing data.
- FastAPI: Serves the data as an API.
Appropriate CSS selectors are defined in sites.yaml file which is used for parsing data from raw HTML. See Site Configuration below for more details.
Clone the project
git clone https://github.com/rnium/grabit.git
Go to the project directory
cd grabit
Create .env file with the following variables
SECRET_KEY=
DB_URL=
CORS_ALLOW_HOST=
USER_AGENT=
Create your sites.yaml
(see Site Configuration below to know how to configure sites.yaml) and Run the docker compose
docker-compose up -d
Create an user
docker exec -it grabit python manage.py createuser
To scrape data from a website using GrabIT, you need to configure each target website in the sites.yaml
file. GrabIT focuses on extracting data from the product pages of websites. For example, consider Ryans, the leading IT product retailer in Bangladesh. Below is an illustration of the sections that GrabIT can scrape from a product page on Ryans:
- Product page container
- Product title
- Product regular/actual price
- Product current/discounted price
- Key features container
- Key feature item
- Product images container
- Specification table
- Specification container
- Specification container heading
- Specification item
- Specification key
- Specification value
- Product description
name: Ryans
urlpatterns:
- http.*ryans.com
- http.*ryanscomputers.com
product:
main_selector: div[itemtype="http://schema.org/Product"]
title_selector: .product_content h1[itemprop="name"]
price_selector_actual: .rp-block .new-reg-text | .rp-block
price_selector_current: .rp-block .new-sp-text | .rp-block
description_selector: .spec-details .card .card-body
key_feature:
container_selector: .short-desc-attr ul.category-info
item_selector: li
attribute: null
item_splitter: '-'
spec_table:
table_selector: "#add-spec-div | .specification-table"
container_selector: .grid-container.for-last-hr
item_selector: div.row[itemprop="description"]
heading_selector: h6.fw-bold
item_key_selector: .att-title
item_value_selector: .att-value
attribute: null
images:
container_selector: '#slideshow-items-container | .modal-product-img .side_view'
item_selector: img
attribute: src
name
: The name of the website.urlpatterns
: A list of regex patterns for matching the website’s hostname.product
: This section contains all the necessary selectors for scraping data from the product page. Refer to the Selector Configuration section for details.
The sites.yaml
file is a Multi-Document YAML file, allowing you to define configurations for multiple websites in the same file.
Note: After adding or altering a site configuration, restart the grabit container or Gunicorn server, go to the swagger api docs and then run a product link from the website to ensure it produces the expected result.
To scrape data accurately, you need to specify appropriate and specific CSS selectors for each site. Below is a guide to configuring these selectors. Multiple selectors for a single item can be defined using a pipe (|) character. GrabIT will attempt to use each selector from left to right until it finds a matching element.
-
Product Page Main Selector:
- Directive:
main_selector
- Description: The top-level wrapper that encloses all product information.
- Directive:
-
Product Title:
- Directive:
title_selector
- Expected Output: String
- Directive:
-
Product Actual Price:
- Directive:
price_selector_actual
- Description: The regular or listed price of the product.
- Expected Output: Float
- Note: The initial extraction is a string, which is then converted to a float using a price formatter. This function can be defined in the
/app/scraping/price_formatter.py
file and specified using theprice_formatter
directive. If not specified, a default formatter is used. If the default formatter is insufficient, you can define a custom one inprice_formatter.py
, add it to theFORMATTER_MAPPING
, and reference it in theprice_formatter
directive.
- Directive:
-
Product Current Price:
- Directive:
price_selector_current
- Description: The discounted price of the product.
- Expected Output: Float
- Directive:
-
Key Features:
- Directive:
key_feature
- Description: A list of key features, typically presented as an unordered list with 5-10 items.
- Configuration:
- Container: Specify the
container_selector
for the key features. - Items: Define the
item_selector
for list items. - Splitter: If the extracted string needs to be split into key-value pairs (e.g.,
'Processor Name - Apple M3'
into'Processor Name'
and'Apple M3'
), specify theitem_splitter
. If not specified, GrabIT will treat the string as a single item.
- Container: Specify the
- Directive:
-
Images:
- Directive:
images
- Description: A list of images. In some cases, there are two different image containers: one for thumbnails and one for high-resolution display images. Be sure to look for and specify the container that holds the high-resolution images.
- Configuration:
- Container: Specify the
container_selector
. - Items: Define the
item_selector
. - Attribute: Specify the attribute (e.g.,
src
for<img>
tags) from which to extract the image URLs.
- Container: Specify the
- Directive:
-
Specification Table:
- Directive:
spec_table
- Description: Contains multiple containers with specifications on different aspects of the product.
- Configuration:
- Table: Specify the
table_selector
. - Containers: Define
container_selector
for each specification set (e.g., 'Processor', 'Memory'). - Heading: Specify the
heading_selector
for container titles. - Items: Define the
item_selector
for selecting all items within a container. - Key-Value Pairs: Specify
item_key_selector
anditem_value_selector
to extract key-value pairs.
- Table: Specify the
- Directive:
-
Product Description:
- Directive:
description_selector
- Directive: