Welcome to the Al Jazeera News Scraper Automation project!
This project is an RPA (Robotic Process Automation) bot built using Python and Robocorp's automation tools. It automates the process of extracting news articles from the Al Jazeera website based on specific search terms and date ranges. The bot scrapes articles, generates a CSV report, and archives the collection for easy access.
- Project Overview
- Key Features
- Installation
- How to Use
- Project Structure
- Technologies Used
- License
- Contact
This RPA project is designed to automate the process of searching and collecting news articles from the Al Jazeera website. The bot:
- Searches for articles based on user-provided search terms.
- Collects articles within a specific date range.
- Generates a Excel report with article details (title, date, description, etc.).
- Archives all collected data, including images, in a ZIP file.
This project aims to demonstrate the implementation of web scraping and RPA as part of a technical portfolio. 📊
- 🌐 Web Scraping: Automatically scrapes news articles from the Al Jazeera website.
- 📝 Customizable Searches: Define search terms and date ranges for precise article collection.
- 📁 Report Generation: Outputs a detailed CSV report of collected articles.
- 🗄️ Archiving: Archives the scraped data (articles and images) into a ZIP file for easy sharing.
- 🖥️ Headless Chrome Browser: Uses Chrome in headless mode for faster and seamless scraping.
- ⌛ Get elements with explicit waits: Class method with the possibility of explicit waits, so you can wait for elements to fully load before interaction.
To get started, follow these steps:
-
Clone the Repository
git clone https://github.com/giovanirech/aljazeera-news-scraper.git cd aljazeera-news-scraper
-
Set Up the Environment Ensure you have
rcc
(Robocorp CLI) installed, and run:rcc environment create --path .
This command reads the
conda.yaml
file and sets up the environment with the required dependencies. -
Install Required Libraries In addition to Robocorp dependencies, the bot uses
Selenium
for web automation:pip install -r requirements.txt
The bot can be executed directly using Robocorp's rcc
or by running the task script.
-
Define Your Input You can customize the bot's behavior by providing inputs through Robocorp work items. Each input can define:
search_phrase
: The term you want to search for (e.g., "Technology").number_of_months
: The number of months back to limit the article search.
-
Run the Bot Run the task using
rcc
:rcc run
-
Output
- The bot will generate a Excel report and save it in the
output
directory. - It will also archive the scraped data, including images, into a ZIP file.
- The bot will generate a Excel report and save it in the
├── libs
│ ├── CustomSelenium.py # Custom Selenium wrapper for browser control
│ ├── NewsScraper.py # Scraper class to collect articles
│ ├── Article.py # Article class to process articles
├── output # Directory where reports and archives are saved
├── tasks.py # Main task script
├── conda.yaml # Environment setup file for dependencies
├── robot.yaml # Robot configuration for Robocorp
├── README.md # Project documentation (this file)
CustomSelenium.py
: A custom wrapper around Selenium to handle browser automation.NewsScraper.py
: Contains the logic for scraping articles, downloading images, and generating reports.Article.py
: Class to represent a single news article collected by the scraper.tasks.py
: The main entry point of the bot that orchestrates the scraping process.output/
: The directory where the bot saves its CSV reports and archives.
- Python: The core language used to build the bot.
- Robocorp: Automation platform to run and orchestrate the bot.
- Selenium: Used for browser-based automation (Chrome).
- Pandas: For generating the Excel report.
- Logging: To capture runtime logs and errors.
- Headless Chrome: For faster, UI-less web scraping.
This project is licensed under the Apache License - see the LICENSE file for details.
Feel free to reach out for collaboration or if you have any questions! 🚀
- Your Name
- GitHub: giovanirech
- Email: gio.pi.rech@gmail.com
Thank you for checking out my page! 😊