🤖 Al Jazeera News Scraper Automation

Welcome to the Al Jazeera News Scraper Automation project!

This project is an RPA (Robotic Process Automation) bot built using Python and Robocorp's automation tools. It automates the process of extracting news articles from the Al Jazeera website based on specific search terms and date ranges. The bot scrapes articles, generates a CSV report, and archives the collection for easy access.

Project Overview

This RPA project is designed to automate the process of searching and collecting news articles from the Al Jazeera website. The bot:

Searches for articles based on user-provided search terms.
Collects articles within a specific date range.
Generates a Excel report with article details (title, date, description, etc.).
Archives all collected data, including images, in a ZIP file.

This project aims to demonstrate the implementation of web scraping and RPA as part of a technical portfolio. 📊

Key Features

🌐 Web Scraping: Automatically scrapes news articles from the Al Jazeera website.
📝 Customizable Searches: Define search terms and date ranges for precise article collection.
📁 Report Generation: Outputs a detailed CSV report of collected articles.
🗄️ Archiving: Archives the scraped data (articles and images) into a ZIP file for easy sharing.
🖥️ Headless Chrome Browser: Uses Chrome in headless mode for faster and seamless scraping.
⌛ Get elements with explicit waits: Class method with the possibility of explicit waits, so you can wait for elements to fully load before interaction.

Installation

To get started, follow these steps:

Clone the Repository

git clone https://github.com/giovanirech/aljazeera-news-scraper.git
cd aljazeera-news-scraper

Set Up the Environment Ensure you have rcc (Robocorp CLI) installed, and run:
```
rcc environment create --path .
```
This command reads the conda.yaml file and sets up the environment with the required dependencies.
Install Required Libraries In addition to Robocorp dependencies, the bot uses Selenium for web automation:
```
pip install -r requirements.txt
```

How to Use

The bot can be executed directly using Robocorp's rcc or by running the task script.

Define Your Input You can customize the bot's behavior by providing inputs through Robocorp work items. Each input can define:
- search_phrase: The term you want to search for (e.g., "Technology").
- number_of_months: The number of months back to limit the article search.
Run the Bot Run the task using rcc:
```
rcc run
```
Output
- The bot will generate a Excel report and save it in the output directory.
- It will also archive the scraped data, including images, into a ZIP file.

Project Structure

├── libs
│   ├── CustomSelenium.py      # Custom Selenium wrapper for browser control
│   ├── NewsScraper.py         # Scraper class to collect articles
│   ├── Article.py             # Article class to process articles
├── output                     # Directory where reports and archives are saved
├── tasks.py                   # Main task script
├── conda.yaml                 # Environment setup file for dependencies
├── robot.yaml                 # Robot configuration for Robocorp
├── README.md                  # Project documentation (this file)

CustomSelenium.py: A custom wrapper around Selenium to handle browser automation.
NewsScraper.py: Contains the logic for scraping articles, downloading images, and generating reports.
Article.py: Class to represent a single news article collected by the scraper.
tasks.py: The main entry point of the bot that orchestrates the scraping process.
output/: The directory where the bot saves its CSV reports and archives.

Technologies Used

Python: The core language used to build the bot.
Robocorp: Automation platform to run and orchestrate the bot.
Selenium: Used for browser-based automation (Chrome).
Pandas: For generating the Excel report.
Logging: To capture runtime logs and errors.
Headless Chrome: For faster, UI-less web scraping.

License

This project is licensed under the Apache License - see the LICENSE file for details.

Contact

Feel free to reach out for collaboration or if you have any questions! 🚀

Your Name
GitHub: giovanirech
Email: gio.pi.rech@gmail.com

Thank you for checking out my page! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Al Jazeera News Scraper Automation

📚 Table of Contents

Project Overview

Key Features

Installation

How to Use

Project Structure

Technologies Used

License

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
devdata/work-items-in/aljazeera-scraper-inputs		devdata/work-items-in/aljazeera-scraper-inputs
libs		libs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda.yaml		conda.yaml
robot.yaml		robot.yaml
tasks.py		tasks.py

License

giovanirech/aljazeera-news-scraper

Folders and files

Latest commit

History

Repository files navigation

🤖 Al Jazeera News Scraper Automation

📚 Table of Contents

Project Overview

Key Features

Installation

How to Use

Project Structure

Technologies Used

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages