This project aims to automate the process of scraping house listing data from buyrentkenya.com (or any other website of your choice) using Apache Airflow. The project focuses on workflow automation and scheduling, utilizing the following tools:
- Apache Airflow
- Python
- Pandas
- Numpy
- Install Apache Airflow following the official installation instructions.
- Clone this repository:
git clone https://github.com/Mutai-Gilbert/Automating-Data-Scrapers-Apache-Airflow
pip install pandas numpy
- Configure Apache Airflow according to your environment.
- Copy the DAG (Directed Acyclic Graph) file from the dags directory to your Apache Airflow DAGs folder.
- Start the Apache Airflow scheduler and web server.
- Access the Apache Airflow web interface and enable the DAG.
- Trigger the DAG to start the data scraping and analytical processes.
- DAG Structure
- The DAG is structured to perform the following tasks:
- The DAG is structured to perform the following tasks:
- Scrape house listing data from buyrentkenya.com (or another website).
- Preprocess the scraped data using Pandas and Numpy.
- Perform analytical processes on the preprocessed data.
- Store the results in a desired format (e.g., CSV, database).
- Customization
- You can customize the scraping process by modifying the URL or the website to scrape.
- Modify the preprocessing and analytical processes according to your requirements.
- Adjust the schedule and frequency of the DAG to meet your needs.
- Contributions
- Contributions to this project are welcome. Feel free to fork the repository and submit pull requests with your improvements.
License This project is licensed under the MIT License. See the LICENSE file for more information.
Feel free to customize this README file further based on your specific project details and requirements.