This project implements an automated data engineering workflow that collects, processes, and stores randomly generated user data. It uses Apache Airflow for orchestration, Apache Kafka for data stream handling, Apache Spark for data processing, and Cassandra for persistent storage.
- Docker
- Docker Compose
1. Clone the repository
git clone [https://github.com/your-username/your-repository.git] (https://github.com/pablogzalez/Realtime-Data-Streaming/tree/master) [cd your-repository (Realtime-Data-Streaming)
2. Start the services
Use Docker Compose to build and start the necessary services (Airflow, Kafka, Spark, Cassandra).
docker-compose up -d
3. Execution
- Apache Airflow: Access the Airflow UI at http://localhost:8080 and trigger the user_automation DAG.
- Verify execution: Check the logs in Airflow to ensure data is being processed and stored correctly.
📋 Architecture This project follows a data flow architecture involving the following components:
- Apache Airflow: Orchestrates the workflow of data collection, processing, and storage.
- Apache Kafka: Acts as a messaging system to handle real-time data.
- Apache Spark: Processes the real-time data read from Kafka.
- Cassandra: Stores the processed data for future queries and analysis.
🛠 Technologies Used
- Apache Airflow
- Apache Kafka
- Apache Spark
- Cassandra
- Docker