📊 Data Pipeline Project with Airflow, Kafka, Spark, and Cassandra

This project implements an automated data engineering workflow that collects, processes, and stores randomly generated user data. It uses Apache Airflow for orchestration, Apache Kafka for data stream handling, Apache Spark for data processing, and Cassandra for persistent storage.

🚀 Quick Start

Prerequisites

Docker
Docker Compose

Setup

1. Clone the repository

git clone [https://github.com/your-username/your-repository.git] (https://github.com/pablogzalez/Realtime-Data-Streaming/tree/master) [cd your-repository (Realtime-Data-Streaming)

2. Start the services

Use Docker Compose to build and start the necessary services (Airflow, Kafka, Spark, Cassandra).

docker-compose up -d

3. Execution

Apache Airflow: Access the Airflow UI at http://localhost:8080 and trigger the user_automation DAG.
Verify execution: Check the logs in Airflow to ensure data is being processed and stored correctly.

📋 Architecture This project follows a data flow architecture involving the following components:

Apache Airflow: Orchestrates the workflow of data collection, processing, and storage.
Apache Kafka: Acts as a messaging system to handle real-time data.
Apache Spark: Processes the real-time data read from Kafka.
Cassandra: Stores the processed data for future queries and analysis.

🛠 Technologies Used

Apache Airflow
Apache Kafka
Apache Spark
Cassandra
Docker

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
airflow_env		airflow_env
dags		dags
script		script
.gitignore		.gitignore
README.md		README.md
data-engineering.iml		data-engineering.iml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
spark_stream.py		spark_stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Data Pipeline Project with Airflow, Kafka, Spark, and Cassandra

🚀 Quick Start

Prerequisites

Setup

About

Releases

Packages

Languages

pablogzalez/Realtime-Data-Streaming

Folders and files

Latest commit

History

Repository files navigation

📊 Data Pipeline Project with Airflow, Kafka, Spark, and Cassandra

🚀 Quick Start

Prerequisites

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages