This project implements a data pipeline to analyze employee performance. The pipeline follows an end-to-end architecture, starting from data ingestion and progressing to data visualization. It uses containerized components orchestrated by Apache Airflow to ensure scalability, modularity, and automation.
The pipeline is composed of the following key stages:
- Data Source: Raw data is provided in CSV format.
- Data Ingestion: Python scripts are used to extract, clean, and load data into the storage layer.
- Data Storage & Transformation:
- PostgreSQL: Acts as the central database to store raw and processed data.
- dbt (Data Build Tool): Used for data transformations and modeling.
- Data Visualization: The transformed data is visualized using Metabase.
- Orchestration: Apache Airflow schedules and manages the execution of the entire pipeline.
Component | Technology |
---|---|
Containerization | Docker |
Data Ingestion | Python |
Database | PostgreSQL |
Data Transformation | dbt |
Visualization | Metabase |
Orchestration | Apache Airflow |
Ensure the following are installed on your machine:
- Docker and Docker Compose
- Python 3.8 or above
-
Clone the Repository
git clone <repository-url> cd employee-performance-analysis-pipeline
-
Configure Environment Variables
- Create a
.env
file in the root directory with the following variables:POSTGRES_USER=<your_postgres_user> POSTGRES_PASSWORD=<your_postgres_password> POSTGRES_DB=employee_performance METABASE_USER=<your_metabase_user> METABASE_PASSWORD=<your_metabase_password>
- Create a
-
Build and Start Containers
docker-compose up --build
-
Run Airflow DAGs
- Access the Airflow web UI at
http://localhost:8080
. - Enable and trigger the DAG for the pipeline.
- Access the Airflow web UI at
-
Access Visualization
- Open Metabase at
http://localhost:3000
. - Login using your credentials.
- Explore dashboards and reports.
- Open Metabase at
- Employee performance data is provided as a CSV file.
- Python scripts:
- Extract and Validate CSV data.
- Load validated data into PostgreSQL.
- Key libraries used:
pandas
,sqlalchemy
.
- PostgreSQL:
- Stores both raw and processed data.
- dbt:
- Implements SQL models to clean and aggregate data.
- Executes transformations as part of the Airflow pipeline.
- Metabase:
- Connects to PostgreSQL to visualize transformed data.
- Dashboards provide insights into employee performance metrics.
- Apache Airflow:
- Automates the execution of data ingestion, transformation, and visualization tasks.
- DAG structure:
- Ingest CSV data.
- Run dbt transformations.
- Trigger Metabase refresh.