Employee Performance Analysis Pipeline

Overview

This project implements a data pipeline to analyze employee performance. The pipeline follows an end-to-end architecture, starting from data ingestion and progressing to data visualization. It uses containerized components orchestrated by Apache Airflow to ensure scalability, modularity, and automation.

Architecture

The pipeline is composed of the following key stages:

Data Source: Raw data is provided in CSV format.
Data Ingestion: Python scripts are used to extract, clean, and load data into the storage layer.
Data Storage & Transformation:
- PostgreSQL: Acts as the central database to store raw and processed data.
- dbt (Data Build Tool): Used for data transformations and modeling.
Data Visualization: The transformed data is visualized using Metabase.
Orchestration: Apache Airflow schedules and manages the execution of the entire pipeline.

Technology Stack

Component	Technology
Containerization	Docker
Data Ingestion	Python
Database	PostgreSQL
Data Transformation	dbt
Visualization	Metabase
Orchestration	Apache Airflow

Setup Instructions

Prerequisites

Ensure the following are installed on your machine:

Docker and Docker Compose
Python 3.8 or above

Steps

Clone the Repository

git clone <repository-url>
cd employee-performance-analysis-pipeline

Configure Environment Variables

Create a .env file in the root directory with the following variables:

POSTGRES_USER=<your_postgres_user>
POSTGRES_PASSWORD=<your_postgres_password>
POSTGRES_DB=employee_performance
METABASE_USER=<your_metabase_user>
METABASE_PASSWORD=<your_metabase_password>

Build and Start Containers
```
docker-compose up --build
```
Run Airflow DAGs
- Access the Airflow web UI at http://localhost:8080.
- Enable and trigger the DAG for the pipeline.
Access Visualization
- Open Metabase at http://localhost:3000.
- Login using your credentials.
- Explore dashboards and reports.

Detailed Workflow

1. Data Source

Employee performance data is provided as a CSV file.

2. Data Ingestion

Python scripts:
- Extract and Validate CSV data.
- Load validated data into PostgreSQL.
Key libraries used: pandas, sqlalchemy.

3. Data Storage & Transformation

PostgreSQL:
- Stores both raw and processed data.
dbt:
- Implements SQL models to clean and aggregate data.
- Executes transformations as part of the Airflow pipeline.

4. Data Visualization

Metabase:
- Connects to PostgreSQL to visualize transformed data.
- Dashboards provide insights into employee performance metrics.

5. Orchestration

Apache Airflow:
- Automates the execution of data ingestion, transformation, and visualization tasks.
- DAG structure:
  1. Ingest CSV data.
  2. Run dbt transformations.
  3. Trigger Metabase refresh.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data-ingestion		data-ingestion
database		database
dbt		dbt
metabase		metabase
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Employee Performance Analysis Pipeline

Overview

Architecture

Technology Stack

Setup Instructions

Prerequisites

Steps

Detailed Workflow

1. Data Source

2. Data Ingestion

3. Data Storage & Transformation

4. Data Visualization

5. Orchestration

About

Releases

Packages

Languages

License

ayoubmesquiny/Employee-Performance-Analysis-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Employee Performance Analysis Pipeline

Overview

Architecture

Technology Stack

Setup Instructions

Prerequisites

Steps

Detailed Workflow

1. Data Source

2. Data Ingestion

3. Data Storage & Transformation

4. Data Visualization

5. Orchestration

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages