Automated Extract Transform Load Pipeline using Amazon Redshift & Apache Airflow

Source Data

Data about songs and user events listening to them are stored in 2 separate S3 buckets.

Log events - s3://sparkify/log_data
Song data - s3://sparkify/song_data

The user events log files contain the following information.

The songs log files contain the following information

num_songs
artist_id
artist_latitude
artist_longitude
artist_location
artist_name
song_id
title
duration
year

Data Warehouse Design

In order to load them in to a data warehouse (Redshift), a data model was designed using STAR schema containing fact and dimension tables.

Fact Table

songplays

songplay_id
start_time
user_id
level
song_id
artist_id
session_id
location
user_agent

Dimension Tables

users

user_id
first_name
last_name
gender
level

songs

song_id
title
artist_id
year
duration

artists

artist_id
name
location
lattitude
longitude

time

start_time
hour
day
week
month
year
weekday

Pipeline Components

S3 bucket - Unstructured data in the form of logs are stored in separate s3 buckets
Redshift - Data Warehouse used as destination of structured data after ETL process and for staging purposes
Airflow - Automation of ETL pipeline on a daily schedule

ETL Process

Initially data from all log files (songs and user events) are copied to staging tables staging_events and staging_songs in Redshift using COPY command which copies data in a parallel fashion.
Using SQL, data is extracted from these staging tables, transformed to match the STAR schema design and loaded into appropriate fact and dimension tables.
After the ETL process, data integrity is verified by querying the fact and dimension tables for number of records.
The entire workflow is automated and executed in a particular order to prevent data discrepancies using Apache Airflow.
The order of execution is as follows

How to run

In order to recreate this repo and run workflow, follow the instructions below from the command line.

Clone repository using the following command

git clone https://github.com/command7/SQL_SQL_ETL_Airflow.git

Navigate to project folder using

cd SQL_SQL_ETL_Airflow

Initialize virtual environment by using

pipenv shell

Install required packages

pipenv install Pipfile

Create .env file as following to direct airflow to load DAGs from project directory

echo "AIRFLOW_HOME=${PWD}/airflow" >> .env

Initialize airflow database by using

airflow initdb

Start airflow scheduler

airflow scheduler

Start airflow webserver

airflow webserver -p 8080

Turn on Songs_Events_ETL in the web server GUI.

Configurations required

Certain credentials need to be stored in Apache Airflow's connections menu.

Open Airflow GUI -> Admin -> Connections and add the following connections.

Redshift Connection

Conn Id -> redshift_connection

Conn Type -> Postgres

Host -> Redshift end point address

Schema -> Name of database in Redshift

Login -> Username to login to database

Password -> Password to login to database

Port -> 5439 (By default redshift runs in this port)

AWS Credentials

Conn Id -> aws_credentials

Conn Type -> Amazon Web Services

Login -> IAM user ACCESS KEY

Password -> IAM user SECRET ACCESS KEY

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.idea		.idea
Images		Images
airflow		airflow
.env		.env
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Extract Transform Load Pipeline using Amazon Redshift & Apache Airflow

Source Data

Data Warehouse Design

Fact Table

Dimension Tables

Pipeline Components

ETL Process

How to run

Configurations required

Redshift Connection

AWS Credentials

About

Releases

Packages

Languages

command7/SQL_SQL_ETL_Airflow

Folders and files

Latest commit

History

Repository files navigation

Automated Extract Transform Load Pipeline using Amazon Redshift & Apache Airflow

Source Data

Data Warehouse Design

Fact Table

Dimension Tables

Pipeline Components

ETL Process

How to run

Configurations required

Redshift Connection

AWS Credentials

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages