Skip to content

command7/SQL_SQL_ETL_Airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Extract Transform Load Pipeline using Amazon Redshift & Apache Airflow

Source Data

Data about songs and user events listening to them are stored in 2 separate S3 buckets.

  • Log events - s3://sparkify/log_data
  • Song data - s3://sparkify/song_data

The user events log files contain the following information.

Event logs

The songs log files contain the following information

  • num_songs
  • artist_id
  • artist_latitude
  • artist_longitude
  • artist_location
  • artist_name
  • song_id
  • title
  • duration
  • year

Data Warehouse Design

In order to load them in to a data warehouse (Redshift), a data model was designed using STAR schema containing fact and dimension tables.

Fact Table

songplays

  • songplay_id
  • start_time
  • user_id
  • level
  • song_id
  • artist_id
  • session_id
  • location
  • user_agent

Dimension Tables

users

  • user_id
  • first_name
  • last_name
  • gender
  • level

songs

  • song_id
  • title
  • artist_id
  • year
  • duration

artists

  • artist_id
  • name
  • location
  • lattitude
  • longitude

time

  • start_time
  • hour
  • day
  • week
  • month
  • year
  • weekday

Pipeline Components

  • S3 bucket - Unstructured data in the form of logs are stored in separate s3 buckets
  • Redshift - Data Warehouse used as destination of structured data after ETL process and for staging purposes
  • Airflow - Automation of ETL pipeline on a daily schedule

ETL Process

  • Initially data from all log files (songs and user events) are copied to staging tables staging_events and staging_songs in Redshift using COPY command which copies data in a parallel fashion.
  • Using SQL, data is extracted from these staging tables, transformed to match the STAR schema design and loaded into appropriate fact and dimension tables.
  • After the ETL process, data integrity is verified by querying the fact and dimension tables for number of records.
  • The entire workflow is automated and executed in a particular order to prevent data discrepancies using Apache Airflow.
  • The order of execution is as follows Order of workflow execution

How to run

In order to recreate this repo and run workflow, follow the instructions below from the command line.

  • Clone repository using the following command

git clone https://github.com/command7/SQL_SQL_ETL_Airflow.git

  • Navigate to project folder using

cd SQL_SQL_ETL_Airflow

  • Initialize virtual environment by using

pipenv shell

  • Install required packages

pipenv install Pipfile

  • Create .env file as following to direct airflow to load DAGs from project directory

echo "AIRFLOW_HOME=${PWD}/airflow" >> .env

  • Initialize airflow database by using

airflow initdb

  • Start airflow scheduler

airflow scheduler

  • Start airflow webserver

airflow webserver -p 8080

  • Turn on Songs_Events_ETL in the web server GUI.

Configurations required

Certain credentials need to be stored in Apache Airflow's connections menu.

Open Airflow GUI -> Admin -> Connections and add the following connections.

Redshift Connection

Conn Id -> redshift_connection

Conn Type -> Postgres

Host -> Redshift end point address

Schema -> Name of database in Redshift

Login -> Username to login to database

Password -> Password to login to database

Port -> 5439 (By default redshift runs in this port)

AWS Credentials

Conn Id -> aws_credentials

Conn Type -> Amazon Web Services

Login -> IAM user ACCESS KEY

Password -> IAM user SECRET ACCESS KEY

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published