Data about songs and user events listening to them are stored in 2 separate S3 buckets.
- Log events -
s3://sparkify/log_data
- Song data -
s3://sparkify/song_data
The user events log files contain the following information.
The songs log files contain the following information
num_songs
artist_id
artist_latitude
artist_longitude
artist_location
artist_name
song_id
title
duration
year
In order to load them in to a data warehouse (Redshift), a data model was designed using STAR schema
containing fact and dimension tables.
songplays
songplay_id
start_time
user_id
level
song_id
artist_id
session_id
location
user_agent
users
user_id
first_name
last_name
gender
level
songs
song_id
title
artist_id
year
duration
artists
artist_id
name
location
lattitude
longitude
time
start_time
hour
day
week
month
year
weekday
S3 bucket
- Unstructured data in the form of logs are stored in separate s3 bucketsRedshift
- Data Warehouse used as destination of structured data after ETL process and for staging purposesAirflow
- Automation of ETL pipeline on a daily schedule
- Initially data from all log files (songs and user events) are copied to staging tables
staging_events
andstaging_songs
in Redshift using COPY command which copies data in a parallel fashion. - Using SQL, data is extracted from these staging tables, transformed to match the STAR schema design and loaded into appropriate fact and dimension tables.
- After the ETL process, data integrity is verified by querying the fact and dimension tables for number of records.
- The entire workflow is automated and executed in a particular order to prevent data discrepancies using
Apache Airflow
. - The order of execution is as follows
In order to recreate this repo and run workflow, follow the instructions below from the command line.
- Clone repository using the following command
git clone https://github.com/command7/SQL_SQL_ETL_Airflow.git
- Navigate to project folder using
cd SQL_SQL_ETL_Airflow
- Initialize virtual environment by using
pipenv shell
- Install required packages
pipenv install Pipfile
- Create
.env
file as following to direct airflow to load DAGs from project directory
echo "AIRFLOW_HOME=${PWD}/airflow" >> .env
- Initialize airflow database by using
airflow initdb
- Start airflow scheduler
airflow scheduler
- Start airflow webserver
airflow webserver -p 8080
- Turn on
Songs_Events_ETL
in the web server GUI.
Certain credentials need to be stored in Apache Airflow
's connections menu.
Open Airflow GUI
-> Admin
-> Connections
and add the following connections.
Conn Id
-> redshift_connection
Conn Type
-> Postgres
Host
-> Redshift end point address
Schema
-> Name of database in Redshift
Login
-> Username to login to database
Password
-> Password to login to database
Port
-> 5439 (By default redshift runs in this port)
Conn Id
-> aws_credentials
Conn Type
-> Amazon Web Services
Login
-> IAM user ACCESS KEY
Password
-> IAM user SECRET ACCESS KEY