The purpose of this project is to analyze the TLC Trip Record Data. Trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. To understand the data and its fields, please refer to the data dictionary provided by the TLC (data_dictionary_trip_records_yellow.pdf
).
The analysis of the TLC Trip Record Data will be conducted using the following technologies and tools:
-
Cloud Storage 🪣 to store and manage the trip record data
-
Compute Engine 💽 to host and run the Mage
-
Mage 🧙♂️ for Extract, Transform, Load (ETL) processes
-
BigQuery 🔍 as our data warehouse for storing and querying the transformed trip record data
-
Looker📈 as our business intelligence and data visualization platform
💡 Additionally, a key objective of this project is to gain familiarity with the Mage tool. Mage offers an alternative approach to Airflow for managing ETL workflows and will be the primary tool used to orchestrate the data pipeline.
Create a new bucket, upload the csv file. Change the permissions of the bucket (permissions -> edit access control -> Fine-grained), edit the access of the csv file to make it publicly available
Create a new instance, e.g. Machine type: e2-standard-4
SSH into your VM and run these commands to install Python, pip, wget, pandas, Google Cloud Library, Google Cloud BigQuery library
sudo apt-get install update
sudo apt-get install python3-distutils
sudo apt-get install python3-apt
sudo apt-get install wget
wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo pip3 install pandas
sudo pip3 install google-cloud
sudo pip3 install google-cloud-bigquery
Run sudo pip3 install mage-ai
to install Mage on your VM.
To start a new Mage project: mage start de-uber-project
(you will see: Checking port 6789...
if you restart Mage, the port may change e.g. Checking port 6790...
)
Create a new firewall rule for the port 6789
To open Mage UI: External-IP-address:6789
Open Mage UI, select Data loader -> Python -> API
Copy URL of your csv file, add it to the Mage code: @data_loader url = ' '
Run the block, load the data from GCS
Transform the data: select Transformer -> Python -> Generic (no template)
Run uber_transformation block
Select Data exporter -> Python -> Google BigQuery
Go to API & Services on GCP, create a new Service account from the Credentials section, assign the BigQuery Admin role to the SA. Create a new key in JSON format for this SA.
Copy and paste the information from your JSON key into the io_config.yaml
file.
Create a Dataset in BQ, run uber_bigquery_load block. Load all tables to BQ.
Create a new table by running sql code from sql_query.sql
Open lookerstudio.google.com. Connect Looker to your BQ.
Create a dashboard. Here is an example of my Looker Dashboard.