Skip to content

Latest commit

 

History

History
132 lines (58 loc) · 4.85 KB

File metadata and controls

132 lines (58 loc) · 4.85 KB

GCP Data Engineering project

The purpose of this project is to analyze the TLC Trip Record Data. Trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. To understand the data and its fields, please refer to the data dictionary provided by the TLC (data_dictionary_trip_records_yellow.pdf).

The analysis of the TLC Trip Record Data will be conducted using the following technologies and tools:

  • Cloud Storage 🪣 to store and manage the trip record data

  • Compute Engine 💽 to host and run the Mage

  • Mage 🧙‍♂️ for Extract, Transform, Load (ETL) processes

  • BigQuery 🔍 as our data warehouse for storing and querying the transformed trip record data

  • Looker📈 as our business intelligence and data visualization platform

💡 Additionally, a key objective of this project is to gain familiarity with the Mage tool. Mage offers an alternative approach to Airflow for managing ETL workflows and will be the primary tool used to orchestrate the data pipeline.

image

GCS 🪣

Create a new bucket, upload the csv file. Change the permissions of the bucket (permissions -> edit access control -> Fine-grained), edit the access of the csv file to make it publicly available

image

image

Compute Engine 💽

Create a new instance, e.g. Machine type: e2-standard-4

SSH into your VM and run these commands to install Python, pip, wget, pandas, Google Cloud Library, Google Cloud BigQuery library

sudo apt-get install update

sudo apt-get install python3-distutils

sudo apt-get install python3-apt

sudo apt-get install wget

wget https://bootstrap.pypa.io/get-pip.py

sudo python3 get-pip.py

sudo pip3 install pandas

sudo pip3 install google-cloud

sudo pip3 install google-cloud-bigquery

Run sudo pip3 install mage-ai to install Mage on your VM.

To start a new Mage project: mage start de-uber-project (you will see: Checking port 6789... if you restart Mage, the port may change e.g. Checking port 6790...)

Create a new firewall rule for the port 6789

image

To open Mage UI: External-IP-address:6789

Mage 🧙

load_uber_data

Open Mage UI, select Data loader -> Python -> API

Copy URL of your csv file, add it to the Mage code: @data_loader url = ' '

image

Run the block, load the data from GCS

image

uber_transformation

Transform the data: select Transformer -> Python -> Generic (no template)

image

Run uber_transformation block

image

uber_bigquery_load

Select Data exporter -> Python -> Google BigQuery

image

Go to API & Services on GCP, create a new Service account from the Credentials section, assign the BigQuery Admin role to the SA. Create a new key in JSON format for this SA. Copy and paste the information from your JSON key into the io_config.yaml file.

image

BigQuery 🔍

Create a Dataset in BQ, run uber_bigquery_load block. Load all tables to BQ.

image

image

image

Create a new table by running sql code from sql_query.sql

Looker 📈

Open lookerstudio.google.com. Connect Looker to your BQ.

Create a dashboard. Here is an example of my Looker Dashboard.