This repo showcases working with datasets using Python, pandas, SQL, BigQuery and Looker Studio working as a team to develop an ETL pipeline.
- BigQuery
- Looker Studio
- Python
- SQL
- Pandas
- Git
- Markdown
- JSON
- NumPy
.gitignore
requirements.txt
- World Energy Consumption
- Emissions by Country
- Latitude and Longitude for Every Country and State
- Land area (sq. km)
This repo includes an ETL pipeline leveraging the different datasets highlighted above. The team began by narrowing down our datasets and then outlining 1) Architectural structure 2) data model and 3) the full ETL pipeline to then construct each respective portion as a team:
The team debated the merits of leveraging a snowflake schema vs star schema, while we acknowledged the efficiencies of a star schema that would be achieved by consolidating fact tables, we landed on a snowflake schema based on the size of the data sets not having a meaningful impact on performance (resulting from need for additional joins) and opting for the organization of data being more user friendly and intuitive.
Reed constructed the dim_country
table in the REED.ipynb
notebook. REED performed profiling, cleaning and transformations on the Latitude and Longitude for Every Country and State and Land area (sq. km) datasets. Upon completion it was loaded to BigQuery.
Chloe worked on profiling, cleaning and transformations for the Emissions by Country data set create the fct_emissions
table. Upon completion it was loaded to BigQuery.
Ruben performed profiling, cleaning and transformations on the World Energy Consumption to compile both the fct_gdp
and fct_consump
tables. Upon completion it was loaded to BigQuery. He also owned and authored the README.md
.
Once the datasets were cleaned and consolidated, the team created data visualizations and analysis (using Looker Studio) leveraging the constructed dimension and fact tables outlined above.
Below is a combo chart that was put together by Ruben that shows GDP compared to Population and Energy Consumption (click on image of chart to use dashboard):
While Github disables iframe, which allows embedding of the report on markdown files, I have included the code below for users that clone the project.
<iframe width="600" height="450" src="https://datastudio.google.com/embed/reporting/dbe92c8b-ccd3-41d9-b269-5964eb9717c3/page/f94CD" frameborder="0" style="border:0" allowfullscreen></iframe>
The scale of GDP (trillions of dollars), population (billions), and Energy consumption (thousands of terawatt-hours) posed an issue for the visualization, but by embedding this report the user is able to see the individual values for energy consumption which shows the consistent trend that population and energy consumption growth align with the growth of GDP. The chart is dynamic in that it allows users to filter for country and timeframe. Also the timeframe of 1965 through 2016 was chosen as consumption data prior to 1965 was missing and GDP data goes up to 2016.
Below is a line chart by Ruben that shows total global energy consumption compared to CO2 emissions (click on image of chart to use dashboard):
As called out above Github disables iframe, but I have included the code below for reference:
<iframe width="600" height="450" src="https://datastudio.google.com/embed/reporting/b7d972c6-7faf-4c78-948c-614945f42350/page/Io6CD" frameborder="0" style="border:0" allowfullscreen></iframe>
The chart leverages different scales for the left and right y-axis to better show the correlation between emissions and energy consumption over time (1965 - 2019). The chart is dynamic in that it allows users to filter for country and timeframe.
Chloe put together two line graphs that 1) plots the global CO2 emissions over time with total emissions and each type of emission producer and 2) total consumption compared to renewable consumption (click on the image of either chart to use dashboard):
Reed put together a line graph plotting the world GDP compared to coastline size and population:
Overall, the team was able to limit the amount of merge conflicts by working on independent notebooks and assigning different tasks (e.g. Each focused on constructing specific dimension and fact tables, etc.).
- Go to https://github.com/Reed-Carter/Emissions-by-Country.git to find the specific repository for this website.
- Then open your terminal. I recommend going to your Desktop directory:
cd Desktop
- Then clone the repository by inputting:
git clone https://github.com/Reed-Carter/Emissions-by-Country.git
- Go to the new directory or open the directory folder on your desktop:
cd Emissions-by_Country
- Once in the directory you will need to set up a virtual environment in your terminal:
python3.7 -m venv venv
- Then activate the environment:
source venv/bin/activate
- Install the necessary items with requirements.txt:
pip install -r requirements.txt
- Download the necessary csv files listed in the Datasets Used section
- With your virtual environment now enabled with proper requirements, open the directory:
code .
- Upon launch please update the Google Cloud client and project details to configure it to load to your project
- No known bugs
MIT License
Copyright (c) 2022 Ruben Giosa, Reed Carter, Chloe (Yen Chi) Le
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.