This repository contains data extraction, cleaning, preprocessing, and analysis of New York City motor vehicle collision data. We utilized the Socrata Open Data API (SODA) in Python to extract the raw data and the Polars library for data cleaning and preprocessing. The cleaned data was uploaded to the Google Cloud Platform (GCP) and analyzed using BigQuery. Finally, we created visualizations in Tableau to gain insights from the data.
- Introduction
- Data Extraction
- Data Cleaning and Preprocessing
- Uploading Cleaned Data to GCP
- Data Analysis with BigQuery
- Data Visualizations in Tableau
- Future Works
The NYPD Motor Vehicle Collision dataset is a valuable resource for analyzing traffic accidents in New York City. This project aims to extract, clean, and analyze this data to gain insights into patterns and trends related to motor vehicle collisions.
We used the Socrata Open Data API (SODA) in Python to retrieve the NYPD Motor Vehicle Collision dataset. SODA provides a convenient way to access data from various sources, making it easier to collect and work with open data. You can find the extraction code in the data_extraction.py
file.
We leveraged the Polars library in Python for data cleaning and preprocessing. Polars is a powerful DataFrame library that provides fast data manipulation capabilities similar to pandas but optimized for performance. It offers a wide range of operations, making it suitable for handling large datasets efficiently.
Polars is a modern DataFrame library that provides expressive and powerful data manipulation and analysis tools. Some key features of Polars include:
- High-performance query execution
- Support for complex data operations
- Easy integration with various data sources
- Memory-efficient operations
- Columnar data storage
You can view the complete data cleaning and preprocessing code in the data_cleaning.ipynb
Jupyter Notebook.
After cleaning and pre-processing the data, we uploaded it to Google Cloud Platform (GCP) for further analysis. GCP provides a scalable and reliable cloud infrastructure for storing and processing large datasets.
We used BigQuery, a fully managed, serverless data warehouse on GCP, to analyze in-depth data. BigQuery allows us to run SQL queries on large datasets quickly and efficiently, making it ideal for exploring the NYPD Motor Vehicle Collision data.
We used Tableau to create interactive data visualizations to gain insights from the dataset. You can find the Tableau visualization files in the tableau_visualizations
directory.
- Add Insights
- Add Conclusion