In this repository, you will find some showcased projects from my Master of Information and Data Science (MIDS) program at UC Berkeley. Additional projects can be found on my website:
As part of my Machine Learning at Scale course for the Master of Information and Data Science (MIDS) program at UC Berkeley, my team (Rachel Gao, Ray Cao, Jenna Sparks, Dhyuti Ramadas and myself) worked on predicting delayed flights at least two hours before departure.We leveraged a dataset of flight and weather variables between 2015-2019, containing 56 columns and 23.2M records. I was in charge of exploratory data analysis and feature engineering for the weather variables, sourcing new data for flights and weather for 2020-2023, and performing a gap analysis to identify where our best-performing ensemble model was struggling. The image above shows percentage of false positives by state in the training, validation and test sets in our best-performing model.
Tech stack:
- Databricks
- Spark
Libraries:
- Matplotlib
- Geopandas
- Folium
Speech-to-text systems show bias against low-resource languages. My team (Rachel Gao, Erica Nakabayashi and myself) decided to tackle this problem for our final project for our Natural Language Processing with Deep Learning course. I came up with the research idea, and was in charge of research design (along with the rest of the team), gathering some of the data, doing EDA on the Grammatical Error Corrector, experimenting with Parts-of-speech tagging and spellchecking for our model, running RoBERTa models and hyper-parameter tuning, as well as writing the final report.
Tech stack:
- Python
- Google colab
- Deep neural networks
- Transformers
- TensorFlow
Libraries:
- NLTK
- SpaCy
Our team (Henry Caldera, Eunice Ngai and myself) chose to estimate the effect of air pollution on the rate of asthma in Los Angeles County by means of a linear regression analysis, for our final statistics project. I came up with the project idea, and was responsible for selecting important variables, designing the presentation, and the whole team worked collaboratively in modeling and writing the report.
Tech stack:
- R
Libraries:
- ggplot2
- tidyverse
- stargazer
- caret
- lmtest
- sandwich
I collaborated with PhD candidates Silvia Barbereschi (UC Berkeley) and Beatrice Montano (Columbia) as a Research Assistant in their Gender Equity and Climate Change in Tanzania project. The goal of the project was to determine whether climate change had driven to changes in gender norms in Tanzania. I generated a dataset based on the land cover dataset from Copernicus Climate Data Store for the years between 2020-2022, and generated summary statistics for variables related to gender in Tanzania and other East African countries (Kenya, Uganda, Mozambique). I also generated visualizations for the number of consecutive dry days.
Tech stack:
- Python
- QGIS
- R