This a tool using data from the 2018 NYC Business Savings Data and the 2015 Tree Census to build a simple predictive model to estimate how much a business might save in ulitity costs if it invested in green initatives like tree planting or enrolling in a business association.
This project is a part of the 2019 NYC Open Data Competition.
We think that the more trees that a business has around or near it's property would be coorelated with savings for ultilities given by the city of NY.
First, given the size of the NYC Business data, we were directly able to load it using the following code for our analysis:
Next, we attempted to load the NYC Tree Census, but given that there are over 680,000 rows, this was not feasible, and we used an optimized function instead. This function downloaded the large CSV in smaller 100,000 row increments that could be more quickly processed by laptops with lower processing power and memory.
The next step was to simply clean the data of any null values, any unimportant values, for instance like "tree stumps" in the tree dataset, and to remove any unused categories.
Importantly, Savings was given in aggregate over the entire existence of the business so, we needed to standardize the savings by address. We created a normalized calucated category that gave the savings of a particular business for one month only ("Periodic Savings Over Months"). We would predict for this normalized monthly savings
Lastly, we created two more categories (1) monthly savings of addresses in a 0.5 miles radius and (2) tree count in a half-mile radius from the address. Our prepared data consisted of both categorical and continuous data.
There are two ways to take this analysis.
- I experimented with doing regression on continuous and coded categorical variables but that turned out to be a little unwiedly and out of scope of this project.
- I instead opted to do regression on our three coninuous variables Our training data consisted of:
From there, we tested and fit our data to a few algorithms to observe accuracy and ultimately we ended up going with Bayesian Ridge.
We applied the model to some training data link
- If we had more time, we would like to have optimized the model further because we suspect that there is a model that drastically reduces variance and bias.
- I originally worked on coding the different categorical variables into our model and is an interesting way to make our tool more powerful and relevant
- William Zeng
- Paloma (Haige) Cui
- Paige (Zipei) Wang
- Roy Hyujin Han of Cross Compute
- Ying Zhou of QC Tech Incubator