-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbike_counters_final_report.rmd
468 lines (332 loc) · 29 KB
/
bike_counters_final_report.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
title: "Ottawa Bike Counter Machine Learning Project"
author: "Nick Marum"
date: "28/10/2020"
output:
pdf_document: default
word_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
if(!require(lubridate)) install.packages("lubridate")
if(!require(caret)) install.packages("caret")
if(!require(rpart)) install.packages("rpart")
if(!require(arm)) install.packages("arm")
if(!require(randomForest)) install.packages("randomForest")
if(!require(rpart.plot)) install.packages("rpart.plot")
library(tidyverse)
library(lubridate) #for parsing of dates
library(caret) #includes several ML algorithms, including KNN3.
library(rpart) #provide a regression tree ML algorithm
library(arm) #for naive bayesian ML algorithm
library(randomForest) #for random forest ML algorithm
library(rpart.plot) #rtree visualization
url1 <- "https://raw.githubusercontent.com/nmarum/ottawa_bike_counters/master/ml_friendly_bike_counters.csv"
url2 <- "https://raw.githubusercontent.com/nmarum/ottawa_bike_counters/master/ottawa_bike_counters.csv"
download.file(url1, destfile = "ml_friendly_bike_counters.csv") #ML friendly csv with weather info
download.file(url2, destfile = "ottawa_bike_counters.csv")#bike ride count data only BUT with date
dat <- read_csv("ml_friendly_bike_counters.csv")
alt <- read_csv("ottawa_bike_counters.csv")
df <- as_tibble(unique(dat$day))#single col tibble of "days" from ML friendly csv
df2 <- as_tibble(unique(as.Date(alt$Date)))#single col tibble of "dates"
df <- cbind(df, df2)#binding two together
colnames(df) <- c("day", "date")
head(df)
dat <- dat %>% left_join(df, by="day") #adding a actual date field to ML friendly data.
#I am hoping this will make data exploration easier.
head(dat)
dat <- dat[1:28776,] #selecting entries which contain weather data (to end of 2018) .
rm("df", "df2")
```
*INTRODUCTION*
This report documents the exploratory analysis and machine learning model building process for a HarvardX Data Science Professional Certificate capstone project. It uses publicly available City of Ottawa bike ride counter data from public pathways in Ottawa, CANADA as well as publicly available weather data from Environment Canada.
The data used in this project were found in zip format on Kaggle at the following address: <https://www.kaggle.com/m7homson/ottawa-bike-counters/download> To access it on Kaggle you have to create an account and login. For ease of access, I have uploaded a copy of the data set to my github repo. A copy of the final script, including data import and wrangling is also included: <https://github.com/nmarum/ottawa_bike_counters>
The reason why I selected this dataset for this project is that I live in Ottawa and am a regular bike rider who commutes to and from work on my bike about 8 months of the year. There is an extensive network of multi-use pathways in Ottawa, with several bike counters that are prominently placed. It appealed to me to use a data set to make predictions about something I am very familiar with and in my home town.
*EXPLORATORY ANALYSIS*
```{r}
summary(dat)
```
A quick summary of the wrangled data set shows that there are a total of 14 columns reflecting 28,776 total entries. The "location_id" and "location_name" are obviously identifiers of bike ride counter locations. The dependent variable for this machine learning project is "count" which is the number of two-way rides past a bike counter recorded on a daily basis. The "day" variable is a unique identifier per day from the first entry in January 2010 until the last entry on Dec 31, 2018; "day_of_week" identifies the day of the week by a numeric identifier starting with Sunday as 0 and Saturday as 6, and "day_of_year" being a count from 0 to 365 of each day in the calendar year (including an extra day for leap years). I added a "date" field using the lubridate package during data wrangling to be able to more clearly associate an entry with a day of the year. Finally, there is a time series of weather observations that have been married with the bike counter ride entries which include: "MaxTemp" for maximum daily temperature, "Totalprecipmm" for total precipitation recorded that day in millimetres, among others.
```{r message=FALSE, warning=FALSE}
dat %>%
group_by(location_name) %>%
summarize(mean_count = mean(count),
prop_zerocount = mean(count==0),
entries = n()) %>%
arrange(desc(entries))
```
Examining the bike counter ride count data, we see that some counters have more entries than others. While some counters seem to cease operations through the winter months, a number of counters which are identified as "winter" counters operate year round. These are a more interesting set of counters with which to make weather-based predictions. Also having more entries should improve the accuracy of these predictions.
The table above shows the bike counter along Colonel By Drive (COBY) near the Corktown footbridge crossing over the Rideau Canal as the counter with the greatest number of entries and a low proportion of "0" count/no bike entries (1.7%). While not the busiest counter, the consistent observations seem promising from a predictive perspective.
A City of Ottawa legend for the bike counter data provides the following description for the COBY counter:
"3_COBY: National Capital Commission (NCC) Eastern Canal Pathway approximately 100m north of the Corktown Bridge. \#WINTER counter
"The data provides counts of bike trips (both directions summed unless otherwise noted)" There is no additional note for the COBY data in the City of Ottawa description.
This counter also appeals to me since it is close to my workplace and I have passed by this stretch of the Col By pathway near the Corktown Bridge hundreds of times.
```{r}
coby <- dat %>% filter(location_name == "COBY")
dat %>% filter(location_name == c("COBY", "LMET", "LLYN", "LBAY", "SOMO", "ADAWE BIKE")) %>%
ggplot(aes(count, col=location_name)) + geom_density() +
ggtitle("Density plot of Ottawa 'Winter' bike ride counters")
```
Selecting for "winter" counters (i.e., counters that operate year round even when there is snow on the ground.), and using a density plot to see the frequency of bike ride counts per entry (i.e., day) we see that not all counters follow the same distribution. There seems to be one group of counters that follow a similar distribution with a wide distribution of count entries, while a couple of other entries have a high number of days with relatively few riders and relatively few busy days. We can see that COBY follows the distribution of most of the winter counters.
```{r}
coby_rides <- c(median(coby$count), mean(coby$count))
total_rides <- c(median(dat$count), mean(dat$count))
diff <- c(median(coby$count)-median(dat$count), mean(coby$count)-mean(dat$count))
data.frame(coby_rides, total_rides, diff, row.names = c("median", "mean"))
```
Looking at the mean and median of the COBY counter and all of the City of Ottawa bike ride counters, we can see that the mean and median are similar to the overall mean and median of bike ride counts.
```{r}
tibble(sd(coby$count), sd(dat$count))
```
Looking at standard deviation, we can see it is similar as well at 814 rides. This would suggest that a predictive model created by COBY data should have applicability to other winter counters. That being said, overall it appears that ride counts overall between different counters can vary widely, which could be a challenge to predict.
We will start to look to identify what kind of patterns may exist in the COBY bike counter over time.
```{r message=FALSE, warning=FALSE}
coby %>% group_by(day_of_week) %>%
summarize(avg_count = mean(count)) %>%
ggplot(aes(day_of_week, avg_count)) + geom_col() +
ggtitle("Col By Counter - Average Count of Rides by Day of Week")
```
Looking at the average number of rides by the day of the week (with 0 meaning Sunday and 6 meaning Saturday), we can see a trend of Tuesday to Friday of of higher traffic - likely from commuters heading towards downtown Ottawa along the Colonel By drive pathway. Traffic drops off a little, by about 20%, on the weekends and Mondays.
```{r}
coby %>% mutate(year = year(date)) %>%
ggplot(aes(day_of_year, count)) +
geom_point(alpha=.5) +
geom_smooth(method = "loess") +
ggtitle("Col By Counter - Rides by Day of the Year (2010-2018)")
```
Looking at the ride count data by day of the year, with 0 being New Year's Day running to 364 as New Year's Eve of the same year, we see a clear trend in terms of the number of rides per day over course of the year. The first 100 days of year (January to early April) are the dead of winter in Ottawa - not many bike rides are recorded before the 80 to 90 day mark with the start of spring. From there, bike rides per day rise and peak around June and remain elevated through the summer months and then starts to gradually drop off in the fall and returns to winter levels by the month of December. This trends appears relatively consistent year over year and we can see the pattern by fitting a bin-smoothing line using locally weighted regression.
Obviously, the weather in the summer months in Ottawa is very different from the weather in the winter, so this pattern is not surprising.
```{r}
coby %>% mutate(year = year(date), month = month(date)) %>%
ggplot(aes(day_of_year, count, col=MaxTemp)) +
geom_point(alpha=.5) +
geom_smooth(method = "loess") +
ggtitle("Col By Counter - Rides by Day of the Year and Max Temp")
```
When we take the same graph, but colour each dot to represent the maximum temperature on the day in question, we can see that the higher temperature counts (lighter blue dots) are generally found in the summer months and appear associated with more rides. There is however a great deal of variability, some relatively warmer days did not have a very high number of rides recorded that day while there are some colder days in the summer months which did. This may be due to precipitation levels or other factors not seen in this graph.
```{r}
coby %>% mutate(year = year(date), month = month(date)) %>%
ggplot(aes(day_of_year, count, col=TotalPrecipmm)) +
geom_point(alpha=.5) +
geom_smooth(method = "loess")+
ggtitle("Col By - Rides by Day of the Year and Total Precipmm")
```
Looking once again at the same graph, but this time overlaying precipitation levels, we see that there are a few days of high precipitation appear to be less busy. However, there are also a few days with high levels of precipitation (light blue dots) that had a higher than normal (i.e., above the trend line), however a clear pattern is not easily discernible.
Since it is hard to identify with confidence a clear pattern with respect to precipitation and its impact on bike rides, I decided to stratify data by precipitation amounts to have a closer look.
```{r message=FALSE, warning=FALSE}
#creating a factor with levels
precip_levels <- coby %>%
mutate(Precipmm_levels = factor(round(TotalPrecipmm/10, digits=0)))
#labelling factor levels
levels(precip_levels$Precipmm_levels) <- c("<=5", ">5-14.9", "15-24.9", "25-34.9", "35-44.9", "45-54.9", "65-74.9", "75-84.9", "NaN")
#Visualizing by stratified precipitation levels
precip_levels %>% mutate(year = year(date), month = month(date)) %>%
filter(TotalPrecipmm<55) %>%
ggplot(aes(day_of_year, count)) +
geom_point(alpha=.5) +
geom_smooth(method = "loess") +
facet_wrap(facets = Precipmm_levels~.) +
ggtitle("Col By - Rides by Day of the Year Stratified by TotalPrecipmm")
```
By stratifying the ride entries by the "Totalprecipmm" variable and overlaying a bin smoothing line, we see that only at very high precipitation levels there seems to be an effect on the number of rides per day. The apparent visual difference between the low/no precipitation days and high precipitation days may simply be due to random variation. Day of the year or temperature seem to be stronger indicators - with the warm summer months have more rides even in the rain than during the Ottawa winter months (Dec-March).
```{r}
coby %>% mutate(year = year(date), month = month(date)) %>%
filter(SnowonGrndcm > 0) %>%
ggplot(aes(SnowonGrndcm, count, col=SnowonGrndcm)) +
geom_point(alpha=.5) +
geom_smooth(method = "loess") +
ggtitle("Col By Counter - Rides with Snow on Ground > 0cm")
```
Comparing the amount of recorded snow on the ground and the number of rides shows a clear trend that *any* recorded snow on the ground reduces the number of rides significantly, though there are still a handful of hardy Ottawa bike riders that seem to ride their bikes year round along the Colonel By Drive pathway no matter how much snow is on the ground.
```{r}
temp_levels <- coby %>%
mutate(maxtemp_levels = factor(round(MaxTemp/10, digits=0)))
#labelling factor levels
levels(temp_levels$maxtemp_levels) <- c("-20", "-10", "0", "10", "20", "30", "40","NaN")
temp_levels %>% mutate(year = year(date), month = month(date)) %>%
filter(maxtemp_levels %in% c("-20", "-10", "0", "10", "20", "30", "40")) %>%
ggplot(aes(MaxTemp, count)) +
geom_point(alpha=.5) +
geom_smooth(method = "loess") +
ggtitle("Col By Bike Counter - Count of rides by Max Temp")
```
Taking a closer look at max daily temperature's impact on rides, we see a smooth relationship between an increase in temperature and more rides however the effect seems to tail off at higher temperatures - presumably too hot for some riders.
```{r}
temp_levels %>% mutate(year = year(date), month = month(date)) %>%
filter(maxtemp_levels %in% c("-20", "-10", "0", "10", "20", "30")) %>%
ggplot(aes(day_of_year, count)) +
geom_point(alpha=.5) +
geom_smooth(method = "loess") +
facet_wrap(facets = maxtemp_levels~.) +
ggtitle("Col By - Rides by Day of the Year Stratified by Max Temp")
```
Visualizing ride counts over the course of the year stratified by max temperature we can see that warmer days are clustered in summer months and have more rides. However, days in the same broad temperature range still show wide variability.
```{r}
#running a series of pairwise correlations by various predictors
day_of_year <- cor(coby$count, coby$day_of_year, use = "pairwise.complete.obs", method = "spearman")
day_of_week <- cor(coby$count, coby$day_of_week, use="pairwise.complete.obs", method = "spearman")
Max_Temp <- cor(coby$MaxTemp, coby$count, use="pairwise.complete.obs", method = "spearman")
Mean_Temp <- cor(coby$MeanTemp, coby$count, use = "pairwise.complete.obs", method = "spearman")
Min_Temp <- cor(coby$MinTemp, coby$count, use="pairwise.complete.obs", method = "spearman")
Total_Precipmm <- cor(coby$TotalPrecipmm, coby$count, use="pairwise.complete.obs", method = "spearman")
Total_Rainmm <- cor(coby$TotalRainmm, coby$count, use = "pairwise.complete.obs", method = "spearman")
Total_Snowcm <- cor(coby$TotalSnowcm, coby$count, use = "pairwise.complete.obs", method = "spearman")
Snow_on_Grd <- cor(coby$SnowonGrndcm, coby$count, use = "pairwise.complete.obs", method = "spearman")
#create a data.frame with all the Spearman correlations
data.frame(day_of_year, day_of_week, Max_Temp, Mean_Temp, Min_Temp, Total_Precipmm, Total_Rainmm, Total_Snowcm, Snow_on_Grd, row.names = "Spear Cor with Rides")
```
While visualizing the data shows that day of the year is a strong predictor looking at pairwise correlations of the various features against rides suggests that using temperature, max temp in particular, and snow on the ground are most closely associated with rides. The Spearman correlation is used to help control for outliers by computing correlation based on the ranks of values, of which there seem to be many in the data.
Higher temps are strongly correlated with more rides (.87), while increased snow on the ground is negatively correlated with rides (-.76). Precipitation, rain or snow, seems less of a predictor. Day of the year is not strongly correlated with rides however the relationship we saw in the visualization was not linear so this would make sense.
While the causation is not clear (i.e., Is it higher temperature or the fact it is the summer months that drive additional ridership along the pathway?) the pattern with respect to rides is.
Using the insights gained from this exploratory analysis, we will make a Machine Learning model that will focus on using Temperature, Snow on Ground and Day of year as key features to predict number of rides.
*MACHINE LEARNING MODEL*
For this ML model building exercise, I used the functionality provided in the caret package to leverage a number of different machine learning algorithms. The first step in the ML model building is to partition the COBY dataset into training and test sets. Then I established a training control protocol using 10-fold cross-validation and establishing training grids for tuning for those algorithms that included tuning parameters. The machine learning algorithms I examined included, K Nearest Neighbours (KNN), randomForest (rf), a Naive Bayesian linear algorithm (bayes_glm), linear regression (lm), and two regression tree algorithms (rpart and rpart2). For the first round of trials, I used three features: "MaxTemp", "day_of_year", and "SnowonGrndcm."
```{r warning=FALSE}
#select for COBY data and remove some na entries to avoid errors for some algorithms
coby <- dat %>% filter(location_name == "COBY", !is.na(SnowonGrndcm))
#create data partition for training and evaluating of model
set.seed(20201021, sample.kind = "Rounding")
ind <- createDataPartition(coby$count, p=.9, list = FALSE)
train <- coby[ind,]
test <- coby[-ind,]
control <- trainControl(method = "cv", number = 10, p = .9) #establish train control
rf_cv <- train(count ~ MaxTemp + day_of_year + SnowonGrndcm, method = "rf", #randomForest algorithm
data = train, na.action = na.omit, #omit NAs
tuneGrid = data.frame(mtry = seq(1:2)),
trControl = control)
ggplot(rf_cv, highlight = TRUE)
rf_cv$bestTune #mtry of 1 - best is about RMSE of 415
bayes_glm_cv <- train(count ~ MaxTemp + day_of_year + SnowonGrndcm, method = "bayesglm", #bayesian algorithm
data = train, na.action = na.omit, #omit NAs
trControl = control)
bayes_glm_cv$results #no tuning parameters = RMSE 470
lm_cv <- train(count ~ MaxTemp + day_of_year + SnowonGrndcm, method = "lm", #linear regression
data = train, na.action = na.omit, #omit NAs
trControl = control)
lm_cv$results #no tuning - simple linear regression - RMSE 470
rpart2_cv <- train(count ~ MaxTemp + day_of_year + SnowonGrndcm, method = "rpart2", #regression tree
data = train, na.action = na.omit, #omit NAs
tuneGrid = data.frame(maxdepth = seq(1:10)),
trControl = control)
ggplot(rpart2_cv, highlight = TRUE)
rpart2_cv$results #best tune is Max depth of 4 and RMSE of 436
knn_cv <- train(count ~ day_of_year + MaxTemp + SnowonGrndcm, method = "knn",
data = train, na.action = na.omit, #omit NAs
tuneGrid = data.frame(k = seq(1, 50, 2)),
trControl = control)
knn_cv$bestTune #K of 29 and RMSE of about 400
ggplot(knn_cv, highlight = TRUE) +
ggtitle("KNN with day_of_year, MaxTemp and SnowonGrndcm predictors")
#graph of turning based on lowest RMSE for continuous data
```
Intuitively, KNN seemed to be the the best model due to the very different relationships that the three features seem to have on rides this proved to be the case in my first set of algorithm trials. KNN provided the best results with a root mean square error (RMSE) of just under 400 using a k of 25 nearest neighbours. randomForest performed well with an RMSE of 415, but was very computationally intensive and so I decided against using it further. The regression tree algorithm performed well with rpart2 provide an RMSE of 439, while the linear models (lm and bayes_glm) performed the worst at an RMSE of 470 each.
While KNN was the best performing algorithm in terms of predictive power on the training data, the regression tree approach did provide very attractive interpretability and insights into the relationships between the variables in the data.
```{r message=FALSE, warning=FALSE}
#visualizing a regression tree for this.
fit <- rpart(count ~ MaxTemp + day_of_year + SnowonGrndcm, data = coby)
plot(fit, compress = TRUE, margin = .1)
text(fit)
```
As we can see from the above visualization of a regression tree for the COBY data using the three predictors, the results are easily interpretable based upon temperature and snow on the ground. The day of year varialbe is not used in the model. Given clear pattern from day of year we see in the data, this would seem to impact the effectiveness of the regression tree as a predictor and likely helps explain the better performance of KNN.
The KNN approach using primary predictors provides for the best results likely because it is better able to handle the non linear relationship it has with ridership. We can easily imagine that regardless of temperature or weather, someone may be more inclined to keep their bike out and go for a ride during the warmer seasons, even on a relatively cold or rainy day, as opposed to when their bike is put away for long periods of time (i.e., winter) and there is a stretch of fair weather.
We will take a closer look to see if we can optimize the results of the regression tree and KNN approaches on the data by looking at more features.
```{r message=FALSE, warning=FALSE}
knn_all_cv <- train(count ~ day_of_year + MaxTemp + SnowonGrndcm + MeanTemp +MinTemp + TotalRainmm + TotalSnowcm + TotalPrecipmm, method = "knn",
data = train, na.action = na.omit, #omit NAs
tuneGrid = data.frame(k = seq(1, 50, 2)),
trControl = control)
knn_all_cv$bestTune
#K of 33 and RMSE of about 392 - not much of an improvement with precip predictors
ggplot(knn_all_cv, highlight = TRUE,) + ggtitle("KNN with all predictors")
```
Using KNN, we can see there is an improvement using all the features as predictors however it is not a terribly significant one - only by a couple of points.
```{r}
#lets do the same with rtree - all predictors
fit_all <- rpart(count ~ day_of_year + MaxTemp + SnowonGrndcm + MeanTemp +MinTemp + TotalRainmm + TotalSnowcm + TotalPrecipmm, data = coby)
plot(fit_all, compress = TRUE, margin = .1)
text(fit_all)
```
Looking at a regression tree using all predictors, we can see the results are similar to the previous tree, however there is an extra branch added at lower temperatures and we can see that the algorithm is using mean temperature rather than maximum temperature.
```{r}
variables <- varImp(fit_all)
variables %>% arrange(desc(Overall))
```
The rpart regression tree has the advantage of being able to provide the importance of variables within the model using a simple function, varImp(). Using this function we see that the mean and max temp features are top predictors followed by snow on ground. day_of_year is a predictor as well but less important than the temperature predictors. Precipitation had very little efffect whatsoever.
```{r}
min <- cor(train$MeanTemp, train$MaxTemp, use = "pairwise.complete.obs")
max <-cor(train$MeanTemp, train$MinTemp, use = "pairwise.complete.obs")
data.frame(min, max, row.names = "Cor with MeanTemp")
```
If we look at correlations between min and max temperature and mean temperature, we naturally the temps are highly correlated. Since regression tree suggested MeanTemp is the best predictor we will swap that in for the regression tree and examine it for the KNN approach.
```{r}
dayofyear <- cor(train$day_of_year, train$MeanTemp, use = "pairwise.complete.obs")
data.frame(dayofyear, row.names = "Cor with MeanTemp")
```
There is a relatively low correlation with day or year and mean temperature: .369. While weaker in terms of predictive power, the day of year can make a algorithm more robust as it is not highly correlated with temperature overall. Particularly this provide an advantage to the distance based KNN method.
```{r message=FALSE, warning=FALSE}
knn_meantemp <- train(count ~ day_of_year + MeanTemp + SnowonGrndcm, method = "knn",
data = train, na.action = na.omit, #omit NAs
tuneGrid = data.frame(k = seq(1, 50, 2)),
trControl = control)
knn_meantemp$bestTune
ggplot(knn_meantemp, highlight = TRUE) +ggtitle("KNN with MeanTemp")
```
While close, Maxtemp performs better than MeanTemp using KNN. We will use KNN and the rpart on the test set to get a sense of the algorithms true accuracy.
*FINAL MODELS AND RESULTS*
```{r}
fit_final <- rpart(count ~ MaxTemp + day_of_year + SnowonGrndcm, data = coby)
rtree_res <- predict(fit_final, newdata = test, type = "vector")
rtree <- RMSE(rtree_res, test$count)
knn_final <- train(count ~ day_of_year + MaxTemp + SnowonGrndcm, method="knn",
tune.grid = data.frame(k=25),
data = train, na.action = na.omit)
knn_res <- predict(knn_final, newdata=test, type = "raw")
knn_3feat <- RMSE(knn_res, test$count)
knn_final2 <- train(count ~ day_of_year + MaxTemp + SnowonGrndcm + MeanTemp +MinTemp + TotalRainmm + TotalSnowcm + TotalPrecipmm, method="knn",
tune.grid = data.frame(k=33),
data = train, na.action = na.omit)
knn_res2 <- predict(knn_final2, newdata=test, type = "raw")
knn_allfeat <- RMSE(knn_res2, test$count)
data.frame(rtree, knn_3feat, knn_allfeat, row.names = "RMSE")
```
The best performing model, KNN with all features, has an RMSE of 397.98. The regression tree does not have the same predictive power as KNN but its performance is better than a lot of other algorithms (bayesglm, rf, and lm) and has very attractive interpretability. From a City of Ottawa operations perspective, being able to provide staff a simple heuristic to estimate how many riders one can expect on a given day using the regression tree model would be very attractive and its performance is not that far off (+/-10%) from more sophisticated algorithms.
This model is based on one pathway bike counter in Ottawa. However we can see how well it applies on another winter bike counter in the Ottawa area to validate this approach.
*APPLYING MODEL TO ANOTHER BIKE RIDE COUNTER*
The Ottawa River pathway bike counter (ORPY) had the second most number of entries of all the bike counters in Ottawa.
```{r}
ORPY <- dat %>% filter(location_name == "ORPY", !is.na(SnowonGrndcm), !is.na(TotalSnowcm))
statistic <- c("min", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
tibble(statistic, summary(ORPY$count), summary(coby$count))
```
We can see that on average the Ottawa River pathway has more riders, with higher median, mean and max number of rides than the Col By pathway. However rides appear more skewed with a greater distance between the mean and median.
```{r}
tibble(sd(ORPY$count), sd(coby$count))
```
The Ottawa river pathway also has a higher standard deviation, suggesting greater variability.
```{r}
set.seed(20201021, sample.kind = "Rounding")
ind <- createDataPartition(ORPY$count, p=.9, list = FALSE)#Partitioning ORPY ride data
train_o <- ORPY[ind,]
test_o <- ORPY[-ind,]
fit_final_o <- rpart(count ~ MaxTemp + day_of_year + SnowonGrndcm, data = train_o)#fitting ORPY regression tree
rtree_res_o <- predict(fit_final_o, newdata = test_o, type = "vector")
rtree_o <- RMSE(rtree_res_o, test_o$count) #RMSE of 618 - about 200 higher RMSE than the COBY,
knn_final_o <- train(count ~ day_of_year + MaxTemp + SnowonGrndcm + MeanTemp +MinTemp + TotalRainmm + TotalSnowcm + TotalPrecipmm, method="knn",
data = train_o, na.action = na.omit,
tuneGrid = data.frame(k = seq(1, 50, 2)), #best tune from all feature model
trControl = control)
knn_res_o <- predict(knn_final_o, newdata=test_o, type = "raw")
knn_o <- RMSE(knn_res_o, test_o$count)
data.frame(rtree_o, knn_o, row.names = "RMSE")
```
Applying the regression tree and KNN algorithms to ORPY results in a higher RMSE score for both algorithms than the results for COBY. However, results are similar in that the RMSE is equal to about about 1/2 a standard deviation of ORPY rides. (ORPY's standard deviation of ride counts was about 1200 and COBY's was about 800.) KNN remains the superior algorithm, however the regression tree is relatively close.
```{r}
plot(fit_final_o, compress = TRUE, margin = .1)
text(fit_final_o)
#snow on ground is no longer a determinant on the regression tree.
```
Plotting the regression tree for the Ottawa River pathway we can see that snow on the ground is no longer a determinant on the tree. Maximum temperature is the only predictor.
*CONCLUSION*
The results of applying the model approach to another counter suggests that while the overall approach for forecasting rides based on the weather could still be used across other counters, each counter has slightly different factors that impact the number of rides. Therefore some tuning and selection of optimized parameters would be required for each.
Potential ways to further improve the performance would be to find some way to model the effects of holidays or other significant annual festivals and events would have on bike traffic.
Also, a more complex model using previous year data as a baseline but would adjust for new urban developments that feed traffic to pathways, adjustments to public transit, etc., would likely provide improved accuracy - but is well beyond my capacity to do at this time.