-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathforest-fires-neportugal.Rmd
458 lines (342 loc) · 19.8 KB
/
forest-fires-neportugal.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
---
title: "Forest Fires NE Portugal"
author: "Shaurya Srivastava, Ishas Kekre, Anjali Viramgama"
date: "5/18/2020"
knit: (function(inputFile, encoding) { rmarkdown::render(inputFile, encoding = encoding, output_file = file.path(dirname(inputFile), 'docs/index.html')) })
output:
html_document:
theme: united
highlight: breezedark
number_sections: true
---
CMSC320 Final Project
![](forestfire.jpg)
# Introduction
Every year, there are many forest fires in regions all over the world. Recently, Australia had major fires, killing many different ecosystems and making some animals functionally extinct. Forest fires have the ablity to change the forest ecosystem vastly, for better or for worse. If we were able to predict forest fires and how much land they would burn, we can be better prepared. It would potentially be easier to control and people can take more precautions.
In this tutorial, we attempt to predict the burned area of forest fires, by using meteorological data in the Forest Fires Dataset provided by UCI's Machine Learning Repository. We also perform other data analysis to learn more about forest fires in Northeast Portugal.
# Dataset Used
The dataset used is from http://archive.ics.uci.edu/ml/datasets/Forest+Fires. It contains data about forest fires that have occurred at Northeast Portugal.
# Dataset Information
Dataset characteristics: Multivariate <br />
Attribute characteristics: Real <br />
Number of Instances: 517 <br />
Number of Attributes: 13 <br />
Area: Physical <br />
# Understanding the Dataset
The dataset has the following attributes: <br />
1. X - x-axis spatial coordinate <br />
2. Y - y-axis spatial coordinate <br />
3. month - month of the year: 'jan' to 'dec' <br />
4. day - day of the week: 'mon' to 'sun' <br />
5. FFMC - FFMC index from the FWI system: 18.7 to 96.20 <br />
6. DMC - DMC index from the FWI system: 1.1 to 291.3 <br />
7. DC - DC index from the FWI system: 7.9 to 860.6 <br />
8. ISI - ISI index from the FWI system: 0.0 to 56.10 <br />
9. temp - temperature in Celsius degrees: 2.2 to 33.30 <br />
10. RH - relative humidity in %: 15.0 to 100 <br />
11. wind - wind speed in km/h: 0.40 to 9.40 <br />
12. rain - outside rain in mm/m2 : 0.0 to 6.4 <br />
13. area - the burned area of the forest (in ha): 0.00 to 1090.84 <br />
(this output variable is very skewed towards 0.0, thus it may make
sense to model with the logarithm transform). <br /> <br />
Here is what different components of the FWI system mean: <br />
FFMC- Fine Fuel Moisture Code, calculated using temperature, humidity, wind and rain. <br />
DMC - Duff Moisture Code, calculated using temperature, relative humidity and rain. <br />
DC - Drought Code, calculated using temperature and rain <br />
ISI - Initial Spread Index, calculated using wind. <br />
# Necessary Libraries for this Tutorial
These are the necessary libraries needed for reading and handling data, performing exploratory data analysis, hypothesis testing, and machine learning.
```{r libs, message=FALSE}
library(dplyr)
library(tidyr)
library(lubridate)
library(tidyverse)
library(caret)
```
# Data Preparation
## Load the Dataset
```{r load}
csv_file <- "forestfires.csv"
forest_fires <- read.csv(csv_file, stringsAsFactors = FALSE)
forest_fires %>% head()
```
## Modify Dataset as Needed
Normally, this means dropping any columns that are unnecessary and adding columns that will be useful for further data analysis. We add a column with log of areas for better naalysis and convert all the months, which are from Jan to Dec to their respective numbers (1 to 12). We also add a season column (Fall, Spring, Summer and Winter) based on the months, which we will use for Hypothesis testing, and we divide the burn area level into two parts, big and small, depending on it's log area.
### Add a column with the log of the areas. Drop the day column. Convert the month abbreviations to numbers
```{r modconvert}
forest_fires_mod <- forest_fires %>%
mutate(log_area = log(forest_fires$area + 1)) %>%
select(-day)
forest_fires_mod$month <- sapply(forest_fires_mod$month,function(x) grep(paste("(?i)",x,sep=""),month.abb))
```
### Add a column for the season
The season column is based on the month column.
```{r modseason}
forest_fires_mod <- forest_fires_mod %>%
mutate(season="winter")
forest_fires_mod$season[forest_fires_mod$month>2 & forest_fires_mod$month<6] <- "spring"
forest_fires_mod$season[forest_fires_mod$month>5 & forest_fires_mod$month<9] <- "summer"
forest_fires_mod$season[forest_fires_mod$month>8 & forest_fires_mod$month<12] <- "fall"
forest_fires_mod$season <- as.factor(forest_fires_mod$season)
```
### Add a column for the burn area level
There are two burn area levels, small and large. Burn area has a small level when log(area) <= 2 and a large level when log(area) > 2.
```{r modelevel}
forest_fires_mod <- forest_fires_mod %>%
mutate(burn_area_level="small")
forest_fires_mod$burn_area_level[forest_fires_mod$log_area>2] <- "large"
forest_fires_mod$burn_area_level <- as.factor(forest_fires_mod$burn_area_level)
```
### Display the first few rows of the new modified dataframe.
```{r head}
forest_fires_mod %>% head()
```
# Exploratory Data Analysis
## Histogram Showing Density of Forest Fires occurring over Months
We first try to find if there is any correlation between density of forest fires occurring over months.
``` {r hist}
forest_fires_mod %>%
ggplot(mapping=aes(x=month), breaks=100) +
ggtitle(" Histogram of Months") +
geom_histogram(aes(y=..density..),binwidth=1.5) +
geom_density(alpha=.2, fill="#FF6666") +
scale_x_continuous(name ="Months",
breaks=seq(1,12,1))
```
There seems to be a higher density between months July to October, i.e. in summer and fall at Northeast Portugal.
## Pie Chart Displaying the Percent of Total Fires per Season
To further confirm that forest fires occur more in the summer and fall, we plot a pie chat showing the percent of total fires per season.
``` {r pie}
t = nrow(forest_fires_mod)
s = nrow(forest_fires_mod %>% filter(season == 'summer'))
sp = nrow(forest_fires_mod %>% filter(season == 'spring'))
f = nrow(forest_fires_mod %>% filter(season == 'fall'))
w = nrow(forest_fires_mod %>% filter(season == 'winter'))
seasons <- data.frame(
group = c("Summer", "Spring", "Fall", "Winter"),
value = c(s/t, sp/t, f/t,w/t)
)
seasons %>%
ggplot(aes(x="", y=value, fill=group)) + geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) + geom_text(aes(label = paste0(round(value*100), "%")), position = position_stack(vjust = 0.5)) +
labs(x = NULL, y = NULL, fill = NULL, title = " Seasonal Fires") +
theme_classic() + theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
```
This clearly shows that the tendency of a forest fire breaking out in Summer and Fall is higher than the other 2 seasons for Northeast Portugal.
## Scatter Plot showing Log Burn Area vs Season
``` {r scatter-season}
forest_fires_mod %>%
ggplot(aes(x=season,y=log_area)) +
ggtitle(" Log Burn Area vs Season") +
labs(x="Season", y="Log Burn Area") +
geom_point()
```
This shows that summer and fall have high log burn areas that spring and winter. There is a very slight correlation with season and log burn area for Northeast Portugal.
## Scatter Plots showing Burn Area Level vs the Different FWI Indices colored by Season
We know that the FWI indices are calculated using combinations of other indices and Tempurature, Wind Speed, Relative Humidity, and Rain. We would like to observe whether any of these indices have a correlation with the Burn Area Level grouped by Season.
### Scatter plot for Burn Level vs FFMC colored by Season
``` {r scatter-ffmc}
forest_fires_mod %>%
ggplot(aes(x=FFMC,y=burn_area_level, color = season)) +
ggtitle(" Burn Level vs FFMC colored by Season") +
labs(x="FFMC", y="Burn Area Level") +
geom_point()
```
From this plot, it is observed that low FFMC correlates with small burn area level. FFMC has a very slight correlation with burn area level. Grouping by season doesn't change the correlation, so season is not an interaction for FFMC.
### Scatter plot for Burn Level vs DMC colored by Season
``` {r scatter-dmc}
forest_fires_mod %>%
ggplot(aes(x=DMC,y=burn_area_level, color = season)) +
ggtitle(" Burn Level vs DMC colored by Season") +
labs(x="DMC", y="Burn Area Level") +
geom_point()
```
We observe that DMC grouped by season has no noticeable correlation with burn area level. Grouping by season doesn't change the correlation, so season is not an interaction for DMC.
### Scatter plot for Burn Level vs DC colored by Season
``` {r scatter-dc}
forest_fires_mod %>%
ggplot(aes(x=DC,y=burn_area_level, color = season)) +
ggtitle(" Burn Level vs DC colored by Season") +
labs(x="DC", y="Burn Area Level") +
geom_point()
```
We observe that DC grouped by season has no noticeable correlation with burn area level. Grouping by season doesn't change the correlation, so season is not an interaction for DC.
### Scatter plot for Burn Level vs ISI colored by Season
``` {r scatter-isi}
forest_fires_mod %>%
ggplot(aes(x=ISI,y=burn_area_level, color = season)) +
ggtitle(" Burn Level vs ISI colored by Season") +
labs(x="ISI", y="Burn Area Level") +
geom_point()
```
We observe that ISI grouped by season has no noticeable correlation with burn area level. Grouping by season doesn't change the correlation, so season is not an interaction for ISI.
## Areas of Forest Most Affected by Forest Fires
Just for fun, we plot another graph to show what areas of the forest are most affected by forest fires, in X and Y coordinates.
``` {r map}
forest_fires_mod %>%
ggplot(aes(x=X, y=Y) ) +
ggtitle(" Forest Fire Map") +
stat_density_2d(aes(fill = ..level..), geom = "polygon") +
scale_x_continuous(name ="X-coord",
breaks=seq(1,9,1)) +
scale_y_continuous(name ="Y-coord",
breaks=seq(2,9,1))
```
# Hypothesis Testing
## First Hypothesis Test
For our first hypothesis test, we assumed that the probability of a forest fire occuring in every season is equal, i.e. it is 0.25 for summer, spring, winter and fall.
We test it out for Fall specifically, assuming p = 0.25
### H0: p = 0.25 Ha: p > 0.25 for fall
```{r hfall}
fall_sample <- forest_fires_mod %>% filter(season == 'fall')
xbar = nrow(fall_sample)/nrow(forest_fires_mod)
u = 0.25
sigma = 0.25*(1-0.25)/nrow(forest_fires_mod)
z = (xbar-u)/(sqrt(sigma))
pval = pnorm(z, lower.tail=FALSE)
print(paste0("xbar: ", xbar), quote = FALSE)
print(paste0("mean: ", u), quote = FALSE)
print(paste0("std: ", sqrt(sigma)), quote = FALSE)
print(paste0("z-score: ", z), quote = FALSE)
print(paste0("p-value: ", pval), quote = FALSE)
```
Based on the results, we reject $H_o$. $pvalue = 1.208e-9 < \alpha = 0.05$, meaning that the probability is within rejection level $\alpha$. Therefore, there is enough evidence to reject $H_o$ and show that the true proportion of fires occuring in fall per year is greater than 0.25.
## Second Hypothesis Test
For our second hypothesis test, we ran a test to see if the true mean temperature of forest fires in the area is 20 degrees. Since the sample mean was less than 20, we conducted a one-tailed lower t-test to see if the true mean could be 20.
### H0: u = 20 Ha: u < 20 for temp
```{r htemp}
xbar = mean(forest_fires_mod$temp)
u = 20
s = sd(forest_fires_mod$RH)
n = nrow(forest_fires_mod)
t = (xbar-u)/(s/sqrt(n))
pval = pt(t, df=n-1)
print(paste0("xbar: ", xbar), quote = FALSE)
print(paste0("mean: ", u), quote = FALSE)
print(paste0("std: ", s/sqrt(n)), quote = FALSE)
print(paste0("t-score: ", t), quote = FALSE)
print(paste0("p-value: ", pval), quote = FALSE)
```
Based on the results, there is not enough evidence to reject $H_o$. $pvalue = 0.06... > \alpha = 0.05$, meaning that the probability is greater than the rejection level $\alpha$. Therefore it is possible enough that the sample mean would occur, assuming the true mean temperature is 20 degrees.
## Third Hypothesis Test
For our third hypothesis test, we ran a test to see if the true mean area of forest fires in the area is 20 hectare units. For this test we are showing how to conduct a two-tailed t-test to see if the true mean could be 20.
### H0: u = 20 Ha: u != 20 for area
```{r harea}
xbar = mean(forest_fires_mod$area)
u = 20
s = sd(forest_fires_mod$area)
n = nrow(forest_fires_mod)
t = (xbar-u)/(s/sqrt(n))
pval = 2*pt(t, df=n-1)
print(paste0("xbar: ", xbar), quote = FALSE)
print(paste0("mean: ", u), quote = FALSE)
print(paste0("std: ", s/sqrt(n)), quote = FALSE)
print(paste0("t-score: ", t), quote = FALSE)
print(paste0("p-value: ", pval), quote = FALSE)
```
Based on the results, there is enough evidence to reject $H_o$. $pvalue = 0.01... < \alpha = 0.05$, meaning that the probability is less than the rejection level $\alpha$. Therefore it is extremely unlikey that the sample mean would occur, assuming the true mean temperature is 20 degrees. We have enough evidence to show that the true mean area is not 20.
# Machine Learning
Based on the attributes of the dataset, we know that the different FWI Indices provide different information of Forest Fires, and when combined, they tell fire intensity. There is no noticeable or strong correlation with any of the indices and burn area level. However, as the indices are used to calculate the fire intensity, we try to see if using these indices can predict burn area level.
## What are we trying to predict
Will a forest fire burn area level be large or small?
## Prepare the data set for the machine learning prediction task
Drop all the columns except for the FWI Indices, Season, and the Burn Area Level.
```{r ml-setup}
forest_learning <- forest_fires_mod %>% select(-month,-X,-Y,-area,-log_area,-rain,-temp,-RH,-wind)
forest_learning %>% head()
```
## Prediction Algorithm
This code will set up the cross validation experiment with knn algorithm from the Caret Library. It utilizes repeatedcv, creating 5 folds and repeating the experiment 5 times. The experiment provides ROC values and a method to calculate the AUROC values.
``` {r ml-alg, message=FALSE, warning=FALSE}
set.seed(1234, sample.kind = "Rounding")
# create the cross-validation partition
cv_partition <- createFolds(forest_learning$burn_area_level,k=5)
# setup training parameters
fit_control <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
#indexOut = cv_partition,
summaryFunction=twoClassSummary,
classProbs=TRUE,
savePredictions=TRUE)
# a function to obtain performance data
# (tpr and fpr) over the given cross validation
# partitions, for the number of trees in the
# random forest
get_roc_data <- function(cv_partition, final_df) {
mean_fpr <- seq(0, 1, len=100)
aucs <- numeric(length(cv_partition))
# iterate over folds
res <- lapply(seq_along(cv_partition), function(i) {
# train the random forest
fit <- train(burn_area_level~.,
data = final_df[-cv_partition[[i]],], # all but the holdout set
method = "knn",
trControl = fit_control,
preProcess = c("center","scale"),
tuneLength = 20,
metric="ROC")
# make predictions on the holdout set
preds <- predict(fit, newdata=final_df[cv_partition[[i]],],type="prob")$large
# compute tpr and fpr from the hold out set
perf <- ROCR::prediction(preds, final_df$burn_area_level[cv_partition[[i]]]) %>%
ROCR::performance(measure="tpr", x.measure="fpr")
fpr <- unlist(perf@x.values)
tpr <- unlist(perf@y.values)
# interpolate the roc curve over 0, 1 range
interp_tpr <- approxfun(fpr, tpr)(mean_fpr)
interp_tpr[1] <- 0.0
# collect values for this fold
data_frame(fold=rep(i, length(mean_fpr)), fpr=mean_fpr, tpr=interp_tpr)
})
# combine values across all folds
# into a single data frame
do.call(rbind, res)
}
# calculate area under the ROC curve
# from tpr and fpr values across folds
compute_auc <- function(curve_df) {
curve_df %>%
group_by(fold) %>%
summarize(auc=pracma::trapz(fpr, tpr))
}
```
## Execute the prediction
We will get performance data from using our machine learning algorithm.
``` {r ml-execution, message=FALSE, warning=FALSE}
curve_df <- get_roc_data(cv_partition,forest_learning)
auc_df <- compute_auc(curve_df)
```
## Plot AUROC
```{r}
ggplot(auc_df, aes(x=fold, y=auc)) +
geom_point()
```
This tells us the area under the ROC curve for each fold. The AUROC values are on the lower end for each fold, mostly less than 0.5. The low values convey that the model used isn't good for the prediction of the Burn Area Level.
## Plot ROC Curve
```{r}
curve_df %>%
group_by(fpr) %>%
summarize(tpr = mean(tpr)) %>%
ggplot(aes(x=fpr, y=tpr)) +
geom_line() +
labs(title = " ROC curves",
x = "False positive rate",
y = "True positive rate")
```
## Machine Learning Conclusion
The ROC Model is a straight line, meaning that this model/algorithm is not good for predicting the Burn Area Level.
## Tutorial Conclusion
In this tutorial, we went over how to perform data analysis on the Forest Fires Northeast Portugal dataset, provided by the UCI database. We performed Exploratory Data Analysis, Hypothesis Testing, and Machine Learning. These are all important tools of analysis to get a better understanding of the data.
From the Exploratory Data Analysis, we learn that the FWI indices do not have much of a correlation with Burn Area Level, except for FFMC which has a very small correlation. It is also learned that season has a slight correlation with Burn Area Level. We also learn about what areas of Northeast Portugal do Forest Fires occur and that they mostly happen in Spring and Summer.
From the Hypothesis Testing, we showed different techniques to perform hypothesis tests on samples of data. We showed how to come up with null and alternate hypothesis, conduct tests on the sample data, and how to understand and analyse those tests. In real life samples, there are a lot of unknowns, and these hypothesis tests are important in understanding data and making assumptions about your data.
From the Machine Learning, we learn that our model isn't good at predicting the Burn Area Level. In fact, with further analysis of the data, we realize that possibly no traditional machine learning algorithm can accurately predict Burn Area Level. We should look into Neural Networks or other machine learning algorithms. We could probably use this dataset and traditional machine learning algorithms to predict another attribute other than Burn Area Level, such as what season did a forest fire occur based on the FWI Indices and other meteorological data.
# References
Dataset - [Cortez and Morais, 2007] P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: [http://archive.ics.uci.edu/ml/datasets/Forest+Fires]
To Learn Additional Data Science Concepts - https://www.hcbravo.org/IntroDataSci/bookdown-notes/
Tutorial on ggplot by r-statistics.co - http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
Machine Learning using caret - http://www.rebeccabarter.com/blog/2017-11-17-caret_tutorial/
Forest Fire Image - https://e360.yale.edu/assets/site/_1500x1500_fit_center-center_80/Washington-DNR-Chiwaukum-WildFires-2014-2015_WA-DNR_cropped.jpg