-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathOpioid-Environment-Toolkit.Rmd
2173 lines (1490 loc) · 96 KB
/
Opioid-Environment-Toolkit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "<span style='color: #982568; font-size:42px'>OpenAirQ Toolit</style>"
author: "Developed for the Partnership for Healthy Cities with support from Bloomberg Philathropies. Last Updated : `r Sys.Date()`"
output: bookdown::gitbook
documentclass: book
---
# Introduction {-}
This toolkit provides and **introduction to GIS and spatial data analysis** for air quality analysis applications, allowing researchers, policymakers, analysts, and practicioners to gain data-driven air quality insights in their local community.
## Software Basics {-}
Tutorials assume that R and RStudio is already downloaded on your device. Luckily, this toolkit is compatible with Windows, macOS, and Linux systems. Basic familiarity in R is required for these toolkits. You should know how to change the working directory, install new packages, library packages, and comfortably navigate between folders on your computer. Additionally, an internet connection will be required for some tutorials.
If you are new to R, we recommend the following <a href="https://learn.datacamp.com/courses/free-introduction-to-r">intro-level tutorials</a> provided through <a href="https://rspatial.org/intr/1-introduction.html">installation guides</a>. You can also refer to this <a href="https://datacarpentry.org/r-socialsci/">R for Social Scientists</a> tutorial developed by Data Carpentry for a refresher.
Before we begin, install the following packages for data wrangling and spatial data analysis.
* `sf`
* `sp`
* `tmap`
* `dplyr`
## Author Team {-}
This toolkit was developed for the [Partnership for Healthy Cities](https://partnershipforhealthycities.bloomberg.org/) by Marynia Kolak, Isaac Kamber, Lorenz Menendez, Haowen Shang, Yuming Liu, and Jizhou Wang at the [Center for Spatial Data Science](https://spatial.uchicago.edu/) at the University of Chicago with support from [Bloomberg Philantropies](https://www.bloomberg.org/).
## Acknowledgements {-}
This research was supported by TBD, add any legal discliamers or other sponsor verbiage here.
***
<!--chapter:end:index.Rmd-->
# Vector Data Mapping
## Required Packages
* `tmap`: Flexible thematic mapping
* `sf`: Spatial vector data manipulation
* `dplyr`: `data.frame` manipulation
```{r, include = FALSE}
library(tmap)
library(sf)
library(dplyr)
```
`tmap` will be set to interactive mode for this tutorial.
```{r}
tmap_mode('view')
```
## Political Boundaries
Air quality modeling occured over the Chicago metropolitan area. This geography consist of 21 individual counties within the states of Illinois, Indiana, and Wisconsin.
**Read in County Boundaries**
```{r}
counties = sf::st_read('./data/LargeAreaCounties/LargeAreaCounties.shp')
```
**Plot with `tmap`**
```{r}
tm_shape(counties) +
tm_borders() +
tm_text("COUNTYNAME", size = 0.7, along.lines = TRUE) +
tm_fill(col = "STATE", alpha = 0.5)
```
Air modeling data, however, is collected at a larger spatial scale to account for region-wide effects that could affect the 21 country study area. Four midwestern states (Illinois, Indiana, Michigan, and Wisconsin) were chosen as a data collection area.
```{r}
states = sf::st_read('./data/FourStates/FourStates.shp')
```
```{r}
tm_shape(states) +
tm_borders() +
tm_text("NAME", size = 0.8, auto.placement = TRUE) +
tm_fill(col = "NAME", alpha = 0.5) +
tm_shape(counties) +
tm_fill(col = "black", alpha = 0.25) + tm_borders(col = "black", alpha = 0.25)
```
```{r, include = FALSE}
basemap = tm_shape(states) +
tm_borders() +
tm_text("NAME", size = 0.8, auto.placement = TRUE) +
tm_fill(col = "NAME", alpha = 0.5) +
tm_shape(counties) +
tm_fill(col = "black", alpha = 0.25) + tm_borders(col = "black", alpha = 0.25)
```
## Ground-based Data Sensor Locations
### EPA Particulate Matter Sensors
Over 127 PM2.5 pollution monitoring stations are located across the four state data collection area. The map below describes the distribution of these point locations. More information on the data collection methods and output data from each sensor will be discussed later on.
```{r}
sensors = sf::st_read('./data/PM25_4States_2014.2018_POINTLOCS.geojson') %>% dplyr::mutate(rec_duration = as.numeric(lastRec - firstRec))
```
```{r}
basemap +
tm_shape(sensors) +
tm_markers(shape = tmap_icons('https://github.com/GeoDaCenter/OpenAirQ-toolkit/blob/master/data/assets/EPA_logo.png?raw=true'))
```
### Weather Stations
High temporal resolution weather data was sourced from a large network of ground-based weather stations co-located at many airports. They provide data at regular one hour intervals on variables such as temperature, pressure, wind velocity, and wind direction. The map below describes the distribution of sensors in the four-state data collection area.
```{r}
asos = sf::st_read('./data/4States_ASOS_2018_Locations.geojson')
```
```{r}
basemap +
tm_shape(asos) +
tm_markers(shape = tmap_icons('https://github.com/GeoDaCenter/OpenAirQ-toolkit/blob/master/data/assets/airport_icon.png?raw=true'))
```
## Point Sources of Pollution
The EPA National Emissions Inventory (NEI) dataset (2014) describes the location of large sources of pollution, such as powerplants, factories, and other industrial buildings.
```{r}
points.pollution = sf::st_read('./data/Point_Source_Emissions_4States_2014.geojson')
```
```{r}
basemap +
tm_shape(points.pollution) +
tm_markers()
```
<!--chapter:end:02-vectormapping.Rmd-->
# Data Prep & Management
This project uses weather and pollution data from remotely sensed satellite imagery, but also ground based sensors maintained by the EPA and FAA to model air quality in the Midwest. Using the ground sensors, the team can attempt to predict pollutions levels based on satellite data. This chapter focuses on how weather and pollution data from ground sensors was downloaded and prepared for use in refining the prediction.
## EPA Pollution Data
```{r, echo=FALSE, fig.cap="EPA Pollution Monitoring Site (EPA.gov)"}
knitr::include_graphics("https://archive.epa.gov/pesticides/region4/sesd/pm25/web/jpg/air-monitoring-site.jpg")
```
EPA data was seamlessly imported into R using the **aqsr** package by [Joshua P. Keller](https://github.com/jpkeller) at Colorado State University. The package takes advanatge of the [EPA AQS DataMart API](https://aqs.epa.gov/aqsweb/documents/data_mart_welcome.html) to load data in R as data.frame objects with only a couple lines of code. It allows users to query for sensor data across multiple air quality variables, geographies, and timeframes. Let's get started by downloading the package.
```{r download.epadata, message = FALSE, warning = FALSE}
# devtools::install_github("jpkeller/aqsr")
library(aqsr)
```
### Getting Started
This section describes the process for querying EPA sensor data using the **aqsr** package. For more information on how each function works, please reference the package documentation.
#### Obtaining an API Key {-}
For first time users of the AQS DataMart API, you must first register your email to recieve an API key. (Users who already have a DataMart API key, please skep to the next step). The API key is a required input for all querying functions in the **aqsr** package. Obtaining a key is made simple by calling the ```aqs_signup()``` function and inputting your own email address.
```{r API.signup, eval = FALSE}
aqs_signup('YourEmailHere@uchicago.edu')
```
Save your API key from the email confirmation for future reference. In case you don't recieve an email, verify that your email address was typed correctly, and check your spam folder.
#### Using your API Key in **aqsr**
Setup your AQI key with the **aqr** package by using the ```create_user()``` function. This way, you won't have to keep typing your email and API key each time you query for data.
```{r eval=FALSE}
myuser = create_user(email = 'YourEmailHere@uchicago.edu', key = 'apikey123')
```
```{r API.details, include=FALSE}
myuser = create_user('lmenendez@uchicago.edu', 'tealmouse67')
```
### PM2.5 Data Query
This section describes how to query for PM2.5 concetration data from EPA pollution sensors. We are looking for at PM2.5 data Wisconsin, Illinois, and Indiana between 2014 and 2018 for our project. First, let's start small and query only for Illinois data for the first week of 2018.
```{r}
IL.data = aqs_dailyData_byState(aqs_user = myuser, # Previously defined user emailand API key
param = 88101, # EPA AQS Parameter Code for PM2.5
bdate = "20180101", # Starting Date (Jan 1st ,2018)
edate = "20180107", # Ending Date (Jan 7th, 2018)
state = "17") # State FIPS Code for Illinois
```
```{r echo=FALSE, }
knitr::kable(IL.data[1:5, 1:10])
```
The outputted data frame includes many fields regarding the PM2.5 observation, including spatial data for the sensor's location. We will focus on these details later on in our data wrangling process. The next code chunk describes how to query for PM2.5 data across our three states and four years.
```{r, eval = FALSE}
library(dplyr)
# List of States to Iterate Through
states = c("17", "18", "55")
# Matrix of Start Dates and End Dates to Iterate Through
dates = matrix(c("20140101", "20141231", "20150101", "20151231", "20160101", "20161231", "20170101", "20171231", "20180101", "20181231"), ncol = 2, byrow = TRUE)
# Leveraging apply functions to iterate through both states and dates
full.data = lapply(states, function(x){
mapply(aqs_dailyData_byState,
bdate = dates[,1],
edate = dates[,2],
MoreArgs = list(aqs_user = myuser,
param = 88101,
state = x),
SIMPLIFY = FALSE
) %>%
do.call("rbind", .)
}) %>%
do.call("rbind", .)
```
***
## FAA Weather Data
```{r, echo=FALSE, fig.cap="An ASOS Observation Station in Elko, NV. (Wikimedia Commons)"}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/2008-07-01_Elko_ASOS_viewed_from_the_south_cropped.jpg/1280px-2008-07-01_Elko_ASOS_viewed_from_the_south_cropped.jpg")
```
FAA weather data gathered from the [Automated Surface Observing System (ASOS)](https://en.wikipedia.org/wiki/Automated_airport_weather_station) can be imported using the **riem** package. This package, created by [ROpenSci](https://ropensci.org/), queries weather data from the [Iowa Environmental Mesonet](https://mesonet.agron.iastate.edu/ASOS/), an online portal for international ASOS data maintained by Iowa State University. First, let's load the package.
```{r download.riem}
# devtools::install_github('ropensci/riem')
library(riem, quietly = TRUE)
```
### Sample Query
Below is an R code snippet that performs the simplest weather data query possible in the **riem** package. It specifies a particular weather station using an airport code and a date range to query for. The output is a tibble table of raw ASOS weather data. The code snippet below extracts sensor data at the San Francisco International Airport.
``` {r simple.riem.query}
SFO.weather = riem_measures(station = 'KSFO', date_start = "2014-01-01", date_end = '2014-01-02')
```
```{r echo=FALSE}
knitr::kable(SFO.weather[1:5, c(1,2,5,6,7, 8)])
knitr::kable(head(SFO.weather)[c('station', 'valid', 'tmpf', 'dwpf', 'alti', 'vsby')])
```
The outputted table shows weather data for a 24-hour period on January 1st, 2014 at the San Francisco International Airport. The `valid` column species when each weather report was generated, typically at 1-hour intervals. The `tmpf` and `dwpf` columns give the ambient air temperature and dew point in Fahrenheit (ºF). Other important variables in our project include air pressure (`alti`), measured in inches of mercury (in.Hg), and visibility (`vsby`) in miles. For more information on all available varibles, see Iowa State's [Documentation](https://mesonet.agron.iastate.edu/request/download.phtml).
Next, we will apply this function at a large scare across multiple sensors and timescales.
### Finding ASOS Sensors
The FAA collects weather data at hourly intervals for each meteorological station, with some stations providing half-hour intervals. Even querying for short periods of time can yield large amounts of data. To optimise performance, we want to only query data from stations in our study area.
#### Finding Sensors by State {-}
In our project, we focus on certain counties in Illinois, Indiana, and Wisconsin, so we are interested in finding the sensors within that study area. The first step is to query the locations of all weather stations in the three states using the **riem** package. In the example below, we query for sensors in the Illinois ASOS sensor network.
```{r IL.query}
IL.stations = riem_stations(network = 'IL_ASOS')
```
```{r echo=FALSE}
knitr::kable(head(IL.stations))
```
To query for data across multiple states, we are going the apply the `riem_stations` function to a list of weather station networks, as shown below.
```{r IL.IN.WI.query, message = FALSE}
networks = list('IL_ASOS', 'IN_ASOS', 'WI_ASOS')
library(dplyr, quietly = TRUE)
station.locs = lapply(networks, riem::riem_stations) %>%
do.call(rbind, .) # Creates a single data table as output
```
Note: You can find a list of state abbreviations by typing `state.abb` in your R console.
#### Converting Latitude and Longitude Coordinates to Spatial Data {-}
The data tables returned by the **riem** package must be converted to spatial data to determine which sensors are located in the study area. Since the lon/lat coordinates are already provided, the data table is easily converted to a spatial `sf` object.
```{r csv.to.spatial}
station.locs.sf = sf::st_as_sf(station.locs, coords = c("lon", "lat"), crs = 4326)
# Plot stations and study area boundaries to verify that the correct sensors were selected
plot(station.locs.sf$geometry)
plot(sf::st_read('https://uchicago.box.com/shared/static/uw0srt8nyyjfqo6l0dv07cyskwmv6r50.geojson', quiet = TRUE)$geometry, border = 'red', add = TRUE)
```
We plot to results to verify that our query and data conversion process worked correctly. For reference, the boundaires of the study area is outlined in red.
#### Selecting Sensors within our Study Area {-}
Next, we perform a spatial join to only keep the points located within the boundaries of our study area polygons. The spatial join is completed by the **sf** package, as shown below. For more information regarding spatial joins and spatial predicates, please see [this](https://gisgeography.com/spatial-join/) helpful blog post by GISgeography.com.
```{r sensor.join, message = FALSE}
# Loading study area boundaries
study.area = sf::st_read('https://uchicago.box.com/shared/static/uw0srt8nyyjfqo6l0dv07cyskwmv6r50.geojson', quiet = TRUE)
study.sensors = sf::st_join(station.locs.sf, study.area, left = FALSE)
# Verify Spatial Join by Plotting
plot(study.area$geometry, border = 'red')
plot(study.sensors$geometry, add = TRUE)
title('Weather Stations Within the Study Area')
```
Now that we have a dataset of which weather stations we are interested in, we can query for the weather data associated with each station.
### Weather Data Query
Again we use the `lapply` function in base R to execute the `riem_measures` function on a list of sensor IDs. This allows us to iteratively query for weather data from each individual sensor in a list. In the code snippet below, we take the study sensors obtained previously and query for a single day's worth of weather data.
```{r multi.locs.query, message=FALSE, warning=FALSE}
library(dplyr, quietly = TRUE)
weather.data = lapply(study.sensors$id, function(x){riem::riem_measures(x, date_start = "2014-01-01", date_end = "2014-01-02")}) %>%
do.call(rbind, .) # Creates a single data table as output
```
```{r echo=FALSE}
knitr::kable(weather.data[1:5, 1:5])
```
#### Querying Full Weather Dataset {-}
Use caution when querying for a large amount of data. Data tables can easily become unwieldy after querying for a large number of weather stations across a wide time scale. The code snippet below downloads all ASOS weather data for sensors in our study area from January 1st 2014 to December 31st 2018, which is our study time period. It has approximately 4.8 Million records and takes 6-10 minutes to download.
```{r large.query, eval = FALSE}
weather.data = lapply(study.sensors$id, function(x){riem::riem_measures(x, date_start = "2014-01-01", date_end = "2018-12-31")}) %>%
do.call(rbind, .) # Creates a single data table as output
```
<!--chapter:end:03-weatherdata.Rmd-->
# Point Sensors to Surfaces
## Introduction
This chapter will introduce how to convert point sensors to surfaces. In it, we will work with the CSV file for the 2017 National Emissions Inventory, downloadable from the EPA's website [here](ftp://newftp.epa.gov/air/nei/2017/data_summaries/2017v1/2017neiApr_facility_process_byregions.zip). If you wish to follow along with this chapter, please download the dataset now. If you have a specific interest area and would like to skip the data wrangling in R, you can download state-specific and pollutant-specific summaries from the NEI website.
We will begin with a brief review of the basics of data wrangling and filter the relatively large CSV file to the considerably smaller subset of the data with which we are concerned. Then, we will reinforce the data visualization skills covered in a previous chapter by mapping out the point locations of emissions sources in our study area. Finally, we will transition into the process of creating a continuous surface in the form of a Kernal Density Estimation (KDE) of PM2.5 point emission source density in Cook County, Illinois.
By the end of this tutorial you will be able to:
* Understand and wrangle National Emissions Inventory data
* Use the sp package in R
* Generate a Kernal Density Estimation using real data
## Loading the Required Packages
To process our data and create a Kernal Density Estimation, we will need the following packages:
* tidyverse (data wrangling)
* sp (Spatial data manipulation/analysis)
* rgdal (Spatial data)
* tmap (Spatial data visualization)
* spatialEco (Creating KDE)
If you do not already have any of these packages installed, you will want to install them using `install.package("*PackageName*")`. Once they are installed, we are going to load the required packages:
```{r, nei.xpackage.load, results='hide', warning=FALSE, message=FALSE}
library(tidyverse)
library(sp)
library(rgdal)
library(spatialEco)
library(tmap)
library(readr)
```
## Read and Examine the Data
Now that we have loaded our required packages, we will now read in our National Emissions Inventory CSV. After unzipping the zipped folder downloaded from the EPA, you will have two files: "process_12345.csv" and "process678910.csv". For the purposes of this chapter, we will only need "process_12345.csv". This file is quite large, so beware that it may take 30 seconds to load.
```{r, nei.load, warning=FALSE, message=FALSE}
nei.data <- readr::read_csv("./data/process_12345.csv.zip")
```
Having successfully read our data into the R environment, let's take a second to examine it.
```{r, nei.examine, warning=FALSE, message=FALSE}
nrow(nei.data)
names(nei.data)
```
As we can see, the dataset is huge, with over 3 million observations and 53 attributes. None of the existing spatial data packages in R are well equipped to handle such a dataset of such size. Luckily, we are only interested in a small subset of the data -- PM2.5 emissions sources in Illinois, Michigan, Wisconsin, and Indiana.
## Data Wrangling
As a reminder, this dataset contains data for many pollutants across the entire United States. Looking at the code snippet above, we can see that the tibble contains columns for state abbreviations and pollutant descriptions, two factors which we are interested in filtering. First, let's filter our tibble to only those observations within our state, Illinois. We are going to be using the `filter()` function from the dplyr package (included in the tidyverse).
```{r, nei.state.filter}
state.abbr <- c("IL")
state.nei <- nei.data %>%
filter(state %in% state.abbr)
nrow(state.nei)
```
With that, we're already down to *just* 386,338 emissions sources. While we still have a ways to go with our filtering, this is certainly progress. Let's take a second to look back over what we just did.
The second line of this code is using the pipe (`%>%`) operator to *pipe* in the complete nei dataset into the filter function covered in an earlier chapter.
`%in%` is an infix operator that matches the items from the first vector (the complete list of state abbreviations for all point emissions sources) with those of the second (the state abbreviations for state of interest).
This code is written this way to allow this code to be used for larger, multistate study areas. If you are interested in examining points from multiple states, simply add their abbreviations to the `state.abbr` vector. If you are only using one state, feel free to simplify the code to your liking. We are next going to filter our data down further to include only those points within Cook County, IL.
```{r, nei.county.filter}
county.names <- c("Cook")
county.nei <- state.nei %>%
filter(county %in% county.names)
nrow(county.nei)
```
Let's finish filtering our data by restricting our results to only those emissions sources emitting PM2.5. We will first examine the different labels for pollution descriptions using the `unique()` function. We will then filter our dataset for only those labels that seem related to PM2.5 using the same process as above.
```{r, nei.pm.filter}
unique(county.nei$`pollutant desc`)
pm25.names <- c("PM2.5 Filterable", "PM2.5 Primary (Filt + Cond)")
county.pm25 <- county.nei %>%
filter(`pollutant desc` %in% pm25.names)
nrow(county.pm25)
```
Now, with a manageable number of observations in our area of interest, we are going to start looking at our data spatially.
## Creating a Spatial Object using sp
We first want to use our filtered tibble to create an sp Spatial Points object.
```{r, nei.sp.create}
#Assign the proper coordinates
coordinates(county.pm25) <- county.pm25[,c("site longitude","site latitude")]
#Assign the proper projection for this data source (EPSG 4326)
proj4string(county.pm25) <- CRS("+init=epsg:4326")
#Check data with basic plot
plot(county.pm25)
```
With everything looking as it should, let's look back on what we just did. We initialized the Spatial Points object using the `coordinates()` function, assigning the proper longitude and latitude from the dataset. We then used the `proj4string()` function to assign the correct Coordinate Reference System (CRS) to our data. Be careful not to use the wrong projection (check your data source). If you need to transform the projection of your dataset, use the `spTransform()` function. Let's now briefly review data visualization with the tmap package using this point data.
## Data Visualization Review
Here, we will use the spatial data visualization skills learned in an earlier chapter to visualize the point locations of PM2.5 sources in Cook County.
```{r, nei.point.dataviz}
#Read in Cook County Shapefile using sp's readOGR function
cook.county <- readOGR("./data/CookCounty.geojson")
#Check projection
proj4string(cook.county)
#Create tmap plot
tm_shape(cook.county) +
tm_borders() +
tm_shape(county.pm25) +
tm_dots()
```
This is clearly a very basic plot of the data. We can get a basic idea of where the point density may be highest, but we cannot tell much else about the data. Let's now create an interactive map with the dots colored and sized based off of the volume of emissions (self-reported) given off at each point location.
```{r, nei.point.interactive.dataviz, warning=FALSE, message=FALSE}
#Set tmap mode to view
tmap_mode("view")
tm_shape(cook.county) +
tm_borders() +
tm_shape(county.pm25) +
tm_bubbles(col = "total emissions",
alpha = 0.3,
size = "total emissions",
style = "fisher")
```
Here, we used the `tmap_mode()` function to change the map style to interactive viewing and changed the arguments of the `tm_bubbles()` function to change the appearance of the point locations. Let's now construct a continuous surface Kernal Density Estimation from our point data.
## Constructing a Kernal Density Estimation
A Kernal Density Estimation (KDE) map at its most basic level is, as the name suggests, a means of representing the density of features over a given area. The term heatmap is often used interchangeably with KDE. Constructing a KDE gives us a continuous surface from discrete point data. This is quite useful as both an end product or an input to a model that requires continuous surface inputs. Each cell of the constructed raster is assigned a value based on the estimated density of ponits in that part of the map. This value can either be entirely unweighted (based solely on the number of points in an area) or weighted on a given variable (points with higher values for that variable will make an area appear denser). There are countless online resources available for learning more about the mathematics/history of KDE.
Let's now create our first KDE from the point data we've been using. We are going to be using the `sp.kde()` function from the spatialEco package, however there are several other R packages that achieve a more or less identical outcome.
```{r, nei.kde.unweighted, message=F, warning=F}
#Construct KDE
county.kde <- sp.kde(county.pm25, nr=500, nc=500)
plot(county.kde)
```
We've now produced a continuous surface representing the density of PM2.5 emissions sources across Cook County. Let's look over the `sp.kde()` function in a little more detail. In addition to inputting our sp object, we also input values of 500 for the `nr` and `nc` arguments. These abbreviations are short for "number of rows" and "number of columns" respectively. The `sp.kde` function creates a grid on which to map the results of the KDE, and these arguments tell the function what the dimensions of this grid should be. Let's look at how changing these two arguments changes the appearance of our KDE map:
```{r, nei.kde.resolution.changes, message=F, warning=F}
#10x10 grid
county.kde.10 <- sp.kde(county.pm25, nr=10, nc=10)
plot(county.kde.10)
#100x100 grid
county.kde.100 <- sp.kde(county.pm25, nr=100, nc=100)
plot(county.kde.100)
#500x500 grid
county.kde.500 <- sp.kde(county.pm25, nr=500, nc=500)
plot(county.kde.500)
```
Note the changes in the plots as the resolution of the grid is increased. Let's now look at the `y` argument used to add a weight to the KDE. Let's say we wanted to weigh our KDE based on the amount of total emissions from individual sites. Here's how you would do that:
```{r, nei.kde.weighted, message=F, warning=F}
#Construct weighted KDE
county.kde.weighted <- sp.kde(county.pm25, y=county.pm25$`total emissions`, nr=500, nc=500)
plot(county.kde.weighted)
```
As you can see, weighing the KDE on the total emissions amount dramatically changes the map. These changes can be deemphasized/accentuated if you transform the variable weighing the data.
If you are interested in reading more about the arguments of this function check out its [R Documentation](https://www.rdocumentation.org/packages/spatialEco/versions/1.3-2/topics/sp.kde) page.
<!--chapter:end:04-ToolkitPointsToSurfaces.Rmd-->
# Interpolation Models
This tutorial demonstrates how to compare common interpolation models empirically to select the one that seems most appropriate for a given spatial extent. It follows the steps provided in an R-Spatial [tutorial](https://rspatial.org/raster/analysis/4-interpolation.html#calfornia-air-pollution-data) on interpolating pollution vairables. Possible interpolation models are voronoi polygons, nearest neighbor interpolation, inverse distance weights (IDW), and finally kriging. The oprimal model is one with the lowest RMSE compared to all other models. The models are also evaluated against a "NULL Model", where the mean value is assigned in all grid cells.
## Example: Interpolating Average Temperature across the 21-county study area.
This section describes how an interpolation model was selected to interpolate average temperature from airport weather stations in the 21 county area.
## Wrangling Data
Reading in monthly temperature averages
```{r message=FALSE, warning=FALSE}
tmpf = readr::read_csv('./data/ASOS_tmpf_2014.2018.csv')
head(tmpf)
```
Let's filter for August 2018 Data
```{r}
tmpf = dplyr::filter(tmpf, moyr == '2018-08')
```
Mapping the station values
```{r}
library(tmap)
tmap_mode("view")
counties = sf::st_read('./data/LargeAreaCounties/LargeAreaCounties.shp')
sensors = sf::st_as_sf(tmpf, coords = c("longitude", "latitude"), crs = 4326)
tm_shape(counties) +
tm_borders() +
tm_shape(sensors) +
tm_dots(col = "tmpf",
palette = "-RdBu",
title = "Average August 2018 Temperature (ºF)",
popup.vars = c("Temp" = "tmpf",
"Airport Code" = "site_num"))
```
## Exploring the Null Model
This model follows the null hypothesis, that there's no variation in precipitation across space, by taking the mean precipitation and assigning that value across the entire area.
```{r}
RMSE <- function(observed, predicted) {
sqrt(mean((predicted - observed)^2, na.rm=TRUE))
}
null <- RMSE(mean(tmpf$tmpf), tmpf$tmpf)
null
```
The Root Mean Swuare Error (RMSE) for the null model is 2.09.
## Model 1: Voronoi Model
This model takes sensor locations and generates a prediction area for that sensor. The 21 county area is divided into polygons representing areas closest to each sensor.
```{r include=FALSE}
library(raster)
# Loading AOD raster for the 21 counties
AOD.raster = raster::raster('./data/AOD_21Counties_MasterGrid/AOD_21Counties_MasterGrid.grd')
# Create Blank Raster with same properties as AOD raster
blank.raster = raster()
crs(blank.raster) = sf::st_crs(AOD.raster)$proj4string
extent(blank.raster) = extent(AOD.raster)
res(blank.raster) = res(AOD.raster)
crs(blank.raster) = st_crs(sensors)$proj4string
# Replacing with NA Values
values(blank.raster) = NA
# Converting sf object to sp
dsp = as_Spatial(sensors)
IL = as_Spatial(counties)
```
Creating Voronoi Polygons
```{r mesage=FALSE, warning = FALSE}
library(dismo)
v <- voronoi(dsp)
tm_shape(v) +
tm_borders() +
tm_shape(sensors) +
tm_dots(popup.vars = c("Temp" = "tmpf",
"Airport Code" = "site_num"))
```
Tenperature values can then be assigned to each voronoi polygon based on temperature readings from the sensor located in the given polygon.
```{r message=FALSE, warning=FALSE}
# Assigning values to polygons
il <- aggregate(IL)
vil <- raster::intersect(v, il)
tm_shape(vil) +
tm_fill('tmpf',
alpha = 0.5,
palette = "-RdBu",
title = "Average August 2018 Temperature (ºF)",
popup.vars = c("Temp" = "tmpf")) +
tm_borders()
```
Rasterizing voronoi polygons
```{r}
r <- blank.raster
vr <- rasterize(vil, r, 'tmpf')
```
### Validating the Voronoi Model
We will use 5-fold cross-validation to determine the improvement compared to our baseline, the NULL model.
```{r warning= FALSE, message = FALSE}
set.seed(5132015)
# Randomly partition the Dataset into 5 groups (1 through 5)
kf <- kfold(nrow(dsp))
# Initialize a vector of length 5
vorrmse <- rep(NA, 5)
# Validate
for (k in 1:5) {
test <- dsp[kf == k, ] # Learn on group k
train <- dsp[kf != k, ] # Train on groups != k
v <- voronoi(train)
p1 <- raster::extract(v, test)$tmpf
vorrmse[k] <- RMSE(test$tmpf, p1) # Save the RMSE
}
print("RMSE for each of the five folds")
vorrmse
# Take the mean RMSE and get percentage improvement
print("Mean RMSE")
mean(vorrmse)
print("Improvement over NULL model")
1 - (mean(vorrmse) / null)
```
The RMSE for the Voronoi model is 1.33, which represents a 36% increase in accuracy from the null model. The percentage improvement metric will determine the most useful model to choose.
## Model 2: Nearest Neighbor Interpolation
The next model to test is a nearest neighbor interpolation. The interpolation takes into account the nearest 5 sensors when determining the temperature value at a given grid cell. The decay parameters is seto to zero.
```{r warning = FALSE, message = FALSE}
set.seed(5132015)
library(gstat)
gs <- gstat(formula=tmpf~1, locations=dsp, nmax=5, set=list(idp = 0))
nn <- interpolate(r, gs)
nnmsk <- mask(nn, vr)
tm_shape(nnmsk) +
tm_raster(n = 5,
alpha = 0.5,
palette = "-RdBu",
title = "Average August 2018 Temperature (ºF)")
```
Overall the interpolation yield provdes a similar estimate of temperature as the voronoi polygons. This is expected as voronoi polygons are a form of nearest neighbor interpolation. The differenc here is that this mode takes into account the temperature values at the five nearest sensors, not just one nearest sensor.
Cross validating the result using the `gstat` ```predict()``` function.
```{r message = FALSE, warning = F}
nnrmse <- rep(NA, 5)
for (k in 1:5) {
test <- dsp[kf == k, ]
train <- dsp[kf != k, ]
gscv <- gstat(formula=tmpf~1, locations=train, nmax=5, set=list(idp = 0))
p2 <- predict(gscv, test)$var1.pred
nnrmse[k] <- RMSE(test$tmpf, p2)
}
print("RMSE for each of the five folds")
nnrmse
print("Mean RMSE")
mean(nnrmse)
print("Improvement over NULL model")
1 - (mean(nnrmse) / null)
```
The average RMSE after 5-fold cross validation is 1.80. This model is 14% more accurate than the null model. This suggests that it might be a less effective estimate of temperature in our case.
## Model 3: IDW Interpolation using baseline parameters
IDW stands for Inverse Distance Weighted interpolation. This model estimates temperature at a given cell by taking into account temperature values located at various nearby sensors and each sensor's straight line distance to the grid cell. Data from sensors located closer to the target grid cell are given more weight in the final estimate of temperature in that cell. This model is a logical outcropping of Tobler's first law of geography, "everything is related to everything else, but near things are more related than distant things."
```{r warning = F, message = F}
set.seed(5132015)
library(gstat)
gs <- gstat(formula=tmpf~1, locations=dsp)
idw <- interpolate(r, gs)
idwr <- mask(idw, vr)
plot(idwr)
tm_shape(idwr) +
tm_raster(n = 10,
alpha = 0.5,
palette = "-RdBu",
title = "Average August 2018 Temperature (ºF)")
```
The IDW model creates a smoother temperature surface compared to voronoi polygons and nearest neighbor interpolations. Hard breaks between individual sensor regions is reduced to a minimum. However, IDW also introduced it's own distortion. The 'bullseye' effect occurs when a sensor value is significantly different than the rest, an artefact that is clearly visible around almost all snesor locations in our map.
```{r warning = F, message = F}
rmse <- rep(NA, 5)
for (k in 1:5) {
test <- dsp[kf == k, ]
train <- dsp[kf != k, ]
gs <- gstat(formula=tmpf~1, locations=train)
p <- predict(gs, test)
rmse[k] <- RMSE(test$tmpf, p$var1.pred)
}
print("RMSE for each of the five folds")
rmse
print("Mean RMSE")
mean(rmse)
print("Improvement over NULL model")
1 - (mean(rmse) / null)
```
The IDW model has an RMSE of 1.52, a 27% improvement over the null model.
## Model 4: Optimized IDW Interpolation
IDW models are highly sensitive to two user defined parameters. 1) The maximum number of sensors to take into account and 2) a decay or friction of distance parameter. Since models are evaluated using RMSE, an optimization algorithm can be used to find an optimal number of sensors and decay parameter that minimizes RMSE. This optimization is performed below.
```{r message = F, warning = F}
f1 <- function(x, test, train) {
nmx <- x[1]
idp <- x[2]
if (nmx < 1) return(Inf)
if (idp < .001) return(Inf)
m <- gstat(formula=tmpf~1, locations=train, nmax=nmx, set=list(idp=idp))
p <- predict(m, newdata=test, debug.level=0)$var1.pred
RMSE(test$tmpf, p)
}
set.seed(20150518)
i <- sample(nrow(dsp), 0.2 * nrow(dsp))
tst <- dsp[i,]
trn <- dsp[-i,]
opt <- optim(c(8, .5), f1, test=tst, train=trn)
opt
```
The optimal IDW interpolation can be gleaned from the `opt$par` variable. The number of sensors to consider should be ~4.90 while the decay parameter should be 8.44.
Performing the IDW interpolation with these parameters yields the following results.
```{r message = F, warning = F}
m <- gstat::gstat(formula=tmpf~1, locations=dsp, nmax=opt$par[1], set=list(idp=opt$par[2]))
idw <- interpolate(r, m)
idw <- mask(idw, il)
tm_shape(idw) +
tm_raster(n = 10,
alpha = 0.5,
palette = "-RdBu",
title = "Average August 2018 Temperature (ºF)")
```
The output from this model is super interesting. It looks similar to the voronoi polygons, but as if someone took a paintbrush to the edges of each polygon and mixed the colors together. In scientific terms, it's as if someone took the voronoi polygons and added a thin gradient between each polygon.
Let's cross validate and get the RMSE
```{r warning = F, message = F}
idwrmse <- rep(NA, 5)
for (k in 1:5) {
test <- dsp[kf == k, ]
train <- dsp[kf != k, ]
m <- gstat(formula=tmpf~1, locations=train, nmax=opt$par[1], set=list(idp=opt$par[2]))
p4 <- predict(m, test)$var1.pred
idwrmse[k] <- RMSE(test$tmpf, p4)
}
print("RMSE for each of the five folds")
idwrmse
print("Mean RMSE")
mean(idwrmse)
print("Improvement over NULL model")
1 - (mean(idwrmse) / null)
```
The RMSE is 1.42, which represents an improvement of 32% over the null model.
## Model 5: Thin Plate Spline Model
Originally, a non-spatial interpolation method, this model seeks to "smooth" the temperature from each sensor across grid cells. It's name comes from this models ability to penalize non-smooth data, similar to how a thin but rigid sheets resists bending.
```{r}
library(fields)
m <- Tps(coordinates(dsp), tmpf$tmpf)
tps <- interpolate(r, m)
tps <- mask(tps, idw)
tm_shape(tps) +
tm_raster(n = 5,
alpha = 0.5,
palette = "-RdBu",
title = "Average August 2018 Temperature (ºF)")
```
This model produces evently spaced temperature bands, as expected considering the rigidity of the model. Compared to other models, it might seem less correct or represnetative of the real-world. However, only RMSE will tell.
```{r}
tpsrmse <- rep(NA, 5)
for (k in 1:5) {
test <- dsp[kf == k, ]
train <- dsp[kf != k, ]
m <- Tps(coordinates(train), train$tmpf)
p5 <- predict(m, coordinates(test))
tpsrmse[k] <- RMSE(test$tmpf, p5)
}
print("RMSE for each of the five folds")
tpsrmse
print("Mean RMSE")
mean(tpsrmse)
print("Improvement over NULL model")
1 - (mean(tpsrmse) / null)
```
The RMSE for this model is 1.60, a 24% improvement over the null model.
## Model 6: Ordinary Kriging
Kriging is a complex interpolation method that seeks to find the best linear predictor of intermediate values. In our spatial context, this means that it seeks
The first step is to fit a variogram over the temperature data
```{r}
library(gstat)
gs <- gstat(formula=tmpf~1, locations=dsp)
v <- variogram(gs, width=20)
head(v)
plot(v)
```
We notice that there are only five points below the mean, which is very few points to properly fit a model to. But, let's continue to see what happens. Next we fit the variogram. This time, we use the ```autofitVarogram``` function from the ```automap``` package.
```{r}
fve = automap:::autofitVariogram(formula = tmpf~1, input_data = dsp)
fve
plot(fve)
```
The ```autofitVariogram``` function fitted a Gaussian variogram to our small sample of datapoints.
Executing an ordiary kriging model
```{r message = F, warning = F}
kp = krige(tmpf~1, dsp, as(blank.raster, 'SpatialGrid'), model=fve$var_model)
spplot(kp)
```
Plotting this on the 21 counties
```{r}
ok <- brick(kp)
ok <- mask(ok, il)
names(ok) <- c('prediction', 'variance')
plot(ok)
tm_shape(ok[[1]]) +
tm_raster(n = 5,
alpha = 0.5,
palette = "-RdBu",
title = "Average August 2018 Temperature (ºF)")
```
Cross-Validating the Kriging Model
```{r warning = F, message = F}
set.seed(20150518)
krigrmse = rep(NA, 5)
for (i in 1:5) {
test <- dsp[kf == i,]
train <- dsp[kf != i, ]
fve = automap:::autofitVariogram(formula = tmpf~1, input_data = train)
kp = krige(tmpf~1, train, as(blank.raster, 'SpatialGrid'), model=fve$var_model)
p6 = raster::extract(as(kp, 'RasterLayer'), test)
krigrmse[i] <- RMSE(test$tmpf, p6)
}
print("RMSE for each of the five folds")
krigrmse
print("Mean RMSE")
mean(krigrmse)
print("Improvement over NULL model")
1 - (mean(krigrmse) / null)
```
After 5 fold cross-validation, the RMSE is 1.80. This is a 14% improvement over the null model.
## Model 7: Blending all models
Next, we will attempt to created a blended model that takes a weighted average of the predicted values for each model, weighted by their RMSE. This ensures that the more accurate models have more influence on the blended model than the models with poor predictions.
This code chunk re-runs each model, creating an ensemble model, while cross-validating the results.
```{r message=FALSE, warning=FALSE}
set.seed(20150518)
# Initialize rmse vectors
vorrmse <- nnrmse <- idwrmse <- krigrmse <- tpsrmse <- ensrmse <- rep(NA, 5)
for (i in 1:5) {
# Creating Test & Training Data
test <- dsp[kf == i, ] # Learn on group k
train <- dsp[kf != i, ] # Train on groups != k
# Voronoi
v <- voronoi(train)
p1 <- raster::extract(v, test)$tmpf
vorrmse[i] <- RMSE(test$tmpf, p1) # Save the RMSE
# Nearest Neighbbor
gscv <- gstat(formula=tmpf~1, locations=train, nmax=5, set=list(idp = 0))
p2 <- predict(gscv, test)$var1.pred
nnrmse[i] <- RMSE(test$tmpf, p2)
# Optimized IDW
m <- gstat(formula=tmpf~1, locations=train, nmax=opt$par[1], set=list(idp=opt$par[2]))
p3 <- predict(m, test)$var1.pred
idwrmse[i] <- RMSE(test$tmpf, p3)
# Thin Plate Spline
tpsm <- Tps(coordinates(train), train$tmpf)
p4 <- predict(tpsm, coordinates(test))[,1]
tpsrmse[i] <- RMSE(test$tmpf, p4)
# Kriging
fve = automap:::autofitVariogram(formula = tmpf~1, input_data = train)
kp = krige(tmpf~1, test, as(blank.raster, 'SpatialGrid'), model=fve$var_model)
p5 = raster::extract(as(kp, 'RasterLayer'), test)
krigrmse[i] <- RMSE(test$tmpf, p5)