F-Roads_Data.Rmd

---
output: 
  html_document:
    toc: TRUE
    toc_float: TRUE
---

# Appendix F: Roads Data {-}

## Overview

Transportation system is a significant source of air pollution, and roads are without doubt the most important component of transportation system. The ability to manipulate roads data is therefore of critical importance. In this tutorial, we will take a look at the major roads in several large Midwest counties, and we will introduce other techniques, such as how to handle data with physical units, along the way. Our objectives are to:

* perform spatial data manipulation using common functions such as `st_intersection` and `st_join`
* learn to work with data of type "units"
* summarize data by groups and output results

## Environment Setup {-}

### Input/Output {-}

Our input includes the shapefile of several large counties in the Midwest, a grid data file, and the shapefile of the major roads in Illinois, Indiana, Wisconsin, and Michigan. The files are available for download on this project's [GitHub repo](https://github.com/GeoDaCenter/OpenAirQ-toolkit/tree/master/data).

Our output is a shapefile that records the lengths of primary raods, secondary roads, and motorways in each 1 km \times 1 km grid. 

### Load Libraries {-}

We start by loading the following packages:

* `raster`: to manipulate and analyze raster data
* `tidyverse`: to perform simple statistical analyses
* `sf`: to conduct basic spatial data manipulation
* `velox`: to manipulate raster data in time efficient manner
* `units`: to handle numeric values with physical measurement units

It should be mentioned here that the "velox" package, which is great at fast raster data manipulation, is sadly no longer available. Detailed instructions for installing this package from R archive are included in other parts of this tutorial book and can also be found [here](https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages).

```{r F.library, message = F, warning = F}
library(raster)
library(tidyverse)
library(sf)
library(velox)
library(units)
```

### Load Data {-}

With the packages ready, we now load the data needed for this tutorial - `roads` is the shapefile for the major roads in Illinois, Indiana, Wisconsin, and Michigan, `km.grid` is the grid data, and `counties` is the shapefile for 21 large counties in the Midwest. All of the data can be loaded using the `st_read` function.  

```{r F.load.data}
roads <- st_read("./data/4StateMajorRoads")
km.grid <- st_read("./data/Km_Grid")
counties <- st_read("./data/LargeAreaCounties")
```

### Data Preparation {-}

The roads data we use in this tutorial were prepared in advance. For educational purposes, we will show here how to download the roads data from the OpenStreetMap(OSM) project. There are of course a large number of methods to download the data, and each has its own merits. However, if we would like to accomplish the task within R, then the `osmdata` is going to be your first choice. 

```{r F.library2, message = F}
library(osmdata)
```

There are numerous types of features contained in the OSM project which can be easily downloaded. If you are not sure what features are available for use, you can check it by calling the function `available_features()`.

```{r F.check.features}
available_features()
```

The feature in which we are interested in is "highway." However, it should be pointed out that within each feature, there might be multiple tags. To see them, we call another function, `available_tags()`.

```{r F.check.tags}
available_tags("highway")
```

For our study and for this tutorial, we will focus on six of the tages, namely "primary," "primary_link," "secondary," "secondary_link," "motorway," "motorway_link." However, there are so many interesting tags out there for you to explore in your spare time. How exciting!

Our next step is to build a query. This can be done using the code below. `opq()` is the main function that will build the query, where you can also specify the geographical region of interest. For illustration purpose, we will use the state of Illinois as an example. Pay attention to the "timeout" argument. It is optional but is very helpful. When this argument is not specified, a timeout error may be raised if the query takes too long. `add_osm_feature()`, does exactly what its name suggests - it allows you to filter the features as well as the tags that you are interested in. The "features" correspond to the "key" argument, and the "tags" correspond to the "value" argument. Lastly, `osmdata_xml` allows you to store the downloaded data in the `xml` format.

```{r F.query, eval = F}
opq(bbox = 'Illinois', timeout = 100) %>%
  add_osm_feature(key = 'highway', 
                  value = c('primary', 'primary_link', 'secondary', 'secondary_link', 'motorway', 'motorway_link')) %>%
  osmdata_xml(filename = 'data/roads_IL.osm')
```

If we would like to conduct spatial data analysis in R, it is easier to work with simple feature object. We can read the roads data, which are lines object, by using the `st_read()` function.

```{r F.read.roads, eval = F}
r_IL <- sf::st_read('data/roads_IL.osm', layer = 'lines')
```

We will not be using this data file in our tutorial; we will instead use `roads`, which we have loaded earlier.

## Data Manipulation {-}

### Prelimnary Work {-}

First, we create a new field in `km.grid` called `DATA`. This field is defined as a sequence of positive integers. Basically, we are assigning a unqie ID number of each features in `km.grid`. It might be puzzling at this time what function the `DATA` field serves, but it will make sense later when we perform spatial joins. 

```{r F.DATA.seq}
km.grid$DATA <- seq(1:length(km.grid$DATA)) 
```

Our next step is to take the spatial intersection of `roads` and `km.grid`. To accomplish this, we simply call the `st_intersection` function, which does exactly what its name suggests.

```{r F.intersection, warning = F, message = F}
roads.intersection <- st_intersection(roads, km.grid)
```

To get the length of the roads, we use the `st_length` function. The geometry of raods is of the type `lingstring`, whose length is easily computed by the `st_length`function. We then store the output (i.e. the lengths of roads) in the `sf` object `roads.intersection` by creating a new field called `length`.

```{r F.length, warning = F, message = F}
roads.intersection$length <- st_length(roads.intersection)
```

Now, let's pause for a second and inspect the `DATA` field of `roads.intersection`. Recall that we created the `DATA` field as an identifier of the grid. 

Other than that the values of `DATA` is right-skewed, the summary statistics do not tell us much useful information. However, it is nevertheless a good habit to inspect your data from time to time so that you are aware of what data you are working with. 

```{r F.summary.roads}
summary(roads.intersection$DATA)
```

### Primary Roads {-}

Next, we want to take a look at all roads that are in the class "primary" or "primary_link." More technically, we would like to extract all features of `roads.intersection` that has value "primary" or "primary_link" in the field `fclass`. To achieve this, we take advantage of the `which` function, which returns the positions of all features that satisfy the criterion. Then we simply bond `pri.1` and `pri.2` into a new vector, and this new vector `pri.id` has the positions of all the features that we want.

```{r F.primary.selection}
pri.1 <- which(roads.intersection$fclass == "primary")
pri.2 <- which(roads.intersection$fclass == "primary_link")
pri.id <- c(pri.1, pri.2)
```

To get the roads that are classified as either "primary" or primary_link", we simply use the `pri.id` we defined above to subset `roads.intersection`. It should be noted that there are many ways to extract the features that we want from `roads.intersection`. The method we use here is one of the most straightforward ones, but there are many alternatives, some of which are potentially better than the one shown here. In most cases, deciding which method to use is simply a matter of personal preference. 

```{r F.subset.primary}
primary.int <- roads.intersection[pri.id,]
```

It is now a time to do a spatial join, which can be done using the `st_join` function. In plain English, we basically combine the `sf` objects `km.grid` and `primary.int` into a new `sf` object called `primary.merged`.

```{r F.primary.merge, warning = F, message = F}
primary.merged <- st_join(km.grid, primary.int)
```

Now, let's take a look at `primary`. This is just an `sf` object with 14 fields and 68769 features.

```{r F.primary.inspect}
primary.merged
```

The next step is to aggregate the lengths of the roads. Recall that we created the `DATA` field in `km.grid` at the beginning of this tutorial as a unique identifier of the grid. Now, we group the data by `DATA.x` (notice the name of the field changed after the spatial join), and we summarize the data using the `sum` function. We store the result of the this summation into the field `length`. 

```{r F.primary.summarize, warning = F, message = F}
primary.roadlengths <- primary.merged %>%
  group_by(DATA.x) %>%
  summarize(length = sum(length))
```

What we have obtained above is the total length of "primary" and "primary_link" roads in each grid. The new `sf` object, grouped by grids, is named `primary.roadlengths`. Let's take a quick look at what information it contains. 

```{r F.primary.result, message = F, warning =F}
primary.roadlengths
```

We see that only certain features has numeric values in the field `length`, which makes perfect sense as roads only cross a few grids. It would be quite strange if we see every feature as a large numeric value in the field `length` - that would suggest that the entire land is covered with roads, which is the case in the real world. 

The next thing we want to deal with is the unit. Note that the values in the field `length` is of the type "units" i.e. a numeric value with a physical measurement unit. The original unit of `length` is a little complicated. We will simplify things by just assigning it the unit "meter".  

```{r F.units.1}
units(primary.roadlengths$length) <- with(ud_units, m)
```

What we want to do is to replace all missing values in `length` with "0 meter." To achieve that, we first create a variable with value 0 and then assign it the unit "meter." This newly created `x0` is what will replace the missing values. 

```{r F.units.2}
x0 <- 0
units(x0) <- with(ud_units, m)
```

We use the `which` function to return the positions of all missing values, and on those exact position, we replace the value with `x0`. Hence, all missing values now read "0 meter."

```{r F.units.3}
primary.roadlengths[which(is.na(primary.roadlengths$length)),2] <- x0
```

As always, it is critical that we inspect our data from time to time. If we print out `primary.roadlengths`, we see that all missing values have indeed by replaced by "0 meter."

```{r F.units.4}
primary.roadlengths
```

### Secondary Roads {-}

From this point on, things start to get a little repetitive. We, in this section, repeat lots of code from the last section. The only difference is that we now want to extract the information - basically the lengths of the roads - of all roads that are classified as "secondary" or "secondary_link." For illustrative purposes, we show all the code below, but we shall not comment much on the code as it is almost identical to the code in the last section. 

```{r warning = F, message = F}
sec.1 <- which(roads.intersection$fclass == "secondary")
sec.2 <- which(roads.intersection$fclass == "secondary_link")
sec.id <- c(sec.1, sec.2)

secondary.int <- roads.intersection[sec.id,]

secondary.merged <- st_join(km.grid, secondary.int)

secondary.roadlengths = secondary.merged %>%
  group_by(DATA.x) %>%
  summarize(length = sum(length))

secondary.roadlengths[which(is.na(secondary.roadlengths$length)),2] <- x0
```

We inspect `secondary.roadlengths`, and it looks fine in terms of both data format and data value. 

```{r F.secondary.inspect}
secondary.roadlengths
```

### Motorway {-}

The code in this section is exactly the same as in the last two section. We include the code here for the sake of completeness, but you should feel free to skip this section. 

```{r warning = F, message = F}
mot.1 <- which(roads.intersection$fclass == "motorway")
mot.2 <- which(roads.intersection$fclass == "motorway_link")
mot.id <- c(mot.1, mot.2)

motorway.int <- roads.intersection[mot.id,]

motorway.merged <- st_join(km.grid, motorway.int)

motorway.roadlengths = motorway.merged %>%
  group_by(DATA.x) %>%
  summarize(length = sum(length))

motorway.roadlengths[which(is.na(motorway.roadlengths$length)),2] <- x0
```

Before we proceed, we take a galnce at `motorway.roadlengths`.

```{r F.motorway.inspect}
motorway.roadlengths
```

## Outputting Data {-}

We now enter the last section of this tutorial, where we would like to save all the work we have done so far. This is not difficult - we just create three fields in `km.grid` (`primary`, `secondary`, and `motorway`), where we store the lengths of the roads.

```{r F.store.length}
km.grid$primary <- primary.roadlengths$length
km.grid$secondary <- secondary.roadlengths$length
km.grid$motorway <- motorway.roadlengths$length
```

Then, we output a new shapefile with the modified `km.grid` using the `st_write` function. 

```{r F.write, eval = F}
st_write(km.grid, "Road_Grid.shp")
```

This concludes our tutorial. In this exercise, we showed how roads data can be handled. Of course, some of the techniques introduced here can be applied in many other instances. The brief introduction on dealing with data of the type "units" should be  particularly helpful if you have not been previously exposed to this data type. You should feel free to explore more on your own and apply what you have learned to your own research.