063-mapping-rf.Rmd

\clearpage
## Data mining: random forest {#rf}
*M. Guevara, C. Thine, G.F. Olmedo & R.R. Vargas*

### Overview

Data mining uses different forms of statistics, such as machine learning, to explore data matrices for a particular situation, from specific information sources, and with a specific objective. Data mining is used in DSM frameworks to generate spatial and temporal predictions of soil properties or classes in places where no information is available.

Under a data mining-based DSM framework, the exploration of statistical relationships (linear and non-linear) between soil observational data and soil environmental predictors is generally performed by the means of machine learning. Machine learning methods represent a branch of statistics that can be used to automatically extract information from available data, including the non-linear and hidden relationships of high dimensional spaces or hyper-volumes of information when high performance or distributed computing resources are available. Machine learning methods do not rely on statistical assumptions about the spatial structure of soil variability or the empirical relationship of soil available data and its environmental predictors. Therefore machine learning methods are also suitable for digital soil mapping under limited and sparse scenarios of data availability, although in practice the statistical performance of machine learning (or any statistical method) is reduced by a low representativeness of a soil property or class in the statistical space given available data. Machine learning methods can be used for (supervised and unsupervised) regression (e.g., predicting soil organic carbon) or classification (e.g., predicting soil type classes) on digital soil mapping. Machine learning methods can be roughly divided into four main groups: linear-based (e.g., MLR), kernel-based (e.g., kernel weighted nearest neighbors or SVM), probabilistic-based (e.g., Bayesian statistics) and tree-based (e.g., classification and regression trees).

Random forest is a tree-based machine learning algorithm that is popular in DSM because it has proven to be efficient mapping soil properties across a wide range of data scenarios and scales of soil variability. RF can be implemented using open source platforms and this Chapter is devoted to provide a reproducible example of this machine learning algorithm applied to SOC mapping across FYROM.

### Random forest

Random forest is a decision-tree-based machine learning method used in DSM for uncovering the statistical relationship between a dependent variable (e.g., soil property) and its predictors. Decision-tree-based models (also known as classifiers) are literally like trees (e.g., with stem, many branches, and leaves). The leaves are the prediction outcomes (final decisions) that flow from higher levels based on decision rules through the stem and the branches [@breiman1984classification]. The decision tree model recursively splits the data into final uniform groups (classes) or unique values based on a set of rules (e.g., based on probability values and hypothesis testing). In random forest, there are many decision trees and each tree recursively splits randomly selected sub-samples from the data (see Figure \@ref(fig:rfschema)). The name for RF originates from the fact that the original data is first randomly split into sub-samples, and many decision trees (or forest) are used to model the sub-samples.


```{r rfschema, fig.cap="Schematical representation of data splitting to generate the random subsets used to train regression trees within a random forest model (ensemble of regression trees)" , out.width='80%', echo=FALSE, fig.align='center'}
knitr::include_graphics("images/randomForestconcept.png")
```


Random forest has been tested by many researchers on DSM (@poggio2013regional; @rad2014updating, and references therein). For SOC mapping (which for large areas usually rely on sparse datasets), it holds a lot of promises when compared to other prediction models, because it is practically free of assumptions. RF has shown accuracy of spatial predictions and, given the random selection of subsets and prediction factors, reduce potential over-fitting and data noise [@wiesmeier2011digital]. Over-fitting and data noise are important uncertainty sources across high dimensionally spaces used to represent the soil forming environment. Thus, the advent of open-source platforms and freely downloadable ancillary data (e.g. http://worldgrids.org/) to represent the soil forming environment makes of RF and other such models increasingly appealing for DSM.

To predict continuous data (such as carbon density), RF generates an averaged ensemble of regression trees based on bagging, which is the statistical term for the random selection of subsets and predictors to generate each regression tree. Bagging is a bootstrapping aggregation technique where each sample is different from the original data set but resembles it in distribution and variability [@Breiman2001; @Breiman1996]. Each tree contributes to weight the statistical relationship between a dependent variable (e.g. soil property) and its prediction factors (e.g., terrain attributes, remote sensing, climate layers and/or legacy maps). Each tree (generated using a different subset of available data and random combinations of the prediction factors) is internally evaluated by an out-of-bag cross validation form which allows assessing the relative importance of the available prediction factors. Thus, higher weight is given to the most accurate trees (which use the most informative prediction factors). The final prediction of new data is the weighted average of all generated trees.

This method has been used to generate accurate predictions of SOC from the plot to the global scale and also in a country-specific basis [@hengl2014soilgrids1km; @hengl2017; @Bonfatti2016; @guevara_2018]. RF can be implemented for DSM using open source code [e.g., **R** package **randomForest**, see @breiman2017cutler] and public sources of environmental information.

The objective of this Chapter is to demonstrate a reproducible framework for the implementation of the RF algorithm applied for SOC predictive mapping, including the uncertainty of model estimates and using open source platforms for statistical computing. 

### 	Conceptual model and data preparation	

To use RF for digital soil organic carbon mapping the SCORPAN (Soils, Climate, Organisms, Relief, Parent material, Age and N space) conceptual model [@mcbratney2003digital; @Florinsky2012] will have take the following form:

$\ \text{SOC}_{x,y~t}  \sim  \text{randomForest}   (E_{x,y~t})$ 

where soil organic carbon estimates ($SOC$) for a specific site ($x,y$) and for a specific period of time ($t$) can be modeled as a Random forest (randomForest) function of the soil forming environment ($\ Ex,y~t$), which is represented by the SOC prediction factors (e.g., terrain attributes, remote sensing, climate layers and/or legacy maps).

To feed the right side of the equation, $SOC_{x,y~t}$ is usually represented in a tabular form or a geospatial object (e.g., shapefile) with three fundamental columns. Two columns represent the spatial coordinates $x$ and $y$ (e.g. latitude and longitude) that are used to extract the values of the prediction factors for the representative locations of the $SOC$ estimates. $SOC$ estimates are represented in a third column (see previous Chapters of this book dealing with the transformation of soil carbon density to mass units). The left side of the equation is generally represented by gridded (raster) files, so all available sources of information should be first harmonized into a common pixel size and coordinate reference system.

### Software

For the RF implementation, we will use the platform for statistical computing **R**. This open source object-oriented software relies on specific-contributor libraries. There are several libraries for the implementation of the RF algorithm in **R** as well as several variants of the method that can be used to solve DSM problems.

In this Section, we will show the use of Random forest using the **randomForest**, the **quantregForest**, the **raster** and the **caret** **R** packages. The quantile regression forest [**quantregForest**; @meinshausen2006quantile] has two main advantages. First, it can be used to extract the variance of all the trees generated by RF, not just the mean (as in the original **randomForest** package), and therefore we can calculate the dispersion of the full conditional distribution of SOC as a function of the prediction factors, which  given available data, represent the RF model uncertainty. Second, the quantile regression forest approach can also run in parallel using all available computational resources, in a way that we can predict and estimate the uncertainty of predictions at reasonable time frames whit large datasets.

### Tunning random forest model parameters

Two important parameters of RF are *mtry* and *ntree*. The *mtry* parameter controls the number of prediction factors that are randomly used on each tree, while the *ntree* parameter controls the number of trees generated by RF. These two parameters can be selected by the means of cross-validation to maximize the prediction capacity of RF.

We will use the **caret** package to select the most appropriate values for these parameters using 10-fold cross-validation [@kuhn2017caret]. Tunning the main parameters of RF (or any other model) can be time-consuming in computational terms because implies the need to run and internally validate an independent model for each possible combination of parameter values. Thus, tunning the RF parameters would be relevant, given available data, to achieve the best possible accuracy of predictions. 

### Technical steps - Random forest

We will use the FYSOM dataset for this exercise (see previous Chapters of this book dealing with data preparation). The first dataset contains in a tabular form the *OCSKGM* values and the values of the prediction factors for the same locations (e.g., *x*, *y*, *OCSKGM*, covariate~1~, covariate~2~ etc.) while the second database is represented by a stack of raster files containing prediction factors across all the area of interest at the spatial resolution of 0.0083º (approx. 1 km). Import the datasets and load in **R** all our libraries of interest.  

**Step 1 - Data preparation**

In the Chapter \@ref(covariates), we presented and prepared several global and continental datasets. In addition to these datasets, numerous covariate layers have been prepared by ISRIC for the GSOCmap project. These are GIS raster layers of various biophysical earth surface properties for each country in the world. Some of these layers will be used as predictors in this Section. Please download the covariates for your own study area from GSOCmap Data Repository as explained in Section \@ref(GSOCDataRepo).

In Section \@ref(overlay-soil-covariates), a table with the points values after data preparation and the values of our spatial predictors was prepared. This step involves loading this table.

Now we will import our point dataset using `read.csv()` function. The easiest way to create a data frame is to read in data from a file. This is done using the function `read.csv()`, which works with comma delimited files. Data can be read in from other file formats as well, using different functions, but `read.csv()` is the most commonly used approach. **R** is very flexible in how it reads in data from text files (`read.table()`, `read.csv()`, `read.csv2()`, `read.delim()`, `read.delim2()`). Please type `?read.table()` for help.

```{r}
# Load data
dat <- read.csv("data/MKD_RegMatrix.csv")

dat$LCEE10 <- as.factor(dat$LCEE10)
dat$soilmap <- as.factor(dat$soilmap)

# Explore the data structure
str(dat)
```

Since we will be working with spatial data we need to define the coordinates for the imported data. Using the `coordinates()` function from the **sp** package we can define the columns in the data frame to refer to spatial coordinates. Here the coordinates are listed in columns `X` and `Y`.

```{r}
library(sp)

# Promote to spatialPointsDataFrame
coordinates(dat) <- ~ X + Y

class(dat)
```

`SpatialPointsDataFrame` structure is essentially the same as a data frame, except that additional *spatial* elements have been added or partitioned into slots. Some important ones being the bounding box (sort of like the spatial extent of the data), and the coordinate reference system `proj4string()`, which we need to define for the sample dataset. To define the CRS, we must know where our data are from, and what was the corresponding CRS used when recording the spatial information in the field. For this data set, the CRS used was WGS84 (EPSG:4326).

To clearly tell **R** this information we define the CRS which describes a reference system in a way understood by the [PROJ.4](http://trac.osgeo.org/proj/) projection library. An interface to the PROJ.4 library is available in the **rgdal** package. As an alternative to using PROJ.4 character strings, we can use the corresponding yet simpler EPSG code. **rgdal** also recognizes these codes. If you are unsure of the PROJ.4 or EPSG code for the spatial data that you have but know the CRS, you should consult http://spatialreference.org/ for assistance.

> **CRS**: Please note that, when working with spatial data, it is very important that the CRS of the point data and covariates are the same.

```{r}
dat@proj4string <- CRS(projargs = "+init=epsg:4326")

dat@proj4string
```

Now we will import the covariates. When the covariate layers are in common resolution and extent, rather than working with individual rasters it is better to stack them all into a single **R** object. In this example, we use 13 covariates from the GSOCmap Data Repository and a rasterized version of the soil type map. The rasterization of vectorial data was covered in [Technical Steps - Rasterizing a vector layer in R]. The file containing all the covariates was prepared at the end of Chapter \@ref(covariates).

```{r}
load(file = "covariates.RData")

names(covs)
```

Random forest does not have assumptions about the statistical distribution of the response variable, but it is a good practice prior to model building to analyze the statistical distribution of the response variable (e.g., if is normal or not) and its relationships with the prediction factors. Soil organic carbon tends to have a log-normal distribution with a right-skew, and transforming the original values to its natural logarithm would generate a normal distribution of soil organic carbon values.

For further analysis, we will use the dataset transformed to its natural logarithm (*OCSKGMlog*) because this transformation, given this dataset, increases the correlation of the response variable and the covariate space. 

Keep in mind that selecting the most appropriate prediction factors is required to generate an interpretable model and high accuracy of prediction in places where no information is available. Variable selection ideally should incorporate expert soil knowledge about the study area and statistical criteria (e.g., just to use the best-correlated predictors). Multivariate analysis (e.g., principal component analysis) is a widely used approach to identify informative predictors. Here we use this combination of prediction factors to be consistent with other Chapters of this book and because they were previously selected for this exercise using expert knowledge about the spatial variability of soil organic carbon.

Now, we will build a working hypothesis from our conceptual model, using all the continuous prediction factors for *OCSKGMlog*:

*OCSKGMlog* ~ randomForest *B04CHE3 + B07CHE3 + B13CHE3 + B14CHE3 + DEMENV5 + LCEE10 + PRSCHE3 + SLPMRG5 + TMDMOD3 + TMNMOD3 + TWIMRG5 + VBFMRG5 + VDPMRG5* 
 
 
```{r}
# For its use on R we need to define a model formula
fm = as.formula(paste("log(OCSKGM) ~", paste0(names(covs[[-14]]),
                                            collapse = "+"))) 

```

This is the **R** syntax to define a model formula required for the model structure, where soil organic carbon transformed to its natural logarithm (*OCSKGMlog*) can be predicted as a function of the available prediction factors, each explained in Chapter \@ref(covariates) of this book, e.g., B04CHE3, B07CHE3, B13CHE3, B14CHE3, DEMENV5 , LCEE10, PRSCHE3, SLPMRG5, TMDMOD3, TMNMOD3, TWIMRG5, VBFMRG5, VDPMRG5. 

Note that the variable soil map is categorical, so is not included in the correlation analysis. In fact, although soil type polygon maps are in theory powerful predictors for *OCSKGM* we will not use this map for this exercise, because not all categories in the map are represented by available *OCSKGM* estimates, therefore this map requires a generalization of soil type units in function of the classes represented by the sites of *OCSKGM* estimates, which is beyond the scope of this Chapter.

Ideally, the number of observations across all the categories of soil type or any other factorial variable should be balanced. Another alternative to using an unbalanced categorical map is by generating dummy variables, where each category in the map becomes an independent binomial predictor variable (e.g., only 0 and 1 values) as is explained in the following Chapter. The risk of doing so rely upon the potential underestimation of the spatial variability of the target variable under each category with a low density of available data.  

**Step 2 - Tuning parameters**

Now we will use the cross-validation strategy implemented in the train function of the **caret** package  [@kuhn2017caret], which default is 10-fold. The result of this function includes information to select the best *mtry* parameter and to decide the appropriate number of trees. The out-of-bag RMSE will be used to select the optimal *mtry* model. To analyze the *ntree* parameter we will plot the number of trees against the out-of-bag RMSE, an optimal *ntree* can be selected with the number of trees when these relationships stabilizes at the minimum possible RMSE (in the y axis). Reduncing the number of trees will reduce the computational demand, which is specially important when dealing with large databases. In the presence of multidimensional and highly correlated prediction factors, avoiding an excessive number of trees will also reduce the risk of model overfitting. 

```{r tunning rf}
library(randomForest)
library(caret)

# Default 10-fold cross-validation
ctrl <- trainControl(method = "cv", savePred=T)

# Search for the best mtry parameter
rfmodel <- train(fm, data=dat@data, method = "rf", trControl = ctrl, 
             importance=TRUE)

# This is a very useful function to compare and test different 
# prediction algorithms.
# Type names(getModelInfo()) to see all the 
# possibilitites implemented on this function.
```

The object derived from the train function can be used to generate predictions of *OCSKGMlog* at the spatial resolution of the prediction factors. Before generating predictions, we will plot the most important predictors sorted in decreasing order of importance. From the variable importance plot, MSE represent an informative measure for variable selection. It is the increase in error (mean squared error, MSE) of predictions which was estimated with out-of-bag-cross validation as a result of prediction factor being permuted with values randomly shuffled. This is one of the strategies that RF uses to reduce overfitting. 

```{r, fig.cap='Model Decreasing Error and Node Purity for the RF model'}
# Variable importance plot, compare with the correlation matrix
# Select the best prediction factors and repeat  
varImpPlot(rfmodel[11][[1]])
```

```{r, fig.cap='Select ntree'}
# Check if the error stabilizes 
plot(rfmodel[11][[1]])
```

Random forest users are encouraged to compare and test the prediction capacity of different combinations of prediction factors in order to reduce the complexity of the model and the statistical redundancy of environmental information on further applications of predicted OCSKGM maps (e.g., quantifying the carbon dynamics). The resulting map of our RF model needs to be validated using the independent dataset to complement the results of the cross-validation (e.g., RMSE and explained variance) derived using the train function and to have a more comprehensive interpretation of accuracy and bias. Note how the RMSE and the explained variance derived from the independent validations are slightly lower than the values obtained using cross-validation. 

```{r rf-pred, fig.cap='SOC prediction map for FYROM using a random forest model'}
# Make a prediction across all FYROM
# Note that the units are still in log
pred <- predict(covs, rfmodel)

# Back transform predictions log transformed
pred <- exp(pred)

# Save the result as a tiff file
writeRaster(pred, filename = "results/MKD_OCSKGM_rf.tif",
            overwrite=TRUE)

plot(pred)
```


### Technical steps - Using quantile regression forest to estimate uncertainty

Ideally, a digital soil map should include a spatial explicit metric of uncertainty. The uncertainty can be roughly divided into four main components, uncertainty in soil data, uncertainty in soil covariates, uncertainty in the model and uncertainty in variations of available data. Here, we show an approach to estimate the sensitivity of the model to available data and the uncertainty of the model. The first two are beyond of the aim of this Chapter. For the third and fourth we will generate a reproducible example.

**Step 1 - Split the data in into training and testing subsets**

To analyze the sensitivity of the model to available data we need to randomly split the data several times (e.g., 10 or more, is possible until the variance stabilizes) in training and testing subsets. A model generation and prediction are made on each split, in a way that the dispersion of the predicted values at the pixel level will represent the uncertainty and the sensitivity of the model to variations in available data.

This process increases computational demand and memory since it will repeat n times (10 in this example) the model and the prediction using each time a different random combination of data for training and testing the models. As larger the sample and the number of realizations the more robust our validation strategy. For this example, we will use only 10 realizations and random splits of 25% of available data. 


```{r, eval=TRUE, warning=FALSE}
library(Metrics)

# Generate an empty dataframe
validation <- data.frame(rmse=numeric(), r2=numeric())

# Sensitivity to the dataset
# Start a loop with 10 model realizations
for (i in 1:10){
  # We will build 10 models using random samples of 25%  
  smp_size <- floor(0.25 * nrow(dat))
  train_ind <- sample(seq_len(nrow(dat)), size = smp_size)
  train <- dat[train_ind, ]
  test <- dat[-train_ind, ]
  modn <- train(fm, data=train@data, method = "rf", 
                trControl = ctrl)
  pred <- stack(pred, predict(covs, modn))
  test$pred <- extract(pred[[i+1]], test)
  # Store the results in a dataframe
  validation[i, 1] <- rmse(test$OCSKGMlog, test$pred)
  validation[i, 2] <- cor(test$OCSKGMlog, test$pred)^2
}
```

**Step 2 - Build a sensitivity map**

```{r, eval=TRUE, fig.cap='Sensitivity based on 10 realizations using 25% samples'}
# The sensitivity map is the dispersion of all individual models
sensitivity <- calc(pred[[-1]], sd)

plot(sensitivity, col=rev(topo.colors(10)), 
     main='Sensitivity based on 10 realizations using 25% samples')
```

The RMSE and the explained variance of the models is stored in the object validation (type `summary(validation)`). The standard deviation of all the ten predictions allows generating a map of model sensitivity to available data.

```{r, eval=TRUE}
# Sensitivity of validation metrics
summary(validation)
```

**Step 3 - Analyze the sensitivity of the map**

```{r, eval=TRUE, fig.cap='OCSKGM prediction based on 75% of data', echo=TRUE}
# Plot of the map based on 75% of data and the sensitivity to data 
# variations
prediction75 <- exp(pred[[1]])

plot(prediction75, main='OCSKGM prediction based on 75% of data', 
     col=rev(topo.colors(10)))
```

**Step 4 - Estimate the full conditional distribution of OCSKGMlog**

Finally, we will estimate the model uncertainty, represented by the full conditional distribution of the response variable (*OCSKGMlog*) as a function of the selected prediction factors using the quantile regression forest package **quantregForest** of **R**. This approach has proven to be efficient for DSM across large areas [@vaysse2017using]. 

```{r fit quantreg, eval=TRUE}
# Use quantile regression forest to estimate the full conditional 
# distribution of OCSKGMlog, note that we are using the mtry 
# parameter that was selected by the train function of the caret 
# package, assuming that the 75% of data previously used well 
# resembles the statistical distribution of the entire data 
# population. Otherwise, repeat the train function with all 
# available data (using the object dat that instead of train) 
# to select mtry.
library(quantregForest)

model <- quantregForest(y=dat@data$OCSKGMlog, x=dat@data[,8:20], 
                        ntree=500, keep.inbag=TRUE, 
                        mtry = as.numeric(rfmodel$bestTune))                        
```

**Step 5 - Estimate the probability distribution function for each pixel**

This method will calculate a probability distribution function for each pixel and therefore can be time-consuming. Therefore we will run it using parallel computing. Note that the code to run in parallel this analysis can also be passed to the previous predictions (predict function). The result will be a map of the standard deviation of the distribution calculated for each pixel, which represents the extreme values that a prediction can take for a specific site (e.g., pixel) given available data and predictors. Note that this analysis is performed using all available data and a second map of OCSKGM is created. 

```{r run quantregForest, eval=TRUE}
library(snow)

# Estimate model uncertainty at the pixel level using parallel 
# computing
# Define number of cores to use
beginCluster()

# Estimate model uncertainty
unc <- clusterR(covs, predict, args=list(model=model,what=sd))

# OCSKGMlog prediction based in all available data
mean <- clusterR(covs, predict, 
                 args=list(model=model, what=mean))

# The total uncertainty is the sum of sensitivity and model 
# uncertainty
unc <- unc + sensitivity

# Express the uncertainty in percent % (divide by the mean)
Total_unc_Percent <- exp(unc)/exp(mean)
endCluster()
```

Our final prediction uses all available data, while the total uncertainty (in percent) is represented by the sum of the quantile regression forest standard deviation and the sensitivity map from the previous Section. The total uncertainty is then divided by the prediction to obtain a percent map, which is easier to interpret.  


```{r, eval=TRUE, fig.cap='OCSKGM quantregForest prediction'}
# Plot both maps (the predicted OCSKGM and associated uncertainty)
plot(exp(mean), main='OCSKGM based in all data')
```


```{r, eval=TRUE, fig.cap='Total uncertainty'}
plot(Total_unc_Percent, zlim=c(0, 5), main='Total uncertainty')
```

**Step 6 - Save the results as raster format**

Finally, the predicted OCSKGM and the total uncertainty can be saved in the working directory in a generic (*.tif) raster format.

```{r, eval=TRUE}
# Save the resulting maps in separated *.tif files
writeRaster(exp(mean), file='results/MKD_OCSKGM_quantrf.tif', 
            overwrite=TRUE)
writeRaster(Total_unc_Percent, file='results/MKD_OCSKGM_rf_unc.tif',
            overwrite=TRUE)
```

We have created two maps in the working directory, one represents the predicted OCSKGM and the second one its uncertainty, which is the sum of the model sensitivity to data variations and the full conditional distribution of the response variable as a function of available prediction factors. The following Chapters of this book will show you how to prepare a stock report based on this soil carbon digital soil maps.