07-Validation.Rmd

---
output:
  pdf_document: default
  html_document: default
---
# Validation {#chvalidation}
*B. Kempen, D.J. Brus & G.B.M. Heuvelink*

The authors of this Chapter used **R** packages. To run the code provided in this Chapter, the following packages need to be installed in the **R** user library. If the packages are not yet installed, the `install.packages()` function can be used.

```{r, eval=FALSE}
# Installing packages for Chapter 'Validation'
install.packages(c("sp", "raster", "caret"))
```


## What is validation?

No map is perfect. All maps, including soil maps, are representations of reality that are often based on an underlying model. This means that there will always be a deviation between the phenomenon depicted on the map and the phenomenon observed in the real world, i.e. each map will contain errors. The magnitude of the errors determines the quality of the map. If a map matches reality well (the error is small), the quality or accuracy of the map is high. On the other hand, if a map does not match reality well, map accuracy is low.

Soil maps are used for many purposes. For example to report on (changes in) SOC stocks, as input in agro-environmental models, to determine land use suitability or for decision- and policy-making. It is therefore, important that the quality of a map is determined and quantified. This is achieved through (statistical) validation.

Validation is defined here as an activity in which the soil map predictions are compared with observed values. From this comparison, the map quality can be quantified and summarized using map quality measures.  These measures indicate how accurate the map is on average for the mapping area, i.e. what is the expected error at a randomly selected location in the mapping area. This means that map quality measures obtained through validation are global measures: each quality measure gives one value for the entire map. Note that this is different from results obtained through uncertainty assessment. Such assessment provides local, location-specific (i.e. for each individual grid cell) estimates of map quality as we saw in the previous Sections. Another important difference between validation and uncertainty assessment is that validation can be done using a model-free approach. Uncertainty assessment takes a model-based approach by defining a geostatistical model of the soil property of interest and deriving an interpolated map and the associated uncertainty from that, or by constructing a geostatistical model of the error in an existing map. The approach yields a complete probabilistic characterization of the map uncertainty, but such characterization is only valid under the assumptions made; for instance, the stationarity assumptions required for kriging. Validation, when done properly as explained hereafter, does not assume a geostatistical model of the error, and hence is model- or assumption-free. This is an important property of validation since we do not want to question the objectivity and validity of the validation results.

We distinguish internal and external map accuracy. Statistical methods typically produce direct estimates of map quality, for instance, the kriging variance or the coefficient of determination $R^2$ of a linear regression model. These we refer to as internal accuracy measures since these rely on model assumptions and are computed from data that are used for model calibration. Preferably, validation is done with an independent dataset not used in map making. Using such dataset gives the external map accuracy. One will often see that the external accuracy is poorer than the internal accuracy.

In Section \@ref(QualMeasures) we will present the most common accuracy measures used to quantify map quality of quantitative (continuous) soil maps and qualitative (categorical) soil maps. In Section \@ref(validationMeth) we will introduce three commonly used validation methods and show how to estimate the map quality measures from a sample. This Chapter is largely based on @brus2011sampling. For details, please refer to this paper.

## Map quality measures {#QualMeasures}

### Quality measures for quantitative soil maps

All map quality measures considered here are computed from the prediction error. For quantitative soil maps of continuous soil properties (e.g. organic carbon content, pH, clay content) the prediction error is defined as the difference between the predicted value at a location and the true value at that location (which is the value that would be observed or measured by a preferably errorless measurement instrument) [@brus2011sampling]:

\begin{equation}
e(s) = \hat{Z}(s) - Z(s)
\end{equation}

where ${\hat{Z}(s)}$ is the predicted soil property at validation location s, and ${Z(s)}$ is the true value of the soil property at that location. We consider six map quality measures that are computed from the prediction error here: the mean error, the mean absolute error, the mean squared error and root mean squared error, the model efficiency, and the mean squared deviation ratio.

Before we introduce the map quality measures and show how to estimate these, it is important to understand the difference between the population and a sample taken from the population. The population is the set of all locations in a mapping area. For digital soil maps, this is the set of all pixels or grid cells of a map. A sample is a subset of locations, selected in some way from the set of all locations in the mapping area. With validation we want to assess the map accuracy for the entire population, i.e. for the map as a whole; we are not interested in the accuracy at the sample of locations only. 
For instance, we would like to know the prediction error averaged over all locations of a map and not merely the average prediction error at a sample of locations. Map quality measures are, therefore, defined as population means. Because we cannot afford to determine the prediction error at each location (grid cell) of the mapping area to calculate the population means, we have to take a sample of a limited number of locations in the mapping area. This sample is then used to estimate the population means. It is important to realize that we are uncertain about the population means because we estimate it from a sample. Ideally, this uncertainty is quantified and reported together with the estimated map quality measures.

In this Section, we will introduce the definitions of the map quality measures. In the next Section, we show how we can estimate these measures from a sample.

**Mean error**

The mean error (${ME}$) measures bias in the predictions. The ${ME}$ is defined as the population mean (spatial mean) of the prediction errors:

\begin{equation}
M E = e = \frac{1}{N} \sum_{i=1}^{N} e (s_i)
\end{equation}

where $i$ indicates the location, ${i = 1, 2,\dots N}$, and $N$ is the total number of locations or grid cells/pixels in the mapping area. The mean error should be (close to) zero, which means that predictions are unbiased meaning that there is no systematic over- or under-prediction of the soil property of interest.

**Mean absolute error and (root) mean squared error**

The mean absolute error (${MAE}$) and mean squared error (${MSE}$) are measures of map accuracy and indicate the magnitude of error we make on average. The ${MAE}$ is defined by the population mean of the absolute errors:

\begin{equation}
M A E = |\underline{e}| = \frac{1}{N} \sum_{i=1}^{N} \underline{e} (s_i)
\end{equation}

and the ${MSE}$ by the population mean of the squared errors:

\begin{equation}
M S E = \underline{e}^2 = \frac{1}{N} \sum_{i=1}^{N} \underline{e}^2 (s_i)
\end{equation}

Many authors report the root mean squared error (${RMSE}$) instead of the ${MSE}$, which is computed by taking the square root of the ${MSE}$. The ${RMSE}$ can be a more appealing quality measure since it has the same unit of measurement as the mapped property and can, therefore, more easily be compared to it. If the squared error distribution is strongly skewed, for instance when several very large errors are present, then this can severely inflate the $(R)MSE$. In such case, the (root) median squared error is a more robust statistic for the *average* error [@kempen2012efficiency]. 

@brus2011sampling argue that instead of using a single summary statistic (the mean) to quantify map quality measures, one should preferably express quality measures for quantitative soil maps through cumulative distribution functions (CDFs). Such functions provide a full description of the quality measures from which various parameters can be reported, such as the mean, median or percentiles. Furthermore, they argue that it can be of interest to define CDFs or its parameters for sub-areas, for instance, geomorphic units, soil or land cover classes. @brus2011sampling give examples of estimating CDFs for validation of digital soil maps.

**Amount of variance explained**

The model efficiency, or Amount of Variance Explained (${AVE}$) [@angelini2016mapping, @samuel2015more], quantifies the fraction of the variation in the data that is explained by the prediction model. It measures the improvement of the model prediction over using the mean of the data set as predictor and is defined as follows [@krause2005comparison]:

\begin{equation}
A V E = 1 -  \frac{\sum_{i=1}^{N} (\hat{Z}(s_i) - Z(s_i))^2}{\sum_{i=1}^{N}  (Z(s_i) - \underline{Z})^2}
\end{equation}

where ${\underline{Z}}$ is the population mean of soil property $Z$. The quantity in the numerator is the sum of the squared prediction errors (for each location the prediction error is computed and squared; the squared prediction errors are summed over all locations in the area). In linear regression, this quantity is known as the residual sum of squares (${RSS}$). The quantity in the denominator is also a sum of squared prediction errors, but here the mean of the area is used as a predictor. In linear regression, this quantity is known as the total sum of squares (${TSS}$). Note that if we would divide the quantity in the denominator by the number of locations in the mapping area $N$ we would obtain the population variance (spatial variance) of the soil property $Z$. 

If the numerator and denominator are equal, meaning the ${AVE}$ is zero, then the model predictions are no improvement over using the mean of the data set as a predictor for any location in the mapping area. An ${AVE}$ value larger than zero (${RSS}$ smaller than ${TSS}$) means that the model predictions are an improvement over using the mean as a predictor (this is what we hope for). In case the ${AVE}$ is negative, then the mean of the data set is a better predictor than the prediction model.

**Mean squared deviation ratio**

Finally, we introduce the mean squared deviation ratio (${MSDR}$) as a map quality measure [@kempen2010pedometric; @lark2000comparison; @voltz1990comparison; @webster_2007]. Contrary to the quality measures discussed so far, the ${MSDR}$ assesses how well the prediction model estimates the prediction uncertainty (expressed as the prediction error variance). The ${MSDR}$ is defined as:

\begin{equation}
M S D R = \frac{1}{N} \sum_{i=1}^{N} \frac{(\hat{Z}(s_i) - Z(s_i))^2}{\sigma^2(s_i)}
\end{equation}

where ${\sigma^2(s_i)}$ is the prediction error variance at location ${s_i, i = 1, 2, \dots N}$. The numerator is the squared error at location ${s_i}$. The fraction represents the squared ${Z_{score}}$. In case of kriging, the prediction error variance is the kriging variance. In case of linear regression, the prediction error variance is the prediction variance of the linear regression predictions that can be obtained by the statistical software R by running the predict function with argument ${se.fit = TRUE}$. This function returns for each prediction location the standard error of the predicted value as well as the residual standard deviation (the residual.scale value). By squaring both values and then summing these, the prediction error variance is obtained. If the prediction model estimates the error variance well, then the ${MSDR}$ should be close to one. A value smaller than one suggests that the prediction error variance overestimates the variance;  a value larger than one suggests that the prediction error variance underestimates the variance.

@lark2000comparison notes that outliers in the prediction data will influence the squared ${Z_{score}}$ and suggests to use the median squared ${Z_{score}}$ instead of the mean since it is a more robust estimator. A median squared ${Z_{score}}$ equal to 0.455 suggests that the prediction model estimates the prediction uncertainty well.

### Quality measures for qualitative soil maps

Like the quality measures for quantitative soil maps, the quality measures for qualitative or categorical soil maps (e.g. soil classes) are defined for the population, i.e. all locations in the mapping area. The basis for map quality assessment of qualitative maps is the error matrix [@brus2011sampling; @lark1995components]. This matrix is constructed by tabulating the observed and predicted class for all locations in the mapping area in a two-way contingency table. The population error matrix is a square matrix of order $U$, with $U$ being the number of soil classes observed and mapped. The columns of the matrix correspond to observed soil classes and the rows of predicted soil classes (the map units).$N$ is the total number of locations of the mapping area. Elements $N_{ij}$ are the number of locations mapped as class $i$ with observed class $j$. The row margins $N_{i+}$ are the locations mapped as class $i$, and column margins $N_{+j}$ the locations for which the observed soil class is $j$. Note that the elements of the population error matrix can also be interpreted as surface areas. In that case element $N_{ij}$ is the surface area mapped as class $i$ with observed class $j$.

From the population error matrix, several quality measures can be summarized, though it is strongly recommended that the error matrix is included in a validation assessment. @brus2011sampling follow the suggestion by @stehman1997selecting that quality measures for categorical maps should be directly interpretable in terms of the probability of a misclassification and therefore recommend the use of three map quality measures: the overall purity, the map unit purity, and class representation. We follow this recommendation here. Note that the map unit purity often is referred to as user’s accuracy, and class representation as producer’s accuracy [@stehman1997selecting; @adhikari2014constructing]. @lark1995components however, questions the appropriateness of these terms since both quality measures can be important for users as well as producers. He proposes to use map unit purity and class representation instead, which is adopted by @brus2011sampling and followed here.

A fourth frequently used group of quality measures are Kappa indices, which adjust the overall purity measure for hypothetical chance agreement [@stehman1997selecting]. How this chance agreement is defined differs between the various indices. Some authors, however, conclude that Kappa indices are difficult to interpret, not informative, misleading and/or flawed and suggest to abandon their use [@pontius2011death]. These authors argue that Kappa indices attempt to compare accuracy to a baseline of randomness, but randomness is not a reasonable alternative for map construction. We, therefore, do not consider kappa here.

The overall purity is the fraction of locations for which the mapped soil class equals the observed soil class and is defined as [@brus2011sampling]:

\begin{equation}
\rho = \sum_{i=1}^{U} N_{uu} / N
\end{equation}

which is the sum of the principal diagonal of the error matrix divided by the total number of locations in the mapping area. The overall purity can be interpreted as the areal proportion of the mapping area that is correctly classified. 

Alternatively, an indicator approach can be used to compute the overall purity. A validation site gets a **1** if the observed soil class is correctly predicted and a **0** otherwise. The overall purity is then computed by taking the average of the indicators.

**Map unit purity**

The map unit purity is calculated from the row marginals of the error matrix. It is the fraction of validation locations with mapped class $u$ for which the observed class is also $u$. The map unit purity for class $u$ is defined as [@brus2011sampling]: 

\begin{equation}
\rho_u = \frac{ N_{uu}}{N_{u+}}
\end{equation}

The map unit purity can be interpreted as the proportion of the area of the map unit that is correctly classified. The complement of $\rho_u$, $1 - \rho_u$, is referred to as the error of commission for mapped class $u$.

**Class representation**

The class representation is calculated from the column marginals of the error matrix. It is the fraction of validation locations with observed class $u$ for which the mapped class is $u$. The class representation for class $u$ is defined as [@brus2011sampling]: 

\begin{equation}
r_u = \frac{ N_{uu}}{N_{+u}}
\end{equation}

The class representation can be interpreted as the proportion of the area where in reality class $u$ occurs that is also mapped as class $u$. The complement of $r_u$, $1 - r_u$, is referred to as the error of omission for mapped class $u$.

### Estimating the map quality measures and associated uncertainty

In validation, we estimate the population means of the map quality measures from a sample taken from a limited number of locations in the mapping area. After all, we cannot afford to sample all locations, i.e. each grid cell of our soil map. Because the map quality measures are estimates, we are uncertain about these: we infer the quality measures from only a limited number of observations taken from the population. We do not know the true population means. The estimation uncertainty can be quantified with the sampling variance. From the variance, the lower and upper boundary of a confidence interval, typically the $95\%$, can be computed using basic statistical theory:

\begin{equation}
CI = (\underline{\hat{x}} - 1.96x \frac{\sigma}{\sqrt{n} } ; \underline{\hat{x}} + 1.96x \frac{\sigma}{\sqrt{n} })
\end{equation}

where $x$ is the estimated map quality measure, for instance, the $ME$, $MSE$ or overall purity, $\sigma$ is the estimated standard deviation of the map quality measure and $n$ is the validation sample size. 

Quantified information about the uncertainty associated to map quality measures is useful and required for statistical testing. For instance, if one wants to test if one mapping method performs better than the other method one needs quantified information about uncertainty. Because we are uncertain about the estimated quality measures, an observed difference in map quality between two methods does not necessarily mean that one method is better than the others, even when there is a substantial difference. The difference might be attributed to chance because we infer the quality measures from a limited sample from the population. With statistical hypothesis testing, we can calculate how large the probability is that observed difference is caused by chance. Based on the outcome we can accept or reject the hypothesis that there is no difference between the performance of two mapping methods (this would be the null hypothesis for statistical testing) for a given significance level, usually 0.05.   

## Graphical map quality measures

In addition to quantifying map accuracy statistically, one can also present validation results obtained from a sample graphically. This can be done by creating scatter plots of predicted against observed values and spatial bubble plots of validation errors.

Figure \@ref(fig:rwandaval) shows an example of a scatterplot and bubble plot. Both plots can be easily made with **R** [@rcore]. Use the function `plot(x,y)` to generate a scatter plot. The 1:1 line (black line in Figure \@ref(fig:rwandaval)) can be added to the plot with the command `abline(0, 1)`. The spatial bubble plot can be generated with the `bubble()` function of the **sp** package [@pebesma2005classes]. 


```{r rwandaval, fig.cap="Scatterplot of predicted versus observed SOM content for Rwanda (left) and spatial bubble plot of cross-validation error for SOM (right) (Kempen et al., 2015). The black line in the scatter plot represents the 1:1 line of prediction versus observed, the blue line represents the regression between observed and predicted values" , out.width='80%', echo=FALSE}
knitr::include_graphics("images/Validation_Rwanda.png")
```


## Validation methods and statistical inference {#validationMeth}

Following @brus2011sampling, we introduce and discuss three common validation methods: additional probability sampling, data-splitting and cross-validation, and show how to estimate the map quality measures introduced in the previous Section from a sample.

With additional probability sampling, an independent dataset is collected from the sampling population (all grid cells of a digital soil map) for the purpose of validation. This dataset is used in addition to a dataset that is used to calibrate a prediction model. Such dataset is often a legacy dataset collected with a purposive sampling design.

Data-splitting and cross-validation are applied in situations where one has only one data set available for prediction model calibration and validation. This can be a dataset collected with probability sampling, but in practice this typically is a legacy dataset collected with some purposive sampling design.

We warn here that if one uses data-splitting or cross-validation with a dataset collected with purposive sampling, then this has severe implications on the validity and interpretation of the estimated map quality measures as we will explain below

### Additional probability sampling {#addProbSampling}

The most appropriate approach for validation is by additional probability sampling. This means that an independent validation dataset is collected in the field on basis of a probability sampling design. Validation based on probability sampling ensures one obtains unbiased and valid estimates of the map quality measures [@brus2011sampling; @stehman1999basic].

Additional probability sampling has several **advantages** compared to data-splitting and cross-validation using non-probability sample data. These are:

* No model is needed for estimating map quality estimates. We can apply design-based estimation, meaning that model-free unbiased and valid estimates of the map quality measures can be obtained;

* Discussions on the validity of the estimated map quality are avoided;

* Model-free, valid estimates of the variance of the map quality measures can be obtained that allows for hypothesis testing, e.g. for comparison of model performance.

**Disadvantages** can be extra costs involved in collecting an additional sample or terrain conditions that make it difficult to access all locations in the mapping area.

Probability sampling is random sampling such that:

* All locations in the mapping area have a probability larger than 0 of being selected;

* The inclusion probabilities are known but need not be equal. 

It should be noted that random sampling is often used for arbitrary or haphazard sampling. Such sampling is not probability sampling because the inclusion probabilities are not known. Design-based, model-free estimation of map quality measures is not possible in this case. All probability samples are random samples but not all random samples are probability samples. The term probability sampling should therefore only be used for random sampling with known inclusion probabilities.

There are many different probability sampling designs: simple, stratified, systematic, two-stage, clustered random sampling. We will not give an exhaustive overview here of all these designs. A good resource is @de2006sampling. For reasons of simplicity, we focus here on simple random sampling.

**Simple random sampling**: In simple random sampling, no restrictions are imposed on the random selection of sampling sites except that the sample size is fixed and chosen prior to sampling [@de2006sampling]. All sampling locations are selected with equal probability and independently from each other.

This can, for instance, be done as follows [@de2006sampling]:

1. Determine the minimum and maximum $X$ and $Y$ coordinates of the mapping area (the bounding box).

2. Generate two independent random coordinates $X$ and $Y$ from a uniform probability distribution on the interval ($x_{min}$, $x_{max}$) and ($y_{min}$, $y_{max}$)

3. Check if the selected sampling site falls within the mapping area. Accept the sampling site if it does; discard the sampling site if it does not.

4. Repeat steps 2 and 3 until the $n$ locations have been selected.

If a sampling location cannot be visited because of inaccessibility for instance, then this location should be discarded and be replaced by a location chosen from a reserve list. Always the location at the top of the list should be selected for this purpose; not an arbitrarily chosen location from the list such as the closest one. It is not allowed to shift an inaccessible sampling location to a location nearby that can be accessed. Irregularity, clustering and open spaces characterize the simple random sampling design [@de2006sampling]. 

**Estimation of quantitative map quality measures:** For each validation location we compute the error, $e(s_i)$, the absolute error, $|e|(s_i)$, or squared error, $e^2(s_i)$. The spatial mean of the mapping area for map quality measure $x$ is then estimated by:

\begin{equation}
e(s_i) = \hat{Z}(s_i) - Z(s_i)
\end{equation}

\begin{equation}
|e|(s_i) = |\hat{Z}(s_i) - Z(s_i)|
\end{equation}

\begin{equation}
e^2(s_i) =( \hat{Z}(s_i) - Z(s_i))^2
\end{equation}

\begin{equation}
\underline{\hat{x}} = \frac{1}{N} \sum_{i=1}^{N} x(s_i)
\end{equation}

where $i$ indicates the validation location, $i = 1, 2, \dots, n$, $n$ the validation sample size, and $x_{si}$ the estimated population mean of map quality measure $x$ at location $s_i$.  $x$ is the prediction error in case of the $ME$, absolute error in case of the $MAE$, squared prediction error in case of the $MSE$. Note that the estimator is the unweighted sample mean. This unweighted mean is an unbiased estimator because all sampling locations were selected with equal probability.

The $MSDR$ is estimated by:

\begin{equation}
\widehat{M S D R} = \frac{1}{N} \sum_{i=1}^{n} \frac{(\hat{Z}(s_i) - Z(s_i))^2}{\sigma^2(s_i)}
\end{equation}

and the $AVE$ by:

\begin{equation}
\widehat{A V E} = 1 - \frac{\sum_{i=1}^{n} (\hat{Z}(s_i) - Z(s_i))^2}{ \sum_{i=1}^{n} (Z(s_i) -  \underline{\hat{Z}})^2}
\end{equation}

where $\underline{\hat{Z}}$ is the mean of the target soil property estimated from the validation sample.

One should be careful when assessing the proportion of variance explained by computing the $R^2$ from a linear regression of the predicted value on the observed value [@krause2005comparison], as is often done in practice. The $R^2$ quantifies the dispersion around the regression line; not around the 1:1 line in which we are interested in validation. So it does not directly compare the predicted with observed value as does the $AVE$; i.e. it is not based on the prediction error. A high $R^2$-value, therefore, does not automatically mean a high $AVE$. For instance, in case of strongly biased predictions the $R^2$ can be high but the $AVE$ will be low. The blue line in Figure \@ref(fig:rwandaval) is the regression line that one obtains when regression the observed value of the predicted value. This line slightly differs from the 1:1 line. In this example, the $R^2$ of the regression is 0.42 while the $AVE$ is 0.40.
The uncertainty associated to the estimated map quality measures is quantified with the sampling variance, which for the $ME$, $MAE$ and $MSE$ is estimated by:

\begin{equation}
Var(\underline{\hat{x}}) = \frac{1}{n(n-1)} \sum_{i=1}^{n} (x(s_i) -  \underline{\hat{x}})
\end{equation}

and the $95\%$ confidence interval ($CI$) of $x$ is given by:

\begin{equation}
CI_{95} = \underline{\hat{x}} \pm 1.96x\sqrt{Var(\underline{\hat{x}})}
\end{equation}

We should warn here that the calculation of the $CI$ is based on the assumption that the estimated map quality measure means have a normal distribution (the central limit theorem). For the squared errors this assumption can be unrealistic, especially for small sample sizes.

**Estimation of qualitative map quality measures:** For validation of qualitative soil maps, a sample error matrix is constructed from the validation data (Figure \@ref(fig:errormatrix)). $n$ is the total number of validation locations in the sample. Element $n_{ij}$ of the matrix corresponds to the number of validation locations that have been predicted as class $i$, $i = 1, 2, \dots U$ and belong to class $j$, $j = 1, 2, \dots U$ [@lark1995components]. The matrix summarizes correct predictions and incorrect predictions within the validation data.


```{r errormatrix, fig.cap="Sample error matrix" , out.width='80%', echo=FALSE, fig.align='center'}
knitr::include_graphics("images/Validation_error_matrix.png")
```

From the sample error matrix the overall purity, map unit purity and class representation are  estimated by:

\begin{equation}
\hat{\rho} = \sum_{i=1}^{U} n_{uu} / n
\end{equation}

\begin{equation}
\hat{\rho}_u = \frac{ n_{uu}}{n_{u+}}
\end{equation}

\begin{equation}
\hat{r_u} = \frac{ n_{uu}}{n_{+u}}
\end{equation}

Alternatively, the overall purity can be estimated by defining a purity indicator variable for each validation location that takes value **1** if the mapped soil class equals the observed soil class at that location, and **0** else. The overall purity is then estimated by: 

\begin{equation}
\hat{\rho} = \frac{1}{n} \sum_{i=1}^{n} \partial(s_i)
\end{equation}

where  ${\delta(s_i)}$ is the indicator variable at validation location $s_i$. The variance of the estimated overall purity is estimated by:

\begin{equation}
Var(\hat{\rho}) = \frac{1}{n(n-1)} \sum_{i=1}^{n} (\partial(s_i) - \hat{\rho})^2
\end{equation}

Alternatively, the variance is estimated by:

\begin{equation}
Var(\hat{\rho}) = \frac{\hat{\rho}(1 - \hat{\rho})}{n - 1)}
\end{equation}

which is the variance of a binomial probability distribution. The $95\%$ confidence interval of $\hat{\rho}$ is given by:

\begin{equation}
CI_{95} = \hat{\rho} \pm 1.96x \sqrt{Var(\hat{\rho})}
\end{equation}

We warn that the $CI$ as calculated here is a rough approximation which only holds when $n \times \hat{\rho}$ and $n \times (1 - \hat{\rho})$ are large (5 as a rule of thumb). Otherwise, the binomial distribution should be used to compute the $CI$.
Figure \@ref(fig:errormatrix2) shows a hypothetical example of a sample error matrix for soil class map. For this example, the overall purity is estimated by: 

\begin{equation}
\hat{\rho} = \frac{(19 + 33 + 25 + 42 + 19)}{240} = 0.575
\end{equation}

meaning that for an estimated $57.5\%$ of the mapping area the mapped soil class is equal to the true soil class.


```{r errormatrix2, fig.cap="Sample error matrix for a hypothetical soil class map" , out.width='80%', echo=FALSE, fig.align='center'}
knitr::include_graphics("images/Validation_error_matrix2.png")
```


Table \@ref(tab:purity) gives the map unit purities and class representations for this example. The map unit purity of the Gleysol is 0.581, meaning that at $58.1\%$ of the validation locations for which a Gleysol is predicted, a Gleysol is observed. Assuming the validation data were collected by simple random sampling, we could conclude that for $58.1\%$ of the area mapped as Gleysol we would find a Gleysol in the field. The class representation of the Gleysol is 0.463, meaning that for $46.3\%$ of the validation locations classified as Gleysol, we map a Gleysol. The majority of the Gleysol locations is thus mapped as a different soil class. Again, assuming the validation data were collected by probability sampling, we would estimate that $22.5\%$ ($\frac{54}{240}$ $\times$ $100\%$) of our mapping area is covered by Gleysols. We map Gleysols for $17.9\%$ of the area ($\frac{43}{240}$ $\times$ $100\%$). It can happen that a soil class has a high map unit purity and a low-class representation. This means that if we map a Gleysol we will likely find a Gleysol there, but that a large extent of the true Gleysol area is not mapped as such.


```{r purity, echo=FALSE}
units <- c('Anthrosol', 'Cambisol', 'Gleysol', 'Luvisol', 'Podzol')
purity <- c(0.679, 0.508, 0.508, 0.592, 0.576)
classrep <- c(0.633, 0.516, 0.463, 0.700, 0.594)

df <- data.frame(unit = units, "map unit purity" = purity, "class representation" = classrep)

knitr::kable(df, caption = "Map unit purity and class representation statistics for an hypothetical example", booktabs = TRUE)
```


### Data-splitting

In data-splitting, the sample data set is split into two subsets. One subset is used to calibrate the prediction model. The other subset is used for validation. A frequently used splitting criterion is 70 - 30, where $70\%$ of the sample data are used for calibration and $30\%$ for validation. The choice of a splitting criterion, however, is arbitrary and it is not evident how to split a data set in such a way that unbiased and valid estimates of the map accuracy can be obtained. For sparse data sets, data-splitting can be inefficient since the information in the data set is not fully exploited for both calibration and validation.

It is important to note here that a random subsample of (legacy) data that are collected with a purposive (non-probability) design, is not a probability sample of the study area. This means that design-based estimation of map quality measures is not possible. 

Thus, we will not obtain model-free, unbiased and valid estimates of the quality measures from non-probability sample validation data. In a case study, @knotters2013purposive showed that model-based predictions of producer’s accuracies from two models differed strongly, indicating that with the model-based approach the validation results strongly depend on model assumptions.

In most studies, however, spatial correlation is not accounted for when estimating map quality measures using the estimators presented above in Section \@ref(addProbSampling) as simple random sampling from non-probability sample data. In such case, the quality measures cannot be considered unbiased and valid estimates of the population means of the map quality measures. In addition, the estimated variance of the map quality measures is not valid and statistical testing of mapping methods to assess which method gives the most accurate predictions cannot be done. 

In other words, if the simple random sampling estimators are used to estimate map quality measures then these are only valid for the validation data points. The map quality measures do not give a valid estimate of the quality of the map as a whole (the population). For instance, the overall purity cannot be interpreted as an areal proportion of correctly mapped soil classes, only as the proportion of the validation data points for which the soil class is correctly predicted. 70 - 30, where $70\%$ of the sample data are used for calibration and $30\%$ for validation. The choice of a splitting criterion, however, is arbitrary and it is not evident how to split a data set in such a way that unbiased and valid estimates of the map accuracy can be obtained. For sparse data sets, data-splitting can be inefficient since the information in the data set is not fully exploited for both calibration and validation.

It is important to note here that a random subsample of (legacy) data that are collected with a purposive (non-probability) design, is not a probability sample of the study area. This means that design-based estimation of map quality measures is not possible. 

If a validation (sub)sample is a non-probability sample of the mapping area, then we must account for possible spatial autocorrelation of the prediction errors when estimating the map quality measures. One can imagine that when two validation locations are close together and the prediction errors are correlated that there is less information in these two locations (there is information redundancy because of autocorrelation) than in two isolated locations. This information redundancy has to be accounted for when estimating map quality measures and implies that we have to rely on model-based estimation: a model for the spatially autocorrelated prediction error has to be assumed. Thus, we will not obtain model-free, unbiased and valid estimates of the quality measures from non-probability sample validation data. In a case study, @knotters2013purposive showed that model-based predictions of producer’s accuracies from two models differed strongly, indicating that with the model-based approach the validation results strongly depend on model assumptions.

### Cross-validation {#xval}

In $K$-fold cross-validation, the dataset is split into $K$ roughly equal sets. One of these sets is set aside for validation. The model is then calibrated using the data from the $K$-1 sets and used to predict the target variable for the data points set aside. From this prediction, the prediction error is calculated. This procedure is repeated $K$ times, each time setting a different set aside for validation. In this way, we obtain $K$ estimates of the prediction error: one for each validation sample site. In this way, all data are used for validation and model calibration. It is thus much more efficient than data-splitting.

$K$ is typically chosen as 5 or 10 or as $N$ the number of data points. The latter is referred to as leave-one-out cross-validation in which only one validation site is set aside in each iteration. The model is then calibrated with $N$-1 observations. Some repeat $K$-fold cross-validation a number of times and average the results to obtain a more robust estimate of the map quality measures.

Note that the problem of spatially correlated errors remains when data are non-probability sample data. Cross-validation using a non-probability sampling dataset suffers from the same drawbacks with respect to unbiasedness and validity of the estimates of the map quality measures as data-splitting. The estimates cannot be interpreted as being valid for the mapping area, but only for the validation locations. 

In **R**, the **caret** package [@kuhn2016short] offers functionality for data-splitting and cross-validation.

## Technical steps - Validation {#TS:validation}
*G.F. Olmedo*

**Step 1 - Prediction error**

First, we will load the validation dataset. This dataset was measured at the same time as the modeling dataset. After data preparation in Chapter \@ref(preparation), we used the code in Section \@ref(dataSplit) to split the database: 75% of the points were used in the models from Chapter \@ref(mappingMethods) and now we will use the remaining 25% of the points to test those results.

```{r}
library(sp)

dat <- read.csv("data/dat_test.csv")

# Promote to spatialPointsDataFrame 
coordinates(dat) <- ~ X + Y

dat@proj4string <- CRS(projargs = "+init=epsg:4326")
```

Now, we will load the predicted layers from Chapter \@ref(mappingMethods), extract the predicted value for every point and then, estimate the prediction error.

```{r ,warning=FALSE}
library(raster)

OCSKGM_GM <- raster("results/MKD_OCSKGM_GM.tif")
OCSKGM_RK <- raster("results/MKD_OCSKGM_RK.tif")
OCSKGM_rf <- raster("results/MKD_OCSKGM_rf.tif")
OCSKGM_svm <- raster("results/MKD_OCSKGM_svm.tif")

dat <- extract(x = OCSKGM_GM, y = dat, sp = TRUE)
dat <- extract(x = OCSKGM_RK, y = dat, sp = TRUE)
dat <- extract(x = OCSKGM_rf, y = dat, sp = TRUE)
dat <- extract(x = OCSKGM_svm, y = dat, sp = TRUE)

dat$PE_GM <- dat$MKD_OCSKGM_GM - dat$OCSKGM
dat$PE_RK <- dat$MKD_OCSKGM_RK - dat$OCSKGM
dat$PE_rf <- dat$MKD_OCSKGM_rf - dat$OCSKGM
dat$PE_svm <- dat$MKD_OCSKGM_svm - dat$OCSKGM

# Save the validation results
write.csv(dat, "results/validation.csv", row.names = F)
```


```{r prederrors, echo=FALSE}
prederrors <-  as.data.frame(cbind(summary(dat$PE_GM), summary(dat$PE_RK), summary(dat$PE_rf), summary(dat$PE_svm)))
names(prederrors) <- c('GM', 'RK', 'randomForest', 'svm')
knitr::kable(prederrors, caption = "Summary of prediction errors for three different mapping methods", digits = 2, align = "rrrr", booktabs = TRUE)
```

**Step 2 - Estimating the map quality measures**

In this Section, we present the code needed to estimate the map quality measures for quantitative soil maps.

```{r}
# Mean Error
ME_GM <- mean(dat$PE_GM, na.rm=TRUE)
ME_RK <- mean(dat$PE_RK, na.rm=TRUE)
ME_rf <- mean(dat$PE_rf, na.rm=TRUE)
ME_svm <- mean(dat$PE_svm, na.rm=TRUE)

# Mean Absolute Error (MAE)
MAE_GM <- mean(abs(dat$PE_GM), na.rm=TRUE)
MAE_RK <- mean(abs(dat$PE_RK), na.rm=TRUE)
MAE_rf <- mean(abs(dat$PE_rf), na.rm=TRUE)
MAE_svm <- mean(abs(dat$PE_svm), na.rm=TRUE)

# Mean Squared Error (MSE)
MSE_GM <- mean(dat$PE_GM^2, na.rm=TRUE)
MSE_RK <- mean(dat$PE_RK^2, na.rm=TRUE)
MSE_rf <- mean(dat$PE_rf^2, na.rm=TRUE)
MSE_svm <- mean(dat$PE_svm^2, na.rm=TRUE)

# Root Mean Squared Error (RMSE)
RMSE_GM <- sqrt(sum(dat$PE_GM^2, na.rm=TRUE) / length(dat$PE_GM))
RMSE_RK <- sqrt(sum(dat$PE_RK^2, na.rm=TRUE) / length(dat$PE_RK))
RMSE_rf <- sqrt(sum(dat$PE_rf^2, na.rm=TRUE) / length(dat$PE_rf))
RMSE_svm <- sqrt(sum(dat$PE_svm^2, na.rm=TRUE) / length(dat$PE_svm))

# Amount of Variance Explained (AVE)
AVE_GM <- 1 - sum(dat$PE_GM^2, na.rm=TRUE) / 
  sum( (dat$OCSKGM - mean(dat$OCSKGM, na.rm = TRUE))^2, 
       na.rm = TRUE)

AVE_RK <- 1 - sum(dat$PE_RK^2, na.rm=TRUE) / 
  sum( (dat$OCSKGM - mean(dat$OCSKGM, na.rm = TRUE))^2, 
       na.rm = TRUE)

AVE_rf <- 1 - sum(dat$PE_rf^2, na.rm=TRUE) / 
  sum( (dat$OCSKGM - mean(dat$OCSKGM, na.rm = TRUE))^2, 
       na.rm = TRUE)

AVE_svm <- 1 - sum(dat$PE_svm^2, na.rm=TRUE) / 
  sum( (dat$OCSKGM - mean(dat$OCSKGM, na.rm = TRUE))^2, 
       na.rm = TRUE)
```

```{r validation, echo=FALSE}
validation <- data.frame(ME = c(ME_GM, ME_RK, ME_rf, ME_svm),
                         MAE= c(MAE_GM, MAE_RK, MAE_rf, MAE_svm),
                         MSE= c(MSE_GM, MSE_RK, MSE_rf, MSE_svm),
                         RMSE= c(RMSE_GM, RMSE_RK, RMSE_rf, RMSE_svm),
                         AVE= c(AVE_GM, AVE_RK, AVE_rf, AVE_svm)
                         )
validation <- as.data.frame(t(validation))
names(validation) <- c('GM', 'RK', 'randomForest', 'svm')
```

```{r, echo=FALSE}
knitr::kable(validation, caption = "Summary of map quality measures for three different mapping methods", digits = 3, align = "rrr", booktabs = TRUE)
```


**Step 3 - Graphical map quality measures** 

In this Section, we will apply the proposed graphical quality measures to the results of Chapter \@ref(mappingMethods).

```{r, fig.asp=2, fig.cap='Scatter plots of predicted against observed values'}
# Two plots in one plot
par(mfrow=c(2,2))

# Scatter plot
plot(dat$MKD_OCSKGM_GM, dat$OCSKGM, main="GM", xlab="predicted", 
     ylab='observed')

# 1:1 line in black
abline(0,1, lty=2, col='black')

# Regression line between predicted and observed in blue
abline(lm(dat$OCSKGM ~ dat$MKD_OCSKGM_GM), col = 'blue', lty=2)

# Scatter plot
plot(dat$MKD_OCSKGM_RK, dat$OCSKGM, main="RK", xlab="predicted", 
     ylab='observed')

# 1:1 line in black
abline(0,1, lty=2, col='black')

# Regression line between predicted and observed in blue
abline(lm(dat$OCSKGM ~ dat$MKD_OCSKGM_RK), col = 'blue', lty=2)

# Scatter plot
plot(dat$MKD_OCSKGM_rf, dat$OCSKGM, main="rf", xlab="predicted", 
     ylab='observed')

# 1:1 line in black
abline(0,1, lty=2, col='black')

# Regression line between predicted and observed in blue
abline(lm(dat$OCSKGM ~ dat$MKD_OCSKGM_rf), col = 'blue', lty=2)

# Scatter plot
plot(dat$MKD_OCSKGM_svm, dat$OCSKGM, main="svm", xlab="predicted", 
     ylab='observed')

# 1:1 line in black
abline(0,1, lty=2, col='black')

# Regression line between predicted and observed in blue
abline(lm(dat$OCSKGM ~ dat$MKD_OCSKGM_svm), col = 'blue', lty=2)

par(mfrow=c(1,1))
```

```{r, fig.cap='Spatial bubble of the prediction errors for GM', out.width='60%'}
# Spatial bubbles for prediction errors
bubble(dat[!is.na(dat$PE_GM),], "PE_GM", pch = 21, 
       col=c('red', 'green'))
```

```{r, fig.cap='Spatial bubble of the prediction errors for RK', out.width='60%'}
# Spatial bubbles for prediction errors
bubble(dat[!is.na(dat$PE_RK),], "PE_RK", pch = 21, 
       col=c('red', 'green'))
```

```{r, fig.cap='Spatial bubble of the prediction errors for RF', out.width='60%'}
# Spatial bubbles for prediction errors
bubble(dat[!is.na(dat$PE_rf),], "PE_rf", pch = 21, 
       col=c('red', 'green'))
```

```{r, fig.cap='Spatial bubble of the prediction errors for SVM', out.width='60%'}
# Spatial bubbles for prediction errors
bubble(dat[!is.na(dat$PE_svm),], "PE_svm", pch = 21, 
       col=c('red', 'green'))
```


## Technical steps - Data-splitting {#dataSplit}

As explained before, many times for running validation analysis, we split the data before fitting the models. In this Section, we present the code needed to achieve this using **caret** package.

```{r, echo=F}
# Fix randomness for data splitting to produce a complete train 
# dataset for soilmap class 1, where we have only 1 sample
set.seed(1234)
```


```{r, eval=T, fig.cap='Statistical distribution of train and test datasets'}
library(caret)

dat <- read.csv("data/dataproc.csv")

train.ind <- createDataPartition(1:nrow(dat), p = .75, list = FALSE)
train <- dat[ train.ind,]
test  <- dat[-train.ind,]

plot(density (log(train$OCSKGM)), col='red', main="")
lines(density(log(test$OCSKGM)), col='blue')
legend('topright', legend=c("train", "test"),
      col=c("red", "blue"), lty=1, cex=1.5)
```

```{r, eval=F}
write.csv(train, file="data/dat_train.csv", row.names = FALSE)
write.csv(test, file="data/dat_test.csv", row.names = FALSE)
```