09-uncertainty.Rmd

# Uncertainty {#uncertainty}
*G.B.M. Heuvelink*

Soil mapping involves making predictions at locations where no soil measurements were taken. This inevitably leads to prediction errors because soil spatial variation is complex and cannot be modeled perfectly. It also implies that we are uncertain about the true soil class or true soil property at prediction locations. We only have the predictions, which differ from the true values in an unpredictable way, and hence we are uncertain about the true value. In fact, we may even be uncertain about the soil at the measurement locations because no measurement method is perfect and uncertainty also arises from measurement errors.

This Chapter describes how uncertainty may be characterized by probability distributions. It also explains how the parameters of these distributions may be derived, leading to quantification of uncertainty. We will see that this can become quite complex, because soil properties vary in space and are often cross-correlated, which the uncertainty model must take into account. A further complication is that there are many different sources of uncertainty. In some cases, it may be too difficult to arrive at a spatially explicit characterization of uncertainty, and in such case, statistical validation may be used to derive summary measures of the accuracy of soil maps. We begin this Chapter with a description of uncertainty sources.

## Sources of uncertainty {#sourcesuncert}

Consider a case in which soil samples were taken from a large number of measurement locations in a study area, taken to the laboratory and analyzed for various soil properties. Let us further assume that the measurement locations were indicated on a topographic map and that the soil was also classified at each measurement location. Next, the soil property and soil type observations were used to create maps of soil properties and soil type using DSM techniques. These techniques not only make use of the soil observations but also benefit from maps of environmental variables that are correlated with the soil, and hence help explain the soil spatial variation. Which sources of uncertainty contribute to uncertainty about the final soil maps? We distinguish four main categories.

### Attribute uncertainty of soil measurements

Soil measurements suffer from measurement errors in the field and laboratory. Perhaps the soil was not sampled at the right depth, perhaps the organic layer was not removed completely before collecting soil material, or perhaps by accident bags were interchanged or numbered wrongly. Field estimates of soil type and soil properties are also not error-free, especially when estimation is difficult, such as estimation of SOC content or texture. Field estimates may also be subjective because soil scientists may be trained differently and so there may be systematic differences between their field estimates of soil properties. Similarly, it is also not uncommon for soil scientists to disagree about the soil type when classifying a soil in the field.

Laboratory analysis adds error too. Soil samples may not be perfectly mixed prior to taking a much smaller subsample that is actually measured; instruments have limited precision and may have systematic errors, climate conditions in the lab vary, and there can be differences between procedures used by laboratory personnel. Differences between laboratories are even bigger and may be of the same order of magnitude as the soil variation itself. It is strongly advised to always take sufficient duplicates and randomize the order in which soil samples are analyzed in the laboratory. This allows quantifying the combined field and laboratory measurement errors.

### Positional uncertainty of soil measurements

When collecting soil data in the field we would generally note the geographic coordinates of the measurement locations. Nowadays this is easy with GPS instruments and depending on the device, modest to high positional accuracy can be achieved. But it may still be too large to be negligible. For instance, consider the case where the soil data are used to train a DSM model that predicts soil properties from covariates. Let these covariates be available at high spatial resolution and have substantial fine-scale spatial variation. Then it is clear that positional uncertainty in the soil measurements may link these measurements to the wrong covariates, which will weaken the strength of the relationship between the soil variable and covariates and deteriorate the quality of the final soil map.

Many soil legacy data suffer from large positional uncertainty. Locations may only be traced from vague descriptions such as *near village A* or *east of the road from B to C*. In such case, researchers should consider whether using such data for calibration of a DSM model and for spatial prediction using the calibrated DSM model is wise. It may do more harm than good. This depends on the specific DSM model used and the degree of spatial variation of the covariates. It also depends on the degree of spatial variation of the soil property itself. If it has negligible fine-scale spatial variation and hence has similar values at the registered and actual geographic location, then little harm is done. For instance, in the Sahara desert many soil properties will show little spatial variation over distances of hundreds or perhaps thousands of meters, so in such case, poor geographic positional accuracy will not seriously affect DSM predictions.

### Uncertainty in covariates

Maps of covariates that are used in DSM can also suffer from errors and uncertainties. For instance, a DEM is a major source of geomorphological covariates but DEMs are only approximations of the real elevation. DEM errors will propagate and cause uncertainty in geomorphological properties such as slope, aspect and topographic wetness index. As a result, the DSM model must be trained on covariate data that are merely approximations of the intended covariates, which will generally lead to weakened relationships and larger DSM prediction errors. Land cover is another example; soil properties may be strongly influenced by land cover, but such relationship may come out quite weak if the DSM model is trained with a land cover map that represents land cover wrongly for a large part of the study area.

Covariates also come in a specific spatial resolution which may be quite coarse in specific cases. In order to use the covariate in a fine-scale DSM model, the coarse-scale grid cell value will be copied to all fine-scale grid cells contained in it, but clearly, fine-scale spatial variation implies that uncertainties will be introduced. A possible solution might be to smooth the coarse-scale covariate prior to entering it to DSM calibration but clearly, this will not remedy all problems.

Uncertainty in covariates leads to weaker DSM models, but this weakening is not hidden to the developer because the deterioration of predictive power is implicitly included in the DSM model. For instance, the amount of variance explained by a DSM model that uses the true land cover as measured on sampling sites may be much higher than that of a model that uses a land cover map. Users may then be tempted to calibrate the DSM model with the true land cover data, but if they next apply that model using the land cover map to predict the soil at non-measurement locations they would systematically underestimate the uncertainty of the resulting map.

### Uncertainty in models predicting soil properties from covariates and soil point data

Even if the soil point data and covariate data were error-free, the resulting DSM predictions would still deviate from the true soil properties. This is because the DSM model itself also introduces uncertainties. Models are merely simplified representations of the real world. The real world is too complex and approximations are needed. For instance, even though we know that physical, chemical and biological processes determine the soil as given by the state equation of soil formation $soil$ = $f$($cl$, $o$, $r$, $p$, $t$), the function $f$ is too complex to be fully understood and implemented in a  computer model. Instead, we use crude approximations such as multiple linear regression and machine-learning algorithms. These empirical models have the additional burden that extrapolation beyond conditions represented by the calibration data is difficult and risky. For extrapolation purposes, it is advised to use DSM models that better represent the mechanisms behind soil formation, but again it is practically impossible to build mechanistic models that represent the real world perfectly. This is not only because we may not understand all processes and their interactions well, but also because dynamic mechanistic models need much information, such as the initial state, boundary conditions, and driving forces. Such detailed information is generally lacking.

Model uncertainty is generally subdivided into model parameter uncertainty and model structural uncertainty. The first can be reduced by using models with fewer parameters or by using a larger calibration data set. The latter can be reduced by using a more complex model, but this will only work if there are enough data to calibrate such model. Thus, in general, a compromise has to be sought by choosing a level of model complexity that matches the amount of information available.

## Uncertainty and spatial data quality

Research into spatial accuracy in geographic information science has listed five main elements of spatial data quality:

* Lineage;
* Positional accuracy;
* Attribute accuracy;
* Logical consistency; and
* Completeness.

We have already discussed positional and attribute accuracy. Lineage refers to documenting the original sources for the data and the processing steps. This is strongly related to the principle of reproducible research. Logical consistency addresses whether there are any contradictory relationships in the database. For instance, it checks whether all data have the same geographic projection and that measurement units are consistent. Completeness refers to whether there are any missing data. For instance, covariate maps must cover the entire study area if they are to be used as explanatory variables in a DSM model. Soil profile data need not capture all relevant soil properties and tend to have fewer soil measurements at greater depths.

In summary, there are many sources of uncertainty that affect the quality of DSM products. This Section has reviewed these sources but was purposely descriptive. The next Section selects a few major uncertainty sources and works out quantitatively how these cause uncertainty in the resulting soil map. Perhaps it is useful to mention that focussing attention on errors and uncertainties may give the wrong impression that soil maps are generally inaccurate and of poor quality. This is not the message that we wish to convey here. But producers and users of soil maps should be aware of the sources of uncertainty and should ideally identify how these uncertainties affect the final product. Thus, quantification of the uncertainty in DSM maps, be it through explicit modeling or independent validation is important.

## Quantifying prediction uncertainty

Uncertainties in soil measurements, covariates and DSM models propagate to resulting soil maps. The uncertainty propagation can fairly easily be traced provided that the uncertainty sources are characterized adequately. The most appropriate way of doing that is by making use of statistics and probability distributions. This Section also takes that approach and starts by providing a brief overview of probability distributions and how these may be used to represent uncertainty. Next, it analyses how the four sources of uncertainty distinguished in Section \@ref(sourcesuncert) lead to uncertainty in soil maps produced using DSM.

### Uncertainty characterised by probability distributions

If we are uncertain about the value of a soil property at some location and depth this means that we cannot identify one single, true value for that soil property [@goovaerts2001geostatistical; @arrouays_uncertainty_2014]. Instead, we may be able to provide a list of all possible values for it and attach a probability to each. In other words, we represent the true but unknown soil property by a probability distribution.

For instance, suppose that we estimate the sand content of a soil sample in the field as $35\%$, while recognizing that a field estimate is quite crude and that the true sand content may very well be less or more than the estimated $35\%$. We might be confident that the estimation error is unlikely to be greater than $8\%$, and hence it would be reasonable to represent the sand content by a normal distribution with a mean of $35\%$ and a standard deviation of $4\%$. For the normal distribution, $95\%$ of the probability mass lies within two standard deviations from the mean, so we would claim that there is a $5\%$ probability that the sand content is smaller than $27\%$ or greater than $43\%$.

In the example above we had chosen the normal distribution because it is the most common probability distribution but we might as well have used a different distribution, such as the uniform or lognormal distribution. Indeed many soil properties, such as soil nutrient concentrations are better described by lognormal distributions, because values below zero cannot occur and because very high positive values (i.e., outliers) are not unlikely. For instance, we may estimate the organic carbon concentration (OC) of a soil sample as $1.2\%$ and identify with it an asymmetric $95\%$ credibility interval ranging from $0.8\%$ to $2.5\%$. In general, statistical modeling is easier if the variables under study can be described by normal distributions. This explains why we usually apply a transformation to skewed variables prior to statistical modeling. For instance, when building a DSM model of OC, it may be wise to develop such model for the logarithm of OC and do a back-transform on the DSM predictions. 

There are many different soil properties that in addition vary in space and possibly time. Thus, the characterization of uncertainty about soil properties needs to be extended and include cross- and space-time correlations. It is beyond the scope of this Chapter to explain this in detail, for this we refer to standard textbooks such as @goovaerts1997geostatistics and @webster_2007. If we assume a joint normal distribution, then a vector of soil properties (be it different soil properties or the same soil property at multiple locations, depths or times) $Z$ is fully characterized by the vector of means $m$ and variance-covariance matrix $C$.

Figure \@ref(fig:pairedsoils) shows three examples of 500 paired soil property values that were simulated from different bivariate normal distributions. The left panel shows an uncorrelated case with equal standard deviations for both properties. The centre and right panels show a case where soil property 2 has a greater standard deviation than soil property 1. The difference between these two cases is that the centre panel has a zero correlation between the two soil properties while it is positive in the right panel.


```{r pairedsoils, fig.cap="Scatter plots of 500 paired soil property values drawn from a two-dimensional normal distribution" , out.width='80%', echo=FALSE, fig.align='center'}
knitr::include_graphics("images/pairedsoilpropierties.png")
```


### Propagation of model uncertainty

Now that we have clarified how uncertainty in soil properties may be characterized by probability distributions, let us consider what these distributions look like in DSM and how these are influenced by the uncertainty sources described in Section \@ref(sourcesuncert). We begin with uncertainty source 4, uncertainty in DSM models.
We noted before that uncertainty in DSM models may be separated in model parameter and model structural uncertainty. A typical example of this is a multiple linear regression model:

\begin{equation}
\label{eq:DSM model}
Z(s) = \beta_0 + \beta_1  \cdot X_1 (s) + \beta_2 \cdot X_2 (s) + \varepsilon(s)
\end{equation}

Note that here for simplicity we assumed two environmental covariates ${X_1}$ and ${X_2}$  while in practice we are likely to use many more. Parameter uncertainty of this model occurs because the parameters ${\beta_0}$ , ${\beta_1}$ and ${\beta_2}$ are merely estimated using calibration data. Under the assumptions made by the linear regression model, these estimation errors are normally distributed and have zero mean, while their standard deviations and cross-correlations can also be computed [@snedecor1989stadistical, Section 17.5]. The standard deviations become smaller as the size of the calibration dataset increases. Both the standard deviations and cross-correlations are standard output of statistical software packages. Thus, we could sample from the joint distribution of the parameter estimation errors in a similar way as displayed in Figure \@ref(fig:pairedsoils).

The model structural uncertainty associated with the multiple linear regression model Eq. \ref{eq:DSM model} is represented by the stochastic residual ${\varepsilon}$. It too is normally distributed and has zero mean, while its standard deviation depends on the (spatial) variation of the soil property $Z$ and the strength of the relationship between $Z$ and the covariates ${X_1}$ and ${X_2}$. If the covariates explain a great deal of the variation of the soil property then the standard deviation of the residual will be much smaller than that of the soil property, as expressed by the goodness-of-fit characteristic ${R^2}$, also termed Amount of Variance Explained (AVE) (see Section \@ref(addProbSampling)). It will be close to 1 in case of a strong linear relationship between soil property and covariates. In that case, the standard deviation of the stochastic residual will be much smaller than that of the soil property, because a large part of the variation is explained by the model. If the covariates bear no linear relationship with the soil property (i.e., ${R^2}$ = 0), the stochastic residual will have the same standard deviation as the soil property.

Since the joint probability distributions of the parameter estimation errors and the stochastic residual can analytically be computed and are routinely provided by statistical software, it is not difficult to analyse how these uncertainties propagate through the DSM model Eq. \ref{eq:DSM model}. This can be done analytically, because Eq. \ref{eq:DSM model} is linear in the stochastic arguments (note that the covariates are treated known and deterministic). If we predict the soil property $Z$ at a prediction location ${s_0}$ using the calibrated regression model as:

\begin{equation}
\label{eq:cal regmodel}
\hat{Z}(s_0) = \hat{\beta}_0 + \hat{\beta}_1  \cdot X_1 (s_0) +  \hat{\beta}_2 \cdot X_2 (s_0)
\end{equation}

then the prediction error will be normally distributed with zero mean and variance (i.e., the square of the standard deviation) given by:

\begin{equation}
\begin{aligned}
\label{eq:pred err}
Var(\hat{Z}(s_0)) - Z(s_0)) = & \ Var(\hat{\beta}_0) + Var(\hat{\beta}_1)  \cdot X_1 (s_0)^2 +  Var(\hat{\beta}_2) \cdot X_2 (s_0)^2 + \\
                      & \ 2 Cov(\hat{\beta}_0,\hat{\beta}_1) \cdot  X_1 (s_0) + 2 Cov(\hat{\beta}_0,\hat{\beta}_2) \cdot X_2 (s_0) + \\
                      & \ 2 Cov(\hat{\beta}_1,\hat{\beta}_2) \cdot  X_1 (s_0) \cdot  X_2 (s_0) + Var(\varepsilon(s_0))
\end{aligned}
\end{equation}

This is a complicated expression but all entries are known and hence it can be easily calculated. 

In many DSM applications, an additional step will be included that makes use of the fact that the stochastic residual ${\varepsilon}$ in Eq. \ref{eq:DSM model} is spatially autocorrelated, as characterized by a semivariogram. If this is the case the residual spatial correlation can be exploited by incorporating a kriging step \citep{hengl2004generic}. Kriging has been explained in Chapter \@ref(mappingMethods), where it was also explained that the uncertainty in the predictions is quantified by the kriging variance. We will not repeat the theory here, but simply note that the kriging variance computes the prediction error variance just as was done in Eq. \ref{eq:pred err}, but that in case of kriging the ${Var(\varepsilon(s_0)))}$ term in Eq. \ref{eq:pred err} is replaced by a smaller term, because kriging benefits from residual spatial correlation. In fact, in case of a pure nugget variogram, the kriging variance would be identical to Eq. \ref{eq:pred err}, because in such case there is no spatial autocorrelation that one can benefit from. Note also that here we refer to kriging with external drift because we included a non-constant mean (i.e., covariates ${X1}$ and ${X_2}$). If no covariates were included Eq. \ref{eq:pred err} would simplify dramatically leaving only uncertainty in the estimated (constant) mean and the stochastic residual. This might then be compared with the ordinary kriging variance.

So far we considered uncertainty in DSM models that are linear in the covariates and that represent the model structural uncertainty by an additive stochastic term. This was relatively easy because tracing how uncertainty in model parameters and model structure propagate to the model output could be done analytically. However, using linear models also poses serious restrictions. The relationship between soil properties and covariates are typically not linear but much more complex. This has led to the development and use of complex non-linear DSM models, such as regression trees, artificial neural networks, support vector machines and random forests approaches, all summarised under the term *machine learning* [e.g., @hengl2015mapping]. These more complex models typically yield more accurate soil predictions but quantification of the associated uncertainty is more difficult. In most cases, one resorts to validation and cross-validation statistics that summarise the prediction accuracy over the entire study area. How this is done will be explained in detail in  Section \@ref(propUncertainty). Such summary validation measures are very valuable but are no substitute for spatially explicit uncertainties such as the kriging variance and the prediction error variance presented in Eq. \ref{eq:pred err}. Research into quantification of location-specific uncertainties when using machine learning algorithms is therefore important. However, it is beyond the scope of this Chapter to review this area of ongoing research. One particular approach makes use of quantile regression forests. We refer to \cite{meinshausen2006quantile} for a general text and to \cite{vaysse2017using} for a DSM application of this promising, albeit computationally challenging approach.

### Propagation of attribute, positional and covariate uncertainty {#propUncertainty}

In Section \@ref(sourcesuncert) we noted that next to uncertainties in model parameters and model structure there may also be uncertainties in the attribute values and positions of the soil point data and in the covariates. These sources of uncertainty will also affect the outcome of DSM model predictions.

Uncertainties in soil attribute values effectively mean that the DSM model is calibrated with error-contaminated observations of the dependent variable. Let us consider the multiple linear regression model Eq. \ref{eq:DSM model} again. True values of the dependent variable Z (i.e., the target soil property, such as pH, clay content or total nitrogen concentration) are no longer for calibration of this model. Instead, we must make do with measurements $Y$ of $Z$:

\begin{equation}
Y(s_i) = Z(s_i) + \delta(s_i), \quad i = 1 \dots n
\end{equation}

where $n$ is the number of measurement locations and ${\delta(s_i)}$ is a random variable representing measurement error. It is custom to assume that all ${\delta(s_i)}$ are normally distributed, have zero mean and are mutually independent, although these assumptions are not strictly necessary. Their standard deviations may vary between cases and depend on the accuracy and precision of the measurement method. For instance, field estimates tend to be more uncertain than laboratory measurements and so the corresponding measurement errors will have a larger standard deviation. The consequence of the presence of measurement errors is that the estimates of the model parameters will be more uncertain. This is no surprise because the calibration data are of poorer quality. The prediction error variance will be greater too, for the same reason. If spatial correlation of the model residual ${\varepsilon}$ is included and an extension to kriging with external drift is made, uncertainty due to measurement errors is further increased because the conditioning of predictions to observations cannot benefit as much as when the observations were error-free. For mathematical details we refer to \cite{cressie1993statistics}. Finally, we should also note that if different observations have different degrees of measurement error, then this will influence the weights that each measurement gets in calibration and prediction. Measurements with larger measurement errors get smaller weights. This is automatically incorporated in multiple linear regression and kriging with external drift, but how this can be incorporated in machine-learning approaches is less clear.

Positional uncertainty of soil point observations will also deteriorate the quality of the predictions of calibrated DSM models. However, it is difficult to predict how much the prediction accuracy is affected. It largely depends on the degree of fine-scale spatial variation of the soil property and covariates. For instance, if both the soil property of interest and the covariates are spatially smooth and hardly change over distances within the range of spatial displacement due to positional uncertainty, then little damage is afflicted by positional uncertainty. But otherwise much harm can be done because the soil observations will be paired with covariate values from displaced locations that can be very different. So far, this interesting and important topic has received only little attention in the DSM literature. \cite{grimm2010uncertainty} and \cite{nelson2011error} are two examples of studies that assessed the effect of positional error on the accuracy of digital soil maps. 

Finally, there are also uncertainties in covariates that affect the accuracy of DSM predictions. In fact, these uncertainties are already incorporated in the model structural uncertainty discussed before, because offering covariates that are poor approximations of the true soil forming factors will explain little of the spatial variation and lead to low goodness-of-fit statistics. From a statistical point of view, the covariates used in Eq. \ref{eq:DSM model} need not be the *true* soil forming factors but could as well be proxies of those. This does not harm the theory and quantification of the prediction error variance such as through Eq. \ref{eq:pred err} in the multiple linear regression case or using the kriging variance in a KED approach remain perfectly valid. This does not mean that digital soil mappers should not look for the most accurate and informative covariates because clearly weak covariates lead to poor predictions of the soil [@samuel2015more].