-
Notifications
You must be signed in to change notification settings - Fork 4
/
appendix-regression.Rmd
259 lines (229 loc) · 18 KB
/
appendix-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# Linear regression {#appendix-regression}
## Simple linear regression
This section derives the theoretical results of Section \@ref(differential-abundance) for the effect of taxonomic bias on regression-based DA analyses.
As in Section \@ref(differential-abundance), we restrict our attention to DA analyses that can be expressed in terms of the simple linear regression model in which the response is log absolute abundance or log proportion of individual species or the log ratio of a pair of species.
### Review of simple linear regression
Table: (\#tab:statistical-notation) Statistical notation used in this section.
| Notation | Mathematical definition | Description |
|:--|:----|:----|
| $x$, $y$, $z$, $d$ | | Random variables, defined in text |
| $x(a)$ | | Value of $x$ in sample $a$ |
| $\bar x$ | $\frac{1}{n}\sum_a x(a)$ | Sample mean of $x$ |
| $s(x,y)$ | $\frac{1}{n}\sum_a (x(a) - \bar x) (y(a) - \bar y)$ | Sample covariance between $x$ and $y$ |
| $s^{2}(x)$ | $s(x,x)$ | Sample variance of $x$ |
| $s(x)$ | $\sqrt{s^{2}(x)}$ | Sample standard deviation of $x$ |
| $r(x,y)$ | $\frac{s(x,y)}{s(x) s(y)}$ | Sample correlation between $x$ and $y$ |
| $\alpha$, $\beta$, $\sigma$ | | Parameters of the regression model |
| $\hat \alpha$, $\hat \beta$, $\hat \sigma$ | | Parameter estimates |
| $\epsilon(a)$ | $y(a) - (\alpha + \beta x(a))$ | Error for sample $a$ |
| $\hat \epsilon(a)$ | $y(a) - (\hat \alpha + \hat \beta x(a))$ | Residual for sample $a$ |
| $\text{se}(\hat \beta)$, $\hat{\text{se}}(\hat \beta)$ | | Standard error of $\hat \beta$ and its estimate |
We begin by reviewing definitions, notation, and results for the analysis of the simple linear regression model.
Our presentation follows @wasserman2004allo Chapter 13 with additional interpretation of the estimated regression coefficients.
<!-- to support our subsequent consideration of measurement error caused by bias. -->
Statistical notation is defined in Table \@ref(tab:statistical-notation).
Consider a response variable $y$ and a covariate $x$.
Following @wasserman2004allo, we define the _simple linear regression model_ by
\begin{align}
(\#eq:lm)
y(a) &= \alpha + \beta x(a) + \varepsilon(a),
\end{align}
where $E[\varepsilon(a) \mid x(a)] = 0$ and $V[\varepsilon(a) \mid x(a)] = \sigma^2$.
That is, conditional on the value of $x$ in a sample $a$, the response $y$ is given by $\alpha + \beta x$ with a random error $\varepsilon$ that has an expected value of $0$ and a constant variance $\sigma^2$.
<!-- Note that we distinguish between the statistical model being fit to the data, and the actual data generating process. -->
<!-- These results for the regression model estimates hold regardless of whether the simple linear model is a good one for the data. -->
Given data $(y, x)$, we _fit_ the model by finding estimates for the parameters $\alpha$, $\beta$, and $\sigma^2$ that best reflect the data under the assumption that the model is correct and according to our chosen criterion that defines 'best'.
Here we consider the least-squares estimates for the coefficients $\alpha$ and $\beta$, which are also the maximum likelihood estimates if we further assume that the errors $\varepsilon$ are normally distributed.
<!-- Defining the residuals as $\hat \varepsilon(a) = y(a) - (\hat \alpha + \hat \beta x(a))$, -->
The least-squares estimates of $\alpha$ and $\beta$ are the values that minimize the residual sum of squares, $\text{RSS} = \sum_{a} \hat \varepsilon(a)^2$.
These estimates can be conveniently expressed in terms of the sample means and covariances or correlations of $x$ and $y$,
\begin{align}
(\#eq:lm-hat)
\hat \alpha &= \bar y - \hat \beta \bar x \\
\hat \beta &= \frac{s(y,x)}{s^2(x)} = r(y,x) \cdot \frac{s(y)}{s(x)}.
\end{align}
<!-- (@wasserman2004allo Chapter 13), -->
The estimate for the slope coefficient, $\hat \beta$, equals the ratio of the sample covariance of $y$ with $x$ relative to the variance in $x$ or, equivalently, the correlation of $y$ with $x$ multiplied by the ratio of their standard deviations.
The estimate for the intercept, $\hat \alpha$, is the value such that the sample means follow the regression relationship, $\bar y = \hat \alpha + \hat \beta \bar x$.
<!-- We are most interested in the estimate of the slope coefficient, $\hat \beta$ -->
The standard unbiased and maximum likelihood estimates for the variance $\sigma^{2}$ are both approximately given by the sample variance of the residuals,
\begin{align}
(\#eq:lm-var)
\hat{\sigma}^2(\hat \beta) \approx s^2(\hat \varepsilon).
\end{align}
We are most interested in the estimated slope coefficient, $\hat \beta$.
Besides the point estimate indicating our best guess at the value, we also wish to understand the uncertainty or precision of the estimate.
This uncertainty is quantified by the standard error, $\text{se}(\hat \beta)$, estimated by
\begin{align}
(\#eq:lm-se)
\hat{\text{se}}(\hat \beta)
= \frac{\hat \sigma}{s(x) \sqrt{n}}
\approx \frac{s(\hat \varepsilon)}{s(x) \sqrt{n}},
\end{align}
<!-- (@wasserman2004allo Equation 13.13). -->
Approximate 95% confidence intervals for $\beta$ are given by $\hat \beta \pm 2\; \hat{\text{se}}(\hat \beta)$.
### Measurement error in the response
Now consider a researcher who would like to regress a biological quantity $y$ (the response) on a second quantity $x$ (the covariate).
But instead of $y$, they only know a proxy $z$, which is related to $y$ via the difference $d = z - y$.
For instance, $y$ might be the log abundance of a species, $z$ is the abundance that is measured via MGS, and $x$ is a numerical quantity like pH or a boolean variable indicting case versus control.
Lacking a direct measurement of $y$, the researcher instead regresses $z$ on $x$ and interprets the fitted model as informing them about the relationship between $y$ and $x$.
We wish to understand the accuracy of their conclusions.
To understand how the measurement error in $y$ impacts the researcher's regression analysis, consider the three linear regression equations for $y$, $z$, and $d$,
\begin{align}
(\#eq:lm-y)
y(a) &= \alpha_y + \beta_y x(a) + \varepsilon_y(a),
\end{align}
\begin{align}
(\#eq:lm-z)
z(a) &= \alpha_z + \beta_z x(a) + \varepsilon_z(a),
\end{align}
\begin{align}
(\#eq:lm-d)
d(a) &= \alpha_d + \beta_d x(a) + \varepsilon_d(a).
\end{align}
The researcher would like to fit the model for $y$ (Equation \@ref(eq:lm-y)), but instead can only fit the model for $z$ (Equation \@ref(eq:lm-z)).
The two are related via the Equation \@ref(eq:lm-z) for $d$.
It is helpful to imagine that $y$ and $d$ are known and so we can fit all three models \@ref(eq:lm-y), \@ref(eq:lm-z), and \@ref(eq:lm-d).
From Equation \@ref(eq:lm-hat) and the linearity of sample means and covariances it follows that
\begin{align}
(\#eq:lm-hat-rel)
\hat \alpha_z &= \hat \alpha_y + \hat \alpha_d \\
\hat \beta_z &= \hat \beta_y + \hat \beta_d.
\end{align}
That is, the estimated coefficients for $y$, $z$, and $d$ mirror the relationship between the variables themselves, $z = y + d$.
Consequently, we see that the measurement error $d$ creates absolute errors in the regression coefficients $\hat \alpha$ and $\hat \beta$ equal to
\begin{align}
(\#eq:lm-hat-err)
\text{abs. error}(\hat \alpha) &\equiv \hat \alpha_z - \hat \alpha_y = \hat \alpha_d \\
\text{abs. error}(\hat \beta) &\equiv \hat \beta_z - \hat \beta_y = \hat \beta_d.
\end{align}
<!-- These absolute errors correspond to the _statistical bias_ in the coefficient estimates. -->
Expressed in terms of sample means and covariances, these errors are
\begin{align}
(\#eq:lm-hat-err-1)
\text{abs. error}(\hat \alpha) &= \bar d - \hat \beta_d \bar x \\
\text{abs. error}(\hat \beta) &= \frac{s(d,x)}{s^2(x)} = r(d,x) \cdot \frac{s(d)}{s(x)}.
\end{align}
We are mainly interested in the estimated slope coefficient, $\hat \beta$.
We see that the absolute error is large when the covariance of $d$ with $x$ is large.
This error is large in a practical sense when $\hat \beta_{d}$ is large (in magnitude) relative to $\hat \beta_y$ (Equation \@ref(eq:lm-hat)), which occurs when $s(d,x)$ is large relative to $s(y,x)$.
<!-- Thus the relative error in $\hat \beta$ is large when the covariance of $d$ with $x$ is large relative to the covariance of $y$ with $x$. -->
A sign error occurs when $s(d,x)$ is larger in magnitude than $s(y,x)$ and of opposite sign.
We can also consider how the measurement error affects the residual variation and the standard error of the slope estimate.
The residuals of $y$, $z$, and $d$ are related through $\hat \varepsilon_z = \hat \varepsilon_y + \hat \varepsilon_d$.
It follows that the sample variance of the $z$ residuals equal
\begin{align}
s^2(\hat \varepsilon_{z})
= s^2(\hat \varepsilon_y + \hat \varepsilon_{d})
= s^2(\hat \varepsilon_y) + s^2(\hat \varepsilon_{d}) + 2 s(\hat \varepsilon_y, \hat \varepsilon_{d}).
\end{align}
The variance of the $z$ residuals is increased above that of the $y$ residuals when the $d$ residuals are either uncorrelated or positively correlated with the $y$ residuals, but may be decreased when the $y$ and $d$ residuals are negatively correlated.
An increased residual variance will lead to a larger estimate for $\sigma$ as well as larger standard errors in the slope coefficient $\beta$.
### Specific application to taxonomic bias
These general results can be used to understand how taxonomic bias affects DA analyses that can be expressed as linear regression of log (relative or absolute) abundance.
First, we illustrate for the particular example of absolute DA analysis, where species absolute abundance is measured with the total-abundance normalization method and we assume that the total abundances have been accurately measured.
Consider the regression of log abundance of species $i$ on a covariate $x$.
In this case, $y(a)= \log \text{abun}_{i}(a)$ is the actual log abundance of species $i$ and $z(a)=\log \widehat{\text{abun}}_i(a)$ is the log abundance we've measured.
From Equation \@ref(eq:density-prop-error), the measurement error $d(a)$ due to taxonomic bias in the MGS measurement is
\begin{align}
d(a)
\equiv z(a) - y(a)
= \log \text{efficiency}_i - \log \text{efficiency}_S(a).
\end{align}
The absolute error in $\hat \beta$ equals the scaled covariance of $d$ with $x$ (Equation \@ref(eq:lm-hat-err-1)).
<!-- The first term, $\log \text{efficiency}_i$, is constant across samples; thus it affects $\hat \alpha_d$ (via its effect on $\bar d$) but does not affect $\hat \beta_d$. -->
In the above expression for $d(a)$, only the second term, $- \log \text{efficiency}_S(a)$, varies across samples and thus affects the covariance, leading to
\begin{align}
\text{abs. error}(\hat \beta)
&= - \frac{s[\log \text{efficiency}_S, x]}{s^2(x)}.
% = - \frac{r[\log \text{efficiency}_S, x] \cdot s[\log \text{efficiency}_S]}{s(x)}, \\
\end{align}
Thus the absolute error in $\hat \beta$ is the negative of the scaled covariance of the log mean efficiency.
Although the absolute error is the same for each species, its practical significance varies.
Recall from Equation \@ref(eq:lm-hat) that the correct value for $\hat \beta$ (i.e., the estimate without measurement error) is
\begin{align}
\hat \beta &= \frac{s[\log \text{abun}_i, x]}{s^2(x)}.
\end{align}
For species whose log abundance covaries with $x$ much more than the log mean efficiency, the error will be relatively small.
This situation can occur either because the mean efficiency varies relatively little across samples or because its variation is relatively uncorrelated with $x$.
For species whose log abundance covaries with $x$ similar or less than the log mean efficiency, the error will be significant.
A sign error occurs when the species covariance is of _equal sign and smaller in magnitude_ than the covariance of the log mean efficiency.
We can also consider how the standard errors in the slope estimate are affected by bias.
In our case, the $d$ residuals equal the negative residuals of the log mean efficiency.
It is plausible that for most species, their residual variation will have a small covariance with log mean efficiency and the net effect of variation in the mean efficiency will be to increase the estimated standard errors, as occurs with most species in Figure \@ref(fig:regression-example).
However, high-efficiency species that vary substantially in proportion across samples may be strongly positively correlated with log mean efficiency such that the estimated standard errors decrease, as we see with Species 9 in Figure \@ref(fig:regression-example).
Now, suppose we were instead performing a DA analysis of a response whose fold error is constant across samples, such as the log ratio of two species.
For the log ratio of species $i$ and $j$, the error is $d(a) = \log \text{efficiency}_i / \text{efficiency}_j$ and is constant across samples.
Thus the covariance of $d$ with $x$ is 0 and taxonomic bias causes no error the slope estimate $\hat \beta$, only the intercept estimate $\hat \alpha$.
## Gamma-Poisson regression
This section describes how gamma-Poisson regression can, with appropriate choice of offsets, be used to estimate log fold changes (LFCs) in proportions from MGS count data either with or without bias correction.
### Background
The gamma-Poisson distribution, also known as the negative binomial distribution, is a distribution that is commonly used to model sequencing count data (@holmes2018mode).
Its key advantage over the simple linear regression model is that it directly models the random sampling of reads that occurs during a sequencing experiment and, in this way, naturally accounts for the imprecision associated with an observation of zero or a small positive count for a given species and sample.
The gamma-Poisson distribution has two parameters, which jointly determine the mean and the standard deviation.
We use the parameterization used by the rstanarm R package (@rstanarm), which we use for fitting gamma-Poisson GLMs in our case-study analyses.
Suppose that $y^{(a)}$ has a gamma-Poisson distribution conditional on a covariate $x$.
Define two parameters $\mu$ and $\phi$ such that
\begin{align}
E[y^{(a)} \mid x] &= \mu^{(a)} \\
sd[y^{(a)} \mid x] &= \sqrt{
E[y^{(a)} \mid x] + E[y^{(a)} \mid x]^2 / \phi
};
\end{align}
thus $\mu$ equals the mean and $\phi$, known as the 'reciprocal dispersion', increases the standard deviation as it approaches 0 and causes the standard deviation to approach $\sqrt{\mu}$ (that of a Poisson distribution) as it approaches infinity.
For use in gamma-Poisson GLM regression of MGS data, we partition the mean into two factors, $\mu^{(a)} = u^{(a)} \theta^{(a)}$ (@gelman2020regr).
Here, $\theta^{(a)} = e^{\alpha + \beta x^{(a)}}$ is the quantity of primary interest, whose average LFC equals $\beta$.
The factor $u^{(a)}$ is a sample-specific _exposure_ parameter that determines the overall scale of counts.
We can equivalently write $\mu^{(a)} = e^{\alpha + \beta x^{(a)} + \log u^{(a)}}$;
the logarithm of the exposure, $\log u^{(a)}$, is known as the _offset_ of the GLM.
Performing gamma-Poisson regression with standard statistical software requires that the offsets are provided as known values.
### Inferring LFCs in proportions with and without bias correction
Our goal is to use gamma-Poisson regression to estimate the LFC in a species proportion from the observed read counts.
We therefore equate the counts $M_i^{(a)}$ with $y^{(a)}$ and the proportion $P_{i}^{(a)}$ with $\theta^{(a)}$; it remains to determine the offsets that are consistent with our MGS model.
To do so, we relax the deterministic assumption of our original model and instead suppose that the right-hand side of Equation \@ref(eq:mgs-model) equals the _expected_ read count of species $i$ in sample $a$,
\begin{align}
(\#eq:mgs-model-gp)
E[M_i^{(a)}] = A_i^{(a)} B_i F^{(a)}.
\end{align}
The total expected count is
\begin{align}
(\#eq:total-reads-gp)
E[M_S^{(a)}] &= \sum_{i \in S} E[M_i^{(a)}]
\\&= \left( \sum_{i \in S} A_i^{(a)} B_i \right) F^{(a)}
\\&= A_S^{(a)} B_S^{(a)} F^{(a)}.
\end{align}
Under the gamma-Poisson model for $M_i^{(a)}$, we have that $E[M_i^{(a)}] = u_i^{(a)} P_i^{(a)}$.
Equating this expression for $E[M_i^{(a)}]$ with Equation \@ref(eq:mgs-model-gp) lets us solve for the exposure,
\begin{align}
(\#eq:exposure)
u_i^{(a)}
&= \frac{A_i^{(a)} B_i F^{(a)}}{P_i^{(a)}}
\\&= B_i A_S^{(a)} F^{(a)}
\\&= \frac{B_i}{B_S} \cdot E[M_S^{(a)}].
\end{align}
The second line follows from $P_i^{(a)} \equiv A_i^{(a)}/ A_S^{(a)}$, and the third line from Equation \@ref(eq:total-reads-gp).
The final expression in Equation \@ref(eq:exposure) can be used to compute offsets for the regression.
The three terms $B_i$, $B_S^{(a)}$, and $E[M_S^{(a)}]$ are each unknown, so we instead substitute estimates for these terms to obtain an estimated exposure $\hat u_{i}^{(a)}$; we then set the offset in the linear model to $\log \hat u_{i}^{(a)}$.
We can estimate $E[M_S^{(a)}]$ by the observed total count, $M_S^{(a)}$.
In the absence of bias correction, we set $B_{i} = B_{S}^{(a)} = 1$; our estimate of the exposure is then just
\begin{align}
\hat u_i^{(a)} = M_S^{(a)}.
\end{align}
We can apply bias correction using estimates of the efficiencies, $\hat B_{i}$, derived from community control measurements.
From these estimates and the observed counts, we can estimate the mean efficiency in the sample by
\begin{align}
\hat B_S^{(a)} = \sum_{i\in S} B_i \hat P_i^{(a)},
\end{align}
where $\hat P_i^{(a)}$ are the calibrated proportions obtained from a simple plug-in procedure,
\begin{align}
\hat P_i^{(a)} = \frac{M_i^{(a)} / B_i}{\sum_j M_j^{(a)} / B_j}.
\end{align}
Equivalently, we can calculate $\hat B_S^{(a)}$ directly from the observed counts,
\begin{align}
\text{B}_S^{(a)} =
\left[\sum_j \frac{M_j^{(a)}}{M_S^{(a)} \text{B}_j}\right]^{-1}.
\end{align}
We then estimate the exposure by
\begin{align}
\hat u_i^{(a)} = \frac{\hat B_i}{\hat B_S} \cdot M_S^{(a)}.
\end{align}