-
Notifications
You must be signed in to change notification settings - Fork 0
/
regression_diagnostics.qmd
360 lines (245 loc) · 13.4 KB
/
regression_diagnostics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
title: 'Regression Diagnostics'
author: "[Jaron Arbet]{style='color: steelblue;'}"
date: '`r Sys.Date()`'
date-format: short
format:
revealjs:
output-file: regression_diagnostics.html
scrollable: TRUE
slide-number: c/t
bibliography: references.bib
embed-resources: true
---
```{r}
library(BoutrosLab.plotting.general);
library(mplot);
library(GGally);
library(sjPlot);
library(car);
seed <- 1234;
source('utilities.R');
```
```{r prepare fev data}
data(fev, package = 'mplot');
colnames(fev)[colnames(fev) == 'height'] <- 'height.inches';
fev$sex <- factor(fev$sex, levels = c(0,1), labels = c('Female', 'Male'));
fev$smoke <- factor(fev$smoke, levels = c(0, 1), labels = c('No', 'Yes'));
```
## Overview
* What is linear regression?
* Assumptions
* Diagnostics and remedies for failed assumptions
## Linear regression
:::: {.columns}
::: {.column width="50%"}
* Continuous response: $\big\{Y_i\big\}_{i=1}^N = (Y_1, ...., Y_N)$
* Predictors: $\boldsymbol{X}_i = (X_{i1}, ..., X_{iP})$
![](figures/linear.png)
<font size='2'> https://www.mathbootcamps.com/reading-scatterplots/ </font>
:::
::: {.column width="50%"}
**Model**:
\begin{equation}
Y_i = \beta_0 + \sum_{j=1}^P \beta_j * X_{ij} + \epsilon_i \\
\end{equation}
Independent normal errors with constant variance:
\begin{equation}
\epsilon_i \overset{\text{iid}}{\sim} \text{Normal}(0, \sigma^2)
\end{equation}
:::
::::
## Ordinary Least Squares (OLS)
* Estimate $\hat{\boldsymbol{\beta}}$ such that "sum of squared residuals" is minimized: $\sum_{i=1}^n(y_i - \hat{y})^2$, where $\hat{y}_i = \hat{\beta}_0 + \sum_{j=1}^P \hat{\beta}_j * X_{ij}$
![](./figures/OLS.png){height=450px width=500px fig-align="center"}
<font size='2'> https://medium.com/analytics-vidhya/ordinary-least-square-ols-method-for-linear-regression-ef8ca10aadfc</font>
## Assumptions
<font size='6'>
1. **Linearity**: relationship btwn $\boldsymbol{X}$ and $\boldsymbol{Y}$ is *approximately* linear
3. **Normally distributed *residuals***
+ OR the sample size is large (Central Limit Theorem)
> ...simulations studies show that “sufficiently large” is often under 100, and even for our extremely nonNormal medical cost data it is less than 500. [@lumley-normality]
2. **Homoscedasticity (equal variance)**: the residuals have equal variance at every value of $\boldsymbol{X}$
3. **Independence**: residuals are independent (not correlated)
**`r colorize('Other issues', 'steelblue')`:**
* **Multicollinearity**: highly correlated predictors can greatly increase Var($\hat{\beta}$)
* **Influencial observations/outliers** can bias results
* **Additivity**: by default, assumes no interactions btwn predictors. Need to manually add interaction terms.
</font>
## Example dataset
* Outcome = forced expiatory volume in liters (**fev**)
* N = `r nrow(fev)` youths age 3-19 from East Boston during 1970's
:::{.column-body-outset}
```{r, echo = TRUE, cache = TRUE, fig.align = 'center'}
GGally::ggpairs(fev, showStrips = TRUE)
```
:::
## Initial model with all variables
<font size='6'>
```{r, echo = TRUE}
fit <- lm(formula = fev ~., data = fev)
sjPlot::tab_model(fit, ci.hyphen = ', ', p.style = 'scientific', digits.p = 2);
```
</font>
## `r colorize('Linearity', 'steelblue')` {.scrollable}
<font size='6'>
* Plot residuals ($y - \hat{y}$) *vs.* each predictor and $\hat{y}$. Want to see horizontal band around 0 with no patterns.
* Curvature in the plots suggests non-linear relationship
</font>
:::{.column-body-outset}
```{r, echo = TRUE, fig.align = 'center', fig.height = 10}
car::residualPlots(fit);
```
:::
<font size='6'>
* `car::residualPlots` outputs a table which tests the linearity assumption of each continuous predictor. It reports the p-value for $X_j^2$.
</font>
## Remedies for non-linearity
* **Data transformation**: transforming the outcome and/or predictor to make more normally distributed may help
* **GAM**: Generalized Additive Model automatically models linear/non-linear effects using smoothing splines: previous lab talks: [2022-12-08](https://uclahs.app.box.com/file/1085939909995) and [2018-04-27](https://uclahs.app.box.com/file/538943163238)
* **Polynomials**: e.g. $Age + Age^2 + Age^3 + ...$
+ use `stats::poly()` for uncorrelated polynomials; regular polynomials are usually highly correlated
:::{.column-body-outset}
```{r, echo = TRUE, fig.align = 'center', fig.height = 10}
fit.poly <- lm(fev ~ poly(age, 2) + poly(height.inches, 2) + sex + smoke, data = fev);
car::residualPlots(fit.poly, tests = FALSE);
```
:::
* Note the adjusted $R^2$ was `r round(summary(fit)$adj, 2)` and `r round(summary(fit.poly)$adj, 2)` for the linear and polynomial models
## `r colorize('Normality of residuals or large N', 'steelblue')`
* Simulations and refs from [@lumley-normality] suggest our N = `r nrow(fev)` is sufficient. Nevertheless, you can visually check normality below.
* P-value tests of Normality are not recommended: low power in the scenario you care about (small N) but usually significant in the scenario you don't care about (large N)
:::{.column-body-outset}
```{r}
par(mfrow = c(2, 2));
hist(residuals(fit), main = 'Linear fit', xlim = c(-2, 2));
hist(residuals(fit.poly), main = 'Polynomial fit', xlim = c(-2, 2));
qqnorm(residuals(fit), main = 'Linear: Normal QQ plot');
qqline(residuals(fit));
qqnorm(residuals(fit.poly), main = 'Polynomial: Normal QQ plot');
qqline(residuals(fit.poly));
```
:::
## Remedies for non-normal residuals
* Large $N$
* Data transformation of outcome and/or predictors
* Make sure linearity assumption is met
* Bootstrap confidence intervals for robust inference:
```{r, echo = TRUE, cache = TRUE}
set.seed(123);
bootstrap <- car::Boot(fit, method = 'case');
round(confint(bootstrap), 3);
```
## `r colorize('Homoscedasticity (equal variance)', 'steelblue')`
<font size='6'>
* **Plot residuals vs fitted values**: want flat/horizontal band around 0
* Funnel pattern like < or > indicates heteroscedasticity
:::{.column-body-outset}
```{r echo = TRUE}
par(mfrow = c(1,2))
car::residualPlot(fit, main = 'Linear fit');
car::residualPlot(fit.poly, main = 'Polynomial fit');
```
:::
`car::spreadLevelPlot`: $log(|\text{studentized residuals}|)$ *vs.* $log(\hat{y})$
* Flat horizontal line means equal variance
```{r, echo = FALSE}
par(mfrow = c(1,2))
car::spreadLevelPlot(fit, main = 'Linear fit');
car::spreadLevelPlot(fit.poly, main = 'Polynomial fit');
```
```{r, echo = TRUE}
car::ncvTest(fit);
car::ncvTest(fit.poly);
```
</font>
## Remedies for unequal variance
* **Transform $\boldsymbol{Y}$**: `car::spreadLevelPlot()` prints a "Suggested power transformation" $\tau$. Refit model with $\boldsymbol{Y}^\tau$.
* Unequal variance affects $Var({\hat{\beta}})$ but not $\hat{\beta}$. Thus, can use robust standard errors or bootstrap for CIs/pvalues:
```{r, echo = TRUE, eval = FALSE}
# robust standard errors
robust.se <- lmtest::coeftest(
x = fit,
vcov = sandwich::vcovHC(fit)
);
confint(robust.se);
```
## `r colorize('Independent residuals', 'steelblue')`
* Correlated residuals can occur from repeated measures, or when patients cluster by some group (e.g. family, hospital)
* Plot residuals vs time or other suspected clustering variables
* **Remedies**: robust sandwich standard errors to account for cluster effect; or linear mixed models
## `r colorize('Multicollinearity', 'steelblue')`
* Highly correlated predictors can increase $Var(\hat{\beta})$, producing unreliable results
* `caret::findCorrelation` removes predictors with corr > cutoff
* **Variance inflation factors**: $\text{VIF}(X_j) = \frac{1}{1 - R^2_j}$, where $R^2_j$ is % variance of $X_j$ explained by all other predictors
* VIF > 10 is a common cutoff
```{r, echo = TRUE}
car::vif(fit);
```
* Other remedies include PCA and regularized regression
## `r colorize('Influential observations', 'steelblue')`
* Extreme values in $\boldsymbol{Y}$ and/or $\boldsymbol{X}$ can highly influence results
* $|\text{Standardized residuals}| > 3$ indicates potential outlier (If normally distributed, expect 99.7% < 3)
:::{.column-body-outset}
```{r, echo = TRUE, fig.align = 'center'}
plot(
x = fitted(fit),
y = rstandard(fit),
ylab = 'Standardized Residuals',
xlab = 'Fitted values'
);
abline(h = -3, lty = 2);
abline(h = 3, lty = 2);
```
:::
## Cook's distance
<font size='6'>
* $D_i = \frac{\sum_j(\hat{y}_j - \hat{y}_{j(i)})^2}{(p+1) * \hat{\sigma}^2}$
* $D_i$ is proportional to the distance that the predicted values would move if the $i$th patient was excluded
* Various cutoffs have been used in the literature, e.g. 1, $\frac{4}{N}$, or $\frac{4}{N - p - 1}$.. or visually identify patients with relatively large vals
</font>
:::{.column-body-outset}
```{r, echo = TRUE, fig.align = 'center'}
car::influenceIndexPlot(fit, vars = 'Cook');
```
:::
## Remedies for influential observations
* **Sensitivity analysis**: fit 2 models that include or exclude influential obs. Do the results substantially change?
* **Robust regression**: include all patients but downweight influential observations
+ R: [https://www.john-fox.ca/Companion/appendices/Appendix-Robust-Regression.pdf](https://www.john-fox.ca/Companion/appendices/Appendix-Robust-Regression.pdf)
+ `quantreg::rq()` for median regression
## `r colorize('Interactions', 'steelblue')`
* `r colorize('Someone should do a short talk on interactions!', 'steelblue')`
* By default, linear regression assumes no interactions between predictors.
* You can manually add interaction terms to the model to investigate. `A*B` in R formula gives `A + B + A:B`
<font size='5'>
```{r, echo = TRUE}
# allow all variables to interact with Sex
fit.interaction <- lm(fev ~ (age + height.inches + smoke) * sex, data = fev);
sjPlot::tab_model(fit.interaction, ci.hyphen = ', ', p.style = 'scientific', digits.p = 2);
```
</font>
* If you suspect many interactions, might be better to use a machine learning model that automatically handles interactions like decision-trees or Random Forest
## `r colorize('Summary', 'steelblue')`
<font size='5'>
| `r colorize('Assumption ', 'steelblue')` | `r colorize('Assessment ', 'steelblue')` | | `r colorize('Solution ', 'steelblue')` |
|------------------------|--------------------------------------------------------------------------------------------------------------|---|-------------------------------------------------------------------------------------------------------------------------------------|
| **Linearity** | `car::residualPlots`, want horizontal band around 0 for each predictor | | - Transform $Y$ or $X$ <br />- GAM to automate linear/non-linear <br />- Polynomials |
| **Normality of residuals** | - Histogram/density plot <br />- Normal QQ plot | | - Large N <br />- Transform $Y$ or $X$ <br />- Make sure linear assumption met <br />- Bootstrap CIs: `confint(car::Boot(fit))` |
| **Equal variance** | - `car::residualPlot` and `car::spreadLevelPlot`, want horizontal band around 0 <br />- `car::ncvTest` | | - Transform $Y$ using exponent from `spreadLevelPlot` <br />- Robust standard errors or bootstrap CI |
| **Independent residuals** | Plot residuals vs time or other suspected clustering variables | | - Robust sandwich standard errors for cluster effect <br />- Linear mixed models |
| **Multicollinearity** | - Check correlation between predictors <br />- `car::vif`, want VIF < 10 | | - Given 2 highly corr. predictors, only keep 1 <br />- `caret::findCorrelation` to remove predictors with corr > cutoff <br />- PCA, regularized regression |
| **Influential obs** | - Plot standardized residuals vs fitted values; \|r\| > 3 outlier <br />- Cook's distance, `car::influenceIndexPlot` | | - Sensitivity analysis fitting models with/without influential obs <br />- Robust regression to downweight influential obs: `quantreg::rq` |
| **Interactions** | - Manually add interaction terms, significant? | | - Manually add interaction terms <br />- Stratify model by potential interaction terms <br />- ML models that automatically handle interactions |
</font>
## `r colorize('Extensions to General Linear Models', 'steelblue')`
* General linear model (GLM): for binary, multi-category, ordinal, or count outcomes
* `car::residualPlots` to assess linearity of each predictor. Want to see horizontal band around 0 with no patterns. For non-linearity, use polynomials or GAM.
* `car::vif` to assess multicollinearity
* `car::influenceIndexPlot` to assess influential observations
* `car::Boot` for robust bootstrap confidence intervals
* GLM also makes assumptions about the variance.. if false, can use bootstrap or robust standard errors:
```{r, eval = FALSE, echo = TRUE}
lmtest::coeftest(fit, vcov = sandwich::vcovHC(fit))
```
## References