generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path06_2_FEV_sol.Rmd
316 lines (238 loc) · 9.68 KB
/
06_2_FEV_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "Exeercise 6.2: Linear regression on the FEV dataset - solution"
author: "Lieven Clement and Jeroen Gilis"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
As an exercise on linear regression, we will analyse the FEV dataset.
# The FEV dataset
The FEV, which is an acronym for forced expiratory volume,
is a measure of how much air a person can exhale (in liters)
during a forced breath. In this dataset, the FEV of 606 children,
between the ages of 6 and 17, were measured. The dataset
also provides additional information on these children:
their `age`, their `height`, their `gender` and, most
importantly, whether the child is a smoker or a non-smoker.
The overarching goal of this experiment was to find out if
smoking has an effect on the FEV of children.
# Load the required libraries
```{r, message = FALSE}
library(tidyverse)
```
# Import data
```{r}
fev <- read_tsv("https://raw.githubusercontent.com/statOmics/PSLSData/main/fev.txt")
head(fev)
```
There are a few things in the formatting of the
data that can be improved upon:
1. Both the `gender` and `smoking` can be transformed to
factors.
2. The `height` variable is written in inches. Assuming that
this audience is mainly European, inches are hard to
interpret. Let's add a new column, `height_cm`, with the values
converted to centimeter using the `mutate` function.
```{r}
fev <- fev %>%
mutate(gender = as.factor(gender)) %>%
mutate(smoking = as.factor(smoking)) %>%
mutate(height_cm = height * 2.54)
head(fev)
```
# Data Exploration
Now, let's make a first explorative boxplot, showing
only the FEV for both smoking categories.
```{r}
fev %>%
ggplot(aes(x = smoking, y = fev, fill = smoking)) +
scale_fill_manual(values = c("dimgrey", "firebrick")) +
theme_bw() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, size = 0.1) +
ggtitle("Boxplot of FEV versus smoking") +
ylab("fev (l)") +
xlab("smoking status")
```
Did you expect these results?
It appears that children that smoke have a higher
median FEV than children that do not smoke.
Should we change legislations worldwide and make
smoking obligatory for children?
Maybe there is something else going on in the data.
Now, we will generate a similar plot, but we will
stratify the data based on age (age as factor).
```{r}
fev %>%
ggplot(aes(
x = age,
y = fev,
color = smoking,
pch = smoking
)) +
facet_grid(cols = vars(smoking)) +
theme_bw() +
scale_color_manual(values = c("dimgrey", "firebrick")) +
geom_point(fill = "white") +
ggtitle("Boxplot of FEV versus smoking, stratified on age") +
ylab("fev (l)") +
xlab("smoking status")
```
This plot seems to already give us a more
plausible picture. First, it seems that we do not have
any smoking children of ages 6, 7 or 8. Second, when
looking at the results per age "category", it seems
no longer the case that smokers have a much higher FEV
than non-smokers; for the higher ages, the contrary
seems true.
This shows that taking into account confounders
(in this case) is crucial! If we simply analyse the dataset based on
the smoking status and FEV values only, our inference might
be incorrect.
Can we provide an even better visualization of the data, taking
into account more useful explanatory variables with respect
to the FEV?
```{r}
fev %>%
ggplot(aes(
x = age,
y = fev,
color = smoking,
pch = smoking
)) +
facet_grid(cols = vars(smoking), rows = vars(gender)) +
theme_bw() +
scale_color_manual(values = c("dimgrey", "firebrick")) +
geom_point(fill = "white") +
ggtitle("Boxplot of FEV versus smoking, stratified on age") +
ylab("fev (l)") +
xlab("smoking status")
```
This plot holds one extra level of information, the gender
of the child. Especially for higher ages, the median FEV
is higher for males as compared to females.
The only source of information that is lacking is `height`.
To look at the effect of height, we could simply make a
scatterplot displaying the FEV in function of a child's
height (in cm). Additionally, we could color the dots based
on gender.
```{r}
fev %>%
ggplot(aes(x = height_cm, y = fev, color = gender)) +
geom_point() +
scale_color_manual(values = c("darkorchid", "olivedrab4")) +
theme_bw() +
ggtitle("Boxplot of FEV versus height") +
ylab("fev (l)") +
xlab("height (cm)")
```
There is a clear relationship between height and FEV.
In addition, we see that for the large height values
(>175cm), we mainly find male subjects.
# Linear regression analysis
## Assumptions of linear regression
List the assumptions:
1. The observations are independent
2. Linearity between the response and predictor variable
3. The residues of the model must be normally distributed
4. Homoscedasticity of the data
The subjects are sampled randomly from the population and can be assumed to be
independent.
```{r}
model <- lm(fev ~ smoking, data = fev)
## display the diagnostic plots of the model
plot(model)
```
We have four diagnostic plots:
### Linearity with the Residuals vs fitted plot
- predictor of predictions $\hat\beta_0+\hat\beta_1 x$ on $X$-axis
- *residuals* on $Y$-as
$$e_i=y_i-\hat{g}(x_i)=y_i-\hat\beta_0-\hat\beta_1\times x_i,$$
However, because smoking is a factor, the predictor variable $x$ in the model is
a dummy variable that can only take two values (0: non-smoking or 1: smoking).
### Normal Q-Q
- QQ-plot of the residuals $e_i$.
The residuals of the linear regression model should be normally
distributed. Based on the second diagnostic plot, the normality
assumption is not met.
### Homoscedasticity
- Square-root of the absolute value of standardized residuals
in function of the fitted values
To meet the third assumption of linear regression, the variance
on the _Square-root of the absolute value of standardized residuals_
must be similar over the entire range of fitted values.
However, because smoking is a factor the predictor variable $x$ in the model is
a dummy variable that can only take two values (0: non-smoking or 1: smoking).
So it is better to check this assumption using boxplots. In the data exploration
we noticed that there is no big difference in the IQR range between smokers and
non smokers.
## Log transformation
The normality assumption was not met. One possible way resolve this is by
relying on the central limit theorem; given the large number of observations in
this dataset, we would still be allowed to perform that assume normality.
Another option is to try a log-transformation of the data. In this example,
log-transforming the data makes perfect sense; we can then interpret difference
in terms of log transformed fold changes, which is sensible for volume data.
Here, we fit the log-linear model and generate diagnostic plots in order to
assess the assumptions for linear regression.
```{r}
log.model <- lm(fev %>% log2() ~ smoking, data = fev)
plot(log.model)
```
Upon log-transformation, all the required assumption are met.
Look at the output of the log-linear model:
```{r}
summary(log.model)
```
Compute the 95% confidence interval on the model parameters:
```{r}
confint(log.model)
```
# Conclusion
## Interpretation on the log scale
Currently, all the outcomes should be interpreted on the log-scale.
Indeed, since we are now modelling _the $\log_{2}$ FEV_
We may interpret the output as follows:
- The slope:
Smoking pupils on average have a log$_{2}$ FEV that is on average
`r log.model$coef[2] %>% round(.,2)` higher than non-smoking pupils (95% CI [`r confint(log.model)[2,] %>% round(.,2)`]).
This difference is extremely significant on the 5% significance level (p << 0.001).
- The intercept:
The average log$_{2}$ of FEV for children that
do not smoke is `r log.model$coef[2] %>% round(.,2)` (95% CI [`r confint(log.model)[1,] %>% round(.,2)`]).
## Interpretation on the original scale
The interpretation the log-scale is quite difficult. However, if we backtransform we get an interpretation in terms of geometric means and fold changes.
```{r}
2^(log.model$coefficients)
```
and of their confidence intervals:
```{r}
2^(confint(log.model))
```
Now, we can interpret the results in terms of the geometric mean:
- The slope:
The geometric mean of the forced expiratory volume is a factor
`r log.model$coefficients[2] %>% abs %>% 2^. %>% round(2)` larger for smoking
pupil than for non-smoking pupil
(95% CI [`r confint(log.model)[2,] %>% 2^. %>% abs %>% round(.,2)`]).
This difference is extremely significant (p << 0.001).
Note that we can also interpret as follows because the factor is between 1
and 2.
The the forced expiratory volume is on average
`r ((log.model$coefficients[2] %>% abs %>% 2^. %>% round(2))-1)*100`% larger for
smoking pupils than for non-smoking pupils
(95% CI [`r ((confint(log.model)[2,] %>% 2^. %>% abs %>% round(.,2))-1)*100`]).
**NOTE THAT THE CONCLUSION IS WRONG BECAUSE THE TWO GROUPS OF PUPILS ARE NOT**
**COMPARABLE. THERE IS CONFOUNDING!**
Indeed, the FEV study is an observational study: smoking pupils are on average
older than non-smoking pupils and age is associated with both the smoking
behavior and long capacity. If we do not account for age the age effect is
partially captured in the estimated effect for smoking!
- The intercept:
The geometric mean of the forced experatory volume for non-smoking pupils
is `r log.model$coefficients[1] %>% 2^. %>% round(2)`l
(95% CI [`r confint(log.model)[1,] %>% 2^. %>% abs %>% round(.,2)`]l).
This value is extermely significantly larger than 1l (p <<< 0.001). Note, that
the statistical test on the intercept is typically not reported.
In the tutorial of multiple regression, we will revisit this exercise and do
an analysis where we correct for this confounder. More specifically, we will
there study the association between smoking and FEV while accounting for
differences in the age, height and gender of the children.