06_2_FEV_sol.Rmd

---
title: "Exeercise 6.2: Linear regression on the FEV dataset - solution"
author: "Lieven Clement and Jeroen Gilis"
date: "statOmics, Ghent University (https://statomics.github.io)"
---

As an exercise on linear regression, we will analyse the FEV dataset.

# The FEV dataset

The FEV, which is an acronym for forced expiratory volume,
is a measure of how much air a person can exhale (in liters)
during a forced breath. In this dataset, the FEV of 606 children,
between the ages of 6 and 17, were measured. The dataset
also provides additional information on these children:
their `age`, their `height`, their `gender` and, most
importantly, whether the child is a smoker or a non-smoker.

The overarching goal of this experiment was to find out if
smoking has an effect on the FEV of children.

# Load the required libraries

```{r, message = FALSE}
library(tidyverse)
```

# Import data

```{r}
fev <- read_tsv("https://raw.githubusercontent.com/statOmics/PSLSData/main/fev.txt")
head(fev)
```

There are a few things in the formatting of the
data that can be improved upon:

1. Both the `gender` and `smoking` can be transformed to
factors.

2. The `height` variable is written in inches. Assuming that
this audience is mainly European, inches are hard to
interpret. Let's add a new column, `height_cm`, with the values
converted to centimeter using the `mutate` function.

```{r}
fev <- fev %>%
  mutate(gender = as.factor(gender)) %>%
  mutate(smoking = as.factor(smoking)) %>%
  mutate(height_cm = height * 2.54)

head(fev)
```

# Data Exploration

Now, let's make a first explorative boxplot, showing
only the FEV for both smoking categories.

```{r}
fev %>%
  ggplot(aes(x = smoking, y = fev, fill = smoking)) +
  scale_fill_manual(values = c("dimgrey", "firebrick")) +
  theme_bw() +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.2, size = 0.1) +
  ggtitle("Boxplot of FEV versus smoking") +
  ylab("fev (l)") +
  xlab("smoking status")
```

Did you expect these results?

It appears that children that smoke have a higher
median FEV than children that do not smoke.
Should we change legislations worldwide and make
smoking obligatory for children?

Maybe there is something else going on in the data.
Now, we will generate a similar plot, but we will
stratify the data based on age (age as factor).

```{r}
fev %>%
  ggplot(aes(
    x = age,
    y = fev,
    color = smoking,
    pch = smoking
  )) +
  facet_grid(cols = vars(smoking)) +
  theme_bw() +
  scale_color_manual(values = c("dimgrey", "firebrick")) +
  geom_point(fill = "white") +
  ggtitle("Boxplot of FEV versus smoking, stratified on age") +
  ylab("fev (l)") +
  xlab("smoking status")
```

This plot seems to already give us a more
plausible picture. First, it seems that we do not have
any smoking children of ages 6, 7 or 8. Second, when
looking at the results per age "category", it seems
no longer the case that smokers have a much higher FEV
than non-smokers; for the higher ages, the contrary
seems true.

This shows that taking into account confounders
(in this case) is crucial! If we simply analyse the dataset based on
the smoking status and FEV values only, our inference might
be incorrect.

Can we provide an even better visualization of the data, taking
into account more useful explanatory variables with respect
to the FEV?

```{r}
fev %>%
  ggplot(aes(
    x = age,
    y = fev,
    color = smoking,
    pch = smoking
  )) +
  facet_grid(cols = vars(smoking), rows = vars(gender)) +
  theme_bw() +
  scale_color_manual(values = c("dimgrey", "firebrick")) +
  geom_point(fill = "white") +
  ggtitle("Boxplot of FEV versus smoking, stratified on age") +
  ylab("fev (l)") +
  xlab("smoking status")
```

This plot holds one extra level of information, the gender
of the child. Especially for higher ages, the median FEV
is higher for males as compared to females.

The only source of information that is lacking is `height`.
To look at the effect of height, we could simply make a
scatterplot displaying the FEV in function of a child's
height (in cm). Additionally, we could color the dots based
on gender.

```{r}
fev %>%
  ggplot(aes(x = height_cm, y = fev, color = gender)) +
  geom_point() +
  scale_color_manual(values = c("darkorchid", "olivedrab4")) +
  theme_bw() +
  ggtitle("Boxplot of FEV versus height") +
  ylab("fev (l)") +
  xlab("height (cm)")
```

There is a clear relationship between height and FEV.
In addition, we see that for the large height values
(>175cm), we mainly find male subjects.

# Linear regression analysis

## Assumptions of linear regression

List the assumptions:

1. The observations are independent
2. Linearity between the response and predictor variable
3. The residues of the model must be normally distributed
4. Homoscedasticity of the data

The subjects are sampled randomly from the population and can be assumed to be
independent.

```{r}
model <- lm(fev ~ smoking, data = fev)

## display the diagnostic plots of the model
plot(model)
```

We have four diagnostic plots:

### Linearity with the Residuals vs fitted plot

- predictor of predictions $\hat\beta_0+\hat\beta_1 x$ on $X$-axis
- *residuals* on $Y$-as
$$e_i=y_i-\hat{g}(x_i)=y_i-\hat\beta_0-\hat\beta_1\times x_i,$$
However, because smoking is a factor, the predictor variable $x$ in the model is
a dummy variable that can only take two values (0: non-smoking or 1: smoking).

### Normal Q-Q

- QQ-plot of the residuals $e_i$.

The residuals of the linear regression model should be normally
distributed. Based on the second diagnostic plot, the normality
assumption is not met.

### Homoscedasticity

- Square-root of the absolute value of standardized residuals
in function of the fitted values

To meet the third assumption of linear regression, the variance
on the _Square-root of the absolute value of standardized residuals_
must be similar over the entire range of fitted values.

However, because smoking is a factor the predictor variable $x$ in the model is
a dummy variable that can only take two values (0: non-smoking or 1: smoking).
So it is better to check this assumption using boxplots. In the data exploration
we noticed that there is no big difference in the IQR range between smokers and
non smokers.

## Log transformation

The normality assumption was not met. One possible way resolve this is by
relying on the central limit theorem; given the large number of observations in
this dataset, we would still be allowed to perform that assume normality.
Another option is to try a log-transformation of the data. In this example,
log-transforming the data makes perfect sense; we can then interpret difference
in terms of log transformed fold changes, which is sensible for volume data.

Here, we fit the log-linear model and generate diagnostic plots in order to
assess the assumptions for linear regression.

```{r}
log.model <- lm(fev %>% log2() ~ smoking, data = fev)
plot(log.model)
```

Upon log-transformation, all the required assumption are met.

Look at the output of the log-linear model:

```{r}
summary(log.model)
```

Compute the 95% confidence interval on the model parameters:

```{r}
confint(log.model)
```

# Conclusion

## Interpretation on the log scale

Currently, all the outcomes should be interpreted on the log-scale.
Indeed, since we are now modelling  _the $\log_{2}$ FEV_

We may interpret the output as follows:

- The slope:

Smoking pupils on average have a log$_{2}$ FEV that is on average
`r log.model$coef[2] %>% round(.,2)` higher than non-smoking pupils (95% CI [`r confint(log.model)[2,] %>% round(.,2)`]).
This difference is extremely significant on the 5% significance level (p << 0.001).

- The intercept:

The average log$_{2}$ of FEV for children that
do not smoke is `r log.model$coef[2] %>% round(.,2)` (95% CI [`r confint(log.model)[1,] %>% round(.,2)`]).

## Interpretation on the original scale

The interpretation the log-scale is quite difficult. However, if we backtransform we get an interpretation in terms of geometric means and fold changes.

```{r}
2^(log.model$coefficients)
```

and of their confidence intervals:

```{r}
2^(confint(log.model))
```

Now, we can interpret the results in terms of the geometric mean:

- The slope:

The geometric mean of the forced expiratory volume is a factor
`r log.model$coefficients[2] %>% abs %>% 2^. %>% round(2)` larger for smoking
pupil than for non-smoking pupil
(95% CI [`r confint(log.model)[2,] %>% 2^. %>% abs %>% round(.,2)`]).
This difference is extremely significant (p << 0.001).

Note that we can also interpret as follows because the factor is between 1
and 2.

The the forced expiratory volume is on average
`r ((log.model$coefficients[2] %>% abs %>% 2^. %>% round(2))-1)*100`% larger for
smoking pupils than for non-smoking pupils
(95% CI [`r ((confint(log.model)[2,] %>% 2^. %>% abs %>% round(.,2))-1)*100`]).

**NOTE THAT THE CONCLUSION IS WRONG BECAUSE THE TWO GROUPS OF PUPILS ARE NOT**
**COMPARABLE. THERE IS CONFOUNDING!**

Indeed, the FEV study is an observational study: smoking pupils are on average
older than non-smoking pupils and age is associated with both the smoking
behavior and long capacity. If we do not account for age the age effect is
partially captured in the estimated effect for smoking!

- The intercept:

The geometric mean of the forced experatory volume for non-smoking pupils
is  `r log.model$coefficients[1] %>% 2^. %>% round(2)`l
(95% CI [`r confint(log.model)[1,] %>% 2^. %>% abs %>% round(.,2)`]l).
This value is extermely significantly larger than 1l (p <<< 0.001). Note, that
the statistical test on the intercept is typically not reported.


In the tutorial of multiple regression, we will revisit this exercise and do
an analysis where we correct for this confounder. More specifically, we will
there study  the association between smoking and FEV while accounting for
differences in the age, height and gender of the children.