w07-hw-balajis2.Rmd

---
title: "Week 7 - Homework"
author: "STAT 420, Summer 2018, BALAJI SATHYAMURTHY (BALAJIS2)"
date: ''
output:
  html_document: 
    toc: yes
  pdf_document: default
urlcolor: cyan
---

***

```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
options(scipen = 1, digits = 4, width = 80, fig.alin = "center")
```

## Exercise 1 (EPA Emissions Data)

For this exercise, we will use the data stored in [`epa2015.csv`](epa2015.csv). It contains detailed descriptions of 4,411 vehicles manufactured in 2015 that were used for fuel economy testing [as performed by the Environment Protection Agency]( https://www3.epa.gov/otaq/tcldata.htm). The variables in the dataset are:  

- `Make` - Manufacturer
- `Model` - Model of vehicle
- `ID` - Manufacturer defined vehicle identification number within EPA's computer system (not a VIN number)
- `disp` - Cubic inch displacement of test vehicle
- `type` - Car, truck, or both (for vehicles that meet specifications of both car and truck, like smaller SUVs or crossovers)
- `horse` - Rated horsepower, in foot-pounds per second
- `cyl` - Number of cylinders
- `lockup` - Vehicle has transmission lockup; N or Y
- `drive` - Drivetrain system code
    - A = All-wheel drive
    - F = Front-wheel drive
    - P = Part-time 4-wheel drive
    - R = Rear-wheel drive
    - 4 = 4-wheel drive
- `weight` - Test weight, in pounds
- `axleratio` - Axle ratio
- `nvratio` - n/v ratio (engine speed versus vehicle speed at 50 mph)
- `THC` - Total hydrocarbons, in grams per mile (g/mi)
- `CO` - Carbon monoxide (a regulated pollutant), in g/mi
- `CO2` - Carbon dioxide (the primary byproduct of all fossil fuel combustion), in g/mi
- `mpg` - Fuel economy, in miles per gallon

We will attempt to model `CO2` using both `horse` and `type`. In practice, we would use many more predictors, but limiting ourselves to these two, one numeric and one factor, will allow us to create a number of plots.

Load the data, and check its structure using `str()`. Verify that `type` is a factor; if not, coerce it to be a factor.

```{r}
epa2015_data = read.csv("epa2015.csv")
```

**(a)** Do the following:

- Make a scatterplot of `CO2` versus `horse`. Use a different color point for each vehicle `type`.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
```
- Fit a simple linear regression model with `CO2` as the response and only `horse` as the predictor.
```{r}
epa_simple_model = lm(CO2~horse,data = epa2015_data)
```

- Add the fitted regression line to the scatterplot. Comment on how well this line models the data.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
abline(epa_simple_model,lwd=2)
```

This model doesn't fit the data. The data for type `Both` and `Truck` are not split well.

- Give an estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle of type `car`.

  The estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle   of type `car' is `r summary(epa_simple_model)$coefficients["horse","Estimate"]`

- Give a 90% prediction interval using this model for the `CO2` of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type `Both`. (Interestingly, the dataset gives the wrong drivetrain for most Subarus in this dataset, as they are almost all listed as `F`, when they are in fact all-wheel drive.)

```{r}
n = data.frame(horse = 148)
lwr = predict(epa_simple_model,interval = "prediction",newdata = n, level = 0.90)[1,2]
upr = predict(epa_simple_model,interval = "prediction",newdata = n, level = 0.90)[1,3]
```
  We are 90% confident that a new observation of vehicle type `Both` which is with 148 horsepower will fall   between `r lwr` and `r upr`

**(b)** Do the following:

- Make a scatterplot of `CO2` versus `horse`. Use a different color point for each vehicle `type`.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
```

- Fit an additive multiple regression model with `CO2` as the response and `horse` and `type` as the predictors.

```{r}
epa_add_model = lm(CO2~horse+type,data = epa2015_data)
coef(epa_add_model)
int_both = coef(epa_add_model)[1]
int_car = coef(epa_add_model)[1] + coef(epa_add_model)[3]
int_truck = coef(epa_add_model)[1] + coef(epa_add_model)[4]
slope = coef(epa_add_model)[2]
```

- Add the fitted regression "lines" to the scatterplot with the same colors as their respective points (one line for each vehicle type). Comment on how well this line models the data. 

```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
abline(int_both,slope,col = 1,lwd = 2,lty = 1)
abline(int_car,slope,col = 2,lwd = 2,lty = 2)
abline(int_truck,slope,col = 3,lwd = 2,lty = 3)
```
This model also has some issues since the slope is the data for type ``Both` and `Truck` are not split well.

- Give an estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle of type `car`. 
```{r}
coef(epa_add_model)
```

  The estimate for the average change in CO2 for a one foot pound per second increase in horse for a vehicle of   type car is `r coef(epa_add_model)[2]`

- Give a 90% prediction interval using this model for the `CO2` of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type `Both`. 

```{r}
n = data.frame(horse = 148,type = "Both")
lwr = predict(epa_add_model,newdata = n,level = 0.90,interval = "prediction")[2]
upr = predict(epa_add_model,newdata = n,level = 0.90,interval = "prediction")[3]

```
 We are 90% confident that a new observation of CO2 for a vehicle of type Both and horsepower 148 will fall   betweeen `r lwr` and `r upr`

**(c)** Do the following:

- Make a scatterplot of `CO2` versus `horse`. Use a different color point for each vehicle `type`. 

```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
```

- Fit an interaction multiple regression model with `CO2` as the response and `horse` and `type` as the predictors.

```{r}
epa_int_model = lm(CO2~horse*type,data = epa2015_data)
coef(epa_add_model)

int_both  = coef(epa_int_model)[1]
int_car = coef(epa_int_model)[1] + coef(epa_int_model)[3]
int_truck =  coef(epa_int_model)[1] + coef(epa_int_model)[4]
slope_both = coef(epa_int_model)[2]
slope_car = coef(epa_int_model)[2] + coef(epa_int_model)[5]
slope_truck = coef(epa_int_model)[2] + coef(epa_int_model)[6]
```
- Add the fitted regression "lines" to the scatterplot with the same colors as their respective points (one line for each vehicle type). Comment on how well this line models the data. 

```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
abline(int_both,slope_both,col = 1,lwd = 2,lty = 1)
abline(int_car,slope_car,col = 2,lwd = 2,lty = 2)
abline(int_truck,slope_truck,col = 3,lwd = 2,lty = 3)
```

This model is better than other two models since due to different slopes the dat for type `Both` and `Truck` are split very well. 

- Give an estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle of type `car`. 

```{r}
coef(epa_int_model)
```

The average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type car is `r coef(epa_int_model)[2] + coef(epa_int_model)[5]`

- Give a 90% prediction interval using this model for the `CO2` of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type `Both`. 

```{r}
n = data.frame(horse = 148,type = "Both")
lwr = predict(epa_int_model,newdata = n,level = 0.90,interval = "prediction")[2]
upr = predict(epa_int_model,newdata = n,level = 0.90,interval = "prediction")[3]
```

  We are 90% confident that a new observation of CO2 for a vehicle of type Both with horsepwer of 148 will fall   between `r lwr` and `r upr`

**(d)** Based on the previous plots, you probably already have an opinion on the best model. Now use an ANOVA $F$-test to compare the additive and interaction models. Based on this test and a significance level of $\alpha = 0.10$, which model is preferred?

```{r}
fstatistic = anova(epa_add_model,epa_int_model)[2,"F"]
```

Based on the fstatistic `r fstatistic` at a significance level of $\alpha = 0.10$, we failed to reject the additive model, so additive model is preferred

***

## Exercise 2 (Hospital SUPPORT Data, White Blood Cells)

For this exercise, we will use the data stored in [`hospital.csv`](hospital.csv). It contains a random sample of 580 seriously ill hospitalized patients from a famous study called "SUPPORT" (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are:  
 
- `Days` - Days to death or hospital discharge
- `Age` - Age on day of hospital admission
- `Sex` - Female or male
- `Comorbidity` - Patient diagnosed with more than one chronic disease
- `EdYears` - Years of education
- `Education` - Education level; high or low
- `Income` - Income level; high or low
- `Charges` - Hospital charges, in dollars
- `Care` - Level of care required; high or low
- `Race` - Non-white or white
- `Pressure` - Blood pressure, in mmHg
- `Blood` - White blood cell count, in gm/dL
- `Rate` - Heart rate, in bpm

For this exercise, we will use `Age`, `Education`, `Income`, and `Sex` in an attempt to model `Blood`. Essentially, we are attempting to model white blood cell count using only demographic information.

**(a)** Load the data, and check its structure using `str()`. Verify that `Education`, `Income`, and `Sex` are factors; if not, coerce them to be factors. What are the levels of `Education`, `Income`, and `Sex`?

```{r}
hosp_data = read.csv("hospital.csv")
str(hosp_data)
```
 The different levels of `Education` is `r levels(hosp_data$Education)`
 
 The different levels of `Income` is `r levels(hosp_data$Income)`
 
 The different levels of `Sex` is `r levels(hosp_data$Sex)`

**(b)** Fit an additive multiple regression model with `Blood` as the response using `Age`, `Education`, `Income`, and `Sex` as predictors. What does `R` choose as the reference level for `Education`, `Income`, and `Sex`?

```{r}
hosp_add_model = lm(Blood~Age+Education+Income+Sex,data = hosp_data)
```

R choses the following reference for each of the following:

For `Education` is `high`

For `Income` is `high`

For `Sex` is `female`

**(c)** Fit a multiple regression model with `Blood` as the response. Use the main effects of `Age`, `Education`, `Income`, and `Sex`, as well as the interaction of `Sex` with `Age` and the interaction of `Sex` and `Income`. Use a statistical test to compare this model to the additive model using a significance level of $\alpha = 0.10$. Which do you prefer?

```{r}
hosp_mult_regr_model1 = lm(Blood~Age+Education+Income+Sex+Sex:Age+Sex:Income,data = hosp_data)
pvalue = anova(hosp_add_model,hosp_mult_regr_model1)[2,"Pr(>F)"]
```
 Based on the pvalue `r pvalue` at a significance level of $\alpha = 0.10$, we failed to reject the additive model, so additive model is preferred

**(d)** Fit a model similar to that in **(c)**, but additionally add the interaction between `Income` and `Age` as well as a three-way interaction between `Age`, `Income`, and `Sex`. Use a statistical test to compare this model to the preferred model from **(c)** using a significance level of $\alpha = 0.10$. Which do you prefer?

```{r}
hosp_mult_regr_model2 = lm(Blood~Age+Education+Income+Sex+Sex:Age+Sex:Income+Income:Age+Age:Income:Sex,data = hosp_data)
pvalue = anova(hosp_mult_regr_model1,hosp_mult_regr_model2)[2,"Pr(>F)"]
```
Based on the p-value of `r pvalue` at a significance level of $\alpha = 0.10$, we failed to reject the model **(c)**, so **(c)** is the preferred model  

**(e)** Using the model in **(d)**, give an estimate of the change in average `Blood` for a one-unit increase in `Age` for a highly educated, low income, male patient.
```{r}
n = data.frame(Age = 1,Education = "high",Income = "low", Sex = "male")
estimate = predict(hosp_mult_regr_model2,newdata = n)
```
 The estimate for average change in `Blood` is `r estimate` for a one-unit increase in `Age` for a highly educated, low income, male patient.

***

## Exercise 3 (Hospital SUPPORT Data, Stay Duration)

For this exercise, we will again use the data stored in [`hospital.csv`](hospital.csv). It contains a random sample of 580 seriously ill hospitalized patients from a famous study called "SUPPORT" (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are:  
 
- `Days` - Days to death or hospital discharge
- `Age` - Age on day of hospital admission
- `Sex` - Female or male
- `Comorbidity` - Patient diagnosed with more than one chronic disease
- `EdYears` - Years of education
- `Education` - Education level; high or low
- `Income` - Income level; high or low
- `Charges` - Hospital charges, in dollars
- `Care` - Level of care required; high or low
- `Race` - Non-white or white
- `Pressure` - Blood pressure, in mmHg
- `Blood` - White blood cell count, in gm/dL
- `Rate` - Heart rate, in bpm

For this exercise, we will use `Blood`, `Pressure`, and `Rate` in an attempt to model `Days`. Essentially, we are attempting to model the time spent in the hospital using only health metrics measured at the hospital.

Consider the model

\[
Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon,
\]

where

- $Y$ is `Days`
- $x_1$ is `Blood`
- $x_2$ is `Pressure`
- $x_3$ is `Rate`.

**(a)** Fit the model above. Also fit a smaller model using the provided `R` code.

```{r}
days_add_mdl = lm(Days ~ Blood + Pressure + Rate, data = hosp_data)
days_int_mdl = lm(Days ~ Blood * Pressure * Rate,data = hosp_data)
fstatistic = anova(days_add_mdl,days_int_mdl)[2,"F"]
pvalue = anova(days_add_mdl,days_int_mdl)[2,"Pr(>F)"]
```

Use a statistical test to compare the two models. Report the following:

- The null and alternative hypotheses in terms of the model given in the exercise description

The null hypothesis
\[
\beta_4 = \beta_5 = \beta_6 = \beta_7 = 0
\]

The alternative hypothesis is

Atleast one of the \[\beta_4,\beta_5,\beta_6,\beta_7\] is not zero

- The value of the test statistic is `r fstatistic`

- The p-value of the test is `r pvalue`

- A statistical decision using a significance level of $\alpha = 0.10$
  
  The p-value `r pvalue` is less than $\alpha = 0.10$ and we reject the null hypothesis.

- Which model you prefer

The below model is preferred model

\[
Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon,
\]

**(b)** Give an expression based on the model in the exercise description for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL. Your answer should be a linear function of the $\beta$s.

The expression for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL is

`beta_1 + beta_4 * 139 + beta_5 * 10 + (beta_6 + beta_7) * 1390`

**(c)** Give an expression based on the additive model in part **(a)** for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL. Your answer should be a linear function of the $\beta$s.

The expression for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL is

`beta_1`

***

## Exercise 4 ($t$-test Is a Linear Model)

In this exercise, we will try to convince ourselves that a two-sample $t$-test assuming equal variance is the same as a $t$-test for the coefficient in front of a single two-level factor variable (dummy variable) in a linear model.

First, we set up the data frame that we will use throughout.

```{r}
n = 30

sim_data = data.frame(
  groups = c(rep("A", n / 2), rep("B", n / 2)),
  values = rep(0, n))
str(sim_data)
```

We will use a total sample size of `30`, `15` for each group. The `groups` variable splits the data into two groups, `A` and `B`, which will be the grouping variable for the $t$-test and a factor variable in a regression. The `values` variable will store simulated data.

We will repeat the following process a number of times.

```{r}
set.seed(19830502)
sim_data$values = rnorm(n, mean = 42, sd = 3.5) # simulate response data
summary(lm(values ~ groups, data = sim_data))
t.test(values ~ groups, data = sim_data, var.equal = TRUE)
```

We use `lm()` to test

\[
H_0: \beta_1 = 0
\]

for the model

\[
Y = \beta_0 + \beta_1 x_1 + \epsilon
\]

where $Y$ is the values of interest, and $x_1$ is a dummy variable that splits the data in two. We will let `R` take care of the dummy variable.

We use `t.test()` to test

\[
H_0: \mu_A = \mu_B
\]

where $\mu_A$ is the mean for the `A` group, and $\mu_B$ is the mean for the `B` group.

The following code sets up some variables for storage.

```{r}
num_sims = 300
lm_t = rep(0, num_sims)
lm_p = rep(0, num_sims)
tt_t = rep(0, num_sims)
tt_p = rep(0, num_sims)
```

- `lm_t` will store the test statistic for the test $H_0: \beta_1 = 0$.
- `lm_p` will store the p-value for the test $H_0: \beta_1 = 0$.
- `tt_t` will store the test statistic for the test $H_0: \mu_A = \mu_B$.
- `tt_p` will store the p-value for the test $H_0: \mu_A = \mu_B$.

The variable `num_sims` controls how many times we will repeat this process, which we have chosen to be `300`.

**(a)** Set a seed equal to your birthday. Then write code that repeats the above process `300` times. Each time, store the appropriate values in `lm_t`, `lm_p`, `tt_t`, and `tt_p`. Specifically, each time you should use `sim_data$values = rnorm(n, mean = 42, sd = 3.5)` to update the data. The grouping will always stay the same.

```{r}

set.seed(19830502)

for(i in 1:num_sims) {
  sim_data$values = rnorm(n, mean = 42, sd = 3.5)
  
  lm_t[i] = summary(lm(values ~ groups, data = sim_data))$coef[2, 3]
  lm_p[i] = summary(lm(values ~ groups, data = sim_data))$coef[2, 4]
  
  tt_t[i] = t.test(values ~ groups, data = sim_data, var.equal = TRUE)$statistic
  tt_p[i] = t.test(values ~ groups, data = sim_data, var.equal = TRUE)$p.value
}

```

**(b)** Report the value obtained by running `mean(lm_t == tt_t)`, which tells us what proportion of the test statistics is equal. The result may be extremely surprising!

```{r}
mean(lm_t == tt_t)
```

**(c)** Report the value obtained by running `mean(lm_p == tt_p)`, which tells us what proportion of the p-values is equal. The result may be extremely surprising!

```{r}
mean(lm_p == tt_p)
```

**(d)** If you have done everything correctly so far, your answers to the last two parts won't indicate the equivalence we want to show! What the heck is going on here? The first issue is one of using a computer to do calculations. When a computer checks for equality, it demands **equality**; nothing can be different. However, when a computer performs calculations, it can only do so with a certain level of precision. So, if we calculate two quantities we know to be analytically equal, they can differ numerically. Instead of `mean(lm_p == tt_p)` run `all.equal(lm_p, tt_p)`. This will perform a similar calculation, but with a very small error tolerance for each equality. What is the result of running this code? What does it mean?

```{r}
all.equal(lm_p, tt_p)
```

The p-value is same for both the tests.

**(e)** Your answer in **(d)** should now make much more sense. Then what is going on with the test statistics? Look at the values stored in `lm_t` and `tt_t`. What do you notice? Is there a relationship between the two? Can you explain why this is happening?

```{r}
all.equal(abs(lm_t),abs(tt_t))
```

If we take absolute of the values, they are same. I am not sure why this behaviour is happening.