-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathw07-hw-balajis2.Rmd
460 lines (329 loc) · 20.4 KB
/
w07-hw-balajis2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
---
title: "Week 7 - Homework"
author: "STAT 420, Summer 2018, BALAJI SATHYAMURTHY (BALAJIS2)"
date: ''
output:
html_document:
toc: yes
pdf_document: default
urlcolor: cyan
---
***
```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
options(scipen = 1, digits = 4, width = 80, fig.alin = "center")
```
## Exercise 1 (EPA Emissions Data)
For this exercise, we will use the data stored in [`epa2015.csv`](epa2015.csv). It contains detailed descriptions of 4,411 vehicles manufactured in 2015 that were used for fuel economy testing [as performed by the Environment Protection Agency]( https://www3.epa.gov/otaq/tcldata.htm). The variables in the dataset are:
- `Make` - Manufacturer
- `Model` - Model of vehicle
- `ID` - Manufacturer defined vehicle identification number within EPA's computer system (not a VIN number)
- `disp` - Cubic inch displacement of test vehicle
- `type` - Car, truck, or both (for vehicles that meet specifications of both car and truck, like smaller SUVs or crossovers)
- `horse` - Rated horsepower, in foot-pounds per second
- `cyl` - Number of cylinders
- `lockup` - Vehicle has transmission lockup; N or Y
- `drive` - Drivetrain system code
- A = All-wheel drive
- F = Front-wheel drive
- P = Part-time 4-wheel drive
- R = Rear-wheel drive
- 4 = 4-wheel drive
- `weight` - Test weight, in pounds
- `axleratio` - Axle ratio
- `nvratio` - n/v ratio (engine speed versus vehicle speed at 50 mph)
- `THC` - Total hydrocarbons, in grams per mile (g/mi)
- `CO` - Carbon monoxide (a regulated pollutant), in g/mi
- `CO2` - Carbon dioxide (the primary byproduct of all fossil fuel combustion), in g/mi
- `mpg` - Fuel economy, in miles per gallon
We will attempt to model `CO2` using both `horse` and `type`. In practice, we would use many more predictors, but limiting ourselves to these two, one numeric and one factor, will allow us to create a number of plots.
Load the data, and check its structure using `str()`. Verify that `type` is a factor; if not, coerce it to be a factor.
```{r}
epa2015_data = read.csv("epa2015.csv")
```
**(a)** Do the following:
- Make a scatterplot of `CO2` versus `horse`. Use a different color point for each vehicle `type`.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
```
- Fit a simple linear regression model with `CO2` as the response and only `horse` as the predictor.
```{r}
epa_simple_model = lm(CO2~horse,data = epa2015_data)
```
- Add the fitted regression line to the scatterplot. Comment on how well this line models the data.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
abline(epa_simple_model,lwd=2)
```
This model doesn't fit the data. The data for type `Both` and `Truck` are not split well.
- Give an estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle of type `car`.
The estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle of type `car' is `r summary(epa_simple_model)$coefficients["horse","Estimate"]`
- Give a 90% prediction interval using this model for the `CO2` of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type `Both`. (Interestingly, the dataset gives the wrong drivetrain for most Subarus in this dataset, as they are almost all listed as `F`, when they are in fact all-wheel drive.)
```{r}
n = data.frame(horse = 148)
lwr = predict(epa_simple_model,interval = "prediction",newdata = n, level = 0.90)[1,2]
upr = predict(epa_simple_model,interval = "prediction",newdata = n, level = 0.90)[1,3]
```
We are 90% confident that a new observation of vehicle type `Both` which is with 148 horsepower will fall between `r lwr` and `r upr`
**(b)** Do the following:
- Make a scatterplot of `CO2` versus `horse`. Use a different color point for each vehicle `type`.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
```
- Fit an additive multiple regression model with `CO2` as the response and `horse` and `type` as the predictors.
```{r}
epa_add_model = lm(CO2~horse+type,data = epa2015_data)
coef(epa_add_model)
int_both = coef(epa_add_model)[1]
int_car = coef(epa_add_model)[1] + coef(epa_add_model)[3]
int_truck = coef(epa_add_model)[1] + coef(epa_add_model)[4]
slope = coef(epa_add_model)[2]
```
- Add the fitted regression "lines" to the scatterplot with the same colors as their respective points (one line for each vehicle type). Comment on how well this line models the data.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
abline(int_both,slope,col = 1,lwd = 2,lty = 1)
abline(int_car,slope,col = 2,lwd = 2,lty = 2)
abline(int_truck,slope,col = 3,lwd = 2,lty = 3)
```
This model also has some issues since the slope is the data for type ``Both` and `Truck` are not split well.
- Give an estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle of type `car`.
```{r}
coef(epa_add_model)
```
The estimate for the average change in CO2 for a one foot pound per second increase in horse for a vehicle of type car is `r coef(epa_add_model)[2]`
- Give a 90% prediction interval using this model for the `CO2` of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type `Both`.
```{r}
n = data.frame(horse = 148,type = "Both")
lwr = predict(epa_add_model,newdata = n,level = 0.90,interval = "prediction")[2]
upr = predict(epa_add_model,newdata = n,level = 0.90,interval = "prediction")[3]
```
We are 90% confident that a new observation of CO2 for a vehicle of type Both and horsepower 148 will fall betweeen `r lwr` and `r upr`
**(c)** Do the following:
- Make a scatterplot of `CO2` versus `horse`. Use a different color point for each vehicle `type`.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
```
- Fit an interaction multiple regression model with `CO2` as the response and `horse` and `type` as the predictors.
```{r}
epa_int_model = lm(CO2~horse*type,data = epa2015_data)
coef(epa_add_model)
int_both = coef(epa_int_model)[1]
int_car = coef(epa_int_model)[1] + coef(epa_int_model)[3]
int_truck = coef(epa_int_model)[1] + coef(epa_int_model)[4]
slope_both = coef(epa_int_model)[2]
slope_car = coef(epa_int_model)[2] + coef(epa_int_model)[5]
slope_truck = coef(epa_int_model)[2] + coef(epa_int_model)[6]
```
- Add the fitted regression "lines" to the scatterplot with the same colors as their respective points (one line for each vehicle type). Comment on how well this line models the data.
```{r}
plot(CO2~horse,col=type,pch = as.numeric(type),cex = 2,data = epa2015_data)
legend("topright",c("Both","Car","Truck"),pch = c(1,2,3),col = c(1,2,3))
abline(int_both,slope_both,col = 1,lwd = 2,lty = 1)
abline(int_car,slope_car,col = 2,lwd = 2,lty = 2)
abline(int_truck,slope_truck,col = 3,lwd = 2,lty = 3)
```
This model is better than other two models since due to different slopes the dat for type `Both` and `Truck` are split very well.
- Give an estimate for the average change in `CO2` for a one foot-pound per second increase in `horse` for a vehicle of type `car`.
```{r}
coef(epa_int_model)
```
The average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type car is `r coef(epa_int_model)[2] + coef(epa_int_model)[5]`
- Give a 90% prediction interval using this model for the `CO2` of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type `Both`.
```{r}
n = data.frame(horse = 148,type = "Both")
lwr = predict(epa_int_model,newdata = n,level = 0.90,interval = "prediction")[2]
upr = predict(epa_int_model,newdata = n,level = 0.90,interval = "prediction")[3]
```
We are 90% confident that a new observation of CO2 for a vehicle of type Both with horsepwer of 148 will fall between `r lwr` and `r upr`
**(d)** Based on the previous plots, you probably already have an opinion on the best model. Now use an ANOVA $F$-test to compare the additive and interaction models. Based on this test and a significance level of $\alpha = 0.10$, which model is preferred?
```{r}
fstatistic = anova(epa_add_model,epa_int_model)[2,"F"]
```
Based on the fstatistic `r fstatistic` at a significance level of $\alpha = 0.10$, we failed to reject the additive model, so additive model is preferred
***
## Exercise 2 (Hospital SUPPORT Data, White Blood Cells)
For this exercise, we will use the data stored in [`hospital.csv`](hospital.csv). It contains a random sample of 580 seriously ill hospitalized patients from a famous study called "SUPPORT" (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are:
- `Days` - Days to death or hospital discharge
- `Age` - Age on day of hospital admission
- `Sex` - Female or male
- `Comorbidity` - Patient diagnosed with more than one chronic disease
- `EdYears` - Years of education
- `Education` - Education level; high or low
- `Income` - Income level; high or low
- `Charges` - Hospital charges, in dollars
- `Care` - Level of care required; high or low
- `Race` - Non-white or white
- `Pressure` - Blood pressure, in mmHg
- `Blood` - White blood cell count, in gm/dL
- `Rate` - Heart rate, in bpm
For this exercise, we will use `Age`, `Education`, `Income`, and `Sex` in an attempt to model `Blood`. Essentially, we are attempting to model white blood cell count using only demographic information.
**(a)** Load the data, and check its structure using `str()`. Verify that `Education`, `Income`, and `Sex` are factors; if not, coerce them to be factors. What are the levels of `Education`, `Income`, and `Sex`?
```{r}
hosp_data = read.csv("hospital.csv")
str(hosp_data)
```
The different levels of `Education` is `r levels(hosp_data$Education)`
The different levels of `Income` is `r levels(hosp_data$Income)`
The different levels of `Sex` is `r levels(hosp_data$Sex)`
**(b)** Fit an additive multiple regression model with `Blood` as the response using `Age`, `Education`, `Income`, and `Sex` as predictors. What does `R` choose as the reference level for `Education`, `Income`, and `Sex`?
```{r}
hosp_add_model = lm(Blood~Age+Education+Income+Sex,data = hosp_data)
```
R choses the following reference for each of the following:
For `Education` is `high`
For `Income` is `high`
For `Sex` is `female`
**(c)** Fit a multiple regression model with `Blood` as the response. Use the main effects of `Age`, `Education`, `Income`, and `Sex`, as well as the interaction of `Sex` with `Age` and the interaction of `Sex` and `Income`. Use a statistical test to compare this model to the additive model using a significance level of $\alpha = 0.10$. Which do you prefer?
```{r}
hosp_mult_regr_model1 = lm(Blood~Age+Education+Income+Sex+Sex:Age+Sex:Income,data = hosp_data)
pvalue = anova(hosp_add_model,hosp_mult_regr_model1)[2,"Pr(>F)"]
```
Based on the pvalue `r pvalue` at a significance level of $\alpha = 0.10$, we failed to reject the additive model, so additive model is preferred
**(d)** Fit a model similar to that in **(c)**, but additionally add the interaction between `Income` and `Age` as well as a three-way interaction between `Age`, `Income`, and `Sex`. Use a statistical test to compare this model to the preferred model from **(c)** using a significance level of $\alpha = 0.10$. Which do you prefer?
```{r}
hosp_mult_regr_model2 = lm(Blood~Age+Education+Income+Sex+Sex:Age+Sex:Income+Income:Age+Age:Income:Sex,data = hosp_data)
pvalue = anova(hosp_mult_regr_model1,hosp_mult_regr_model2)[2,"Pr(>F)"]
```
Based on the p-value of `r pvalue` at a significance level of $\alpha = 0.10$, we failed to reject the model **(c)**, so **(c)** is the preferred model
**(e)** Using the model in **(d)**, give an estimate of the change in average `Blood` for a one-unit increase in `Age` for a highly educated, low income, male patient.
```{r}
n = data.frame(Age = 1,Education = "high",Income = "low", Sex = "male")
estimate = predict(hosp_mult_regr_model2,newdata = n)
```
The estimate for average change in `Blood` is `r estimate` for a one-unit increase in `Age` for a highly educated, low income, male patient.
***
## Exercise 3 (Hospital SUPPORT Data, Stay Duration)
For this exercise, we will again use the data stored in [`hospital.csv`](hospital.csv). It contains a random sample of 580 seriously ill hospitalized patients from a famous study called "SUPPORT" (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are:
- `Days` - Days to death or hospital discharge
- `Age` - Age on day of hospital admission
- `Sex` - Female or male
- `Comorbidity` - Patient diagnosed with more than one chronic disease
- `EdYears` - Years of education
- `Education` - Education level; high or low
- `Income` - Income level; high or low
- `Charges` - Hospital charges, in dollars
- `Care` - Level of care required; high or low
- `Race` - Non-white or white
- `Pressure` - Blood pressure, in mmHg
- `Blood` - White blood cell count, in gm/dL
- `Rate` - Heart rate, in bpm
For this exercise, we will use `Blood`, `Pressure`, and `Rate` in an attempt to model `Days`. Essentially, we are attempting to model the time spent in the hospital using only health metrics measured at the hospital.
Consider the model
\[
Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon,
\]
where
- $Y$ is `Days`
- $x_1$ is `Blood`
- $x_2$ is `Pressure`
- $x_3$ is `Rate`.
**(a)** Fit the model above. Also fit a smaller model using the provided `R` code.
```{r}
days_add_mdl = lm(Days ~ Blood + Pressure + Rate, data = hosp_data)
days_int_mdl = lm(Days ~ Blood * Pressure * Rate,data = hosp_data)
fstatistic = anova(days_add_mdl,days_int_mdl)[2,"F"]
pvalue = anova(days_add_mdl,days_int_mdl)[2,"Pr(>F)"]
```
Use a statistical test to compare the two models. Report the following:
- The null and alternative hypotheses in terms of the model given in the exercise description
The null hypothesis
\[
\beta_4 = \beta_5 = \beta_6 = \beta_7 = 0
\]
The alternative hypothesis is
Atleast one of the \[\beta_4,\beta_5,\beta_6,\beta_7\] is not zero
- The value of the test statistic is `r fstatistic`
- The p-value of the test is `r pvalue`
- A statistical decision using a significance level of $\alpha = 0.10$
The p-value `r pvalue` is less than $\alpha = 0.10$ and we reject the null hypothesis.
- Which model you prefer
The below model is preferred model
\[
Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon,
\]
**(b)** Give an expression based on the model in the exercise description for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL. Your answer should be a linear function of the $\beta$s.
The expression for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL is
`beta_1 + beta_4 * 139 + beta_5 * 10 + (beta_6 + beta_7) * 1390`
**(c)** Give an expression based on the additive model in part **(a)** for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL. Your answer should be a linear function of the $\beta$s.
The expression for the true change in length of hospital stay in days for a 1 bpm increase in `Rate` for a patient with a `Pressure` of 139 mmHg and a `Blood` of 10 gm/dL is
`beta_1`
***
## Exercise 4 ($t$-test Is a Linear Model)
In this exercise, we will try to convince ourselves that a two-sample $t$-test assuming equal variance is the same as a $t$-test for the coefficient in front of a single two-level factor variable (dummy variable) in a linear model.
First, we set up the data frame that we will use throughout.
```{r}
n = 30
sim_data = data.frame(
groups = c(rep("A", n / 2), rep("B", n / 2)),
values = rep(0, n))
str(sim_data)
```
We will use a total sample size of `30`, `15` for each group. The `groups` variable splits the data into two groups, `A` and `B`, which will be the grouping variable for the $t$-test and a factor variable in a regression. The `values` variable will store simulated data.
We will repeat the following process a number of times.
```{r}
set.seed(19830502)
sim_data$values = rnorm(n, mean = 42, sd = 3.5) # simulate response data
summary(lm(values ~ groups, data = sim_data))
t.test(values ~ groups, data = sim_data, var.equal = TRUE)
```
We use `lm()` to test
\[
H_0: \beta_1 = 0
\]
for the model
\[
Y = \beta_0 + \beta_1 x_1 + \epsilon
\]
where $Y$ is the values of interest, and $x_1$ is a dummy variable that splits the data in two. We will let `R` take care of the dummy variable.
We use `t.test()` to test
\[
H_0: \mu_A = \mu_B
\]
where $\mu_A$ is the mean for the `A` group, and $\mu_B$ is the mean for the `B` group.
The following code sets up some variables for storage.
```{r}
num_sims = 300
lm_t = rep(0, num_sims)
lm_p = rep(0, num_sims)
tt_t = rep(0, num_sims)
tt_p = rep(0, num_sims)
```
- `lm_t` will store the test statistic for the test $H_0: \beta_1 = 0$.
- `lm_p` will store the p-value for the test $H_0: \beta_1 = 0$.
- `tt_t` will store the test statistic for the test $H_0: \mu_A = \mu_B$.
- `tt_p` will store the p-value for the test $H_0: \mu_A = \mu_B$.
The variable `num_sims` controls how many times we will repeat this process, which we have chosen to be `300`.
**(a)** Set a seed equal to your birthday. Then write code that repeats the above process `300` times. Each time, store the appropriate values in `lm_t`, `lm_p`, `tt_t`, and `tt_p`. Specifically, each time you should use `sim_data$values = rnorm(n, mean = 42, sd = 3.5)` to update the data. The grouping will always stay the same.
```{r}
set.seed(19830502)
for(i in 1:num_sims) {
sim_data$values = rnorm(n, mean = 42, sd = 3.5)
lm_t[i] = summary(lm(values ~ groups, data = sim_data))$coef[2, 3]
lm_p[i] = summary(lm(values ~ groups, data = sim_data))$coef[2, 4]
tt_t[i] = t.test(values ~ groups, data = sim_data, var.equal = TRUE)$statistic
tt_p[i] = t.test(values ~ groups, data = sim_data, var.equal = TRUE)$p.value
}
```
**(b)** Report the value obtained by running `mean(lm_t == tt_t)`, which tells us what proportion of the test statistics is equal. The result may be extremely surprising!
```{r}
mean(lm_t == tt_t)
```
**(c)** Report the value obtained by running `mean(lm_p == tt_p)`, which tells us what proportion of the p-values is equal. The result may be extremely surprising!
```{r}
mean(lm_p == tt_p)
```
**(d)** If you have done everything correctly so far, your answers to the last two parts won't indicate the equivalence we want to show! What the heck is going on here? The first issue is one of using a computer to do calculations. When a computer checks for equality, it demands **equality**; nothing can be different. However, when a computer performs calculations, it can only do so with a certain level of precision. So, if we calculate two quantities we know to be analytically equal, they can differ numerically. Instead of `mean(lm_p == tt_p)` run `all.equal(lm_p, tt_p)`. This will perform a similar calculation, but with a very small error tolerance for each equality. What is the result of running this code? What does it mean?
```{r}
all.equal(lm_p, tt_p)
```
The p-value is same for both the tests.
**(e)** Your answer in **(d)** should now make much more sense. Then what is going on with the test statistics? Look at the values stored in `lm_t` and `tt_t`. What do you notice? Is there a relationship between the two? Can you explain why this is happening?
```{r}
all.equal(abs(lm_t),abs(tt_t))
```
If we take absolute of the values, they are same. I am not sure why this behaviour is happening.