-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path12-classical-linear-model.qmd
202 lines (135 loc) · 7.06 KB
/
12-classical-linear-model.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# The Classical Linear Model
```{r}
```
## Learning Objectives
At the end of this week's learning students will be able to
1. **Describe** the assumptions of the classical linear model (sometimes referred to as the Gauss-Markov Assumptions) and what each assumption contributes to the estimator.
2. **Evaluate** using empirical methods, whether each of the assumptions are likely to be true of the population data generating function.
3. **Assess** whether the guarantees that are provided by the classical linear model's requirements are likely to *ever* be true, including within data the student is likely to encounter.
## Class Announcements
- Lab 2 Deliverable and Dates
- Research Proposal (Today)
- Within-Team Review (Week 14)
- Final Report (Week 14)
- Final Presentation (Week 14)
## Roadmap
**Rearview Mirror**
- Statisticians create a population model to represent the world.
- The BLP is a useful summary for a relationship among random variables.
- OLS regression is an estimator for the Best Linear Predictor (BLP).
- For a large sample, we only need two mild assumptions to work with OLS
- To know coefficients are consistent
- To have valid standard errors, hypothesis tests
**Today**
- The Classical Linear Model (CLM) allows us to apply regression to smaller samples.
- The CLM requires more to be true of the data generating process, to make coefficients, standard errors, and tests *meaningful* in small samples.
- Understanding if the data meets these requirements (often called assumptions) requires considerable care.
**Looking Ahead**
- The CLM -- and the methods that we use to evaluate the CLM -- are the basis of advanced models (*inter alia* time-series)
- (Week 13) In a regression studies (and other studies), false discovery is a widespread problem. Understanding its causes can make you a better member of the scientific community.
## The Classical Linear Model
Comparing the Large Sample Model and the CLM
### Part 1
- We say that in small samples, more needs be true of our data for OLS regression to "work."
- What do we mean when we say "work"?
- If our goals are descriptive, how is a "working" estimator useful?
- If our goals are explanatory, how is a "working" estimator useful?
- If our goals are predictive, are the requirements the same?
### Part 2
::: {.discussion-question}
- Suppose that you're interested in understanding how subsidized school meals benefit under-resourced students in San Francisco East Bay region.
- Using the tools from DATASCI 201, refine this question to a data science question.
- Suppose that there exists two possible data sources to answer the question you have formed:
- A large amount (e.g. 10,000 data points) of individual-level data about income, nutrition and test scores, self-reported by individual families who have opted in to the study.
- A relatively smaller amount (e.g. 500 data points) of Government data about school district characteristics, including district-level college achievement; county-level home prices, and state-level tax receipts.
- **What are the tradeoffs to using one or the other data source?**
:::
### Part 3
::: {.discussion-question}
- Suppose you elect to use the relatively larger sample of individual-level data.
- Which of the large-sample assumptions do you expect are valid, and which are problematic?
- Or, suppose that you elect to use the relatively smaller sample of school-district-level data.
- Which of the CLM assumptions do you expect are valid, and which do you expect are most problematic?
- **What was the research question that you identified?**
- **What would a successful answer accomplish?**
:::
### Part 4
::: {.discussion-question}
- **Which data source, the individual or the district-level, do you think is more likely to produce a successful answer?**
:::
### Part 5
Problems with the CLM Requirements
- There are five requirements for the CLM
1. IID Sampling
2. Linear Conditional Expectation
3. No Perfect Collinearity
4. Homoskedastic Errors
5. Normally Distributed Errors
- For each of these requirements:
- Identify one **concrete** way that the data might not satisfy the requirement.
- Identify what the consequence of failing to satisfy the requirement would be.
- Identify a path forward to satisfy the requirement.
## R Exercise
```{r, message=FALSE}
library(tidyverse)
library(wooldridge)
library(car)
library(lmtest)
library(sandwich)
library(stargazer)
```
If you haven't used the `mtcars` dataset, you haven't been through an intro applied stats class!
In this analysis, we will use the mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). The dataset is automatically available when you start R.
For more information about the dataset, use the R command: `help(mtcars)`
```{r load-data, message=FALSE}
data(mtcars)
glimpse(mtcars)
```
### Questions:
0. Using the `mtcars` data, quickly reason about the variables that we're interested in studying:
```{r}
mtcars %>%
ggplot() +
aes(x=mpg) +
geom_histogram(bins=10)
mtcars %>%
select(mpg, disp, hp, wt, drat) %>%
pairs(pch=19)
```
1. Using the `mtcars` data, run a linear regression to find the relationship between miles per gallon (`mpg`) on the left-hand-side as a function of displacement (`disp`), gross horsepower (`hp`), weight (`wt`), and rear axle ratio (`drat`) on the right-hand-side. That is, fit a regression of the following form:
$$
\widehat{mpg} = \hat{\beta_{0}} + \hat{\beta}_{1} disp + \hat{\beta}_{2}horse\_power + \hat{\beta}_{3}weight + \hat{\beta}_{4}drive\_ratio
$$
```{r}
```
2. For **each** of the following CLM assumptions, assess whether the assumption holds. Where possible, demonstrate multiple ways of assessing an assumption. When an assumption appears violated, state what steps you would take in response.
- I.I.D. data
- Linear conditional expectation
- No perfect collinearity
- Homoskedastic errors
- Normally distributed errors
```{r evaluate iid}
# goal:
# consequence if violated:
```
```{r evaluate linear conditional expectation}
# goal:
# consequence if violated:
```
```{r evaluate no perfect collinearity}
# goal:
# consequence if violated:
```
```{r evaluate homoskedastic errors}
# goal:
# consequence if violated:
```
```{r evaluate normally distributed errors}
# goal:
# consequence if violated:
```
3. In addition to the above, assess to what extent (imperfect) collinearity is affecting your inference.
4. Interpret the coefficient on horsepower.
5. Perform a hypothesis test to assess whether rear axle ratio has an effect on mpg. What assumptions need to be true for this hypothesis test to be informative? Are they?
6. Choose variable transformations (if any) for each variable, and try to better meet the assumptions of the CLM (which also maintaining the readability of your model).
7. (As time allows) report the results of both models in a nicely formatted regression table.