-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path13-L10-regression-wrapup.Rmd
205 lines (126 loc) · 7.86 KB
/
13-L10-regression-wrapup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
## Module 13 Lab: Big Mac Index
\setstretch{1}
### Learning outcomes
* Given a research question involving two quantitative variables, construct the null and alternative hypotheses
in words and using appropriate statistical symbols.
* Assess the conditions to determine in theory or simulation-based methods should be used.
* Find, interpret, and evaluate the p-value for a hypothesis test for a slope or correlation.
* Create and interpret a confidence interval for a slope or correlation.
### Big Mac Index
Can the relative cost of a Big Mac across different countries be used to predict the Gross Domestic Product (GDP) per person for that country? The GDP per person and the adjusted dollar equivalent to purchase a Big Mac was found on a random sample of 55 countries in January of 2022. The cost of a Big Mac in each country was adjusted to US dollars based on current exchange rates. Is there evidence of a positive relationship between Big Mac cost and the GDP per person?
Upload and open the R script file for Week 13 lab. Upload and import the csv file, `big_mac_adjusted_index_S22.csv`. Enter the name of the data set (see the environment tab) for datasetname in the R script file in line 9. Highlight and run lines 1--9 to load the data.
```{r, echo = TRUE, fig.width=7,fig.height=6, collapse=FALSE, eval=FALSE}
# Read in data set
mac <- datasetname
```
#### Summarize and visualize the data {-}
To find the correlation between the variables, `GDP_dollar` and `dollar_price` highlight and run lines 13--16 in the R script file.
```{r, echo=TRUE, eval=FALSE}
mac %>%
select(c("GDP_dollar", "dollar_price")) %>%
cor(use="pairwise.complete.obs") %>%
round(3)
```
1. Report the value of correlation between the variables.
\vspace{0.2in}
2. **Calculate the value of the coefficient of determination between `GDP_dollar` and `dollar_price`.**
\vspace{0.4in}
3. Interpret the value of the coefficient of determination in context of the problem.
\vspace{0.6in}
In the next part of the activity we will assess the linear model between Big Mac cost and GDP. Enter the variable `GDP_dollar` for `response` and the variable `dollar_price` for `explanatory` in line 22. Highlight and run lines 22--23 to get the linear model output.
```{r, echo=TRUE, collapse = FALSE, eval=FALSE}
# Fit linear model: y ~ x
bigmacLM <- lm(response~explanatory, data=mac)
summary(bigmacLM)$coefficients # Display coefficient summary
```
4. Give the value of the slope of the regression line. Interpret this value in context of the problem.
\vspace{0.6in}
#### Conditions for the least squares line {-}
Highlight and run lines 27--40 to produce the diagnostic plots needed to assess conditions to use theory-based methods. Use the scatterplot and the residual plots to assess the validity conditions for approximating the data with the $t$-distribution.
```{r, echo=TRUE, eval=FALSE, warning=FALSE}
#Scatterplot
mac %>% # Pipe data set into...
ggplot(aes(x = dollar_price, y = GDP_dollar))+ # Specify variables
geom_point() + # Add scatterplot of points
labs(x = "Big Mac Cost", # Label x-axis
y = "GDP", # Label y-axis
title = "Scatterplot of Big Mac Cost vs. GDP per person") + # Be sure to tile your plots
geom_smooth(method = "lm", se = FALSE) # Add regression line
#Diagnostic plots
bigmacLM <- lm(GDP_dollar~dollar_price, data = mac) # Fit linear regression model
par(mfrow=c(1,2)) # Set graphics parameters to plot 2 plots in 1 row
plot(bigmacLM, which=1) # Residual vs fitted values
hist(bigmacLM$resid, xlab="Residuals", ylab="Frequency",
main = "Histogram of Residuals") # Histogram of residuals
```
5. **Are the conditions met to use the $t$-distribution to approximate the sampling distribution of the standardized statistic? Justify your answer.**
\vspace{1.5in}
\newpage
#### Ask a research question {-}
6. Write out the null and alternative hypotheses in notation to test *correlation* between Big Mac cost and country GDP.
| $H_0:$
| $H_A:$
#### Use statistical inferential methods to draw inferences from the data {-}
#### Hypothesis test {-}
Use the `regression_test()` function in R (in the `catstats` package) to simulate the null distribution of sample **correlations** and compute a p-value. We will need to enter the response variable name and the explanatory variable name for the formula, the data set name (identified above as `mac`), the summary measure used for the test, number of repetitions, the sample statistic (value of correlation), and the direction of the alternative hypothesis.
The response variable name is `GDP_dollar` and the explanatory variable name is `dollar_price`.
7. What inputs should be entered for each of the following to create the simulation to test correlation?
\vspace{.5 mm}
* Direction (`"greater"`, `"less"`, or `"two-sided"`):
\vspace{.2in}
* Summary measure (choose `"slope"` or `"correlation"`):
\vspace{.2in}
* As extreme as (enter the value for the sample correlation):
\vspace{0.2in}
* Number of repetitions:
\vspace{.2in}
Using the R script file for this activity, enter your answers for question 7 in place of the `xx`'s to produce the null distribution with 1000 simulations. Highlight and run lines 45--51. **Upload a copy of your plot showing the p-value to Gradescope for your group.**
```{r, echo=TRUE, eval=FALSE}
regression_test(GDP_dollar~dollar_price, # response ~ explanatory
data = mac, # Name of data set
direction = "xx", # Sign in alternative ("greater", "less", "two-sided")
summary_measure = "xx", # "slope" or "correlation"
as_extreme_as = xx, # Observed slope or correlation
number_repetitions = 1000) # Number of simulated samples for null distribution
```
8. Report the p-value from the R output.
\vspace{0.3in}
#### Simulation-based confidence interval {-}
We will use the `regression_bootstrap_CI()` function in R (in the `catstats` package) to simulate the bootstrap distribution of sample **correlations** and calculate a confidence interval. Fill in the `xx`'s in the the provided R script file to find a 90\% confidence interval. Highlight and run lines 56--60.
```{r, echo=TRUE, eval=FALSE}
regression_bootstrap_CI(GDP_dollar~dollar_price, # response ~ explanatory
data = mac, # Name of data set
confidence_level = xx, # Confidence level as decimal
summary_measure = "xx", # Slope or correlation
number_repetitions = 1000) # Number of simulated samples for bootstrap distribution
```
9. Report the bootstrap 90\% confidence interval in interval notation.
\vspace{0.5in}
#### Communicate the results and answer the research question {-}
10. Using a significance level of 0.1, what decision would you make?
\vspace{0.2in}
11. What type of error is possible?
\vspace{0.3in}
12. Interpret this error in context of the problem.
\vspace{0.8in}
13. Write a paragraph summarizing the results of the study as if you are reporting these results in your local newspaper. **Upload a copy of your paragraph to Gradescope for your group.** Be sure to describe:
* Summary statistic and interpretation
* P-value and interpretation
* Statement about probability or proportion of samples
* Statistic (summary measure and value)
* Direction of the alternative
* Null hypothesis (in context)
* Confidence interval and interpretation
* How confident you are (e.g., 90%, 95%, 98%, 99%)
* Parameter of interest
* Calculated interval
* Order of subtraction when comparing two groups
* Conclusion (written to answer the research question)
* Amount of evidence
* Parameter of interest
* Direction of the alternative hypothesis
* Scope of inference
* To what group of observational units do the results apply (target population or observational units similar to the sample)?
* What type of inference is appropriate (causal or non-causal)?
\vspace{3in}
\newpage