-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-comparing-two-groups.qmd
570 lines (412 loc) · 21.4 KB
/
07-comparing-two-groups.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
```{r, echo=FALSE, message=FALSE}
library(tidyverse)
library(magrittr)
library(ggthemes)
library(patchwork)
library(data.table)
```
# Comparing Two Groups
![](./images/goin_left.jpeg)
## Learning Objectives
This week, we introduce the idea of comparing two groups to evaluate whether the data that we have sampled lead us to believe that the population distribution of the random variables are different. Of course, because we don't get access to the function that describes the random variable, we can't *actually* know if the populations are different. It is for this reason that we call it statistical inference -- we are inferring from a sample some belief about the populations.
At the conclusion of this week, students will be able to:
1. *Recognize* the similarities between all frequentist hypothesis tests.
2. *Evaluate* the conditions that surround the data, and choose a test that is best-powered and justifiable.
3. *Perform* and *Interpret* the results of the most common statistical tests.
## Class Announcements
- Great work completing your final w203 test!
- Unit 7 Homework is Group Homework, due next week.
- The Hypothesis Testing Lab is released today!
- Lab is due at Unit 09 Live Session (two weeks): Apply these fundamentals to analyze 2022 election data and write a single, three-page analysis
## Roadmap
### Rearview Mirror
- Statisticians create a population model to represent the world
- A population model has parameters we are interested in
- Ex: A parameter might represent the effect that a vitamin has on test performance
- A null hypothesis is a specific statement about a parameter
- Ex: The vitamin has zero effect on performance
- A hypothesis test is a procedure for rejecting or not rejecting a null, such the probability of a type 1 error is constrained.
### Today
- There are often multiple hypothesis tests you can apply to a scenario.
- Our primary concern is choosing a test with assumptions we can defend.
- Secondarily, we want to maximize power.
### Looking ahead
- Next week, we start working with models for linear regression
- We will see how hypothesis testing is also used for regression parameters.
## Teamwork Discussion
### Working on Data Science Teams
Data science is a *beautiful* combination of team-work and individual-work. It provides the opportunity to work together on a data pipeline with people from all over the organization, to deal with technical, statistical, and social questions that are always interesting. While we expect that everyone on a team will be a professional, there is so much range within the pursuit of data science that projects are nearly always collaborative exercises.
Together as teams, we
- Define research ambitions and scope
- Imagine/envision the landscape of what is possible
- Support, discuss, review and integrate individual contributions
Together as individuals we conduct the heads-down work that moves question answering forward. This might be reading papers to determine the most appropriate method to bring to bear on the question, or researching the data that is available, or understanding the technical requirements that we have to meet for this answer to be useful to our organization in real time.
What is your instructor *uniquely* capable of? Let them tell you! But, at the same time, what would they acknowledge that others are better than them?
See, the thing is, because there is so much that has to be done, there literally are very, very few people who are one-stop data science shops. Instead, teams rely on collaboration and joint expertise in order to get good work done.
### The Problematic Psychology of Data Science
People talk about the *impostor syndrome*: a feeling of inadequacy or interloping that is sometimes also associated with a fear of under-performing relative to the expectation of others on the team. These emotions are common through data science, academics, computer science. But, these types of emotions are also commonplace in journalism, film-making, and public speaking.
Has anybody ever had the dream that they're late to a test? Or, that that they've got to give a speech that they're unprepared for? Does anybody remember playing an instrument as a kid and having to go to recitals? Or, play for a championship on a youth sports team? Or, go into a test two?
What are the feelings associated with those events? What might be generating these feelings?
![](images/among_us.jpeg){width=25%}
### What Makes an Effective Team?
- This reading on [*effective* teams](https://rework.withgoogle.com/print/guides/5721312655835136/) summarizes academic research to argue:
> What really matters to creating an effective tema is less about who is on the team, and more about how the team works together.
In your live session, your section might take 7 minutes to read this brief. If so, please read the following sections:
- The problem statement;
- The proposed solution;
- The framework for team effectiveness, stopping at the section titled *"Tool: Help teams determine their own needs."*
> "Psychological safety refers to an individual's perception of the consequences of taking an interpersonal risk. It is a belief that a team is safe for risk taking in the face of being seen as ignorant, incompetent, negative, or disruptive."
>
> "In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel confident that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea."
### We All Belong
- From your experience, can you give an example of taking a personal risk as part of a team?
- Can you describe your emotions when contemplating this risk?
- If you did take the risk, how did the reactions of your teammates affect you?
- Knowing the circumstances that generate feelings of anxiety -- what steps can we take as a section, or a team, to recognize and respond to these circumstances?
> How can you add to the psychological safety of your peers in the section and lab teammates?
## Team Kick-Off
**Lab 1 Teams**
- Here are teams for Lab 1!
**Team Kick-Off Conversation**
- In a 10 minute breakout with your team, please discuss the following questions:
1. How much time will you invest in the lab each week?
2. How often will you meet and for how long?
3. How will you discuss, review, and integrate individual work into the team deliverable?
4. What do you see as the biggest risks when working on a team? How can you contribute to an effective team dynamic?
## A Quick Review
**Review of Key Terms**
- Define each of the following:
- Population Parameter
- Null Hypothesis
- Test Statistic
- Null Distribution
**Comparing Groups Review**
Take a moment to recall the tests you learned this week. Here is a quick cheat-sheet to their key assumptions.
| paired/unpaired | parametric | non-parametric |
|-----------------|---------------------------------------------------------------------------------|---------------------------------------------------------|
| unpaired | **unpaired t-test** <br> - metric var <br> - i.i.d. <br> - (not too un-)normal | **Wilcoxon rank-sum** <br> ordinal var <br> i.i.d. <br> |
| paired | **paired t-test** <br> metric var <br> i.i.d. <br> (not too un-)normal | **Wilcoxon signed-rank** <br> metric var <br> i.i.d. <br> difference is symmetric <br> <br> **sign test** <br> ordinal var <br> i.i.d. |
## Rank Based Tests
Darrin Speegle has a nice talk-through, walk through of rank based testing procedures, linked [here](https://bookdown.org/speegled/foundations-of-statistics/RBT.html#two-sample-test). We'll talk through a few examples of this, and then move to estimating against data for the class.
## Comparing Groups R Exercise
The General Social Survey (GSS) is one of the longest running and extensive survey projects in the US. The full dataset includes over 1000 variables spanning demographics, attitudes, and behaviors. The file `GSS_w203.RData` contains a small selection of a variables from the 2018 GSS.
To learn about each variable, you can enter it into the search bar at the [GSS data explorer](https://gssdataexplorer.norc.org/variables/vfilter)
```{r}
load('data/GSS_w203.RData')
summary(GSS)
```
You have a set of questions that you would like to answer with a statistical test. **For each question**:
1. Choose the most appropriate test.
2. List and evaluate the assumptions for your test.
3. Conduct your test.
4. Discuss statistical and practical significance.
## The Questions
### Set 1
- Do economics majors watch more or less TV than computer science majors?
```{r plot relationship}
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %>%
ggplot() +
aes(x = tvhours, fill = major1) +
geom_histogram(bins = 10, position = 'dodge')
```
What kinds of tests *could* be reasonable to conduct? For what part of the data would we conduct these tests?
```{r}
## The assumptions about the data drive us to the correct test.
## But, let's ask all the tests that could *possibly* make sense, and see how
## matching or mis-matching assumptions changes what we learn.
## Answers are in the next chunk... but don't jump to them right away.
```
```{r echo=FALSE, eval=FALSE}
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %$%
t.test(tvhours ~ major1)
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %$%
wilcox.test(tvhours ~ major1)
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %>%
mutate(tv_category = case_when(
tvhours < 2 ~ 1,
tvhours >= 2 & tvhours < 5 ~ 2,
tvhours >= 5 ~ 3)) %$%
# t.test(tv_category ~ major1)
wilcox.test(tv_category ~ major1)
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %$%
wilcox.test(tvhours ~ major1)
```
- Do Americans with pets watch more or less TV than Americans without pets?
### Set 2
- Do Americans spend more time emailing or using the web?
```{r}
GSS %>%
select(wwwhr, emailhr) %>%
drop_na() %$%
t.test(x = wwwhr, y = emailhr, paired = TRUE)
GSS %>%
ggplot() +
geom_histogram(aes(x = wwwhr), fill = 'darkblue', alpha = 0.5) +
geom_histogram(aes(x = emailhr), fill = 'darkred', alpha = 0.5)
t.test(
x = GSS$wwwhr,
y = GSS$emailhr,
paired = FALSE
)
```
- Do Americans spend more evenings with neighbors or with relatives?
```{r start by building data}
wilcox_test_data <- GSS %>%
select(socrel, socommun) %>%
mutate(
family_ordered = factor(
x = socrel,
levels = c('almost daily', 'sev times a week',
'sev times a mnth', 'once a month',
'sev times a year', 'once a year', 'never')),
friends_ordered = factor(
x = socommun,
levels = c('almost daily', 'sev times a week',
'sev times a mnth', 'once a month',
'sev times a year', 'once a year', 'never')))
```
To begin this investigation, we've got to look at the data and see what is in it. If you look below, you'll note that it sure seems that people are spending more time with their family... erp, actually no. They're "hanging out" with their friends rather than taking their mother out to dinner.
```{r produce descriptive plot}
wilcox_test_data %>%
select(friends_ordered, family_ordered) %>%
rename(
Friends = friends_ordered,
Family = family_ordered
) %>%
drop_na() %>%
pivot_longer(cols = c(Friends, Family)) %>%
ggplot() +
aes(x=value, fill=name) +
geom_histogram(stat='count', position='dodge', alpha=0.7) +
scale_fill_manual(values = c('#003262', '#FDB515')) +
labs(
title = 'Do Americans Spend Times With Friends or Family?',
subtitle = 'A cutting analysis.',
fill = 'Friends or Family',
x = 'Amount of Time Spent') +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
theme_minimal()
```
With this plot created, we can ask if what we observe in the plot is the produce of what could just be sampling error, or if this is something that was unlikely to arise due if the null hypothesis were true. What is the null hypothesis? Well, lets suppose that if we didn't know anything about the data that we would expect there to be no difference between the amount of time spent with friends or families.
```{r run the appropriate test}
## risky choice -- casting the factor to a numeric without checking what happens.
wilcox_test_data %$%
wilcox.test(
x = as.numeric(family_ordered),
y = as.numeric(friends_ordered),
paired = FALSE
)
```
### Set 3
- Are Americans that own guns or Americans that don't own guns more likely to have pets?
- Are Americans with pets happier than Americans without pets?
### Apply to a New Type of Data
- Is there a relationship between college major and gun ownership?
```{r}
```
## Simulating the Effects of Test Choices
```{r set colors}
theme_set(theme_minimal())
berkeley_blue <- '#003262'
berkeley_gold <- '#FDB515'
berkeley_sather <- '#B9D3B6'
```
### Should we use a t-test or a wilcox sign-rank?
There is some open discussion in the applied statistics literature about whether we should *ever* be using a t-test. In particular, if the underlying distribution that generates the data is **not** normal, than the assumptions of a t-test are not, technically satisfied and the test does not produce results that have nominal p-value coverage. This is both *technically* and *theoretically* true; and yet, researchers, data scientists, your instructors, and the entire world runs t-tests as "test of first recourse."
What is the alternative to conducting a t-test as the test of first recourse? It might be the Wilcox test. The Wilcox test makes a weaker assumption -- of symmetry around the mean or median -- which is weaker than the assumption of normality.
Additional points of argument, which you will investigate in this worksheet:
- If the underlying data **is** normal, then the Wilcox test is *nearly* as well powered as the t-test.
- If the underlying data **is not** normal, then the Wilcox test still maintains nominal p-value coverage, whereas the t-test might lose this guarantee.
##
### The Poisson Distribution
The poisson distribution has the following PDF:
$$
f_X(x) = \frac{\lambda^n e^{-\lambda}}{n!}
$$
The key shape parameter for a poisson function is $\lambda$; we show three different distributions, setting this shape parameter to be 1, 3, and 30 respectively. Notice that the limits on these plots are not set to be the same; for example, the range in the third plot is considerably larger than the first.
```{r show some from poisson distribution}
pois_lambda_1 <- rpois(n=1000, lambda=1)
pois_lambda_3 <- rpois(n=1000, lambda=3)
pois_lambda_30 <- rpois(n=1000, lambda=30)
plot_1 <- ggplot() + aes(x=pois_lambda_1) + geom_histogram(bins=6, fill = berkeley_blue)
plot_3 <- ggplot() + aes(x=pois_lambda_3) + geom_histogram(bins=10, fill = berkeley_gold)
plot_30 <- ggplot() + aes(x=pois_lambda_30) + geom_histogram(bins=30, fill = berkeley_sather)
plot_1 / plot_3 / plot_30
```
What does this changing distribution do to the p-values?
### Write a Simulation
```{r write poisson simulation}
pois_sim <- function(num_observations, lambda_one, lambda_two) {
t_test_result <- rep(NA, 10000)
wilcox_result <- rep(NA, 10000)
for(i in 1:10000) {
group_one <- rpois(n=num_observations, lambda=lambda_one)
group_two <- rpois(n=num_observations, lambda=lambda_two)
t_test_result[i] <- t.test(group_one, group_two)$p.value
wilcox_result[i] <- wilcox.test(x=group_one, y=group_two)$p.value
}
df <- data.table(
p_value = c(t_test_result, wilcox_result),
test = rep(c('t_test', 'wilcox_test'), each = 10000)
)
return(df)
}
```
```{r run simulation, warning=FALSE}
foo <- pois_sim(20, 1, 2.0)
```
```{r plot results of simulation}
foo %>%
ggplot() +
geom_density(aes(x=p_value, color = test)) +
scale_color_manual(values = c(berkeley_blue, berkeley_gold))
```
And so, the simulation rejects the null at the following rates:
- For the t-test, at a rate of `r foo[test=='t_test', mean(p_value)]`
- For the Wilcox test, at a rate of `r foo[test=='wilcox_test', mean(p_value)]`
### What if a distribution is much *more* skewed?
```{r}
skewed_sim <- function(num_sims=1000, num_observations, alpha_1, beta_1, alpha_2, beta_2) {
t_test_result <- rep(NA, num_sims)
wilcox_result <- rep(NA, num_sims)
for(i in 1:num_sims) {
group_one <- rbeta(n=num_observations, shape1 = alpha_1, shape2 = beta_1)
group_two <- rbeta(n=num_observations, shape1 = alpha_2, shape2 = beta_2)
t_test_result[i] <- t.test(group_one, group_two)$p.value
wilcox_result[i] <- wilcox.test(x=group_one, y=group_two)$p.value
}
dt <- data.table(
p_value = c(t_test_result, wilcox_result),
test = rep(c('t_test', 'wilcox_test'), each = num_sims)
)
return(dt)
}
```
### False Rejection Rates
Start with a distribution that has parameters `alpha=2`, `beta=7`.
```{r}
ggplot(data.frame(x=c(0,1)), aes(x)) +
stat_function(fun = dbeta, n=100, args=list(shape1=2, shape2=7))
```
```{r}
same_dist_small_data <- skewed_sim(
num_observations=10,
alpha_1=2, beta_1=7,
alpha_2=2, beta_2=7
)
same_dist_med_data <- skewed_sim(
num_observations=50,
alpha_1=2, beta_1=7,
alpha_2=2, beta_2=7
)
same_dist_big_data <- skewed_sim( # haha, "big data"
num_observations=100,
alpha_1=2, beta_1=7,
alpha_2=2, beta_2=7
)
```
```{r}
plot_1 <- same_dist_small_data %>%
ggplot() +
geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_2 <- same_dist_med_data %>%
ggplot() +
geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_3 <- same_dist_big_data %>%
ggplot() +
geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_1 / plot_2 / plot_3
```
- T-tests
- `r same_dist_small_data[test=='t_test', mean(p_value < 0.05)]`
- `r same_dist_med_data[test=='t_test', mean(p_value < 0.05)]`
- `r same_dist_big_data[test=='t_test', mean(p_value < 0.05)]`
- Wilcox Tests
- `r same_dist_small_data[test=='wilcox_test', mean(p_value < 0.05)]`
- `r same_dist_med_data[test=='wilcox_test', mean(p_value < 0.05)]`
- `r same_dist_big_data[test=='wilcox_test', mean(p_value < 0.05)]`
### What about Power to Reject
```{r}
small_diff_small_data <- skewed_sim(
num_observations=10,
alpha_1=2, beta_1=7,
alpha_2=2, beta_2=5
)
small_diff_med_data <- skewed_sim(
num_observations=50,
alpha_1=2, beta_1=7,
alpha_2=2, beta_2=5
)
small_diff_big_data <- skewed_sim( # haha, "big data"
num_observations=100,
alpha_1=2, beta_1=7,
alpha_2=2, beta_2=5
)
```
```{r}
plot_1 <- small_diff_small_data %>%
ggplot() +
geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_2 <- small_diff_med_data %>%
ggplot() +
geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_3 <- small_diff_big_data %>%
ggplot() +
geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_1 / plot_2 / plot_3
```
### Paired compared to unpaired tests
```{r write the paired simulation}
paired_sim <- function(num_sims=10000, num_observations, mean_one, mean_two, paired_diff, sd_one, sd_two) {
unpaired_test_unpaired_data <- rep(NA, num_sims)
unpaired_test_paired_data <- rep(NA, num_sims)
paired_test_unpaired_data <- rep(NA, num_sims)
paired_test_paired_data <- rep(NA, num_sims)
for(i in 1:num_sims) {
observation_a1 <- rnorm(n = num_observations, mean = mean_one, sd = sd_one)
## first create unpaired data
observation_b <- rnorm(n = num_observations, mean = mean_two, sd = sd_two)
## then, create paired data
observation_a2 <- observation_a1 + rnorm(n = num_observations, mean = paired_diff, sd=sd_two)
## run tests
unpaired_test_unpaired_data[i] <- t.test(x=observation_a1, y=observation_b, paired=FALSE)$p.value
unpaired_test_paired_data[i] <- t.test(x=observation_a1, y=observation_a2, paired=FALSE)$p.value
paired_test_unpaired_data[i] <- t.test(x=observation_a1, y=observation_b, paired=TRUE)$p.value
paired_test_paired_data[i] <- t.test(x=observation_a1, y=observation_a2, paired=TRUE)$p.value
}
dt <- data.table(
p_value = c(unpaired_test_unpaired_data, unpaired_test_paired_data,
paired_test_unpaired_data, paired_test_paired_data),
test = rep(c('unpaired data, unpaired test', 'paired data, unpaired test',
'unpaired data, paired test', 'paired data, paired test'), each = num_sims)
)
return(dt)
}
```
```{r run paired simulation}
bar <- paired_sim(num_observations = 30, mean_one = 10, mean_two = 11, paired_diff = 1, sd_one = 4, sd_two = 5)
```
```{r plot results of paired simulation}
paired_data_plot <- bar[grep('unpaired data', test, invert=TRUE)] %>%
ggplot() +
aes(x=p_value, color = test, fill = test) +
geom_density(alpha=0.25) +
labs(title = 'Paired Data')
unpaired_data_plot <- bar[grep('unpaired data', test, invert=FALSE)] %>%
ggplot() +
aes(x=p_value, color = test, fill = test) +
geom_density(alpha=0.25) +
labs(title = 'Unpaired Data')
paired_data_plot / unpaired_data_plot
```