07-comparing-two-groups.qmd

```{r, echo=FALSE, message=FALSE}
library(tidyverse)
library(magrittr)
library(ggthemes)
library(patchwork)
library(data.table)
```


# Comparing Two Groups

![](./images/goin_left.jpeg)

## Learning Objectives 

This week, we introduce the idea of comparing two groups to evaluate whether the data that we have sampled lead us to believe that the population distribution of the random variables are different. Of course, because we don't get access to the function that describes the random variable, we can't *actually* know if the populations are different. It is for this reason that we call it statistical inference -- we are inferring from a sample some belief about the populations. 

At the conclusion of this week, students will be able to:

1. *Recognize* the similarities between all frequentist hypothesis tests. 
2. *Evaluate* the conditions that surround the data, and choose a test that is best-powered and justifiable.
3. *Perform* and *Interpret* the results of the most common statistical tests.

## Class Announcements

- Great work completing your final w203 test!
- Unit 7 Homework is Group Homework, due next week.
- The Hypothesis Testing Lab is released today! 
    - Lab is due at Unit 09 Live Session (two weeks): Apply these fundamentals to analyze 2022 election data and write a single, three-page analysis
    

## Roadmap
### Rearview Mirror

- Statisticians create a population model to represent the world
- A population model has parameters we are interested in
    - Ex: A parameter might represent the effect that a vitamin has on test performance
- A null hypothesis is a specific statement about a parameter
    - Ex: The vitamin has zero effect on performance
- A hypothesis test is a procedure for rejecting or not rejecting a null, such the probability of a type 1 error is constrained.

### Today

- There are often multiple hypothesis tests you can apply to a scenario.
- Our primary concern is choosing a test with assumptions we can defend.
- Secondarily, we want to maximize power.

### Looking ahead 

- Next week, we start working with models for linear regression
- We will see how hypothesis testing is also used for regression parameters.

## Teamwork Discussion
### Working on Data Science Teams

Data science is a *beautiful* combination of team-work and individual-work. It provides the opportunity to work together on a data pipeline with people from all over the organization, to deal with technical, statistical, and social questions that are always interesting. While we expect that everyone on a team will be a professional, there is so much range within the pursuit of data science that projects are nearly always collaborative exercises. 

Together as teams, we

  - Define research ambitions and scope
  - Imagine/envision the landscape of what is possible 
  - Support, discuss, review and integrate individual contributions
  
Together as individuals we conduct the heads-down work that moves question answering forward. This might be reading papers to determine the most appropriate method to bring to bear on the question, or researching the data that is available, or understanding the technical requirements that we have to meet for this answer to be useful to our organization in real time. 

What is your instructor *uniquely* capable of? Let them tell you! But, at the same time, what would they acknowledge that others are better than them? 

See, the thing is, because there is so much that has to be done, there literally are very, very few people who are one-stop data science shops. Instead, teams rely on collaboration and joint expertise in order to get good work done. 

### The Problematic Psychology of Data Science

People talk about the *impostor syndrome*: a feeling of inadequacy or interloping that is sometimes also associated with a fear of under-performing relative to the expectation of others on the team. These emotions are common through data science, academics, computer science. But, these types of emotions are also commonplace in journalism, film-making, and public speaking. 

Has anybody ever had the dream that they're late to a test? Or, that that they've got to give a speech that they're unprepared for? Does anybody remember playing an instrument as a kid and having to go to recitals? Or, play for a championship on a youth sports team? Or, go into a test two? 

What are the feelings associated with those events? What might be generating these feelings?

![](images/among_us.jpeg){width=25%}


### What Makes an Effective Team?

- This reading on [*effective* teams](https://rework.withgoogle.com/print/guides/5721312655835136/) summarizes academic research to argue: 

> What really matters to creating an effective tema is less about who is on the team, and more about how the team works together. 

In your live session, your section might take 7 minutes to read this brief. If so, please read the following sections: 

- The problem statement; 
- The proposed solution; 
- The framework for team effectiveness, stopping at the section titled *"Tool: Help teams determine their own needs."* 

> "Psychological safety refers to an individual's perception of the consequences of taking an interpersonal risk. It is a belief that a team is safe for risk taking in the face of being seen as ignorant, incompetent, negative, or disruptive." 
> 
> "In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel confident that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea."

### We All Belong

- From your experience, can you give an example of taking a personal risk as part of a team?
    - Can you describe your emotions when contemplating this risk?
    - If you did take the risk, how did the reactions of your teammates affect you?
  
- Knowing the circumstances that generate feelings of anxiety -- what steps can we take as a section, or a team, to recognize and respond to these circumstances? 

> How can you add to the psychological safety of your peers in the section and lab teammates?

## Team Kick-Off
**Lab 1 Teams**

- Here are teams for Lab 1! 

**Team Kick-Off Conversation**

- In a 10 minute breakout with your team, please discuss the following questions: 

1. How much time will you invest in the lab each week?
2. How often will you meet and for how long?
3. How will you discuss, review, and integrate individual work into the team deliverable?
4. What do you see as the biggest risks when working on a team?  How can you contribute to an effective team dynamic?

## A Quick Review
**Review of Key Terms**

- Define each of the following:
    - Population Parameter
    - Null Hypothesis
    - Test Statistic
    - Null Distribution

**Comparing Groups Review**

Take a moment to recall the tests you learned this week.  Here is a quick cheat-sheet to their key assumptions.

| paired/unpaired | parametric                                                                      | non-parametric                                          |
|-----------------|---------------------------------------------------------------------------------|---------------------------------------------------------|
| unpaired        | **unpaired t-test** <br> - metric var <br> - i.i.d. <br> - (not too un-)normal  | **Wilcoxon rank-sum** <br> ordinal var <br> i.i.d. <br>  |
| paired          | **paired t-test** <br> metric var <br> i.i.d. <br> (not too un-)normal          | **Wilcoxon signed-rank** <br> metric var <br> i.i.d. <br> difference is symmetric  <br> <br> **sign test** <br> ordinal var <br> i.i.d. |

## Rank Based Tests 

Darrin Speegle has a nice talk-through, walk through of rank based testing procedures, linked [here](https://bookdown.org/speegled/foundations-of-statistics/RBT.html#two-sample-test). We'll talk through a few examples of this, and then move to estimating against data for the class. 
   
## Comparing Groups R Exercise

The General Social Survey (GSS) is one of the longest running and extensive survey projects in the US.  The full dataset includes over 1000 variables spanning demographics, attitudes, and behaviors.  The file `GSS_w203.RData` contains a small selection of a variables from the 2018 GSS.  

To learn about each variable, you can enter it into the search bar at the [GSS data explorer](https://gssdataexplorer.norc.org/variables/vfilter)
    
```{r}
load('data/GSS_w203.RData')
summary(GSS)
```

You have a set of questions that you would like to answer with a statistical test. **For each question**:

1. Choose the most appropriate test.
2. List and evaluate the assumptions for your test.
3. Conduct your test.
4. Discuss statistical and practical significance.

## The Questions 

### Set 1 
- Do economics majors watch more or less TV than computer science majors?

```{r plot relationship}
GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %>% 
  ggplot() + 
  aes(x = tvhours, fill = major1) + 
  geom_histogram(bins = 10, position = 'dodge')
```

What kinds of tests *could* be reasonable to conduct? For what part of the data would we conduct these tests? 

```{r}
## The assumptions about the data drive us to the correct test. 
## But, let's ask all the tests that could *possibly* make sense, and see how 
##     matching or mis-matching assumptions changes what we learn. 

## Answers are in the next chunk... but don't jump to them right away. 


```


```{r echo=FALSE, eval=FALSE}
GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %$% 
  t.test(tvhours ~ major1)

GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %$% 
  wilcox.test(tvhours ~ major1)
  
GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %>% 
  mutate(tv_category = case_when(
    tvhours < 2 ~ 1,
    tvhours >= 2 & tvhours < 5 ~ 2,
    tvhours >= 5 ~ 3)) %$%
  # t.test(tv_category ~ major1)
  wilcox.test(tv_category ~ major1)

GSS %>% 
  filter(major1 %in% c('computer science', 'economics')) %$% 
  wilcox.test(tvhours ~ major1)
```


- Do Americans with pets watch more or less TV than Americans without pets?

### Set 2
- Do Americans spend more time emailing or using the web?

```{r}
GSS %>% 
  select(wwwhr, emailhr) %>% 
  drop_na() %$% 
  t.test(x = wwwhr, y = emailhr, paired = TRUE)

GSS %>% 
  ggplot() + 
  geom_histogram(aes(x = wwwhr),   fill = 'darkblue', alpha = 0.5) + 
  geom_histogram(aes(x = emailhr), fill = 'darkred',  alpha = 0.5)

t.test(
  x = GSS$wwwhr, 
  y = GSS$emailhr, 
  paired = FALSE
)
```


- Do Americans spend more evenings with neighbors or with relatives?

```{r start by building data}
wilcox_test_data <- GSS %>% 
  select(socrel, socommun) %>%
  mutate(
    family_ordered = factor(
      x      = socrel, 
      levels = c('almost daily', 'sev times a week', 
                 'sev times a mnth', 'once a month',
                 'sev times a year', 'once a year', 'never')),
    friends_ordered = factor(
      x      = socommun, 
      levels = c('almost daily', 'sev times a week', 
                 'sev times a mnth', 'once a month',
                 'sev times a year', 'once a year', 'never'))) 
```

To begin this investigation, we've got to look at the data and see what is in it. If you look below, you'll note that it sure seems that people are spending more time with their family... erp, actually no. They're "hanging out" with their friends rather than taking their mother out to dinner. 

```{r produce descriptive plot}
wilcox_test_data %>% 
  select(friends_ordered, family_ordered) %>% 
  rename(
    Friends = friends_ordered, 
    Family  = family_ordered
  ) %>% 
  drop_na() %>% 
  pivot_longer(cols = c(Friends, Family)) %>%   
  ggplot() + 
    aes(x=value, fill=name) + 
    geom_histogram(stat='count', position='dodge', alpha=0.7) + 
  scale_fill_manual(values = c('#003262', '#FDB515')) + 
  labs(
    title    = 'Do Americans Spend Times With Friends or Family?',
    subtitle = 'A cutting analysis.', 
    fill     = 'Friends or Family', 
    x        = 'Amount of Time Spent') + 
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  theme_minimal()
```

With this plot created, we can ask if what we observe in the plot is the produce of what could just be sampling error, or if this is something that was unlikely to arise due if the null hypothesis were true. What is the null hypothesis? Well, lets suppose that if we didn't know anything about the data that we would expect there to be no difference between the amount of time spent with friends or families. 

```{r run the appropriate test}
## risky choice -- casting the factor to a numeric without checking what happens.
wilcox_test_data %$% 
  wilcox.test(
    x = as.numeric(family_ordered), 
    y = as.numeric(friends_ordered),
    paired = FALSE
  )
```

### Set 3
- Are Americans that own guns or Americans that don't own guns more likely to have pets?
- Are Americans with pets happier than Americans without pets?

### Apply to a New Type of Data 
- Is there a relationship between college major and gun ownership? 

```{r}

```

## Simulating the Effects of Test Choices 

```{r set colors}
theme_set(theme_minimal())

berkeley_blue   <- '#003262'
berkeley_gold   <- '#FDB515'
berkeley_sather <- '#B9D3B6'
```

### Should we use a t-test or a wilcox sign-rank?

There is some open discussion in the applied statistics literature about whether we should *ever* be using a t-test. In particular, if the underlying distribution that generates the data is **not** normal, than the assumptions of a t-test are not, technically satisfied and the test does not produce results that have nominal p-value coverage. This is both *technically* and *theoretically* true; and yet, researchers, data scientists, your instructors, and the entire world runs t-tests as "test of first recourse." 

What is the alternative to conducting a t-test as the test of first recourse? It might be the Wilcox test. The Wilcox test makes a weaker assumption -- of symmetry around the mean or median -- which is weaker than the assumption of normality. 

Additional points of argument, which you will investigate in this worksheet: 

- If the underlying data **is** normal, then the Wilcox test is *nearly* as well powered as the t-test. 
- If the underlying data **is not** normal, then the Wilcox test still maintains nominal p-value coverage, whereas the t-test might lose this guarantee. 

##

### The Poisson Distribution 

The poisson distribution has the following PDF: 

$$
f_X(x) = \frac{\lambda^n e^{-\lambda}}{n!}
$$

The key shape parameter for a poisson function is $\lambda$; we show three different distributions, setting this shape parameter to be 1, 3, and 30 respectively. Notice that the limits on these plots are not set to be the same; for example, the range in the third plot is considerably larger than the first. 

```{r show some from poisson distribution}
pois_lambda_1  <- rpois(n=1000, lambda=1)
pois_lambda_3  <- rpois(n=1000, lambda=3)
pois_lambda_30 <- rpois(n=1000, lambda=30)

plot_1  <- ggplot() + aes(x=pois_lambda_1) + geom_histogram(bins=6, fill = berkeley_blue)
plot_3  <- ggplot() + aes(x=pois_lambda_3) + geom_histogram(bins=10, fill = berkeley_gold)
plot_30 <- ggplot() + aes(x=pois_lambda_30) + geom_histogram(bins=30, fill = berkeley_sather)

plot_1 / plot_3 / plot_30
```

What does this changing distribution do to the p-values? 

### Write a Simulation

```{r write poisson simulation}
pois_sim <- function(num_observations, lambda_one, lambda_two) { 
  
  t_test_result <- rep(NA, 10000)
  wilcox_result <- rep(NA, 10000)
  
  for(i in 1:10000) { 
    group_one <- rpois(n=num_observations, lambda=lambda_one)
    group_two <- rpois(n=num_observations, lambda=lambda_two)
  
    t_test_result[i] <- t.test(group_one, group_two)$p.value
    wilcox_result[i] <- wilcox.test(x=group_one, y=group_two)$p.value
  }
  
  df <- data.table(
    p_value = c(t_test_result, wilcox_result), 
    test    = rep(c('t_test', 'wilcox_test'), each = 10000)
  )
  
  return(df)
}
```

```{r run simulation, warning=FALSE}
foo <- pois_sim(20, 1, 2.0)
```

```{r plot results of simulation}
foo %>%
  ggplot() +
  geom_density(aes(x=p_value, color = test)) +
  scale_color_manual(values = c(berkeley_blue, berkeley_gold))
```

And so, the simulation rejects the null at the following rates: 

- For the t-test, at a rate of `r foo[test=='t_test', mean(p_value)]`
- For the Wilcox test, at a rate of `r foo[test=='wilcox_test', mean(p_value)]`

### What if a distribution is much *more* skewed? 

```{r}
skewed_sim <- function(num_sims=1000, num_observations, alpha_1, beta_1, alpha_2, beta_2) {
  
  t_test_result <- rep(NA, num_sims)
  wilcox_result <- rep(NA, num_sims)
  
  for(i in 1:num_sims) { 
    group_one <- rbeta(n=num_observations, shape1 = alpha_1, shape2 = beta_1)
    group_two <- rbeta(n=num_observations, shape1 = alpha_2, shape2 = beta_2)
  
    t_test_result[i] <- t.test(group_one, group_two)$p.value
    wilcox_result[i] <- wilcox.test(x=group_one, y=group_two)$p.value
  }
  
  dt <- data.table(
    p_value = c(t_test_result, wilcox_result), 
    test    = rep(c('t_test', 'wilcox_test'), each = num_sims)
  )
  
  return(dt)
}
```

### False Rejection Rates 

Start with a distribution that has parameters `alpha=2`, `beta=7`. 

```{r}
ggplot(data.frame(x=c(0,1)), aes(x)) + 
  stat_function(fun = dbeta, n=100, args=list(shape1=2, shape2=7))
```


```{r}
same_dist_small_data <- skewed_sim(
  num_observations=10, 
  alpha_1=2, beta_1=7, 
  alpha_2=2, beta_2=7
  )
same_dist_med_data <- skewed_sim(
  num_observations=50, 
  alpha_1=2, beta_1=7, 
  alpha_2=2, beta_2=7
  )
same_dist_big_data <- skewed_sim( # haha, "big data"
  num_observations=100, 
  alpha_1=2, beta_1=7, 
  alpha_2=2, beta_2=7
  )
```

```{r}
plot_1 <- same_dist_small_data %>%
  ggplot() +
  geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
  scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_2 <- same_dist_med_data %>%
  ggplot() +
  geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
  scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_3 <- same_dist_big_data %>%
  ggplot() +
  geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
  scale_color_manual(values = c(berkeley_blue, berkeley_gold))

plot_1 / plot_2 / plot_3
```

- T-tests
  - `r same_dist_small_data[test=='t_test', mean(p_value < 0.05)]`
  - `r same_dist_med_data[test=='t_test', mean(p_value < 0.05)]`
  - `r same_dist_big_data[test=='t_test', mean(p_value < 0.05)]`
- Wilcox Tests 
  - `r same_dist_small_data[test=='wilcox_test', mean(p_value < 0.05)]`
  - `r same_dist_med_data[test=='wilcox_test', mean(p_value < 0.05)]`
  - `r same_dist_big_data[test=='wilcox_test', mean(p_value < 0.05)]`

### What about Power to Reject 

```{r}
small_diff_small_data <- skewed_sim(
  num_observations=10, 
  alpha_1=2, beta_1=7, 
  alpha_2=2, beta_2=5
  )
small_diff_med_data <- skewed_sim(
  num_observations=50, 
  alpha_1=2, beta_1=7, 
  alpha_2=2, beta_2=5
  )
small_diff_big_data <- skewed_sim( # haha, "big data"
  num_observations=100, 
  alpha_1=2, beta_1=7, 
  alpha_2=2, beta_2=5
  )
```

```{r}
plot_1 <- small_diff_small_data %>%
  ggplot() +
  geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
  scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_2 <- small_diff_med_data %>%
  ggplot() +
  geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
  scale_color_manual(values = c(berkeley_blue, berkeley_gold))
plot_3 <- small_diff_big_data %>%
  ggplot() +
  geom_density(aes(x=p_value, color = test), bounds=c(0,1)) +
  scale_color_manual(values = c(berkeley_blue, berkeley_gold))

plot_1 / plot_2 / plot_3
```

### Paired compared to unpaired tests
  
```{r write the paired simulation}
paired_sim <- function(num_sims=10000, num_observations, mean_one, mean_two, paired_diff, sd_one, sd_two) { 
  
  unpaired_test_unpaired_data <- rep(NA, num_sims)
  unpaired_test_paired_data   <- rep(NA, num_sims)
  paired_test_unpaired_data   <- rep(NA, num_sims)
  paired_test_paired_data     <- rep(NA, num_sims)
  
  for(i in 1:num_sims) { 
    observation_a1 <- rnorm(n = num_observations, mean = mean_one, sd = sd_one) 
    ## first create unpaired data 
    observation_b <- rnorm(n = num_observations, mean = mean_two, sd = sd_two)
    ## then, create paired data 
    observation_a2 <- observation_a1 + rnorm(n = num_observations, mean = paired_diff, sd=sd_two)
    
    ## run tests
    unpaired_test_unpaired_data[i] <- t.test(x=observation_a1, y=observation_b,  paired=FALSE)$p.value
    unpaired_test_paired_data[i]   <- t.test(x=observation_a1, y=observation_a2,  paired=FALSE)$p.value
    paired_test_unpaired_data[i]   <- t.test(x=observation_a1, y=observation_b,  paired=TRUE)$p.value
    paired_test_paired_data[i]     <- t.test(x=observation_a1, y=observation_a2, paired=TRUE)$p.value
  }
  
  dt <- data.table(
    p_value = c(unpaired_test_unpaired_data, unpaired_test_paired_data, 
                paired_test_unpaired_data,   paired_test_paired_data), 
    test    = rep(c('unpaired data, unpaired test', 'paired data, unpaired test', 
                    'unpaired data, paired test', 'paired data, paired test'), each = num_sims)
  )
  
  return(dt)
}
```
  
```{r run paired simulation}
bar <- paired_sim(num_observations = 30,  mean_one = 10, mean_two = 11, paired_diff = 1, sd_one = 4, sd_two = 5)
```

```{r plot results of paired simulation}
paired_data_plot <- bar[grep('unpaired data', test, invert=TRUE)] %>% 
  ggplot() + 
    aes(x=p_value, color = test, fill = test) + 
    geom_density(alpha=0.25) + 
  labs(title = 'Paired Data')
unpaired_data_plot <- bar[grep('unpaired data', test, invert=FALSE)] %>% 
  ggplot() + 
    aes(x=p_value, color = test, fill = test) + 
    geom_density(alpha=0.25) + 
  labs(title = 'Unpaired Data')

paired_data_plot / unpaired_data_plot
```