-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathjeopardy.Rmd
297 lines (224 loc) · 9.41 KB
/
jeopardy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "Probability of winning jeopardy"
author: "Bena"
date: "`r Sys.Date()`"
output: html_document
---
### Background
Jeopardy is a popular TV show in the US where participants answer trivia to win money. Participants are given a set of categories to choose from and a set of questions that increase in difficulty. As the questions get more difficult, the participant can earn more money for answering correctly.
Let's say we're also interested in breaking the record. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.
It's a subset of 20,000 rows from a much larger dataset of Jeopardy questions.
Concepts applied in resolving this include:
probability
hypothesis testing
### Data familiarization
Setting up
```{r}
library(tidyverse)
```
Importing data
```{r}
jeopardy <- read_csv("jeopardy.csv")
```
Data exploration
```{r}
View(jeopardy)
glimpse(jeopardy)
```
Cleaning column names
```{r}
library(janitor)
```
```{r}
jeopardy <- jeopardy %>%
janitor::clean_names()
```
```{r}
glimpse(jeopardy)
```
```{r}
sapply(jeopardy, typeof)
```
### Fixing Data Types
```{r}
unique(jeopardy$value)
```
The value column has character data type because some entries are "None" & the $ sign
```{r}
jeopardy2 <- jeopardy %>%
filter(value != "None") %>%
mutate(value = str_replace_all(value, "[$,]",""),
value = as.numeric(value))
```
### Normalizing Text
We want to clean the texts to ensure that we lowercase all the words and any remove punctuation.
We'll do this for the category, question & answer columns.
```{r}
jeopardy2 <- jeopardy2 %>%
mutate(category = str_to_lower(category),
category = str_replace_all(category, "[^A-Za-z0-9 ],",""),
category = gsub("[[:punct:]]", "", category),
question = str_to_lower(question),
question = str_replace_all(question, "[^[:alnum:]],",""),
question = gsub("[[:punct:]]", "", question),
answer = str_to_lower(answer),
answer = str_replace_all(answer, "[^A-Za-z0-9. ],",""),
answer = gsub("[[:punct:]]", "", answer)
)
```
### Making Dates More Accessible
We'll separate this column into year, month and day columns to make filtering easier in the future.
We also need them to be numeric
```{r}
library(lubridate)
```
```{r}
jeopardy2$air_date <- as.Date(jeopardy2$air_date, format = "%m/%d/%Y")
jeopardy2$year <- as.numeric(format(as.Date(jeopardy2$air_date, format = "%m/%d/%Y"),"%Y"))
jeopardy2$month <- as.numeric(format(as.Date(jeopardy2$air_date, format = "%m/%d/%Y"),"%m"))
jeopardy2$day <- as.numeric(format(as.Date(jeopardy2$air_date, format = "%m/%d/%Y"),"%d"))
```
Alternatively
```{r}
jeopardy2 <- jeopardy2 %>%
mutate(
day2 = day(air_date),
weekday = as.character(wday(air_date, label=TRUE)),
month2 = month(air_date),
year2 = year(air_date)
)
```
Re-ordering columns
```{r}
jeopardy2 <- jeopardy2[,c("show_number","air_date","day","month","year","weekday","round","category","value","question","answer")]
```
```{r}
glimpse(jeopardy2)
```
### Focusing On Particular Subject Areas
Many people seem to think that science and history facts are the most common categories to appear in Jeopardy episodes. Others feel that Shakespeare questions gets an awful lot of attention from Jeopardy.
With the chi-squared test, we can actually test these hypotheses! Let's assess if science, history and Shakespeare have a higher prevalence in the data set.
```{r}
n_categories <- jeopardy2 %>%
count(category) %>%
summarise(no_of_categories = n())
print(n_categories)
```
There are around 3368 unique categories in the Jeopardy data set after doing all of our cleaning.
If we suppose that no category stood out, the probability of picking a random category would be the same no matter what category you picked.
```{r}
n_questions <- nrow(jeopardy2)
p_category_expected <- 1/3368
p_not_category_expected <- 3367/3368
p_expected <- c(p_category_expected, p_not_category_expected)
```
We'll conduct a hypothesis test to see if the 3 are more likely to appear than other categories.<br/>
> H0: Our null hypothesis states that science, history and Shakespeare are the most prevalent categories in Jeopardy <br/>
> H1: The alternative hypothesis states that science, history and Shakespeare are not the most prevalent categories in Jeopardy.
First, we'll count how many times the word "science" appears in the category column.
```{r}
categories <- pull(jeopardy2, category)
n_science_categories <- 0
for (c in categories) {
if ("science" %in% c) {
n_science_categories = n_science_categories + 1
}
}
```
```{r}
science_obs <- c(n_science_categories, n_questions - n_science_categories)
p_expected = c(1/3368, 3367/3368)
chisq.test(science_obs, p = p_expected) #the function is used to conduct the hypothesis test
```
The p-value is below 0.05 thus we reject the null hypothesis & conclude that science doesn't have a higher prevalence than other topics in the Jeopardy data.
```{r}
n_history_categories <- 0
for (c in categories) {
if ("history" %in% c) {
n_history_categories = n_history_categories + 1
}
}
```
```{r}
history_obs <- c(n_history_categories, n_questions - n_history_categories)
p_expected = c(1/3368, 3367/3368)
chisq.test(history_obs, p = p_expected)
```
The p-value is below 0.05 thus we reject the null hypothesis & conclude that history doesn't have a higher prevalence than other topics in the Jeopardy data.
```{r}
n_shakespear_categories <- 0
for (c in categories) {
if ("shakespear" %in% c) {
n_shakespear_categories = n_shakespear_categories + 1
}
}
```
```{r}
shakespear_obs <- c(n_shakespear_categories, n_questions - n_shakespear_categories)
p_expected = c(1/3368, 3367/3368)
chisq.test(shakespear_obs, p = p_expected)
```
The p-value is below 0.05 thus we reject the null hypothesis & conclude that shakespear doesn't have a higher prevalence than other topics in the Jeopardy data.
### Unique Terms In Questions
We'd like to investigate how often new questions are repeats of older ones.
```{r}
questions <- pull(jeopardy2, question)
terms_used <- 0
for (q in questions) {
# Split the sentence into distinct words
split_sentence = str_split(q, " ")[[1]]
# Check if each word is longer than 6 and if it's currently in terms_used
for (term in split_sentence) {
if (!term %in% terms_used & nchar(term) >= 6) {
terms_used = c(terms_used, term)
}
}
}
```
### Terms In Low and High Value Questions
We're more interested to study terms that have high values associated with it rather than low values. <br/>
This optimization will help us earn more money when you're on Jeopardy while reducing the number of questions we have to study.<br/>
We'll define low and high values as follows:<br/>
>Low value: Any row where value is less than 800.<br/>
>High value: Any row where value is greater or equal than 800.
Below is an image of what the question board looks like at the start of every round
```{r echo=FALSE, out.width="100%"}
knitr::include_graphics("winning_board.png", error = FALSE)
```
For each category, we can see that for every 2 high value questions, there are 3 low value questions.
If the number of high and low value questions is appreciably different from the 2:3 ratio, we would have reason to believe that a term would be more prevalent in either the low or high value questions.
We can use the chi-squared test to test the null hypothesis that each term is not distributed more to either high or low value questions.
```{r message=FALSE, warning=FALSE}
# Going only through the first 20 terms for shortness
# But you can remove the indexing to perform this code on all the terms
values = pull(jeopardy, value)
value_count_data = NULL
for (term in terms_used) {
n_high_value = 0
n_low_value = 0
for (i in 1:length(questions)) {
# Split the sentence into a new vector
split_sentence = str_split(questions[i], " ")[[1]]
# Detect if the term is in the question and its value status
if (term %in% split_sentence & values[i] >= 800) {
n_high_value = n_high_value + 1
} else if (term %in% split_sentence & values[i] < 800) {
n_low_value = n_low_value + 1
}
}
# Testing if the counts for high and low value questions deviates from what we expect
test = chisq.test(c(n_high_value, n_low_value), p = c(2/5, 3/5))
new_row = c(term, n_high_value, n_low_value, test$p.value)
# Append this new row to our
value_count_data = rbind(value_count_data, new_row)
}
```
```{r}
# Take the value count data and put it in a better format
tidy_value_count_data <- as_tibble(value_count_data)
colnames(tidy_value_count_data) = c("term", "n_high", "n_low", "p_value")
head(tidy_value_count_data)
```
We can see from the output that some of the values are less than 5. <br/>
Recall that the chi-squared test is prone to errors when the counts in each of the cells are less than 5. <br/>
We may need to discard these terms and only look at terms where both counts are greater than 5.