-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDATA210-applyingweightpackages
271 lines (189 loc) · 13.4 KB
/
DATA210-applyingweightpackages
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
###
#DATA-2100
#Adefoluke Shemsu
#Week 6
###
# PROBLEM 1
setwd("~/Documents/Education/Penn/Classes/DATA 210/Week 6")
library(tidyverse)
require(weights)
require(anesrake)
# You will need to load several different data sets to complete this problem set. All of the data are available on Canvas.
# Don’t forget that all data we provide you for this class can only be used for class purposes.
# In this question, we’re going to explore a variation of the birthday problem.
# Ultimately we want you to answer the following question: in a large group of people,
# what is the probability that you pick a random day of the year and nobody in the group has that birthday?
# For these questions you can assume that all years have 365 days.
# Thoughout this question, we’ll only worry about birth month and day (i.e. you can ignore birth year)
#a) We’ll begin with the basics. Write a line or two of code that randomly draws birthdays for
# 500 people and checks whether nobody has a birthday on the first day of the year (i.e. January 1).
# The sample() function should be helpful here. Hint: use the numbers 1 through 365 to represent birthdays,
# and sample from those numbers.
birthdays.vec <- sample(1:365, 500, replace = TRUE) # Creating birthday vectors for 500 people
birthdays <- birthdays.vec #Turning sample into more actionable data set
birthdays <- as.data.frame(table(birthdays))
subset(birthdays, c(birthdays == "1")) #Indexing for # of Jan 1 birthdays; 0 occurrences
# b) The previous question allowed us to calculate whether or not somebody had a January 1 birthday in the sample we drew.
# That’s not especially interesting, since our answer could change if we drew a new sample.
# It would be more interesting to know (before we draw the sample) what the probability is that nobody in
# our group had a birthday on January 1. We can use for-loops to estimate these types of probabilities.
# The idea is that if we repeat what we did in part A a large number of times (each time drawing a new random sample),
# the proportion of the samples in which nobody had a January 1 birthday equals the probability that nobody has a
# January 1 birthday (for a group this size). Write a for-loop that repeats the calculation you did in part A 1,000 times.
# Each time, you should store the result (i.e. whether or not there was anybody with a January 1 birthday) in a vector
# so that you can see all 1,000 results after the loop is completed.
birthdays <- rename(birthdays, #Cleaning up for easier analysis
freq = "Freq")
birthdays <- mutate(birthdays, #Creating variable to represent what pct those numbers make up of the total 500
pct.of.total = freq/(sum(freq))*100) #Sum of pct.of.total = 100%
x <- 1
bdays.loop <- rep(0,1000) #Looping the vectors 1000 times
for(x in seq(1:1000)){
bdays.loop <- print(birthdays.vec)
}
# This version walks through 1000 models that also display the chance of getting that number of 500 people
# c) Find the mean of the vector of results you created in part B.
# What is the probability that, in a group of 500 people, nobody in the group has a birthday on January 1?
# Hint: Your answer should be somewhere between 0 and 1 (i.e. 0% and 100%).
# If your answer equals exactly 0 or 1, you’ve done something wrong.
mean(as.numeric(birthdays$birthdays)) # Mean = 134
n <- 1
c <- 0
perc <- 0
for (n in 1:1000){
bdays <- birthdays.vec
if (1 %in% bdays) score = 1 else score = 0
c = c + score
perc <- (c/1000)
print(paste(n))
}
# d) Now let’s use what we learned about writing our own functions to make our code a little bit more general.
# Write a function where you specify the size of the group (rather than fixing it at 500),
# and the function tells you the probability that nobody in a group that size has a January 1 birthday.
# Use the function to estimate that probability for a group of 750 people. What is that probability?
# Got mixed up with the normal version of for-loop for this, so trying another way to find probability more cleanly
for(x in seq(1:1000)){
pct.jan.1 <- print(wpct(sample(1:365, 750, replace = TRUE)))
}
# Stuck here, finish first then come back
# e) Bonus question: Use your function and another for-loop to calculate the probability for every group size between 500 and 1500.
# Use the plot() function to make a simple scatter plot of your results. What do you notice about the results?
# PROBLEM 2
# 2. Typically when we’re working with survey data, our goal is to use the responses of the survey
# to understand the larger group of people that the survey respondents represent (usually called the sample frame).
# If we’re interested in understanding the political attitudes of everybody who lives in the United States,
# then we could use a survey of Americans to further our understanding. It’s almost impossible, however,
# that a sample of a few thousand (or even a hundred thousand) people would have demographic attributes that are exactly
# identical to the US population as a whole. Because of this fact, we use survey weights to slightly adjust the
# composition of our sample to match the sampling frame.
# In an ideal world, every respondents’ survey weight would equal exactly 1. In practice, this is never really
# possible in any real-world scenario. The best we can do is to draw a survey that’s as representative of the sampling
# frame as possible, and then use weights to make small adjustments to account for the imperfection and randomness
# in the sampling process. In this question, we’re going to use survey data from a July 2019 survey about American
# politics (“july-2019-sm-poll.sav). To keep things a bit simpler, the data we’ve provided for this problem set
# doesn’t include every question on the survey.
poll.data <- rio::import("/Users/ade/Documents/Education/Penn/Classes/DATA 210/Week 6/july-2019-sm-poll.sav")
# a. We’ll begin by looking at the weight variable in the dataset.
# What is the average weight given to people in the dataset?
# Why is it generally a good idea for survey weights to have this average?
mean(poll.data$weight) #Mean weight is 1.0.
# This indicates that the weighting across all demographics is balanced, regardless of whether it is + or - 1.
# b. Sometimes we trim survey weights so that no single respondent has a huge amount of influence on the
# conclusions we draw from the data. Does it appear that the survey weights in the anger data have been trimmed?
# If so, what value were they trimmed to?
# There doesn't appear to be any trimming, as no respondent that I've seen seems to have a disproportionate influence.
# c. Identify the person in the data that has the highest survey weight.
# What is the race and gender of this person? Why might this person have such a high survey weight?
attributes(poll.data$race) # Viewing attributes to see how numeric values under the race/gender column translate to characters.
attributes(poll.data$gender)
# The respondent's race is Hispanic and their gender is female.
attributes(poll.data$party) # Pulling some more attributes for better context
attributes(poll.data$state)
# This respondent likely has a heavier weight due to the rarity of her political affiliation (repub)
# and her geographical location (ID), which, when combined with her minority demographics (Hispanic female),
# infers a likeliness that this weightiness is made to offset what I would surmise is a largely white male and female
# respondent base from that particular region.
# This theory can also be exemplified in the data:
idaho <- subset(poll.data, state == "13")
count(idaho, weight > 1) # Of the respondents who recieved a weight over 1 alone, we're already looking at 28/101 entries
count(idaho, weight > 4.5) # Drilling down further, only 1 other respondent of 100 others matched her, so the need to offset here makes sense
# d. What is the unweighted average age of people in the dataset?
# Using the survey weights, what is the weighted average age of people in the data?
# The weighted.mean() function might be helpful here. You can ignore the people who did not
# provide their age in the survey, and leave them out of the denominators.
# What does the difference between these two numbers indicate to you?
mean(poll.data$age)
# The unweighted average age is 52
weighted.mean(poll.data$age, poll.data$weight)
# The weighted average is 45, which indicates that this group is likely more skewed toward the older (55+) side,
# which may demonstrate implications in how their values, income, and generation may disproportionately impact
# the survey.
# e. Compare the unweighted versus weighted percentages of people in the survey who are white, black, Hispanic, and Asian.
# When you apply the weights, which groups increase in size and which decrease?
# By how much do each of the group sizes change (in terms of percentage points)?
# You might find the wpct() function in the ‘weights’ package to be useful.
# First going to segment this data to make it easier to analyze without impacting the larger set
race.data <- select(poll.data, c(race, weight))
# Pulling specific numbers to validate with manual math
count(race.data, race == "1") # = 11,448 white
count(race.data, race == "2") # = 1,397 black
count(race.data, race == "3") # = 1,009 hispanic
count(race.data, race == "4") # = 530 asian
count(race.data, race == "5") # = 1,145 other
unweighted.race <- wpct(race.data$race)
# Unweighted = 0.73720137 white, 0.08996072 black, 0.06497521 hispanic, 0.03412969 asian, 0.07373302 other
weighted.race <- wpct(race.data$race, race.data$weight)
# Weighted = 0.61709576 white, 0.12891352 black, 0.16001156 hispanic, 0.03031985 asian, 0.06365931 other
# When applying the weights, the white and asian demographics shrank while the other groups grew.
# Getting exact difference
tibble(weighted.race - unweighted.race)
# White -12%, black +3.9%, hispanic +9.5%, asian -0.381%, other -0.101%
# f. What percentage of people said that they would be somewhat or very willing to pay higher taxes to pay for
# infrastructure improvements? What is this percentage when you only look at Republicans?
# What about when you only look at Democrats?
# You should use the survey weights and omit people who did not answer the question from your calculations.
#Segmenting for analysis
infra.support <- select(poll.data, c("taxes_improve_infrastructure", "party", "weight"))
# Removing data points based on criteria
# First, focusing in on 'somewhat willing' and 'very willing'
infra.support <- infra.support%>%
filter(taxes_improve_infrastructure == 1:2)
# Then, since our focus is dems/repubs, removing independents
infra.support <- infra.support%>%
filter(party != 2)
# Also removing NAs for the sake of calculation
infra.support <- na.omit(infra.support)
count(poll.data, taxes_improve_infrastructure == "5")
sum(infra.support$taxes_improve_infrastructure)/15109 #this number comes from removing 353 "no response" + 67 NAs
# 43.5% of people are somewhat or very willing to pay taxes toward infrastructure projects
wpct(infra.support$party) # Unweighted = 42.22% republican, 57.77% democrat
wpct(infra.support$party, infra.support$weight) # Weighted = 43.46% republican, 56.53% democrat
# g. Analyze the questions about trust in the state and federal governments.
# Use those variables, and others in the dataset, to find some interesting pattern or result in the data.
# What did you find?
# The pattern I'm seeking will be a validation of a relationship between party affiliation and support of the fed vs state government,
# since the commericial dem vs repub argument should demonstrate a correlation where more democrats support national
# government while republicans skew more toward states rights and smaller government.
# First, going to focus this data on individuals with strong support ("Just about always" and "Most of the time").
state.trust <- select(poll.data, c("trust_state_gov", "party", "weight"))
fed.trust <- select(poll.data, c("trust_fed_gov", "party", "weight"))
state.trust <- na.omit(state.trust)
fed.trust <- na.omit(fed.trust)
state.trust <- state.trust%>%
filter(state.trust$trust_state_gov == 1:2)
fed.trust <- fed.trust%>%
filter(fed.trust$trust_fed_gov == 1:2)
# Assessing for correlation
wpct(state.trust$party) # 47.27% republican, 11.31% independent, 41.4% democrat
wpct(state.trust$party, state.trust$weight) # 44.66% republican, 15.7% independent, 39.62% democrat
wpct(fed.trust$party)# 56.31% republican, 13.9% independent, 29.74% democrat
wpct(fed.trust$party, fed.trust$weight) # 51.89% republican, 18.6% independent, 29.47% democrat
# The result of this survey was really interesting because the overall pattern here was actually found in the fact
# that republicans appear to more strongly support our government overall, both at the state and federal levels. In fact,
# there were even less democrats that either mostly or just about always trusted the federal government especially,
# with nearly 10% less of democratic respondents trusting the federal government.
# So, despite the many tropes on social media and and messages touted by pundits, this data indicates that republicans overall
# appear to very much believe in the government with lesser skepticism than democrats. Though, I think it would also
# be great to get data on cities as well (since states can be very skewed politically and republicans in cities
# may be less likely to be open about it) and education as it pertains to political alignment, since dems are historically
# more college educated. That may paint a clearer picture of why these percentages were presented as such.