-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDirty_Data.Rmd
202 lines (139 loc) · 5.68 KB
/
Dirty_Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "problem_set_2"
author: "Lydia Holley"
date: "2024-01-17"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#load needed libraries
library(naniar)
library(dplyr)
library(naniar)
library(stringr)
```
# Step 1 - Import the Data
The data will be imported here
```{r data import}
#load in data
sw1<-read.delim("star_wars_data.1.tsv",header = T, sep = "\t")
sw2<-read.delim("star_wars_data.2.txt",header = T, sep = ",")
```
# Step 2 - Clean up the data
Data will be cleaned up here
```{r clean data}
#find initial number
sum(is.na(sw1))
# replace unknowns with proper NA
sw1<-replace_with_na(sw1,replace = list(hair_color="unknown"))
sw1<-replace_with_na(sw1,replace = list(height="unknown"))
sw1<-replace_with_na(sw1,replace = list(mass="unknown"))
sw1<-replace_with_na(sw1,replace = list(skin_color="unknown"))
sw1<-replace_with_na(sw1,replace = list(eye_color="unknown"))
sw1<-replace_with_na(sw1,replace = list(birth_year="unknown"))
sw1<-replace_with_na(sw1,replace = list(sex="unknown"))
sw1<-replace_with_na(sw1,replace = list(gender="unknown"))
sw1<-replace_with_na(sw1,replace = list(homeworld_species="unknown"))
# check final NA
sum(is.na(sw1))
```
## **Question 1 - How many \_NA_s are in the cleaned dataset**
There are 95 NAs in the cleaned dataset
# Step 3 - Clean up the last column
Split the homeworld_species column into 2 columns.
```{r split column}
# get homeworld and species columns
sw1$homeworld<-str_split_i(sw1$homeworld_species, "_", 1)
sw1$species<-str_split_i(sw1$homeworld_species, "_", 2)
# remove combined column
sw1<-subset(sw1, select = -homeworld_species)
# find answer to question 2 by indexing and counting rows
de_sp <- sw1[sw1$species %in% c('Droid', 'Ewok'), ]
nrow(de_sp)
```
## **Question 2 - How many species are either a Droid or an Ewok**
There are 7 Droid or Ewok species.
```{r replace NA}
# replace improper NA with proper NAs
sw1<-replace_with_na(sw1,replace = list(homeworld="NA"))
sw1<-replace_with_na(sw1,replace = list(species="NA"))
# find number of homeworlds ignorings NAs
n_distinct(sw1$homeworld,na.rm = T)
```
## **Question 3 - How many different home worlds are there? Do not include missing data**
There are 48 different home worlds
# Step 4 - Edit the mass column
*already fixed the unknowns in this column*
```{r median}
# find the median mass, ignoring NAs
median(as.numeric(sw1$mass), na.rm = T)
```
## **Question 4 - What is the median mass of all the characters?**
The median mass of all the characters is 79.
# Step 5 - Edit star_wars_data.2.txt
This chunk will create columns for the film number
```{r film}
#split column
sw2$first<-str_split_i(sw2$films, "AND", 1)
sw2$second<-str_split_i(sw2$films, "AND", 2)
sw2$third<-str_split_i(sw2$films, "AND", 3)
sw2$fourth<-str_split_i(sw2$films, "AND", 4)
sw2$fifth<-str_split_i(sw2$films, "AND", 5)
sw2$sixth<-str_split_i(sw2$films, "AND", 6)
sw2$seventh<-str_split_i(sw2$films, "AND", 7)
# remove film column
sw2<-subset(sw2, select = -films)
#answer question 5 by indexing and counting rows
sec_film <- sw2[sw2$second %in% c(' Return of the Jedi',' Return of the Jedi '), ]
nrow(sec_film)
```
## **Question 5 - How many character's 2nd film was "Return of the Jedi"?**
5 character's 2nd film was "Return of the Jedi"
```{r one film}
# how many characters have NA in the "second" column
sum(is.na(sw2$second))
```
## **Question 6 - How many characters were only in 1 film?**
46 characters were only in 1 film
# Step 6 - Merge the datasets
Combine the two datasets to answer the questions below
```{r combine datasets}
# join datasets
joined<-full_join(sw1, sw2, join_by(name == names))
# subset groups
one_movie<- joined[ which(is.na(joined$second)),]
more_movies<- joined[ which(!is.na(joined$second)),]
# find means of subset
mean(one_movie$height, na.rm = T)
mean(more_movies$height, na.rm = T)
```
## **Question 7 - Which group has a greater mean height - characters in 1 movie or characters in more than 1 movie?**
Characters in 1 movie have a greater mean height, but only by a very small amount (174.95 v 174.2683)
```{r empire}
# subset the 2nd movie criteria
empire2<-joined[joined$second %in% c(' The Empire Strikes Back',' The Empire Strikes Back '), ]
# subset human criteria
emp_hum<-empire2[empire2$species %in% c('Human'),]
# find answer to question 8
nrow(emp_hum)
```
## **Question 8 - How many Human characters had their 2nd movie "The Empire Strikes Back"?**
6 human characters had their 2nd movie "The Empire Strikes Back".
```{r ratio}
# only keep characters that appear in a second movie
# we already have this from earlier, it is more_movies
# subset into early movies
early<-more_movies[more_movies$second %in% c(' The Empire Strikes Back',' The Empire Strikes Back ', ' Return of the Jedi',' Return of the Jedi ', ' A New Hope',' A New Hope '), ]
# subset into later movies
later<-more_movies[!(more_movies$second %in% c(' The Empire Strikes Back',' The Empire Strikes Back ', ' Return of the Jedi',' Return of the Jedi ', ' A New Hope',' A New Hope ')), ]
#find early ratio
emass<-median(as.numeric(early$mass), na.rm = T)
eheight<-median(as.numeric(early$height), na.rm = T)
emass/eheight
#find later ratio
lmass<-median(as.numeric(later$mass), na.rm = T)
lheight<-median(as.numeric(later$height), na.rm = T)
lmass/lheight
```
## **Question 9 - You hypothesize that the ratio of mass/height is smaller in characters from the earlier movies. Which group has a smaller median mass/height ratio - characters whose second movie was in the original trilogy ("The Empire Strikes Back," "Return of the Jedi," "A New Hope") or those who had a second movie from the later series?**
Characters who had a second movie from the later series had a smaller median mass/height ratio.