forked from tbruefach/CAnD3-Data-Activity
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRachel_CleanData&HandleMissing.Rmd
143 lines (103 loc) · 5.05 KB
/
Rachel_CleanData&HandleMissing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
title: "Rachel's Replication: Clean Data & Handle Missing"
author: "Rachel Ganly"
date: "9/27/2021"
output: html_document
---
Clear the data space
1. Load required packages for analyses.
2. Converting data from the GSS into a tibble.
3. Cleaning and coding variables to be used in the analyses.
4. Dropping all variables but the ones used in analyses.
5. Labeling those variables.
```{r}
#Follow Tyler's instructions from the Shell File
#1. Load required packages for analyses
# Clear dataspace
rm(list = ls())
library(tidyverse)
library(haven)
library(skimr)
library(naniar)
library(Hmisc)
library(sjlabelled)
library(gt)
library(gtsummary)
# Rachel's comments: Found these packages in the Readme list on GitHub
```
```{r}
# 2. Converting data from the GSS into a tibble.
gss_rachel<-read.csv("gss_2017.csv") #Loaded file gss_2017 per the instructions
gss_rachel<-as_tibble(gss_rachel)
# 3. Cleaning and coding variables to be used in the analyses.
# Rachel's comment: cannot find a list of variables to be used, except by reading the code
gss_rachel <- gss_rachel %>%
select(c(SRH_110, # Self-Rated Health
SRH_115, # Self-Rated Mental Health
EHG3_01B, # Highest Educational Attainment
AGEC, # Respondent's Age in Years
SEX, # Respondent's Sex
MARSTAT, # Marital Status
FAMINCG2, # Family Income
CHRINHDC, # Number of Children in Household
VISMIN, # Visible Minority
WGHT_PER)) # Individual Survey Weight
# Create new values
# Self-Rated Health (Ordinal)
# Coded Excellent (1) - Poor (5)
# Self-Rated Mental Health (Ordinal)
# Coded Excellent (1) - Poor (5)
# Educational Attainment (Ordinal)
# Coded HS or Less (1) | Some College (2) | Bachelor's + (3)
# Age in Years (Continuous; capped at 80)
# Sex (Binary)
# Coded Male (0) | Female (1)
# Marital Status (Changing from Nominal to Dichotomous)
# Married/Common Law = Married (1) | All Else = Not Married (0)
# Family Income (Ordinal)
# Values are: 1) Less than $25k; 2) $25k to $49.999k; 3) $50k to 74.999k;
# 4) $75k to $99.999k; 5) $100k to $124.999k; 6) $125k or more
# Number of Children in Household (Discrete)
# Visible Minority (Binary)
# Coded Yes (1) | No(0)
gss_rachel<-gss_rachel%>%mutate(srh=SRH_110,
mental_srh=SRH_115,
education=case_when(EHG3_01B<=2~1,
EHG3_01B>=3&EHG3_01B<=5~2,
EHG3_01B>=6&EHG3_01B<=7~3),
age=AGEC,
sex=case_when(SEX==1~0,
SEX==2~1),
marital_status=case_when(MARSTAT == 1 | MARSTAT == 2 ~ 1,
MARSTAT >= 3 & MARSTAT <= 6 ~ 0),
family_income=as.integer(FAMINCG2), # Family Income
number_children=as.integer(CHRINHDC), # Number of Children in Household
visible_minority=case_when(VISMIN == 1 ~ 1,
VISMIN == 2 ~ 0), # Visible Minority
individual_survey_weight=WGHT_PER) # Individual Survey Weight
# LABELLING VARIABLES AND VALUES
#4. Dropping all variables but the ones used in analyses.
# Rachels Note: could only find the list of variables to be used by looking at the code and had already dropped most values earlier so I initially skipped this step but went back to add it later
#5. Label variables.
# No notes on labelling of variables except by looking at the code
save(gss_rachel, file="GSS_Cleaned_Rachel.csv")
```
```{r}
#Handle Missing Data
#1. Recodes missing values of self-rated health and self-rated mental health (other measures were assigned missing values during the cleaning phase).
# 2. Creates an index called "sampmiss" that is a count of how many variables that each respondent has missing values.
# 3. Creates a dataset called "sample" that only contains cases with no missing values.
# Rachel's comment: I could not find out how to do this using the instructions or comments (for example how were the values recoded in step 1); so I had to read and copy the code
# CODING DATA VALUES AS MISSING
# Note: Variables not included have no missing data or were already coded
# in the cleaning stage
gss_rachel <- gss_rachel %>% replace_with_na(replace = list(srh = c(7, 8, 9),
mental_srh = c(7, 8, 9)))
# CREATING SAMPLE VARIABLE FOR ANALYSES
# Will use this variable to filter out cases missing any data
gss_rachel <- gss_rachel %>% mutate(sampmiss = rowSums(is.na(.)))
gss_rachel %>% count(sampmiss)
# SAMPLE DATASET WHICH EXCLUDES MISSING DATA (LISTWISE DELETION)
sample <- gss_rachel %>% filter(sampmiss == 0)
save(gss_rachel, file="GSS_NoMissing_Rachel.csv")
```