-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-preparing-data.qmd
3225 lines (2463 loc) · 120 KB
/
01-preparing-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Preparing Data {#sec-chap01}
```{r}
#| label: setup
#| include: false
base::source(file = "R/helper.R")
```
## Achievements to unlock
:::::: {#obj-chap01}
::::: my-objectives
::: my-objectives-header
Objectives for chapter 01
:::
::: my-objectives-container
**SwR Achievements**
- (~~Observations and variables~~)
- Using reproducible research practices (@sec-chap01-reproducibility)
- (~~Understanding and changing data types~~)
- Entering or loading data into R (@sec-chap01-import-data)
- Identifying and treating missing values (Data wrangling) (@sec-chap01-data-wrangling)
- Building a basic bar chart (Replicate Figure 1.1 and 1.2) (@sec-chap01-repr-bar-charts)
I will skip the crossed out learning objectives in parenthesis as I know already these procedures. However I will modify and elaborate some of these achievements as mentioned in the parentheses.
I will add other objectives that resulted from questions that arose during reading the book.
**Questions that resulted to additional objectives**
- How to download data directly from the `r glossary("General Social Survey")` (GSS) web page? {@sec-chap01-get-gss-data}
- How to work with labelled data? {@sec-chap01-labelled-data}
- Reviewing different packages for creating tables (@sec-chap01-review-table-packages)
- Introducing {**gtsummary**}, a packages for presentation-ready data summary and analytic result tables (see: @sec-gtsummary @sec-chap01-experiments-gtsummary)
- Developing timeline graph (1973-2022) about opinions of respondents from `r glossary("GSS")` surveys about legalizing marijuana (@sec-chap01-grass-timeline)
:::
:::::
Achievements for chapter 01
::::::
## Using reproducible research practices {#sec-chap01-reproducibility}
### Script files
SwR explains writing script files, but I am using `r glossary("Literate Programming")` with Quarto. This has the consequence that in addition to short comments inside code cells I have the possibility to write extensively in the other parts of the file about approach, code, results etc.
A practical advice for scripts is to include a `r glossary("prolog")`. Possible prolog sections:
- Project name
- Project purpose
- Name(s) of data set(s) used in the project
- Location(s) of data set(s) used in the project
- Code author name (you!)
- Date code created
- Date last time code was edited
Most of these information are naturally occurring in the writing process of Quarto books.
:::::: my-resource
:::: my-resource-header
::: {#lem-chap01-literate-programming}
: Literate Statistical Programming
:::
::::
::: my-resource-container
- Literate Programming: ([Wikipedia](https://en.wikipedia.org/wiki/Literate_programming))
- Introduction to Literate Programming with Quarto ([Online Slides](https://gesiscss.github.io/quarto-workshop/material/slides/01_introduction.html#/title-slide))
- Reproducibility and literate programming in R ([bookdown course](https://exeter-data-analytics.github.io/LitProg/index.html))
- Introduction to Data Science in R for Biologists (Module on [Literate Statistical Programming and Quarto](https://mbutler808.github.io/rclass/posts/2023-01-26-intro-quarto/index.html))
- Let’s build a blog with Quarto [Literate programming in Quarto](https://ivelasq.quarto.pub/building-a-blog-with-quarto/workflow/write-docs/)) by Isabella Velásquez. The site has other material (for Quarto blogs) as well: [Migrate from R Markdown](https://ivelasq.quarto.pub/building-a-blog-with-quarto/learn-more/migrate-blog/), [Additional resources](https://ivelasq.quarto.pub/building-a-blog-with-quarto/learn-more/resources/)
- Introduction to literate programming with Quarto and Markdown by Gesis ([Slides](https://gesiscss.github.io/quarto-workshop/material/slides/01_introduction.html#/title-slide))
:::
::::::
### Naming objects
I am used to apply the [tidyverse style guide](https://style.tidyverse.org/). It requires to use underlines ("snake_code") as separators in object names. (In contrast to "camelCase" code style). But reading the book I thought it might be a good idea to use special additional styles for certain specific objects.
- **Naming constants**: Prefix name of constants with `k_`.
- **Naming variables**: Standard snake code.
- **Naming functions**: Prefix name of private functions with a dot `.`. I had already experienced that didn't know from which package a function was. Only to learn after looking around for minutes that it was a function I wrote myself!
- **Naming data frames**: Prefix name with `df_` for data.frame and `dt_` for tibble. I might also use a suffix to refer to the status e.g., `_raw` (raw data), `_clean` (cleaned data), `_v2` (version number).
- **Naming files**: It could be helpful to add at the start the chapter number e.g. `chap02_`. And maybe also --- as in naming data frames --- the status as suffix.
## Import data frames from outside resources {#sec-chap01-import-data}
R has many possibilities to import data from other statistical packages.
### Some common file extensions
- **.csv**: comma separated values
- **.txt**: text file
- **.xls or .xlsx**: Excel file
- **.sav**: SPSS file
- **.sasb7dat**: SAS file
- **.xpt**: SAS transfer file
- **.dta**: Stata file
### Some packages for import data sources
- {**readr**}: Read Rectangular Text Data (see: @sec-readr), it is part of {**tidyverse**}
- {**vroom**}: Read and Write Rectangular Text Data Quickly
- {**haven**}: Import and Export 'SPSS', 'Stata' and 'SAS' Files (see: @sec-haven)
- {**foreign**}: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
- {**readxl**}: Read Excel Files
- {**openxslx**}: Read, Write and Edit xslx Files
- {**readODS**}: Read and Write ODS Files (e.g. LibreOffice)
- {**clipr**}: Read and Write from the System Clipboard
I will not go into the import details of all the different packages here, because my focus is on the `r glossary("General Social Survey")` (GSS) data.
### Import data with a .csv file
> “While the GSS data can be read into R directly from the GSS website, Kiara had experienced this and knew that it could be frustrating.” ([Harris, 2020](zotero://select/groups/5254842/items/9N29QMJB)) ([pdf](zotero://open-pdf/groups/5254842/items/3NDRGBBW?page=107&annotation=SFD9FHQD))
The author therefore offers for this chapter a `r glossary("CSV", ".csv")` file with the data. In later chapters learner can choose to use the provided files from the [SAGE webpage](https://edge.sagepub.com/harris1e/student-resources/datasets-and-r-code). Even if these data files are not yet cleaned, it is a kind of cheating, because it bypasses downloading data from the original source.
The `pb_create_folder()` function in the following code chunk checks if the folder already exists. If this is the case it leaves it untouched, otherwise it creates the folder at the desired path. The code for my private functions are in the file `helper.R` inside the `R` folder at root level of this working directory.
:::::::::::::::::: my-example
:::: my-example-header
::: {#exm-chap01-read-csv-show-data}
: Read data from a .csv file into R
:::
::::
::::::::::::::: my-example-container
:::::::::::::: panel-tabset
###### summary()
::::::: my-r-code
:::: my-r-code-header
::: {#cnj-code-name-a}
: Read .csv file and show data summary
:::
::::
:::: my-r-code-container
::: {#lst-chap01-read-csv-show-summary}
```{r}
#| label: read-csv-show-summary
#| cache: true
pb_create_folder(base::paste0(here::here(), "/data/"))
pb_create_folder(base::paste0(here::here(), "/data/chap01"))
gss_2016_book <- readr::read_csv(
file = "data/chap01/legal_weed_age_GSS2016_ch1.csv",
show_col_types = FALSE)
summary(gss_2016_book)
```
Read the provided dataset as a .csv file into R and show `base::summary()`
:::
------------------------------------------------------------------------
**Some comments**
1. In contrast to `base::read.csv()` in the book I used with `readr::read_csv()` a function from the {**tidyverse**} package collection.
2. I added the `show_col_types = FALSE` argument to prevent a message about the column specification.
::::
:::::::
The output of `base::summary` is in this case not very informative. Looking around for appropriate reporting function I developed with `my_glance_data()` my private command. (See @lst-chap01-read-csv-glance-data in next tab.)
###### my_glance_data()
::::::: my-r-code
:::: my-r-code-header
::: {#cnj-code-name-b}
Look at data with my private function `my_glance_data()`
:::
::::
:::: my-r-code-container
::: {#lst-chap01-read-csv-glance-data}
```{r}
#| label: glance-data
gss_2016_book |>
dplyr::select(c(age, grass)) |>
my_glance_data(N = 8, seed = 2016)
```
Display 8 randomly chosen rows in addition to the first and last row
:::
::::
:::::::
I developed a private function `my_glance_data()` to provide a first impression about the data. The function prints the first and last row of the dataset, so you know implicitly how many rows the dataset has. Between the first and last row the function adds randomly `N` other rows (default is 8). Additionally it provides the row number of the data in a separate column. (The column header `obs` stands for "observation".) For reproducibility purposes you can also add a number for the `set.seed()` command.
The idea of `my_glance_data()` is to get a first impression of the dataset. Other printing methods show only the first (or last) rows of the dataset. This could be misleading, giving a wrong impression about typical data.
###### my_glance_data
Maybe you are interested to use `my_glance_data()` yourself? It isn't available through a package, but you can copy the source code from the next R chunk.
I have saved the function in a `.qmd` file one level higher as this book (and all my other R projects). With `{{< include "../_common.qmd" >}}` I have the code integrated in this book. (For the `include` shortcode see the section [Includes](https://quarto.org/docs/authoring/includes.html) of the Quarto documentation.)
::: {#lst-chap01-show-function-glance-data}
```{r}
#| label: show-function-glance-data
#| code-fold: show
#| eval: false
### function my_glance_data ##############
my_glance_data <- function(df, N = 8, seed = 42){
df_temp <- first_and_last_row(df)
set.seed(seed)
df |>
dplyr::mutate(obs = dplyr::row_number()) |>
dplyr::relocate(obs) |>
dplyr::slice_sample(n = N) |>
dplyr::bind_rows(df_temp) |>
dplyr::arrange(obs)
}
first_and_last_row <- function(df) {
df |>
dplyr::mutate(obs = dplyr::row_number()) |>
dplyr::filter(dplyr::row_number() %in% base::c(1, dplyr::n()))
}
```
Code of my private function `glance_date()`
:::
::::::::::::::
:::::::::::::::
::::::::::::::::::
------------------------------------------------------------------------
## Data Wrangling {#sec-chap01-data-wrangling}
### ToDo List
After we saved the data we need to do some data wrangling. To replicate the data structure for the book Figure 1.2 we need to:
- filter the dataset to the year 2016 (in the case you work with the full GSS dataset 1972-2022, which we won't)
- select only the variables `age` and `grass` to work with
- drop all NA’s
- convert `grass` into factor
- recode `grass` labels
- convert `age` from double to numeric
- divide `age` into appropriate age intervals and label them accordingly
------------------------------------------------------------------------
::: {#fig-chap01-bar-charts layout-ncol="2"}
![Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1)](img/chap01/fig-01-01-min.png){#fig-01-01}
![Support for marijuana legalization by age group among participants in the 2016 General Social Survey (Figure 1.2)](img/chap01/fig-01-02-min.png){#fig-01-02}
We will replicate these two bar charts (Figure 1.1 and Figure 1.2 in the book)
:::
------------------------------------------------------------------------
### Replicating data structure for bar charts
:::::::::::::: my-example
:::: my-example-header
::: {#exm-chap01-repl-data-bar-charts}
: Replicate data structure for bar charts
:::
::::
::::::::::: my-example-container
::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-repl-data-fig-book}
: Replicate data structure for Figures 1.1 and 1.2 (Book)
:::
::::
:::: my-r-code-container
::: {#lst-repl-data-fig-book}
```{r}
#| label: replicate-data-fig-book
#| results: hold
gss_2016 <- readr::read_csv(
file = "data/chap01/legal_weed_age_GSS2016_ch1.csv",
show_col_types = FALSE)
gss_2016_clean <- gss_2016 |>
#### (A) rework grass ##################
## convert grass to factor
dplyr::mutate(grass = forcats::as_factor(grass)) |>
## change DK and IAP to NA
dplyr::mutate(grass = dplyr::na_if(x = grass, y = "DK")) |>
dplyr::mutate(grass = dplyr::na_if(x = grass, y = "IAP")) |>
## drop unused levels "DK" and "IAP"
dplyr::mutate(grass = forcats::fct_drop(grass)) |>
## convert to factor and recode it
dplyr::mutate(grass = forcats::fct_recode(
grass, "Yes" = "LEGAL", "No" = "NOT LEGAL")) |>
#### (B) rework age #########################
## change data types and recode
## turn character type of age into factor and recode "89 OR OLDER" to "89"
# dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89")) |>
dplyr::mutate(age = dplyr::case_match(age,
"89 OR OLDER" ~ "89",
.default = age)) |>
## convert data type of age from factor to numeric
dplyr::mutate(age = base::as.numeric(age)) |>
## cut age into several defined age groups
dplyr::mutate(age_cat = base::cut(age,
breaks = c(-Inf, 29, 59, 74, Inf),
labels = c("< 30", "30 - 59", "60 - 74", "75+" ))) |>
## drop NA's
tidyr::drop_na()
## save cleaned data gss_2016
save_data_file("chap01", gss_2016_clean, "gss_2016_clean.rds")
## (C) check result #########
## summarize
print("************************ SUMMARY ****************************")
base::summary(gss_2016_clean)
## glance at the data
print("******************* GLANCE AT SOME DATA **********************")
gss_2016_clean |>
dplyr::select(c(age_cat, grass)) |>
my_glance_data(N = 8, seed = 2016)
```
Replicate data structure for Figures 1.1 and 1.2 of the book)
:::
------------------------------------------------------------------------
The somewhat strange cut decision for age groups was motivated by the question if there is a difference between young and old voters about marijuana legalization.
Compare the result of the recoding with the original data structure in @lst-chap01-read-csv-show-summary.
::::
:::::::
**Some comments**
1. I have included all data wrangling changes for the Figure 1.1 even if they appeared in the book section where the graph is prepared.
2. I have used `|>` form the native R pipe instead of `%>%` exported into tidyverse from the {**magrittr**} package.
3. Otherwise whenever possible I have used {**tidyverse**} code. For `base::as.numeric()` and `base::cut()` I couldn't find {**tidyverse**} equivalents.
::::: my-watch-out
::: my-watch-out-header
WATCH OUT! Difference between `forcats::fct_recode()` and `dplyr::recode()`. Use `dplyr::case_match()`!
:::
::: my-watch-out-container
To recode `age` from "89 OR OLDER" to "89" I used at first `forcats::fct_recode()`. But this function changes the 72 *levels* of the factor `age` and resulted --- after I changed `age` to numeric in the next line --- in a data frame where `age` ranged from 1 to 72!
After I noticed the problem and wanted to replace the {**forcats**}-line with `dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89"))` I saw in the help file that
> `recode()` is superseded in favor of `case_match()`, which handles the most important cases of `recode()` with a more elegant interface. ([Recode values](https://dplyr.tidyverse.org/reference/recode.html)) and
> \[`dplyr::case_match()`\] allows you to vectorise multiple switch() statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the .default is used. ([A general verctorized switch()](https://dplyr.tidyverse.org/reference/case_match.html))
:::
:::::
@lst-repl-data-fig-book used some important packages:
I used {**dplyr**} in the above code heavily. {**dplyr**} (@sec-dplyr) is part of the {**tidyverse**} core collection (@sec-tidyverse), together with {**forcats**} (@sec-forcats), {**ggplot2**} (@sec-ggplot2), {**readr**} (@sec-ggplot2), {**tibble**} (@sec-tibble), {**tidyr**} (@sec-tidyr), {**stringr**} (@sec-stringr) and {**purrr**} (@sec-purrr).
:::::::::::
::::::::::::::
## Reproducing bar charts {#sec-chap01-repr-bar-charts}
::::::::::::::: my-example
:::: my-example-header
::: {#exm-chap01-repr-bar-charts}
: Reproducing bar charts
:::
::::
:::::::::::: my-example-container
::::::::::: panel-tabset
###### Figure 1.1
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-fig-chap01-repr-fig-1-1}
a: Reproducing bar chart Figure 1.1 using `geom_bar()`
:::
::::
::: my-r-code-container
```{r}
#| label: fig-chap01-repr-fig-1-1
#| fig-cap: Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1)
#| fig-width: 4
#| fig-height: 4
gss_2016_clean <- base::readRDS("data/chap01/gss_2016_clean.rds")
## create figure 1.1 #######
gss_2016_clean |>
ggplot2::ggplot(ggplot2::aes(x = grass,
y = 100 * ggplot2::after_stat(count) /
base::sum(ggplot2::after_stat(count)),
fill = grass)) +
ggplot2::geom_bar() +
ggplot2::theme_minimal() +
ggplot2::scale_fill_manual(values = c("#79a778", '#7662aa'),
guide = "none") +
ggplot2::labs(x = "Should marijuana be legal?",
y = "Percent of responses")
```
------------------------------------------------------------------------
I changed the code slightly because of two warnings. Newer versions of {**ggplot2**} have deprecated some older functions:
- Warning: The dot-dot notation (`..count..`) was deprecated in {**ggplot2**} version 3.4.0. Please use `after_stat(count)` instead.
- Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in {**ggplot2**} version 3.3.4. Please use "none" instead. (I have slightly edited both warnings)
:::
::::::
Compare the reproduction with the original @fig-01-01.
###### Figure 1.2
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-fig-chap01-repr-fig-1-2}
b: Reproducing bar chart Figure 1.2 using `geom_col()`
:::
::::
::: my-r-code-container
```{r}
#| label: fig-chap01-repr-fig-1-2
#| fig-cap: Support for marijuana legalization by age group among participants in the 2016 General Social Survey (Figure 1.2)
gss_2016_clean <- base::readRDS("data/chap01/gss_2016_clean.rds")
## create figure 1.2 ########
gss_2016_clean |>
dplyr::group_by(grass, age_cat) |>
dplyr::count() |>
dplyr::group_by(age_cat) |>
dplyr::mutate(perc_grass = 100 * n / base::sum(n)) |>
ggplot2::ggplot(ggplot2::aes(x = age_cat, fill = grass,
y = perc_grass)) +
ggplot2::geom_col(position = 'dodge') +
ggplot2::theme_minimal() +
ggplot2::scale_fill_manual(values = c('#79a778', '#7662aa'),
name = "Should marijuana\nbe legal?") +
ggplot2::labs(x = "Age group (in years)",
y = "Percent of responses in age group")
```
------------------------------------------------------------------------
@fig-chap01-repr-fig-1-1 uses `ggplot2::geom_bar()` whereas this figure here calls the `ggplot2::geom_col()` function.
> There are two types of bar charts: `geom_bar()` and `geom_col()`. `geom_bar()` makes the height of the bar proportional to the number of cases in each group … . If you want the heights of the bars to represent values in the data, use `geom_col()` instead. `geom_bar()` uses `stat_count()` by default: it counts the number of cases at each x position. `geom_col()` uses `stat_identity()`: it leaves the data as is. ([Bar charts](https://ggplot2.tidyverse.org/reference/geom_bar.html) with {**ggplot2**})
:::
::::::
Compare the reproduction with the original @fig-01-02.
:::::::::::
::::::::::::
:::::::::::::::
------------------------------------------------------------------------
## Exercises
::::: my-objectives
::: my-objectives-header
Exercises for the Coder & Hacker edition
:::
::: my-objectives-container
**SwR objectives**
1. Visit the NHANES website and find the online codebook for the 2013–2014 data (@sec-visit-nhanes-website)
2. Open the 2013–2014 NHANES data file saved as `nhanes_2013_ch1.csv` with the book materials at [edge.sagepub.com/harris1e](https://edge.sagepub.com/harris1e) (Achievement 1) (@sec-chap01-import-clean-data in tab "read_csv()")
3. Examine the data types for DUQ200, RIDAGEYR, and RIAGENDR, and fix data types if needed based on the NHANES codebook (Achievements 2 and 3) (@sec-chap01-import-clean-data in tab "View data" and tab "Codebook")
4. Based on the online NHANES codebook, code missing values appropriately for DUQ200, RIDAGEYR, and RIAGENDR (Achievement 4) (@sec-chap01-import-clean-data in tab "Data cleaning")
5. Create a bar chart showing the percentage of NHANES participants answering yes and no to marijuana use (Achievement 5) (@sec-chap01-bar-chart-grass in tab "Solution")
6. Recode age into a new variable called `age_cat` with 4 categories: 18–29, 30–39, 40–49, 50–59 (Achievement 7) (@sec-chap01-grass-age-sex in tab "Numbers")
7. Create a bar chart of marijuana use by age group (Achievement 6) (@sec-chap01-grass-age-sex in tab "Age")
8. ~~Add a prolog and comments to your code (Achievement 1)~~ (I am using `r glossary("literate programming")` in writing this book.)
9. Create a bar chart of marijuana use by age group and sex with side-by-side bars (Achievement 5) (@sec-chap01-grass-age-sex in tab "Age and sex")
10. Following the R code in your code file, use comments to describe what you found. Given what you found and the information in the chapter, what do you predict will happen with marijuana legalization in the next 10 years? Discuss how the omission of older people from the marijuana use question for NHANES influenced your prediction. Write your prediction and discussion in comments at the end of your code file (@sec-chap01-data-interpretation).
**My own additional exercises**
11. Download and work with SAS .xpt file data (Demo) {@sec-download-xpt-file}
12. Show table of frequencies and percentages (@sec-chap01-bar-chart-grass in Tab "Numbers")
13. Make bars thinner (@sec-chap01-bar-chart-grass in Tab "Thin bars")
14. Make bars thinner and adjust spacing accordingly (@sec-chap01-bar-chart-grass in Tab "Solution")
:::
:::::
### Visiting the NHANES website {#sec-visit-nhanes-website}
Use the National Health and Nutrition Examination Survey (`r glossary("NHANES")`) data to examine marijuana use in the United States. Spend a few minutes looking through the NHANES website (<https://www.cdc.gov/nchs/nhanes/index.htm>) before you begin, including finding the online codebook for the 2013–2014 data. Complete the following tasks to explore whether age is related to marijuana use in the United States.
The 2013–2014 codebook is on the page [NHANES 2013-2014](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013). On this page are the links to different types of data, documentations or codebooks.
You can choose from
:::::: {#bul-NAHNES-data-sections}
::::: my-bullet-list
::: my-bullet-list-header
Bullet List: NHANES has six different data sections
:::
::: my-bullet-list-container
- Demographics Data
- Dietary Data
- Examination Data
- Laboratory Data
- Questionnaire Data
- Limited Access Data
:::
:::::
NHANES has six different data sections
::::::
------------------------------------------------------------------------
Clicking of one of these links opens a page where a table contains the following information:
:::::: {#bul-NHANES-data-pages}
::::: my-bullet-list
::: my-bullet-list-header
Bullet List: NHANES data pages has download links for documentation and data files
:::
::: my-bullet-list-container
- Data File Name
- Doc File
- Data File
- Data Published
:::
:::::
NHANES data pages has download links for documentation and data files
::::::
------------------------------------------------------------------------
[Datasets and Documentation](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Datasets.aspx) and [Frequently Asked Questions (FAQs)](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/faq.aspx) are important introductory pages.
NHANES data are saved in a [SAS transport (.XPT) file](https://www.loc.gov/preservation/digital/formats/fdd/fdd000464.shtml). These file can be read with `foreign::read.xport()` and `haven::read_xpt()`. There is also a [page with example code](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Datasets.aspx) to download/import NHANES data files. I will use {**haven**} instead of the recommended {**foreign**} package.
### Download NHANES SAS .xpt file data (Demo) {#sec-download-xpt-file}
:::::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-xpt-file-download}
: Download and work with NHANES SAS .xpt file data (Demo)
:::
::::
::::::::::::::::::::::::::::: my-experiment-container
:::::::::::::::::::::::::::: panel-tabset
###### Get .xpt file
::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-save-xpt-file-demo}
: Save a SAS transport (.xpt) file from the NHANES site for demo purposes and display the structure of the file
:::
::::
:::: my-r-code-container
::: {#lst-chap01-downloadfile}
```{r}
#| label: save-xpt-file-demo
#| cache: true
#| eval: false
## download only once (manually)
utils::download.file(
"https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BPXO.XPT",
tf <- base::tempfile(),
mode = "wb"
)
blood_pressure_demo <- haven::read_xpt(tf)
save_data_file("chap01", blood_pressure_demo, "blood_pressure_demo.rds")
```
Download a file and save it as R object
:::
(*This code chunk has no visible output*)
With the exception of using (**haven**) instead of {**foreign**}, I followed the instruction from the [NHANES recommendation](https://wwwn.cdc.gov/nchs/data/tutorials/file_download_import_R.R). We see labelled data as it was expected from the import with {**haven**}.
In @lst-chap01-get-gss2016-automated I have already used the `utils::download.file()` function: There are some differences:
- @lst-chap01-get-gss2016-automated downloaded a STATA `.zip` file, therefore I used `unz()` to unzip the file and to read `haven::read_dta()` instead `haven::read_xpt()`.
- More important however was the `base::unlink(temp)` command to unlink the temporary file, which is not done here.
- Another difference was using the argument `mode = "wb"` for writing binary files. As I understood this is only important for Windows user. On my Mac I could stick with the default value `mode = "w"`. But for reproducibility it would be better to use in the future the more general option with `mode = "wb"`.
::::
:::::::
:::::: {#imp-chap01-nhanesA-pkg}
::::: my-important
::: my-important-header
Package {**nhanesA**} facilitates downloading NHANES data
:::
::: my-important-container
I just learned that there is a {**nhanesA**} package that facilitates downloading data to R. (2024-01-20) The latest package release is brand-new (2024-01-09).
I am planning to use this package when I am working with again with NHANES data in @sec-chap06. Just for me to remember:
- The packages is [on CRAN](https://cran.r-project.org/web/packages/nhanesA/index.html)
- There is a [GitHub repository](https://github.com/cjendres1/nhanes).
- A very informative [introductory vignette](https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html) does not only explain the package but also the structure of the NHANES data and codebook.
:::
:::::
Package {**nhanesA**} facilitates downloading NHANES data
::::::
------------------------------------------------------------------------
###### str()
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-str-blood-pressure-demo}
: Show data summary of three blood pressure variables with `utils::str()` (Demo)
:::
::::
::: my-r-code-container
```{r}
#| label: str-blood-pressure-demo
blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds")
blood_pressure_demo |>
dplyr::select(c(2, 4, 5)) |>
utils::str()
```
:::
::::::
###### summary()
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-summary-blood-pressure-demo}
: Show data summary of three blood pressure variables with `summary()` (Demo)
:::
::::
::: my-r-code-container
```{r}
#| label: summary-blood-pressure-demo
blood_pressure_demo |>
dplyr::select(c(2, 4, 5)) |>
base::summary()
```
:::
::::::
###### skim()
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-skim-blood-pressure-demo}
: Show data summary of three blood pressure variables with `skim()` (Demo)
:::
::::
::: my-r-code-container
```{r}
#| label: skim-blood-pressure-demo
blood_pressure_demo |>
dplyr::select(c(2, 4, 5)) |>
skimr::skim()
```
:::
::::::
###### my_glance_data()
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-show-random-blood-pressure-demo}
: Show some blood pressure data with my private function `my_glance_data()`
:::
::::
::: my-r-code-container
```{r}
#| label: show-random-blood-pressure-demo
blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds")
## display some values
blood_pressure_demo |>
dplyr::select(c(2, 4, 5)) |>
dplyr::rename(arm_selected = BPAOARM) |>
dplyr::rename(systolic = BPXOSY1) |>
dplyr::rename(diastolic = BPXODI1) |>
my_glance_data()
```
:::
::::::
::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::
### Import & clean data {#sec-chap01-import-clean-data}
:::::::::::::::::::::::::: my-exercise
:::: my-exercise-header
::: {#exr-chap01-read-nhanes-csv}
: Coder exercises: Import & clean data
:::
::::
::::::::::::::::::::::: my-exercise-container
:::::::::::::::::::::: panel-tabset
###### read_csv()
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-read-nhanes-csv}
: Open the 2013–2014 NHANES data file that comes with the book
:::
::::
::: my-r-code-container
```{r}
#| label: read-nhanes-csv
#| cache: true
#| eval: false
## load and save it only once (manually) ########
nhanes_2013 <- readr::read_csv("data/chap01/nhanes_2013_ch1.csv")
save_data_file("chap01", nhanes_2013, "nhanes_2013.rds")
```
(*This R Code chunk has no visible output.*)
:::
::::::
###### View data
::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-NHANES-view-data}
: Look at the data with `utils::str()` and `skimr::skim()`
:::
::::
:::: my-r-code-container
::: {#lst-chap01-NHANES-view-data}
```{r}
#| label: NHANES-view-data
nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds")
base::summary(nhanes_2013)
utils::str(nhanes_2013)
skimr::skim(nhanes_2013)
```
Look at the data with `utils::str()` and `skimr::skim()`
:::
As one of the packages for displaying summaries I am using the {**skimr**} package (see @sec-skimr).
::::
:::::::
###### Codebook
Unfortunately the .csv files does not include labelled data. Therefore I had to look for the variable names in all different NHANES sections as outlined in @bul-NAHNES-data-sections. Inspecting the accompanying NHANES download pages (see: @bul-NHANES-data-pages) I could look for the variable names in the download links for the documentation.
- [DUQ200](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DUQ_H.htm#DUQ200): Ever used marijuana or hashish
- [RIAGENDR](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#RIAGENDR): Gender
- [RIDAGEYR](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#RIDAGEYR): Age in years at screening
There is an additional column `...1` (see @lst-chap01-NHANES-view-data) providing row numbers. Maybe this column works as an ID, a kind sequence number, so that one could merge other data files identified by this number. (See [SEQN](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#SEQN): Respondent Sequence Number). Why this column has lost its name, I do not know. Perhaps it has to do with the data import done by the book author?
To facilitate the work I made screenshots from these four variables:
![](img/chap01/codebook-NHANES-DUQ200-marijuana-min.png)
![](img/chap01/codebook-NHANES-RIAGENDR-RIDAGEYR-gender-age-min.png)
![](img/chap01/codebook-NHANES-SEQN-sequence-number-min.png)
###### Data cleaning
:::::::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-NHANES-data-cleaning}
: Look at the data after data cleaning
:::
::::
::::::::: my-r-code-container
```{r}
#| label: NHANES-data-cleaning
nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds")
nhanes_2013_clean <- nhanes_2013 |>
dplyr::rename(nr = ...1) |>
dplyr::rename(
sex = RIAGENDR,
age = RIDAGEYR,
grass = DUQ200
) |>
dplyr::mutate(grass =
dplyr::na_if(x = grass, y = 7)) |>
dplyr::mutate(grass =
dplyr::na_if(x = grass, y = 9)) |>
tidyr::drop_na() |>
dplyr::mutate(sex = forcats::as_factor(sex)) |>
dplyr::mutate(grass = forcats::as_factor(grass)) |>
dplyr::mutate(grass =
forcats::fct_recode(grass,
"Yes" = "1", "No" = "2")) |>
dplyr::mutate(sex =
forcats::fct_recode(sex,
"Male" = "1", "Female" = "2"))
## save cleaned data ###############
save_data_file("chap01", nhanes_2013_clean, "nhanes_2013_clean.rds")
base::summary(nhanes_2013_clean)
skimr::skim(nhanes_2013_clean)
```
::::: my-important
::: my-important-header
Recoded `gender` to `sex`
:::
::: my-important-container
In all R code chunks and text passages I have changed the `gender` category to `sex`. "Sex" refers only to biology but "gender" is a broader concepts: It refers to identity construction, that include biological, psycho-social, and cultural factors.
> Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles associated with being a woman, man, girl or boy, as well as relationships with each other. As a social construct, gender varies from society to society and can change over time. ... Gender interacts with but is different from sex, which refers to the different biological and physiological characteristics of females, males and intersex persons, such as chromosomes, hormones and reproductive organs. Gender and sex are related to but different from gender identity. Gender identity refers to a person’s deeply felt, internal and individual experience of gender, which may or may not correspond to the person’s physiology or designated sex at birth. ([WHO](https://www.who.int/health-topics/gender#tab=tab_1))
:::
:::::
::::: my-remark
::: my-remark-header
My learning from the above procedure of data cleaning
:::
::: my-remark-container
1. It is wrong to put the column name `...1` into accents.
2. There exists a shortcut for several `dplyr::rename()` but not for `dplyr::na_if()` and `forcats::as_factor()` because here we need to change the column values with the command.
3. Sequence of commands is important:
- Start with the renaming of variables, This is not mandatory but it helps to address the correct column in the following commands.
- Recode different missing values to NA’s with `dplyr::na_if()`
- Then drop (all) NA’s with `tidyr::drop_na()`.
4. `forcats::as_factor()` needs to rewrite the column as factor with `dplyr::mutate()`
:::
:::::
:::::::::
::::::::::::
::::::::::::::::::::::
:::::::::::::::::::::::
::::::::::::::::::::::::::
### Bar chart for marijuana use {#sec-chap01-bar-chart-grass}
:::::::::::::::::::::::::::: my-exercise
:::: my-exercise-header
::: {#exr-chap01-bar-chart-grass-percentage}
: Bar chart (with different bar width ans space) showing the percentage of NHANES participants answering yes and no to marijuana use
:::
::::
::::::::::::::::::::::::: my-exercise-container
:::::::::::::::::::::::: panel-tabset
###### Numbers
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-compute-grass-percentage}
: Show frequencies and percentages of NHANES participants answering yes and no to "Ever used marijuana or hashish"
:::
::::
::: my-r-code-container
```{r}
#| label: compute-grass-percentage
#| cache: true
nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")
tibble::enframe(table(nhanes_2013_clean$grass)) |>
dplyr::rename(`Ever used grass or hash` = name, Freq = value) |>
dplyr::mutate(Perc = round(100 * Freq / sum(Freq), 2))
```
------------------------------------------------------------------------
Instead of using `name` and `value` I could have used the position of the column numbers `1` and `2` (all names without the \`-sign.)
:::
::::::
This is a manual calculation using the {**tidyverse**} approach: I am sure there are some packages that may facilitate this computation (e.g., {**Janitor**} or {**DescTools**}), but I had difficulties to apply the appropriate functions.
###### Standard
:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-bar-chart-percentage}
: Percentage of NHANES participants answering yes and no to marijuana use
:::
::::
::: my-r-code-container
```{r}
#| label: fig-grass-bar-chart-percentage-standard
#| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey"
#| fig-width: 4
#| fig-height: 4
nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")
ggplot2::ggplot(nhanes_2013_clean,
ggplot2::aes(grass,
y = 100 * ggplot2::after_stat(count) /
base::sum(ggplot2::after_stat(count)),
fill = grass
)) +
ggplot2::geom_bar() +
ggplot2::theme_minimal() +
ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'),
guide = "none") +
ggplot2::scale_y_continuous(
breaks = base::seq(0, 50, by = 10),
labels = scales::label_percent(scale = 1)) +
ggplot2::labs(x = "Did you ever use marijuana or hashish?",
y = "Percent of responses")
```
:::
::::::
The bars are very thick. I tried with `ggplot2::geom_bar(width = 0.5)` to make them thinner. But I failed. (See @lst-chap01-grass-bar-chart-width)
###### Thin bars
::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-bar-chart-thin-bars}
: Percentage of NHANES participants answering yes and no to marijuana use
:::
::::
:::: my-r-code-container
::: {#lst-chap01-grass-bar-chart-width}
```{r}
#| label: fig-grass-bar-chart-thin-bars
#| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey"
#| fig-width: 4
#| fig-height: 4
nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")
ggplot2::ggplot(nhanes_2013_clean,
ggplot2::aes(grass,
y = 100 * ggplot2::after_stat(count) /
base::sum(ggplot2::after_stat(count)),