01-preparing-data.qmd

# Preparing Data {#sec-chap01}

```{r}
#| label: setup
#| include: false

base::source(file = "R/helper.R")
```

## Achievements to unlock

:::::: {#obj-chap01}
::::: my-objectives
::: my-objectives-header
Objectives for chapter 01
:::

::: my-objectives-container
**SwR Achievements**

-   (~~Observations and variables~~)
-   Using reproducible research practices (@sec-chap01-reproducibility)
-   (~~Understanding and changing data types~~)
-   Entering or loading data into R (@sec-chap01-import-data)
-   Identifying and treating missing values (Data wrangling) (@sec-chap01-data-wrangling)
-   Building a basic bar chart (Replicate Figure 1.1 and 1.2) (@sec-chap01-repr-bar-charts)

I will skip the crossed out learning objectives in parenthesis as I know already these procedures. However I will modify and elaborate some of these achievements as mentioned in the parentheses.

I will add other objectives that resulted from questions that arose during reading the book.

**Questions that resulted to additional objectives**

-   How to download data directly from the `r glossary("General Social Survey")` (GSS) web page? {@sec-chap01-get-gss-data}
-   How to work with labelled data? {@sec-chap01-labelled-data}
-   Reviewing different packages for creating tables (@sec-chap01-review-table-packages)
-   Introducing {**gtsummary**}, a packages for presentation-ready data summary and analytic result tables (see: @sec-gtsummary @sec-chap01-experiments-gtsummary)
-   Developing timeline graph (1973-2022) about opinions of respondents from `r glossary("GSS")` surveys about legalizing marijuana (@sec-chap01-grass-timeline)
:::
:::::

Achievements for chapter 01
::::::

## Using reproducible research practices {#sec-chap01-reproducibility}

### Script files

SwR explains writing script files, but I am using `r glossary("Literate Programming")` with Quarto. This has the consequence that in addition to short comments inside code cells I have the possibility to write extensively in the other parts of the file about approach, code, results etc.

A practical advice for scripts is to include a `r glossary("prolog")`. Possible prolog sections:

-   Project name
-   Project purpose
-   Name(s) of data set(s) used in the project
-   Location(s) of data set(s) used in the project
-   Code author name (you!)
-   Date code created
-   Date last time code was edited

Most of these information are naturally occurring in the writing process of Quarto books.

:::::: my-resource
:::: my-resource-header
::: {#lem-chap01-literate-programming}
: Literate Statistical Programming
:::
::::

::: my-resource-container
-   Literate Programming: ([Wikipedia](https://en.wikipedia.org/wiki/Literate_programming))
-   Introduction to Literate Programming with Quarto ([Online Slides](https://gesiscss.github.io/quarto-workshop/material/slides/01_introduction.html#/title-slide))
-   Reproducibility and literate programming in R ([bookdown course](https://exeter-data-analytics.github.io/LitProg/index.html))
-   Introduction to Data Science in R for Biologists (Module on [Literate Statistical Programming and Quarto](https://mbutler808.github.io/rclass/posts/2023-01-26-intro-quarto/index.html))
-   Let’s build a blog with Quarto [Literate programming in Quarto](https://ivelasq.quarto.pub/building-a-blog-with-quarto/workflow/write-docs/)) by Isabella Velásquez. The site has other material (for Quarto blogs) as well: [Migrate from R Markdown](https://ivelasq.quarto.pub/building-a-blog-with-quarto/learn-more/migrate-blog/), [Additional resources](https://ivelasq.quarto.pub/building-a-blog-with-quarto/learn-more/resources/)
-   Introduction to literate programming with Quarto and Markdown by Gesis ([Slides](https://gesiscss.github.io/quarto-workshop/material/slides/01_introduction.html#/title-slide))
:::
::::::

### Naming objects

I am used to apply the [tidyverse style guide](https://style.tidyverse.org/). It requires to use underlines ("snake_code") as separators in object names. (In contrast to "camelCase" code style). But reading the book I thought it might be a good idea to use special additional styles for certain specific objects.

-   **Naming constants**: Prefix name of constants with `k_`.
-   **Naming variables**: Standard snake code.
-   **Naming functions**: Prefix name of private functions with a dot `.`. I had already experienced that didn't know from which package a function was. Only to learn after looking around for minutes that it was a function I wrote myself!
-   **Naming data frames**: Prefix name with `df_` for data.frame and `dt_` for tibble. I might also use a suffix to refer to the status e.g., `_raw` (raw data), `_clean` (cleaned data), `_v2` (version number).
-   **Naming files**: It could be helpful to add at the start the chapter number e.g. `chap02_`. And maybe also --- as in naming data frames --- the status as suffix.

## Import data frames from outside resources {#sec-chap01-import-data}

R has many possibilities to import data from other statistical packages.

### Some common file extensions

-   **.csv**: comma separated values
-   **.txt**: text file
-   **.xls or .xlsx**: Excel file
-   **.sav**: SPSS file
-   **.sasb7dat**: SAS file
-   **.xpt**: SAS transfer file
-   **.dta**: Stata file

### Some packages for import data sources

-   {**readr**}: Read Rectangular Text Data (see: @sec-readr), it is part of {**tidyverse**}
-   {**vroom**}: Read and Write Rectangular Text Data Quickly
-   {**haven**}: Import and Export 'SPSS', 'Stata' and 'SAS' Files (see: @sec-haven)
-   {**foreign**}: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
-   {**readxl**}: Read Excel Files
-   {**openxslx**}: Read, Write and Edit xslx Files
-   {**readODS**}: Read and Write ODS Files (e.g. LibreOffice)
-   {**clipr**}: Read and Write from the System Clipboard

I will not go into the import details of all the different packages here, because my focus is on the `r glossary("General Social Survey")` (GSS) data.

### Import data with a .csv file

> “While the GSS data can be read into R directly from the GSS website, Kiara had experienced this and knew that it could be frustrating.” ([Harris, 2020](zotero://select/groups/5254842/items/9N29QMJB)) ([pdf](zotero://open-pdf/groups/5254842/items/3NDRGBBW?page=107&annotation=SFD9FHQD))

The author therefore offers for this chapter a `r glossary("CSV", ".csv")` file with the data. In later chapters learner can choose to use the provided files from the [SAGE webpage](https://edge.sagepub.com/harris1e/student-resources/datasets-and-r-code). Even if these data files are not yet cleaned, it is a kind of cheating, because it bypasses downloading data from the original source.

The `pb_create_folder()` function in the following code chunk checks if the folder already exists. If this is the case it leaves it untouched, otherwise it creates the folder at the desired path. The code for my private functions are in the file `helper.R` inside the `R` folder at root level of this working directory.

:::::::::::::::::: my-example
:::: my-example-header
::: {#exm-chap01-read-csv-show-data}
: Read data from a .csv file into R
:::
::::

::::::::::::::: my-example-container
:::::::::::::: panel-tabset
###### summary()

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-code-name-a}
: Read .csv file and show data summary
:::
::::

:::: my-r-code-container
::: {#lst-chap01-read-csv-show-summary}
```{r}
#| label: read-csv-show-summary
#| cache: true

pb_create_folder(base::paste0(here::here(), "/data/"))
pb_create_folder(base::paste0(here::here(), "/data/chap01"))

gss_2016_book <- readr::read_csv(
    file = "data/chap01/legal_weed_age_GSS2016_ch1.csv",
    show_col_types = FALSE)

summary(gss_2016_book)
```

Read the provided dataset as a .csv file into R and show `base::summary()`
:::

------------------------------------------------------------------------

**Some comments**

1.  In contrast to `base::read.csv()` in the book I used with `readr::read_csv()` a function from the {**tidyverse**} package collection.
2.  I added the `show_col_types = FALSE` argument to prevent a message about the column specification.
::::
:::::::

The output of `base::summary` is in this case not very informative. Looking around for appropriate reporting function I developed with `my_glance_data()` my private command. (See @lst-chap01-read-csv-glance-data in next tab.)

###### my_glance_data()

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-code-name-b}
Look at data with my private function `my_glance_data()`
:::
::::

:::: my-r-code-container
::: {#lst-chap01-read-csv-glance-data}
```{r}
#| label: glance-data

gss_2016_book |>
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)
```

Display 8 randomly chosen rows in addition to the first and last row
:::
::::
:::::::

I developed a private function `my_glance_data()` to provide a first impression about the data. The function prints the first and last row of the dataset, so you know implicitly how many rows the dataset has. Between the first and last row the function adds randomly `N` other rows (default is 8). Additionally it provides the row number of the data in a separate column. (The column header `obs` stands for "observation".) For reproducibility purposes you can also add a number for the `set.seed()` command.

The idea of `my_glance_data()` is to get a first impression of the dataset. Other printing methods show only the first (or last) rows of the dataset. This could be misleading, giving a wrong impression about typical data.

###### my_glance_data

Maybe you are interested to use `my_glance_data()` yourself? It isn't available through a package, but you can copy the source code from the next R chunk.

I have saved the function in a `.qmd` file one level higher as this book (and all my other R projects). With `{{< include "../_common.qmd" >}}` I have the code integrated in this book. (For the `include` shortcode see the section [Includes](https://quarto.org/docs/authoring/includes.html) of the Quarto documentation.)

::: {#lst-chap01-show-function-glance-data}
```{r}
#| label: show-function-glance-data
#| code-fold: show
#| eval: false

### function my_glance_data ##############
my_glance_data <- function(df, N = 8, seed = 42){
    df_temp <- first_and_last_row(df)
    
    set.seed(seed)
    df |> 
    dplyr::mutate(obs = dplyr::row_number()) |> 
    dplyr::relocate(obs) |> 
    dplyr::slice_sample(n = N) |> 
    dplyr::bind_rows(df_temp) |> 
    dplyr::arrange(obs)
} 

first_and_last_row <-  function(df) {
    df |> 
    dplyr::mutate(obs = dplyr::row_number()) |>
    dplyr::filter(dplyr::row_number() %in% base::c(1, dplyr::n()))
}

```

Code of my private function `glance_date()`
:::
::::::::::::::
:::::::::::::::
::::::::::::::::::

------------------------------------------------------------------------

## Data Wrangling {#sec-chap01-data-wrangling}

### ToDo List

After we saved the data we need to do some data wrangling. To replicate the data structure for the book Figure 1.2 we need to:

-   filter the dataset to the year 2016 (in the case you work with the full GSS dataset 1972-2022, which we won't)
-   select only the variables `age` and `grass` to work with
-   drop all NA’s
-   convert `grass` into factor
-   recode `grass` labels
-   convert `age` from double to numeric
-   divide `age` into appropriate age intervals and label them accordingly

------------------------------------------------------------------------

::: {#fig-chap01-bar-charts layout-ncol="2"}
![Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1)](img/chap01/fig-01-01-min.png){#fig-01-01}

![Support for marijuana legalization by age group among participants in the 2016 General Social Survey (Figure 1.2)](img/chap01/fig-01-02-min.png){#fig-01-02}

We will replicate these two bar charts (Figure 1.1 and Figure 1.2 in the book)
:::

------------------------------------------------------------------------

### Replicating data structure for bar charts

:::::::::::::: my-example
:::: my-example-header
::: {#exm-chap01-repl-data-bar-charts}
: Replicate data structure for bar charts
:::
::::

::::::::::: my-example-container
::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-repl-data-fig-book}
: Replicate data structure for Figures 1.1 and 1.2 (Book)
:::
::::

:::: my-r-code-container
::: {#lst-repl-data-fig-book}
```{r}
#| label: replicate-data-fig-book
#| results: hold

gss_2016 <- readr::read_csv(
    file = "data/chap01/legal_weed_age_GSS2016_ch1.csv",
    show_col_types = FALSE)


gss_2016_clean <- gss_2016 |> 

#### (A) rework grass ##################
## convert grass to factor
  dplyr::mutate(grass = forcats::as_factor(grass))  |> 

## change DK and IAP to NA
  dplyr::mutate(grass = dplyr::na_if(x = grass, y = "DK")) |> 
  dplyr::mutate(grass = dplyr::na_if(x = grass, y = "IAP")) |> 

## drop unused levels "DK" and "IAP"
  dplyr::mutate(grass = forcats::fct_drop(grass)) |> 
    
## convert to factor and recode it
  dplyr::mutate(grass = forcats::fct_recode(
      grass, "Yes" = "LEGAL", "No" = "NOT LEGAL")) |> 

#### (B) rework age #########################
## change data types and recode

## turn character type of age into factor and recode "89 OR OLDER" to "89"
  # dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89")) |> 
  dplyr::mutate(age = dplyr::case_match(age, 
                                 "89 OR OLDER" ~ "89",
                                 .default = age)) |> 

## convert data type of age from factor to numeric
  dplyr::mutate(age = base::as.numeric(age)) |> 

## cut age into several defined age groups
  dplyr::mutate(age_cat = base::cut(age, 
                       breaks = c(-Inf, 29, 59, 74, Inf),
                       labels = c("< 30", "30 - 59", "60 - 74", "75+" ))) |>  

## drop NA's
    tidyr::drop_na()

## save cleaned data gss_2016
save_data_file("chap01", gss_2016_clean, "gss_2016_clean.rds")

## (C) check result #########

## summarize
print("************************   SUMMARY   ****************************")
base::summary(gss_2016_clean)

## glance at the data
print("*******************   GLANCE AT SOME DATA  **********************")
gss_2016_clean |>
    dplyr::select(c(age_cat, grass)) |> 
    my_glance_data(N = 8, seed = 2016)
```

Replicate data structure for Figures 1.1 and 1.2 of the book)
:::

------------------------------------------------------------------------

The somewhat strange cut decision for age groups was motivated by the question if there is a difference between young and old voters about marijuana legalization.

Compare the result of the recoding with the original data structure in @lst-chap01-read-csv-show-summary.
::::
:::::::

**Some comments**

1.  I have included all data wrangling changes for the Figure 1.1 even if they appeared in the book section where the graph is prepared.
2.  I have used `|>` form the native R pipe instead of `%>%` exported into tidyverse from the {**magrittr**} package.
3.  Otherwise whenever possible I have used {**tidyverse**} code. For `base::as.numeric()` and `base::cut()` I couldn't find {**tidyverse**} equivalents.

::::: my-watch-out
::: my-watch-out-header
WATCH OUT! Difference between `forcats::fct_recode()` and `dplyr::recode()`. Use `dplyr::case_match()`!
:::

::: my-watch-out-container
To recode `age` from "89 OR OLDER" to "89" I used at first `forcats::fct_recode()`. But this function changes the 72 *levels* of the factor `age` and resulted --- after I changed `age` to numeric in the next line --- in a data frame where `age` ranged from 1 to 72!

After I noticed the problem and wanted to replace the {**forcats**}-line with `dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89"))` I saw in the help file that

> `recode()` is superseded in favor of `case_match()`, which handles the most important cases of `recode()` with a more elegant interface. ([Recode values](https://dplyr.tidyverse.org/reference/recode.html)) and

> \[`dplyr::case_match()`\] allows you to vectorise multiple switch() statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the .default is used. ([A general verctorized switch()](https://dplyr.tidyverse.org/reference/case_match.html))
:::
:::::

@lst-repl-data-fig-book used some important packages:

I used {**dplyr**} in the above code heavily. {**dplyr**} (@sec-dplyr) is part of the {**tidyverse**} core collection (@sec-tidyverse), together with {**forcats**} (@sec-forcats), {**ggplot2**} (@sec-ggplot2), {**readr**} (@sec-ggplot2), {**tibble**} (@sec-tibble), {**tidyr**} (@sec-tidyr), {**stringr**} (@sec-stringr) and {**purrr**} (@sec-purrr).
:::::::::::
::::::::::::::

## Reproducing bar charts {#sec-chap01-repr-bar-charts}

::::::::::::::: my-example
:::: my-example-header
::: {#exm-chap01-repr-bar-charts}
: Reproducing bar charts
:::
::::

:::::::::::: my-example-container
::::::::::: panel-tabset
###### Figure 1.1

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-fig-chap01-repr-fig-1-1}
a: Reproducing bar chart Figure 1.1 using `geom_bar()`
:::
::::

::: my-r-code-container
```{r}
#| label: fig-chap01-repr-fig-1-1
#| fig-cap: Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1)
#| fig-width: 4
#| fig-height: 4

gss_2016_clean <-  base::readRDS("data/chap01/gss_2016_clean.rds")

## create figure 1.1 #######
gss_2016_clean |> 
  ggplot2::ggplot(ggplot2::aes(x = grass,
             y = 100 * ggplot2::after_stat(count) / 
                 base::sum(ggplot2::after_stat(count)),
             fill = grass)) +
  ggplot2::geom_bar() +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c("#79a778", '#7662aa'),
                    guide = "none") +
  ggplot2::labs(x = "Should marijuana be legal?",
       y = "Percent of responses")

```

------------------------------------------------------------------------

I changed the code slightly because of two warnings. Newer versions of {**ggplot2**} have deprecated some older functions:

-   Warning: The dot-dot notation (`..count..`) was deprecated in {**ggplot2**} version 3.4.0. Please use `after_stat(count)` instead.
-   Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in {**ggplot2**} version 3.3.4. Please use "none" instead. (I have slightly edited both warnings)
:::
::::::

Compare the reproduction with the original @fig-01-01.

###### Figure 1.2

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-fig-chap01-repr-fig-1-2}
b: Reproducing bar chart Figure 1.2 using `geom_col()`
:::
::::

::: my-r-code-container
```{r}
#| label: fig-chap01-repr-fig-1-2
#| fig-cap: Support for marijuana legalization by age group among participants in the 2016 General Social Survey (Figure 1.2)

gss_2016_clean <-  base::readRDS("data/chap01/gss_2016_clean.rds")

## create figure 1.2 ########
gss_2016_clean |> 
  dplyr::group_by(grass, age_cat) |> 
  dplyr::count() |> 
  dplyr::group_by(age_cat) |> 
  dplyr::mutate(perc_grass = 100 * n / base::sum(n)) |> 
  ggplot2::ggplot(ggplot2::aes(x = age_cat, fill = grass,
             y = perc_grass)) +
  ggplot2::geom_col(position = 'dodge') +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c('#79a778', '#7662aa'),
                    name = "Should marijuana\nbe legal?") +
  ggplot2::labs(x = "Age group (in years)",
       y = "Percent of responses in age group")
```

------------------------------------------------------------------------

@fig-chap01-repr-fig-1-1 uses `ggplot2::geom_bar()` whereas this figure here calls the `ggplot2::geom_col()` function.

> There are two types of bar charts: `geom_bar()` and `geom_col()`. `geom_bar()` makes the height of the bar proportional to the number of cases in each group … . If you want the heights of the bars to represent values in the data, use `geom_col()` instead. `geom_bar()` uses `stat_count()` by default: it counts the number of cases at each x position. `geom_col()` uses `stat_identity()`: it leaves the data as is. ([Bar charts](https://ggplot2.tidyverse.org/reference/geom_bar.html) with {**ggplot2**})
:::
::::::

Compare the reproduction with the original @fig-01-02.
:::::::::::
::::::::::::
:::::::::::::::

------------------------------------------------------------------------

## Exercises

::::: my-objectives
::: my-objectives-header
Exercises for the Coder & Hacker edition
:::

::: my-objectives-container
**SwR objectives**

1.  Visit the NHANES website and find the online codebook for the 2013–2014 data (@sec-visit-nhanes-website)
2.  Open the 2013–2014 NHANES data file saved as `nhanes_2013_ch1.csv` with the book materials at [edge.sagepub.com/harris1e](https://edge.sagepub.com/harris1e) (Achievement 1) (@sec-chap01-import-clean-data in tab "read_csv()")
3.  Examine the data types for DUQ200, RIDAGEYR, and RIAGENDR, and fix data types if needed based on the NHANES codebook (Achievements 2 and 3) (@sec-chap01-import-clean-data in tab "View data" and tab "Codebook")
4.  Based on the online NHANES codebook, code missing values appropriately for DUQ200, RIDAGEYR, and RIAGENDR (Achievement 4) (@sec-chap01-import-clean-data in tab "Data cleaning")
5.  Create a bar chart showing the percentage of NHANES participants answering yes and no to marijuana use (Achievement 5) (@sec-chap01-bar-chart-grass in tab "Solution")
6.  Recode age into a new variable called `age_cat` with 4 categories: 18–29, 30–39, 40–49, 50–59 (Achievement 7) (@sec-chap01-grass-age-sex in tab "Numbers")
7.  Create a bar chart of marijuana use by age group (Achievement 6) (@sec-chap01-grass-age-sex in tab "Age")
8.  ~~Add a prolog and comments to your code (Achievement 1)~~ (I am using `r glossary("literate programming")` in writing this book.)
9.  Create a bar chart of marijuana use by age group and sex with side-by-side bars (Achievement 5) (@sec-chap01-grass-age-sex in tab "Age and sex")
10. Following the R code in your code file, use comments to describe what you found. Given what you found and the information in the chapter, what do you predict will happen with marijuana legalization in the next 10 years? Discuss how the omission of older people from the marijuana use question for NHANES influenced your prediction. Write your prediction and discussion in comments at the end of your code file (@sec-chap01-data-interpretation).

**My own additional exercises**

11. Download and work with SAS .xpt file data (Demo) {@sec-download-xpt-file}
12. Show table of frequencies and percentages (@sec-chap01-bar-chart-grass in Tab "Numbers")
13. Make bars thinner (@sec-chap01-bar-chart-grass in Tab "Thin bars")
14. Make bars thinner and adjust spacing accordingly (@sec-chap01-bar-chart-grass in Tab "Solution")
:::
:::::

### Visiting the NHANES website {#sec-visit-nhanes-website}

Use the National Health and Nutrition Examination Survey (`r glossary("NHANES")`) data to examine marijuana use in the United States. Spend a few minutes looking through the NHANES website (<https://www.cdc.gov/nchs/nhanes/index.htm>) before you begin, including finding the online codebook for the 2013–2014 data. Complete the following tasks to explore whether age is related to marijuana use in the United States.

The 2013–2014 codebook is on the page [NHANES 2013-2014](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013). On this page are the links to different types of data, documentations or codebooks.

You can choose from

:::::: {#bul-NAHNES-data-sections}
::::: my-bullet-list
::: my-bullet-list-header
Bullet List: NHANES has six different data sections
:::

::: my-bullet-list-container
-   Demographics Data
-   Dietary Data
-   Examination Data
-   Laboratory Data
-   Questionnaire Data
-   Limited Access Data
:::
:::::

NHANES has six different data sections
::::::

------------------------------------------------------------------------

Clicking of one of these links opens a page where a table contains the following information:

:::::: {#bul-NHANES-data-pages}
::::: my-bullet-list
::: my-bullet-list-header
Bullet List: NHANES data pages has download links for documentation and data files
:::

::: my-bullet-list-container
-   Data File Name
-   Doc File
-   Data File
-   Data Published
:::
:::::

NHANES data pages has download links for documentation and data files
::::::

------------------------------------------------------------------------

[Datasets and Documentation](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Datasets.aspx) and [Frequently Asked Questions (FAQs)](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/faq.aspx) are important introductory pages.

NHANES data are saved in a [SAS transport (.XPT) file](https://www.loc.gov/preservation/digital/formats/fdd/fdd000464.shtml). These file can be read with `foreign::read.xport()` and `haven::read_xpt()`. There is also a [page with example code](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Datasets.aspx) to download/import NHANES data files. I will use {**haven**} instead of the recommended {**foreign**} package.

### Download NHANES SAS .xpt file data (Demo) {#sec-download-xpt-file}

:::::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-xpt-file-download}
: Download and work with NHANES SAS .xpt file data (Demo)
:::
::::

::::::::::::::::::::::::::::: my-experiment-container
:::::::::::::::::::::::::::: panel-tabset
###### Get .xpt file

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-save-xpt-file-demo}
: Save a SAS transport (.xpt) file from the NHANES site for demo purposes and display the structure of the file
:::
::::

:::: my-r-code-container
::: {#lst-chap01-downloadfile}
```{r}
#| label: save-xpt-file-demo
#| cache: true
#| eval: false


## download only once (manually)
utils::download.file(
    "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BPXO.XPT",
    tf <- base::tempfile(),
    mode = "wb"
)
blood_pressure_demo <- haven::read_xpt(tf)

save_data_file("chap01", blood_pressure_demo, "blood_pressure_demo.rds")
```

Download a file and save it as R object
:::

(*This code chunk has no visible output*)

With the exception of using (**haven**) instead of {**foreign**}, I followed the instruction from the [NHANES recommendation](https://wwwn.cdc.gov/nchs/data/tutorials/file_download_import_R.R). We see labelled data as it was expected from the import with {**haven**}.

In @lst-chap01-get-gss2016-automated I have already used the `utils::download.file()` function: There are some differences:

-   @lst-chap01-get-gss2016-automated downloaded a STATA `.zip` file, therefore I used `unz()` to unzip the file and to read `haven::read_dta()` instead `haven::read_xpt()`.
-   More important however was the `base::unlink(temp)` command to unlink the temporary file, which is not done here.
-   Another difference was using the argument `mode = "wb"` for writing binary files. As I understood this is only important for Windows user. On my Mac I could stick with the default value `mode = "w"`. But for reproducibility it would be better to use in the future the more general option with `mode = "wb"`.
::::
:::::::

:::::: {#imp-chap01-nhanesA-pkg}
::::: my-important
::: my-important-header
Package {**nhanesA**} facilitates downloading NHANES data
:::

::: my-important-container
I just learned that there is a {**nhanesA**} package that facilitates downloading data to R. (2024-01-20) The latest package release is brand-new (2024-01-09).

I am planning to use this package when I am working with again with NHANES data in @sec-chap06. Just for me to remember:

-   The packages is [on CRAN](https://cran.r-project.org/web/packages/nhanesA/index.html)
-   There is a [GitHub repository](https://github.com/cjendres1/nhanes).
-   A very informative [introductory vignette](https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html) does not only explain the package but also the structure of the NHANES data and codebook.
:::
:::::

Package {**nhanesA**} facilitates downloading NHANES data
::::::

------------------------------------------------------------------------

###### str()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-str-blood-pressure-demo}
: Show data summary of three blood pressure variables with `utils::str()` (Demo)
:::
::::

::: my-r-code-container
```{r}
#| label: str-blood-pressure-demo

blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds")

blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    utils::str()
```
:::
::::::

###### summary()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-summary-blood-pressure-demo}
: Show data summary of three blood pressure variables with `summary()` (Demo)
:::
::::

::: my-r-code-container
```{r}
#| label: summary-blood-pressure-demo

blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    base::summary()
```
:::
::::::

###### skim()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-skim-blood-pressure-demo}
: Show data summary of three blood pressure variables with `skim()` (Demo)
:::
::::

::: my-r-code-container
```{r}
#| label: skim-blood-pressure-demo

blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    skimr::skim()
```
:::
::::::

###### my_glance_data()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-show-random-blood-pressure-demo}
: Show some blood pressure data with my private function `my_glance_data()`
:::
::::

::: my-r-code-container
```{r}
#| label: show-random-blood-pressure-demo

blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds")

## display some values
blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    dplyr::rename(arm_selected = BPAOARM) |> 
    dplyr::rename(systolic = BPXOSY1) |> 
    dplyr::rename(diastolic = BPXODI1) |> 
    my_glance_data()
```
:::
::::::
::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::

### Import & clean data {#sec-chap01-import-clean-data}

:::::::::::::::::::::::::: my-exercise
:::: my-exercise-header
::: {#exr-chap01-read-nhanes-csv}
: Coder exercises: Import & clean data
:::
::::

::::::::::::::::::::::: my-exercise-container
:::::::::::::::::::::: panel-tabset
###### read_csv()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-read-nhanes-csv}
: Open the 2013–2014 NHANES data file that comes with the book
:::
::::

::: my-r-code-container
```{r}
#| label: read-nhanes-csv
#| cache: true
#| eval: false

## load and save it only once (manually) ########
nhanes_2013 <- readr::read_csv("data/chap01/nhanes_2013_ch1.csv")
save_data_file("chap01", nhanes_2013, "nhanes_2013.rds")
```

(*This R Code chunk has no visible output.*)
:::
::::::

###### View data

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-NHANES-view-data}
: Look at the data with `utils::str()` and `skimr::skim()`
:::
::::

:::: my-r-code-container
::: {#lst-chap01-NHANES-view-data}
```{r}
#| label: NHANES-view-data

nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds")

base::summary(nhanes_2013)
utils::str(nhanes_2013)
skimr::skim(nhanes_2013)
```

Look at the data with `utils::str()` and `skimr::skim()`
:::

As one of the packages for displaying summaries I am using the {**skimr**} package (see @sec-skimr).
::::
:::::::

###### Codebook

Unfortunately the .csv files does not include labelled data. Therefore I had to look for the variable names in all different NHANES sections as outlined in @bul-NAHNES-data-sections. Inspecting the accompanying NHANES download pages (see: @bul-NHANES-data-pages) I could look for the variable names in the download links for the documentation.

-   [DUQ200](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DUQ_H.htm#DUQ200): Ever used marijuana or hashish
-   [RIAGENDR](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#RIAGENDR): Gender
-   [RIDAGEYR](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#RIDAGEYR): Age in years at screening

There is an additional column `...1` (see @lst-chap01-NHANES-view-data) providing row numbers. Maybe this column works as an ID, a kind sequence number, so that one could merge other data files identified by this number. (See [SEQN](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#SEQN): Respondent Sequence Number). Why this column has lost its name, I do not know. Perhaps it has to do with the data import done by the book author?

To facilitate the work I made screenshots from these four variables:

![](img/chap01/codebook-NHANES-DUQ200-marijuana-min.png)

![](img/chap01/codebook-NHANES-RIAGENDR-RIDAGEYR-gender-age-min.png)

![](img/chap01/codebook-NHANES-SEQN-sequence-number-min.png)

###### Data cleaning

:::::::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-NHANES-data-cleaning}
: Look at the data after data cleaning
:::
::::

::::::::: my-r-code-container
```{r}
#| label: NHANES-data-cleaning

nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds")

nhanes_2013_clean <- nhanes_2013 |> 
    dplyr::rename(nr = ...1) |> 
    dplyr::rename(
        sex = RIAGENDR,
        age = RIDAGEYR,
        grass = DUQ200
        ) |> 
    dplyr::mutate(grass =
          dplyr::na_if(x = grass, y = 7)) |> 
    dplyr::mutate(grass =
          dplyr::na_if(x = grass, y = 9)) |>
    tidyr::drop_na() |> 
    dplyr::mutate(sex = forcats::as_factor(sex)) |> 
    dplyr::mutate(grass = forcats::as_factor(grass)) |> 
    dplyr::mutate(grass = 
          forcats::fct_recode(grass, 
          "Yes" = "1", "No" = "2")) |> 
    dplyr::mutate(sex = 
          forcats::fct_recode(sex, 
          "Male" = "1", "Female" = "2"))

## save cleaned data ###############
save_data_file("chap01", nhanes_2013_clean, "nhanes_2013_clean.rds")

base::summary(nhanes_2013_clean)
skimr::skim(nhanes_2013_clean)
```

::::: my-important
::: my-important-header
Recoded `gender` to `sex`
:::

::: my-important-container
In all R code chunks and text passages I have changed the `gender` category to `sex`. "Sex" refers only to biology but "gender" is a broader concepts: It refers to identity construction, that include biological, psycho-social, and cultural factors.

> Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles associated with being a woman, man, girl or boy, as well as relationships with each other. As a social construct, gender varies from society to society and can change over time. ... Gender interacts with but is different from sex, which refers to the different biological and physiological characteristics of females, males and intersex persons, such as chromosomes, hormones and reproductive organs. Gender and sex are related to but different from gender identity. Gender identity refers to a person’s deeply felt, internal and individual experience of gender, which may or may not correspond to the person’s physiology or designated sex at birth. ([WHO](https://www.who.int/health-topics/gender#tab=tab_1))
:::
:::::

::::: my-remark
::: my-remark-header
My learning from the above procedure of data cleaning
:::

::: my-remark-container
1.  It is wrong to put the column name `...1` into accents.
2.  There exists a shortcut for several `dplyr::rename()` but not for `dplyr::na_if()` and `forcats::as_factor()` because here we need to change the column values with the command.
3.  Sequence of commands is important:
    -   Start with the renaming of variables, This is not mandatory but it helps to address the correct column in the following commands.
    -   Recode different missing values to NA’s with `dplyr::na_if()`
    -   Then drop (all) NA’s with `tidyr::drop_na()`.
4.  `forcats::as_factor()` needs to rewrite the column as factor with `dplyr::mutate()`
:::
:::::
:::::::::
::::::::::::
::::::::::::::::::::::
:::::::::::::::::::::::
::::::::::::::::::::::::::

### Bar chart for marijuana use {#sec-chap01-bar-chart-grass}

:::::::::::::::::::::::::::: my-exercise
:::: my-exercise-header
::: {#exr-chap01-bar-chart-grass-percentage}
: Bar chart (with different bar width ans space) showing the percentage of NHANES participants answering yes and no to marijuana use
:::
::::

::::::::::::::::::::::::: my-exercise-container
:::::::::::::::::::::::: panel-tabset
###### Numbers

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-compute-grass-percentage}
: Show frequencies and percentages of NHANES participants answering yes and no to "Ever used marijuana or hashish"
:::
::::

::: my-r-code-container
```{r}
#| label: compute-grass-percentage
#| cache: true

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

tibble::enframe(table(nhanes_2013_clean$grass)) |> 
    dplyr::rename(`Ever used grass or hash` = name, Freq = value) |> 
    dplyr::mutate(Perc = round(100 * Freq / sum(Freq), 2))
```

------------------------------------------------------------------------

Instead of using `name` and `value` I could have used the position of the column numbers `1` and `2` (all names without the \`-sign.)
:::
::::::

This is a manual calculation using the {**tidyverse**} approach: I am sure there are some packages that may facilitate this computation (e.g., {**Janitor**} or {**DescTools**}), but I had difficulties to apply the appropriate functions.

###### Standard

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-bar-chart-percentage}
: Percentage of NHANES participants answering yes and no to marijuana use
:::
::::

::: my-r-code-container
```{r}
#| label: fig-grass-bar-chart-percentage-standard
#| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey"
#| fig-width: 4
#| fig-height: 4

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

ggplot2::ggplot(nhanes_2013_clean,
    ggplot2::aes(grass, 
         y = 100 * ggplot2::after_stat(count) /
             base::sum(ggplot2::after_stat(count)),
             fill = grass
        )) +
    ggplot2::geom_bar() + 
    ggplot2::theme_minimal() +
    ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'),
                    guide = "none") +
    ggplot2::scale_y_continuous(
        breaks = base::seq(0, 50, by = 10),
        labels = scales::label_percent(scale = 1)) +
    ggplot2::labs(x = "Did you ever use marijuana or hashish?",
       y = "Percent of responses") 
    
```
:::
::::::

The bars are very thick. I tried with `ggplot2::geom_bar(width = 0.5)` to make them thinner. But I failed. (See @lst-chap01-grass-bar-chart-width)

###### Thin bars

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-bar-chart-thin-bars}
: Percentage of NHANES participants answering yes and no to marijuana use
:::
::::

:::: my-r-code-container
::: {#lst-chap01-grass-bar-chart-width}
```{r}
#| label: fig-grass-bar-chart-thin-bars
#| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey"
#| fig-width: 4
#| fig-height: 4

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

ggplot2::ggplot(nhanes_2013_clean,
    ggplot2::aes(grass, 
         y = 100 * ggplot2::after_stat(count) /
             base::sum(ggplot2::after_stat(count)),
             fill = grass
        )) +
    
    ## provide thinner bars, standard = 0.9
    ggplot2::geom_bar(width = 0.5) + 
    
    ggplot2::theme_minimal() +
    ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'),
                    guide = "none") +
    ggplot2::scale_y_continuous(
        breaks = base::seq(0, 50, by = 10),
        labels = scales::label_percent(scale = 1)) +
    ggplot2::labs(x = "Did you ever use marijuana or hashish?",
       y = "Percent of responses") 
    
```

Reducing the thickness of the bars
:::
::::
:::::::

The bars are now thin but the space between didn't change. I tried several options with the position argument to [adjust bar width *and* spacing](https://r-graphics.org/recipe-bar-graph-adjust-width):

> Dodging preserves the vertical position of an geom while adjusting the horizontal position. position_dodge() requires the grouping variable to be be specified in the global or geom\_\* layer. Unlike position_dodge(), position_dodge2() works without a grouping variable in a layer. position_dodge2() works with bars and rectangles, but is particulary useful for arranging box plots, which can have variable widths.

All my trials with `position_dodge()` and `position_dodge2()` failed. The only two strategies I could reduce the spacing was:

-   reducing the width of the graphic (and adapting the title of the graphic to fit into the smaller space)
-   using ggplot2::theme(aspect.ratio = 3/1) as shown in @lst-fig-chap01-grass-bar-chart-solution.

###### Solution

:::::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-bar-chart-solution}
: Percentage of NHANES participants answering yes and no to marijuana use
:::
::::

::::::: my-r-code-container
::: {#lst-fig-chap01-grass-bar-chart-solution}
```{r}
#| label: fig-grass-bar-chart-solution
#| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey"
#| fig-width: 4
#| fig-height: 4

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

ggplot2::ggplot(nhanes_2013_clean,
    ggplot2::aes(grass, 
         y = 100 * ggplot2::after_stat(count) /
             base::sum(ggplot2::after_stat(count)),
             fill = grass
        )) +
    ggplot2::geom_bar() + 
    ggplot2::theme_minimal() +
    ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'),
                    guide = "none") +
    ggplot2::scale_y_continuous(
        # n.breaks = 10, 
        breaks = base::seq(0, 50, by = 5),
        labels = scales::label_percent(scale = 1)) +
    ggplot2::labs(x = "Did you ever use marijuana or hashish?",
       y = "Percent of responses") +
    ggplot2::theme(aspect.ratio = 3/1)
    
```

My solution for creating a bar chart showing the percentage of NHANES participants answering yes and no to marijuana use
:::

::::: my-remark
::: my-remark-header
What I learned from creating the bar chart
:::

::: my-remark-container
Even with this simple example of a bar chart I learned some new options:

1.  Use `ggplot2::after_stat()` inside the aesthetics of the first layer to convert frequencies to percentages.
2.  Adding breaks with `n.breaks = <number>` or `breaks = base::seq(from, to, by = <number>)` inside the appropriate scale.
3.  Use the {**scales**} package to add the `%`-sign after the values. As I have already calculated the percentages I need to reduce the default value of `scale = 100` to `scale = 1`.
4.  Use `ggplot2::theme(aspect.ratio = <number>/1)` creating thinner bars *and* reducing the spacing between the bars accordingly. In my case I changed the aspect ratio to `3/1`. A bigger front number compresses the graph, a smaller front number extends it. Aspect ratio of `1/1` is the standard (= no change).
:::
:::::
:::::::
::::::::::
::::::::::::::::::::::::
:::::::::::::::::::::::::
::::::::::::::::::::::::::::

### Marijuana use per age and sex {#sec-chap01-grass-age-sex}

::::::::::::::::::::::::::::::: my-exercise
:::: my-exercise-header
::: {#exr-chap01-bar-chart-grass-age-group}
: Exploring the percentage of NHANES participants answering yes or no to marijuana use by age category and sex
:::
::::

:::::::::::::::::::::::::::: my-exercise-container
::::::::::::::::::::::::::: panel-tabset
###### Summary Table

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-compute-grass-age-group-percentage}
: Show frequencies and percentages of NHANES participants answering yes and no to "Ever used marijuana or hashish"
:::
::::

::: my-r-code-container
```{r}
#| label: compute-grass-age-group-percentage
#| results: hold

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

nhanes_2013_grouped <- nhanes_2013_clean |> 
    ## cut age into several defined age groups
    dplyr::mutate(age_cat = base::cut(age,
                   breaks = c(-Inf, 29, 39, 49, 59),
                   labels = c("18 - 29", "30 - 39", "40 - 49", "50 - 59"))) |> 
    dplyr::select(grass, sex, age_cat)


## older trial, without the knowledge of {gtsummary}
# tibble::enframe(table(nhanes_2013_grouped$age_cat)) |>  
#     dplyr::rename(`Ever used grass or hash` = 1, Freq = 2) |> 
#     dplyr::mutate(Perc = round(100 * Freq / sum(Freq), 2))

nhanes_2013_grouped |> 
    labelled::set_variable_labels(
        grass = "Ever used grass or hash?"
    ) |> 
    gtsummary::tbl_summary(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        type = grass ~ "categorical"
    ) |> 
    gtsummary::bold_labels()

save_data_file("chap01", nhanes_2013_grouped, "nhanes_2013_grouped.rds")
```

------------------------------------------------------------------------

Instead of using `1` and `2` I could have used the position of the column numbers `name` and `value` (all names without the \`-sign.)
:::
::::::

This is a manual calculation using the {**tidyverse**} approach: I am sure there are some packages that may facilitate this computation (e.g., {**Janitor**} or {**DescTools**}), but I had difficulties to apply the appropriate functions.

###### Age 1

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-age-group1}
: Table: Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses
:::
::::

::: my-r-code-container
```{r}
#| label: tbl-grass-age-group1
#| tbl-cap: "Proportion of marijuana use of the participant by age group in percent of total responses in the NHANES 2013 survey"

nhanes_2013_grouped <- base::readRDS("data/chap01/nhanes_2013_grouped.rds")

nhanes_2013_grouped |> 
    labelled::set_variable_labels(
        grass = "**Ever used grass or hash?**"
    ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = age_cat,
        col = grass,
        percent = "cell"
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    ) |> 
    gtsummary::bold_labels()

```

------------------------------------------------------------------------

I have required {**gtsummary**} to calculate the percentage with one decimal place. The reason was that the graphic did not match the table. For instance: Without decimal places the "No"-proportion of the age group 18-29 was with 13% equal to the "Yes"-proportion of the age group 30-39. But in the figure these two bars have a different height. The reason was a rounding error: Both values 13.3% and 12.6% were rounded to 13%. But in the figure one could see the difference of 0.7% in the length of the bars.
:::
::::::

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-fig-grass-age-group1}
: Figure: Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses
:::
::::

::: my-r-code-container
```{r}
#| label: fig-grass-age-group1
#| fig-cap: "Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses"


## age groups bar chart 1 ########
nhanes_2013_grouped |> 
    ggplot2::ggplot(ggplot2::aes(
        x = age_cat, 
        fill = grass,
         y = 100 * ggplot2::after_stat(count) / 
             base::sum(ggplot2::after_stat(count))
        )
    ) + 
    ggplot2::geom_bar(position = 'dodge') +    
    ggplot2::theme_minimal() +
    ggplot2::scale_fill_manual(values = c('darkgreen', 'darkred'),
                    name = "Ever used\ngrass or hash?") +
    ggplot2::labs(x = "Age group (in years)",
       y = "Percent of total responses")
```
:::
::::::

###### Age 2

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-age-group2}
: Table:: Percentage of NHANES 2013 survey participants by age group answering yes and no to marijuana use in percent of age group
:::
::::

::: my-r-code-container
```{r}
#| label: tbl-grass-age-group2
#| tbl-cap: "Proportion of marijuana use of the participant by age group in the NHANES 2013 survey"

nhanes_2013_grouped <- base::readRDS("data/chap01/nhanes_2013_grouped.rds")

nhanes_2013_grouped |> 
    labelled::set_variable_labels(
            grass = "**Ever used grass or hash?**"
        ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = age_cat,
        col = grass,
        percent = "row",
        margin = "column",
        digits = c(0, 1)
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    ) |> 
    gtsummary::bold_labels()


```
:::
::::::

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-fig-grass-age-group2}
: Figure: Percentage of NHANES participants of the 2013 survey answering yes and no to marijuana use in percent of age group
:::
::::

::: my-r-code-container
```{r}
#| label: fig-grass-age-group2
#| fig-cap: "Percentage of NHANES participants of the 2013 survey answering yes and no to marijuana use in percent of age group"

## age groups bar chart 2 ########
nhanes_2013_grouped |> 
  dplyr::group_by(grass, age_cat) |>
  dplyr::count() |> 
  dplyr::group_by(age_cat) |> 
  dplyr::mutate(perc_grass = 100 * n / base::sum(n)) |> 
  ggplot2::ggplot(ggplot2::aes(x = age_cat, fill = grass,
             y = perc_grass)) +
  ggplot2::geom_col(position = 'dodge') +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c('darkgreen', 'darkred'),
                    name = "Ever used\ngrass or hash?") +
  ggplot2::labs(x = "Age group (in years)",
       y = "Percent of responses in age group")
```
:::
::::::

###### Age & Sex

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-grass-age-sex}
: Percentage of NHANES participants 2013 survey by age group and sex answering yes and no to marijuana use
:::
::::

::: my-r-code-container
```{r}
#| label: fig-grass-age-sex
#| fig-cap: "Proportion of marijuana use of the participant by age group and sex in the NHANES 2013 survey"

nhanes_2013_grouped <- 
    base::readRDS("data/chap01/nhanes_2013_grouped.rds")

## age and sex bar chart ########
nhanes_2013_grouped |> 
  dplyr::group_by(grass, age_cat, sex) |> 
  dplyr::count() |> 
  dplyr::group_by(age_cat, sex) |> 
  dplyr::mutate(perc_grass_2 = 100 * n / base::sum(n)) |> 
  ggplot2::ggplot(ggplot2::aes(
      x = age_cat, 
      y = perc_grass_2,
      fill = sex)) +
  ggplot2::geom_col(position = 'dodge') +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c('#f98e31', '#730b6d'),
                             name = "sex") +
  ggplot2::labs(
          x = "Age group (in years)",
          y = "Percent of males/females that used marijuana or hashish"
      )
```

------------------------------------------------------------------------

With the exception of the oldest age group the graphic shows that the proportion of marijuana / hashish user increases with younger people. One possible explanation for the untypical high use in the oldest age group could be that people between 50-59 of the 2013 survey were born between 1954-1963 and were young adults during the time of the [Flower Power](https://en.wikipedia.org/wiki/Flower_power) generation part of the [counterculture of the 1960s](Counterculture%20of%20the%201960s)-1970s.
:::
::::::

@fig-grass-age-sex is a little difficult to interpret. Both bars colors shows the proportion of "Yes"-respondents in the appropriate age group. For instance: About 62% of males and about 51% of women of the age group 18-29 said "Yes".

Therefore the bars to not sum up to 100%. The figures shows only the "Yes" part of the dichotomous `grass` variable for men and women in a certain age group. The other "No" part is missing.

To distinguish the sex colors in @fig-grass-age-sex from the Yes/No colors in @fig-grass-age-group1 and @fig-grass-age-group2 I chose at first some kind of blue for men and red for females. But I learned from [An alternative to pink & blue: Colors for gender data](https://blog.datawrapper.de/gendercolor/) that one should avoid this stereotypical colors.

> In our western culture, these colors come with the whole gender stereotype baggage. Pink means weak, shy girls who play with dolls and don’t deserve as much as boys. Blue stands for boys who need to be strong & rough. **When we create a chart with pink & blue, we endorse gender stereotypes**.
:::::::::::::::::::::::::::
::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::

### Data interpretation {#sec-chap01-data-interpretation}

::: {#rep-chap01-gss-nhanes .callout-tip}
###### Marijuana usage and opinions to legalize it (`r glossary("GSS")` and `r glossary("NHANES")` data)

@fig-chap01-repr-fig-1-2 with `r glossary("GSS")` data 2016 shows that younger people support the legalizing of marijuana. Therefore we can expect that in the future the proportion of people that approve legalizing will increase. With a small project on my own I have backed up this thesis. @fig-gss-grass-timeline demonstrates that the percentage of supporters has increased very strongly in the last years and has now hit the 70% mark. This trend is very clear even I have not distinguished in the longitudinal study between different age groups.

The graphs of the `r glossary("NHANES")` 2013 data in @fig-grass-age-group1 and fig-grass-age-group2 show that the highest proportion of people ever used marijuana or hashish lies in youngest age. The age trend is not very clear, because we have only figures for person equal or under 59 years. (Here would be necessary another small project with longitudinal analysis including newer data.)

With the exception of the age group 40-49 the proportion of males ever used in their respective age group is higher than the percentage of women. (See @fig-grass-age-sex). But this is only a snapshot and difficult to interpret. Another possible thesis could be that there is a tiny upward trend from old top young in marijuana use. One possible explanation for the untypical higher use in the oldest age group could be that people between 50-59 of the 2013 survey were born between 1954-1963 and were young adults during the time of the [Flower Power](https://en.wikipedia.org/wiki/Flower_power) generation part of the [counterculture of the 1960s](Counterculture%20of%20the%201960s)-1970s.
:::

------------------------------------------------------------------------

## Get the GSS data {#sec-chap01-get-gss-data}

### Working with the GSS

I am very interested how to get `r glossary("GSS")` data directly from the GSS website, so that I could work on interesting research questions myself.

I have found several resources helping to work with the `r glossary("GSS")`. The most important one is the {**gssr**} package (see: @sec-gssr)

:::::: my-resource
:::: my-resource-header
::: {#lem-chap01-asdf}
: asdfree: Analyze Survey Data for Free
:::
::::

::: my-resource-container
[Analyze Survey Data for Free](http://asdfree.com/) is a bookdown website by [Anthony Damico](https://www.youtube.com/@anthonyjosephdamico/playlists) with currently 64 locations to grab free survey data. As expected it features also a [description of the GSS](http://asdfree.com/general-social-survey-gss.html) including analysis examples with the {**survey**} package and --- especially important for my purpose here --- {**lodown**}, a [package on GitHub](https://github.com/ajdamico/lodown) that is not maintained anymore.
:::
::::::

@def-chap01-get-gss-data features five different strategies to download GSS data:

1.  Download extract by using the GSS Data Explorer --- Tab: "Explorer"
2.  Download files manually --- Tab: "by hand"
3.  Download files programmatically --- Tab: "automated"
4.  Download via the {**lodown**} package --- Tab: "lodown"
5.  Download via the {**gssr**} package --- Tab: "gssr"

### Five strategies to get GSS data

::::::::::::::::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-get-gss-data}
: How to get the General Social Survey (GSS) data
:::
::::

:::::::::::::::::::::::::::::::::::::::: my-experiment-container
::::::::::::::::::::::::::::::::::::::: panel-tabset
###### Explore

To use all the facilities of the GSS Data Explorer (tagging, tabulating, data extracting) you need to register for a free account. The good thing is: This is a onetime procedure.

:::::: my-procedure
:::: my-procedure-header
::: {#prp-chap01-get-gss-data-explorer}
: Downloading data extracts with the GSS Data Explorer
:::
::::

::: my-procedure-container
1.  Create a free account for the [GSS Data Explorer](https://gssdataexplorer.norc.org/), a tool that allows to browse the data that have been collected in the surveys.
    -   Fill out the form
    -   Wait for an email with the verification code
    -   Confirm the registration with the verification code
2.  Go for the tour to learn the interface (Link "Tour Guide")
3.  Now you are ready to follow the advises in the slides. If you prefer you can view the slide show in a [standalone browser](https://petzi53.quarto.pub/gss-data-explorer/#/title-slide).
:::
::::::

<iframe width="650" height="400" class="slide-deck" src="https://petzi53.quarto.pub/gss-data-explorer/#/title-slide">

</iframe>

As one can see this is a somewhat cumbersome procedure to download the desired data. Following the proposed strategies in the other tabs are much easier for importing GSS data. But using the GSS Data Explorer is very helpful to *explore* the dataset. Apply the first three steps of the above list to find the correct variable names, to read the exact wording of the question asked and to inspect the different codes that are used for the variable. Otherwise you have to skim the more than 700 pages of the GSS codebook.😭

###### by hand

Another approach is to download the complete dataset (or all variables of those years you are interested in) and manage the data in such a way that it can be easily used for your research question. (See @sec-chap01-data-wrangling)

:::::: my-procedure
:::: my-procedure-header
::: {#prp-chap01-get-gss-data-manually}
: Download GSS individual year data sets (cross-section only)
:::
::::

::: my-procedure-container
1.  Visit <https://gss.norc.org/Get-The-Data> and choose under the section "Download the Data" the "STATA" format. I read elsewhere that this is the preferred format to convert the data into R with the {**haven**} package.
2.  From the [STATA-page](https://gss.norc.org/get-the-data/stata) choose the appropriate link (`2016` in our case) under the section "Individual Year Data Sets (cross-section only)" and download the file `2016_stata.zip` (994 MB) into your preferred folder on your hard disk. After you extract the .zip file you will get the STAT file `GSS2016.dta` (4.4 MB).
3.  You can now apply @cnj-chap01-import-stata-data.
:::
::::::

:::::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-import-stata-data}
: Import STATA GSS 2016 file into R using (**haven**)
:::
::::

::::::: my-r-code-container
::: {#lst-chap01-import-stata-2016-file}
```{r}
#| label: import-stata-2016-file
#| cache: true
#| eval: false

## run only once (manually) ###########
gss_2016_man <- haven::read_dta("data/chap01/GSS2016.dta")
save_data_file("chap01", gss_2016_man, "gss_2016_man.rds")

```

(*This R code chunk has no visible output*)

```{r}
#| label: glance-data-stata-file

gss_2016_man <- 
    base::readRDS("data/chap01/gss_2016_man.rds")

gss_2016_man |>
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)
```

Import STATA GSS 2016 file into R using (**haven**) and glance at the data
:::

::::: my-important
::: my-important-header
{**haven**} imports data as labelled vectors
:::

::: my-important-container
The data structure we have found here is very different from the Excel data file provided with the book.
:::
:::::

Labelled vectors is a completely new feature for me. I learned that value labels and other metadata tags that are commonly seen when working with other statistical software like SAS, STATA or SPSS (cf. [Survey Research and Datasets in R](https://socialresearchcentre.github.io/r_survey_datasets/), here section 3 [Labelled Data](https://socialresearchcentre.github.io/r_survey_datasets/labelled-data.html))

> A labelled vector is a common data structure in other statistical environments, allowing you to assign text labels to specific values. This class makes it possible to import such labelled vectors in to R without loss of fidelity. ([Create a labelled vector](https://haven.tidyverse.org/reference/labelled.html))

I will go into more details in @sec-chap01-labelled-data. The important thing here is to notice that the variable `grass` has labelled values that explain the short code. Code `1` represents the respondent option that marijuana should be legalized and `2` the opposite. We also learn that there is with `NA i` a special kind of `NA` value:

> .i: Inapplicable (IAP). Respondents who are not asked to answer a specific question are assigned to IAP. (See [Alert on the STATA download page](https://gss.norc.org/get-the-data/stata))

On the website we see under the "Alert" section that there other kind of NA’s as well. And the 2022 GSS Codebook describes still other, less common missing values.
:::::::
::::::::::

**Additional comments**

I chose for file saving the `base::saveRDS()` option (and not `base::save()`) because when later reading into R again with `base::readRDS()` it does not overwrite a variable with the same name respectively I can assign the file to another variable name.

###### automated

:::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-get-gss2016-data}
: Get year 2016 of GSS data set with `base::download.file()`
:::
::::

::::: my-r-code-container
::: {#lst-chap01-get-gss2016-automated}
```{r}
#| label: get-gss2016-automated
#| code-fold: show
#| cache: true
#| eval: false

## run only once (manually) ###########
temp <- base::tempfile()
utils::download.file("https://gss.norc.org/documents/stata/2016_stata.zip",temp)
gss_2016_aut <- haven::read_dta(base::unz(temp, "GSS2016.dta"))
base::unlink(temp)
save_data_file("chap01", gss_2016_aut, "gss_2016_aut.rds")
```

Get year 2016 of GSS data set with `base::download.file()`
:::

(*This R code chunk has no visible output*)

This time we have the file downloaded programmatically which is much better in term of reproducibility. We don't need now to import the data {**haven**} but can call base::readRDS().

::: {#lst-chap01-read-rds-data}
```{r}
#| label: read-rds-data

gss_2016_aut <- base::readRDS("data/chap01/gss_2016_aut.rds")

gss_2016_aut |> 
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)

```

Read previously saved `.rds` file into R and glance at the data
:::
:::::
::::::::

Data now have a more R like appearance, even if the variable classes with <hvn_lbll> "*`r class(gss_2016_aut$age)`*" are unfamiliar. But we have now lost some information, especially we have to consult the codebook to know what the codes `1` and `2` mean.

###### lodown

The following strategy I have taken from the bookdown book [Analyze Survey Data for Free](http://asdfree.com/) (asdf.com) It gives step by step instructions to explore public `r glossary("Microdata")`. Here I refer to the `r glossary("General Social Survey")` (GSS) section of the book.

:::::: my-procedure
:::: my-procedure-header
::: {#prp-chap01-get-gss2016-lodown}
: Get the GSS data with the {**lodown**} package and glance at the data
:::
::::

::: my-procedure-container
Working with {**lodown**} is a three step procedure:

1.  Retrieve a listing of all available extracts for the GSS data.
2.  Choose what files you want to download. In our case data for the year 2016.
3.  Download the specified dataset in the offered SPSS file format, but {**lodown**} produces with `.rds` a native R file format with the name `2016.rds`.
:::
::::::

The second step has to be done manually but I have the result of my inspection already integrated in @lst-chap01-get-gss2016-lodown.

As additional steps I renamed the downloaded file, so that I can distinguish it from similar files of the other approaches. Finally I glanced at the `grass` and `age` data.

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-get-gss2016-lodown}
: Get GSS data via {**lodown**} package
:::
::::

:::: my-r-code-container
::: {#lst-chap01-get-gss2016-lodown}
```{r}
#| label: get-gss2016-lodown
#| eval: false
#| cache: true
#| results: hold

## can't suppress messages, tried several options

## run only once (manually) ############
my_folder <- base::paste0(here::here(), "/data-raw")

# (1) retrieve a listing of all available extracts for the GSS data
gss_cat <- lodown::get_catalog(data_name = "gss",
                               output_dir = my_folder,
                               "GSS") |> 
## (2) choose the catalog part to download
    dplyr::filter(
        output_filename == base::paste0(my_folder, "/2016.rds"))

## (3) download the GSS microdata as 2016.rds
lodown::lodown("gss" , gss_cat)

## rename dataset to distinguish from other download approaches
old_filename <- base::paste0(my_folder, "/2016.rds")
new_filename <- base::paste0(my_folder, "/gss_2016_cat.rds")
base::file.rename(from = old_filename, to = new_filename)
```

(*This R code chunk has no visible output beside the warning message.*)

```{r}
#| label: glance-data-gss-cat

## load and glance at data
gss_2016_cat <- base::readRDS("data/chap01/gss_2016_cat.rds")
gss_2016_cat |> 
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)
```

Get GSS data for the year 2016 via the {**lodown**} package
:::
::::
:::::::

The result is a pure `.rds` file where the columns are still of class "*`r class(gss_2016_cat$grass)`*" as in @lst-chap01-read-rds-data.

###### gssr

Finally I will download the 2016 data with the help of the {**gssr**} package (see @sec-gssr). This takes some minutes. At first I missed the vignette, so I had to download the package again with the additional argument `build_vignettes = TRUE`. Whereas the vignette explains how to analyse data the GitHub is very helpful how to get the desired data.

> You can quickly get the data for any single GSS year by using `gssr::gss_get_yr()` to download the data file from NORC and put it directly into a tibble.

After downloaded the file we can --- as in the other tabs already done --- load the file and glance at the grass/age data.

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-get-gss2016-gssr}
: Get GSS 2016 Year Data Set (cross-section only) and glance at the data
:::
::::

:::: my-r-code-container
::: {#lst-chap01-get-gss2016-gssr-and-glance-at-data}
```{r}
#| label: get-gss2016-gssr
#| eval: false
#| cache: true
#| results: hold

## run only once (manually) ####################
gss_2016_gssr <- gssr::gss_get_yr(year = 2016)
gss_2016_gssr <- gss_2016_gssr |>
   dplyr::select(c(age, grass))

save_data_file("chap01", gss_2016_gssr, "gss_2016_gssr.rds")
```

```{r}
#| label: glance-data-gss2016

## load data ####################
gss_2016_gssr <- base::readRDS("data/chap01/gss_2016_gssr.rds")

## glance at data
gss_2016_gssr |> 
    my_glance_data(N = 8, seed = 2016)
```

Get GSS 2016 Year Data Set (cross-section only) and glance at the data
:::
::::
:::::::

Downloading data with {**gssr**} results in exactly the same format as in listing @lst-chap01-import-stata-2016-file from the manual download. But it has now the advantages from the {**gssr**} package. For instance with the integrated help it is much easier to

-   find the variables
-   to read the question text of the variable
-   to see in which year the questions was asked
-   what the code - including the different types of NA’s mean
:::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::

::::: my-important
::: my-important-header
Six different approaches to get the GSS data
:::

::: my-important-container
Using the {**gssr**} packages seems to me by far the best approach.
:::
:::::

## Working with Labelled Data {#sec-chap01-labelled-data}

### How to work with labelled data? --- Resources

During downloading GSS data I mentioned that {**haven**} imports data as labelled vectors. As a completely new feature for me I looked at the Internet for their advantages and how to work with them. I found four important resources:

::::::::: my-resource
:::: my-resource-header
::: {#lem-chap01-survey-research-data}
: How to work with labelled data?
:::
::::

:::::: my-resource-container
[Survey Research Datasets and R](https://socialresearchcentre.github.io/r_survey_datasets/)

A book accompanying a workshop delivered for the 7th Biennial `r glossary("ACSPRI")` Social Science Methodology Conference. It provides a practical demonstration of several packages for accessing and working with survey data, associated metadata and official statistics in R. The short book demonstrates

-   Working with external data sources from common statistical packages (SPSS, SAS, Stata, Excel) and their quirks
-   Easily working with categorical data in R with the {**labelled**} R package
-   Accessing external databases in an R native way using {**DBI**} and {**dbplyr**} (The [DBI package](https://dbi.r-dbi.org/) helps connecting R to database management systems (DBMS). [dbplyr](https://dbplyr.tidyverse.org/) is a {**dplyr**} backend for data bases)
-   Accessing publicly available data in R scripts via the web
-   Resources for accessing official statistics data in R

Here I am interested only in the first two demonstration. I will refer especially to the section on [Labelled data](https://socialresearchcentre.github.io/r_survey_datasets/labelled-data.html).

------------------------------------------------------------------------

::::: my-watch-out
::: my-watch-out-header
WATCH OUT! {**labelled**} and {**sjlabelled**} have similar purposes and functions
:::

::: my-watch-out-container
{**labelled**} and {**sjlabelled**} are very similar. As far as I understand is the main difference that {**sjlabelled**} supports not only working with labelled data but also offers functions to benefit from these features. {**labelled**} in contrast manipulates labelled data only as an intermediate data structure that should be converted to common R classes, loosing these kind of meta-data.
:::
:::::
::::::
:::::::::

### Displaying data including labelled vectors {#sec-chap01-labelled-data-2}

To get comfortable with labelled data @def-chap01-show-labelled-data I will show how labelled data appear in different viewing functions. To inspect the possibilities of the {**labelled**} package see @def-chap01-use-labelled-haven-package.

:::::::::::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-show-labelled-data}
: Different types of data views with labelled data
:::
::::

::::::::::::::::::::::::::::::::::: my-experiment-container
:::::::::::::::::::::::::::::::::: panel-tabset
###### str()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-str-labelled-data}
: Display the internal structure of labelled data
:::
::::

::: my-r-code-container
```{r}
#| label: str-labelled-data

utils::str(gss_2016_gssr)
```
:::
::::::

###### attributes()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-attributes-labelled-data}
: Show attributes of labelled data
:::
::::

::: my-r-code-container
```{r}
#| label: attributes-labelled-data

base::attributes(gss_2016_gssr$grass)
```

This works only with vectors (variables), but not with data frames. You can display specific list values with numbers (e.g., `attributes(gss_2016_gssr$grass)[[4]]`) or names (e.g., `attributes(gss_2016_gssr$grass)[["labels"]]`).
:::
::::::

###### head()

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-head-labelled-data}
: Show first six records of the labelled variable
:::
::::

:::: my-r-code-container
::: {#lst-chap01-head-labelled-data}
```{r}
#| label: head-labelled-data

head(gss_2016_gssr$grass)
```

Use `head()` to show the first six records of the labelled variable.
:::

------------------------------------------------------------------------

This works only with labelled vectors (variables), but not with data frames.
::::
:::::::

###### tibble()

> One immediate advantage of labelled vectors is that value labels are used in data frame printing when using [tibble](https://tibble.tidyverse.org/) (and by extension the wider tidyverse) and other packages using the [pillar](https://cran.r-project.org/web/packages/pillar/index.html) printing methods. ([Survey Research Datasets and R](https://socialresearchcentre.github.io/r_survey_datasets/what-is-labelled-data.html))

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-tibble-labelled-data}
: Using `tibble` resp. `pillar` printing methods
:::
::::

:::: my-r-code-container
::: {#lst-chap01-tibble-labelled-data}
```{r}
#| label: tibble-labelled-data

gss_2016_gssr |> 
    dplyr::count(grass)
```

Using `tibble` resp. `pillar` printing methods to display NA's
:::
::::
:::::::

###### glimpse()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-glimpse-labelled-data}
: Get a glimpse of labelled data
:::
::::

::: my-r-code-container
```{r}
#| label: glimpse-labelled-data

dplyr::glimpse(gss_2016_gssr)
```
:::
::::::

###### print()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-print-labelled-data}
: Print values as `tibble()` with labelled data
:::
::::

::: my-r-code-container
```{r}
#| label: print-labelled-data

base::print(gss_2016_gssr)
```
:::
::::::

###### my_glance_data()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-glance-labelled-data}
: Glance randomly at some records with labelled data
:::
::::

::: my-r-code-container
```{r}
#| label: glance-labelled-data

gss_2016_gssr |>
    my_glance_data(N = 8, seed = 2016)

```
:::
::::::

###### RStudio

In RStudio you can see part of the labels by supplying additional information about NAs in R code chunks and variable labels in the RStudio viewer.

::: {#fig-rstudio-labelled-data layout-ncol="2"}
![Labelled data as a R code chunk result in RStudio](img/chap01/rstudio-r-chunk-labelled-data-min.png){#fig-rstudio-chunk}

![Labelled data in RStudio data viewer](img/chap01/rstudio-viewer-labelled-data-min.png){#fig-rstudio-viewer}

RStudio display of labelled data
:::
::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::

`base::summary()` with data frames that contain labelled data generates an error when {**haven**} is not loaded:

> ! `summary.haven_labelled()` not implemented.

::::: my-important
::: my-important-header
Labelled meta data not visible in most base R viewing functions
:::

::: my-important-container
With the exception of `base::str()` labelled meta data are not visible when viewing at the printed data.

And even more important: You can't see the different types of NA's without specialized functions. "Tagged" (labelled) missing values behaves exactly like regular R missing values. For more information see @def-chap01-use-labelled-haven-package.
:::
:::::

### Working with labelled data {#sec-chap01-labelled-data-3}

#### Two different approaches

> The purpose of the labelled package is to provide functions to manipulate metadata as variable labels, value labels and defined missing values using the `haven_labelled` and `haven_labelled_spss` classes introduced in haven package. ([Introduction to labelled](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html))

Until now I have not worked in this book with `haven_labelled_spss` classes. This class comes from imports of SPSS data files via the {**haven**} package.

> This class is only used when `user_na = TRUE` in `read_sav()`. It is similar to the `labelled()` class but it also models SPSS's user-defined missings, which can be up to three distinct values, or for numeric vectors a range. ([Labelled vector for SPSS](https://haven.tidyverse.org/reference/labelled_spss.html)) [^01-preparing-data-1]

[^01-preparing-data-1]: To prevent conflict with the labelled class from the {**Hmisc**} package {**haven**} has changed with version 2.0.0 its label class names from `labelled` and `labelled_spss` to `haven_labelled` and `haven_labelled_spss`. (See [GitHub Issue #329](https://github.com/tidyverse/haven/issues/329).)

As user-defined missing values are not important here I will stick with STATA imports.

My main interest with @def-chap01-use-labelled-haven-package is to prepare and/or use the labelled data to work with R. There are principal two approaches:

:::::: my-procedure
:::: my-procedure-header
::: {#prp-chap01-labelled-data-in-r}
: Two different procedures to work in R with labelled data
:::
::::

::: my-procedure-container
**Approach A**

> In approach A, `haven_labelled` vectors are converted into factors or into numeric/character vectors just after data import, using `unlabelled()`, `to_factor()` or `unclass()`. Then, data cleaning, recoding and analysis are performed using classic R vector types.

1.  Data import
2.  *Conversion to factor / numeric*
3.  Data cleaning / data recoding
4.  Data analysis

**Approach B**

> In approach B, `haven_labelled` vectors are kept for data cleaning and coding, allowing to preserved original recoding, in particular if data should be reexported after that step. Functions provided by labelled will be useful for managing value labels. However, as in approach A, `haven_labelled` vectors will have to be converted into classic factors or numeric vectors before data analysis (in particular modeling) as this is the way categorical and continuous variables should be coded for analysis functions. ([Introduction to labelled](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html))

1.  Data import
2.  Data cleaning / data recoding (-\> Data export to SPSS / SAS / STATA )
3.  *Conversion to factor / numeric*
4.  Data analysis

[![Two different approaches to work in R with labelled data](img/chap01/labelled-approaches-min.png){#fig-chap01-two-approaches-with-labelled-data}](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html)
:::
::::::

#### Inspecting labelled data

:::::::::::::::::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-use-labelled-haven-package}
: Inspecting labelled data
:::
::::

::::::::::::::::::::::::::::::::::::::::: my-experiment-container
:::::::::::::::::::::::::::::::::::::::: panel-tabset
###### var_label()

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-labelled-var-label}
: Get variable labels of a labelled data frame
:::
::::

:::: my-r-code-container
::: {#lst-chap01-labelled-var-label}
```{r}
#| label: labelled-var-label

labelled::var_label(gss_2016_gssr) 
```

Use `labelled::var_label()` to get variables labels of a data frame
:::

The labels of the two variable specifies in more detail their content. For the `grass` variable we get the question asked in the survey. This is very helpful and saves consulting the codebook.

------------------------------------------------------------------------
::::
:::::::

I am using here the short form with `labelled::var_label()` but there exist also with `labelled::get_variable_labels()` a long form which has with it plural form a more consistent syntax.

###### val_labels()

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-labelled-val-labels}
: Get value labels of data frame variables
:::
::::

:::: my-r-code-container
::: {#lst-chap01-labelled-val-labels}
```{r}
#| label: labelled-val-labels

labelled::val_labels(gss_2016_gssr)
```

Use `labelled::val_labels()` to get value labels of data frame variables
:::
::::
:::::::

Important is the start of the list of value labels.

-   **age**: It shows that the value `89` of variable `age` includes values of `89 and older`. This is important for the analysis and saves the work of recoding as done in
    1.  
-   **grass**: For the `grass` variable we learn that `1` stands for the opinion that Marijuana "should be legal" and two for the opposite.

What follows is a comprehensive list of NA values used in the survey, even if many of these values are not used for the two variables considered here.

We know from @cnj-chap01-tibble-labelled-data that the `grass` variable has 1024 NA's, but we do not know their composition of different NA values. See @lst-chap01-haven-na-tag how to get this information.

------------------------------------------------------------------------

::::: my-watch-out
::: my-watch-out-header
WATCH OUT! Inconsistency in the function names
:::

::: my-watch-out-container
There is an inconsistency in the singular vs. plural form between `labelled::var_label()` and `labelled::val_labels()`. Both functions show a *list of values* (= plural) if there are more than one label available.

For value labels exist also a singular version with `labelled::val_label()` to get a specific value label of a variable.
:::
:::::

###### head()

::::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-labelled-head}
: Using `utils::head()` on a labelled variable after {**labelled**} is loaded
:::
::::

:::::: my-r-code-container
```{r}
#| label: labelled-head

## {**labelled**} is loaded through a call in the previous chunk
utils::head(gss_2016_gssr$grass, 10)
```

------------------------------------------------------------------------

Using `utils::head()` on a variable when {**labelled**} is loaded prints a nicely formatted summary of the attached metadata, excluding formats. It shows all NA value labels in the data frame used in the survey, even only some of them is used here. (See @lst-chap01-haven-na-tag to get type of NA's used for a variable.)

We see in the first 10 records of the variable `grass`

-   three `1` ("should be legal")
-   three `2` ("should not be legal") and
-   four `NA(i)` ("iap")

Until now it is not clear what "iap" for the `NA(i)` type means. If we want [download (STATA) files directly from the GSS website](https://gss.norc.org/get-the-data/stata) we see on the right side a note about the most important `NA` types. It explains the the abbreviation "iap" stands for "inapplicable".

[![GSS Information about the most important NA types](img/chap01/na-gss-website-note-min.png){#fig-chap01-gss-na-note fig-alt="An \"Alerts\" explains that there are three main missing values in the data; \".i\": Inapplicable (IAP). Respondents whor are not asked to answer a specific question are assigned to IAP.\".d\": Don't know (DK). \".n\": No answer (NA)." fig-align="center"}](https://gss.norc.org/get-the-data/stata)

::::: my-note
::: my-note-header
Two different meanings of "NA"
:::

::: my-note-container
With the `NA(n)` type of missing value "NA" does not mean "Not Applicable" but "No answer".
:::
:::::

Returning only the `NA(i)` type of a missing value does not mean that there aren't any other NA types in the data. (See @lst-chap01-labelled-print-unique-tagged to get the types of tagged NA's of a variable.)
::::::
:::::::::

This formatted result of labelled data is better readable as with @lst-chap01-labelled-val-labels, because category and value are in the same line. Compare the result of this code ({**labelled**} is loaded) with @lst-chap01-head-labelled-data ({**labelled**} is not loaded).

::::: my-note
::: my-note-header
haven::print_labels() as equivalent of head() with {**labelled**} loaded
:::

::: my-note-container
With the exception of the specified rows in the `head()` function, you can get the same nicely formatted list of NA types with `haven::print_labels()`. (See next tab "Print labels".)
:::
:::::

###### Print labels

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-haven-print-labels}
: Print labels
:::
::::

::: my-r-code-container
```{r}
#| label: haven-print-labels

haven::print_labels(gss_2016_gssr$grass)
```
:::
::::::

For a detailed explication see previous tab "head()".

###### Unique NA's

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-labelled-print-unique-tagged}
: Print unique tagged NA's
:::
::::

:::: my-r-code-container
::: {#lst-chap01-labelled-print-unique-tagged}
```{r}
#| label: labelled-print-unique-tagged

labelled::unique_tagged_na(gss_2016_gssr$grass) |> 
    labelled::sort_tagged_na() |> 
    haven::print_tagged_na()
```

Show the unique types of NA's of a variable
:::

------------------------------------------------------------------------

We see that variable `grass` has three types of NA's:

`NA(d)`: Don't know. `NA(i)`: Inapplicable, e.g., these respondents were not asked the marijuana question. `NA(n)`: No answer

Here I have used `haven::print_tagged_na()` to format tagged NA's as `NA(a)`, `NA(b)`, etc. (See ["Tagged" missing values](https://haven.tidyverse.org/reference/tagged_na.html))
::::
:::::::

It would be interesting to know the composition of these different NA types. One could reason that there is a big difference between "question not asked" and "no answer". (See @lst-chap01-haven-na-tag to get the composition of the NA types.)

###### NA composition

::::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-haven-na-tag}
: Get the composition of the different types of tagged NA's
:::
::::

:::: my-r-code-container
::: {#lst-chap01-haven-na-tag}
```{r}
#| label: haven-na-tagged

gss_2016_gssr |>
  dplyr::count(
    grass,
    haven::na_tag(grass)
  )
```

Get the composition of the different types of tagged NA's
:::

------------------------------------------------------------------------

We already know from @lst-chap01-tibble-labelled-data that the variable `grass` has 1024 NA's. Now we know also the type composition of these NA's.
::::
:::::::

I could not find a similar function in {**labelled**}, so I have used here `haven::na_tag()`.
::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::

#### Working with labelled data

There are two ways to work with labelled data in R:

> The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels ({**Haven**} vignette [Conversion semantics](https://haven.tidyverse.org/articles/semantics.html))

::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-conversion-labelled-data}
: How to convert labelled data for data analysis in R
:::
::::

:::::::::::::::::::: my-experiment-container
The second variable `grass` is the labelled original variable. The third variable (last column to the right) is the modified `grass` variable.

::::::::::::::::::: panel-tabset
###### as_factor()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-haven-as-factor}
: Convert labelled vectors to factors
:::
::::

::: my-r-code-container
```{r}
#| label: haven-as-factor

gss_2016_grass <- gss_2016_gssr |> 
    dplyr::select(grass)

gss_2016_grass |> 
    dplyr::mutate(grass_factor = haven::as_factor(grass)) |> 
    my_glance_data(N = 8, seed = 42)
```

Here I am using `haven::as_factor()`. This function also knows how to re-label missing values.
:::
::::::

###### zap_labels()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-haven-zap-labels}
: Remove labels
:::
::::

::: my-r-code-container
```{r}
#| label: haven-zap-labels

gss_2016_grass |> 
    dplyr::mutate(grass_unlabelled = haven::zap_labels(grass)) |> 
    my_glance_data(N = 8, seed = 42)

```
:::
::::::

###### zap_missing()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-haven-zap-missing}
: Convert special missings to regular R missings
:::
::::

::: my-r-code-container
```{r}
#| label: haven-zap-missing

gss_2016_grass |> 
    dplyr::mutate(grass_rgular_na = haven::zap_missing(grass)) |> 
    my_glance_data(N = 8, seed = 42)
```
:::
::::::

###### zap_label()

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-haven-zap-label}
: Remove variable labels from a data frame
:::
::::

::: my-r-code-container
```{r}
#| label: haven-zap-label
#| results: hold

print("*******   Original data frame   ********")
labelled::var_label(gss_2016_gssr)

print("*******   Data frame after variable zapped   *******")
gss_2016_gssr |> 
    haven::zap_label() |> 
    labelled::var_label()

```
:::
::::::
:::::::::::::::::::

You can combine all three zap-commands.
::::::::::::::::::::
:::::::::::::::::::::::

## Table excursion

### Review several table packages {#sec-chap01-review-table-packages}

Creating tables is an important result of preparing data --- the main subject of this section. In my research I came up with eight packages that seems to me the most important ones. What follows is a short introduction into these packages.

::::::::::::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-review-table-packages}
: Review table packages: {**DT**}, {**flextable**}, {**gt**}, {**gtsummary**}, {**htmlTable**}, {**kableExtra**}, {**tableone**}, {**xtable**}
:::
::::

:::::::::::::::::::::::::::::::::::: my-experiment-container
::::::::::::::::::::::::::::::::::: panel-tabset
###### DT

[{**DT**}](https://rstudio.github.io/DT/)

DT is an abbreviation of 'DataTables.' Data objects in R can be rendered as HTML tables using the JavaScript library 'DataTables' (typically via R Markdown or Shiny).

This R package provides an enhanced version of data.frames with more options for authors and users:

**Authors**: DT supports data wrangling beside subsetting rows and selecting columns (e.g., computing on columns, sort, grouping). I am not sure if this is really an advantage because it requires the somewhat complex square brackets notation. Using {**dplyr**} with other {**tidyverse**} packages seems --- for me at least --- easier to use. **Users**: The really advantage of {**DT**} in my opinion is that the interface allows users to change values and display formats interactively. There are some option in the standard installations and much more with extensions.

-   **Standard**: Out of the box {**DT**} allows user to
    -   search,
    -   sort,
    -   choose the number of rows shown per page,
    -   filter columns
    -   add/remove row names and even
    -   edit cells individually.
-   **Extensions**: There are also some extension to enhance the interactive options: Extensions allow users to
    -   edit groups of cells,
    -   provide buttons for copying or downloading the table in different formats,
    -   to re-order and/or hide columns,
    -   to fix columns and/or header if the table is too wide and/or to long and you have to scroll,
    -   moving around with arrow keys like in Excel tables,
    -   making the table responsive to smaller screens by collapsing columns,
    -   grouping values by a column,
    -   reordering rows,
    -   scrolling through the rows and
    -   selecting parts of variables interactively.

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-dt}
My conclusion to {**DT**}
:::
::::

::: my-remark-container
{**DT**} is my first choice for interactive data exploring and very useful for Explorative Data Analysis (`r glossary("EDA")`). I have already some experiences with this package but haven't use it lastly.

Compared with the other table packages {**DT**} is not in the same category: The following packages are developed as `r glossary("Display Tables")` (in contrast to `r glossary("Data Tables")`. Beside of the interactive use {**DT**}s are Data Tables.
:::
::::::

###### flextable

[{**flextable**}](https://davidgohel.github.io/flextable/)

This package helps you to create reporting table from a data frame easily. You can

-   merge cells,
-   add headers,
-   add footers,
-   change formatting,
-   set how data in cells is displayed
-   Table content can also contain mixed types of text and image content.
-   Tables can be embedded from R Markdown documents into HTML, PDF, Word, and PowerPoint documents and can be embedded using Package Officer for Microsoft Word or PowerPoint documents.
-   Tables can also be exported as R plots or graphic files, e.g., png, pdf, and jpeg.

It is said that {**flextable**} --- together with the sparsely used [{**huxtable**}](https://hughjonesd.github.io/huxtable/) --- "supports the widest range of output formats ... \[both\] support HTML, {{< latex >}}, and Office formats, and contain most common table features (e.g., conditional formatting)." (Yihui: [R Markdown Coobook](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html))

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-flextable}
My conclusion to {**flextable**}
:::
::::

::: my-remark-container
{**flextable**} seems pretty well suited to create sophisticated tables for publications. It is often especially mentioned for exporting into Word files. Maybe I will look into this package in the near future, but currently my priority is {**gtsummary**}.
:::
::::::

###### gt

[{**gt**}](https://gt.rstudio.com/)

{**gt**} stands for "grammar of tables". It aims to do for tables what {**ggplot2**} did for graphics. ([Hadley Wickham on X](https://twitter.com/hadleywickham/status/1247997744469295104?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1247997744469295104%7Ctwgr%5Edb446e88f95e2f0e4a5e791da84f82d626ceee1c%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fcdn.embedly.com%2Fwidgets%2Fmedia.html%3Ftype%3Dtext2Fhtmlkey%3Da19fcc184b9711e1b4764040d3dc5c07schema%3Dtwitterurl%3Dhttps3A%2F%2Ftwitter.com%2Fhadleywickham%2Fstatus%2F1247997744469295104image%3D))

{**gt**} offers a different and easy-to-use set of functions that helps to build display tables from tabular data. The {**gt**} philosophy states that a comprehensive collection of table parts can be used to create a broad range of functional tables.

It allows you to compose a table by putting together different parts of the table, such as

-   the table header (title and subtitle),
-   the column labels,
-   the table body,
-   row group labels,
-   and the table footer.

Some parts are optional. You can also format numbers and add background shading to cells.

[![Part of a {gt} table](https://gt.rstudio.com/reference/figures/gt_parts_of_a_table.svg){#fig-gt-table-parts fig-alt="The table shows all the different parts of a gt table: The TABLE HEADER consists of TITLE and SUBTITLE. The next part is called STUB HEAD and consists of STUBHEAD LABEL on the left, above the row names and SPANNER COLUMN LABEL with COLUMN LABELs below. The third part is called STUB and consists of ROW GROUP LABEL, ROW LABELs and SUMMARY LABEL on the left. On the right side are the values of the table in cells respectively summary cells. The last part is called TABLE FOOTER and consists of FOOTNOTES and SOURCE NOTES."}](https://gt.rstudio.com/index.html).

At the beginning of the development process {**gt**} supported only HTML, but now you can also use it for LATEX and RTF outputs.

{**gt**} is developed by the RStudio/Posit team, that stands not only for high quality but also for {**tidyverse**} compatibility. It was a little surprise for me that the download statistics does not reflect the normally very high figures of RStudio/Posit products. After trying to work with {**gt**} I believe I know the reasons:

-   With {**gt**} you have big flexibility but you have to provide all the details. This could be cumbersome, especially when other table packages are providing not only theses details but incorporate also many other features as calculations, tests etc. {**gt**} is therefore better used in combination with other packages like {**gtsummary**}.
-   {**gt**} is still very early in the development process. The issue tracker show this in the relation of open / closed issue: 272 open / 633 closed (2024-01-18).

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-gt}
My conclusion to {**gt**}
:::
::::

::: my-remark-container
I tried to work with {**gt**} but it is cumbersome. I noticed a [bug even following the official online tutorial](https://github.com/rstudio/gt/issues/1535). Maybe {**gt**} (together with {**gtsummary**} is better suited for creating easily complex tables in many formats.
:::
::::::

###### gtsummary

The {**gtsummary**} package provides an efficient way to create publication-ready analytical and summary tables using the R programming language. {**gtsummary**} summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities.

{**gtsummary**} "was written to be a companion to the {**gt**} package from RStudio. But not all output types are supported by the {**gt**} package. Therefore, \[the developers\] have made it possible to print {**gtsummary**} tables with various engines." ([gtsummary + Quarto/R Markdown](https://www.danieldsjoberg.com/gtsummary/articles/rmarkdown.html))

[![Summary of the various Quarto and R Markdown output types and the print engines supporting them](img/chap01/gtsummary-output-formats-min.png){#fig-chap01-gtsummary-outputs fig-alt="Table of summaries for output formats: The table header reads from left to the right: Print Engine, Function, HTML, Word, PDF, RTF.  As print engines feature in this table: gt, flextable, huxtable, kableExtra, kable and tibble. All functions start with 'as_' followed by the name of the package, e.g., as_gt(), as_flex_table(), as_kable_extra() etc. The table states that gt is still working on RTF output, flextable and kableExtra does not support RTF, kableExtra also has no output format for Word. Kable has all four options in a very reduced form (missing indentation, footnotes and spanning headers) and tibble fails on all four formats." fig-align="center"}](https://www.danieldsjoberg.com/gtsummary/articles/rmarkdown.html)

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-gtsummary}
My conclusion to {**gtsummary**}
:::
::::

::: my-remark-container
I have tried out {**gtsummary**} and I am enthusiastic about it! I wrote an introduction with some examples (@sec-chap01-experiments-gtsummary) Now that I understand how it works I will definitely use it. (By the way: Besides the extensive documentation I found that the most effective introduction for me was watching the [Youtube video presentation](https://www.youtube.com/watch?v=tANo9E1SYJE) by the developer Daniel D. Sjoberg).

{{< video https://www.youtube.com/watch?v=tANo9E1SYJE >}}
:::
::::::

###### htmlTable

[{**htmlTable**}](https://cran.r-project.org/web/packages/htmlTable/vignettes/complex_tables.html)

I found {**htmlTable**} per accident searching the Internet with "r creating complex table". This package is not mentioned in the literature I have read so far, but it seems not only able to produce complex tables but is also used quite often. Maybe it is the HTML limitation which reduces it application for Online journals because it requires a separate step to produce {{< latex >}} tables ready for printing.

It is a pretty old package (version 1.0 appeared already in 2014) but still currently in active development. The [last change](https://cran.r-project.org/web/packages/htmlTable/index.html) today (2024-01-18) was version 2.4.2 from 2023-10-29.

From the package documentation:

**Advanced Tables for Markdown/HTML**

> Tables with state-of-the-art layout elements such as row spanners, column spanners, table spanners, zebra striping, and more. While allowing advanced layout, the underlying css-structure is simple in order to maximize compatibility with common word processors. The package also contains a few text formatting functions that help outputting text compatible with HTML/LaTeX.

Beside the extensive vignette documentation there are also some blog articles:

-   [Introducing the htmlTable-package](https://gforge.se/2015/04/introducing-the-htmltable-package/)
-   [Tables form R into Word](https://gforge.se/2013/02/tables-from-r-into-word/)
-   [Fast-track publishing using knitr: table mania (part IV)](https://gforge.se/2014/01/fast-track-publishing-using-knitr-part-iv/)

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-htmltable}
My conclusion to {**htmlTable**}
:::
::::

::: my-remark-container
{**htmlTable**} seems an interesting package not only for creating complex tables but also for exporting them into Word. But to learn this package is at the moment not on my task list.
:::
::::::

###### kableExtra

[{**kableExtra**}](https://haozhu233.github.io/kableExtra/)

{**kableExtra**} extends the basic functionality of `knitr::kable()` tables. Although `knitr::kable()` is simple by design, it has many features missing which are usually available in other packages. {**kableExtra**} has filled the gap nicely. One of the best thing about {**kableExtra**} is that most of its table capabilities work for both HTML and PDF formats.

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-kableextra}
My conclusion to {\*\*kableExtra\*}
:::
::::

::: my-remark-container
I have already some experiences with {**kableExtra**}. I used it quasi as an add-on to `knitr::kable()` during personal HTML publications (Blog, bookdown, Quarto). I believe that other "Table 1" packages like {**gtsummary**} or maybe also {**tableOne**) are better suited for display and publication.
:::
::::::

###### tableone

[{**tableone**}](https://cran.r-project.org/web/packages/tableone/index.html)

The first time I learned about the existence of {**tableone**} was in @sec-chap02 of Statistics with R. Consulting the online documentation it says:

> Creates "Table 1", i.e., description of baseline patient characteristics, which is essential in every medical research. Supports both continuous and categorical variables, as well as p-values and standardized mean differences. Weighted data are supported via the {**survey**} package. … Most important functions are `CreateTableOne()` and `svyCreateTableOne()`.

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-tableone}
My conclusion to {**tableone**}
:::
::::

::: my-remark-container
The package is completely new for me. I have to check it out to see how and when to use. But my priority now is {**gtsummary**} because

-   {**gtsummary**} is much more used than {**tableone**}: over 2000 versus under 500 downloads per day
-   last release of {**gtsummary**} was version 1.7.2 from July 15, 2023 versus release 26 from July 26, 2020.
:::
::::::

###### xtable

[{**xtable**}](https://cran.r-project.org/web/packages/xtable/vignettes/xtableGallery.pdf)

{**xtable**} is the oldest package and still the table package most used! From the documentation it is clear that it is very useful for the creation of publication ready tables. But I am little hesitant to learn it, because it is not mentioned in current literature and the last change was in April 2019.

:::::: my-remark
:::: my-remark-header
::: {#rem-chap01-xtable}
My conclusion to {**xtable**}
:::
::::

::: my-remark-container
Learning {**xtable**} is not on my agenda: It is not clear for me if {**xtable**} is still in developments and if it is supporting newer approaches. Learning other table packages, especially {**gtsummary**} probabily together with {**gt**} has priority for me.
:::
::::::
:::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::

There are other packages as well. But the above eight packages are a first starting point for everybody to learn creating and displaying sophisticated data tables in R. It is good to know about different packages but lastly one has to decide which one to learn and use in daily practice. At the moment my favorites is {**gtsummary**}.

:::::: my-resource
:::: my-resource-header
::: {#lem-chap01-table-packages}
: Other interesting table packages, but currently not considered for my personal purposes
:::
::::

::: my-resource-container
Just for a possible later reference I will add some other table packages I know of.

-   [{**reactable**}](https://glin.github.io/reactable/): `reactable()` creates a data table from tabular data with sorting and pagination by default. The data table is an HTML widget that can be used in R Markdown documents and Shiny applications or viewed from an R console. It is based on the React Table library and made with reactR. Features are:

    -   It creates a data table with sorting, filtering, and pagination
    -   It has built-in column formatting
    -   It supports custom rendering via R or JavaScript
    -   It works seamlessly within R Markdown documents and the Shiny app

-   [{**ractablefmtr**}](https://kcuilla.github.io/reactablefmtr/index.html): The package improves the appearance and formatting of tables created using the reactable R library. It includes many conditional formatters that are highly customizable and easy to use.

The authors of R Markdown Cookbook (Yihui Xie, Christophe Dervieux, Emily Riederer) mention also several other table packages in the section [Other packages for creating tables](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html):

-   {**rhandsontable**}: [Interface to the Handsontable.js Library.](https://jrowen.github.io/rhandsontable/)). Similar to {**DT**}, and has an Excel feel (e.g., you can edit data directly in the table). Visit <https://jrowen.github.io/rhandsontable/> to learn more about it.
-   {**pixiedust**}: [Tables so Beautifully Fine-Tuned You Will Believe It’s Magic](https://github.com/nutterb/pixiedust): Features creating tables for models (such as linear models) converted through the {**broom**} package ([Broom: Convert Statistical Objects into Tidy Tibbles](https://broom.tidymodels.org/)). It supports Markdown, HTML, and {{< latex >}} output formats.
-   **stargazer**: ([beautiful {{< latex >}}, HTML and ASCII tables from R statistical output](https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf)): Features formatting regression models and summary statistics tables.
-   **xtable** ([Dahl et al. 2019](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-xtable)): Perhaps the oldest package for creating tables---the first release was made in 2000. It supports both {{< latex >}} and HTML formats.

I'm not going to introduce the rest of packages, but will just list them here:

-   **tables**: [Formula-Driven Table Generation](https://dmurdoch.github.io/tables/)),
-   **pander** ([Daróczi and Tsegelskyi 2022](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-pander)),
-   **tangram** ([S. Garbett 2023](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-tangram)),
-   **ztable** ([Moon 2021](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-ztable)), and
-   **condformat** ([Introduction to condformat](https://zeehio.github.io/condformat/)).
:::
::::::

::::: my-important
::: my-important-header
Summary of my table packages review
:::

::: my-important-container
It does not make sense to learn every package. SO I will concentrate on five packages with different usage:

-   {**DT**}, see: @sec-DT: An important package for interactive HTML tables. It also manages quite good big datasets
-   {**gt**}, see @sec-gt: Although this package is still lacking many features in follows a systematic approach ("Grammar of Tables") that allows to make fine grained changes. It is there somewhat cumbersome to apply it in everyday use. But it is especially valuable when you have another package for the standard features and you need some specialized change. I plan to use {**gt**} together with {**gtsummary**}.
-   {**gtsummary**}, see: @sec-gtsummary: I just learned this package and I am enthusiastic about it! I will use it as replacement for {**tableone**} that is used by Harris for the book to prepare publishing ready `r glossary("Table 1")` tables.
-   {**kableExtra**}, see: @sec-kableExtra: It uses the `kable()` function from {**knitr**} (see: @sec-knitr), which is an essential part of the publishing infrastructure for `r glossary("literate programming")`. It is pretty flexible and could be used in tandem with {**gtsummary**}.

For simple tables one can also use `knitr::kable()` or just printing the tibble.
:::
:::::

As a consequence of my table packages review I experimented with the {**gtsummary**}. I wanted to create a table where I have information by the variables grass, age in four categories and sex with all their levels in absolute numbers and in percents.

### Experiments with the gtsummary package {#sec-chap01-experiments-gtsummary}

::::::::::::::::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-gtsummary-intro}
: Experiments with {**gtsummary**}
:::
::::

:::::::::::::::::::::::::::::::::::::::: my-experiment-container
0.  Display dataset used for the {**gtsummary**} introduction.
1.  A very basic summary table without any adaptions or modifications.
2.  Changing dichotomous to a categorical variable.
3.  Ditto with a different method.
4.  Changing variable labels with two different methods.
5.  Output grouped by the `grass` variable.
6.  Header modified.
7.  two cross tables of summary statistics stacked.
8.  Table stratified by the `grass` variable.

::::::::::::::::::::::::::::::::::::::: panel-tabset
###### 0

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-data}
: Data used in this {**gtsummary**} introduction
:::
::::

::: my-r-code-container
```{r}
#| label: gtsummary-data
#| cache: true

nhanes_2013_clean <- 
    base::readRDS("data/chap01/nhanes_2013_clean.rds")

gtsummary_basic <- nhanes_2013_clean |> 
 
    ## cut age into several defined age groups
    dplyr::mutate(age_cat = base::cut(age,
                   breaks = c(-Inf, 29, 39, 49, 59),
                   labels = c("18 - 29", "30 - 39", "40 - 49", "50 - 59"))) |>
    dplyr::select(-c(nr, age))

save_data_file("chap01", gtsummary_basic, "gtsummary_clean.rds")

base::summary(gtsummary_basic)
gtsummary_basic
```
:::
::::::

I am going to use this dataset to experiment with {**gtsummary**}.

###### 1

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-basic}
: A very basic summary without any adaptions or modifications
:::
::::

::: my-r-code-container
```{r}
#| label: gts-basic
#| cache: true

gtsummary_basic <- 
    base::readRDS("data/chap01/gtsummary_clean.rds")

gtsummary_basic |> 
    gtsummary::tbl_summary()
```
:::
::::::

This is a very basic summary without any adaptions or modifications. It is strange, that there are for `grass` not separated levels presented. From a [video of an introduction to {**gtsummary**}](https://www.youtube.com/watch?v=tANo9E1SYJE) by the package author Daniel D. Sjoberg I learned that a categorical variable with two levels is interpreted as dichotomous whenever the two levels are labelled `1` and `0`, `true` and `false` or `Yes` and `No`. And dichotomous variable are different as other categorical variables as they are reported by {**gtsummary**} only with one number (the `1`, `true` or `yes` value).

###### 2

Here I am going to change level names of the `grass` variable to prevent that it is interpreted as a dichotomous variable.

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-no-dichotomous1}
: Numbered R Code Title
:::
::::

::: my-r-code-container
```{r}
#| label: gts-not-dichotomous1
#| cache: true

gtsummary_basic2 <- 
    base::readRDS("data/chap01/gtsummary_clean.rds") |> 
    dplyr::mutate(grass = 
      forcats::fct_recode(grass,
                          "used" = "Yes", 
                          "not used" = "No")
            ) 

gtsummary_basic2 |>
    gtsummary::tbl_summary()
```
:::
::::::

This time we can see both levels of `grass`. But we loose the nice Yes/No answer of the survey.

###### 3

Another strategy to prevent a dichotomous interpretation is to tell {**gtsummary**} expressively with the `type =` argument how a variable has to be interpreted.

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-no-dichotomous2}
: Preventing a dichotomous interpretation
:::
::::

::: my-r-code-container
```{r}
#| label: gts-not-dichotomous2
#| cache: true

gtsummary_basic |> 
    gtsummary::tbl_summary(
        type = grass ~ "categorical"
    )
```
:::
::::::

The same output as in the previous tab, but now with the desired Yes/No answer categories.

###### 4

If there are labelled data {gtsummary} uses them as variable labels. This is a big advantages, especially if the data are already labelled and imported with {**haven**}. But you can use the {**labelled**} package to label the variables.

If there are no labelled metadata then the variable name is taken, e.g., in most of the cases you have to provide a better name for the table display.

I am using both methods in the following code chunk: I provided labelled metadata for the `grass` variable but used new variable names for `age_cat` and `sex`.

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-label}
: Changing variable labels
:::
::::

::: my-r-code-container
```{r}
#| label: gts-label
#| cache: true

## variant 1 to label variables
# labelled::var_label(
#     gtsummary_basic$grass) <- "Ever used grass or hash?"

## variant 2 to label variables
gtsummary_basic |> 
    labelled::set_variable_labels(
        grass = "Ever used grass or hash?"
    ) |> 
    gtsummary::tbl_summary(
        # by = grass,
        type = grass ~ "categorical",
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        )
    )
```
:::
::::::

###### 5

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-group}
: Summarize data by a grouped variable
:::
::::

::: my-r-code-container
```{r}
#| label: gts-by-grass
#| cache: true

gtsummary_group <-  gtsummary_basic |> 
    labelled::set_variable_labels(
        grass = "Ever used grass or hash?"
    ) |>
    gtsummary::tbl_summary(
        # by = grass,
        type = grass ~ "categorical",
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        )
    )

gtsummary_group
```

The variable `grass` is not grouped `dplyr::group_by()` but inside `gtsummary::tbl_summary()`.
:::
::::::

###### 6

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-header}
: Change table header
:::
::::

::: my-r-code-container
```{r}
#| label: gts-header
#| cache: true


gtsummary_group |> 
  gtsummary::modify_header(label ~ "**Variable**") |> 
  gtsummary::modify_spanning_header(
      gtsummary::show_header_names(gtsummary_group) ~ "**Ever used grass or hash?**")
```
:::
::::::

###### 7

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gtsummary-cross-table}
: Prepare cross tables and stack them
:::
::::

::: my-r-code-container
```{r}
#| label: gts-cross-table
#| cache: true

gtsummary_basic <- 
    base::readRDS("data/chap01/gtsummary_clean.rds")

gtsummary_cross1 <- gtsummary_basic |> 
    labelled::set_variable_labels(
            grass = "**Ever used grass or hash?**"
        ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = age_cat,
        col = grass,
        percent = "cell"
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    ) 

gtsummary_cross2 <- gtsummary_basic |> 
    labelled::set_variable_labels(
            grass = "**Ever used grass or hash?**"
        ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = sex,
        col = grass,
        percent = "cell"
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    )

gtsummary::tbl_stack(
    list(gtsummary_cross1,
         gtsummary_cross2)
    ) |> 
    gtsummary::bold_labels() 
```
:::
::::::

###### 8

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-ID-chap01-gtsummary-stratify}
: Stratify tables
:::
::::

::: my-r-code-container
```{r}
#| label: tbl-gts-stratify
#| tbl-cap: "Did you ever use marijuana or hashish?"
#| cache: true

gtsummary_basic <- 
    base::readRDS("data/chap01/gtsummary_clean.rds")

gtsummary_basic |> 
    gtsummary::tbl_strata(
        strata = grass,
        ~ .x |> 
    gtsummary::tbl_summary(
        by = sex,
        label = age_cat ~ "Age Category"
        ) |> 
    gtsummary::add_overall() |> 
    gtsummary::modify_header(
        label = "**Variables**",
        gtsummary::all_stat_cols() ~ "**{level}**"
        )
    ) |> 
  gtsummary::bold_labels() 


```
:::
::::::
:::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::

------------------------------------------------------------------------

## Opinions about legalizing marijuana (1973-2022) {#sec-chap01-grass-timeline}

:::::::::::::::::::::::::::::: my-experiment
:::: my-experiment-header
::: {#def-chap01-timeline-grass}
: Developing a graph showing the timeline of respondents opinion about legalizing marijuana (1973-2022)
:::
::::

::::::::::::::::::::::::::: my-experiment-container
:::::::::::::::::::::::::: panel-tabset
###### Load data

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-load-gss-grass-legal}
: Load the GSS data from 1972-2022 and select appropriate variables
:::
::::

::: my-r-code-container
```{r}
#| label: load-gss-grass-legal
#| cache: true
#| eval: false

## run only once (manually) #########

library(gssr)   ## load package
data(gss_all)   ## attach GSS data 
data(gss_dict)  ## attach codebook
gss_grass_legal <- gss_all |> 
    dplyr::select(year, age, grass) 

save_data_file("chap01", gss_grass_legal, "gss_grass_legal.rds")

base::unloadNamespace("gssr")
```
:::
::::::

(*This R code chunk has no visible output*)

###### Clean data

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-clean-gss-grass-legal}
: Clean `grass-legal` dataset
:::
::::

::: my-r-code-container
```{r}
#| label: clean-gss-grass-legal
#| results: hold

## run only once (manually) ##########
gss_grass_legal <- 
    base::readRDS("data/chap01/gss_grass_legal.rds")

gss_grass_legal_clean <- gss_grass_legal |> 
    tidyr::drop_na() |> 
    dplyr::mutate(
        grass = forcats::as_factor(grass),
        grass = forcats::fct_drop(grass),
        year = base::as.numeric(year),
        age = base::as.numeric(age)
    )

save_data_file("chap01", gss_grass_legal_clean, "gss_grass_legal_clean.rds")
```
:::
::::::

(*This R code chunk has no visible output*)

###### Show summary

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gss-grass-clean-data}
: Show summary of cleaned dataset
:::
::::

::: my-r-code-container
```{r}
#| label: tbl-gtsummary-gss-grass-legal
#| tbl-cap: "Summary of GSS surveys about opinion about legalizing marijuana between 1972-2022" 
#| warning: false

grass_legal <- 
    base::readRDS("data/chap01/gss_grass_legal_clean.rds")

## remove comma from big numbers
gtsummary::theme_gtsummary_language("en", big.mark = "")


grass_legal |> 
  gtsummary::tbl_summary(
    label = list(
        year ~ "Year",
        age ~ "Age",
        grass ~ "Should marijuana be legal?"
    ),
    type = list(age ~ "continuous2", year ~ "continuous2"),
    statistic = 
         list(
            age ~ c(
      "{mean} ({sd})",
      "{median} ({p25}, {p75})",
      "{min}, {max}"
         ), 
            year ~ c("{min}, {max}")
    )
  ) |> 
  gtsummary::italicize_levels() |> 
  gtsummary::bold_labels()
    
```
:::
::::::

::::: my-watch-out
::: my-watch-out-header
WATCH OUT! Summary with labelled data
:::

::: my-watch-out-container
I used for the summary the labelled data from the {**gssr**} package. This has the big advantage that I didn't recode the variable names and could use the variable and value metadata!

But {**gtsummary**} warned me to see that this is only "an intermediate data structure not meant for analysis".

> Column(s) age are class "haven_labelled". This is an intermediate datastructure not meant for analysis. Convert columns with `haven::as_factor()`, `labelled::to_factor()`, `labelled::unlabelled()`, and `unclass()`. "haven_labelled" value labels are ignored when columns are not converted. Failure to convert may have unintended consequences or result in error.
>
> -   <https://haven.tidyverse.org/articles/semantics.html>
> -   [https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled](https://haven.tidyverse.org/articles/semantics.html)
:::
:::::

###### Prepare data

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gss-grass-prepare-data}
: Prepare data for timeline graph
:::
::::

::: my-r-code-container
```{r}
#| label: gss-grass-prepare-data

grass_legal <- 
    base::readRDS("data/chap01/gss_grass_legal_clean.rds")


grass_legal2 <- grass_legal |> 
    # labelled::remove_labels() |> 
    dplyr::group_by(year) |> 
    dplyr::add_count(name = "N") |> 
    dplyr::add_count(grass) |> 
    dplyr::mutate(perc = n / N * 100) |> 
    dplyr::ungroup()

```
:::
::::::

(*This R code chunk has no visible output*)

###### Timeline

:::::: my-r-code
:::: my-r-code-header
::: {#cnj-chap01-gss-grass-timeline}
: Timeline of respondents opinion about legalizing marijuana (1973-2022)
:::
::::

::: my-r-code-container
```{r}
#| label: fig-gss-grass-timeline
#| fig-cap: "Timeline of respondents opinion about legalizing marijuana between 1972-2022 of the GSS data"

grass_legal2 |> 
    # dplyr::group_by(year) |> 
    ggplot2::ggplot(ggplot2::aes(
        x = year, 
        y = perc,
        color = grass)) +
    ggplot2::geom_point() + 
    ggplot2::theme_bw() +
    ggplot2::scale_color_manual(
        values = c("#79a778", '#7662aa')
        ) +
    ggplot2::scale_x_continuous(
        breaks = c(1973, seq(1978, 2022, 4), 2022)
        ) +
    ggplot2::theme(axis.text.x =
          ggplot2::element_text(angle = 45, vjust = 0.5)) +
    ggplot2::labs(
        x = "Should marijuana be legal?",
        y = "Percent of responses",
        color = "Marijuana") +
    ggplot2::theme(legend.position = c(.82, .85))

```

@fig-gss-grass-timeline shows that the proportion of people that support legalizing marijuana is (with the exception of 1978-1990) continuously rising. Starting from 19% support in 1973 approval hits 70% in 2022, the last date where data are available. The turning point where the proportion of affirmation was the first time greater than the disapproval was in 2014 (55/45).
:::
::::::
::::::::::::::::::::::::::
:::::::::::::::::::::::::::
::::::::::::::::::::::::::::::

## Glossary {.unnumbered}

```{r}
#| label: glossary-table
#| echo: false

glossary_table()
```

## Session Info {.unnumbered}

::::: my-r-code
::: my-r-code-header
Session Info
:::

::: my-r-code-container
```{r}
#| label: session-info

sessioninfo::session_info()
```
:::
:::::