-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathadam_missing.qmd
224 lines (163 loc) · 11.9 KB
/
adam_missing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
title: "Hands-On Example: Mapping from SDTM to ADaM Using R and Handling Missing Data with Multiple Imputation"
author: "Joshua J. Cook"
date: "`r Sys.Date()`"
format: html
embed-resources: true
toc: true
---
## Objective
The goal of this section is to transform our SDTM dataset into an ADaM (Analysis Data Model) dataset that is CDISC-compliant while [**handling missing data with multiple imputation**]{.underline}. This ensures that our clinical data is ready for [statistical analysis and reporting]{.underline}, following CDISC ADaM standards.
## Key Points for Presentation
- **Overview**: Demonstrate mapping SDTM-compliant data to ADaM format using R and handle missing data with multiple imputation.
- **Required Libraries**: Use `{tidyverse}` for effective data manipulation, `{admiral}` to facilitate the creation of ADaM datasets, and `{mice}` for handling missing data.
- **Input Data**: Use the SDTM DM dataset we previously created.
- **Data Understanding**: Ensure proper derivation of analysis-ready variables and adherence to ADaM standards while addressing missing values.
- **Transformation Process**:
- Generate missing values for `AGE` and `SEX`.
- Attempt to map to ADaM format with missing values.
- Apply multiple imputation using the `{mice}` package to handle missing data.
- Recreate ADaM dataset after imputation.
- **Compare Results**: Understand the impact of handling missing data on the ADaM dataset.
- **Metadata Creation**: Document ADaM variables to align with Define XML standards for submissions.
## Step-by-Step Transformation from SDTM to ADaM with Handling Missing Data
### Step 1: Load Required Libraries & Data
```{r}
if (!requireNamespace(c("tidyverse", "mice"), quietly = TRUE)) {
install.packages(c("tidyverse", "mice"))
}
# Load necessary libraries
library(tidyverse) # Tidyverse for data manipulation
library(mice) # Mice for handling missing data
# Loading the dataset in the third document
sdtm_dm <- readRDS("sdtm_dm.rds") # Loads our SDTM data
```
**Explanation**: We start by loading the `{tidyverse}` package for general data manipulation, `{admiral}` for creating ADaM datasets, and `{mice}` to handle missing data through imputation. We also load in our previously generated SDTM dataset.
### Step 2: Generate Missing Data in the SDTM DM Dataset
```{r}
# Expand the SDTM dataset to 150 observations and introduce missing values
set.seed(123) # Set seed for reproducibility
# Expand the original mock dataset to 150 observations
sdtm_dm_expanded <- sdtm_dm %>%
slice(rep(1:n(), length.out = 150)) %>% # Repeat rows to expand the dataset to 150 observations
mutate(SUBJID = sprintf("SUB%03d", 1:150)) # Update SUBJID to have unique identifiers
# Introduce missing values into AGE and SEX
sdtm_dm_missing <- sdtm_dm_expanded %>%
mutate(
AGE = ifelse(runif(n()) < 0.2, NA, AGE), # Set 20% of AGE values to NA
SEX = ifelse(runif(n()) < 0.1, NA, SEX) # Set 10% of SEX values to NA
)
# View the dataset with missing values
head(sdtm_dm_missing)
```
**Explanation**: In this step, we expand the dataset to 150 observations using `slice()` and `mutate()`. We then introduce missing values into the `AGE` and `SEX` columns by using `ifelse()` with a random probability of setting 20% of `AGE` and 10% of `SEX` values to `NA`, simulating incomplete data.
### Step 3: Attempt to Map SDTM DM with Missing Data to ADaM Format
```{r}
# Attempt to derive ADaM Dataset from SDTM DM Dataset with missing values
adam_dm_missing <- sdtm_dm_missing %>%
mutate(
# Deriving the Age Group Variable (AGEGR1)
# 'case_when()' is used to create conditional logic for assigning age groups
AGEGR1 = case_when(
AGE < 18 ~ "<18",
AGE >= 18 & AGE <= 65 ~ "18-65",
AGE > 65 ~ ">65"
),
# Creating the Safety Population Flag (SAFFL)
# 'ifelse()' is used to create a flag for inclusion in the safety population
SAFFL = ifelse(SEX %in% c("M", "F"), "Y", "N")
)
# View the derived dataset with missing values
head(adam_dm_missing)
md.pattern(adam_dm_missing) # mice function for visualizing missingness; matrix and heatmap!
```
**Explanation**: Here, we attempt to map the SDTM DM dataset to an ADaM format while the dataset contains missing values. `AGEGR1`is derived using `case_when()`, and `SAFFL` is flagged using `ifelse()`. This highlights the challenges of creating ADaM datasets when data is incomplete..
- **Matrix Interpretation**: Each row represents a pattern of missingness. `1` indicates observed data, `0` indicates missing data. The last column shows how many rows have that specific pattern, and the last row indicates the number of missing values per column.
- **Heatmap Interpretation**: The heatmap visually depicts the patterns, helping identify if there are specific variables with more missing data or if there's a monotone pattern, which might simplify imputation.
### Step 4: Apply Multiple Imputation Using `{mice}`
The methodology behind the `mice()` function in the {mice} package for R is based on **Multivariate Imputation by Chained Equations**. The idea is to create plausible values for missing data in a dataset by treating each variable as a target to be imputed, conditioned on the other variables. The method uses a **Fully Conditional Specification (FCS)**, meaning that each incomplete variable is imputed by a separate model fitted on all other variables iteratively.
**Steps in the Process**:
1. **Imputation**: Replace missing values for each variable using predictive models, specific to the type of data (e.g., predictive mean matching (`pmm`) for numeric variables, logistic regression (`logreg`) for binary categorical variables).
2. **Iteration**: The function cycles through each incomplete variable multiple times to improve the imputation estimates.
3. **Convergence**: After the iterations, a stable solution is expected. Multiple imputations are generated to reflect the uncertainty of the missing data.
The **predictor matrix** specifies which variables to use as predictors for each target variable, and the **method argument** determines the imputation method used, depending on the type of variable (e.g., `pmm`, `logreg`, `polyreg`).
The end result is an `mids`object, containing several imputed datasets, each reflecting different plausible imputations for the missing values, allowing researchers to conduct repeated analyses and account for uncertainty due to missing data.
#### Age
```{r}
# Add a temporary variable to ensure sufficient columns for imputation
sdtm_dm_missing$temp_id <- seq_len(nrow(sdtm_dm_missing))
# Apply multiple imputation for AGE using the mice package
imp_age <- mice(sdtm_dm_missing %>% select(AGE, temp_id, ETHNIC, RACERECOD), m = 5, maxit = 50, method = "pmm", seed = 500)
# Complete the AGE column using the first imputation
sdtm_dm_imputed_age <- sdtm_dm_missing
sdtm_dm_imputed_age$AGE <- complete(imp_age, 1)$AGE
# Remove the temporary variable
sdtm_dm_imputed_age$temp_id <- NULL
# View the dataset after AGE imputation
head(sdtm_dm_imputed_age)
```
#### Sex
```{r}
str(sdtm_dm_imputed_age)
# Add a temporary variable to ensure sufficient columns for imputing SEX
sdtm_dm_imputed_age$temp_id <- seq_len(nrow(sdtm_dm_imputed_age))
# Apply multiple imputation for SEX using logistic regression, using different columns as predictors
imp_sex <- mice(sdtm_dm_imputed_age %>% select(SEX, temp_id, ETHNIC, RACERECOD, AGE), m = 5, maxit = 50, method = "logreg", seed = 500)
# Complete the SEX column using the first imputation
sdtm_dm_imputed <- sdtm_dm_imputed_age
sdtm_dm_imputed$SEX <- complete(imp_sex, 1)$SEX
# Remove the temporary variable
sdtm_dm_imputed$temp_id <- NULL
# View the dataset after SEX imputation
head(sdtm_dm_imputed)
```
**Explanation**: We use the `mice()` function to perform multiple imputation on the dataset. Five imputed datasets are created using predictive mean matching (`pmm`), and we select the first imputed dataset for further use with the `complete()` function. This step ensures missing values are filled to create a complete dataset, however limitations can be encountered if there is not enough data to create models, which was the case for the `SEX` variable!
### Step 5: Recreate the ADaM Dataset After Imputation
```{r}
# Derive ADaM Dataset from the imputed SDTM DM Dataset
adam_dm_imputed <- sdtm_dm_imputed %>%
mutate(
# Re-create Age Group Variable (AGEGR1)
AGEGR1 = case_when(
AGE < 18 ~ "<18",
AGE >= 18 & AGE <= 65 ~ "18-65",
AGE > 65 ~ ">65"
),
# Re-create the Safety Population Flag (SAFFL)
SAFFL = ifelse(SEX %in% c("M", "F"), "Y", "N")
) %>%
select(
STUDYID, USUBJID, AGE, AGEGR1, SEX, SAFFL, ETHNIC, RACERECOD
)
# View the final derived ADaM DM dataset after imputation
head(adam_dm_imputed)
md.pattern(adam_dm_imputed) # mice function for visualizing missingness; matrix and heatmap!
```
**Explanation**: After imputing missing values, we derive the ADaM dataset using similar transformations as before. This time, the dataset is complete, allowing us to generate `AGEGR1` and `SAFFL` without issues.
### Step 6: Compare Results
```{r}
# Compare original ADaM dataset with missing values to imputed dataset
md.pattern(adam_dm_missing) # mice function for visualizing missingness; matrix and heatmap!
anyNA(adam_dm_imputed)
md.pattern(adam_dm_imputed) # mice function for visualizing missingness; matrix and heatmap!
# Saving the datasets for future use
saveRDS(sdtm_dm_imputed, "sdtm_dm_imputed.rds") # RDS is a R-specific file format that is in a compressed binary format that retains structure
saveRDS(adam_dm_imputed, "adam_dm_imputed.rds") # RDS is a R-specific file format that is in a compressed binary format that retains structure
```
**Explanation**: We compare the original ADaM dataset containing missing values (`adam_dm_missing`) to the imputed dataset (`adam_dm_imputed`). This comparison helps in understanding the impact of handling missing data on the completeness and reliability of the final analysis-ready dataset.
## Justification for Using `{mice}` for More Complex Analyses
- **Handling Missing Data**: `{mice}` is a powerful package for handling missing data through multiple imputation, which is essential in clinical trial data to ensure valid and unbiased statistical analyses.
## Important CDISC Concepts of SDTM and ADaM that We Are Adhering To During These Processes:
- **Standardization**: Standardizing datasets (ex: DM) across different clinical trials facilitates easier regulatory review and analysis.
- **Traceability**: The derivation of variables like `AGEGR1` and `SAFFL` from SDTM data ensures traceability, which is crucial for CDISC compliance.
- **Handling Missing Data**: Proper handling of missing data is critical in ensuring the accuracy and reliability of analysis results.
- **Metadata Documentation**: Variables in the ADaM dataset must be well-documented with metadata to comply with CDISC standards.
- **Population Flagging**: `SAFFL` helps in defining specific subpopulations for safety analysis, which is a requirement for regulatory submissions.
## Summary of ADaM Creation with Missing Data Handling
- **Generating Missing Data**: Simulated missing values in `AGE` and `SEX` to reflect real-world scenarios.
- **Initial Mapping**: Attempted to create ADaM dataset from incomplete data to illustrate challenges.
- **Multiple Imputation**: Used `{mice}` to impute missing values, creating a complete dataset.
- **Age Group Derivation (`AGEGR1`)**: Classified subjects into predefined age groups.
- **Safety Population Flag (`SAFFL`)**: Flagged subjects for inclusion in the safety population based on available demographic data.
- **Compliance with ADaM Standards**: Ensured derived variables and dataset structure align with CDISC ADaM guidelines, leveraging `{mice}` for imputation.
This hands-on section helps participants understand how to transform SDTM data into analysis-ready ADaM datasets while handling missing data, emphasizing compliance with CDISC standards for regulatory submissions.