This section replicates analyses used in the Lee et al (2021) paper (the paper) with synthetic data. This section includes two sets of files:
- Application
- Simulation
This part provides scripts that uses the simulated dataset ("simulated_dataset.csv") to conduct all the analyses used in the reference paper. Figure 2,3 and Table 4 can be replicated in a similar fashion.
This part provides guidelines to replicating the simulation results such as Table 1 in the manuscript and Table 5 in the online supplementary materials.
analysis_conti.R
: application of the denovo method with continuous outcomes (including sensitivity analyses)analysis_binary.R
: application of the denovo method with binary outcomesanalysis_binary_sensi.R
: sensitivity analysis when outcomes are binarysimulated_dataset.csv
: this data set contains 110,091 matched pairs with 4 individual level covariates (age, sex, race, medicaid eligibility) and 11 zip code-level covariates. The Zip code-level covariates are the within-pair averages. The outcomes are simulated by the authors.simulated_dataset_continuous.csv
:this dataset is similar to the previous dataset. However, the outcomes are generated as continuous outcomes instead. This data set will be used with "analysis_conti.R.discovery_set_index.csv
: this includes an index vector indicating that which subjects were used as the discovery sub-sample. The same index vector was used in analyzing the actual Medicare data used in the paper.pvalue-discovery_continues.R
: For continues outcome simulationspvalue_discovery_binary.R
: For binary outcome simulationstest_all.R
: This script runs all mentioned tests and can be used by developers as a functional test.
Since handling continuous outcomes is much simpler than handling binary outcomes, it is recommended to start with this script first. In this script, we demonstrates how to implement the denovo method in order to discover a tree structure and conduct hypothesis tests. analysis_binary.R
and analysis_binary_sensi.R
can be considered to deal with binary outcomes. These two R script demonstrate the denovo method discussed in Section 3.5 of the paper.
The data generating process is discussed in Section 4. We considered 5 covariates, and two are effect modifiers among five. Also, three different splitting ratios (10%, 90%), (25%, 75$), (50%, 50%) were considered.
The outputs for both R scripts for simulations consist of two matrices: (1) pval.matrix
and (2) check.matrix
. pval.matrix
includes the p-values that was used for power computation like in Table 1. check.matrix includes the discovery ratio of each covariate. This was used for Table 5 in the supplementary materials.
- First, the number of columns for pval.matrix is 12. Each splitting ratio takes 4 columns. Among these four columns, first two are from CART and the next two are from CT. The first two columns represent the p-values by the truncated product method and the denovo method. Therefore, for each splitting ratio, (1) p-value from truncated product with CART, (2) p-value from denovo with CART, (3) p-value from truncatedP with CT, and (4) p-value from denovo with CT.
- Second, the number of columns for check.matrix is 30. Each splitting ratio takes 10 columns. Among these 10 columns, first five are from CART and the next five are from CT. The first five columns represent whether each covariates x_i is used for split or not in CART. Similarly, the next five columns are defined for CT.
Both functions_binary.R
and functions_binary_sensi.R
are modified from the R scripts provided by the supplementary materials of Fogarty et al. (2016) and Fogarty et al. (2017)