Merge branch 'main' of https://github.com/salbalkus/CausalTables.jl

salbalkus · Jan 28, 2025 · 3d3a8a3 · 3d3a8a3
2 parents c61c22f + c2ed453
commit 3d3a8a3
Show file tree

Hide file tree

Showing 14 changed files with 690 additions and 236 deletions.
diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@
 [![Build Status](https://github.com/salbalkus/CausalTables.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/salbalkus/CausalTables.jl/actions/workflows/CI.yml?query=branch%3Amain)
 [![Coverage Status](https://coveralls.io/repos/github/salbalkus/CausalTables.jl/badge.svg?branch=main)](https://coveralls.io/github/salbalkus/CausalTables.jl?branch=main)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![JOSS Status](https://joss.theoj.org/papers/68c43e832d063050a4e67528191e8148/status.svg)](https://joss.theoj.org/papers/68c43e832d063050a4e67528191e8148)
 
 *A package for storing and simulating data for causal inference in Julia.*
 

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -16,15 +16,15 @@ CausalTables.jl has three main functionalities:
 
 1. Generating simulation data using a `StructuralCausalModel`.
 2. Computing "ground truth" conditional distributions, moments, counterfactuals, and counterfactual functionals from a `StructuralCausalModel` and a `CausalTable`. These include, for instance, counterfactual means and average treatment effects.
-3. Wrapping an existing Table as a `CausalTable` object for use by external packages.
+3. Wrapping an existing Table as a `CausalTable` object for use by external packages, which provides several utility functions for extracting causal-relevant variables from a dataset. 
 
 The examples below illustrate each of these three functionalities.
 
 ### Simulating Data from a DataGeneratingProcess
 
 To set up a statistical simulation using CausalTables.jl, we first define a `StructuralCausalModel` (SCM). This consists of two parts: a `DataGeneratingProcess` (DGP) that controls how the data is generated, and a list of variables to define the basic structure of the underlying causal diagram.
 
-A DataGeneratingProcess can be constructed using the `@dgp` macro, which takes a sequence of conditional distributions of the form `[variable name] ~ Distribution(args...)` and returns a `DataGeneratingProcess` object. Then, one can construct an StructuralCausalModel by passing the DGP to its construct, along with labels of the treatment and response variables. Note that `using Distributions` is almost always required before defining a DGP, since the package [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) is used to define the conditional distribution of random components at each step.
+A DataGeneratingProcess can be constructed using the `@dgp` macro, which takes a sequence of conditional distributions of the form `[name] ~ Distribution(args...)` or auxiliary variables `[name] = some code...` and returns a `DataGeneratingProcess` object. Then, one can construct an StructuralCausalModel by passing the DGP to its construct, along with labels of the treatment and response variables. Note that `using Distributions` is almost always required before defining a DGP, since the package [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) is used to define the conditional distribution of random components at each step.
 
 ```jldoctest quicktest; output = false, filter = r"(?<=.{21}).*"s
 using CausalTables
@@ -40,8 +40,7 @@ dgp = @dgp(
 scm = StructuralCausalModel(
     dgp;
     treatment = :X,
-    response = :Y,
-    confounders = [:W]
+    response = :Y
 )
 
 # output
@@ -101,10 +100,9 @@ X_distribution = condensity(scm, data, :X)
  Distributions.Normal{Float64}(μ=5.0, σ=1.0)
 ```
 
-For convenience, there also exists `conmean` and `convar` functions that extracts the true conditional mean and variance of a specific variable the CausalTable. One can apply this to an "intervened" version of data to obtain the conditional mean of the outcome under intervention. 
+For convenience, there also exists functins like `conmean`, `convar`, and `propensity` that extract the true conditional mean, variance, and (generalized) propensity score of a specific variable the CausalTable. One can apply this to an "intervened" version of data to obtain functionals of the outcome under intervention:
 
 ```jldoctest quicktest
-Y_var = convar(scm, data_intervened, :Y)
 Y_mean = conmean(scm, data_intervened, :Y)
 
 # output
@@ -124,7 +122,17 @@ If you have a table of data that you would like to use with CausalTables.jl with
 
 ```jldoctest quicktest; output = false, filter = r"(?<=.{11}).*"s
 tbl = (W = rand(1:5, 10), X = randn(10), Y = randn(10))
-ctbl = CausalTable(tbl; treatment = :X, response = :Y, confounders = [:W])
+ctbl = CausalTable(tbl; treatment = :X, response = :Y, 
+                        causes = (X = [:W], Y = [:W, :X]))
+
+# output
+CausalTable
+```
+
+Doing this is often convenient, as it allows you to use the utility functions provided by CausalTables.jl to extract causal-relevant variables from the dataset. For instance, you can extract the treatment, response, confounders, mediators, or instruments from the dataset using the corresponding functions. For example, the following subsets the data to include only confounders:
+
+```jldoctest quicktest; output = false, filter = r"(?<=.{11}).*"s
+confounders(ctbl)
 
 # output
 CausalTable
@@ -134,4 +142,4 @@ For a more detailed guide of how to wrap an existing table as a CausalTable plea
 
 # Contributing
 
-Have questions? Spot a bug or issue in the documentation? Want to request a new feature or add one yourself? Please do not hesitate to open an issue or pull request on the [CausalTables.jl GitHub repository](https://github.com/salbalkus/CausalTables.jl). We welcome all contributions and feedback!
+Have questions? Spot a bug or issue in the documentation? Want to request a new feature or add one yourself? Please don't hesitate to open an issue or pull request on the [CausalTables.jl GitHub repository](https://github.com/salbalkus/CausalTables.jl). We welcome all contributions and feedback!
diff --git a/docs/src/man/formatting.md b/docs/src/man/formatting.md
@@ -2,7 +2,7 @@
 
 In Julia, most datasets are stored in a Table: a data structure with a [Tables.jl](https://tables.juliadata.org/stable/)-compatible interface. One of the main purposes of CausalTables.jl is to wrap a Table of data in Julia in order to provide it as input to some other causal inference package. Given a Table of some data, we can turn it into a `CausalTable` by specifying the treatment, response, and control variables. 
 
-## Tables with Causally Independent Units
+## Constructing the `CausalTable`
 
 The code below provides an example of how to wrap the Boston Housing dataset as a `CausalTable` to answer causal questions of the form "How would changing nitrous oxide air pollution (`NOX`) within Boston-area towns affect median home value (`MEDV`)?" Any dataset in a [Tables.jl](https://tables.juliadata.org/stable/)-compliant format can be wrapped as a `CausalTable`. In this example, we turn a `DataFrame` from [DataFrames.jl](https://dataframes.juliadata.org/stable/) into a `CausalTable` object.
 
@@ -15,11 +15,21 @@ using DataFrames
 tbl = BostonHousing().dataframe
 
 # Wrapping the dataset in a CausalTable
-ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV, confounders = [:CRIM, :ZN, :INDUS, :CHAS, :B, :DIS, :LSTAT])
+ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV)
 
 nothing # hide
 ```
 
+When only `treatment` and `response` are specified, all other variables are assumed to be confounders. However, one can also explicitly specify the causes of both treatment and response by passing them as a `NamedTuple` of lists to the `CausalTable` constructor. In the example below, we specify the causes of the treatment `NOX` only as `[:CRIM, :INDUS]`, and the causes of the response `MEDV` are specified as `[:CRIM, :INDUS, :NOX]`.
+
+```@example bostonhousing
+ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV, 
+                        causes = (NOX = [:CRIM, :INDUS], MEDV = [:CRIM, :INDUS, :NOX]))
+
+```
+
+Note that a full representation of the causes of each variable is **not** required, though they can be specified (this is often referred to a "[directed acyclic graph](https://www.nature.com/articles/s41390-018-0071-3)"). Only the causes of the treatment and response are necessary as input; `CausalTables.jl` can compute other types of variables one might be interested in like confounders or mediators automatically. 
+
 After wrapping a dataset in a `CausalTable` object, the [Tables.jl](https://tables.juliadata.org/stable/) is available to call on the `CausalTable` as well. Below, we demonstrate a few of these functions, as well as additional utility functions for causal inference tasks made available by CausalTables.jl.
 
 ```@example bostonhousing
@@ -29,16 +39,35 @@ using Tables
 Tables.getcolumn(ctbl, :NOX) # extract specific column
 Tables.subset(ctbl, 1:5)     # exact specific rows
 Tables.columnnames(ctbl)     # obtain all column names
+```
+
+In addition, the `CausalTable` object has several utility functions that can be used to extract different types of variables relevant to causal inference from the `CausalTable` object.
 
+```@example bostonhousing
 # Additional utility functions for CausalTables
 treatment(ctbl)              # get CausalTable of treatment variables
 response(ctbl)               # get CausalTable of response variables
-confounders(ctbl)            # get CausalTable of confounders
+treatmentparents(ctbl)      # get CausalTable of treatment and response
 responseparents(ctbl)        # get CausalTable of treatment and confounders
-data(ctbl)                   # get underlying wrapped dataset
 
-# replace one or more attributes of the CausalTable
-CausalTables.replace(ctbl; response = :CRIM, confounders = [:MEDV, :ZN, :INDUS, :CHAS, :B, :DIS, :LSTAT]) 
+parents(ctbl, :NOX)          # get CausalTable of parents of a particular variable
+
+confounders(ctbl)            # get CausalTable of confounders
+mediators(ctbl)              # get CausalTable of mediators
+instruments(ctbl)            # get CausalTable of instruments
+
+data(ctbl)                   # get underlying wrapped dataset of a CausalTable
+
+nothing # hide
+```
+
+Although the `CausalTable` object is immutable, one can replace the values of its attributes with new ones using the `replace` function. The code below demonstrates how to replace the treatment and response variables of the `CausalTable` object `ctbl` with `:CRIM` and `nothing`, respectively. Setting `causes = nothing` is a quick shortcut to specify that all unlabeled variables are confounders of the treatment-response relationship.
+
+```@example bostonhousing
+# Replace one or more attributes of the CausalTable.
+# Setting `causes = nothing` is a quick shortcut to specify
+# that all unlabeled variables are confounders of the treatment-response relationship
+CausalTables.replace(ctbl; response = :CRIM, causes = nothing) 
 
 nothing # hide
 ```

diff --git a/docs/src/man/generating-data.md b/docs/src/man/generating-data.md
@@ -4,7 +4,7 @@ When evaluating a causal inference method, we often want to test it on data from
 
 ## Defining a DataGeneratingProcess
 
-A data generating process describes a mechanism by which draws from random variables are simulated. It typically takes the form of a sequence of conditional distributions. CausalTables allows us to define a DGP as a DataGeneratingProcess object, which takes three arguments: the `names` of variables generated at each step, the `types` of these variables, and `funcs`, an array of functions of the form `(; O...) -> *some code`. 
+A data generating process describes a mechanism by which draws from random variables are simulated. It typically takes the form of a sequence of conditional distributions. CausalTables allows us to define a DGP as a DataGeneratingProcess object, which takes three arguments: the `names` of variables generated at each step, the `types` of these variables, and `funcs`, an array of functions of the form `O -> *some code`. 
 
 Suppose, for example, that we wanted to simulate data from the following DGP:
 
@@ -24,52 +24,113 @@ using CausalTables
 
 DataGeneratingProcess(
     [:W, :X, :Y],
-    [:distribution, :distribution, :distribution],
     [
-        (; O...) -> DiscreteUniform(1, 5), 
-        (; O...) -> (@. Normal(O.W, 1)),
-        (; O...) -> (@. Normal(O.X + 0.2 * O.W, 1))
+        O -> DiscreteUniform(1, 5), 
+        O -> (@. Normal(O.W, 1)),
+        O -> (@. Normal(O.X + 0.2 * O.W, 1))
     ]
 )
 
 # output
 DataGeneratingProcess
 ```
-where `; O...` syntax is a shorthand for a function that takes keyword arguments corresponding to the names of the variables in the DGP. 
+where `O` is an object that stores the output of each previous function in the sequence as a field with a name corresponding to its order in the sequence (i.e. in this example, the first function's output is stored as `O.W`, the second function's output is stored as `O.X`, and so on).
 
 However, a much more convenient way to define this DGP is using the `@dgp` macro, which takes a sequence of conditional distributions of the form `[variable name] ~ Distribution(args...)` and deterministic variable assignments of the form `[variable name] = f(...)` and automatically generates a valid DataGeneratingProcess. For example, the *easier* way to define the DGP above is as follows:
 
 ```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
 using CausalTables
 distributions = @dgp(
         W ~ DiscreteUniform(1, 5),
-        X ~ (@. Normal(W, 1)),
+        X ~ Normal.(W, 1),
         Y ~ (@. Normal(X + 0.2 * W, 1))
     )
 
 # output
 DataGeneratingProcess
 ```
 
-Note that with the `@dgp` macro, any symbol (that is, any string of characters prefixed by a colon, as in `:W` or `:X`) is automatically replaced with the corresponding previously-defined variable in the process. For instance, in `Normal(:W, 1)`, the `:W` will be replaced automatically with the distribution we defined as `W` earlier in the sequence. 
+Note that when using the `@dgp` macro, any symbol defined on the left side of an equation in the sequence can be used to pass in the output of a previous step on the right side. For example, in the above code, the symbol `W` is used to pass in the output of the first step to the second step. This works by metaprogramming which replaces `W` with `O.W` when the function is constructed by `@dgp`. 
+
+In this way, we can define virtually any DGP that can be expressed as a sequence of conditional distributions. For ease of use, one can still use the `O` object in the `@dgp` macro to pass in the output of all previous steps, which is especially useful for programmatically-defined DGPs. For example, the following code is equivalent to the above code:
+
+```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
+distributions = @dgp(
+        W ~ DiscreteUniform(1, 5),
+        X ~ Normal.(O[1], 1),
+        Y ~ Normal.(hcat(values(O)...) * [1, 0.2], 1)
+    )
+
+# output
+DataGeneratingProcess
+```
+
+In the first step, previous variables are accessed by index using `O[1]`, and in the third step, all previous variables are combined into a matrix by `hcat(values(O)...)`. Be careful when using these constructions, however, as they can make the code harder to read and understand. In some cases, it may be better to construct a `DataGeneratingProcess` manually using the constructor, for which several additional utilities are available. 
+
+For instance, if one wanted to generate a large number of variables with the same distribution, one could use the `DataGeneratingProcess` constructor without specifying variable names, in which case names will be automatically generated:
+
+```jldoctest generation; output = false, filter = r"(?<=.{75}).*"s
+
+many_distributions = DataGeneratingProcess(
+    [O -> Normal(0, 1) for _ in 1:100]
+)
+
+# output
+DataGeneratingProcess([:X1, :X2, :X3, :X4, :X5, :X6, :X7, :X8, :X9, :X10  …
+```
+
+In addition, the `merge` function can be used to combine two separate DGP sequences into one:
+
+```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
+# Define a new distribution whose mean is the mean of previous draws
+output_distribution = @dgp(
+    Y ~ Normal.(reduce(+, values(O)) ./ n, 1)
+)
+# Merge our previous `many_distributions` with the new `output_distribution`
+new_distributions = merge(many_distributions, output_distribution)
+
+# output
+DataGeneratingProcess
+```
 
 ## Defining a StructuralCausalModel
 
-In CausalTables.jl, a StructuralCausalModel is a data generating process endowed with some causal interpretation. Constructing a StructuralCausalModel allows users to randomly draw a CausalTable with the necessary components from the DataGeneratingProcess they've defined. With the above DataGeneratingProcess in hand, we can define a `StructuralCausalModel` object like so -- treatment, response, and confounder variables in the causal model are specified as keyword arguments to the `DataGeneratingProcess` constructor:
+In CausalTables.jl, a StructuralCausalModel is a data generating process endowed with some causal interpretation. Constructing a StructuralCausalModel allows users to randomly draw a CausalTable with the necessary components from the DataGeneratingProcess they've defined. With the previous DataGeneratingProcess in hand, we can define a `StructuralCausalModel` object like so -- treatment and response in the causal model are specified as keyword arguments to the `DataGeneratingProcess` constructor:
+
+
+```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
+scm = StructuralCausalModel(
+    distributions;
+    treatment = :X,
+    response = :Y
+)
+
+# output
+StructuralCausalModel
+```
 
+When a `StructuralCausalModel` is constructed with only treatment and response specified, all other variables are assumed to be confounders. However, one can also explicitly specify the causes of both treatment and response by passing them as a `NamedTuple` of lists to the `StructuralCausalModel` constructor:
 
 ```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
-dgp = StructuralCausalModel(
+
+scm = StructuralCausalModel(
     distributions;
     treatment = :X,
     response = :Y,
-    confounders = [:W]
+    causes = (X = [:W], Y = [:X, :W])
 )
 
 # output
 StructuralCausalModel
 ```
 
+In the above, the keys of `causes` denote the variables whose causes are being specified, and the values are lists of variables that cause the key variable. In this case, the causes of the treatment `X` are specified as `[:W]`, and the causes of the response `Y` are specified as `[:X, :W]`, identical to how they are defined in a [CausalTable object](formatting.md). Just like for a `CausalTable`, while causes of other variables besides treatment and response can be specified, they are not necessary: only the causes of treatment and response are required as input. 
+
+**Important note**: `causes` must be specified manually unless the user is assuming that all unlabeled variables cause both `treatment` and `outcome`. This is the default assumption of a `StructuralCausalModel`, but it may not not factually match the model encoded by the `DataGeneratingProcess`. This behavior is allowed for two reasons: (1) to permit a random draw of a `CausalTable` with an 'incorrect' causal model, which can be useful for benchmarking the robustness of different causal inference methods to model misspecification, and (2) to simulate causal models that implicitly condition on a particular set of variables by leaving them out of the `causes` argument. Otherwise, ensure that labels in `causes` do not contradict the data generating process! 
+
+Finally, in some cases it may be convenient to define intermediate variables within a DGP
+
+
 ## Networks of Causally-Connected Units
 
 In some cases, we might work with data in which units may *not* be causally independent, but rather, in which one unit's variables could dependent on some summary function of its neighbors. Generating data from such a model can be done by adding lines of the form `Xs $ NetworkSummary` to the `@dgp` macro.
@@ -94,8 +155,7 @@ dgp = @dgp(
 scm = StructuralCausalModel(
     dgp;
     treatment = :X,
-    response = :Y,
-    confounders = [:W, :Ws]
+    response = :Y
 )
 
 # output