Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
salbalkus committed Jan 28, 2025
2 parents c61c22f + c2ed453 commit 3d3a8a3
Show file tree
Hide file tree
Showing 14 changed files with 690 additions and 236 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
[![Build Status](https://github.com/salbalkus/CausalTables.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/salbalkus/CausalTables.jl/actions/workflows/CI.yml?query=branch%3Amain)
[![Coverage Status](https://coveralls.io/repos/github/salbalkus/CausalTables.jl/badge.svg?branch=main)](https://coveralls.io/github/salbalkus/CausalTables.jl?branch=main)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![JOSS Status](https://joss.theoj.org/papers/68c43e832d063050a4e67528191e8148/status.svg)](https://joss.theoj.org/papers/68c43e832d063050a4e67528191e8148)

*A package for storing and simulating data for causal inference in Julia.*

Expand Down
24 changes: 16 additions & 8 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@ CausalTables.jl has three main functionalities:

1. Generating simulation data using a `StructuralCausalModel`.
2. Computing "ground truth" conditional distributions, moments, counterfactuals, and counterfactual functionals from a `StructuralCausalModel` and a `CausalTable`. These include, for instance, counterfactual means and average treatment effects.
3. Wrapping an existing Table as a `CausalTable` object for use by external packages.
3. Wrapping an existing Table as a `CausalTable` object for use by external packages, which provides several utility functions for extracting causal-relevant variables from a dataset.

The examples below illustrate each of these three functionalities.

### Simulating Data from a DataGeneratingProcess

To set up a statistical simulation using CausalTables.jl, we first define a `StructuralCausalModel` (SCM). This consists of two parts: a `DataGeneratingProcess` (DGP) that controls how the data is generated, and a list of variables to define the basic structure of the underlying causal diagram.

A DataGeneratingProcess can be constructed using the `@dgp` macro, which takes a sequence of conditional distributions of the form `[variable name] ~ Distribution(args...)` and returns a `DataGeneratingProcess` object. Then, one can construct an StructuralCausalModel by passing the DGP to its construct, along with labels of the treatment and response variables. Note that `using Distributions` is almost always required before defining a DGP, since the package [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) is used to define the conditional distribution of random components at each step.
A DataGeneratingProcess can be constructed using the `@dgp` macro, which takes a sequence of conditional distributions of the form `[name] ~ Distribution(args...)` or auxiliary variables `[name] = some code...` and returns a `DataGeneratingProcess` object. Then, one can construct an StructuralCausalModel by passing the DGP to its construct, along with labels of the treatment and response variables. Note that `using Distributions` is almost always required before defining a DGP, since the package [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) is used to define the conditional distribution of random components at each step.

```jldoctest quicktest; output = false, filter = r"(?<=.{21}).*"s
using CausalTables
Expand All @@ -40,8 +40,7 @@ dgp = @dgp(
scm = StructuralCausalModel(
dgp;
treatment = :X,
response = :Y,
confounders = [:W]
response = :Y
)
# output
Expand Down Expand Up @@ -101,10 +100,9 @@ X_distribution = condensity(scm, data, :X)
Distributions.Normal{Float64}(μ=5.0, σ=1.0)
```

For convenience, there also exists `conmean` and `convar` functions that extracts the true conditional mean and variance of a specific variable the CausalTable. One can apply this to an "intervened" version of data to obtain the conditional mean of the outcome under intervention.
For convenience, there also exists functins like `conmean`, `convar`, and `propensity` that extract the true conditional mean, variance, and (generalized) propensity score of a specific variable the CausalTable. One can apply this to an "intervened" version of data to obtain functionals of the outcome under intervention:

```jldoctest quicktest
Y_var = convar(scm, data_intervened, :Y)
Y_mean = conmean(scm, data_intervened, :Y)
# output
Expand All @@ -124,7 +122,17 @@ If you have a table of data that you would like to use with CausalTables.jl with

```jldoctest quicktest; output = false, filter = r"(?<=.{11}).*"s
tbl = (W = rand(1:5, 10), X = randn(10), Y = randn(10))
ctbl = CausalTable(tbl; treatment = :X, response = :Y, confounders = [:W])
ctbl = CausalTable(tbl; treatment = :X, response = :Y,
causes = (X = [:W], Y = [:W, :X]))
# output
CausalTable
```

Doing this is often convenient, as it allows you to use the utility functions provided by CausalTables.jl to extract causal-relevant variables from the dataset. For instance, you can extract the treatment, response, confounders, mediators, or instruments from the dataset using the corresponding functions. For example, the following subsets the data to include only confounders:

```jldoctest quicktest; output = false, filter = r"(?<=.{11}).*"s
confounders(ctbl)
# output
CausalTable
Expand All @@ -134,4 +142,4 @@ For a more detailed guide of how to wrap an existing table as a CausalTable plea

# Contributing

Have questions? Spot a bug or issue in the documentation? Want to request a new feature or add one yourself? Please do not hesitate to open an issue or pull request on the [CausalTables.jl GitHub repository](https://github.com/salbalkus/CausalTables.jl). We welcome all contributions and feedback!
Have questions? Spot a bug or issue in the documentation? Want to request a new feature or add one yourself? Please don't hesitate to open an issue or pull request on the [CausalTables.jl GitHub repository](https://github.com/salbalkus/CausalTables.jl). We welcome all contributions and feedback!
41 changes: 35 additions & 6 deletions docs/src/man/formatting.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

In Julia, most datasets are stored in a Table: a data structure with a [Tables.jl](https://tables.juliadata.org/stable/)-compatible interface. One of the main purposes of CausalTables.jl is to wrap a Table of data in Julia in order to provide it as input to some other causal inference package. Given a Table of some data, we can turn it into a `CausalTable` by specifying the treatment, response, and control variables.

## Tables with Causally Independent Units
## Constructing the `CausalTable`

The code below provides an example of how to wrap the Boston Housing dataset as a `CausalTable` to answer causal questions of the form "How would changing nitrous oxide air pollution (`NOX`) within Boston-area towns affect median home value (`MEDV`)?" Any dataset in a [Tables.jl](https://tables.juliadata.org/stable/)-compliant format can be wrapped as a `CausalTable`. In this example, we turn a `DataFrame` from [DataFrames.jl](https://dataframes.juliadata.org/stable/) into a `CausalTable` object.

Expand All @@ -15,11 +15,21 @@ using DataFrames
tbl = BostonHousing().dataframe
# Wrapping the dataset in a CausalTable
ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV, confounders = [:CRIM, :ZN, :INDUS, :CHAS, :B, :DIS, :LSTAT])
ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV)
nothing # hide
```

When only `treatment` and `response` are specified, all other variables are assumed to be confounders. However, one can also explicitly specify the causes of both treatment and response by passing them as a `NamedTuple` of lists to the `CausalTable` constructor. In the example below, we specify the causes of the treatment `NOX` only as `[:CRIM, :INDUS]`, and the causes of the response `MEDV` are specified as `[:CRIM, :INDUS, :NOX]`.

```@example bostonhousing
ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV,
causes = (NOX = [:CRIM, :INDUS], MEDV = [:CRIM, :INDUS, :NOX]))
```

Note that a full representation of the causes of each variable is **not** required, though they can be specified (this is often referred to a "[directed acyclic graph](https://www.nature.com/articles/s41390-018-0071-3)"). Only the causes of the treatment and response are necessary as input; `CausalTables.jl` can compute other types of variables one might be interested in like confounders or mediators automatically.

After wrapping a dataset in a `CausalTable` object, the [Tables.jl](https://tables.juliadata.org/stable/) is available to call on the `CausalTable` as well. Below, we demonstrate a few of these functions, as well as additional utility functions for causal inference tasks made available by CausalTables.jl.

```@example bostonhousing
Expand All @@ -29,16 +39,35 @@ using Tables
Tables.getcolumn(ctbl, :NOX) # extract specific column
Tables.subset(ctbl, 1:5) # exact specific rows
Tables.columnnames(ctbl) # obtain all column names
```

In addition, the `CausalTable` object has several utility functions that can be used to extract different types of variables relevant to causal inference from the `CausalTable` object.

```@example bostonhousing
# Additional utility functions for CausalTables
treatment(ctbl) # get CausalTable of treatment variables
response(ctbl) # get CausalTable of response variables
confounders(ctbl) # get CausalTable of confounders
treatmentparents(ctbl) # get CausalTable of treatment and response
responseparents(ctbl) # get CausalTable of treatment and confounders
data(ctbl) # get underlying wrapped dataset
# replace one or more attributes of the CausalTable
CausalTables.replace(ctbl; response = :CRIM, confounders = [:MEDV, :ZN, :INDUS, :CHAS, :B, :DIS, :LSTAT])
parents(ctbl, :NOX) # get CausalTable of parents of a particular variable
confounders(ctbl) # get CausalTable of confounders
mediators(ctbl) # get CausalTable of mediators
instruments(ctbl) # get CausalTable of instruments
data(ctbl) # get underlying wrapped dataset of a CausalTable
nothing # hide
```

Although the `CausalTable` object is immutable, one can replace the values of its attributes with new ones using the `replace` function. The code below demonstrates how to replace the treatment and response variables of the `CausalTable` object `ctbl` with `:CRIM` and `nothing`, respectively. Setting `causes = nothing` is a quick shortcut to specify that all unlabeled variables are confounders of the treatment-response relationship.

```@example bostonhousing
# Replace one or more attributes of the CausalTable.
# Setting `causes = nothing` is a quick shortcut to specify
# that all unlabeled variables are confounders of the treatment-response relationship
CausalTables.replace(ctbl; response = :CRIM, causes = nothing)
nothing # hide
```
Expand Down
86 changes: 73 additions & 13 deletions docs/src/man/generating-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ When evaluating a causal inference method, we often want to test it on data from

## Defining a DataGeneratingProcess

A data generating process describes a mechanism by which draws from random variables are simulated. It typically takes the form of a sequence of conditional distributions. CausalTables allows us to define a DGP as a DataGeneratingProcess object, which takes three arguments: the `names` of variables generated at each step, the `types` of these variables, and `funcs`, an array of functions of the form `(; O...) -> *some code`.
A data generating process describes a mechanism by which draws from random variables are simulated. It typically takes the form of a sequence of conditional distributions. CausalTables allows us to define a DGP as a DataGeneratingProcess object, which takes three arguments: the `names` of variables generated at each step, the `types` of these variables, and `funcs`, an array of functions of the form `O -> *some code`.

Suppose, for example, that we wanted to simulate data from the following DGP:

Expand All @@ -24,52 +24,113 @@ using CausalTables
DataGeneratingProcess(
[:W, :X, :Y],
[:distribution, :distribution, :distribution],
[
(; O...) -> DiscreteUniform(1, 5),
(; O...) -> (@. Normal(O.W, 1)),
(; O...) -> (@. Normal(O.X + 0.2 * O.W, 1))
O -> DiscreteUniform(1, 5),
O -> (@. Normal(O.W, 1)),
O -> (@. Normal(O.X + 0.2 * O.W, 1))
]
)
# output
DataGeneratingProcess
```
where `; O...` syntax is a shorthand for a function that takes keyword arguments corresponding to the names of the variables in the DGP.
where `O` is an object that stores the output of each previous function in the sequence as a field with a name corresponding to its order in the sequence (i.e. in this example, the first function's output is stored as `O.W`, the second function's output is stored as `O.X`, and so on).

However, a much more convenient way to define this DGP is using the `@dgp` macro, which takes a sequence of conditional distributions of the form `[variable name] ~ Distribution(args...)` and deterministic variable assignments of the form `[variable name] = f(...)` and automatically generates a valid DataGeneratingProcess. For example, the *easier* way to define the DGP above is as follows:

```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
using CausalTables
distributions = @dgp(
W ~ DiscreteUniform(1, 5),
X ~ (@. Normal(W, 1)),
X ~ Normal.(W, 1),
Y ~ (@. Normal(X + 0.2 * W, 1))
)
# output
DataGeneratingProcess
```

Note that with the `@dgp` macro, any symbol (that is, any string of characters prefixed by a colon, as in `:W` or `:X`) is automatically replaced with the corresponding previously-defined variable in the process. For instance, in `Normal(:W, 1)`, the `:W` will be replaced automatically with the distribution we defined as `W` earlier in the sequence.
Note that when using the `@dgp` macro, any symbol defined on the left side of an equation in the sequence can be used to pass in the output of a previous step on the right side. For example, in the above code, the symbol `W` is used to pass in the output of the first step to the second step. This works by metaprogramming which replaces `W` with `O.W` when the function is constructed by `@dgp`.

In this way, we can define virtually any DGP that can be expressed as a sequence of conditional distributions. For ease of use, one can still use the `O` object in the `@dgp` macro to pass in the output of all previous steps, which is especially useful for programmatically-defined DGPs. For example, the following code is equivalent to the above code:

```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
distributions = @dgp(
W ~ DiscreteUniform(1, 5),
X ~ Normal.(O[1], 1),
Y ~ Normal.(hcat(values(O)...) * [1, 0.2], 1)
)
# output
DataGeneratingProcess
```

In the first step, previous variables are accessed by index using `O[1]`, and in the third step, all previous variables are combined into a matrix by `hcat(values(O)...)`. Be careful when using these constructions, however, as they can make the code harder to read and understand. In some cases, it may be better to construct a `DataGeneratingProcess` manually using the constructor, for which several additional utilities are available.

For instance, if one wanted to generate a large number of variables with the same distribution, one could use the `DataGeneratingProcess` constructor without specifying variable names, in which case names will be automatically generated:

```jldoctest generation; output = false, filter = r"(?<=.{75}).*"s
many_distributions = DataGeneratingProcess(
[O -> Normal(0, 1) for _ in 1:100]
)
# output
DataGeneratingProcess([:X1, :X2, :X3, :X4, :X5, :X6, :X7, :X8, :X9, :X10 …
```

In addition, the `merge` function can be used to combine two separate DGP sequences into one:

```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
# Define a new distribution whose mean is the mean of previous draws
output_distribution = @dgp(
Y ~ Normal.(reduce(+, values(O)) ./ n, 1)
)
# Merge our previous `many_distributions` with the new `output_distribution`
new_distributions = merge(many_distributions, output_distribution)
# output
DataGeneratingProcess
```

## Defining a StructuralCausalModel

In CausalTables.jl, a StructuralCausalModel is a data generating process endowed with some causal interpretation. Constructing a StructuralCausalModel allows users to randomly draw a CausalTable with the necessary components from the DataGeneratingProcess they've defined. With the above DataGeneratingProcess in hand, we can define a `StructuralCausalModel` object like so -- treatment, response, and confounder variables in the causal model are specified as keyword arguments to the `DataGeneratingProcess` constructor:
In CausalTables.jl, a StructuralCausalModel is a data generating process endowed with some causal interpretation. Constructing a StructuralCausalModel allows users to randomly draw a CausalTable with the necessary components from the DataGeneratingProcess they've defined. With the previous DataGeneratingProcess in hand, we can define a `StructuralCausalModel` object like so -- treatment and response in the causal model are specified as keyword arguments to the `DataGeneratingProcess` constructor:


```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
scm = StructuralCausalModel(
distributions;
treatment = :X,
response = :Y
)
# output
StructuralCausalModel
```

When a `StructuralCausalModel` is constructed with only treatment and response specified, all other variables are assumed to be confounders. However, one can also explicitly specify the causes of both treatment and response by passing them as a `NamedTuple` of lists to the `StructuralCausalModel` constructor:

```jldoctest generation; output = false, filter = r"(?<=.{21}).*"s
dgp = StructuralCausalModel(
scm = StructuralCausalModel(
distributions;
treatment = :X,
response = :Y,
confounders = [:W]
causes = (X = [:W], Y = [:X, :W])
)
# output
StructuralCausalModel
```

In the above, the keys of `causes` denote the variables whose causes are being specified, and the values are lists of variables that cause the key variable. In this case, the causes of the treatment `X` are specified as `[:W]`, and the causes of the response `Y` are specified as `[:X, :W]`, identical to how they are defined in a [CausalTable object](formatting.md). Just like for a `CausalTable`, while causes of other variables besides treatment and response can be specified, they are not necessary: only the causes of treatment and response are required as input.

**Important note**: `causes` must be specified manually unless the user is assuming that all unlabeled variables cause both `treatment` and `outcome`. This is the default assumption of a `StructuralCausalModel`, but it may not not factually match the model encoded by the `DataGeneratingProcess`. This behavior is allowed for two reasons: (1) to permit a random draw of a `CausalTable` with an 'incorrect' causal model, which can be useful for benchmarking the robustness of different causal inference methods to model misspecification, and (2) to simulate causal models that implicitly condition on a particular set of variables by leaving them out of the `causes` argument. Otherwise, ensure that labels in `causes` do not contradict the data generating process!

Finally, in some cases it may be convenient to define intermediate variables within a DGP


## Networks of Causally-Connected Units

In some cases, we might work with data in which units may *not* be causally independent, but rather, in which one unit's variables could dependent on some summary function of its neighbors. Generating data from such a model can be done by adding lines of the form `Xs $ NetworkSummary` to the `@dgp` macro.
Expand All @@ -94,8 +155,7 @@ dgp = @dgp(
scm = StructuralCausalModel(
dgp;
treatment = :X,
response = :Y,
confounders = [:W, :Ws]
response = :Y
)
# output
Expand Down
Loading

0 comments on commit 3d3a8a3

Please sign in to comment.