Skip to content

Commit

Permalink
update readme.md, cleanup docs
Browse files Browse the repository at this point in the history
  • Loading branch information
vandenman committed Jan 13, 2025
1 parent 42e0cef commit fa2ff5f
Show file tree
Hide file tree
Showing 14 changed files with 267 additions and 198 deletions.
271 changes: 139 additions & 132 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,186 +49,193 @@ Both use [Turing.jl](github.com/TuringLang/Turing.jl) under the hood and return
## Independent Binomials
`proportion_test` can be used to explore equality constraints across a series of independent Binomials.
```julia
using EqualitySampler, EqualitySampler.Simulations, Distributions, Statistics
import Random, AbstractMCMC, MCMCChains
using EqualitySampler, Distributions
import Random

# simulate some data
Random.seed!(42) # on julia 1.7.3
Random.seed!(42)
n_groups = 5 # no. binomials.
true_partition = rand(UniformPartitionDistribution(n_groups))
# [1, 2, 2, 1, 1]
temp_probabilities = rand(n_groups)
true_probabilities = average_equality_constraints(temp_probabilities, true_partition)
true_probabilities = temp_probabilities[true_partition]
# [0.165894, 0.613478, 0.613478, 0.165894, 0.165894]

# total no. trials
observations = rand(100:200, n_groups)
# [166, 164, 134, 127, 152]

# no. successes
successes = rand(product_distribution(Binomial.(observations, true_probabilities)))
# [27, 101, 90, 16, 23]

obs_proportions = successes ./ observations
[true_probabilities obs_proportions]
# 5×2 Matrix{Float64}:
# 0.30743 0.301205
# 0.640909 0.640244
# 0.640909 0.701493
# 0.30743 0.275591
# 0.30743 0.296053

# specify no mcmc iterations, no chains, parallelization. burnin and thinning can also be specified.
mcmc_settings = MCMCSettings(;iterations = 15_000, chains = 4, parallel = AbstractMCMC.MCMCThreads)

# nothing indicates no equality sampling is done and instead the full model is sampled from
chn_full = proportion_test(successes, observations, nothing; mcmc_settings = mcmc_settings)
# use a BetaBinomial(1, k) over the partitions
# 0.165894 0.162651
# 0.613478 0.615854
# 0.613478 0.671642
# 0.165894 0.125984
# 0.165894 0.151316

partition_prior = BetaBinomialPartitionDistribution(n_groups, 1, n_groups)
chn_eqs = proportion_test(successes, observations, partition_prior; mcmc_settings = mcmc_settings)

# extract posterior mean of full model and averaged across equality constraints
estimated_probabilities_full = mean(chn_full).nt.mean
estimated_probabilities_eqs = mean(MCMCChains.group(chn_eqs, :p_constrained)).nt.mean
no_iter = 100_000

alg_full = EqualitySampler.SampleRJMCMC(;iter = no_iter, fullmodel_only = true)
alg_eq = EqualitySampler.EnumerateThenSample(;iter = no_iter)

prop_samples_full = proportion_test(observations, successes, alg_full, partition_prior)
prop_samples_eqs = proportion_test(observations, successes, alg_eq, partition_prior)


estimated_probabilities_full = vec(mean(prop_samples_full.parameter_samples.θ_p_samples, dims = 2))
estimated_probabilities_eqs = vec(mean(prop_samples_eqs.parameter_samples.θ_p_samples, dims = 2))
[true_probabilities obs_proportions estimated_probabilities_full estimated_probabilities_eqs]
# 5×4 Matrix{Float64}:
# 0.30743 0.301205 0.303421 0.296359
# 0.640909 0.640244 0.638429 0.662943
# 0.640909 0.701493 0.698563 0.666477
# 0.30743 0.275591 0.278896 0.295154
# 0.30743 0.296053 0.298635 0.296189
# 0.165894 0.162651 0.166634 0.150795
# 0.613478 0.615854 0.614287 0.638377
# 0.613478 0.671642 0.669006 0.641879
# 0.165894 0.125984 0.131644 0.149105
# 0.165894 0.151316 0.155732 0.150185

# posterior probabilities of equalities among the probabilities
compute_post_prob_eq(chn_eqs)
# note the shrinkage towards the group mean. Row 4 has a small observed proportion.
# Since rows 1 and 5 have a high posterior probability of being equal to row 4,

compute_post_prob_eq(prop_samples_eqs)
# 5×5 Matrix{Float64}:
# 0.0 0.0 0.0 0.0 0.0
# 0.0 0.0 0.0 0.0 0.0
# 0.0 0.94185 0.0 0.0 0.0
# 0.930683 0.0 0.0 0.0 0.0
# 0.937 0.0 0.0 0.931667 0.0
# true matrix
[i > j && p == q for (i, p) in enumerate(true_partition), (j, q) in enumerate(true_partition)]
# 0.0 0.0 0.0 0.0 0.0
# 3.33618e-18 0.0 0.0 0.0 0.0
# 3.09553e-20 0.93484 0.0 0.0 0.0
# 0.934021 1.05727e-18 1.0493e-20 0.0 0.0
# 0.944021 1.88964e-18 1.78492e-20 0.9391 0.0

# true equalities
[true_partition[i] == true_partition[j] && i < j for j in eachindex(true_partition), i in eachindex(true_partition)]
# 5×5 Matrix{Bool}:
# 0 0 0 0 0
# 0 0 0 0 0
# 0 1 0 0 0
# 1 0 0 0 0
# 1 0 0 1 0

# list the visited models (use compute_model_probs to obtain their posterior probability)
compute_model_counts(chn_eqs, false)
# OrderedCollections.OrderedDict{String, Int64} with 10 entries:
# "12211" => 51488
# "12215" => 1512
# "12241" => 1808
# "12244" => 1562
# "12245" => 141
# "12311" => 2596
# "12315" => 245
# "12341" => 328
# "12344" => 254
# "12345" => 66

# convert true partition to normalized form and print as string
join(EqualitySampler.reduce_model(true_partition))
# "12211"
compute_model_counts(prop_samples_eqs, false)
# OrderedCollections.OrderedDict{Vector{Int8}, Int64} with 10 entries:
# [1, 2, 2, 1, 1] => 86102
# [1, 2, 3, 1, 1] => 4938
# [1, 2, 2, 3, 1] => 2777
# [1, 2, 2, 3, 3] => 2466
# [1, 2, 2, 1, 3] => 1993
# [1, 2, 3, 4, 1] => 542
# [1, 2, 3, 4, 4] => 442
# [1, 2, 3, 1, 4] => 370
# [1, 2, 2, 3, 4] => 264
# [1, 2, 3, 4, 5] => 106

# and it so happens to be that the most visited model is also the true model
true_partition
# 5-element Vector{Int64}:
# 1
# 2
# 2
# 1
# 1
```
## Post hoc tests in One-Way ANOVA
`anova_test` can be used to explore equality constraints across the levels of a single categorical predictor.
The setup uses a grand mean $\mu$ and offsets $\theta_i$ for every level of the categorical predictor.
To identify the model, the constraint $\sum_i\theta_i = 1$ is imposed.
```julia
using EqualitySampler, EqualitySampler.Simulations, Distributions, Statistics
import Random, AbstractMCMC, MCMCChains
using EqualitySampler, Distributions
import Random

# Simulate some data
Random.seed!(42)
n_groups = 5
n_obs_per_group = 1000
Random.seed!(31415)
n_groups = 7
n_obs_per_group = 50
true_partition = rand(UniformPartitionDistribution(n_groups))
temp_θ = randn(n_groups)
temp_θ .-= mean(temp_θ) # center temp_θ to avoid identification constraints
true_θ = average_equality_constraints(temp_θ, true_partition)

g = repeat(1:n_groups; inner = n_obs_per_group)
μ, σ = 0.0, 1.0
temp_θ = randn(n_groups)
true_θ = temp_θ[true_partition]
true_θ .-= mean(true_θ) # center temp_θ to avoid identification constraints

# Important: this is the same parametrization as used by the model!
Dy = MvNormal.+ σ .* true_θ[g], σ)
y = rand(Dy)
data_obj = simulate_data_one_way_anova(n_groups, n_obs_per_group, true_θ, true_partition)
data = data_obj.data

# observed cell offsets
obs_offset = ([mean(y[g .== i]) for i in unique(g)] .- mean(y)) / std(y)
obs_offset = ([mean(data.y[data.g[i]]) for i in eachindex(data.g)] .- mean(data.y)) / std(data.y)
[true_θ obs_offset]
# 5×2 Matrix{Float64}:
# 0.191185 0.24118
# -0.286777 -0.290348
# -0.286777 -0.243936
# 0.191185 0.142092
# 0.191185 0.151012

# specify no mcmc iterations, no chains, parallelization. burnin and thinning can also be specified.
mcmc_settings = MCMCSettings(;iterations = 15_000, chains = 4, parallel = AbstractMCMC.MCMCThreads)

# nothing indicates no equality sampling is done and instead the full model is sampled from
chn_full = anova_test(y, g, nothing; mcmc_settings = mcmc_settings)
# use a BetaBinomial(1, k) over the partitions
# 7×2 Matrix{Float64}:
# 0.253441 0.289433
# 0.253441 0.231883
# 0.253441 0.0709005
# -0.197943 -0.297703
# 0.253441 0.339342
# 0.204135 0.277509
# -1.01996 -0.911364

iter = 100_000

alg_full = EqualitySampler.SampleRJMCMC(;iter = no_iter, fullmodel_only = true)
alg_eq = EqualitySampler.EnumerateThenSample(;iter = no_iter)

partition_prior = BetaBinomialPartitionDistribution(n_groups, 1, n_groups)
chn_eqs = anova_test(y, g, partition_prior; mcmc_settings = mcmc_settings)
anova_samples_full = anova_test(data.y, data.g, alg_full, partition_prior)
anova_samples_eqs = anova_test(data.y, data.g, alg_eq, partition_prior)

estimated_θ_full = Statistics.mean(MCMCChains.group(chn_full, :θ_cs)).nt.mean
estimated_θ_eqs = Statistics.mean(MCMCChains.group(chn_eqs , :θ_cs)).nt.mean
estimated_θ_full = vec(mean(anova_samples_full.parameter_samples.θ_cp, dims = 2))
estimated_θ_eqs = vec(mean( anova_samples_eqs.parameter_samples.θ_cp, dims = 2))
[true_θ obs_offset estimated_θ_full estimated_θ_eqs]
# 5×4 Matrix{Float64}:
# 0.191185 0.24118 0.245577 0.194745
# -0.286777 -0.290348 -0.296143 -0.252687
# -0.286777 -0.243936 -0.248339 -0.25913
# 0.191185 0.142092 0.145534 0.165273
# 0.191185 0.151012 0.153371 0.1518
# 7×4 Matrix{Float64}:
# 0.253441 0.289433 0.287095 0.250173
# 0.253441 0.231883 0.23036 0.237068
# 0.253441 0.0709005 0.0710826 0.155485
# -0.197943 -0.297703 -0.296033 -0.280944
# 0.253441 0.339342 0.337171 0.258582
# 0.204135 0.277509 0.275538 0.247956
# -1.01996 -0.911364 -0.905213 -0.86832


# posterior probabilities of equalities among the probabilities
compute_post_prob_eq(chn_eqs)
# 5×5 Matrix{Float64}:
# 0.0 0.0 0.0 0.0 0.0
# 0.00858333 0.0 0.0 0.0 0.0
# 0.0 0.874517 0.0 0.0 0.0
# 0.772967 0.0256833 0.0428167 0.0 0.0
# 0.73465 0.10215 0.0279333 0.8664 0.0
compute_post_prob_eq(anova_samples_eqs)
# 7×7 Matrix{Float64}:
# 0.0 0.0 0.0 0.0 0.0 0.0 0.0
# 0.73923 0.0 0.0 0.0 0.0 0.0 0.0
# 0.591839 0.618516 0.0 0.0 0.0 0.0 0.0
# 0.0366705 0.0481045 0.177259 0.0 0.0 0.0 0.0
# 0.762704 0.736683 0.57209 0.0316826 0.0 0.0 0.0
# 0.753677 0.739104 0.597044 0.0384047 0.75827 0.0 0.0
# 5.46569e-8 2.93418e-7 5.64801e-5 0.129007 1.40949e-8 7.66761e-8 0.0

# true matrix
[i > j && p == q for (i, p) in enumerate(true_partition), (j, q) in enumerate(true_partition)]
# 5×5 Matrix{Bool}:
# 0 0 0 0 0
# 0 0 0 0 0
# 0 1 0 0 0
# 1 0 0 0 0
# 1 0 0 1 0
# 7×7 Matrix{Bool}:
# 0 0 0 0 0 0 0
# 1 0 0 0 0 0 0
# 1 1 0 0 0 0 0
# 0 0 0 0 0 0 0
# 1 1 1 0 0 0 0
# 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0

# note that the model mistakenly identifies groups 5 and 6 as equal
# these had 0.253441 and 0.339342 as their sample values and 0.204135 and 0.277509 as their true means.

# list the visited models (use compute_model_probs to obtain their posterior probability)
compute_model_counts(chn_eqs, false)
# OrderedCollections.OrderedDict{String, Int64} with 21 entries:
# "11311" => 421
# "11341" => 94
# "12211" => 41409
# "12212" => 346
# "12215" => 1161
# "12221" => 3
# "12222" => 949
# "12241" => 427
# "12242" => 381
# "12244" => 7699
# "12245" => 96
# "12311" => 723
# "12312" => 2261
# "12315" => 57
# "12322" => 168
# "12331" => 963
# "12335" => 654
# "12341" => 39
# "12342" => 1509
# "12344" => 615
# "12345" => 25

# note that there is more uncertainty in the results here, probably because this model is more compelex than the previous.

# convert true partition to normalized form and print as string
join(EqualitySampler.reduce_model(true_partition))
# "12211"
# and it so happens to be that the most visited model is also the true model
compute_model_counts(anova_samples_eqs, false)
# OrderedCollections.OrderedDict{Vector{Int8}, Int64} with 242 entries:
# [1, 1, 1, 2, 1, 1, 3] => 32242
# [1, 1, 1, 2, 1, 1, 2] => 10256
# [1, 1, 2, 2, 1, 1, 3] => 9223
# [1, 1, 2, 3, 1, 1, 4] => 4421
# [1, 2, 2, 3, 1, 1, 4] => 2511
# [1, 1, 1, 1, 1, 1, 2] => 2469
# [1, 2, 2, 3, 1, 2, 4] => 1807
# [1, 1, 2, 3, 1, 2, 4] => 1784
# [1, 1, 1, 2, 3, 1, 4] => 1772
# [1, 1, 1, 2, 3, 3, 4] => 1666
# [1, 2, 1, 3, 2, 2, 4] => 1637
# [1, 2, 2, 3, 2, 2, 4] => 1419
# ⋮ => ⋮

```

# Supplementary Analyses
Expand Down
5 changes: 4 additions & 1 deletion docs/src/distributions over partitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@ AbstractPartitionDistribution
UniformPartitionDistribution
BetaBinomialPartitionDistribution
CustomInclusionPartitionDistribution
RandomProcessPartitionDistribution
DirichletProcessPartitionDistribution
PitmanYorProcessPartitionDistribution
PrecomputedCustomInclusionPartitionDistribution
DuplicatedPartitionDistribution
```

Aside from the interface for multivariate distributions, the following methods are also defined.
Expand Down
17 changes: 9 additions & 8 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# EqualitySampler

[*EqualitySampler*](https://github.com/vandenman/EqualitySampler.jl) defines four distributions over [partitions of a set](https://en.wikipedia.org/wiki/Partition_of_a_set):
[*EqualitySampler*](https://github.com/vandenman/EqualitySampler.jl) defines distributions over [partitions of a set](https://en.wikipedia.org/wiki/Partition_of_a_set):
- [`UniformPartitionDistribution`](@ref)
- [`BetaBinomialPartitionDistribution`](@ref)
- [`CustomInclusionPartitionDistribution`](@ref)
- [`RandomProcessPartitionDistribution`](@ref)
- [`DirichletProcessPartitionDistribution`](@ref)
- [`PitmanYorProcessPartitionDistribution`](@ref)

These distributions can be as prior distributions over equality constraints among a set of variables.

## Type Hierarchy

Each of the four distributions is a subtype of `AbstractPartitionDistribution` which is a subtype of [`Distributions.DiscreteMultivariateDistribution`](https://juliastats.org/Distributions.jl/stable/multivariate/#multivariates).
Each of the distributions is a subtype of `AbstractPartitionDistribution` which is a subtype of [`Distributions.DiscreteMultivariateDistribution`](https://juliastats.org/Distributions.jl/stable/multivariate/#multivariates).

Thus, each of these types can be called with e.g., `rand` and `logpdf`.
There are multiple abstract types, in particular there is `AbstractSizePartitionDistribution` which is a subtype of `AbstractPartitionDistribution` and is used to represent distributions over partitions where the logpdf of a partition is entirely determined by the size of partition (e.g., ``\{\{\theta_1, \theta_2\}, \{\theta_3\}\}`` has size 2).
When using subtypes of this distribution, it may be convenient to use `PrecomputedCustomInclusionPartitionDistribution` which caches all the computations.
There is also `AbstractProcessPartitionDistribution` for distributions that are based on a stochastic process, such as the Pitman-Yor process and the Dirichlet Process.

## Representation of Partitions

Expand All @@ -27,8 +31,5 @@ For example, given 3 parameters ``(\theta_1, \theta_2, \theta_3)`` there exist 5
| ``\{\{\theta_1\}, \{\theta_2, \theta_3\}\}`` | ``\theta_1 \neq \theta_2 = \theta_3`` | `[1, 2, 2]` |
| ``\{\{\theta_1\}, \{\theta_2\}, \{\theta_3\}\}`` | ``\theta_1 \neq \theta_2 \neq \theta_3`` | `[1, 2, 3]` |

However, we also consider `[2, 2, 2]` and `[3, 3, 3]` to be valid and identical to `[1, 1, 1]`.
The main reason for this is that it has pragmatic benefits when doing Gibbs sampling.
For example, consider that the current partition is `[1, 2, 2]` and the sampler proposes an update for the first element.
A natural proposal would be `[1, 1, 1]`, however, without duplicated partitions this would be impossible (as `[2, 2, 2]` would not exist).
The default [`logpdf`](@ref) accounts for duplicated partitions, use [`logpdf_model_distinct`](@ref) to evaluate the logpdf without duplicated partitions.
We do not consider `[2, 2, 2]` and `[3, 3, 3]` to be valid partitions identical to `[1, 1, 1]`.
When this is desired, one can wrap a partition distribution using `DuplicatedPartitionDistribution` to allow duplicated representation for a partition.
16 changes: 16 additions & 0 deletions docs/src/tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,19 @@ proportion_test
```@docs
anova_test
```

## Sampling Methods
```@docs
Enumerate
EnumerateThenSample
SampleIntegrated
SampleRJMCMC
```

## Extractor Functions
```@docs
compute_post_prob_eq
compute_model_probs
get_mpm_partition
get_hpm_partition
```
Loading

0 comments on commit fa2ff5f

Please sign in to comment.