update readme.md, cleanup docs

vandenman · Jan 13, 2025 · fa2ff5f · fa2ff5f
1 parent 42e0cef
commit fa2ff5f
Show file tree

Hide file tree

Showing 14 changed files with 267 additions and 198 deletions.
diff --git a/README.md b/README.md
@@ -49,186 +49,193 @@ Both use [Turing.jl](github.com/TuringLang/Turing.jl) under the hood and return
 ## Independent Binomials
 `proportion_test` can be used to explore equality constraints across a series of independent Binomials.
 ```julia
-using EqualitySampler, EqualitySampler.Simulations, Distributions, Statistics
-import Random, AbstractMCMC, MCMCChains
+using EqualitySampler, Distributions
+import Random
 
-# simulate some data
-Random.seed!(42) # on julia 1.7.3
+Random.seed!(42)
 n_groups = 5 # no. binomials.
 true_partition     = rand(UniformPartitionDistribution(n_groups))
+# [1, 2, 2, 1, 1]
 temp_probabilities = rand(n_groups)
-true_probabilities = average_equality_constraints(temp_probabilities, true_partition)
+true_probabilities = temp_probabilities[true_partition]
+# [0.165894, 0.613478, 0.613478, 0.165894, 0.165894]
+
 # total no. trials
 observations = rand(100:200, n_groups)
+# [166, 164, 134, 127, 152]
+
 # no. successes
 successes = rand(product_distribution(Binomial.(observations, true_probabilities)))
+# [27, 101, 90, 16, 23]
 
 obs_proportions = successes ./ observations
 [true_probabilities obs_proportions]
 # 5×2 Matrix{Float64}:
-#  0.30743   0.301205
-#  0.640909  0.640244
-#  0.640909  0.701493
-#  0.30743   0.275591
-#  0.30743   0.296053
-
-# specify no mcmc iterations, no chains, parallelization. burnin and thinning can also be specified.
-mcmc_settings = MCMCSettings(;iterations = 15_000, chains = 4, parallel = AbstractMCMC.MCMCThreads)
-
-# nothing indicates no equality sampling is done and instead the full model is sampled from
-chn_full = proportion_test(successes, observations, nothing; mcmc_settings = mcmc_settings)
-# use a BetaBinomial(1, k) over the partitions
+#  0.165894  0.162651
+#  0.613478  0.615854
+#  0.613478  0.671642
+#  0.165894  0.125984
+#  0.165894  0.151316
+
 partition_prior = BetaBinomialPartitionDistribution(n_groups, 1, n_groups)
-chn_eqs  = proportion_test(successes, observations, partition_prior; mcmc_settings = mcmc_settings)
 
-# extract posterior mean of full model and averaged across equality constraints
-estimated_probabilities_full = mean(chn_full).nt.mean
-estimated_probabilities_eqs = mean(MCMCChains.group(chn_eqs, :p_constrained)).nt.mean
+no_iter = 100_000
+
+alg_full = EqualitySampler.SampleRJMCMC(;iter = no_iter, fullmodel_only = true)
+alg_eq   = EqualitySampler.EnumerateThenSample(;iter = no_iter)
+
+prop_samples_full   = proportion_test(observations, successes, alg_full, partition_prior)
+prop_samples_eqs    = proportion_test(observations, successes, alg_eq,   partition_prior)
+
+
+estimated_probabilities_full = vec(mean(prop_samples_full.parameter_samples.θ_p_samples, dims = 2))
+estimated_probabilities_eqs  = vec(mean(prop_samples_eqs.parameter_samples.θ_p_samples, dims = 2))
 [true_probabilities obs_proportions estimated_probabilities_full estimated_probabilities_eqs]
 # 5×4 Matrix{Float64}:
-#  0.30743   0.301205  0.303421  0.296359
-#  0.640909  0.640244  0.638429  0.662943
-#  0.640909  0.701493  0.698563  0.666477
-#  0.30743   0.275591  0.278896  0.295154
-#  0.30743   0.296053  0.298635  0.296189
+#  0.165894  0.162651  0.166634  0.150795
+#  0.613478  0.615854  0.614287  0.638377
+#  0.613478  0.671642  0.669006  0.641879
+#  0.165894  0.125984  0.131644  0.149105
+#  0.165894  0.151316  0.155732  0.150185
 
-# posterior probabilities of equalities among the probabilities
-compute_post_prob_eq(chn_eqs)
+# note the shrinkage towards the group mean. Row 4 has a small observed proportion.
+# Since rows 1 and 5 have a high posterior probability of being equal to row 4,
+
+compute_post_prob_eq(prop_samples_eqs)
 # 5×5 Matrix{Float64}:
-#  0.0       0.0      0.0  0.0       0.0
-#  0.0       0.0      0.0  0.0       0.0
-#  0.0       0.94185  0.0  0.0       0.0
-#  0.930683  0.0      0.0  0.0       0.0
-#  0.937     0.0      0.0  0.931667  0.0
-# true matrix
-[i > j && p == q for (i, p) in enumerate(true_partition), (j, q) in enumerate(true_partition)]
+#  0.0          0.0          0.0          0.0     0.0
+#  3.33618e-18  0.0          0.0          0.0     0.0
+#  3.09553e-20  0.93484      0.0          0.0     0.0
+#  0.934021     1.05727e-18  1.0493e-20   0.0     0.0
+#  0.944021     1.88964e-18  1.78492e-20  0.9391  0.0
+
+# true equalities
+[true_partition[i] == true_partition[j] && i < j for j in eachindex(true_partition), i in eachindex(true_partition)]
 # 5×5 Matrix{Bool}:
 #  0  0  0  0  0
 #  0  0  0  0  0
 #  0  1  0  0  0
 #  1  0  0  0  0
 #  1  0  0  1  0
 
-# list the visited models (use compute_model_probs to obtain their posterior probability)
-compute_model_counts(chn_eqs, false)
-# OrderedCollections.OrderedDict{String, Int64} with 10 entries:
-# "12211" => 51488
-# "12215" => 1512
-# "12241" => 1808
-# "12244" => 1562
-# "12245" => 141
-# "12311" => 2596
-# "12315" => 245
-# "12341" => 328
-# "12344" => 254
-# "12345" => 66
-
-# convert true partition to normalized form and print as string
-join(EqualitySampler.reduce_model(true_partition))
-# "12211"
+compute_model_counts(prop_samples_eqs, false)
+# OrderedCollections.OrderedDict{Vector{Int8}, Int64} with 10 entries:
+#   [1, 2, 2, 1, 1] => 86102
+#   [1, 2, 3, 1, 1] => 4938
+#   [1, 2, 2, 3, 1] => 2777
+#   [1, 2, 2, 3, 3] => 2466
+#   [1, 2, 2, 1, 3] => 1993
+#   [1, 2, 3, 4, 1] => 542
+#   [1, 2, 3, 4, 4] => 442
+#   [1, 2, 3, 1, 4] => 370
+#   [1, 2, 2, 3, 4] => 264
+#   [1, 2, 3, 4, 5] => 106
+
 # and it so happens to be that the most visited model is also the true model
+true_partition
+# 5-element Vector{Int64}:
+#  1
+#  2
+#  2
+#  1
+#  1
 ```
 ## Post hoc tests in One-Way ANOVA
 `anova_test` can be used to explore equality constraints across the levels of a single categorical predictor.
 The setup uses a grand mean $\mu$ and offsets $\theta_i$ for every level of the categorical predictor.
 To identify the model, the constraint $\sum_i\theta_i = 1$ is imposed.
 ```julia
-using EqualitySampler, EqualitySampler.Simulations, Distributions, Statistics
-import Random, AbstractMCMC, MCMCChains
+using EqualitySampler, Distributions
+import Random
 
 # Simulate some data
-Random.seed!(42)
-n_groups = 5
-n_obs_per_group = 1000
+Random.seed!(31415)
+n_groups = 7
+n_obs_per_group = 50
 true_partition = rand(UniformPartitionDistribution(n_groups))
-temp_θ = randn(n_groups)
-temp_θ .-= mean(temp_θ) # center temp_θ to avoid identification constraints
-true_θ = average_equality_constraints(temp_θ, true_partition)
 
-g = repeat(1:n_groups; inner = n_obs_per_group)
-μ, σ = 0.0, 1.0
+temp_θ = randn(n_groups)
+true_θ = temp_θ[true_partition]
+true_θ .-= mean(true_θ) # center temp_θ to avoid identification constraints
 
-# Important: this is the same parametrization as used by the model!
-Dy = MvNormal(μ .+ σ .* true_θ[g], σ)
-y = rand(Dy)
+data_obj = simulate_data_one_way_anova(n_groups, n_obs_per_group, true_θ, true_partition)
+data = data_obj.data
 
-# observed cell offsets
-obs_offset = ([mean(y[g .== i]) for i in unique(g)] .- mean(y)) / std(y)
+obs_offset = ([mean(data.y[data.g[i]]) for i in eachindex(data.g)] .- mean(data.y)) / std(data.y)
 [true_θ obs_offset]
-# 5×2 Matrix{Float64}:
-#   0.191185   0.24118
-#  -0.286777  -0.290348
-#  -0.286777  -0.243936
-#   0.191185   0.142092
-#   0.191185   0.151012
-
-# specify no mcmc iterations, no chains, parallelization. burnin and thinning can also be specified.
-mcmc_settings = MCMCSettings(;iterations = 15_000, chains = 4, parallel = AbstractMCMC.MCMCThreads)
-
-# nothing indicates no equality sampling is done and instead the full model is sampled from
-chn_full = anova_test(y, g, nothing; mcmc_settings = mcmc_settings)
-# use a BetaBinomial(1, k) over the partitions
+# 7×2 Matrix{Float64}:
+#   0.253441   0.289433
+#   0.253441   0.231883
+#   0.253441   0.0709005
+#  -0.197943  -0.297703
+#   0.253441   0.339342
+#   0.204135   0.277509
+#  -1.01996   -0.911364
+
+iter = 100_000
+
+alg_full = EqualitySampler.SampleRJMCMC(;iter = no_iter, fullmodel_only = true)
+alg_eq   = EqualitySampler.EnumerateThenSample(;iter = no_iter)
+
 partition_prior = BetaBinomialPartitionDistribution(n_groups, 1, n_groups)
-chn_eqs  = anova_test(y, g, partition_prior; mcmc_settings = mcmc_settings)
+anova_samples_full   = anova_test(data.y, data.g, alg_full, partition_prior)
+anova_samples_eqs    = anova_test(data.y, data.g, alg_eq,   partition_prior)
 
-estimated_θ_full = Statistics.mean(MCMCChains.group(chn_full, :θ_cs)).nt.mean
-estimated_θ_eqs  = Statistics.mean(MCMCChains.group(chn_eqs , :θ_cs)).nt.mean
+estimated_θ_full = vec(mean(anova_samples_full.parameter_samples.θ_cp, dims = 2))
+estimated_θ_eqs  = vec(mean( anova_samples_eqs.parameter_samples.θ_cp, dims = 2))
 [true_θ obs_offset estimated_θ_full estimated_θ_eqs]
-# 5×4 Matrix{Float64}:
-#   0.191185   0.24118    0.245577   0.194745
-#  -0.286777  -0.290348  -0.296143  -0.252687
-#  -0.286777  -0.243936  -0.248339  -0.25913
-#   0.191185   0.142092   0.145534   0.165273
-#   0.191185   0.151012   0.153371   0.1518
+# 7×4 Matrix{Float64}:
+#   0.253441   0.289433    0.287095    0.250173
+#   0.253441   0.231883    0.23036     0.237068
+#   0.253441   0.0709005   0.0710826   0.155485
+#  -0.197943  -0.297703   -0.296033   -0.280944
+#   0.253441   0.339342    0.337171    0.258582
+#   0.204135   0.277509    0.275538    0.247956
+#  -1.01996   -0.911364   -0.905213   -0.86832
+
 
 # posterior probabilities of equalities among the probabilities
-compute_post_prob_eq(chn_eqs)
-# 5×5 Matrix{Float64}:
-#  0.0         0.0        0.0        0.0     0.0
-#  0.00858333  0.0        0.0        0.0     0.0
-#  0.0         0.874517   0.0        0.0     0.0
-#  0.772967    0.0256833  0.0428167  0.0     0.0
-#  0.73465     0.10215    0.0279333  0.8664  0.0
+compute_post_prob_eq(anova_samples_eqs)
+# 7×7 Matrix{Float64}:
+#  0.0         0.0         0.0         0.0        0.0         0.0         0.0
+#  0.73923     0.0         0.0         0.0        0.0         0.0         0.0
+#  0.591839    0.618516    0.0         0.0        0.0         0.0         0.0
+#  0.0366705   0.0481045   0.177259    0.0        0.0         0.0         0.0
+#  0.762704    0.736683    0.57209     0.0316826  0.0         0.0         0.0
+#  0.753677    0.739104    0.597044    0.0384047  0.75827     0.0         0.0
+#  5.46569e-8  2.93418e-7  5.64801e-5  0.129007   1.40949e-8  7.66761e-8  0.0
+
 # true matrix
 [i > j && p == q for (i, p) in enumerate(true_partition), (j, q) in enumerate(true_partition)]
-# 5×5 Matrix{Bool}:
-#  0  0  0  0  0
-#  0  0  0  0  0
-#  0  1  0  0  0
-#  1  0  0  0  0
-#  1  0  0  1  0
+# 7×7 Matrix{Bool}:
+#  0  0  0  0  0  0  0
+#  1  0  0  0  0  0  0
+#  1  1  0  0  0  0  0
+#  0  0  0  0  0  0  0
+#  1  1  1  0  0  0  0
+#  0  0  0  0  0  0  0
+#  0  0  0  0  0  0  0
+
+# note that the model mistakenly identifies groups 5 and 6 as equal
+# these had 0.253441 and 0.339342 as their sample values and 0.204135 and 0.277509 as their true means.
 
 # list the visited models (use compute_model_probs to obtain their posterior probability)
-compute_model_counts(chn_eqs, false)
-# OrderedCollections.OrderedDict{String, Int64} with 21 entries:
-#   "11311" => 421
-#   "11341" => 94
-#   "12211" => 41409
-#   "12212" => 346
-#   "12215" => 1161
-#   "12221" => 3
-#   "12222" => 949
-#   "12241" => 427
-#   "12242" => 381
-#   "12244" => 7699
-#   "12245" => 96
-#   "12311" => 723
-#   "12312" => 2261
-#   "12315" => 57
-#   "12322" => 168
-#   "12331" => 963
-#   "12335" => 654
-#   "12341" => 39
-#   "12342" => 1509
-#   "12344" => 615
-#   "12345" => 25
-
-# note that there is more uncertainty in the results here, probably because this model is more compelex than the previous.
-
-# convert true partition to normalized form and print as string
-join(EqualitySampler.reduce_model(true_partition))
-# "12211"
-# and it so happens to be that the most visited model is also the true model
+compute_model_counts(anova_samples_eqs, false)
+# OrderedCollections.OrderedDict{Vector{Int8}, Int64} with 242 entries:
+#   [1, 1, 1, 2, 1, 1, 3] => 32242
+#   [1, 1, 1, 2, 1, 1, 2] => 10256
+#   [1, 1, 2, 2, 1, 1, 3] => 9223
+#   [1, 1, 2, 3, 1, 1, 4] => 4421
+#   [1, 2, 2, 3, 1, 1, 4] => 2511
+#   [1, 1, 1, 1, 1, 1, 2] => 2469
+#   [1, 2, 2, 3, 1, 2, 4] => 1807
+#   [1, 1, 2, 3, 1, 2, 4] => 1784
+#   [1, 1, 1, 2, 3, 1, 4] => 1772
+#   [1, 1, 1, 2, 3, 3, 4] => 1666
+#   [1, 2, 1, 3, 2, 2, 4] => 1637
+#   [1, 2, 2, 3, 2, 2, 4] => 1419
+#   ⋮                     => ⋮
+
 ```
 
 # Supplementary Analyses

diff --git a/docs/src/distributions over partitions.md b/docs/src/distributions over partitions.md
@@ -6,7 +6,10 @@ AbstractPartitionDistribution
 UniformPartitionDistribution
 BetaBinomialPartitionDistribution
 CustomInclusionPartitionDistribution
-RandomProcessPartitionDistribution
+DirichletProcessPartitionDistribution
+PitmanYorProcessPartitionDistribution
+PrecomputedCustomInclusionPartitionDistribution
+DuplicatedPartitionDistribution
 ```
 
 Aside from the interface for multivariate distributions, the following methods are also defined.

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -1,18 +1,22 @@
 # EqualitySampler
 
-[*EqualitySampler*](https://github.com/vandenman/EqualitySampler.jl) defines four distributions over [partitions of a set](https://en.wikipedia.org/wiki/Partition_of_a_set):
+[*EqualitySampler*](https://github.com/vandenman/EqualitySampler.jl) defines distributions over [partitions of a set](https://en.wikipedia.org/wiki/Partition_of_a_set):
 - [`UniformPartitionDistribution`](@ref)
 - [`BetaBinomialPartitionDistribution`](@ref)
 - [`CustomInclusionPartitionDistribution`](@ref)
-- [`RandomProcessPartitionDistribution`](@ref)
+- [`DirichletProcessPartitionDistribution`](@ref)
+- [`PitmanYorProcessPartitionDistribution`](@ref)
 
 These distributions can be as prior distributions over equality constraints among a set of variables.
 
 ## Type Hierarchy
 
-Each of the four distributions is a subtype of `AbstractPartitionDistribution` which is a subtype of [`Distributions.DiscreteMultivariateDistribution`](https://juliastats.org/Distributions.jl/stable/multivariate/#multivariates).
+Each of the distributions is a subtype of `AbstractPartitionDistribution` which is a subtype of [`Distributions.DiscreteMultivariateDistribution`](https://juliastats.org/Distributions.jl/stable/multivariate/#multivariates).
 
 Thus, each of these types can be called with e.g., `rand` and `logpdf`.
+There are multiple abstract types, in particular there is `AbstractSizePartitionDistribution` which is a subtype of `AbstractPartitionDistribution` and is used to represent distributions over partitions where the logpdf of a partition is entirely determined by the size of partition (e.g., ``\{\{\theta_1, \theta_2\}, \{\theta_3\}\}`` has size 2).
+When using subtypes of this distribution, it may be convenient to use `PrecomputedCustomInclusionPartitionDistribution` which caches all the computations.
+There is also `AbstractProcessPartitionDistribution` for distributions that are based on a stochastic process, such as the Pitman-Yor process and the Dirichlet Process.
 
 ## Representation of Partitions
 
@@ -27,8 +31,5 @@ For example, given 3 parameters ``(\theta_1, \theta_2, \theta_3)`` there exist 5
 | ``\{\{\theta_1\}, \{\theta_2, \theta_3\}\}``     | ``\theta_1 \neq \theta_2 = \theta_3``     | `[1, 2, 2]`             |
 | ``\{\{\theta_1\}, \{\theta_2\}, \{\theta_3\}\}`` | ``\theta_1 \neq \theta_2 \neq \theta_3``  | `[1, 2, 3]`             |
 
-However, we also consider `[2, 2, 2]` and `[3, 3, 3]` to be valid and identical to `[1, 1, 1]`.
-The main reason for this is that it has pragmatic benefits when doing Gibbs sampling.
-For example, consider that the current partition is `[1, 2, 2]` and the sampler proposes an update for the first element.
-A natural proposal would be `[1, 1, 1]`, however, without duplicated partitions this would be impossible (as `[2, 2, 2]` would not exist).
-The default [`logpdf`](@ref) accounts for duplicated partitions, use [`logpdf_model_distinct`](@ref) to evaluate the logpdf without duplicated partitions.
+We do not consider `[2, 2, 2]` and `[3, 3, 3]` to be valid partitions identical to `[1, 1, 1]`.
+When this is desired, one can wrap a partition distribution using `DuplicatedPartitionDistribution` to allow duplicated representation for a partition.
diff --git a/docs/src/tests.md b/docs/src/tests.md
@@ -9,3 +9,19 @@ proportion_test
 ```@docs
 anova_test
 ```
+
+## Sampling Methods
+```@docs
+Enumerate
+EnumerateThenSample
+SampleIntegrated
+SampleRJMCMC
+```
+
+## Extractor Functions
+```@docs
+compute_post_prob_eq
+compute_model_probs
+get_mpm_partition
+get_hpm_partition
+```