ABforge is an A/B testing library designed for a variety of use cases, utilizing the Bayesian approach to implement different variables and tests, including binary, Poisson, normal, delta-lognormal and discrete tests on different metrics. It's especially useful for analysing key metrics in marketplaces, such as conversion rates, ticket size, ARPU differences between test variants, etc.
The main idea of this library is to create an all-in-one tool for running A/B tests separately from closed-source code and business logic.
Clone & navigate:
git clone git@github.com:avrtt/ABforge.git
cd ABforge
Using a virtual environment is optional, but recommended. If you'd like to create one, run:
python3 -m venv venv
Then activate the virtual environment:
-
On Linux/macOS:
source venv/bin/activate
-
On Windows:
venv\Scripts\activate
⚠️ This project hasn't been tested on Windows.
Executing the Makefile will install the required dependencies and the library itself:
make
Once installed, you can now import the abforge
library in your Python code with a simple import abforge
.
Besides importing directly, you can also interact with some parts of the library engine through a Streamlit-based web interface. To run it, simply execute:
streamlit run Home.py
Below you can discover all the implemented methods of the library.
This engine measures statistics for 3 very important variables at once:
- conversion rate (e.g., percentage of visits that turn into sales)
- monetary value for conversions (e.g., revenue per transaction)
- average value per impression (e.g., Average Revenue per User)
Sometimes, conversion rate isn't the best metric for your test when the most important is if you're bringing more money to the table. That's why ARPU helps you a lot. Revenue also helps you to undestand how your ticket sale is affected between variants.
In frequentist approach:
- p-value is difficult to understand and has no business value;
- we can't make an informative conclusion from insignificant test;
- test isn't valid without fixing the sample size;
- bigger sample size is required.
Instead, in Bayesian approach:
- results have clear business value and are easy to understand;
- we always get valid results and can make at least an informed decision;
- fixing the sample size isn't required;
- smaller sample size is sufficient.
There are also five classes for experiments:
BinaryDataTest
PoissonDataTest
NormalDataTest
DeltaLognormalDataTest
DiscreteDataTest
For each class, there are two methods for inserting data:
add_variant_data
- add raw data for a variant as a list of observations (or numpy 1-D array)add_variant_data_agg
- add aggregated variant data (this can be practical for a larger data set, as the aggregation can be done outside the package)
Both methods for adding data allow the user to specify a prior distribution (see details in respective docstrings). The default priors are non-informative priors and should be sufficient for most use cases, and in particular when the number of samples or observations is large.
To get the results of the test, simply call method evaluate
; to access evaluation metrics as well as the simulated random samples, call the data
instance variable.
Chance to beat all and expected loss are approximated using Monte Carlo simulation, so evaluate
may return slightly different values for different runs. To decrease variation, you can set the sim_count
parameter of evaluate
to a higher value (the default is 200K); to fix values, set the seed
parameter.
Evaluation metrics are calculated using Monte Carlo simulations from posterior distributions.
Chance to beat all
— probability of beating all other variants;
Expected Loss
— risk associated with choosing a given variant over other variants. Measured in the same units as the tested measure (e.g. positive rate or average value);
Uplift vs. 'A'
— relative uplift of a given variant compared to the first variant added;
95% HDI
— the central interval containing 95% of the probability. The Bayesian approach allows us to say that, 95% of the time, the 95% HDI will contain the true value.
For tests between two variants with binary, Poisson, and normal data, abforge
can additionally provide a continuation recommendation - that is, a recommendation as to the variant to select, or to continue testing. See the docstrings and examples for usage guidelines.
The decision method makes use of the following concepts:
- Region of Practical Equivalence (ROPE) — a region
[-t, t]
of the distribution of differencesB - A
which is practically equivalent to no uplift. E.g., you may be indifferent between an uplift of +/- 0.1% and no change, in which case the ROPE would be[-0.1, 0.1
; - 95% HDI — the central interval containing 95% of the probability for the distribution of differences
B - A
.
The recommendation output has three elements:
- Decision
- Select either variant if the ROPE is fully contained within the 95% HDI
- Select the better variant if the ROPE and the 95% HDI do not overlap
- Continue testing if the ROPE partially overlaps the 95% HDI
- Note: There are high-confidence and low-confidence variations of the first two messages
- Confidence
- High if the width of the 95% HDI is less than or equal to
0.8*rope
- Low if the width of the 95% HDI is greater than
0.8*rope
- High if the width of the 95% HDI is less than or equal to
- Bounds
- The 95% HDI
For smaller Binary and Poisson samples, metrics calculated from Monte Carlo simulation can be checked against the closed-form solutions by passing closed_form=True
to the evaluate()
method. Larger samples generate warnings; samples that are larger than a predetermined threshold will raise an error. The larger the sample, however, the closer the simulated value will be to the true value, so closed-form comparisons are recommended to validate metrics for smaller samples only.
Binary tests with small sample sizes will raise a warning when the error for the expected loss estimate surpasses a set tolerance. To reduce error, increase the simulation count. For more detail, see the docstring for expected_loss_accuracy_bernoulli
in evaluation.py
- Input data: binary (
[0, 1, 0, ...]
) - Designed for binary data, such as conversions
Class for Bayesian A/B testing of binary-like data (e.g. conversions, successes, etc.).
Example:
import numpy as np
from abforge.experiments import BinaryDataTest
# generating some random data
rng = np.random.default_rng(52)
# random 1x1500 array of 0/1 data with 5.2% probability for 1:
data_a = rng.binomial(n=1, p=0.052, size=1500)
# random 1x1200 array of 0/1 data with 6.7% probability for 1:
data_b = rng.binomial(n=1, p=0.067, size=1200)
# initialize a test.js:
test = BinaryDataTest()
# add variant using raw data (arrays of zeros and ones) and specifying priors:
test.add_variant_data("A", data_a, a_prior=10, b_prior=17)
test.add_variant_data("B", data_b, a_prior=5, b_prior=30)
# the default priors are a=b=1
# test.js.add_variant_data("C", data_c)
# add variant using aggregated data:
test.add_variant_data_agg("C", total=1000, positives=50)
# evaluate test.js:
test.evaluate(seed=314)
# access simulation samples and evaluation metrics
data = test.data
# generate plots
test.plot_distributions(control='A', fname='binary_distributions_example.png')
+---------+--------+-----------+-------------+----------------+--------------------+---------------+----------------+----------------+
| Variant | Totals | Positives | Sample rate | Posterior rate | Chance to beat all | Expected loss | Uplift vs. "A" | 95% HDI |
+---------+--------+-----------+-------------+----------------+--------------------+---------------+----------------+----------------+
| B | 1200 | 80 | 6.67% | 6.88% | 83.82% | 0.08% | 16.78% | [5.74%, 8.11%] |
| C | 1000 | 50 | 5.00% | 5.09% | 2.54% | 1.87% | -13.64% | [4.00%, 6.28%] |
| A | 1500 | 80 | 5.33% | 5.89% | 13.64% | 1.07% | 0.00% | [4.94%, 6.92%] |
+---------+--------+-----------+-------------+----------------+--------------------+---------------+----------------+----------------+
For smaller samples, such as the above, it is also possible to check the modeled chance to beat all against the
closed-form equivalent by passing closed_form=True
.
test.evaluate(closed_form=True, seed=314)
+---------+-------------------------+--------------------------+--------+
| Variant | Est. chance to beat all | Exact chance to beat all | Delta |
+---------+-------------------------+--------------------------+--------+
| B | 83.82% | 83.58% | 0.28% |
| C | 2.54% | 2.56% | -0.66% |
| A | 13.64% | 13.86% | -1.59% |
+---------+-------------------------+--------------------------+--------+
Removing variant 'C', as this feature is implemented for two variants only currently, and passing a value to control
additionally returns a test-continuation recommendation:
test.delete_variant("C")
test.evaluate(control='A', seed=314)
Decision: Stop and implement either variant. Confidence: Low. Bounds: [-0.84%, 2.85%].
Finally, we can plot the prior and posterior distributions, as well as the distribution of differences.
- Input data: integer counts
- Designed for count data (e.g., number of sales per salesman, deaths per zip code)
Class for Bayesian A/B testing of count data. This can be used to compare, e.g., the number of sales per day from different salesmen, or the number of deaths from a given disease per zip code.
Example:
# generating some random data
import numpy as np
from abforge.experiments import PoissonDataTest
# generating some random data
rng = np.random.default_rng(21)
data_a = rng.poisson(43, size=20)
data_b = rng.poisson(39, size=25)
data_c = rng.poisson(37, size=15)
# initialize a test.js:
test = PoissonDataTest()
# add variant using raw data (arrays of zeros and ones) and specifying priors:
test.add_variant_data("A", data_a, a_prior=30, b_prior=7)
test.add_variant_data("B", data_b, a_prior=5, b_prior=5)
# test.js.add_variant_data("C", data_c)
# add variant using aggregated data:
test.add_variant_data_agg("C", total=len(data_c), obs_mean=np.mean(data_c), obs_sum=sum(data_c))
# evaluate test.js:
test.evaluate(seed=314)
# access simulation samples and evaluation metrics
data = test.data
# generate plots
test.plot_distributions(control='A', fname='poisson_distributions_example.png')
+---------+--------------+-------------+----------------+--------------------+---------------+----------------+--------------+
| Variant | Observations | Sample mean | Posterior mean | Chance to beat all | Expected loss | Uplift vs. "A" | 95% HDI |
+---------+--------------+-------------+----------------+--------------------+---------------+----------------+--------------+
| C | 15 | 38.6 | 36.2 | 74.06% | 0.28 | 4.01% | [33.8, 38.8] |
| B | 25 | 40.4 | 33.9 | 5.09% | 2.66 | -2.83% | [32.1, 35.6] |
| A | 20 | 45.6 | 34.9 | 20.85% | 1.68 | 0.00% | [33.0, 36.7] |
+---------+--------------+-------------+----------------+--------------------+---------------+----------------+--------------+
For samples smaller than the above, it is also possible to check the modeled chance to beat all against the closed-form
equivalent by passing closed_form=True
:
test.evaluate(closed_form=True, seed=314)
+---------+-------------------------+--------------------------+--------+
| Variant | Est. chance to beat all | Exact chance to beat all | Delta |
+---------+-------------------------+--------------------------+--------+
| C | 74.06% | 73.91% | 0.20% |
| B | 5.09% | 5.24% | -2.84% |
| A | 20.85% | 20.85% | -0.01% |
+---------+-------------------------+--------------------------+--------+
Removing variant 'C', as this feature is implemented for two variants only currently, and passing control
and rope
additionally returns a test-continuation recommendation:
test.delete_variant("C")
test.evaluate(control='A', rope=0.5, seed=314)
Decision: Stop and implement either variant. Confidence: Low. Bounds: [-4.0, 2.1].
Finally, we can plot the posterior distributions as well as the distribution of differences (returning now to the original number of observations rather than the smaller sample used to show the closed-form validation).
- Input data: normal data with unknown variance
- Designed for normal data
Class for Bayesian A/B test for normal data.
Example:
import numpy as np
from abforge.experiments import NormalDataTest
# generating some random data
rng = np.random.default_rng(314)
data_a = rng.normal(6.9, 2, 500)
data_b = rng.normal(6.89, 2, 800)
data_c = rng.normal(7.0, 4, 500)
# initialize a test.js:
test = NormalDataTest()
# add variant using raw data:
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b, m_prior=5, n_prior=11, v_prior=10, s_2_prior=4)
# test.js.add_variant_data("C", data_c)
# add variant using aggregated data:
test.add_variant_data_agg("C", len(data_c), sum(data_c), sum((data_c - np.mean(data_c)) ** 2), sum(np.square(data_c)))
# evaluate test.js:
test.evaluate(sim_count=200000, seed=314)
# access simulation samples and evaluation metrics
data = test.data
# generate plots
test.plot_joint_prior(variant='B', fname='normal_prior_distribution_B_example.png')
test.plot_distributions(control='A', fname='normal_distributions_example.png')
+---------+--------------+-------------+----------------+-----------+-----------+--------------------+---------------+----------------+----------------+-----------------+
| Variant | Observations | Sample mean | Posterior mean | Precision | Std. dev. | Chance to beat all | Expected loss | Uplift vs. "A" | 95% HDI (mean) | 95% HDI (stdev) |
+---------+--------------+-------------+----------------+-----------+-----------+--------------------+---------------+----------------+----------------+-----------------+
| A | 500 | 6.89 | 6.89 | 0.257 | 1.97 | 91.31% | 0.0 | 0.00% | [6.72, 7.07] | [1.86, 2.10] |
| B | 800 | 6.91 | 6.89 | 0.258 | 1.97 | 8.69% | 0.01 | -0.09% | [6.75, 7.02] | [1.88, 2.07] |
| C | 500 | 6.75 | 6.75 | 0.065 | 3.91 | 0.00% | 0.14 | -2.01% | [6.41, 7.10] | [3.68, 4.17] |
+---------+--------------+-------------+----------------+-----------+-----------+--------------------+---------------+----------------+----------------+-----------------+
We can also plot the joint prior distribution for
- Input data: lognormal data with zeros
- Designed for lognormal data, such as revenue per conversions
Class for Bayesian A/B testing of delta-lognormal data (log-normal with zeros). Delta-lognormal data is typical case of revenue per session data where many sessions have 0 revenue but non-zero values are positive numbers with possible log-normal distribution. To handle this data, the calculation is combining binary Bayes model for zero vs non-zero "conversions" and log-normal model for non-zero values.
Example:
import numpy as np
from abforge.experiments import DeltaLognormalDataTest
test = DeltaLognormalDataTest()
data_a = [7.1, 0.3, 5.9, 0, 1.3, 0.3, 0, 1.2, 0, 3.6, 0, 1.5, 2.2, 0, 4.9, 0, 0, 1.1, 0, 0, 7.1, 0, 6.9, 0]
data_b = [4.0, 0, 3.3, 19.3, 18.5, 0, 0, 0, 12.9, 0, 0, 0, 10.2, 0, 0, 23.1, 0, 3.7, 0, 0, 11.3, 10.0, 0, 18.3, 12.1]
# adding variant using raw data:
test.add_variant_data("A", data_a)
# test.js.add_variant_data("B", data_b)
# alternatively, variant can be also added using aggregated data:
# (looks more complicated but for large data it can be quite handy to move around only these sums)
test.add_variant_data_agg(
name="B",
total=len(data_b),
positives=sum(x > 0 for x in data_b),
sum_values=sum(data_b),
sum_logs=sum([np.log(x) for x in data_b if x > 0]),
sum_logs_2=sum([np.square(np.log(x)) for x in data_b if x > 0])
)
# evaluate test.js:
test.evaluate(seed=21)
# access simulation samples and evaluation metrics
data = test.data
[{'variant': 'A',
'totals': 24,
'positives': 13,
'sum_values': 43.4,
'avg_values': 1.80833,
'avg_positive_values': 3.33846,
'prob_being_best': 0.04815,
'expected_loss': 4.0941101},
{'variant': 'B',
'totals': 25,
'positives': 12,
'sum_values': 146.7,
'avg_values': 5.868,
'avg_positive_values': 12.225,
'prob_being_best': 0.95185,
'expected_loss': 0.1588627}]
- Input data: categorical data with numerical categories
- Designed for discrete data (e.g. dice rolls, star ratings, 1-10 ratings)
Class for Bayesian A/B testing for discrete data having a finite number of numerical categories (states). This test can be used, e.g., to find the biases of different dice and to decide which of them of multiple for the "best" of multiple dice) or rating data (e.g. 1-5 stars or 1-10 scale).
Example:
from abforge.experiments import DiscreteDataTest
# dice rolls data for 3 dice - A, B, C
data_a = [2, 5, 1, 4, 6, 2, 2, 6, 3, 2, 6, 3, 4, 6, 3, 1, 6, 3, 5, 6]
data_b = [1, 2, 2, 2, 2, 3, 2, 3, 4, 2]
data_c = [1, 3, 6, 5, 4]
# initialize a test.js with all possible states (i.e. numerical categories):
test = DiscreteDataTest(states=[1, 2, 3, 4, 5, 6])
# add variant using raw data:
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
test.add_variant_data("C", data_c)
# add variant using aggregated data:
# test.js.add_variant_data_agg("C", [1, 0, 1, 1, 1, 1]) # equivalent to rolls in data_c
# evaluate test.js:
test.evaluate(sim_count=200000, seed=52)
# access simulation samples and evaluation metrics
data = test.data
+---------+------------------------------------+-------------+----------------+------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+--------------------+---------------+----------------+----------------+
| Variant | Concentrations | Sample mean | Posterior mean | Relative prob. | 95% HDI (relative prob.) | Chance to beat all | Expected loss | Uplift vs. "A" | 95% HDI (mean) |
+---------+------------------------------------+-------------+----------------+------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+--------------------+---------------+----------------+----------------+
| A | 1: 2, 2: 4, 3: 4, 4: 2, 5: 2, 6: 6 | 3.8 | 3.73 | 1: 11.54%, 2: 19.23%, 3: 19.23%, 4: 11.54%, 5: 11.54%, 6: 26.92% | 1: [2.55%, 26.02%], 2: [6.82%, 36.06%], 3: [6.85%, 36.12%], 4: [2.54%, 25.96%], 5: [2.59%, 26.09%], 6: [12.10%, 45.17%] | 55.21% | 19.71% | 0.00% | [3.07, 4.40] |
| C | 1: 1, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1 | 3.8 | 3.64 | 1: 18.18%, 2: 9.09%, 3: 18.18%, 4: 18.18%, 5: 18.18%, 6: 18.18% | 1: [2.50%, 44.45%], 2: [0.26%, 30.78%], 3: [2.51%, 44.54%], 4: [2.47%, 44.48%], 5: [2.53%, 44.57%], 6: [2.52%, 44.54%] | 44.02% | 29.09% | -2.53% | [2.64, 4.58] |
| B | 1: 1, 2: 6, 3: 2, 4: 1, 5: 0, 6: 0 | 2.3 | 2.75 | 1: 12.50%, 2: 43.75%, 3: 18.75%, 4: 12.50%, 5: 6.25%, 6: 6.25% | 1: [1.66%, 31.97%], 2: [21.33%, 67.67%], 3: [4.31%, 40.47%], 4: [1.65%, 31.96%], 5: [0.17%, 21.78%], 6: [0.17%, 21.84%] | 0.78% | 117.81% | -26.29% | [2.18, 3.45] |
+---------+------------------------------------+-------------+----------------+------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+--------------------+---------------+----------------+----------------+
Finally, we can plot the posterior distribution for each state for each variant.
- Documentation page
- Unit tests (methods)
- Integration tests (Linux)
- Integration tests (Windows)
- Python 3.10 > 3.12 migration (upd: fix numpy==1.26.4)
- Remove setup.py
- Add usage example(s)
- Build Streamlit app
- Add references
- Add images
- Add logo (Streamlit page & README)
- Merge with test/metrics classes (not a part of the base engine yet)
- Create toy dfs (tests/make_data.py)
- Add test continuation assessment (decision, confidence, bounds) to DeltaLognormalDataTest
- Implement sample size/reverse posterior calculation
- Implement Markov Chain Monte Carlo instead of Monte Carlo
- Add new tests and metrics to Streamlit app
- Wikipedia: A/B testing
- Bayesian A/B testing at VWO
- Optional stopping in data collection: p values, Bayes factors, credible intervals, precision
- Its time to rethink A/B Testing
- Agile A/B testing with Bayesian Statistics and Python
- Probabalistic programming and Bayesian methods for hackers
- Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing
- Think Bayes 2
- Conjugate prior
- Binomial distributions
- Bayesian Inference 2019
- An Introduction to Bayesian Thinking
- Formulas for Bayesian A/B Testing
- Easy Evaluation of Decision Rules in Bayesian A/B testing
- Bayesian Data Analysis, Third Edition
- Bayes theorem, the geometry of changing beliefs
- The quick proof of Bayes' theorem
- Is Bayesian A/B Testing Immune to Peeking? Not Exactly
Feel free to open PRs and issues.
Distributed under the MIT License. See LICENSE.txt for more information.