This repository contains the source code of the paper "Efficient Exploration of the Rashomon Set of Rule Set Models" (KDD 2024)
The Rashomon set of an ML problem refers to the set of models of near-optimal predictive performance.
Why studying it? Because models with similar performance may exhibit drastically different properties (such as fairness), therefore a single model does not offer an adequate representation of the reality.
An example showcasing the Rashomon set of rule set models for the COMPAS dataset.
- Each rule set is plotted as a point, whose position is determined by the statistical parity (
SP
) of the rule set on race and gender (in the X and Y axis, respectively).- Statistical parity quantifies the fairness of classification models.
- You can see that two highlighted models have very different
SP[race]
scores, though their accuracy scores are close.
- We designed efficient algorithms to explore the Rashomon set of rule-set models for binary classification problems.
- Our focus is on rule set models, due to their inherent interpretability 💡.
- We investigated two exploration modes -- counting and uniform sampling from the Rashomon set.
- Instead of tackling exact counting and uniform sampling, we study the approximate versions of them, which reduces the search space drastically.
- For both problems, we invented theoretically-sound algorithms sped up by effective pruning bounds, and a efficient implementation of it powered by Numba and Ray.
- Compared to off-the-shelf tools (such as Google OR-tools), our implementation is often >1000x faster ⚡⚡⚡
The source code is tested against Python 3.8 on MacOS 14.2.1
pip install -r requirements.txt
Verify that unit tests pass
pytest tests
We illustrate the usage of approximate counter and almost-uniform sampler applied on synthetic data.
Set up a Ray cluster for parallel computing, e.g.,
import ray
ray.init()
from bds.rule_utils import generate_random_rules_and_y
from bds.meel import approx_mc2
ub = 0.9 # upper bound on the rule set objective function
lmbd = 0.1 # complexity penalty term
eps = 0.8 # error parameter related to estimation accuracy
delta = 0.8 # the estimation confidence parameter
num_pts, num_rules = 100, 10
# generate the input data
random_rules, random_y = generate_random_rules_and_y(
num_pts, num_rules, rand_seed=42
)
# get an approximate estimation of the number of good rule set models
estimated_count = approx_mc2(
random_rules,
random_y,
lmbd=lmbd,
ub=ub,
delta=delta,
eps=eps,
rand_seed=42,
parallel=True, # using paralle run
)
from bds.rule_utils import generate_random_rules_and_y
from bds.meel import UniGen
num_pts, num_rules = 100, 10
random_rules, random_y = generate_random_rules_and_y(
num_pts, num_rules, rand_seed=42
)
ub = 0.9
eps = 8 # epsilon parameter that controls the closeness between the sampled distribution and uniform distribution
lmbd = 0.1 # complexity penalty term
sampler = UniGen(random_rules, random_y, lmbd, ub, eps, rand_seed=42)
sampler.prepare() # collect necessary statistics required for sampling
# sample 10 rule sets almost uniformly from the Rashomon set
samples = sampler.sample(10, exclude_none=True)
When working with real-world datasets, the first step is often extract a list of candidate rules.
For this purpose, you may rely on extract_rules_with_min_support
to extract a list of rules with support above a given threshold.
import pandas as pd
from bds.candidate_generation import extract_rules_with_min_support
dataset = "compas"
data = pd.read_csv('data/compas_train-binary.csv') # the features are binary
X = data.to_numpy()[:,:-2] # extract the feature matrix
attribute_names = list(data.columns[:-2])
candidate_rules = extract_rules_with_min_support(X, attribute_names, min_support=70)
# then you may apply the sampler or count estimator on the candidate rules
- Han Xiao: xiaohan2012@gmail.com
- Martino Ciaperoni: martino.ciaperoni@aalto.fi
If you find this work useful, please consider citing it.
Bibtex entry
@inproceedings{ciaperoni2024efficient,
title={Efficient Exploration of the Rashomon Set of Rule-Set Models},
author={Ciaperoni, Martino and Xiao, Han and Gionis, Aristides},
booktitle={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages={478--489},
year={2024}
}
- rename package to
ers
- packaging
- maybe add a logo?