Exploring the impact of data imbalance on ε-Differential Privacy
Differential Privacy (DP) has emerged as a rigorous mathematical framework for use in privacy preserving data mining. While the effects of imbalanced data have been extensively studied in classical Machine Learning (ML), their implications on DP remain largely unexplored. This study examines the impact of class imbalance on three differentially private ML algorithms: Gaussian Naive Bayes (NB), Logistic Regression (LR), and Random Forest (RF). It evaluates the performance and privacy sensitivity of these learners at varying levels of imbalance from 0.6% to 20%. The findings reveal that while DP-LR achieves the best performance, this comes at a privacy budget so large as to provide no meaningful protection; DP-RF is less sensitive to privacy but the most impacted by data imbalance; and DP-NB allows for the highest levels of privacy paired with very good performance and low computational overheads. This research concludes that DP-NB is the most optimal of the approaches studied in terms of privacy, performance, and computational efficiency when trained on highly imbalanced data.
This project explores the impact of data imbalance on ε-Differential Privacy in machine learning algorithms using the diffprivlib library developed by IBM [1]. It includes a set of Jupyter notebooks that demonstrate the findings and methods used in the study.
This study revealed the impact of data imbalance on three differentially private ML algorithms commonly used in binary classification. The results indicate that the level of imbalance will impact both the performance of a algorithm and its sensitivity to the privacy mechanism of DP, but that the extent of this impact will vary with the algorithm involved. The severity of the imbalance significantly impacted the performance of both the non-private RF (
The notebooks were developed and tested in the following environment:
- Jupyter® Notebook running on a Python 3 kernel
The notebooks require the following packages to be installed:
- Python® == 3.11.9
- pandas == 2.2.1
- numpy == 1.26.4
- matplotlib == 3.8.4
- joblib == 1.4.2
- scikit-learn (sklearn) == 1.4.2
- diffprivlib == 0.6.4
- lilac-arff == 2.5.0
- scipy == 1.11.4
- tqdm == 4.66.4
To set up the required environment, follow these steps:
-
Create a new execution environment:
- In Anaconda Navigator select
Environments
from the main menu bar on the left hand side. - Then select
Create
from the Environments sub-menu at the bottom. - This will launch a
Create new environment
pop-up which will require you to name the environment (i.e.IRP
) and select 3.11.9 from thePython
dropdown menu. - Hit
Create
.
- In Anaconda Navigator select
-
Install the public packages using Anaconda:
- Select the newly created environment and ensure the dropdown box next to
Channels
is set toNot installed
. - Scroll down (or use the search field) to locate the row for
pandas
version 2.2.1. Click on the...
to the left of the package then clickApply
in the bottom right. - Repeat for the following publicly available and required packages:
numpy
version 1.26.4 or highermatplotlib
version 3.8.4 or higherjoblib
version 1.4.2 or higherscikit-learn
version 1.4.2 or higherlilac-arff
version 2.5.0 or higherscipy
version 1.11.4 or highertqdm
version 4.66.4 or higher
- Select the newly created environment and ensure the dropdown box next to
-
Install the additional required packages using pip:
-
Left click on the green arrow to the right of the execution environment and select
Open Terminal
. -
Then type:
pip install diffprivlib
-
To run the notebooks:
- Launch Anaconda Navigator
- Select the created execution environment.
- Select Jupyter Notebook.
- Navigate to the file location where this artefact directory is stored.
- Open and run the notebooks in the following order:
Diffprivlib Experiment-NB.ipynb
, approx. processing time: 1.4 hoursDiffprivlib Experiment-LR.ipynb
, approx. processing time: 4.6 hoursDiffprivlib Experiment-RF.ipynb
, approx. processing time: 11.5 hours
Note: Running each of these files will over-write the figures in the local figures folder with new, identical versions.
Diffprivlib Experiment-NB.ipynb
: A Jupyter® notebook containing the NB analysis.Diffprivlib Experiment-LR.ipynb
: A Jupyter® notebook containing the LR analysis.Diffprivlib Experiment-RF.ipynb
: A Jupyter® notebook containing the RF analysis.datasets
: A folder containing all of the datasets required for reproducibility.figures
: A folder containing all of the figures output by this code.
If you use this for research, please consider citing the folowing reference paper:
@mastersthesis{data-imbalance-dp,
author = {Krauskopf, Alison},
title = {Exploring the impact of data imbalance on ε-Differential Privacy},
school = {University of York},
year = {2024},
note = {[Unpublished]}
}
[1] N. Holohan, S. Braghin, P.M. Aonghusa, and K. Levancher, "Diffprivlib: The IBM differential privacy library," ArXiv e-prints 1907.02444 [cs.CR], 2019. doi: 10.48550/arXiv.1907.02444