This is the repository for TabWak: A watermark for Tabular Diffusion Models.
The backbone model of TabWak is based on Tabsyn. Therefore, the installation and usage of TabWak are similar to Tabsyn. The following installation steps are based on Tabsyn's instructions.
Python version: 3.10
conda create -n tabsyn python=3.10
conda activate tabsyn
Using pip
:
pip install torch torchvision torchaudio
Or via conda
:
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
pip install dgl -f https://data.dgl.ai/wheels/cu117/repo.html
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.1+cu117.html
Create another environment for the quality metric:
conda create -n synthcity python=3.10
conda activate synthcity
pip install synthcity
pip install category_encoders
Download the raw dataset:
python download_dataset.py
Process the dataset:
python process_dataset.py
For Tabsyn, use the following commands for training:
-
Train the VAE model first:
python main.py --dataname [NAME_OF_DATASET] --method vae --mode train
-
After the VAE is trained, train the diffusion model:
python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode train
To watermark the data during the sampling process, run:
python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode sample --steps 1000 --with_w [Name_of_Watermark]
[Name_of_Watermark] options: treering
, GS
, TabWak
, TabWak*
For watermark detection, use:
python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode detect --steps 1000 --with_w [Name_of_Watermark]
To run attacks on watermarked data, use:
python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode detect --steps 1000 --with_w [Name_of_Watermark] --attack [Name_of_Attack_Options] --attack_percentage [0 to 1]
[Name_of_Attack_Options]: rowdeletion
, celldeletion
, celldeletetion
, noise
, shuffle