Zechen Bai 1 Jianxiong Gao 2 Ziteng Gao 1
Pichao Wang 3 Zheng Zhang 3 Tong He 3 Mike Zheng Shou 1
arXiv 2024
1 Show Lab, National University of Singapore 2 Fudan University 3 Amazon
News
- [2024-12-26] We released our code!
- [2024-11-26] We released our paper on arXiv.
FQGAN is state-of-the-art visual tokenizer with a novel factorized tokenization design, surpassing VQ and LFQ methods in discrete image reconstruction.
FQGAN addresses the large codebook usage issue by decomposing a single large codebook into multiple independent sub-codebooks. By leveraging disentanglement regularization and representation learning objectives, the sub-codebooks learn hierarchical, structured and semantic meaningful representations. FQGAN achieves state-of-the-art performance on discrete image reconstruction, surpassing VQ and LFQ methods.
Method | Downsample | rFID (256x256) | weight |
---|---|---|---|
FQGAN-Dual | 16 | 0.94 | fqgan_dual_ds16.pt |
FQGAN-Triple | 16 | 0.76 | fqgan_triple_ds16.pt |
FQGAN-Dual | 8 | 0.32 | fqgan_dual_ds8.pt |
FQGAN-Triple | 8 | 0.24 | fqgan_triple_ds8_c2i.pt |
The main dependency of this project is pytorch and transformers. You may use your existing python environment.
git clone https://github.com/showlab/FQGAN.git
conda create -n fqgan python=3.10 -y
conda activate fqgan
pip3 install torch==2.1.1+cu121 torchvision==0.16.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121
pip3 install -r requirements.txt
First, please prepare ImageNet dataset.
# Train FQGAN-Dual Tokenizer (Downsample 16X by default
bash train_fqgan_dual.sh
# Train FQGAN-Triple Tokenizer (Downsample 16X by default
bash train_fqgan_triple.sh
To train the FAR Generation Model, please follow the instructions in train_far_dual.sh.
Download the pre-trained tokenizer weights or train the model by yourself.
First, generate the reference .npz
file of the validation set. You only need to run this command once
torchrun --nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12343 \
tokenizer/val_ddp.py \
--data-path /home/ubuntu/DATA/ImageNet/val \
--image-size 256 \
--per-proc-batch-size 128
Evaluate FQGAN-Dual model
torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12344 \
tokenizer/reconstruction_vq_ddp_dual.py \
--data-path /home/ubuntu/DATA/ImageNet/val \
--image-size 256 \
--vq-model VQ-16 \
--vq-ckpt results_tokenizer_image/fqgan_dual_ds16.pt \
--codebook-size 16384 \
--codebook-embed-dim 8 \
--per-proc-batch-size 128 \
--with_clip_supervision \
--folder-name FQGAN_Dual_DS16
python3 evaluations/evaluator.py \
reconstructions/val_imagenet.npz \
reconstructions/FQGAN_Dual_DS16.npz
Evaluate FQGAN-Triple model
torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12344 \
tokenizer/reconstruction_vq_ddp_triple.py \
--data-path /home/ubuntu/DATA/ImageNet/val \
--image-size 256 \
--vq-model VQ-16 \
--vq-ckpt results_tokenizer_image/fqgan_triple_ds16.pt \
--codebook-size 16384 \
--codebook-embed-dim 8 \
--per-proc-batch-size 64 \
--with_clip_supervision \
--folder-name FQGAN_Triple_DS16
python3 evaluations/evaluator.py \
reconstructions/val_imagenet.npz \
reconstructions/FQGAN_Triple_DS16.npz
To evaluate the FAR Generation Model, please follow the instructions in eval_far.sh.
To cite the paper and model, please use the below:
@article{bai2024factorized,
title={Factorized Visual Tokenization and Generation},
author={Bai, Zechen and Gao, Jianxiong and Gao, Ziteng and Wang, Pichao and Zhang, Zheng and He, Tong and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2411.16681},
year={2024}
}
This work is based on Taming-Transformers, Open-MAGVIT2, and LlamaGen. Thanks to all the authors for their great works!
The code is released under CC-BY-NC-4.0 license for research purpose only.