Factorized Visual Tokenization and Generation

Zechen Bai ¹ Jianxiong Gao ² Ziteng Gao ¹

Pichao Wang ³ Zheng Zhang ³ Tong He ³ Mike Zheng Shou ¹

arXiv 2024

¹ Show Lab, National University of Singapore ² Fudan University ³ Amazon

News

[2024-12-26] We released our code!
[2024-11-26] We released our paper on arXiv.

TL;DR

FQGAN is state-of-the-art visual tokenizer with a novel factorized tokenization design, surpassing VQ and LFQ methods in discrete image reconstruction.

Method Overview

FQGAN addresses the large codebook usage issue by decomposing a single large codebook into multiple independent sub-codebooks. By leveraging disentanglement regularization and representation learning objectives, the sub-codebooks learn hierarchical, structured and semantic meaningful representations. FQGAN achieves state-of-the-art performance on discrete image reconstruction, surpassing VQ and LFQ methods.

Getting Started

Pre-trained Models

Method	Downsample	rFID (256x256)	weight
FQGAN-Dual	16	0.94	fqgan_dual_ds16.pt
FQGAN-Triple	16	0.76	fqgan_triple_ds16.pt
FQGAN-Dual	8	0.32	fqgan_dual_ds8.pt
FQGAN-Triple	8	0.24	fqgan_triple_ds8_c2i.pt

Setup

The main dependency of this project is pytorch and transformers. You may use your existing python environment.

git clone https://github.com/showlab/FQGAN.git

conda create -n fqgan python=3.10 -y
conda activate fqgan

pip3 install torch==2.1.1+cu121 torchvision==0.16.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121
pip3 install -r requirements.txt

Training

First, please prepare ImageNet dataset.

# Train FQGAN-Dual Tokenizer (Downsample 16X by default
bash train_fqgan_dual.sh

# Train FQGAN-Triple Tokenizer (Downsample 16X by default
bash train_fqgan_triple.sh

To train the FAR Generation Model, please follow the instructions in train_far_dual.sh.

Evaluation

Download the pre-trained tokenizer weights or train the model by yourself.

First, generate the reference .npz file of the validation set. You only need to run this command once

torchrun --nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12343 \
tokenizer/val_ddp.py \
--data-path /home/ubuntu/DATA/ImageNet/val \
--image-size 256 \
--per-proc-batch-size 128

Evaluate FQGAN-Dual model

torchrun \
  --nnodes=1 --nproc_per_node=8 --node_rank=0 \
  --master_port=12344 \
  tokenizer/reconstruction_vq_ddp_dual.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_dual_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 128 \
  --with_clip_supervision \
  --folder-name FQGAN_Dual_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Dual_DS16.npz

Evaluate FQGAN-Triple model

torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12344 \
tokenizer/reconstruction_vq_ddp_triple.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_triple_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 64 \
  --with_clip_supervision \
  --folder-name FQGAN_Triple_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Triple_DS16.npz

To evaluate the FAR Generation Model, please follow the instructions in eval_far.sh.

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

To cite the paper and model, please use the below:

@article{bai2024factorized,
  title={Factorized Visual Tokenization and Generation},
  author={Bai, Zechen and Gao, Jianxiong and Gao, Ziteng and Wang, Pichao and Zhang, Zheng and He, Tong and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2411.16681},
  year={2024}
}

Acknowledgement

This work is based on Taming-Transformers, Open-MAGVIT2, and LlamaGen. Thanks to all the authors for their great works!

License

The code is released under CC-BY-NC-4.0 license for research purpose only.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
autoregressive		autoregressive
dataset		dataset
evaluations		evaluations
tokenizer		tokenizer
utils		utils
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eval_far.sh		eval_far.sh
requirements.txt		requirements.txt
train_far_dual.sh		train_far_dual.sh
train_fqgan_dual.sh		train_fqgan_dual.sh
train_fqgan_triple.sh		train_fqgan_triple.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Factorized Visual Tokenization and Generation

TL;DR

Method Overview

Getting Started

Pre-trained Models

Setup

Training

Evaluation

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

Acknowledgement

License

About

Releases

Packages

Contributors 2

Languages

License

amazon-science/FQGAN

Folders and files

Latest commit

History

Repository files navigation

Factorized Visual Tokenization and Generation

TL;DR

Method Overview

Getting Started

Pre-trained Models

Setup

Training

Evaluation

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

Acknowledgement

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages