Official Implementation of "From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models"
Our paper "From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models" has been accepted by CVPR 2024.
- Creating conda environment and install pytorch
conda create -n pix2sgg python=3.8
conda activate pix2sgg
# CUDA 11.8
conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# or CUDA 10.2
conda install pytorch==1.10.2 torchvision==0.11.3 cudatoolkit=10.2 -c pytorch
- Install other dependencies:
pip install -r requirements_pix2sgg.txt
# the hugging face version: v4.29.2
Our work is built upon LAVIS, sharing the majority of its requirements.
- Build Project
python setup.py build develop
Check DATASET.md for instructions of dataset preprocessing.
The model weight can be download from: https://huggingface.co/rj979797/PGSG-CVPR2024/tree/main
Novel+base | Novel | checkpoint | ||
---|---|---|---|---|
Datasets | mR50/100 | R50/100 | mR50/100 | |
VG | 6.2/8.3 | 15.1/18.4 | 3.7/5.2 | vg_ov_sgg.pth |
VG-SGCls | 9.7/13.8 | 26.8/33.2 | 5.1/7.7 | vg_ov_sgg.pth |
PSG | 15.3/17.7 | 23.7/25.4 | 6.7/9.6 | psg_ov_sgg.pth |
Datasets | mR50/100 | R50/100 | checkpoint |
---|---|---|---|
VG | 9.0/11.5 | 17.7/ 20.7 | vg_sgg.pth |
PSG | 14.5/17.6 | 25.8/28.9 | psg_sgg.pth |
VG-c | 10.4/12.7 | 20.3/23.6 | vg_sgg_close_clser.pth |
PSG-c | 21.2/22.0 | 34.9/36.1 | psg_sgg_close_clser.pth |
Our PGSG is trained using the BLIP pre-trained weights, accessible here.
Ensure that the checkpoint path in the configuration file (*.yaml) is accurate before training or evaluation. During training, utilize the checkpoint specified by model.pretrained
, while for evaluation, load the checkpoint specified by model.finetuned
.
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py lavis/projects/blip/train/vrd_vg_ft_pgsg_ov.yaml --job-name VG-pgsg_ovsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_vg_pgsg_eval_ov.yaml --job-name VG-pgsg_stdsgg-eval
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py lavis/projects/blip/train/vrd_vg_ft_pgsg.yaml --job-name VG-pgsg_ovsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_vg_pgsg_eval.yaml --job-name VG-pgsg_stdsgg-eval
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip/train/vrd_psg_ft_pgsg_ov.yaml --job-name psg-pgsg_ovsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_psg_ov.yaml --job-name psg-pgsg_ovsgg-eval
Training
python -m torch.distributed.run --master_port 13919 --nproc_per_node=4 train.py --cfg-path lavis/projects/blip/train/vrd_psg_ft_pgsg.yaml --job-name psg-pgsg_stdsgg
Evaluation
python -m torch.distributed.run --master_port 13958 --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/rel_det_psg_eval.yaml --job-name psg-pgsg_stdsgg-eval
If you find this project helps your research, please kindly consider citing our papers in your publications.
@misc{li2024pixels,
title={From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models},
author={Rongjie Li and Songyang Zhang and Dahua Lin and Kai Chen and Xuming He},
year={2024},
eprint={2404.00906},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repository is built on LAVIS and borrows code from scene graph benchmarking framework from SGTR.