Pretrained Image-Text Models are Secretly Video Captioners

Paper link: {Chunhui Zhang*, Yiren Jian*}, Zhongyu Ouyang, Soroush Vosoughi

Accepted to the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). The first two authors contributed equally and were listed in random order.

Compared to open source models with minimal resource usage, we have ranked the top 2 best performers on the paperwithcode leaderboard.

Overview

This fully open-source and reproducible study demonstrates how a pretrained image-text model, BLIP-2, can be adapted for video captioning tasks via RL. By leveraging BLIP-2's visual and language understanding, originally developed for image-text tasks, we show its effectiveness in generating video captions without needing architectural changes or additional parameters. This method allows us to achieve high-quality captioning for multi-frame video data, using reinforcement learning to further enhance performance. As a result, we achieve one of the top positions on the Papers with Code Leaderboard, surpassing industry-level captioning models.

Installation

conda create -n video_cap python=3.8
conda activate video_cap
pip install -e .

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

Data Preparation

Please follow instructions from [here](LAVIS/lavis/datasets/download_scripts at main · anonymous/LAVIS) to download MSRVTT and MSVD datasets. VATEX datasets can be downloaded from its official website.

Cross-Entropy Training

We provide example script for training (MSRVTT and MSVD):

python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/caption_msrvtt_flant5xl_ft.yaml
python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/caption_msvd_flant5xl_ft.yaml

SCST Training

We provide example script for training (MSRVTT and MSVD):

python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/caption_msrvtt_flant5xl_scst.yaml
python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/caption_msvd_flant5xl_scst.yaml

Full training configurations can be found in lavis/projects/blip2/train.

caption_msrvtt_flant5xl_ft.yaml
caption_msrvtt_flant5xl_scst.yaml
caption_msrvtt_opt2.7b_ft.yaml
caption_msrvtt_opt2.7b_scst.yaml
caption_msrvtt_opt6.7b_ft.yaml
caption_msrvtt_vicuna7b_ft.yaml
caption_msvd_flant5xl_ft.yaml
caption_msvd_flant5xl_scst.yaml
caption_vatex_flant5xl_ft.yaml

Evaluation

We provide example scripts for evaluation (MSRVTT and MSVD):

python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/blip2/eval/caption_msrvtt_flant5xl_eval.yaml
python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/blip2/eval/caption_msvd_flant5xl_eval.yaml

More experiment details

We hope these results are useful to your experimental explorations. Feel free to discuss this with us (especially Yiren and me)!

LLM Decoder	Description	Train ViT	Train LLM (LoRA)	Num of Frames	Video Resolution	Scst	CIDEr
FLAN-T5-XL	No MSRVTT training, directly using BLIP-2 zero-shot results.	-	-	-	-	-	50.8 (val)
FLAN-T5-XL	Best result with 8 frames, trained weights were later used for 8-frame RL training.	No	No	8	224	No	73.6
FLAN-T5-XL	LoRA LLM, trained with 8 frames (epoch-based → iteration-based), used WebVid, and adjusted hyperparameters. WebVid did not improve as expected or even at all, so we abandoned it in future experiments.	No	Yes	8	224	No	73.0
FLAN-T5-XL	LoRA, 8 frames (fine-tuned ViT, completely failed).	Yes	Yes	8	224	No	61.8
FLAN-T5-XL	LoRA, 8 frames (fine-tuned ViT for only 5 epochs, not too bad).	Yes	Yes	8	224	No	71.4
FLAN-T5-XL	LoRA, 8 frames (fine-tuned ViT, poor result, i.e., 68.4).	Yes	Yes	8	224	No	68.4
FLAN-T5-XL	No MSRVTT training, directly using BLIP-2 zero-shot results.	-	-	-	-	-	50.8 (val)
FLAN-T5-XL	Train only Q-former.	No	No	8	224	No	71.4
FLAN-T5-XL	Train only Q-former, better hyper-parameters?	No	No	8	224	No	73.2
FLAN-T5-XL	Train only Q-former, conclusion: 4 frame results are worse than 8 frames.	No	No	4	224	No	70.4
FLAN-T5-XL	Train only Q-former, conclusion: 4 frame results are worse than 8 frames.	No	No	4	224	No	71.7
FLAN-T5-XL	Best result with 8 frames, trained weights used later for 8 frame RL training.	No	No	8	224	No	73.6
FLAN-T5-XL	SCST reinforcement learning: initialized with pre-trained model 20231210062. CIDEr: 77.4	-	-	-	-	Yes	77.4
FLAN-T5-XL	SCST, changed a set of hyper-parameters: initialized with pre-trained model 20231210062. CIDEr: 77.4	-	-	-	-	Yes	77.4
FLAN-T5-XL	Changed another set of hyper-parameters and trained 8 frame Q-former only once.	No	No	8	224	No	71.8
FLAN-T5-XL	First time training LoRA LLM, used 16 frames	No	Yes	16	224	No	73.0
FLAN-T5-XL	LoRA LLM, used 8 frames (seemingly not as good as 16 frames), used webvid (pseudo-labeled in msrvtt) mixed data	No	Yes	8	224	No	71.9
FLAN-T5-XL	LoRA LLM, used 8 frames (epoch based -> iter based), seems tuning hyper-parameters for LoRA can achieve 73+. Used pure webvid data	No	Yes	8	224	No	73.7
FLAN-T5-XL	LoRA LLM, used 8 frames (epoch based -> iter based), used webvid, changed hyper-params again. Feels like webvid did not improve as expected or even not at all, so it was dropped in future experiments.	No	Yes	8	224	No	73.0
FLAN-T5-XL	LoRA, 16 frames	No	Yes	16	224	No	72.6
FLAN-T5-XL	LoRA, 8 frames (finetuned ViT, totally failed)	Yes	Yes	8	224	No	61.8
FLAN-T5-XL	LoRA, 8 frames (finetuned ViT briefly for 5 epochs, not too bad)	Yes	Yes	8	224	No	71.4
FLAN-T5-XL	LoRA, 8 frames (finetuned ViT, results were not good, i.e., 68.4)	Yes	Yes	8	224	No	68.4
FLAN-T5-XL	No LoRA, 8 frame 336x pixel (higher resolution doesn't seem to help, training becomes slower, interpolated embeddings even worse result)	No	No	8	364	No	70.7
FLAN-T5-XL	LoRA+SCST	No	No	16	224	Yes	79.2
FLAN-T5-XL	LoRA+SCST (changed hyper-parameters)	No	No	16	224	Yes	79.1
FLAN-T5-XL	LoRA+SCST (changed hyper-parameters)	No	No	16	224	Yes	76.8
FLAN-T5-XL	LoRA+SCST (changed hyper-parameters)	No	No	16	224	Yes	79.5
FLAN-T5-XL	LoRA+SCST (changed hyper-parameters)	No	No	16	224	Yes	78.7
FLAN-T5-XL	LoRA+SCST (changed hyper-parameters)	No	No	16	224	Yes	79.0
FLAN-T5-XL	LoRA+32 frames, no RL (conclusion: 32 frames are not very helpful)	No	No	32	224	No	72.2
FLAN-T5-XL	LoRA+SCST+32 frames (conclusion: 32 frames are not very helpful)	No	No	32	224	Yes	79.1
FLAN-T5-XL	LoRA+SCST+32 frames (new hyper-parameters)	No	No	32	224	Yes	79.3
FLAN-T5-XL	LoRA+32 frames, no RL (seems re-trained once)	No	No	32	224	No	72.5
FLAN-T5-XL	LoRA+32 frames, re-trained for RL, seems worse	No	No	32	224	Yes	77.8
FLAN-T5-XL	LoRA+32 frames, re-trained for RL, seems ineffective	No	No	32	224	Yes	78.3
FLAN-T5-XL	LoRA+32 frames, re-trained for RL, seems ineffective	No	No	32	224	Yes	76.2
OPT-2.7b	Changed LLM, feels not as good as FLAN-T5, also very sensitive to hyper-parameters.	-	-	-	-	-	71.3
OPT-2.7b	Pre-trained BLIP-2 using only 4M data	-	-	-	-	-	B1: 85.0, B2: 71.9, B3: 58.3, B4: 45.6, M: 31.8, R: 64.6, C: 65.7
OPT-6.7b	Switched to 6.7B model but failed 3 times, (i.e., CIDEr: 63.x)	-	-	-	-	-	63.x
Vicuna-7b	Used a larger model, results didn't seem to improve, conclusion: Vicuna-7b is unnecessary, too slow and large	No	No	-	-	No	72.7
Vicuna-7b	Also tried SCST, it seems Vicuna SCST (RL) hardly improves (e.g., to 79), possibly because this model is already aligned with human preferences, reducing RL efficiency.	Yes	Yes	-	-	Yes	74.8
Vicuna-7b	Also tried SCST, it seems Vicuna SCST (RL) hardly improves (e.g., to 79), possibly because this model is already aligned with human preferences, reducing RL efficiency.	Yes	Yes	-	-	Yes	75.8
SimVLG	Also ran a random SimVLG (the simplest example using only LLM), used LLM without image-text pretraining, results were poor, indicating not only LLM is needed but also image-text alignment.	No	No	-	-	No	57.3
Vicuna-7b	Also tried SCST, it seems Vicuna SCST (RL) hardly improves (e.g., to 79), possibly because this model is already aligned with human preferences, reducing RL efficiency.	Yes	Yes	-	-	Yes	75.8
SimVLG	Also ran a random SimVLG (the simplest example using only LLM), used LLM without image-text pretraining, results were poor, indicating not only LLM is needed but also image-text alignment.	No	No	-	-	No	57.3

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
app		app
lavis		lavis
projects		projects
.DS_Store		.DS_Store
CODEOWNERS		CODEOWNERS
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pretrained Image-Text Models are Secretly Video Captioners

Overview

Installation

Data Preparation

Cross-Entropy Training

SCST Training

Evaluation

More experiment details

About

Releases

Packages

Languages

License

chunhuizng/mllm-video-captioner

Folders and files

Latest commit

History

Repository files navigation

Pretrained Image-Text Models are Secretly Video Captioners

Overview

Installation

Data Preparation

Cross-Entropy Training

SCST Training

Evaluation

More experiment details

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages