OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

Our study investigates whether R1-like reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.

As an early result, we present OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision.

Evaluation

Our model has been evaluated on several challenging benchmarks:

MathVista
MathVerse
MathVision

We provide two evaluation scripts to handle different answer formats:

For OpenVLThinker evaluation:

python evaluation/eval_openvlthinker.py --dataset mathvista

For Qwen2.5-VL evaluation:

python evaluation/eval_qwen.py --dataset mathvista

The scripts differ in how they handle answer extraction:

eval_openvlthinker.py expects answers in the format <answer>...</answer>
eval_qwen.py expects answers in the format \boxed{...}

For MathVerse evaluation, we need to verify the model's responses using GPT-4V due to the more diverse response formats in its free-form questions:

python evaluation/verify_mathverse_gpt4.py \
    --responses_file ./evaluation/outputs/mathverse_OpenVLThinker-7B.json \
    --output_dir ./evaluation/outputs

Note: This requires an OpenAI API key to be set in your environment variables.

Optional arguments for both evaluation scripts:

--cuda: Specify CUDA device number (default: 0)
--model_path: Path to the model (default: "ydeng9/OpenVLThinker-7B" for OpenVLThinker, "Qwen/Qwen2.5-VL-7B-Instruct" for Qwen)

The evaluation results will be saved in the ./evaluation/outputs directory. Note: evaluation results may fluctuate on different GPUs.

Our detailed evaluation results are listed below

Benchmark	GPT-4o (reported)	Qwen2.5-VL-7B	OpenVLThinker-7B
MathVista	63.8	68.5	70.2
MathVerse	50.2	46.8	47.9
MathVision (testmini)	-	27.6	29.6
MathVision (full)	30.4	24.0	25.3

Table 1: Evaluation results across multi-modal reasoning benchmarks including MathVista, MathVerse and MathVision. We include the reported performance of GPT-4o as a reference. OpenVLThinker-7B consistently and effectively improves upon the performance of Qwen2.5-VL-7B, surpassing or matching the performance of GPT-4o.

Citation

@misc{deng2025openvlthinker,
      title={OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement}, 
      author={Yihe Deng and Hritik Bansal and Fan Yin and Nanyun Peng and Wei Wang and Kai-Wei Chang},
      year={2025},
      eprint={2503.17352},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.17352}, 
}

Acknowledgments

We thank LLaMA-Factory and EasyR1 for open-sourcing the model training frameworks that we used in this work.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
evaluation		evaluation
paper		paper
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

Evaluation

Citation

Acknowledgments

About

Releases

Packages

Languages

yihedeng9/OpenVLThinker

Folders and files

Latest commit

History

Repository files navigation

OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

Evaluation

Citation

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages