Skip to content

OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

Notifications You must be signed in to change notification settings

yihedeng9/OpenVLThinker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

🤗Model📝Blog📄Paper

Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

Our study investigates whether R1-like reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.

As an early result, we present OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision.

Evaluation

Our model has been evaluated on several challenging benchmarks:

  • MathVista
  • MathVerse
  • MathVision

We provide two evaluation scripts to handle different answer formats:

  1. For OpenVLThinker evaluation:
python evaluation/eval_openvlthinker.py --dataset mathvista
  1. For Qwen2.5-VL evaluation:
python evaluation/eval_qwen.py --dataset mathvista

The scripts differ in how they handle answer extraction:

  • eval_openvlthinker.py expects answers in the format <answer>...</answer>
  • eval_qwen.py expects answers in the format \boxed{...}

For MathVerse evaluation, we need to verify the model's responses using GPT-4V due to the more diverse response formats in its free-form questions:

python evaluation/verify_mathverse_gpt4.py \
    --responses_file ./evaluation/outputs/mathverse_OpenVLThinker-7B.json \
    --output_dir ./evaluation/outputs

Note: This requires an OpenAI API key to be set in your environment variables.

Optional arguments for both evaluation scripts:

  • --cuda: Specify CUDA device number (default: 0)
  • --model_path: Path to the model (default: "ydeng9/OpenVLThinker-7B" for OpenVLThinker, "Qwen/Qwen2.5-VL-7B-Instruct" for Qwen)

The evaluation results will be saved in the ./evaluation/outputs directory. Note: evaluation results may fluctuate on different GPUs.

Our detailed evaluation results are listed below

Benchmark GPT-4o (reported) Qwen2.5-VL-7B OpenVLThinker-7B
MathVista 63.8 68.5 70.2
MathVerse 50.2 46.8 47.9
MathVision (testmini) - 27.6 29.6
MathVision (full) 30.4 24.0 25.3

Table 1: Evaluation results across multi-modal reasoning benchmarks including MathVista, MathVerse and MathVision. We include the reported performance of GPT-4o as a reference. OpenVLThinker-7B consistently and effectively improves upon the performance of Qwen2.5-VL-7B, surpassing or matching the performance of GPT-4o.

Citation

@misc{deng2025openvlthinker,
      title={OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement}, 
      author={Yihe Deng and Hritik Bansal and Fan Yin and Nanyun Peng and Wei Wang and Kai-Wei Chang},
      year={2025},
      eprint={2503.17352},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.17352}, 
}

Acknowledgments

We thank LLaMA-Factory and EasyR1 for open-sourcing the model training frameworks that we used in this work.

About

OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages