Skip to content

Latest commit

 

History

History
73 lines (49 loc) · 4.18 KB

README.md

File metadata and controls

73 lines (49 loc) · 4.18 KB

E.T. Bench: Towards Open-Ended Event-Level
Video-Language Understanding

Ye Liu1,2, Zongyang Ma2,3, Zhongang Qi2, Yang Wu4, Ying Shan2, Chang Wen Chen1

1The Hong Kong Polytechnic University 2ARC Lab, Tencent PCG
3Institute of Automation, Chinese Academy of Sciences 4Tencent AI Lab

E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark) is a comprehensive solution for open-ended event-level video-language understanding. This project consists of the following three contributions:

  • E.T. Bench: A large-scale and high-quality benchmark for event-level and time-sensitive video understanding, comprising 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains.
  • E.T. Chat: A multi-modal large language model (MLLM) that specializes in time-sensitive video-conditioned chatting. It reformulates timestamp prediction as a novel embedding matching problem.
  • E.T. Instruct 164K: A meticulously collected instruction-tuning dataset tailored for time-sensitive video understanding scenarios.

We focus on 4 essential capabilities for time-sensitive video understanding: referring, grounding, dense captioning, and complex understanding. The examples (categorized by background colors) are as follows.

🔥 News

  • 2025.01.17 📚 We release the inference code for E.T. Chat on Charades-STA. See here for details.
  • 2024.09.28 ⭐️ Code, model, and dataset release.
  • 2024.09.27 🎉 E.T. Bench has been accepted to NeurIPS 2024 (Datasets and Benchmarks Track).

🏆 Leaderboard

Our online leaderboard is under construction. Stay tuned!

🔮 Benchmark

Please refer to the Benchmark page for details about E.T. Bench.

🛠️ Model

Please refer to the Model page for training and testing E.T. Chat.

📦 Dataset

Please refer to the Dataset page for downloading E.T. Instruct 164K.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024etbench,
  title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
  author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2024}
}

💡 Acknowledgements

This project was built upon the following repositories with many thanks to their authors.

LLaVA, LAVIS, EVA, LLaMA-VID, TimeChat, densevid_eval