E.T. Bench: Towards Open-Ended Event-Level
Video-Language Understanding

Ye Liu^1,2, Zongyang Ma^2,3, Zhongang Qi², Yang Wu⁴, Ying Shan², Chang Wen Chen¹

¹The Hong Kong Polytechnic University ²ARC Lab, Tencent PCG
³Institute of Automation, Chinese Academy of Sciences ⁴Tencent AI Lab

E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark) is a comprehensive solution for open-ended event-level video-language understanding. This project consists of the following three contributions:

E.T. Bench: A large-scale and high-quality benchmark for event-level and time-sensitive video understanding, comprising 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains.
E.T. Chat: A multi-modal large language model (MLLM) that specializes in time-sensitive video-conditioned chatting. It reformulates timestamp prediction as a novel embedding matching problem.
E.T. Instruct 164K: A meticulously collected instruction-tuning dataset tailored for time-sensitive video understanding scenarios.

We focus on 4 essential capabilities for time-sensitive video understanding: referring, grounding, dense captioning, and complex understanding. The examples (categorized by background colors) are as follows.

🔥 News

2025.01.17 📚 We release the inference code for E.T. Chat on Charades-STA. See here for details.
2024.09.28 ⭐️ Code, model, and dataset release.
2024.09.27 🎉 E.T. Bench has been accepted to NeurIPS 2024 (Datasets and Benchmarks Track).

🏆 Leaderboard

Our online leaderboard is under construction. Stay tuned!

🔮 Benchmark

Please refer to the Benchmark page for details about E.T. Bench.

🛠️ Model

Please refer to the Model page for training and testing E.T. Chat.

📦 Dataset

Please refer to the Dataset page for downloading E.T. Instruct 164K.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024etbench,
  title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
  author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2024}
}

💡 Acknowledgements

This project was built upon the following repositories with many thanks to their authors.

LLaVA, LAVIS, EVA, LLaMA-VID, TimeChat, densevid_eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

E.T. Bench: Towards Open-Ended Event-Level
Video-Language Understanding

🔥 News

🏆 Leaderboard

🔮 Benchmark

🛠️ Model

📦 Dataset

📖 Citation

💡 Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

E.T. Bench: Towards Open-Ended Event-LevelVideo-Language Understanding

🔥 News

🏆 Leaderboard

🔮 Benchmark

🛠️ Model

📦 Dataset

📖 Citation

💡 Acknowledgements

E.T. Bench: Towards Open-Ended Event-Level
Video-Language Understanding