🌐 Homepage | 🏆 Leaderboard | 📖 Arxiv Paper | 🤗 Paper | 🤗 Dataset | 🦜 Tweets
The project introduces OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs).
This table shows the omni-language models in the full evaluation setting in OmniBench, with the "Image & Audio", "Audio", and "Image" as input contexts and accuracy as metric. More results could be found at the live leaderboard.
Input Context | Image & Audio | Audio | Image |
---|---|---|---|
AnyGPT (7B) | 18.04% | 16.20% | 20.05% |
video-SALMONN (13B) | 35.64% | 35.90% | 34.94% |
UnifiedIO2-large (1.1B) | 27.06% | 29.07% | 29.07% |
UnifiedIO2-xlarge (3.2B) | 38.00% | 31.17% | 34.76% |
UnifiedIO2-xxlarge (6.8B) | 33.98% | 32.49% | 33.45% |
Gemini-1.5-Pro | 47.56% | 38.53% | 34.68% |
Reka-core-20240501 | 36.10% | 35.07% | 34.39% |
python inference/demo_api_call.py --output-file your_model_inference_output.json
Run the ablation setting without image (audio+text) or without audio (image+text).
python inference/demo_api_call.py --no-image --output-file your_model_inference_output.no-image.json
python inference/demo_api_call.py --no-audio --output-file your_model_inference_output.no-image.json
python inference/calculate_metrics.py --input-file dataset/batch-5_1142_20240817.jsonl --inference-file your_model_inference_output.jsonl
The dataset consists of the following keys:
"index"
: an integer suggests the question id."task type"
: a string suggests one of the 7 task types."audio type"
: a string suggests one of the 3 audio types (speech, sound event and music)."question"
: a string suggests the question."options"
: a list of four strings for multi-choice questions."answer"
: a string suggesting the correct response, must appear in"options"
."audio_path"
: the basename of the audio file, need to prependmm_data/audio
before using."image_path"
: the basename of the image file, need to prependmm_data/image
before using."audio"
(for HF version only): contains the numpy array for the wavfile."image"
(for HF version only): contains thePIL.Image()
object for the image."audio content"
: the human-annotated audio transcripts, used in text alternative experiments."image content"
: the VLM-generated caption for the image, used in text alternative experiments.
from datasets import load_dataset
dataset = load_dataset("m-a-p/OmniBench")
# check on the data samples
print(dataset)
print(dataset['train'][0])
The local version data is placed at dataset/batch-5_1142_20240817.jsonl
. You will need to use git lfs
to pull the folder mm_data
for the images and audios.
@misc{li2024omnibench,
title={OmniBench: Towards The Future of Universal Omni-Language Models},
author={Yizhi Li and Ge Zhang and Yinghao Ma and Ruibin Yuan and Kang Zhu and Hangyu Guo and Yiming Liang and Jiaheng Liu and Jian Yang and Siwei Wu and Xingwei Qu and Jinjie Shi and Xinyue Zhang and Zhenzhu Yang and Xiangzhou Wang and Zhaoxiang Zhang and Zachary Liu and Emmanouil Benetos and Wenhao Huang and Chenghua Lin},
year={2024},
eprint={2409.15272},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.15272},
}