Logo LVOmniBench

Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs


LVOmniBench Team
LVOmniBench Teaser

Examples of the LVOmniBench dataset.

🔔 News

🌟 [2026.03.19] We are very proud to launch LVOmniBench, the pioneering comprehensive evaluation benchmark of OmniLLMs in Long Audio-Video Understanding Evaluation!


Introduction

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes.

To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. We curated a diverse collection of long videos, with durations ranging from 10 to 90 minutes and an average duration of 2,069s. This dataset comprises 275 high-quality videos and 1,014 manually constructed QA pairs explicitly designed to require joint reasoning across both audio and visual modalities.

Question Distribution

Benchmark Statistics & Results

Leaderboard

# Model Modality Low Medium High Understanding Perception Inference Logical Avg.
1 Gemini-3.0-Pro A + V 79.368.145.073.360.165.467.5 65.8
2 Gemini-3.0-Flash A + V 76.663.031.067.754.760.651.8 59.0
3 Gemini-3.0-Flash V 55.649.330.651.442.942.348.5 46.2
4 Gemini-2.0-Flash A + V 57.048.929.841.438.549.142.1 42.9
5 Qwen3-VL-30B V 42.935.230.137.439.932.530.9 36.3
6 Qwen3-Omni-30B A + V 41.036.328.633.035.840.529.1 35.8
7 Qwen3-VL-8B V 37.136.532.132.237.136.734.6 35.6
8 MiniCPM-o 4.5 A + V 43.434.125.135.731.939.132.7 34.8
9 Ming-Omni-2.0-100B A + V 41.332.929.330.036.533.939.1 34.6
10 video-SALMONN 2+ 7B A + V 40.930.226.730.036.330.331.8 32.7
11 Qwen2.5-Omni-7B A + V 37.729.928.329.134.931.828.2 32.0
12 VideoLLaMA2-7B A + V 27.026.828.223.928.229.424.6 27.2
13 Qwen2-Audio A 27.025.221.225.222.026.329.1 24.7

Dataset Statistics

Performance Across Tasks




Results Type

Cite LVOmniBench


@article{tao2026lvomnibench,
title={LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs},
author={Keda Tao and Yuhua Zheng and Jia Xu and Wenjie Du and Kele Shao and Hesong Wang and Xueyi Chen and Xin Jin and Junhan Zhu and Bohan Yu and Weiqiang Wang and Jian Liu and Can Qin and Yulun Zhang and Ming-Hsuan Yang and Huan Wang},
journal={arXiv preprint arXiv:2603.19217},
year={2026}
}