OmniAgent Logo

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal
Audio-Video Understanding

1Zhejiang University   2Westlake University   3Ant Group
*Corresponding Authors.
Zhejiang University
Westlake University
Ant Group

Abstract

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

Methodology

Overview of OmniAgent Framework
Overview of the OmniAgent Framework

Experimental Results

Experimental Results

Contact

This work is produced by the Westlake ENCODE LAB.

For questions, please contact KD.TAO@outlook.com.

BibTeX


          @article{omniagent,
            title={OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding},
            author={Tao, Keda and Du, Wenjie and Yu, Bohan and Wang, Weiqiang and Liu, Jian and Wang, Huan},
            journal={arXiv preprint arXiv:2512.23646},
            year={2025}
          }