Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

Interspeech Logo Interspeech 2025
Korea University, South Korea

Abstract

Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal model integration for fine-grained audiovisual segmentation.

Sample Videos

Example 1

Example 2

Example 3

Method

Overview of Our approaches Overview of the proposed zero-shot AVS approaches. Each subfigure illustrates a strategy for converting audiovisual inputs into textual queries for the RIS model.

Results

Performance comparison with state-of-the-art unsupervised AVS method under a zero-shot setting Our approach outperforms prior methods without requiring additional training.

Qualitative comparisons of segmentation results across dataset Our method produces fine-grained masks for sounding objects, outperforming prior methods in visual accuracy and boundary alignment.

BibTeX

@article{lee2025bridging,
  title={Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models},
  author={Lee, Seung-jae and Seo, Paul Hongsuck},
  journal={arXiv preprint arXiv:2506.06537},
  year={2025}
}