Conditional Flow Matching for Visually-Guided Acoustic Highlighting
Abstract
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Growth and citations
This paper is currently showing No growth state computed yet..
Citation metrics and growth state from academic sources (e.g. Semantic Scholar). See About for details.
Cited by (0)
No citing papers yet
Papers that cite this one will appear here once data is available.
View citations page →References (0)
No references in DB yet
References for this paper will appear here once ingested.
Related papers in Audio and Speech Processing
- A Unified SVD-Modal Solution for Sparse Sound Field Reconstruction with Hybrid Spherical-Linear Microphone Arrays0 citations
- WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection0 citations
- WAXAL: A Large-Scale Multilingual African Language Speech Corpus0 citations
Growth transitions
No transitions recorded yet
Growth state transitions will appear here once computed.