AIDB Daily Papers

音源の動きを捉える時空間音声言語モデル

原題: Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

著者: Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka, Takashi Shibuya, Kyeongyoon Lee, Tae-Hyun Oh, Yuki Mitsufuji

公開日: 2026-06-12 | 分野: LLM NLP Vision-Language-Action cs.CL cs.AI cs.SD

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

音源の場所や動きを理解する時空間音声QAデータセットとベンチマークを構築しました。
音源の動きと意味を同時に学習する新しいエンコーダーと言語モデルを提案しました。
提案手法は、音源の動きと意味の理解において既存手法を上回る性能を示しました。

Abstract

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.14141
カテゴリ: cs.SD, cs.AI, cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報