AIDB Daily Papers

AIエージェントは場の空気を読めるか？マルチモーダルシミュレーションにおける視覚的社会知能のベンチマーク

原題: Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

著者: Shijun Wan, Xuehai Wu, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

公開日: 2026-06-13 | 分野: LLM マルチモーダルコンピュータビジョン cs.CL AIエージェント AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

マルチモーダル社会シミュレーションにおいて、視覚的社会知能を評価する新しいベンチマーク「AgentViSS」を開発した。
既存のテキストベースのベンチマークとは異なり、表情や姿勢などの視覚情報を用いてAIエージェントの対話能力を測る点が重要である。
評価の結果、AIエージェントは個別の役割遂行は得意だが、対話の調整や視覚情報に基づいた結果達成は依然として困難であることが示された。

Abstract

Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce textsc{benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.15152
カテゴリ: cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報