AIDB Daily Papers

CrowdVLA：状況認識型群衆シミュレーションのための具現化されたVision-Language-Actionエージェント

原題: CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation

著者: Juyeong Hwang, Seong-Eun Hong, Jinhyun Kim, JaeYoung Seon, Giljoo Nam, Hanyoung Jang, HyeongYeop Kang

公開日: 2026-04-07 | 分野: 強化学習ロボティクスコンピュータビジョンゲーム AR 機械学習エージェント VR シミュレーション自然言語処理深層学習アニメーション群衆 Vision-Language-Action

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

群衆シミュレーションにおいて、各歩行者をVision-Language-Action（VLA）エージェントとしてモデル化する新しい手法CrowdVLAを提案。
従来の幾何学と衝突回避に終始したナビゲーションから、視覚的観察と言語指示に基づいた意味解釈と行動選択を可能にする点が革新的。
環境再構築、LoRAによるファインチューニング、探索的質問応答により、知覚駆動で結果を考慮した意思決定を実現し、意味のある群衆行動を生成。

Abstract

Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable per-frame control, and success-biased datasets-through: (i) agent-centric visual supervision via semantically reconstructed environments and Low-Rank Adaptation (LoRA) fine-tuning of a pretrained vision-language model, (ii) a motion skill action space that bridges symbolic decision making and continuous locomotion, and (iii) exploration-based question answering that exposes agents to counterfactual actions and their outcomes through simulation rollouts. Our results shift crowd simulation from motion-centric synthesis toward perception-driven, consequence-aware decision making, enabling crowds that move not just realistically, but meaningfully.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2604.05525
カテゴリ: cs.GR

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報