AIDB Daily Papers
CrowdVLA:状況認識型群衆シミュレーションのための具現化されたVision-Language-Actionエージェント
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 群衆シミュレーションにおいて、各歩行者をVision-Language-Action(VLA)エージェントとしてモデル化する新しい手法CrowdVLAを提案。
- 従来の幾何学と衝突回避に終始したナビゲーションから、視覚的観察と言語指示に基づいた意味解釈と行動選択を可能にする点が革新的。
- 環境再構築、LoRAによるファインチューニング、探索的質問応答により、知覚駆動で結果を考慮した意思決定を実現し、意味のある群衆行動を生成。
Abstract
Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable per-frame control, and success-biased datasets-through: (i) agent-centric visual supervision via semantically reconstructed environments and Low-Rank Adaptation (LoRA) fine-tuning of a pretrained vision-language model, (ii) a motion skill action space that bridges symbolic decision making and continuous locomotion, and (iii) exploration-based question answering that exposes agents to counterfactual actions and their outcomes through simulation rollouts. Our results shift crowd simulation from motion-centric synthesis toward perception-driven, consequence-aware decision making, enabling crowds that move not just realistically, but meaningfully.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: