AIDB Daily Papers
鏡よ鏡、私は誰? VLMエージェントは自己を認識できるのか
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 本研究では、3D環境でVLMエージェントが鏡像から自身の身体属性を推測し、自己認識能力を持つかを評価した。
- 自己認識能力は、鏡の有無や遮蔽などの制御された条件下で、エージェントが鏡像を自己のものとして正しく認識できるかで測られる。
- 実験の結果、より強力なVLMエージェントは鏡像から行動に必要な情報を抽出できたが、弱いモデルは自己関連情報を抽出できず、自己と他者を誤認する傾向があった。
Abstract
In the animal kingdom, mirror self-recognition is a canonical probe of higher-order cognition, emerging only in some species. We ask whether an analogous functional capability emerges in embodied vision-language model (VLM) agents: can they recognize themselves in a mirror? We introduce a controlled 3D benchmark where a first-person VLM agent must infer a hidden body attribute from its reflection and select the matching target, while avoiding self-other misattribution. To separate mirror-grounded self-identification from shortcuts, we test mirror removal, misleading cues, and occluded reflections. We also evaluate the decision process through mirror seeking, temporal ordering, self-attribution, and reasoning-action consistency. Our experiments show that mirror-based self-identification emerges mainly in stronger VLMs. These models can use reflected evidence for action, whereas weaker models often inspect the mirror but fail to extract self-relevant information or misattribute their reflection. Language-vision conflict further shows that self-referential language alone is not evidence of grounded self-identification. Overall, mirror-based evaluation provides a diagnostic for whether embodied self-grounding is causally rooted in perception and action rather than priors, prompt compliance, or confabulation.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: