AIDB Daily Papers

最先端LLMにおけるイメージ推論の限界：視覚的認知の欠如

原題: Limits of Imagery Reasoning in Frontier LLM Models

著者: Sergio Y. Hayashi, Nina S. T. Hirata

公開日: 2026-03-25 | 分野: LLM コンピュータビジョン推論 AI 画像認知 3D モデル空間

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

本研究では、LLMに3Dモデルのレンダリング・回転を行う外部「イメージモジュール」を搭載し、空間タスクの克服を試みた。
LLMは空間的信号の抽出や動的な予測に必要な視覚的・空間的プリミティブを欠いており、イメージ操作を外部化しても性能が向上しなかった。
実験では、デュアルモジュール構成で3Dモデル回転タスクを実施したが、精度は最大62.5%にとどまり、視覚的認知の限界が示された。

Abstract

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2603.26779
カテゴリ: cs.CV, cs.AI

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報