AIDB Daily Papers

視線制御による能動的視覚でマルチモーダル推論を革新するGazeVLM

原題: GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

著者: Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti

公開日: 2026-05-08 | 分野: マルチモーダルコンピュータビジョン推論 AI 自然言語処理 cs.CL cs.AI cs.CV 能動的視覚注意機構

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

人間の能動的な視覚メカニズムを模倣し、注意リソースの配分を内部制御するマルチモーダル推論アーキテクチャを提案した。
従来の受動的な情報処理とは異なり、関連性の低い情報を抑制し、局所的な詳細への集中と全体像の把握を動的に切り替える。
40億パラメータのGazeVLMは、高解像度マルチモーダル推論タスクにおいて、既存の最先端モデルを大幅に上回る性能を示した。

Abstract

Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($texttt{<LOOK>}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.07817
カテゴリ: cs.CV, cs.AI, cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報