AIDB Daily Papers

視覚的根拠に基づいた思考：VLMの推論能力を画像領域と結びつける

原題: Thinking with Visual Grounding

著者: Junkai Zhang, Yihe Deng, Kai-Wei Chang, Wei Wang

公開日: 2026-06-15 | 分野: コンピュータビジョン AI 自然言語処理 cs.AI AI支援 AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

本研究では、言語的推論と画像領域の明示的な根拠付けを組み合わせた「視覚的根拠付け思考」を提案する。
この手法は、既存のVLMの推論過程を検証可能にし、より正確な学習を可能にする点で重要である。
視覚的根拠付け思考を導入したモデルは、特に空間推論タスクにおいて、非根拠付けモデルやより大規模なモデルを凌駕する性能を示した。

Abstract

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.16122
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報