AIDB Daily Papers

深層事前アライメントによるマルチモーダル理解の深化

原題: Deep Pre-Alignment for VLMs

著者: Tianyu Yu, Kechen Fang, Zihao Wan, Kaidong Zhang, Yicheng Zhang, Jun Song, Bo Zheng, Yuan Yao

公開日: 2026-05-14 | 分野: LLM コンピュータビジョン自然言語処理 VLM cs.CV cs.LG

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

標準的なViTエンコーダをVLMパーシーバーに置き換えることで、視覚特徴を言語空間に深く統合する新アーキテクチャDPAを提案した。
このアプローチは、既存のVLMにおける視覚特徴とテキスト空間の乖離問題を解決し、深い理解と複雑な推論を可能にする点で重要である。
DPAは、複数のマルチモーダルベンチマークでベースラインを上回り、言語能力の低下を抑制し、様々なLLMファミリーで有効性を示した。

Abstract

Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.15300
カテゴリ: cs.CV

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報