AIDB Daily Papers
深層事前アライメントによるマルチモーダル理解の深化
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 標準的なViTエンコーダをVLMパーシーバーに置き換えることで、視覚特徴を言語空間に深く統合する新アーキテクチャDPAを提案した。
- このアプローチは、既存のVLMにおける視覚特徴とテキスト空間の乖離問題を解決し、深い理解と複雑な推論を可能にする点で重要である。
- DPAは、複数のマルチモーダルベンチマークでベースラインを上回り、言語能力の低下を抑制し、様々なLLMファミリーで有効性を示した。
Abstract
Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: