AIDB Daily Papers
「見る」コスト:単一構造パラダイムにおける信頼できるマルチモーダル推論の達成
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 本研究は、現在のビジョン・言語モデル(VLM)がマルチモーダルデータを忠実に統合しているという前提に疑問を呈し、その信頼性の危機を指摘する。
- 従来の評価方法の限界を克服するため、情報理論に基づいた「モダリティ翻訳プロトコル」を提案し、視覚情報のコストを定量化する新しい指標を導入する。
- 大規模化に伴い視覚知識のボトルネックが増大するという「マルチモーダルスケーリングの分岐法則」を提唱し、真のマルチモーダル推論の実現を目指す。
Abstract
The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: