AIDB Daily Papers
マルチモーダルLLM評価者の敵対的頑健性:脆弱性と新たな攻撃手法
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- マルチモーダル大規模言語モデル(MLLM)を評価者として利用する際の敵対的頑健性を評価するフレームワーク「RobustMLLMJudge」を提案した。
- MLLM評価者はスコアを不正に操作する攻撃に対して脆弱であり、既存の攻撃手法は評価プロトコルの制約により効果が限定的であった。
- 新たな攻撃手法「MGSIA」は、高スコア領域への誘導と意味的整合性を組み合わせることで、MLLM評価者を効果的に欺くことが可能となった。
Abstract
Multimodal Large Language Models (MLLMs) are increasingly used as automated judges, e.g., for image quality and safety assessment. However, their adversarial robustness remains largely unexplored, threatening the fairness and reliability of automated judging. To bridge this gap, we introduce RobustMLLMJudge, the first general framework for evaluating the adversarial robustness of general-purpose MLLMs when functioning as judges. It covers diverse attacks against popular judge approaches across quality and safety evaluation scenarios. Using RobustMLLMJudge, we reveal that i) different MLLM judges are highly vulnerable to score-inflating adversarial attacks; and ii) although effective, these attack methods face a critical challenge due to unique constraints in the evaluation protocols of MLLM judges. We further propose MGSIA, namely Manifold-Guided Semantic Induction Attack, a novel method that bypasses these constraints to enable more effective and transferable attacks on MLLM judges. The core idea of MGSIA is to combine affirmative semantic induction with high-score manifold alignment: it maximizes the probability that judges yield affirmative responses (e.g., "Yes") to binary semantic queries, while regularizing adversarial representations toward high-score centers estimated from proxy protocols. Together, these objectives yield transferable score-inflating perturbations. Extensive experiments demonstrate the superiority and generalizability of MGSIA in deceiving advanced MLLM judges under different evaluation scenarios, highlighting the need for robust MLLM judges. Code and data will be made available at https://github.com/mala-lab/RobustMLLMJudge.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: