AIDB Daily Papers

プロンプトインジェクション防御のトリレンマ：なぜ防御ラッパーは失敗するのか？

原題: The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

著者: Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Joel Webb, Blake Gatto

公開日: 2026-04-07 | 分野: LLM 安全性セキュリティ AI 検証プロンプト自然言語処理敵対者大規模言語モデル

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

本研究では、連続的で有用性を維持する防御ラッパーは、安全な出力を保証できないことを証明した。
この結果は、連続性、有用性の維持、完全性の３つが両立しない「防御のトリレンマ」を示す重要な発見である。
Lean 4で理論を検証し、3つのLLMで実証実験を行った結果、理論的予測と実験結果が一致した。

Abstract

We prove that no continuous, utility-preserving wrapper defense-a function $D: Xto X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $ε$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2604.06436
カテゴリ: cs.CR, cs.AI

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報