AIDB Daily Papers
RouteGuard:LLMエージェントのスキルポイズニングを内部信号で検知
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- LLMエージェントのスキルに悪意ある指示を隠蔽するスキルポイズニング攻撃を研究した。
- 攻撃は、信頼できるコンテキストから悪意あるスキル部分への注意のシフトを引き起こす。
- 提案手法RouteGuardは、内部信号を検知し、既存手法を上回る精度で攻撃を検出した。
Abstract
Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: