AIDB Daily Papers

RouteGuard：LLMエージェントのスキルポイズニングを内部信号で検知

原題: RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

著者: Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu, Jizhong Han

公開日: 2026-04-24 | 分野: LLM セキュリティ AI アルゴリズム cs.AI cs.CR 信頼性 AIエージェント

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMエージェントのスキルに悪意ある指示を隠蔽するスキルポイズニング攻撃を研究した。
攻撃は、信頼できるコンテキストから悪意あるスキル部分への注意のシフトを引き起こす。
提案手法RouteGuardは、内部信号を検知し、既存手法を上回る精度で攻撃を検出した。

Abstract

Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2604.22888
カテゴリ: cs.CR, cs.AI

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報