AIDB Daily Papers

Agent-ToM: 心理論推論で自律LLMエージェントの監視を学習する

原題: Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

著者: Nesreen K. Ahmed, Nima Nafisi

公開日: 2026-05-22 | 分野: cs.CL cs.AI cs.CR cs.LG AIエージェント AI安全性

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

自律LLMエージェントの隠れた悪意ある行動を検知するため、心理論（ToM）推論に基づく監視フレームワーク「Agent-ToM」を提案した。
Agent-ToMは、エージェントの信念、意図、目標整合性を推論し、従来の監視手法よりも効果的に異常行動を識別する。
SHADE-ArenaおよびCUA-SHADE-Arenaベンチマークで、Agent-ToMは最先端手法を上回る精度と再現性を達成した。

Abstract

Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.24216
カテゴリ: cs.LG, cs.AI, cs.CL, cs.CR

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報