AIDB Daily Papers

AIエージェントの誤操作からの復旧：人間主導の安全対策

原題: Human-Guided Harm Recovery for Computer Use Agents

著者: Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu

公開日: 2026-04-20 | 分野: ロボティクス安全性 AI インターフェース人間中心設計説明性 cs.CL cs.AI 信頼性

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

AIエージェントがコンピュータ上で誤った操作をした際に、安全な状態へ復帰させるための手法を提案した。
人間が重視する復旧の側面を特定し、その好みに合わせた復旧計画を生成する点が重要である。
人間評価により、提案手法が従来のAIエージェントよりも質の高い復旧を実現することを示した。

Abstract

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2604.18847
カテゴリ: cs.AI, cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報