AIDB Daily Papers
ClawsBench:LLM生産性エージェントの能力と安全性をシミュレーションされたワークスペースで評価
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- LLMエージェントの生産性タスク自動化における能力を、安全性を考慮しつつ評価するClawsBenchを開発しました。
- 既存の評価環境の課題を克服し、現実的な複数サービス連携ワークフローを捉える高忠実度な模擬環境を提供することが重要です。
- 実験の結果、エージェントは一定のタスク成功率を示す一方で、安全でない行動も確認され、課題が明らかになりました。
Abstract
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification. We release the trajectories and future dataset at https://clawsbench.com.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: