AIDB Daily Papers
ClawBench:AIエージェントは日常的なオンラインタスクをこなせるか?
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 日常的なオンラインタスクを評価するフレームワークClawBenchを構築し、AIエージェントの能力を検証しました。
- 実世界のウェブサイトでタスクを実行するため、既存のオフライン環境での評価よりも現実的で複雑な課題を提示します。
- 最先端モデルでもタスクのわずかな部分しか完了できず、汎用アシスタント実現にはさらなる進歩が必要と判明しました。
Abstract
AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: