AIDB Daily Papers
LiveClawBench:複雑な実世界アシスタントタスクにおけるLLMエージェントのベンチマーク
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- LLMエージェントの実世界タスク遂行能力を、環境、認知、適応性の3軸で評価するLiveClawBenchを提案。
- 既存の評価基準では捉えきれない、実用的な複雑性を持つタスクを網羅的に評価できる点が重要である。
- 多様なOpenClawの利用事例を分析し、複雑性フレームワークを構築、現実的なアシスタント設定での評価基盤を確立した。
Abstract
LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: