AIDB Daily Papers

LiveClawBench：複雑な実世界アシスタントタスクにおけるLLMエージェントのベンチマーク

原題: LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

著者: Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang

公開日: 2026-03-20 | 分野: LLM ベンチマーク機械学習 AI エージェント認知評価環境タスク自然言語処理フレームワーク適応性実世界

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMエージェントの実世界タスク遂行能力を、環境、認知、適応性の3軸で評価するLiveClawBenchを提案。
既存の評価基準では捉えきれない、実用的な複雑性を持つタスクを網羅的に評価できる点が重要である。
多様なOpenClawの利用事例を分析し、複雑性フレームワークを構築、現実的なアシスタント設定での評価基盤を確立した。

Abstract

LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2604.13072
カテゴリ: cs.CL, cs.AI, cs.LG

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報