AIDB Daily Papers
日常タスク評価のためのオープンエンドベンチマーク「DailyReport」
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 日常的な情報探索タスクを評価する「DailyReport」という新しいベンチマークを提案した。
- 既存のベンチマークは現実的でなく評価の解釈性も低いため、本研究は実用性と詳細な評価を重視した。
- 150のタスクと3,546の評価項目で構成され、現在のエージェントシステムはユーザーの期待に達していないことが示された。
Abstract
Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: