AIDB Daily Papers

長期間タスクをこなすWebエージェントの現実的なベンチマーク「Odysseys」

原題: Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

著者: Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

公開日: 2026-04-27 | 分野: LLM ベンチマーク機械学習自然言語処理 cs.CL cs.LG AIエージェント

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

現実世界でのWeb利用を模倣した、長期間かつ複数サイトにわたるタスクを評価するベンチマーク「Odysseys」を開発した。
従来の評価方法では不十分なため、詳細な評価基準を導入し、人間との一致度と詳細な評価信号を向上させた。
最先端モデルでも成功率は44.5%に留まり、効率性も課題であり、長期間のWeb操作が可能なエージェント開発の必要性を示唆した。

Abstract

Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2604.24964
カテゴリ: cs.LG, cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報