AIDB Daily Papers

知識労働AIの評価ベンチマーク設計と報告基準

原題: Design and Report Benchmarks for Knowledge Work

著者: Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian

公開日: 2026-05-22 | 分野: LLM ベンチマーク AI cs.AI AIエージェント AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

知識労働AIの評価において、従来のNLPタスクの論理から脱却し、より実用的なベンチマーク設計と報告基準を提案する。
本研究は、評価対象の作業活動、テスト環境、および成果物を明確に定義することで、ベンチマークスコアが実際の知識労働能力をどの程度反映するかを検証する。
提案手法に基づき、3つのケーススタディを通じて、ベンチマーク設計が作業能力の主張に与える影響と、ベンチマークタスクと実際の作業との間のギャップを明らかにした。

Abstract

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.23262
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報