AIDB Daily Papers

AI評価のリンゴとリンゴ化：実世界ユースケースから評価シナリオへ

原題: Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

著者: Yee-Yin Choong, Kristen Greene, Alice Qian, Meryem Marasli, Ziqi Yang, Sophia Chen, Laura Dabbish, Anand Rao, Hong Shen

公開日: 2026-05-08 | 分野: LLM 人間中心設計 cs.AI cs.CY cs.HC AI評価ユースケースシナリオ生成

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

AI評価における「リンゴとオレンジ」のような比較を避けるため、評価シナリオの透明性、運用上の根拠、人間中心設計を提唱する。
専門家からAIユースケースを収集し、LLMと人間レビューを組み合わせた3段階の拡張プロセスで107のシナリオを生成する手法を提案する。
この手法により、実世界での利用や人間のニーズを反映した、より一貫性があり有意義な人間中心のAI評価パラダイムが実現される。

Abstract

AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be "apples-to-oranges" comparisons across AI evaluations. To move toward "apples-to-apples" comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases identified by financial services sector SMEs: cyber defense enablement, developer productivity, financial crime aggregation, suspicious activity report (SAR) filing, credit memo generation, and internal call center support. These AI use cases provided are illustrative of the process and not exhaustive. Central to our work is a three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios from those use cases elicited from SMEs. This process integrates iterative human reviews at every juncture to ensure operational grounding: for scenario titles and descriptions; for core scenario elements like users, benefits and risks, and metrics; and for scenario narratives and evaluation objectives. Human checkpoints ensure scenarios remain reflective of real-world usage and human needs. We describe a validation rubric to assess scenario quality. By defining key scenario components, this work supports a more consistent and meaningful paradigm for human-centered AI evaluations.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.07986
カテゴリ: cs.HC, cs.AI, cs.CY

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報