AIDB Daily Papers
LLMの「雰囲気テスト」を形式化:ユーザー体験に基づいた評価の体系化
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 本研究では、LLMの実際の有用性を評価する「雰囲気テスト」の実態を調査・分析しました。
- ベンチマークスコアと実世界のユーザー体験の乖離を埋めるため、雰囲気テストを体系的に分析できる形式化手法を提案します。
- 実験の結果、パーソナライズされたプロンプトとユーザーを考慮した評価基準を組み合わせることで、モデルの選好度が変化することを発見しました。
Abstract
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: