AIDB Daily Papers

LLMの「雰囲気テスト」を形式化：ユーザー体験に基づいた評価の体系化

原題: From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

著者: Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov

公開日: 2026-04-15 | 分野: LLM NLP ベンチマーク機械学習 AI 評価プロンプトユーザ自然言語処理人工知能大規模言語モデル

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

本研究では、LLMの実際の有用性を評価する「雰囲気テスト」の実態を調査・分析しました。
ベンチマークスコアと実世界のユーザー体験の乖離を埋めるため、雰囲気テストを体系的に分析できる形式化手法を提案します。
実験の結果、パーソナライズされたプロンプトとユーザーを考慮した評価基準を組み合わせることで、モデルの選好度が変化することを発見しました。

Abstract

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2604.14137
カテゴリ: cs.CL, cs.AI, cs.LG

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報