AIDB Daily Papers

評価者を厳選せず、校正せよ：ノイズの多いLLM評価者からのラベル効率的な推定

原題: Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

著者: Yanran Li

公開日: 2026-05-10 | 分野: LLM AI 評価 cs.CL stat.ME キャリブレーション

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLM評価において、精度が低い評価者もノイズとしてではなく、バイアスを学習可能な信号として活用する手法を提案した。
従来の評価者を選別する手法に対し、本研究では全評価者を利用し、キャリブレーションすることで精度が向上することを示した。
実験の結果、精度が低い評価者を含めた全評価者パネルを用いることで、キャリブレーション誤差を半減させることに成功した。

Abstract

Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilistic evaluation from a labeled calibration set. Holding the aggregation and calibration procedures fixed, we compare accuracy-ranked top-$k$ judge selection with using the full judge panel. Across four labeled pairwise-evaluation benchmarks spanning LLM-as-judge and reward-model settings, the calibrated full panel consistently outperforms accuracy-based selection. On RewardBench2, retaining all judges achieves negative log-likelihood (NLL) of $0.006$ versus $0.013$ under top-5 selection, halving the calibration error. This advantage persists after judge-family deduplication and against stronger same-pipeline subset search. We explain this reversal with oracle analyses showing that the optimal calibrated risk under proper scoring rules cannot increase when additional judge signals are made available, and that even below-chance judges can be useful when their biases are learnable and their signals are non-redundant. The resulting operating principle is simple: in multi-judge evaluation with labeled calibration data, do not discard weak judges by accuracy alone; keep them when they are parseable, non-redundant, and calibratable.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.09702
カテゴリ: stat.ME, cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報