AIDB Daily Papers
LLM評価における温度設定は再現性の鍵だが、それだけでは不十分
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- LLMを評価者として用いる際、温度設定を0にしても評価結果の再現性が保証されない問題点を指摘した。
- 評価ハーネスの設定不備や、LLM自体の性質により、同一条件でも評価結果が変動する。
- 評価結果のばらつきを無視した単一結果の報告は、安全性を誤って伝えるリスクがあり、評価者間の不一致を第一級の指標とすべきである。
Abstract
LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampling temperature to 0 makes grading deterministic. We test this assumption against a real safety-evaluation codebase (Japan AISI's open-source aisev) and show it fails on two levels. First, the harness invokes its grader without setting temperature or seed; the underlying provider silently applies its default of 1.0, so items near the decision boundary flip pass/fail across identical runs (per-item disagreement up to ~50% over 20 runs). Second, pinning temperature=0 reduces but does not eliminate flips: across 690 API calls spanning two providers, three model tiers, and five sampling configurations, 1-2 of 7 borderline items remain non-reproducible even under forced greedy decoding (top_k=1). Claude Opus 4.7/4.8 has since deprecated temperature entirely, rendering the primary mitigation inapplicable to newer model generations. These findings expose a structural gap: evaluation harnesses that report single-run verdicts without variance or grader-disagreement metrics can present noise as a safety property. We release a reproduction harness (690 calls, 7 conditions) and recommend that harnesses treat grader disagreement as a first-class health metric alongside the scores themselves.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: