AIDB Daily Papers

質問だけでは不十分：LLMの信頼度校正におけるプロトコル感受性

原題: Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

著者: Hankyeol Kim, Pilsung Kang

公開日: 2026-05-26 | 分野: LLM cs.AI 信頼性プロンプトエンジニアリング AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMの信頼度校正は、トークン確率スコアと発話された信頼度を比較するが、その比較方法は明示されず、測定方法に依存することが明らかになった。
回答文字列の選択、スコアの読み取り方、条件付けの文脈といった測定軸を変更すると、信頼度の評価結果が大きく変動することが示された。
発話された信頼度は、正しさだけでなく、回答の尤もらしさや出所も反映しており、両信号はプロトコル依存の行動測定として扱うべきであると提言された。

Abstract

LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.27752
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報