AIDB Daily Papers

大規模言語モデルの自信度と信頼性の乖離を解消する

原題: Closing the Confidence-Faithfulness Gap in Large Language Models

著者: Miranda Muqing Miao, Lyle Ungar

公開日: 2026-03-26 | 分野: LLM NLP 解釈性評価精度自然言語処理

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

大規模言語モデル（LLM）の自信度スコアと実際の精度が乖離している問題を分析しました。
線形プローブと対照的な活性化操作により、キャリブレーションと自信度信号が直交することを発見しました。
モデル内部の精度推定を読み取り、出力に反映させる二段階適応型操作でキャリブレーションを改善しました。

Abstract

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2603.25052
カテゴリ: cs.CL, cs.AI

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報