AIDB Daily Papers
LLMは推論チェーンの長さにどう影響される?回答の正確性判断における盲点
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 大規模言語モデル(LLM)による回答評価において、推論チェーンの有無が判断に与える影響を検証しました。
- 推論能力の高いモデルの登場で、推論過程を評価者に提示することが精度向上に繋がると考えられていますが、その影響は不明確でした。
- 弱い評価者は流暢な推論に惑わされやすく、強い評価者も高品質に見える推論に誤認されることが明らかになりました。
Abstract
Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: