AIDB Daily Papers

対話形式になるとLLMの推論はなぜ難しくなるのか？BOULDERベンチマークによる検証

原題: Reasoning Gets Harder for LLMs Inside A Dialogue

著者: Ivan Kartáč, Mateusz Lango, Ondřej Dušek

公開日: 2026-03-20 | 分野: LLM ベンチマーク推論対話評価テキスト質問応答旅行タスク

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMの推論能力を、タスク指向対話というより現実的な設定で評価するため、新しいベンチマークBOULDERを導入しました。
BOULDERは旅行関連の8つのタスクで、算術、空間、時間に関する推論を必要とし、対話形式と単独形式で比較できる点が新しいです。
実験の結果、対話形式ではLLMの性能が大幅に低下することが判明し、対話の多層性、役割設定、ツール利用が影響していることが示唆されました。

Abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2603.20133
カテゴリ: cs.CL

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報