次回の更新記事:AIエージェントに「私のこと」をテキストではなくコ…(公開予定日:2026年06月29日)
AIDB Daily Papers

LLMと人間の柔軟な推論能力を測る「なぞなぞのなぞなぞ」

原題: The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans
著者: Bella Fascendini, Kathryn McGregor, Max D. Gupta, Thomas L. Griffiths
公開日: 2026-06-25 | 分野: LLM NLP 推論 自然言語処理 cs.CL AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

  • 本研究では、LLMと人間の柔軟な推論能力を比較するため、新たな「なぞなぞのなぞなぞ」という問題形式を提案した。
  • この問題形式は、表面的な構造に惑わされず、内容に応じて推論戦略を切り替えられるかを評価する点で新規性がある。
  • 実験の結果、LLMは通常のなぞなぞで高い精度を示す一方、問題形式では人間と逆の傾向を示し、推論の柔軟性に課題があることが明らかとなった。

Abstract

Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretations. Identifying correct answers requires looking past the structure of each question and flexibly apply different reasoning strategies based on the content. If LLMs respond to surface features, such as form, a riddle-like structure should cause models to use an inventive reasoning strategy even when a literal interpretation suffices. Alternatively, if LLMs reason based on content, they should flexibly switch strategies when appropriate. Across two experiments with nine state-of-the-art LLMs and 100 human participants, we show humans and LLMs fail on this paradigm in opposite directions. LLMs were far more accurate on genuine riddles than on riddle riddles (84.9% vs. 50.7%); whereas humans showed the reverse effect (50.5% vs. 80.5%). Error analysis shows that 90.8% of LLM errors on riddle riddles (the condition where they show diminished performance) were due to inappropriate use of inventive reasoning while only 57.6% of human errors on genuine riddles were due to overextending literal reasoning. Thus, while both groups make mistakes, reasoning mistakes are made more often by LLMs than by humans. Overall, LLMs' strong performance on genuine riddles may reflect memory retrieval rather than flexible strategy selection, and without stimuli designed to elicit this contrast, it becomes easy to conflate LLM-generated outputs that look like reasoning with genuine reasoning.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録(無料)が必要です。

会員登録 / ログイン

関連するAIDB記事