AIDB Daily Papers
CresOWLve:実世界知識を用いた創造的問題解決のベンチマーク
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- CresOWLveは、論理的推論、水平思考、類推、常識的知識を組み合わせた創造的問題解決を評価するベンチマーク。
- 既存の創造性ベンチマークが人工的な問題に依存するのに対し、実世界の知識に基づいたパズルを使用し、現実的な設定を反映する点が新しい。
- 最先端LLMの評価で、事実に関する質問よりも創造的な質問で大幅に性能が低下(最大-17%)し、知識の統合に課題が残ることを示した。
Abstract
Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: