AIDB Daily Papers

大規模言語モデルは本当に人間より賢いのか？汚染検証による実力評価

原題: Are Large Language Models Truly Smarter Than Humans?

著者: Eshwar Reddy M, Sourav Karmakar

公開日: 2026-03-17 | 分野: LLM ベンチマーク機械学習 AI 評価言語テキスト研究精度モデル

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

大規模言語モデル(LLM)の性能を、汚染検証によって厳密に評価し、その実力を明らかにすることを目的とした研究。
既存のベンチマークはインターネット上に公開されており、LLMが学習データで評価データに触れている可能性があり、性能評価の信頼性が課題。
GPT-4oなど6つのLLMを対象に、汚染率の検出、間接参照による精度低下、記憶の兆候の検出を行い、汚染が性能に影響を与えていることを示唆。

Abstract

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2603.16197
カテゴリ: cs.AI, cs.CL

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報