AIDB Daily Papers
AIベンチマークの停滞:飽和に関する体系的な研究
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 大規模言語モデルのベンチマーク60個を分析し、飽和状態の実態と要因を明らかにしました。
- ベンチマークの飽和は、AIモデルの進歩を測る上で深刻な問題であり、対策が求められます。
- 専門家キュレーションのベンチマークは飽和しにくい一方、テストデータの隠蔽には効果がないことが判明しました。
Abstract
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: