AIDB Daily Papers
AIの「人間レベル」評価を再考:世界人口スケールとの比較
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- AIモデルの性能を世界人口の成功確率に基づき評価するフレームワークを提案した。
- 既存の「人間レベル」比較は対象集団が狭く、AIの真の能力を測れていない点を改善する。
- 教育や推論ベンチマークのデータを用いてスケールを調整し、AIの性能をより公平に評価する。
Abstract
Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: