LLMの潜在能力を測るFactor Analysis評価法

2025.07.29

評価・ベンチマーク（モデル評価、ベンチマーク、性能測定）

📝 これは「短信」です ― AIDBリサーチチームが独自の視点で論文を紹介する、カジュアルで読みやすいコンテンツです。

IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

https://doi.org/10.48550/arXiv.2507.20208

Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty
（Bar-Ilan University, OriginAI, Columbia University, New York University）

Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model’s overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models’ wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.

X（Twitter）で見る

LLMの潜在能力を測るFactor Analysis評価法

こちらもどうぞ

🔒 GPTがクラウドに住んで「AIモデルを選択する」自律的エッジAIシステムが登場

🔒 RAGで取得すべき情報はLLMごとの「データの有用性」で異なる