AIDB Daily Papers
文脈内学習は内発的知的好奇心をサポートできるか?
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 本研究は、大規模言語モデルの文脈内学習能力が、高コストな勾配降下法ループなしで「学習進捗」を評価できるかを検証した。
- 従来の学習進捗評価は計算コストが高く実用的ではなかったが、文脈内学習は更新不要な世界モデルとしてこの問題を解決する可能性を示唆する。
- 理論と実験により、文脈内学習由来の報酬が真の学習進捗を近似し、好奇心旺盛なデータ収集ポリシーを訓練できることを実証した。
Abstract
Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: