AIDB Daily Papers

文脈内学習は内発的知的好奇心をサポートできるか？

原題: Can In-Context Learning Support Intrinsic Curiosity?

著者: Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

公開日: 2026-06-17 | 分野: LLM 強化学習学習 cs.AI cs.LG AIエージェント

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

本研究は、大規模言語モデルの文脈内学習能力が、高コストな勾配降下法ループなしで「学習進捗」を評価できるかを検証した。
従来の学習進捗評価は計算コストが高く実用的ではなかったが、文脈内学習は更新不要な世界モデルとしてこの問題を解決する可能性を示唆する。
理論と実験により、文脈内学習由来の報酬が真の学習進捗を近似し、好奇心旺盛なデータ収集ポリシーを訓練できることを実証した。

Abstract

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.19476
カテゴリ: cs.LG, cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報