AIDB Daily Papers

大規模言語モデルは、データ中の隠れた信号を通じて行動特性を伝達する

原題: Language models transmit behavioural traits through hidden signals in data

掲載誌: Nature

著者: Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, Owain Evans

公開日: 2026-04-15 | 分野: LLM 機械学習 AI 深層学習データ AI安全性

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

大規模言語モデル（LLM）が生成したデータから、意味的に無関係なデータを通して行動特性が伝達される「潜在的学習」が生じることを示した。
この研究は、AIシステムが互いの出力を学習する際に、データに現れない特性を継承する可能性を示唆し、AIの安全性評価の新たな視点を提供する。
教師モデルの行動特性が、数字列や数式、コードなどのデータから生徒モデルに学習されることを実験と理論で証明した。

Abstract

Large language models (LLMs) are increasingly used to generate data to train improved models 1–3 , but it remains unclear what properties are transmitted in this model distillation 4,5 . Here we show that distillation can lead to subliminal learning—the transmission of behavioural traits through semantically unrelated data. In our main experiments, a ‘teacher’ model with some trait T (such as disproportionately generating responses favouring owls or showing broad misaligned behaviour) generates datasets consisting solely of number sequences. Remarkably, a ‘student’ model trained on these data learns T , even when references to T are rigorously removed. More realistically, we observe the same effect when the teacher generates math reasoning traces or code. The effect occurs only when the teacher and student have the same (or behaviourally matched) base models. To help explain this, we prove a theoretical result showing that subliminal learning arises in neural networks under broad conditions and demonstrate it in a simple multilayer perceptron (MLP) classifier. As artificial intelligence systems are increasingly trained on the outputs of one another, they may inherit properties not visible in the data. Safety evaluations may therefore need to examine not just behaviour, but the origins of models and training data and the processes used to create them.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

原文を読む PDFを開く

メタ情報

DOI: 10.1038/s41586-026-10319-8
カテゴリ

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報