AIDB Daily Papers

秘密を守れますか？言語モデルの執筆における意図せぬ情報漏洩

原題: Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

著者: Ari Holtzman, Peter West

公開日: 2026-05-11 | 分野: LLM 情報抽出 cs.AI cs.CR AI安全性

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

言語モデルに秘密の単語を教え、それを書かないように指示した上で物語を作成させ、その物語から秘密を推測する実験を行った。
秘密の単語は直接出現しなかったものの、5つの最先端モデル全てが、テーマ、イメージ、設定などを通じて秘密を漏洩させ、その漏洩率は偶然を大きく上回った。
モデルは秘密を隠すよう指示されると、逆に秘密から遠ざかるような回避行動を取り、それが検出可能であった。この漏洩はモデルサイズに依存し、短い文章では消失した。

Abstract

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79%. When told to actively hide the secret, models write emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.10794
カテゴリ: cs.CR, cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報