AIDB Daily Papers

LLMの自己認識能力：アクティベーションシグネチャの操作と取得

原題: LLM Self-Recognition: Steering and Retrieving Activation Signatures

著者: Thibaud Ardoin, Jonas Schäfer, Gerhard Wunder

公開日: 2026-06-04 | 分野: LLM AI 自然言語処理 XAI cs.AI AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMが自身の生成したテキストを自己認識する能力を実証し、その能力を操作・増幅する手法を開発した。
生成時にランダムなスパースベクトルで内部状態を操作することで、特定のLLMに帰属可能な検出可能な指紋を作成した。
この手法は、生成品質を維持しながら98%以上の精度でテキストの帰属を可能にし、外部信号埋め込みに代わる実用的なアプローチを提供する。

Abstract

Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.06315
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報