AIDB Daily Papers

言語モデルのペルソナ依存型選好性を解明する

原題: Probing Persona-Dependent Preferences in Language Models

著者: Oscar Gilg, Pierre Beckmann, Daniel Paleka, Patrick Butlin

公開日: 2026-05-13 | 分野: LLM cs.CL cs.AI プロンプトエンジニアリング AIエージェント

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

言語モデルの選好性を線形プローブで分析し、ペルソナ間で共有される選好ベクトルを発見した。
この研究は、モデルの内部的な選好メカニズムがペルソナ間で共有されている可能性を示唆する点で重要である。
選好ベクトルを操作することで、モデルのタスク選択を因果的に制御できることが示された。

Abstract

Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.13339
カテゴリ: cs.CL, cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報