AIDB Daily Papers

ペルソナLLMの安全性評価：単一手法では不十分

原題: Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

著者: Wenkai Li, Fan Yang, Shaunak A. Mehta, Koichi Onoue

公開日: 2026-04-13 | 分野: LLM Transformer 安全性 AI 評価プロンプト自然言語処理ハルシネーション大規模言語モデルアラインメント

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

本研究では、ペルソナを付与したLLMの安全性評価において、プロンプトと活性化ステアリングという異なる手法を用いることの重要性を示した。
単一の手法ではモデルの脆弱性を見逃す可能性があり、アーキテクチャによって異なる脆弱性プロファイルを持つことが判明した点が新しい。
Llama-3.1-8Bでは、プロンプトでは安全なペルソナが活性化ステアリングで最も脆弱になるという「向社会性ペルソナのパラドックス」を発見した。

Abstract

Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($ρ= 0.71$--$0.96$), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15--18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2604.11120
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報