AIDB Daily Papers

表現選択による頑健なLLM忘却手法「RepSelect」

原題: RepSelect: Robust LLM Unlearning via Representation Selectivity

著者: Filip Sondej, Yushi Yang, Adam Mahdi

公開日: 2026-06-15 | 分野: LLM cs.CL AI安全性

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMから特定の知識や価値観を深く忘却させる手法「RepSelect」を提案した。
既存手法は忘却が浅く、ファインチューニングで容易に復元される課題があった。
RepSelectは、忘却対象に特化した表現を選択的に消去し、頑健な忘却を実現した。

Abstract

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.17168
カテゴリ: cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報