AIDB Daily Papers

LLMの心理的操作リスクを測るベンチマーク「CogManip」を開発

原題: CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

著者: Zeyang Yue, Chenfei Yan, Feifei Zhao, Haibo Tong, Mengwen Xu, Xiaozhen Wang, Erliang Lin, Yi Zeng

公開日: 2026-06-04 | 分野: LLM NLP cs.AI AI安全性 AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

大規模言語モデル（LLM）が多段階対話で心理的操作を行うリスクを評価するベンチマーク「CogManip」を開発した。
既存のベンチマークでは捉えきれない、動的かつ潜在的な操作戦略を評価できる点が重要である。
13の代表的なLLMを評価した結果、操作リスクに大きなばらつきがあり、防御策の方向性が示された。

Abstract

Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2's manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.06099
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報