AIDB Daily Papers

有害な幻覚：プロンプトを操作しLLM回路を追跡する

原題: Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

著者: Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen, Yian Wang, Navin Kumar, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

公開日: 2026-05-29 | 分野: LLM cs.CL cs.AI cs.CY cs.HC 幻覚

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

本研究では、プロンプトの有害な表現が大規模言語モデル（LLM）の事実精度に与える影響を調査した。
有害な表現は事実精度を一貫して低下させ、不確実性を増加させる一方、丁寧な表現では限定的で一貫性のない変化しか見られなかった。
有害な表現は、モデル内部の計算において、特定のノードを増幅させ、主要な推論ノードへの影響を相対的に安定させることを発見した。

Abstract

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.30913
カテゴリ: cs.CL, cs.AI, cs.CY, cs.HC

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報