AIDB Daily Papers
方言か属性か?:明示的プロフィール vs 暗黙的言語信号によるLLMバイアスの定量化
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 本研究では、LLMが明示的な属性情報と暗黙的な言語信号のどちらにバイアスを持つかを、2つのLLMと24,000以上の応答を用いて検証した。
- 明示的な属性提示は安全フィルターを過剰に起動させる一方、方言(AAVE、Singlishなど)は「方言ジェイルブレイク」を引き起こし、安全性が低下する逆説が発見された。
- 現在の安全対策は明示的キーワードに依存しすぎており、言語的多様性と公平性の両立という課題と、より汎用的な安全機構の必要性が示唆された。
Abstract
As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users' identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,'' reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard'' users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: