AIDB Daily Papers

最先端LLMの脆弱性を暴く！自動攻撃によるレッドチーミング研究

原題: A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

著者: Nicola Franco

公開日: 2026-06-16 | 分野: LLM AI cs.CL cs.AI cs.CR AI安全性

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

Anthropicの最先端LLMであるFable 5とOpus 4.8に対し、自動化されたジェイルブレイク攻撃を仕掛け、その堅牢性を評価しました。
本研究は、既存の評価手法では見落とされがちな適応型攻撃の有効性を示し、最先端モデルでも依然として脆弱性が残ることを明らかにしました。
両モデルとも多くの攻撃に耐えたものの、特にOpus 4.8は一部の攻撃で11.5%の意図が破られ、持続的な自動攻撃による侵害の可能性が示唆されました。

Abstract

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.18193
カテゴリ: cs.CR, cs.AI, cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報