AIDB Daily Papers

LLMによるコードモダナイゼーションにおける自己レビューの失敗：表面的には正しいが、実際には誤っている

原題: Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization

著者: Gokul Chandra Purnachandra Reddy, Aditya Lolla, Harsha Sanku

公開日: 2026-05-20 | 分野: LLM コード生成 cs.SE ソフトウェア工学 AI支援 AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMエージェントがレガシーコードを最新スタックに移行する際、自身の出力が挙動を変化させていないか認識できるかを検証した。
1,980件のコード移行と厳密な挙動検証の結果、特に意味的な変化を伴うコードではLLMの自己レビューが機能しないことが判明した。
LLMは、生成したコードの挙動変化を31.7%のケースで見逃し、モデルの能力や価格とは無関係に、タスク構造に起因する問題であることが示された。

Abstract

Large language model (LLM) agents are increasingly used to migrate legacy code to modern stacks. We ask a deceptively simple question: when an LLM modernizes legacy code, can the same model be relied upon to recognize when its own output silently changes observable behavior? We run 1,980 real modernization calls across 11 production LLMs from 7 distinct families on a balanced 60-snippet legacy-Python-2 corpus, evaluate every output with a type-strict behavioral oracle, and then ask each model to judge whether its own output preserves behavior. We report four findings. (1) Semantic-preservation drift is prevalent and sharply separable from a cleanly-controlled baseline: semantic-trap snippets drift in 39.7% of attempts versus 7.0% on benign-control code that requires no real modernization (+32.7 percentage points; n=660 each). (2) Drift concentrates on specific snippets that fail across models: pairwise model agreement on which snippets are hard is high (mean Pearson r=0.52), and a small core of numeric-semantics snippets fails for nearly every model and every prompt phrasing. (3) Self-review by the producing model is not a reliable safety net: across all semantic drift cases, 31.7% are silently endorsed by the same model that produced them (83/262), and the per-model self-miss rate is strongly bimodal -- ranging from 0% on five models to 100% on one widely deployed model -- with several models explicitly articulating the very Py2/Py3 semantic distinction that broke their output, then declaring behavior preserved. (4) Drift rate is non-monotone in model capability and price: per-model rates range 5.6%-46.7% and do not track model capability cleanly, indicating the failure is task-structural rather than driven by model scale. All code, prompts, the 60-snippet corpus, the behavioral oracle, the output extractor, and the raw model outputs are released.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.21537
カテゴリ: cs.SE

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報