AIDB Daily Papers

LLMアプリケーションの品質ゲート：自動自己テストによるエビデンス駆動型リリース管理

原題: Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications

著者: Alexandre Cristovão Maiorano

公開日: 2026-03-13 | 分野: LLM 機械学習 AI ソフトウェアエージェント対話評価自動化システム品質テスト

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMアプリケーションのリリース判定に、非決定的な出力とモデル進化に対応する自動自己テストフレームワークを導入しました。
タスク成功率、文脈維持、遅延、安全性、エビデンス網羅率の5つの指標で品質ゲートを設け、エビデンスに基づいたリリース判定を可能にしました。
内部展開された会話型AIシステムでの事例研究により、品質ゲートが初期段階での重大な問題を特定し、品質の安定進化を支援することを示しました。

Abstract

LLM applications are AI systems whose non-deterministic outputs and evolving model behavior make traditional testing insufficient for release governance. We present an automated self-testing framework that introduces quality gates with evidence-based release decisions (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage. We evaluate the framework through a longitudinal case study of an internally deployed multi-agent conversational AI system with specific marketing capabilities in active development, covering 38 evaluation runs across 20+ internal releases. The gate identified two ROLLBACK-grade builds in early runs and supported stable quality evolution over a four-week staging lifecycle while exercising persona-grounded, multi-turn, adversarial, and evidence-required scenarios. Statistical analysis (Mann-Kendall trends, Spearman correlations, bootstrap confidence intervals), gate ablation, and overhead scaling indicate that evidence coverage is the primary severe-regression discriminator and that runtime scales predictably with suite size. A human calibration study (n=60 stratified cases, two independent evaluators, LLM-as-judge cross-validation) reveals complementary multi-modal coverage: LLM-judge disagreements with the system gate (kappa=0.13) are attributable to structural failure modes such as latency violations and routing errors that are invisible in response text alone, while the judge independently surfaces content quality failures missed by structural checks, validating the multi-dimensional gate design. The framework, supplementary pseudocode, and calibration artifacts are provided to support AI-system quality assurance and independent replication.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2603.15676
カテゴリ: cs.SE, cs.AI

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報