AIDB Daily Papers
長期間の企業向けAIエージェントのための4軸意思決定アライメント
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 高リスクな長期間の企業向けAIエージェントの意思決定を、事実精度、推論の一貫性、規制遵守、適応的棄却の4軸で評価する手法を提案した。
- 本研究は、従来の単一指標では見えなかったエージェントの多様な失敗モードを明らかにし、特に規制遵守と適応的棄却の重要性を強調する。
- 実験の結果、既存のメモリアーキテクチャは事実精度に課題があり、要約ベースのプロンプトが有力なベースラインとなる一方、全てのアーキテクチャが棄却を怠る傾向があることが示された。
Abstract
Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: