AIDB Daily Papers

自己進化するLLMエージェント：分布内最適化による長期タスク遂行の実現

原題: Self-evolving LLM agents with in-distribution Optimization

著者: Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy

公開日: 2026-06-05 | 分野: LLM 強化学習 AI cs.LG AIエージェント AI支援

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMエージェントが長期的な意思決定を行う際の課題を解決するため、プロセス報酬ラベリングとポリシー学習を統合した自己進化フレームワーク「Q-Evolve」を提案した。
専門家デモンストレーションとエージェント生成軌道を組み合わせたオフポリシーデータから分布内クリティックを学習し、重み付きImplicit Q-Learningで報酬稀少環境を安定化させる。
AlfWorld, WebShop, ScienceWorldでの評価により、Q-Evolveはサンプル効率、頑健性、タスク性能において既存手法を上回る結果を示した。

Abstract

Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q-Evolve, a self-evolving framework for LLM agents that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in-distribution critic from a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self-evolution is achievable through the co-evolution of process-level supervision and policy, both grounded within a shared in-distribution learning loop.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.07367
カテゴリ: cs.LG

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報