AIDB Daily Papers
人間の行動シミュレーションを言語フィードバックで強化する
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 人間の行動シミュレーション能力を高めるため、言語によるフィードバックを直接学習に活用する手法を提案した。
- 従来の数値報酬ではなく、主観的で多面的な言語フィードバックを第一級信号として扱うことで、より人間らしい振る舞いを学習させる。
- 提案手法は10のタスクからなるベンチマークでベースモデルを平均36%上回り、GPT-5.4を超える性能を示した。
Abstract
Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: