AIDB Daily Papers

ピアフィードバックによるLLM相互改善のためのオンポリシー共同蒸留

原題: Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

著者: Woohyeon Byeon, Jiwon Jeon, Jeonghye Kim, Youngchul Sung

公開日: 2026-06-12 | 分野: LLM 学習 cs.CL cs.LG AI支援

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

2つの異なる得意分野を持つLLMが、互いに教え合いながら共に進化するマルチドメイン学習を研究した。
独自の強みを失わずに全ドメインで性能向上する相互パレート改善を目指し、オンポリシー共同蒸留（OPCoD）を提案した。
OPCoDは、ピアからのフィードバックを効果的に活用し、科学QAタスクでベースラインを上回り、全ドメインペアでパレート改善を達成した。

Abstract

We study multi-domain LLM training in which two models, each stronger in a different domain, co-evolve by tutoring each other through on-policy feedback. Unlike one-way distillation or single-model fine-tuning, our goal is mutual Pareto improvement: each model improves across domains without losing its original strength. To this end, we propose On-Policy Co-Distillation (OPCoD), where each student's self-distillation is conditioned on its own correct rollout and feedback from its peer. To make feedback exchange effective, OPCoD uses cognizance-based gating to decide when to give feedback and feedback anchoring to ground feedback in the problem. On Science Q&A tasks, OPCoD consistently outperforms baselines and achieves Pareto improvement across all evaluated domain pairs and students.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2606.14368
カテゴリ: cs.LG, cs.CL

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報