AIDB Daily Papers
RACE-bench:リポジトリレベルのコードエージェントの推論能力を測る新たなベンチマーク
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- リポジトリ全体のコードを扱うAIエージェントの能力評価のため、推論過程を重視したRACE-benchを開発した。
- 既存の評価法では最終的な正誤しか分からず、AIの思考プロセスや弱点の特定が困難だった点を改善した。
- RACE-benchを用いた評価で、既存エージェントは高レベルな意図理解は得意だが、実装ステップへの変換が苦手と判明した。
Abstract
Repository-level code agents have shown strong promise in real-world feature addition tasks, making reliable evaluation of their capabilities increasingly important. However, existing benchmarks primarily evaluate these agents as black boxes based on final test correctness, providing limited insight into how they reason and where failures arise. To address this limitation, we introduce RACE-bench, a reasoning-augmented benchmark for evaluating code agents on repository-level feature addition tasks. RACE-bench contains 528 real-world feature addition instances from 12 open-source repositories. Each instance is paired with executable patch verification and structured intermediate reasoning ground truth covering issue understanding, file localization, implementation tasks, and step decomposition. Based on this design, we introduce a dual-track evaluation framework that jointly measures patch correctness and intermediate reasoning quality. We evaluate three representative repository-level code agents on RACE-bench. On the full benchmark, Resolved Rates range from 29% to 70% across different agents. Our reasoning-level analysis further shows that while current agents perform well at understanding high-level intent, their performance degrades substantially when translating intent into concrete implementation steps. We also find that apply-success but test-fail cases exhibit lower reasoning recall (35.7% decrease) and higher over-prediction (94.1% increase) compared to successful cases. These findings highlight the importance of evaluating repository-level code agents beyond final patch correctness by examining the quality of their reasoning processes.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: