AIDB Daily Papers

RACE-bench：リポジトリレベルのコードエージェントの推論能力を測る新たなベンチマーク

原題: A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

著者: Shuhan Liu, Zhiyi Zhao, Xing Hu, Kui Liu, Xiaohu Yang, Xin Xia

公開日: 2026-03-27 | 分野: ベンチマーク機械学習 AI ソフトウェアオープンソース評価解析研究実装コードプログラミングテストデバッグ自然言語処理リポジトリ

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

リポジトリ全体のコードを扱うAIエージェントの能力評価のため、推論過程を重視したRACE-benchを開発した。
既存の評価法では最終的な正誤しか分からず、AIの思考プロセスや弱点の特定が困難だった点を改善した。
RACE-benchを用いた評価で、既存エージェントは高レベルな意図理解は得意だが、実装ステップへの変換が苦手と判明した。

Abstract

Repository-level code agents have shown strong promise in real-world feature addition tasks, making reliable evaluation of their capabilities increasingly important. However, existing benchmarks primarily evaluate these agents as black boxes based on final test correctness, providing limited insight into how they reason and where failures arise. To address this limitation, we introduce RACE-bench, a reasoning-augmented benchmark for evaluating code agents on repository-level feature addition tasks. RACE-bench contains 528 real-world feature addition instances from 12 open-source repositories. Each instance is paired with executable patch verification and structured intermediate reasoning ground truth covering issue understanding, file localization, implementation tasks, and step decomposition. Based on this design, we introduce a dual-track evaluation framework that jointly measures patch correctness and intermediate reasoning quality. We evaluate three representative repository-level code agents on RACE-bench. On the full benchmark, Resolved Rates range from 29% to 70% across different agents. Our reasoning-level analysis further shows that while current agents perform well at understanding high-level intent, their performance degrades substantially when translating intent into concrete implementation steps. We also find that apply-success but test-fail cases exhibit lower reasoning recall (35.7% decrease) and higher over-prediction (94.1% increase) compared to successful cases. These findings highlight the importance of evaluating repository-level code agents beyond final patch correctness by examining the quality of their reasoning processes.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2603.26337
カテゴリ: cs.SE

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報