AIDB Daily Papers

LLMエージェントにいくつのツールを見せるべきか？：偶然性を補正した回答

原題: How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

著者: Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Joey Blackwell

公開日: 2026-05-23 | 分野: LLM cs.AI cs.IR cs.LG AIエージェント AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMエージェントがツールを使用する際、候補ツールの表示数を評価する新しい指標「Bits-over-Random」を提案した。
この研究は、固定数のツール表示ではなく、クエリごとに最適なツール表示数を決定する強化学習エージェントを開発した点で重要である。
結果として、平均7個のツール表示で50個表示した場合と同等のカバレッジを達成し、難易度の高いクエリでの発見率も向上した。

Abstract

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3%$ vs $90.8%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7%$ vs $61.9%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1%$ versus $87.1%$ when always shown 5 tools, widening to $76.8%$ vs $60.9%$ on medium-difficulty queries where the correct tool is present but not ranked first.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.24660
カテゴリ: cs.IR, cs.AI, cs.LG

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報