AIDB Daily Papers
LLMエージェントにいくつのツールを見せるべきか?:偶然性を補正した回答
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- LLMエージェントがツールを使用する際、候補ツールの表示数を評価する新しい指標「Bits-over-Random」を提案した。
- この研究は、固定数のツール表示ではなく、クエリごとに最適なツール表示数を決定する強化学習エージェントを開発した点で重要である。
- 結果として、平均7個のツール表示で50個表示した場合と同等のカバレッジを達成し、難易度の高いクエリでの発見率も向上した。
Abstract
Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3%$ vs $90.8%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7%$ vs $61.9%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1%$ versus $87.1%$ when always shown 5 tools, widening to $76.8%$ vs $60.9%$ on medium-difficulty queries where the correct tool is present but not ranked first.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: