AIDB Daily Papers

MCPToolBench++: 大規模AIエージェントモデルのコンテキストプロトコルMCPツール利用ベンチマーク

原題: MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

著者: Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo

公開日: 2025-08-11 | 分野: LLM 効率化データセットベンチマーク推論

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

大規模なマルチドメインAIエージェントツール利用ベンチマークMCPToolBench++を提案し、LLMのMCPツール呼び出し能力評価を支援する。
既存のツール利用ベンチマークと異なり、現実世界のMCPツールの成功率のばらつきや、LLMのコンテキストウィンドウ制限に対応している点が重要である。
40以上のカテゴリにわたる4000以上のMCPサーバーのマーケットプレイスに基づいて構築されたデータセットで、SOTA LLMを評価した結果を報告する。

Abstract

LLMs' capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents' MCP tool use abilities suffer from several issues. First, there's a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies across different MCP servers. Furthermore, the LLMs' context window also limits the number of available tools that can be called in a single run, because the textual descriptions of tool and the parameters have long token length for an LLM to process all at once. To help address the challenges of evaluating LLMs' performance on calling MCP tools, we propose MCPToolBench++, a large-scale, multi-domain AI Agent tool use benchmark. As of July 2025, this benchmark is build upon marketplace of over 4k MCP servers from more than 40 categories, collected from the MCP marketplaces and GitHub communities. The datasets consist of both single-step and multi-step tool calls across different categories. We evaluated SOTA LLMs with agentic abilities on this benchmark and reported the results.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2508.07575
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報