AIDB Daily Papers

LLMエージェント向けスキル生成パイプラインのベンチマーク「SkillGenBench」

原題: SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

著者: Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, Qizhen Lan, Zhangquan Chen, Zhi Yang, QianyuXu, Ronghao Chen, Huacan Wang, Sen Hu

公開日: 2026-05-18 | 分野: LLM ロボティクスソフトウェア cs.AI AIエージェント AI評価

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

LLMエージェントが再利用可能なスキルを生成する能力を評価するための統一的なベンチマーク「SkillGenBench」を提案した。
既存のベンチマークではスキル生成自体を独立した研究対象として扱っていなかったため、本研究は新たな評価基準を提供する。
実験の結果、スキル生成手法やデータソースによって性能に大きなばらつきがあり、特にドキュメントからのスキル抽出が困難であることが明らかになった。

Abstract

As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.18693
カテゴリ: cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報