AIDB Daily Papers
複数ユーザーの旅行計画を評価するベンチマーク「GroupTravelBench」
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 複数ユーザーの旅行計画における対話とコンフリクト解消能力を評価するベンチマークを開発した。
- 現実的なユーザープロファイルやPOIデータに基づき、3段階の難易度を持つ650のタスクを合成した。
- 最先端のLLMでも、ユーザー間の公平性や嗜好の網羅性に課題が見られた。
Abstract
Travel planning is a realistic task for evaluating the planning and tool-use abilities of LLM agents. However, existing benchmarks typically assume only a single user, thereby avoiding one of the most challenging aspects of real-world scenarios: an agent's ability to identify and resolve conflicts among multiple users. To address this gap, we introduce textbf{GroupTravelBench}, the first benchmark for textbf{multi-user, multi-turn} travel planning. Based on real user profiles, POI data, and ticket price data, we synthesize 650 tasks and divide them into three difficulty levels. Beyond standard abilities in single-user itinerary planning, such as multi-step reasoning and tool use, our benchmark further evaluates three key capabilities required for travel agents: emph{(i) elicitation} -- proactively engaging in multi-turn dialogue to gather preferences from each user; emph{(ii) coordination} -- resolving conflicts among users through compromise or subgrouping strategies; and emph{(iii) planning} -- searching for travel plans that maximize overall group utility while maintaining fairness and feasibility. To simulate real-world conversational itinerary planning while enabling reliable tool use and offline evaluation, we build an interactive sandbox environment with cached real-world tool data. We evaluate a wide range of LLMs and find that even frontier models still show substantial weaknesses in preference coverage and group fairness. textit{GroupTravelBench} provides a practical and reproducible benchmark for advancing research on LLM agents for real-world travel planning.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: