AIDB Daily Papers

AgentGrounder：多言語モデルで実現するゼロショット3D点群の物体認識

原題: AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

著者: Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin

公開日: 2026-05-25 | 分野: ロボティクスコンピュータビジョン AI cs.CV cs.RO AIエージェント

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

3Dシーン内の物体を自然言語で指示された通りに特定する「3Dビジュアルグラウンディング」を、タスク固有の3D学習なしで実現する手法を提案した。
既存手法が2Dモデルや限定的な3D情報に依存する課題に対し、3D点群から直接情報を抽出し、必要な時にのみ画像レンダリングを行うことで精度を向上させた。
ScanReferとNr3Dデータセットでの評価により、既存手法を上回る精度を示し、オープンボキャブラリーでの3D物体認識の頑健な基盤を構築した。

Abstract

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.25901
カテゴリ: cs.CV, cs.RO

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報