AIDB Daily Papers
ソーシャルロボットのための軽量な視覚的推論
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 人間と共存するロボットが、周囲の状況を理解し、人間の行動に対応するための軽量なモジュールを開発した。
- LLMと視覚エンコーダ間の連携を強化し、テキスト情報を基に視覚情報を再解釈することで、より複雑なHRIに対応する。
- シミュレーション環境でのナビゲーション、シーン記述、人間の意図認識のタスクで性能向上を確認、特に意図認識で大幅な精度向上。
Abstract
Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by $3.3%$ (less distance), $+0.057$ description score, and $+2.93%$ accuracy, with less than $3%$ extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains $+0.111,+0.055$ and $+10.81%,+4.79%$ on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: