AIDB Daily Papers
MementoGUI:長期間タスク向けエージェント型マルチモーダル記憶制御を学習する
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 長期間のGUI操作タスクで、過去のインターフェース遷移を跨いだ状態維持を可能にする記憶制御フレームワークを提案した。
- 従来の履歴再生やテキストのみの記憶では不十分だったため、タスク関連イベントの選択・圧縮・検索を学習する記憶制御器を導入した。
- 提案手法はGUIエージェントの性能を大幅に向上させ、長期間タスクにおける意思決定能力を強化することが実験で示された。
Abstract
Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: