AIDB Daily Papers

メモリ制約下LLM推論を加速するカスケード適応型ツリー推測「CATS」

原題: CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

著者: Yuning Han, Yangchenchen Jin, Dylan Zhao, Jingwei Sun

公開日: 2026-05-11 | 分野: LLM 推論 cs.AI cs.LG メモリ制約推測的デコーディングエッジAI

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

メモリ制約下でのLLM推論において、推測的デコーディングの限界を分析し、新しいフレームワーク「CATS」を提案した。
CATSは、メモリ予算とパラメータオフロードパターンに基づき、カスケード検証と修正を行うことで、ターゲットモデル単体のメモリフットプリントを維持する。
実機エッジデバイスでの評価により、CATSは生成品質を損なわずに最大5.08倍の速度向上を達成し、既存手法を凌駕した。

Abstract

Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously -- an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2605.11186
カテゴリ: cs.LG, cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報