AIDB Daily Papers

ゼロから構築するエンタープライズリアルタイム音声エージェント：技術チュートリアル

原題: Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

著者: Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

公開日: 2026-03-05 | 分野: LLM 音声エージェント対話自動化 API 企業ストリーミング

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

エンタープライズグレードのリアルタイム音声エージェントを構築するための技術チュートリアルを提示する。
単一のリソースで完全なパイプラインを説明するものがなく、ストリーミングとパイプライン処理が重要となる。
Deepgram、vLLM、ElevenLabsを使用し、クラウドAPIで947msの遅延を実現、コードを公開。

Abstract

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $rightarrow$ LLM $rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime'' is not any single fast model but rather textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2603.05413
カテゴリ: cs.SD

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報