Voice AI Engineering
Build real-time voice agents: audio pipelines, turn-taking, latency engineering, and the modern orchestration stack.
Why take this course?
Voice is the next interface. Master the architecture behind sub-second voice agents — from modular STT→LLM→TTS pipelines to unified speech-to-speech models, semantic turn-taking, the 800ms latency threshold, and the 2026 orchestration stack (LiveKit, Pipecat, Vapi).
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
The three ways to build a voice agent: the modular STT → LLM → TTS sandwich pipeline, unified speech-to-speech models like Gemini 3 and GPT-4o, and hybrid edge-cloud architectures that split reflexes from reasoning.
Learning Goals
- Describe the modular "sandwich" pipeline (STT → LLM → TTS) and its trade-offs: swappable components and debuggability vs cumulative latency.
- Explain unified speech-to-speech (S2S) models that process audio-in → audio-out, achieving sub-500ms responses with emotion awareness at the cost of black-box behavior.
- Design a hybrid edge-cloud architecture where local SLMs handle reflexes (VAD, wake-word) while cloud LLMs handle reasoning.
- Choose the right architectural pattern for a given voice agent use case based on latency, cost, and control requirements.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Voice Agent Challenge
Nina's team ships a voice agent for customer support. The first demo goes well — until the VP asks a complex question an…
The Modular "Sandwich" Pipeline
The industry standard for enterprise voice agents. You pipe audio through three independent stages: STT (Whisper, De…
Unified Speech-to-Speech (S2S)
Models like Gemini and GPT-4o process audio directly — no STT/TTS stages needed. Audio goes in, audio comes out.…
How voice agents decide when to listen, when to speak, and when to shut up. ML-based voice activity detection, semantic turn-taking that reads linguistic cues instead of silence timers, barge-in handling, and spatial awareness for multi-speaker environments.
Learning Goals
- Explain how ML-based voice activity detection (VAD) distinguishes speech from background noise and why threshold-based approaches fail.
- Describe semantic turn-taking: detecting linguistic completion cues instead of relying on silence timers.
- Design barge-in and interruption handling that kills agent audio the moment user speech is detected.
- Explain acoustic fingerprinting and spatial awareness for locking onto a specific speaker in noisy or multi-speaker environments.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
In Voice, Silence Is Data
Text-based AI has it easy — the user hits "send" and the message is complete. Voice is entirely different: **there is no…
Voice Activity Detection (VAD)
Voice Activity Detection answers the most basic question: is someone speaking right now?
Old approach (threshold-ba…
Semantic Turn-Taking
Old voice systems: "Wait for 2 seconds of silence, then respond." This fails both ways — premature response when users p…
The 800ms rule and every trick to stay under it. Token-level streaming from LLM to TTS, predictive prefetching while the user is still speaking, interstitial fillers that mask think time, and collision avoidance strategies.
Learning Goals
- Explain the 800ms response threshold and why exceeding it breaks conversational flow.
- Design a streaming pipeline that forwards LLM tokens to TTS at the token level instead of waiting for full responses.
- Apply predictive prefetching to query backends while the user is still speaking, using partial transcripts.
- Use interstitial fillers ("Let me look that up...") strategically to mask think time without sounding robotic.
- Identify and prevent response collisions where agent and user audio overlap.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The 800ms Rule
In human conversation, the average gap between turns is 200-300ms — this holds across all studied languages. Our bra…
Your Latency Budget
In a modular sandwich pipeline, every stage eats into your 800ms budget:
| Stage | Latency | What's happening | |---|--…
Stream Everything: Token-Level TTS
The biggest latency win: don't wait for the LLM to finish before the agent starts speaking.
Traditional: LLM genera…
The 2026 voice AI stack: orchestrators like LiveKit and Pipecat for WebRTC plumbing, brain-layer models with native audio reasoning, TTS providers optimized for low time-to-first-byte, and high-level APIs that let you skip the infrastructure.
Learning Goals
- Compare orchestration frameworks (LiveKit, Pipecat) and explain how they handle WebRTC, audio routing, and session management.
- Evaluate brain-layer models (Gemini, GPT-4o) for native audio-reasoning capabilities vs text-only LLMs with STT/TTS wrappers.
- Assess TTS providers (ElevenLabs, Play.ht) on time-to-first-byte, emotional range, and voice cloning quality.
- Distinguish high-level APIs (Vapi, Retell) from DIY stacks and choose the right abstraction level for your use case.
- Design an end-to-end voice agent stack that balances latency, cost, and flexibility for a specific production scenario.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The 2026 Voice AI Stack
Building a production voice agent requires four layers:
| Layer | Role | Key metric |
|---|---|---|
| Orchestrator… |
Orchestrators: LiveKit & Pipecat
Orchestrators handle the plumbing: WebRTC connections, audio codecs, session management, and pipeline coordination.
**L…
The Brain: Audio-Native vs Text-Only LLMs
Not all LLMs are equal for voice agents.
Audio-native (Gemini, GPT-4o): accept raw audio, "hear" tone, emotion, pac…