Voice AI Engineering
Build real-time voice agents: audio pipelines, turn-taking, latency engineering, and the modern orchestration stack.
Why take this course?
Voice is the next interface. Master the architecture behind sub-second voice agents — from modular STT→LLM→TTS pipelines to unified speech-to-speech models, semantic turn-taking, the 800ms latency threshold, and the 2026 orchestration stack (LiveKit, Pipecat, Vapi).
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
The three ways to build a voice agent: the modular STT → LLM → TTS sandwich pipeline, unified speech-to-speech models like Gemini 3 and GPT-4o, and hybrid edge-cloud architectures that split reflexes from reasoning.
Learning Goals
- Describe the modular "sandwich" pipeline (STT → LLM → TTS) and its trade-offs: swappable components and debuggability vs cumulative latency.
- Explain unified speech-to-speech (S2S) models that process audio-in → audio-out, achieving sub-500ms responses with emotion awareness at the cost of black-box behavior.
- Design a hybrid edge-cloud architecture where local SLMs handle reflexes (VAD, wake-word) while cloud LLMs handle reasoning.
- Choose the right architectural pattern for a given voice agent use case based on latency, cost, and control requirements.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Voice Agent Challenge
A text chatbot can pause for a second and still feel usable. A voice agent cannot. When response timing is wrong, people…
Modular Pipeline: STT -> LLM -> TTS
The most debuggable voice-agent architecture is a modular pipeline. Audio goes to an STT provider, the transcript goes t…
Unified Speech-to-Speech
Unified speech-to-speech systems process audio input and produce audio output more directly. Instead of forcing every si…
How voice agents decide when to listen, when to speak, and when to shut up. ML-based voice activity detection, semantic turn-taking that reads linguistic cues instead of silence timers, barge-in handling, and spatial awareness for multi-speaker environments.
Learning Goals
- Explain how ML-based voice activity detection (VAD) distinguishes speech from background noise and why threshold-based approaches fail.
- Describe semantic turn-taking: detecting linguistic completion cues instead of relying on silence timers.
- Design barge-in and interruption handling that kills agent audio the moment user speech is detected.
- Explain acoustic fingerprinting and spatial awareness for locking onto a specific speaker in noisy or multi-speaker environments.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
In Voice, Silence Is Data
Text interfaces have a send button. Voice does not. The agent must infer whether the user is done, pausing to think, tak…
Voice Activity Detection
Voice Activity Detection, or VAD, answers the first gatekeeping question: is someone speaking right now?
A simple thres…
Semantic Turn-Taking
A silence timer is not enough. "Wait two seconds, then respond" fails when users pause to think and also feels slow when…
The 800ms rule and every trick to stay under it. Token-level streaming from LLM to TTS, predictive prefetching while the user is still speaking, interstitial fillers that mask think time, and collision avoidance strategies.
Learning Goals
- Explain the 800ms response threshold and why exceeding it breaks conversational flow.
- Design a streaming pipeline that forwards LLM tokens to TTS at the token level instead of waiting for full responses.
- Apply predictive prefetching to query backends while the user is still speaking, using partial transcripts.
- Use interstitial fillers ("Let me look that up...") strategically to mask think time without sounding robotic.
- Identify and prevent response collisions where agent and user audio overlap.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Latency Changes the Conversation
Voice agents live under a tighter timing constraint than text agents. In text, a delayed answer feels like waiting. In v…
Build a Latency Budget
A modular voice pipeline has several places where delay can appear: VAD, STT, policy checks, LLM time to first token, to…
Stream the Critical Path
A common mistake is waiting for the full LLM response before starting TTS. That makes the user wait for all reasoning an…
The 2026 voice AI stack: orchestrators like LiveKit and Pipecat for WebRTC plumbing, brain-layer models with native audio reasoning, TTS providers optimized for low time-to-first-byte, and high-level APIs that let you skip the infrastructure.
Learning Goals
- Compare orchestration frameworks (LiveKit, Pipecat) and explain how they handle WebRTC, audio routing, and session management.
- Evaluate brain-layer models (Gemini, GPT-4o) for native audio-reasoning capabilities vs text-only LLMs with STT/TTS wrappers.
- Assess TTS providers (ElevenLabs, Play.ht) on time-to-first-byte, emotional range, and voice cloning quality.
- Distinguish high-level APIs (Vapi, Retell) from DIY stacks and choose the right abstraction level for your use case.
- Design an end-to-end voice agent stack that balances latency, cost, and flexibility for a specific production scenario.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Voice AI Stack
A production voice agent is not one model. It is a stack of realtime systems that have to cooperate.
The core layers ar…
Orchestrators
The orchestrator is the control plane for the conversation. It receives audio events, manages state, calls models and to…
The Brain Layer
The brain layer decides what the agent should do. It may be a text LLM behind STT, an audio-native realtime model, or a…