No Image

Voice AI Engineering

Build real-time voice agents: audio pipelines, turn-taking, latency engineering, and the modern orchestration stack.

Why take this course?

Voice is the next interface. Master the architecture behind sub-second voice agents — from modular STT→LLM→TTS pipelines to unified speech-to-speech models, semantic turn-taking, the 800ms latency threshold, and the 2026 orchestration stack (LiveKit, Pipecat, Vapi).

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

AI Agents

Course Modules

1Module 1: Architectural Patterns — Pipelines, S2S & Hybrid

The three ways to build a voice agent: the modular STT → LLM → TTS sandwich pipeline, unified speech-to-speech models like Gemini 3 and GPT-4o, and hybrid edge-cloud architectures that split reflexes from reasoning.

Learning Goals

Describe the modular "sandwich" pipeline (STT → LLM → TTS) and its trade-offs: swappable components and debuggability vs cumulative latency.
Explain unified speech-to-speech (S2S) models that process audio-in → audio-out, achieving sub-500ms responses with emotion awareness at the cost of black-box behavior.
Design a hybrid edge-cloud architecture where local SLMs handle reflexes (VAD, wake-word) while cloud LLMs handle reasoning.
Choose the right architectural pattern for a given voice agent use case based on latency, cost, and control requirements.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The Voice Agent Challenge

Nina's team ships a voice agent for customer support. The first demo goes well — until the VP asks a complex question an…

Loading diagram...

The Modular "Sandwich" Pipeline

The industry standard for enterprise voice agents. You pipe audio through three independent stages: STT (Whisper, De…

Loading diagram...

Unified Speech-to-Speech (S2S)

Models like Gemini and GPT-4o process audio directly — no STT/TTS stages needed. Audio goes in, audio comes out.…

2Module 2: Interaction & Turn-Taking

How voice agents decide when to listen, when to speak, and when to shut up. ML-based voice activity detection, semantic turn-taking that reads linguistic cues instead of silence timers, barge-in handling, and spatial awareness for multi-speaker environments.

Learning Goals

Explain how ML-based voice activity detection (VAD) distinguishes speech from background noise and why threshold-based approaches fail.
Describe semantic turn-taking: detecting linguistic completion cues instead of relying on silence timers.
Design barge-in and interruption handling that kills agent audio the moment user speech is detected.
Explain acoustic fingerprinting and spatial awareness for locking onto a specific speaker in noisy or multi-speaker environments.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

In Voice, Silence Is Data

Text-based AI has it easy — the user hits "send" and the message is complete. Voice is entirely different: **there is no…

Loading diagram...

Voice Activity Detection (VAD)

Voice Activity Detection answers the most basic question: is someone speaking right now?

Old approach (threshold-ba…

Loading diagram...

Semantic Turn-Taking

Old voice systems: "Wait for 2 seconds of silence, then respond." This fails both ways — premature response when users p…

3Module 3: Latency Engineering

The 800ms rule and every trick to stay under it. Token-level streaming from LLM to TTS, predictive prefetching while the user is still speaking, interstitial fillers that mask think time, and collision avoidance strategies.

Learning Goals

Explain the 800ms response threshold and why exceeding it breaks conversational flow.
Design a streaming pipeline that forwards LLM tokens to TTS at the token level instead of waiting for full responses.
Apply predictive prefetching to query backends while the user is still speaking, using partial transcripts.
Use interstitial fillers ("Let me look that up...") strategically to mask think time without sounding robotic.
Identify and prevent response collisions where agent and user audio overlap.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The 800ms Rule

In human conversation, the average gap between turns is 200-300ms — this holds across all studied languages. Our bra…

Loading diagram...

Your Latency Budget

In a modular sandwich pipeline, every stage eats into your 800ms budget:

| Stage | Latency | What's happening | |---|--…

Loading diagram...

Stream Everything: Token-Level TTS

The biggest latency win: don't wait for the LLM to finish before the agent starts speaking.

Traditional: LLM genera…

4Module 4: Voice AI Tech Stack & Orchestration

The 2026 voice AI stack: orchestrators like LiveKit and Pipecat for WebRTC plumbing, brain-layer models with native audio reasoning, TTS providers optimized for low time-to-first-byte, and high-level APIs that let you skip the infrastructure.

Learning Goals

Compare orchestration frameworks (LiveKit, Pipecat) and explain how they handle WebRTC, audio routing, and session management.
Evaluate brain-layer models (Gemini, GPT-4o) for native audio-reasoning capabilities vs text-only LLMs with STT/TTS wrappers.
Assess TTS providers (ElevenLabs, Play.ht) on time-to-first-byte, emotional range, and voice cloning quality.
Distinguish high-level APIs (Vapi, Retell) from DIY stacks and choose the right abstraction level for your use case.
Design an end-to-end voice agent stack that balances latency, cost, and flexibility for a specific production scenario.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The 2026 Voice AI Stack

Building a production voice agent requires four layers:

Layer	Role	Key metric
Orchestrator…

Loading diagram...

Orchestrators: LiveKit & Pipecat

Orchestrators handle the plumbing: WebRTC connections, audio codecs, session management, and pipeline coordination.

**L…

Loading diagram...

The Brain: Audio-Native vs Text-Only LLMs

Not all LLMs are equal for voice agents.

Audio-native (Gemini, GPT-4o): accept raw audio, "hear" tone, emotion, pac…