Menu
← Back to Courses
No Image

Voice AI Engineering

Build real-time voice agents: audio pipelines, turn-taking, latency engineering, and the modern orchestration stack.

Why take this course?

Voice is the next interface. Master the architecture behind sub-second voice agents — from modular STT→LLM→TTS pipelines to unified speech-to-speech models, semantic turn-taking, the 800ms latency threshold, and the 2026 orchestration stack (LiveKit, Pipecat, Vapi).

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

Course Modules

1Module 1: Architectural Patterns — Pipelines, S2S & Hybrid

The three ways to build a voice agent: the modular STT → LLM → TTS sandwich pipeline, unified speech-to-speech models like Gemini 3 and GPT-4o, and hybrid edge-cloud architectures that split reflexes from reasoning.

Learning Goals

  • Describe the modular "sandwich" pipeline (STT → LLM → TTS) and its trade-offs: swappable components and debuggability vs cumulative latency.
  • Explain unified speech-to-speech (S2S) models that process audio-in → audio-out, achieving sub-500ms responses with emotion awareness at the cost of black-box behavior.
  • Design a hybrid edge-cloud architecture where local SLMs handle reflexes (VAD, wake-word) while cloud LLMs handle reasoning.
  • Choose the right architectural pattern for a given voice agent use case based on latency, cost, and control requirements.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The Voice Agent Challenge

Nina's team ships a voice agent for customer support. The first demo goes well — until the VP asks a complex question an…

Loading diagram...

The Modular "Sandwich" Pipeline

The industry standard for enterprise voice agents. You pipe audio through three independent stages: STT (Whisper, De…

Loading diagram...

Unified Speech-to-Speech (S2S)

Models like Gemini and GPT-4o process audio directly — no STT/TTS stages needed. Audio goes in, audio comes out.…

2Module 2: Interaction & Turn-Taking

How voice agents decide when to listen, when to speak, and when to shut up. ML-based voice activity detection, semantic turn-taking that reads linguistic cues instead of silence timers, barge-in handling, and spatial awareness for multi-speaker environments.

Learning Goals

  • Explain how ML-based voice activity detection (VAD) distinguishes speech from background noise and why threshold-based approaches fail.
  • Describe semantic turn-taking: detecting linguistic completion cues instead of relying on silence timers.
  • Design barge-in and interruption handling that kills agent audio the moment user speech is detected.
  • Explain acoustic fingerprinting and spatial awareness for locking onto a specific speaker in noisy or multi-speaker environments.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

In Voice, Silence Is Data

Text-based AI has it easy — the user hits "send" and the message is complete. Voice is entirely different: **there is no…

Loading diagram...

Voice Activity Detection (VAD)

Voice Activity Detection answers the most basic question: is someone speaking right now?

Old approach (threshold-ba…

Loading diagram...

Semantic Turn-Taking

Old voice systems: "Wait for 2 seconds of silence, then respond." This fails both ways — premature response when users p…

3Module 3: Latency Engineering

The 800ms rule and every trick to stay under it. Token-level streaming from LLM to TTS, predictive prefetching while the user is still speaking, interstitial fillers that mask think time, and collision avoidance strategies.

Learning Goals

  • Explain the 800ms response threshold and why exceeding it breaks conversational flow.
  • Design a streaming pipeline that forwards LLM tokens to TTS at the token level instead of waiting for full responses.
  • Apply predictive prefetching to query backends while the user is still speaking, using partial transcripts.
  • Use interstitial fillers ("Let me look that up...") strategically to mask think time without sounding robotic.
  • Identify and prevent response collisions where agent and user audio overlap.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The 800ms Rule

In human conversation, the average gap between turns is 200-300ms — this holds across all studied languages. Our bra…

Loading diagram...

Your Latency Budget

In a modular sandwich pipeline, every stage eats into your 800ms budget:

| Stage | Latency | What's happening | |---|--…

Loading diagram...

Stream Everything: Token-Level TTS

The biggest latency win: don't wait for the LLM to finish before the agent starts speaking.

Traditional: LLM genera…

4Module 4: Voice AI Tech Stack & Orchestration

The 2026 voice AI stack: orchestrators like LiveKit and Pipecat for WebRTC plumbing, brain-layer models with native audio reasoning, TTS providers optimized for low time-to-first-byte, and high-level APIs that let you skip the infrastructure.

Learning Goals

  • Compare orchestration frameworks (LiveKit, Pipecat) and explain how they handle WebRTC, audio routing, and session management.
  • Evaluate brain-layer models (Gemini, GPT-4o) for native audio-reasoning capabilities vs text-only LLMs with STT/TTS wrappers.
  • Assess TTS providers (ElevenLabs, Play.ht) on time-to-first-byte, emotional range, and voice cloning quality.
  • Distinguish high-level APIs (Vapi, Retell) from DIY stacks and choose the right abstraction level for your use case.
  • Design an end-to-end voice agent stack that balances latency, cost, and flexibility for a specific production scenario.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The 2026 Voice AI Stack

Building a production voice agent requires four layers:

LayerRoleKey metric
Orchestrator
Loading diagram...

Orchestrators: LiveKit & Pipecat

Orchestrators handle the plumbing: WebRTC connections, audio codecs, session management, and pipeline coordination.

**L…

Loading diagram...

The Brain: Audio-Native vs Text-Only LLMs

Not all LLMs are equal for voice agents.

Audio-native (Gemini, GPT-4o): accept raw audio, "hear" tone, emotion, pac…

    Voice AI Engineering | Synapse