No Image

Tokens & Embeddings Deep Dive

Deep dive into tokenizer internals, cross-model comparison, embedding training, and the encoder/decoder split — for learners who want to go beyond the fundamentals.

Why take this course?

The basics of tokenization and embeddings are covered in How LLMs Work. This deep dive explores tokenizer internals across models, multilingual cost traps, how embeddings are trained (skip-gram, contrastive learning), and the encoder vs decoder architecture split.

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

How LLMs Work

Course Modules

1Module 1: Tokenization Foundations

How do LLMs break text into pieces they can process? Explore Byte-Pair Encoding, subword tokenization strategies, vocabulary size tradeoffs, and why the tokenizer must match the model.

Learning Goals

Describe how Byte-Pair Encoding (BPE) builds vocabularies from training data.
Explain why subword tokenization strikes a balance between character-level and word-level approaches.
Analyze how vocabulary size affects model performance, memory, and generalization.
Recognize why a tokenizer must be paired with the model it was trained with.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The Spelling Paradox

Nina stared at the screen in disbelief. GPT-4 had just claimed 'strawberry' contains two R's. She counted on her fingers…

Loading diagram...

The Tokenizer: A Wall Between You and the Model

Neural networks only do math. They can only process numbers, not text. This creates a problem: how do we feed language i…

Loading diagram...

Subword Tokenization: The Middle Ground

Why not just use one token per letter? Or one token per word? Both approaches break down.

Character tokens make the…

2Module 2: From Tokens to Vectors

Tokens are just IDs — embeddings give them meaning. Learn how embeddings convert token IDs into dense vectors, the leap from word2vec to contextual embeddings, cosine similarity, and how semantic search works.

Learning Goals

Understand how embeddings convert token IDs into semantic vector representations.
Explain the difference between static embeddings (word2vec) and contextualized embeddings (DeBERTa).
Use cosine similarity to measure how close two vectors are in embedding space.
Describe how semantic search uses embeddings to find relevant documents beyond keyword matching.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

Why Keyword Search Breaks

Nina built a documentation search for her API. Users type 'authentication error' and get... nothing. But the docs have a…

The Embedding Matrix: A Lookup Table of Meaning

We know tokenizers convert words to integer IDs. But integers have no semantic meaning — token 451 ('apple') is mathemat…

Loading diagram...

The Problem with Static Embeddings

Here's a word that exposes static embeddings' fatal flaw: pitch.

A pitcher threw a perfect pitch — sports mec…

3Module 3: Tokenizers in the Wild

Not all tokenizers are created equal. Compare tokenizer behavior across GPT-4, Llama, BERT, and StarCoder — explore special tokens, multilingual cost implications, and byte-level approaches like CANINE and ByT5.

Learning Goals

Compare tokenizer outputs across BERT, GPT-2, GPT-4, Llama 2/3, and StarCoder2 for the same input.
Recognize how special tokens (BOS, EOS, chat templates) affect model behavior and can introduce subtle bugs.
Explain why multilingual text can cost 3x-15x more tokens depending on the tokenizer.
Describe byte-level tokenization approaches (CANINE, ByT5) and when they make sense.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

The Hidden Passengers: Special Tokens

Nina inspects a BERT tokenizer's output and notices tokens she never typed. [CLS] at the start. [SEP] at the end. Someti…

Loading diagram...

Chat Templates: Tokens That Shape Conversations

Modern chat models don't just see raw text — they see structured conversations wrapped in special tokens added during in…

Loading diagram...

The Tokenizer Showdown: Same Text, Seven Models

Nina feeds the same multi-line text into seven tokenizers to see how they differ:

English and CAPITALIZATION
(emoji…

4Module 4: How Embeddings Learn

Embeddings don't appear by magic — they're trained. Discover the skip-gram objective, negative sampling, contrastive learning from SBERT to CLIP, and the split between representation models (encoders) and generation models (decoders).

Learning Goals

Explain the skip-gram training objective: predict context words from a center word.
Describe negative sampling and why it makes Word2Vec training computationally feasible.
Explain contrastive learning: training embeddings by pulling similar pairs together and pushing dissimilar pairs apart.
Distinguish representation models (encoder-only) from generation models (decoder-only) and identify which tasks each excels at.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Loading diagram...

You Shall Know a Word by the Company It Keeps

Nina wonders: how did Word2Vec learn that 'king' and 'queen' are related without anyone labeling them?

The answer is th…

Skip-Gram: The Training Loop

The skip-gram model is surprisingly simple — just two matrices.

Step 1: Take a center word (e.g., 'king') and c…

The Efficiency Trick: Negative Sampling

There's a problem with vanilla skip-gram: computing probabilities over the entire vocabulary (50,000+ words) for eve…