Tokens & Embeddings Deep Dive
Deep dive into tokenizer internals, cross-model comparison, embedding training, and the encoder/decoder split — for learners who want to go beyond the fundamentals.
Why take this course?
The basics of tokenization and embeddings are covered in How LLMs Work. This deep dive explores tokenizer internals across models, multilingual cost traps, how embeddings are trained (skip-gram, contrastive learning), and the encoder vs decoder architecture split.
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
How do LLMs break text into pieces they can process? Explore Byte-Pair Encoding, subword tokenization strategies, vocabulary size tradeoffs, and why the tokenizer must match the model.
Learning Goals
- Describe how Byte-Pair Encoding (BPE) builds vocabularies from training data.
- Explain why subword tokenization strikes a balance between character-level and word-level approaches.
- Analyze how vocabulary size affects model performance, memory, and generalization.
- Recognize why a tokenizer must be paired with the model it was trained with.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Spelling Paradox
Nina stared at the screen in disbelief. GPT-4 had just claimed 'strawberry' contains two R's. She counted on her fingers…
The Tokenizer: A Wall Between You and the Model
Neural networks only do math. They can only process numbers, not text. This creates a problem: how do we feed language i…
Subword Tokenization: The Middle Ground
Why not just use one token per letter? Or one token per word? Both approaches break down.
Character tokens make the…
Tokens are just IDs — embeddings give them meaning. Learn how embeddings convert token IDs into dense vectors, the leap from word2vec to contextual embeddings, cosine similarity, and how semantic search works.
Learning Goals
- Understand how embeddings convert token IDs into semantic vector representations.
- Explain the difference between static embeddings (word2vec) and contextualized embeddings (DeBERTa).
- Use cosine similarity to measure how close two vectors are in embedding space.
- Describe how semantic search uses embeddings to find relevant documents beyond keyword matching.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Why Keyword Search Breaks
Nina built a documentation search for her API. Users type 'authentication error' and get... nothing. But the docs have a…

The Embedding Matrix: A Lookup Table of Meaning
We know tokenizers convert words to integer IDs. But integers have no semantic meaning — token 451 ('apple') is mathemat…
The Problem with Static Embeddings
Here's a word that exposes static embeddings' fatal flaw: pitch.
- A pitcher threw a perfect pitch — sports mec…
Not all tokenizers are created equal. Compare tokenizer behavior across GPT-4, Llama, BERT, and StarCoder — explore special tokens, multilingual cost implications, and byte-level approaches like CANINE and ByT5.
Learning Goals
- Compare tokenizer outputs across BERT, GPT-2, GPT-4, Llama 2/3, and StarCoder2 for the same input.
- Recognize how special tokens (BOS, EOS, chat templates) affect model behavior and can introduce subtle bugs.
- Explain why multilingual text can cost 3x-15x more tokens depending on the tokenizer.
- Describe byte-level tokenization approaches (CANINE, ByT5) and when they make sense.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Hidden Passengers: Special Tokens
Nina inspects a BERT tokenizer's output and notices tokens she never typed. [CLS] at the start. [SEP] at the end. Someti…
Chat Templates: Tokens That Shape Conversations
Modern chat models don't just see raw text — they see structured conversations wrapped in special tokens added during in…
The Tokenizer Showdown: Same Text, Seven Models
Nina feeds the same multi-line text into seven tokenizers to see how they differ:
English and CAPITALIZATION
(emoji…
Embeddings don't appear by magic — they're trained. Discover the skip-gram objective, negative sampling, contrastive learning from SBERT to CLIP, and the split between representation models (encoders) and generation models (decoders).
Learning Goals
- Explain the skip-gram training objective: predict context words from a center word.
- Describe negative sampling and why it makes Word2Vec training computationally feasible.
- Explain contrastive learning: training embeddings by pulling similar pairs together and pushing dissimilar pairs apart.
- Distinguish representation models (encoder-only) from generation models (decoder-only) and identify which tasks each excels at.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
You Shall Know a Word by the Company It Keeps
Nina wonders: how did Word2Vec learn that 'king' and 'queen' are related without anyone labeling them?
The answer is th…

Skip-Gram: The Training Loop
The skip-gram model is surprisingly simple — just two matrices.
Step 1: Take a center word (e.g., 'king') and c…

The Efficiency Trick: Negative Sampling
There's a problem with vanilla skip-gram: computing probabilities over the entire vocabulary (50,000+ words) for eve…