Tokens & Embeddings Deep Dive
Deep dive into tokenizer internals, cross-model comparison, embedding training, and the encoder/decoder split — for learners who want to go beyond the fundamentals.
Why take this course?
The basics of tokenization and embeddings are covered in How LLMs Work. This deep dive explores tokenizer internals across models, multilingual cost traps, how embeddings are trained (skip-gram, contrastive learning), and the encoder vs decoder architecture split.
Prerequisites
This course builds on concepts from the following courses. It is recommended to complete them first:
Course Modules
How do LLMs break text into pieces they can process? Explore Byte-Pair Encoding, subword tokenization strategies, vocabulary size tradeoffs, and why the tokenizer must match the model.
Learning Goals
- Describe how Byte-Pair Encoding (BPE) builds vocabularies from training data.
- Explain why subword tokenization strikes a balance between character-level and word-level approaches.
- Analyze how vocabulary size affects model performance, memory, and generalization.
- Recognize why a tokenizer must be paired with the model it was trained with.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
The Spelling Paradox
Nina asks GPT-4 to count the R's in strawberry. It answers: two. The word has three.
The model is not looking at lett…
How Tokenizers Chop Text
A tokenizer has to choose granularity.
Character-level tokenization sees every letter, but sequences explode. A par…
Tokenizer Choice Changes Cost and Behavior
Tokenizers are not interchangeable utilities. They are part of the model.
Three things shape behavior: the algorithm, v…
Tokens are just IDs — embeddings give them meaning. Learn how embeddings convert token IDs into dense vectors, the leap from word2vec to contextual embeddings, cosine similarity, and how semantic search works.
Learning Goals
- Understand how embeddings convert token IDs into semantic vector representations.
- Explain the difference between static embeddings (word2vec) and contextualized embeddings (DeBERTa).
- Use cosine similarity to measure how close two vectors are in embedding space.
- Describe how semantic search uses embeddings to find relevant documents beyond keyword matching.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Keyword Search Breaks on Meaning
Nina ships API documentation search. Users type authentication error and get nothing. The docs have the answer, but th…

Embeddings Turn IDs Into Geometry
Token IDs are labels. Embeddings are meaning.
The tokenizer may turn apple into ID 451 and fruit into ID 902, but t…
Static vs Contextual Embeddings
Static embeddings have one fatal flaw: one word gets one vector forever.
Take pitch:
- baseball pitch
- musical…
Not all tokenizers are created equal. Compare tokenizer behavior across GPT-4, Llama, BERT, and StarCoder — explore special tokens, multilingual cost implications, and byte-level approaches like CANINE and ByT5.
Learning Goals
- Compare tokenizer outputs across BERT, GPT-2, GPT-4, Llama 2/3, and StarCoder2 for the same input.
- Recognize how special tokens (BOS, EOS, chat templates) affect model behavior and can introduce subtle bugs.
- Explain why multilingual text can cost 3x-15x more tokens depending on the tokenizer.
- Describe byte-level tokenization approaches (CANINE, ByT5) and when they make sense.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Hidden Tokens Shape the Input
The text you type is not the full input the model sees. Tokenizers add hidden structure.
BERT-style models add special…
Normalization Is a One-Way Door
Some tokenizers rewrite text before splitting it. That preprocessing is called normalization.
BERT uncased lowercas…
The Multilingual Token Tax
Nina's team expands to China and finds a cost bug hiding in the tokenizer.
The same phrase can cost very different toke…
Embeddings don't appear by magic — they're trained. Discover the skip-gram objective, negative sampling, contrastive learning from SBERT to CLIP, and the split between representation models (encoders) and generation models (decoders).
Learning Goals
- Explain the skip-gram training objective: predict context words from a center word.
- Describe negative sampling and why it makes Word2Vec training computationally feasible.
- Explain contrastive learning: training embeddings by pulling similar pairs together and pushing dissimilar pairs apart.
- Distinguish representation models (encoder-only) from generation models (decoder-only) and identify which tasks each excels at.
Concept Card Preview
Visuals, diagrams, and micro-interactions you'll see in this module.
Skip-Gram Learns Meaning from Neighbors
Nina asks the real question: how did Word2Vec learn that king and queen are related without labels?
The answer is t…

Negative Sampling Makes Training Practical
Naive skip-gram is expensive. For every center word, predicting the exact neighbor requires scoring the entire vocabular…
Relationships Become Directions
Vector arithmetic works because relationships become directions.
If man -> woman is a consistent direction in embeddi…