Menu
← Back to Courses
No Image

Tokens & Embeddings Deep Dive

Deep dive into tokenizer internals, cross-model comparison, embedding training, and the encoder/decoder split — for learners who want to go beyond the fundamentals.

Why take this course?

The basics of tokenization and embeddings are covered in How LLMs Work. This deep dive explores tokenizer internals across models, multilingual cost traps, how embeddings are trained (skip-gram, contrastive learning), and the encoder vs decoder architecture split.

Prerequisites

This course builds on concepts from the following courses. It is recommended to complete them first:

Course Modules

1Module 1: Tokenization Foundations

How do LLMs break text into pieces they can process? Explore Byte-Pair Encoding, subword tokenization strategies, vocabulary size tradeoffs, and why the tokenizer must match the model.

Learning Goals

  • Describe how Byte-Pair Encoding (BPE) builds vocabularies from training data.
  • Explain why subword tokenization strikes a balance between character-level and word-level approaches.
  • Analyze how vocabulary size affects model performance, memory, and generalization.
  • Recognize why a tokenizer must be paired with the model it was trained with.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

The Spelling Paradox

Nina asks GPT-4 to count the R's in strawberry. It answers: two. The word has three.

The model is not looking at lett…

How Tokenizers Chop Text

A tokenizer has to choose granularity.

Character-level tokenization sees every letter, but sequences explode. A par…

Tokenizer Choice Changes Cost and Behavior

Tokenizers are not interchangeable utilities. They are part of the model.

Three things shape behavior: the algorithm, v…

2Module 2: From Tokens to Vectors

Tokens are just IDs — embeddings give them meaning. Learn how embeddings convert token IDs into dense vectors, the leap from word2vec to contextual embeddings, cosine similarity, and how semantic search works.

Learning Goals

  • Understand how embeddings convert token IDs into semantic vector representations.
  • Explain the difference between static embeddings (word2vec) and contextualized embeddings (DeBERTa).
  • Use cosine similarity to measure how close two vectors are in embedding space.
  • Describe how semantic search uses embeddings to find relevant documents beyond keyword matching.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Keyword Search Breaks on Meaning

Nina ships API documentation search. Users type authentication error and get nothing. The docs have the answer, but th…

Embeddings Turn IDs Into Geometry

Embeddings Turn IDs Into Geometry

Token IDs are labels. Embeddings are meaning.

The tokenizer may turn apple into ID 451 and fruit into ID 902, but t…

Static vs Contextual Embeddings

Static embeddings have one fatal flaw: one word gets one vector forever.

Take pitch:

  • baseball pitch
  • musical…
3Module 3: Tokenizers in the Wild

Not all tokenizers are created equal. Compare tokenizer behavior across GPT-4, Llama, BERT, and StarCoder — explore special tokens, multilingual cost implications, and byte-level approaches like CANINE and ByT5.

Learning Goals

  • Compare tokenizer outputs across BERT, GPT-2, GPT-4, Llama 2/3, and StarCoder2 for the same input.
  • Recognize how special tokens (BOS, EOS, chat templates) affect model behavior and can introduce subtle bugs.
  • Explain why multilingual text can cost 3x-15x more tokens depending on the tokenizer.
  • Describe byte-level tokenization approaches (CANINE, ByT5) and when they make sense.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Hidden Tokens Shape the Input

The text you type is not the full input the model sees. Tokenizers add hidden structure.

BERT-style models add special…

Normalization Is a One-Way Door

Some tokenizers rewrite text before splitting it. That preprocessing is called normalization.

BERT uncased lowercas…

The Multilingual Token Tax

Nina's team expands to China and finds a cost bug hiding in the tokenizer.

The same phrase can cost very different toke…

4Module 4: How Embeddings Learn

Embeddings don't appear by magic — they're trained. Discover the skip-gram objective, negative sampling, contrastive learning from SBERT to CLIP, and the split between representation models (encoders) and generation models (decoders).

Learning Goals

  • Explain the skip-gram training objective: predict context words from a center word.
  • Describe negative sampling and why it makes Word2Vec training computationally feasible.
  • Explain contrastive learning: training embeddings by pulling similar pairs together and pushing dissimilar pairs apart.
  • Distinguish representation models (encoder-only) from generation models (decoder-only) and identify which tasks each excels at.

Concept Card Preview

Visuals, diagrams, and micro-interactions you'll see in this module.

Skip-Gram Learns Meaning from Neighbors

Nina asks the real question: how did Word2Vec learn that king and queen are related without labels?

The answer is t…

Negative Sampling Makes Training Practical

Negative Sampling Makes Training Practical

Naive skip-gram is expensive. For every center word, predicting the exact neighbor requires scoring the entire vocabular…

Relationships Become Directions

Vector arithmetic works because relationships become directions.

If man -> woman is a consistent direction in embeddi…

    Tokens & Embeddings Deep Dive | Synapse