Synapse

1/6

What an LLM Does

Nina ships her first LLM feature in a weekend. It answers questions, summarizes docs, writes code. Then a user asks: "What's the weather in Berlin?" — and the model confidently invents yesterday's forecast.

Nina didn't break anything. That is how LLMs work. You would have built exactly the same thing.

An LLM does one thing: take text in, predict the next token. Not browse the internet. Not query a database. Just pattern-match against a massive slice of the internet — one token at a time.

That's why:

Hallucination = lossy compression — trillions of words squeezed into billions of parameters, gaps filled with plausible fabrications
Streaming responses = causal attention — the model predicts one token at a time, left to right
Uncanny understanding = contextual embeddings — words shift meaning based on surrounding context

None of this is magic. It's engineering — built by solving one concrete failure at a time. It starts with a surprisingly dumb idea: counting words.

Term note: Token, parameters, compression, attention, and embeddings are introduced here; they get deeper treatment in Module 2, Tokens & Embeddings, and Transformer Internals.

From Word Counts to Meaning Vectors

Before 2013, computers represented language by counting words. "Bank" plus "loan" meant finance. "Bank" plus "river" meant geography. It powered spam filters and search for 15 years.

Fatal flaw: word order disappears entirely. "The dog bit the man" and "The man bit the dog" produce identical vectors — same words, same counts, opposite meaning.

Word2Vec (2013) fixed this by placing words in a vector space based on co-occurrence — words that appear in similar contexts end up nearby:

king - man + woman ≈ queen

Each word gets an embedding — a vector of ~300 numbers representing its semantic properties. These aren't hand-coded; the model discovers them by reading billions of sentences.

But Word2Vec assigns one fixed embedding per word. "Bank" gets the same vector next to "river" as next to "deposit." That's not a corner case — every polysemous word in English is frozen in the same ambiguity. Your model can't tell which meaning is active. Context has to come from somewhere.

Term note: Vectors, Word2Vec, embeddings, and semantic similarity are introduced here; they are explained deeply in Tokens & Embeddings and Embeddings & Vector Databases.

Checkpoint

Word2Vec encodes `king − man + woman ≈ queen`. But it assigns one embedding to 'bank' regardless of context. What does this make it unable to do?

Attention → Transformer → Scale

Word2Vec gave "bank" one fixed vector — same whether it sits next to "river" or "deposit." Real language doesn't work that way.

Attention (2014) fixed this by making embeddings dynamic. In "I deposited money at the bank," attention shifts "bank" toward finance because "money" and "deposited" are nearby. Same word, different vector per context — contextual embeddings.

But attention was bolted onto RNNs — sequential models, one token at a time. Token 50 can't start until tokens 1–49 finish. No GPU count fixes that.

The Transformer (2017) cut the RNN entirely. Attention Is All You Need: attention alone is sufficient — every token attends to every other in one parallel pass. Trainable on the entire internet.

The architecture split:

Encoder-only (BERT): Bidirectional — understands. Cannot generate.
Decoder-only (GPT): Causal — generates left to right.

All modern LLMs are decoder-only. Scale then did something no one fully predicted: reasoning, code, and arithmetic emerged without being explicitly trained in. They just appeared.

Next: what IS an LLM physically — not a cloud, but two files on a laptop.

Term note: Attention, RNNs, Transformers, encoder/decoder models, BERT, and GPT are introduced here; the mechanics are explained deeply in Transformer Internals and LLM Types & Modalities.