What an LLM Does
Nina ships her first LLM feature in a weekend. It answers questions, summarizes docs, writes code. Then a user asks: "What's the weather in Berlin?" — and the model confidently invents yesterday's forecast.
Nina didn't break anything. That is how LLMs work. You would have built exactly the same thing.
An LLM does one thing: take text in, predict the next token. Not browse the internet. Not query a database. Just pattern-match against a massive slice of the internet — one token at a time.
That's why:
- Hallucination = lossy compression — trillions of words squeezed into billions of parameters, gaps filled with plausible fabrications
- Streaming responses = causal attention — the model predicts one token at a time, left to right
- Uncanny understanding = contextual embeddings — words shift meaning based on surrounding context
None of this is magic. It's engineering — built by solving one concrete failure at a time. It starts with a surprisingly dumb idea: counting words.
Term note: Token, parameters, compression, attention, and embeddings are introduced here; they get deeper treatment in Module 2, Tokens & Embeddings, and Transformer Internals.