What Is KV Cache and Why Does It Make LLM Inference Fast?

Every token an LLM generates reuses Keys and Values from everything that came before. The KV cache is what makes that reuse cheap. Here's how it works — and why inference slows down with longer context.

Johannes Hayer

Building ai-in-a-shell

April 24, 2026
The Complete Journey of a Prompt: How LLMs Actually Process Your Input End-to-End
Most explanations cover one piece at a time. Here's the full data flow — from your prompt to the next generated token — traced through every component in order.
April 23, 2026
Why Transformers Can't Tell Position Apart — and How RoPE Fixes It
Self-attention is blind to order. Shuffle the words in a sentence and you get identical attention scores. Positional embeddings solve this — but the way they do it determines whether your model can handle long contexts at inference time.
April 23, 2026
What Actually Happens Inside a Transformer Block
Attention gets all the press. But a transformer block is more than attention — there's a feedforward network that holds most of the parameters, two residual connections, and a normalisation design that determines whether large-scale training is stable. Here's all of it, in order.

Learn it properly

Practice the AI Native Engineer Roadmap

Turn the article into concept cards, Socratic questions, and an AI tutor session that checks whether the model actually holds in your head.

Start a Synapse session Download iOS app

What Is KV Cache and Why Does It Make LLM Inference Fast?

Related articles

Practice the AI Native Engineer Roadmap