The KV Cache — How LLM Inference Works
One token at a time — and the work that repeats
An LLM writes its reply one token at a time: predict a token, append it, run the model again. The catch is in the "run again". To pick token 1001, attention needs the key and value vectors of all 1000 tokens before it — in every layer, in every head. Recompute those from scratch on every step and generation becomes quadratically wasteful: each step redoes everything the last step already did, plus one more token.
The new token's query is compared against the keys of every earlier token, then blends their values. No keys and values, no attention.
Tokens only attend backwards, so token 17's K and V come out identical on every step. Recomputing them buys nothing.
Step n redoes n tokens' worth of K/V work. Summed over a reply, the waste grows with the square of the length.
Because of the causal mask, a past token's keys and values are frozen. Anything frozen and reused is begging to be computed once and stored.
The fix: cache the keys and values
That stored copy is the KV cache. The first time a token passes through the model, its K and V vectors — one pair per layer, per attention head — are saved. From then on, generating a new token means computing only that token's Q, K and V, attending its query against all the stored keys and values, and appending its own K/V pair to the cache. One column of new work per step, no matter how long the sequence has grown.
Every token contributes one K and one V vector to every layer's every KV head. That whole block of memory is the cache.
A token's Q is used once — to attend from that token at that step — and thrown away. Only K and V are ever read again.
The new token computes its own column and reads the rest. Compute per step is constant; memory traffic grows with context.
Prefill and decode — the two phases of every request
Every request you send to an LLM goes through the same two phases, and they behave completely differently. First the model has to ingest your prompt and build the cache for it; then it can start generating.
The whole prompt is processed in one parallel pass — all tokens at once, filling the cache in bulk. GPUs love this: huge matrix multiplies, fast per token. Its duration is the time to first token.
Generation appends one token per step. Each step does little math but must read the entire cache from GPU memory, so the bottleneck is bandwidth, not FLOPs. This sets your tokens-per-second.
A long prompt means a long prefill — a noticeable pause before the first word. But once decoding starts, the streaming speed barely depends on how big the prompt was.
Prefill cost scales with prompt length, so a 100-page prompt makes you wait. Decode cost per step is roughly "read the cache once" — longer context makes each step a bit heavier on memory, but nothing like recomputing it. The KV cache is exactly what decouples the two.
Watch the cache at work
The grid below is the KV cache itself — one column per token, one row per layer. Watch naive generation relight the whole grid every step, the cache reduce each step to a single column, prefill fill the grid in one parallel pass, decode append columns one at a time, and finally the memory bill grow until it dwarfs the model's own weights.
The bill: a cache that grows with everything
The cache trades compute for memory, and the memory side is brutal. Its size is roughly 2 (K and V) × layers × KV heads × head dim × tokens × bytes per value — linear in context length, multiplied by everything about the model's shape. This is the real reason long context is expensive: a long conversation isn't just slower to prefill, it permanently occupies gigabytes of GPU memory that could be serving other users.
A 70B-class model with full multi-head attention: 80 layers × 64 heads × 128 dims × 2 (K and V) × 2 bytes ≈ 2.6 MB per token. At a 128k-token context that's roughly 335 GB of cache — more than twice the fp16 weights themselves. Untenable, which is why modern architectures attack the cache directly.
Every token in context adds a column to the cache, in every layer. Double the context, double the memory — forever, until the request ends.
Grouped-query attention lets a group of query heads share one K/V head. 8 KV heads instead of 64 cuts the cache 8× with little quality loss — now standard in Llama, Mistral and most open models.
Multi-head latent attention (DeepSeek) stores a small compressed latent per token and reconstructs K/V on the fly — shrinking the cache far below even GQA. More in open-source LLM architectures.
Serving tricks built around the cache
Once you see inference as "managing a giant per-request cache", most of modern LLM serving clicks into place. These four ideas are what let one GPU cluster serve thousands of simultaneous chats.
Store the cache in fixed-size pages instead of one contiguous block — exactly like virtual memory in an OS. No giant up-front allocation, no fragmentation, near-100% memory utilisation.
Decode steps from many requests run together. The moment one reply finishes, a waiting request slots into the freed decode slot mid-batch — instead of the whole batch draining first.
Thousands of requests share the same system prompt, so keep its K/V around and skip that part of prefill entirely. This is why providers bill cached input tokens far cheaper than fresh ones.
Store K and V in 8 or even 4 bits instead of 16 — the same trick as weight quantization, applied to the cache. Halves or quarters the memory for a small accuracy cost.
Time-to-first-token is prefill. Tokens-per-second is decode. Cheap "cached input" tokens are prefix caching. Long-context surcharges are the KV cache eating GPU memory. One data structure explains the whole bill.