The KV Cache & LLM Inference · Suman Bhadra Notes

One token at a time — and the work that repeats

An LLM writes its reply one token at a time: predict a token, append it, run the model again. The catch is in the "run again". To pick token 1001, attention needs the key and value vectors of all 1000 tokens before it — in every layer, in every head. Recompute those from scratch on every step and generation becomes quadratically wasteful: each step redoes everything the last step already did, plus one more token.

Attention looks back Q vs all past K, V

The new token's query is compared against the keys of every earlier token, then blends their values. No keys and values, no attention.

The past never changes causal mask

Tokens only attend backwards, so token 17's K and V come out identical on every step. Recomputing them buys nothing.

Naive cost O(n²) per reply

Step n redoes n tokens' worth of K/V work. Summed over a reply, the waste grows with the square of the length.

The key insight

Because of the causal mask, a past token's keys and values are frozen. Anything frozen and reused is begging to be computed once and stored.

The fix: cache the keys and values

That stored copy is the KV cache. The first time a token passes through the model, its K and V vectors — one pair per layer, per attention head — are saved. From then on, generating a new token means computing only that token's Q, K and V, attending its query against all the stored keys and values, and appending its own K/V pair to the cache. One column of new work per step, no matter how long the sequence has grown.

What's stored K and V, per layer × head

Every token contributes one K and one V vector to every layer's every KV head. That whole block of memory is the cache.

What's not queries

A token's Q is used once — to attend from that token at that step — and thrown away. Only K and V are ever read again.

Each new step compute 1, read n

The new token computes its own column and reads the rest. Compute per step is constant; memory traffic grows with context.

Prefill and decode — the two phases of every request

Every request you send to an LLM goes through the same two phases, and they behave completely differently. First the model has to ingest your prompt and build the cache for it; then it can start generating.

Prefill compute-bound

The whole prompt is processed in one parallel pass — all tokens at once, filling the cache in bulk. GPUs love this: huge matrix multiplies, fast per token. Its duration is the time to first token.

Decode memory-bandwidth-bound

Generation appends one token per step. Each step does little math but must read the entire cache from GPU memory, so the bottleneck is bandwidth, not FLOPs. This sets your tokens-per-second.

What you feel pause, then stream

A long prompt means a long prefill — a noticeable pause before the first word. But once decoding starts, the streaming speed barely depends on how big the prompt was.

Why long prompts delay the first token but barely slow the rest

Prefill cost scales with prompt length, so a 100-page prompt makes you wait. Decode cost per step is roughly "read the cache once" — longer context makes each step a bit heavier on memory, but nothing like recomputing it. The KV cache is exactly what decouples the two.

Watch the cache at work

The grid below is the KV cache itself — one column per token, one row per layer. Watch naive generation relight the whole grid every step, the cache reduce each step to a single column, prefill fill the grid in one parallel pass, decode append columns one at a time, and finally the memory bill grow until it dwarfs the model's own weights.

The bill: a cache that grows with everything

The cache trades compute for memory, and the memory side is brutal. Its size is roughly 2 (K and V) × layers × KV heads × head dim × tokens × bytes per value — linear in context length, multiplied by everything about the model's shape. This is the real reason long context is expensive: a long conversation isn't just slower to prefill, it permanently occupies gigabytes of GPU memory that could be serving other users.

A worked example

A 70B-class model with full multi-head attention: 80 layers × 64 heads × 128 dims × 2 (K and V) × 2 bytes ≈ 2.6 MB per token. At a 128k-token context that's roughly 335 GB of cache — more than twice the fp16 weights themselves. Untenable, which is why modern architectures attack the cache directly.

Grows linearly × tokens

Every token in context adds a column to the cache, in every layer. Double the context, double the memory — forever, until the request ends.

GQA share KV heads

Grouped-query attention lets a group of query heads share one K/V head. 8 KV heads instead of 64 cuts the cache 8× with little quality loss — now standard in Llama, Mistral and most open models.

MLA compress K/V

Multi-head latent attention (DeepSeek) stores a small compressed latent per token and reconstructs K/V on the fly — shrinking the cache far below even GQA. More in open-source LLM architectures.

Serving tricks built around the cache

Once you see inference as "managing a giant per-request cache", most of modern LLM serving clicks into place. These four ideas are what let one GPU cluster serve thousands of simultaneous chats.

Paged attention vLLM

Store the cache in fixed-size pages instead of one contiguous block — exactly like virtual memory in an OS. No giant up-front allocation, no fragmentation, near-100% memory utilisation.

Continuous batching no idle slots

Decode steps from many requests run together. The moment one reply finishes, a waiting request slots into the freed decode slot mid-batch — instead of the whole batch draining first.

Prefix / prompt caching reuse the prefill

Thousands of requests share the same system prompt, so keep its K/V around and skip that part of prefill entirely. This is why providers bill cached input tokens far cheaper than fresh ones.

Quantized KV cache fp8 / int4 values

Store K and V in 8 or even 4 bits instead of 16 — the same trick as weight quantization, applied to the cache. Halves or quarters the memory for a small accuracy cost.

Reading a provider's pricing page

Time-to-first-token is prefill. Tokens-per-second is decode. Cheap "cached input" tokens are prefix caching. Long-context surcharges are the KV cache eating GPU memory. One data structure explains the whole bill.