Sampling & Temperature

Gen AI temperature top-k top-p

A model predicts a distribution, not a word

After reading "The cat sat on the …", an LLM doesn't output one word. It produces a raw score (a logit) for every token in its vocabulary, and softmax turns those scores into a probability for each candidate. Decoding is the rule we use to turn that probability bar chart into an actual choice.

Greedy argmax

Always take the single highest-probability token. Deterministic, but repetitive and bland.

Sampling roll the dice

Draw a token at random, weighted by its probability. More varied — and the knobs below control how varied.

Watch the distribution get reshaped

Same logits every time — only the decoding knob changes. Temperature scales the spread; top-k and top-p trim the long tail of unlikely tokens before we sample.

What each knob does

Temperature T divide logits by T

T < 1 sharpens — the model gets confident and almost greedy. T > 1 flattens — rarer tokens get a real chance. T → 0 is greedy.

Top-k keep k tokens

Throw away everything except the k most likely tokens, then renormalize and sample from those.

Top-p (nucleus) keep mass p

Keep the smallest set of top tokens whose probabilities sum to p (e.g. 0.9). Adapts to how peaked the distribution is.

Rule of thumb

Need a factual, repeatable answer? Use low temperature (or greedy). Want creative, diverse writing? Raise the temperature and use top-p ≈ 0.9. Top-k and top-p are usually combined with temperature, not instead of it.

Now hold the dice yourself. Pick a knob, shape the distribution, then hit Sample ×10 to actually draw ten next-tokens from it. Green bars are the surviving candidates with their renormalized probabilities; faded bars can never be chosen.

Sample at T = 0.1 — ten draws, almost always ten "mat"s (that's why low temperature repeats itself). Now T = 2.5 — even "box" shows up. Then try top-k = 1: that's greedy decoding, no matter how often you roll.

Why trim the tail at all?

Pure sampling, high T
  • Every token has some chance — including nonsense
  • One unlucky draw can derail the whole sentence
  • Great for chaos, bad for coherence
Top-k / top-p sampling
  • The absurd long tail is cut before drawing
  • Still random among plausible tokens
  • The sweet spot for most generation