Language Models & Perplexity · Suman Bhadra Notes

A machine that bets on the next word

A language model is a probability machine: hand it any sequence of words and it tells you how plausible that sequence is — P("the cat sat"). In practice that one big probability is factored into a chain of next-word predictions: P(the) · P(cat | the) · P(sat | the cat).

N-gram models did this job with counts — "after 'the cat', how often did 'sat' follow in my corpus?" Modern LLMs do exactly the same job with transformers. Different machinery, same job — and crucially, the same yardstick for measuring how well they do it.

One job score sequences

Assign a probability to any text — high for fluent English, low for word salad.

Chain rule one word at a time

The sequence probability is a product of next-word probabilities, each conditioned on what came before.

Same yardstick 1990s → today

The metric that graded trigram models grades GPT-class models too.

Score it by surprise

So how good is a language model? Show it real text it has never seen and measure how surprised it is. At each position, look up the probability the model gave to the word that actually came next. High probability → low surprise → good model. The standard summary is cross-entropy: the average negative log-probability per token. And perplexity is just cross-entropy pushed through an exponential: perplexity = exp(cross-entropy). Lower is better for both.

Two names, one number

If you measure cross-entropy in nats (natural log), perplexity is e^H; in bits (log base 2), it's 2^H. The base doesn't matter as long as the log and the exponential match — the perplexity comes out identical.

The K-sided die: what perplexity really says

Here's the intuition that makes perplexity click. A perplexity of K means the model is, on average, as confused as if it were choosing uniformly among K equally likely words at every step. Perplexity 20? Every word is a 20-way dice roll. Perplexity 2? It's basically flipping a coin between two candidates. That's why perplexity is often called the model's effective branching factor.

A tiny worked example with clean numbers. Take a 4-word text where the model gives each true next word probability 1/8:

The arithmetic

Surprise per word: −log₂(1/8) = 3 bits → average over 4 words: (3 + 3 + 3 + 3) / 4 = 3 bits → perplexity: 2³ = 8.
The model is exactly as confused as if it rolled an 8-sided die for every single word.

And the two extremes anchor the scale: a perfect model that always gives the true next word probability 1 has perplexity 1 (no surprise at all), while a model that guesses uniformly over its whole vocabulary has perplexity equal to the vocabulary size — pure dice-rolling.

Watch the yardstick at work

Step through it: a next-word bet, the chain rule assembling a sentence probability, a confident model vs a confused one on the same sentence, and the worked example landing on that 8-sided die.

Read the number with care

Tokenizer matters not comparable

Perplexity is per token, and tokens depend on the vocabulary. Two models with different tokenizers aren't directly comparable.

Corpus matters always "on what?"

Perplexity is measured on a specific dataset. Perplexity 8 on Wikipedia and perplexity 8 on tweets are different achievements.

Low ppl ≠ good chat needs evals

A great next-word predictor can still be unhelpful or unsafe. Instruction-tuned models are judged with human and automated evals.

Still the engine of pretraining

Don't write perplexity off, though — the loss LLMs minimize during pretraining is exactly this cross-entropy. Every gradient step in a trillion-token run is the model learning to be a little less surprised by the next token.