Chinchilla (2022) — Compute-Optimal Scaling

The world before this paper

In 2021, the recipe for a smarter model was simple: make it bigger. The 2020 scaling laws had been read as a verdict — parameters matter most, data is a detail — so labs built monuments. GPT-3 hit 175B parameters, Gopher 280B, MT-NLG 530B. Almost nobody stopped to ask whether all those weights were actually being fed enough.

The creed params first

The 2020 scaling-law reading was "parameters matter most" — so every lab raced to build a bigger model than the last.

The giants ~300B tokens

GPT-3 (175B), Gopher (280B) and MT-NLG (530B) differed 3× in size — yet all trained on roughly the same ~300B tokens.

The blind spot data

Data was an afterthought. Nobody had carefully traded model size against training tokens at a fixed compute budget.

The key idea

In early 2022, Jordan Hoffmann and a team at DeepMind decided to test the creed instead of extending it. Their question was almost accounting-flavored: a training run buys you a fixed number of FLOPs — what is the best way to spend them? They published the answer as Hoffmann et al. — "Training Compute-Optimal Large Language Models", NeurIPS 2022, and the field knows it by the name of its star witness: Chinchilla.

The bet was empirical, not theoretical. Train a swarm of models at different sizes on different amounts of data, hold compute fixed along each comparison, and let the loss curves speak. They did — over 400 runs — and the curves disagreed with everything the giants had assumed. Want the full mechanics of the models being scaled? See What is an LLM.

The paper in one sentence

Hold compute fixed and sweep the trade-off — for every FLOP budget, the lowest loss comes from scaling parameters and tokens in equal proportion, roughly 20 tokens per parameter, which meant the era's giants were massively undertrained.

One curve, two bets

The whole paper fits in one picture: a plane of model size versus training data, a curve of equal compute, and two points on it. Watch Gopher's budget slide along the curve to Chinchilla's spot — then watch the benchmarks flip.

The results that mattered

Chinchilla — 70B parameters trained on 1.4T tokens — cost exactly Gopher's training compute and beat it across the board.

Same compute 4× smaller

70B vs Gopher's 280B at the same training budget — better scores (MMLU 67.5% vs 60.0%) and 4× cheaper to serve.

The ratio ~20 : 1

Compute-optimal training wants about 20 tokens per parameter. Gopher had barely one.

The evidence 400+ runs

An empirical sweep across model sizes and data scales — enough to overturn the bigger-is-better creed.

Legacy — and the catch

The paper reset the field overnight. LLaMA followed the new playbook, training 7–65B models on 1–1.4T tokens — and modern models now deliberately train far past 20 tokens per parameter, because once a model ships, inference cost dominates training cost.

What it unlocked

Made data scale a first-class citizen alongside parameters
Smaller-but-better models cut inference costs industry-wide
A model (heh) example of empirical science correcting groupthink

The limits

Optimal for training compute only — ignores inference economics
Assumes unlimited fresh data; the field now stares at a data wall
The "law" is a fit, not physics — constants shift with data quality

Go deeper

Read the original: arXiv:2203.15556. For the machinery being scaled, see What is an LLM; for the descendants, LLaMA and Open-source LLM architectures. Next paper: InstructGPT (2022).