Chinchilla (2022) — Compute-Optimal Scaling
The world before this paper
In 2021, the recipe for a smarter model was simple: make it bigger. The 2020 scaling laws had been read as a verdict — parameters matter most, data is a detail — so labs built monuments. GPT-3 hit 175B parameters, Gopher 280B, MT-NLG 530B. Almost nobody stopped to ask whether all those weights were actually being fed enough.
The 2020 scaling-law reading was "parameters matter most" — so every lab raced to build a bigger model than the last.
GPT-3 (175B), Gopher (280B) and MT-NLG (530B) differed 3× in size — yet all trained on roughly the same ~300B tokens.
Data was an afterthought. Nobody had carefully traded model size against training tokens at a fixed compute budget.
The key idea
In early 2022, Jordan Hoffmann and a team at DeepMind decided to test the creed instead of extending it. Their question was almost accounting-flavored: a training run buys you a fixed number of FLOPs — what is the best way to spend them? They published the answer as Hoffmann et al. — "Training Compute-Optimal Large Language Models", NeurIPS 2022, and the field knows it by the name of its star witness: Chinchilla.
The bet was empirical, not theoretical. Train a swarm of models at different sizes on different amounts of data, hold compute fixed along each comparison, and let the loss curves speak. They did — over 400 runs — and the curves disagreed with everything the giants had assumed. Want the full mechanics of the models being scaled? See What is an LLM.
Hold compute fixed and sweep the trade-off — for every FLOP budget, the lowest loss comes from scaling parameters and tokens in equal proportion, roughly 20 tokens per parameter, which meant the era's giants were massively undertrained.
One curve, two bets
The whole paper fits in one picture: a plane of model size versus training data, a curve of equal compute, and two points on it. Watch Gopher's budget slide along the curve to Chinchilla's spot — then watch the benchmarks flip.
The results that mattered
Chinchilla — 70B parameters trained on 1.4T tokens — cost exactly Gopher's training compute and beat it across the board.
70B vs Gopher's 280B at the same training budget — better scores (MMLU 67.5% vs 60.0%) and 4× cheaper to serve.
Compute-optimal training wants about 20 tokens per parameter. Gopher had barely one.
An empirical sweep across model sizes and data scales — enough to overturn the bigger-is-better creed.
Legacy — and the catch
The paper reset the field overnight. LLaMA followed the new playbook, training 7–65B models on 1–1.4T tokens — and modern models now deliberately train far past 20 tokens per parameter, because once a model ships, inference cost dominates training cost.
- Made data scale a first-class citizen alongside parameters
- Smaller-but-better models cut inference costs industry-wide
- A model (heh) example of empirical science correcting groupthink
- Optimal for training compute only — ignores inference economics
- Assumes unlimited fresh data; the field now stares at a data wall
- The "law" is a fit, not physics — constants shift with data quality
Read the original: arXiv:2203.15556. For the machinery being scaled, see What is an LLM; for the descendants, LLaMA and Open-source LLM architectures. Next paper: InstructGPT (2022).