Chain-of-Thought (2022) — Show Your Work
The world before this paper
By 2021, scale was solving everything except the thing people wanted most: reasoning. Giant language models could write essays, translate, and summarize — but hand one a grade-school math word problem and it would confidently blurt out a wrong number. Worse, going bigger barely helped. The scaling curves that climbed so reliably on other tasks looked stubbornly flat on multi-step problems.
Even huge LLMs flopped on multi-step math and logic — on word-problem benchmarks, more parameters bought almost nothing.
Few-shot prompts showed question → answer pairs, asking the model to jump from problem to result in a single step.
The field largely believed reasoning would need new architectures or specialized training. Prompting was not on the suspect list.
The key idea
In early 2022, Jason Wei and colleagues at Google Brain made a bet that sounded too cheap to work. Look at how humans solve a word problem: nobody leaps from question to answer. We scribble. 5 balls, plus 2 cans of 3, that's 6 more, so 11. Yet every few-shot prompt of the era showed the model only the question and the final number — and then everyone acted surprised when it guessed. Their hunch, published as Wei et al. — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", NeurIPS 2022, was that big models already could reason. They had just never been shown that they were allowed to.
So the team changed nothing about the model. No fine-tuning, no new layers, no extra data. They simply rewrote the examples in the prompt to include the scratch work. The model, ever the imitator, copied the shape of the demonstration: it wrote out its own intermediate steps before committing to an answer. Each step it generates becomes context for the next, so the final answer rides on a trail of easy sub-problems instead of one impossible jump.
Change only the prompt's examples — show worked solutions with intermediate steps instead of bare answers — and the model imitates the pattern, reasoning out loud before answering, which sends accuracy on multi-step problems soaring.
Want the full mechanics? See Prompt engineering.
Watch the prompt do the work
Same model, same weights, same question. The only thing that changes below is what the examples in the prompt look like — and that's enough to flip a wrong answer into a right one. Then the kicker: the trick only works once the model is big enough.
The results that mattered
The headline number came from GSM8K, a benchmark of grade-school math word problems that had humiliated every large model before it. And the gains weren't a math fluke — the same prompt trick lifted arithmetic, commonsense, and symbolic reasoning benchmarks alike.
PaLM 540B with chain-of-thought, a prompting change only — sailing past the fine-tuned state of the art of the time.
The entire gain came from the prompt. No gradient ever flowed; the ability was already sitting in the weights.
The scale where step-by-step reasoning switches on. Below roughly 10–100B parameters, CoT doesn't help — and can even hurt.
Legacy — and the catch
- Unlocked latent reasoning that was already in the weights
- Made model thinking inspectable — you can read the steps
- Founded the reasoning line: self-consistency, tree-of-thought, reasoning models
- The written chain isn't always the real computation — it can rationalize
- Helps only at scale; small models babble steps without benefit
- Longer outputs: more tokens, more latency, more cost
Read the original: arXiv:2201.11903. The sequels came fast: within months, Kojima et al. (2022) showed even the examples were optional — just appending "Let's think step by step" worked zero-shot — and self-consistency (sample many chains, majority-vote) pushed scores higher still, before reasoning-trained models like the o1/R1 lineage made the chain internal. For the surrounding concepts, see Prompt engineering, AI agents, and Evaluating LLMs. Next paper: LoRA (2021).