Reasoning Models & Chain-of-Thought
One token at a time, no scratchpad
An LLM writes its answer one token at a time, and each token gets exactly one forward pass of compute — there is no hidden scratchpad. Ask a multi-step question and demand an instant answer, and the model has to compress all of the reasoning into that single pass. But let it think out loud — write the intermediate steps before committing — and every step it writes becomes context the next step can read. On math, logic and planning problems, accuracy jumps.
Forced to answer immediately, the model must do all the working invisibly, in one shot. Multi-step problems break.
Each written step lands in the context, so the next token can build on it — the context window becomes the scratchpad.
A prompt-level nudge that makes the model reason before answering — big accuracy gains on hard tasks.
This is chain-of-thought (CoT) prompting (Wei et al., 2022): add "let's think step by step", or show a few worked examples with their reasoning written out, and the model produces intermediate steps where each one conditions the next. It started life as a humble prompting trick — the original paper has its own note under research papers.
Fail fast, then think out loud
One word problem, three regimes. First the model is forced to answer instantly and fumbles. Then it thinks step by step and lands the answer. Then you'll see the training loop that bakes the habit in — and the curve that turns "thinking tokens" into a dial you pay to turn.
From prompting trick to trained behaviour
If thinking out loud helps, why wait for the user to ask? Reasoning models bake it in. OpenAI's o1 (2024), DeepSeek-R1 (Jan 2025) and the "thinking" modes in Qwen and others are trained — largely with reinforcement learning on verifiable rewards — to produce a long internal chain of thought before every final answer. Math and code are the perfect training ground: a program can check the answer automatically, so chains that end correct get rewarded, and behaviours like double-checking, backtracking and trying another approach get reinforced until they emerge on their own. (How reward signals shape model behaviour more broadly is the subject of RLHF & alignment.)
Where a checker can grade the answer, correct chains are rewarded automatically — no human labeller in the loop.
The model emits a long chain of thought before the final reply — sometimes shown in full, often hidden or summarized.
Same weights, more thinking tokens, better answers — a third axis after model size and training data, with diminishing returns.
Every thinking token costs latency and money, so APIs expose a dial: cap the thinking tokens, or pick a low/medium/high effort level. Some models show the full chain (DeepSeek-R1), others hide it and show only a summary (o1) — either way, you usually pay for the hidden tokens too.
When a reasoning model is the wrong tool
- Simple lookups and casual chat — slower and pricier, zero accuracy gain
- Latency-sensitive UIs — users sit and wait while it "thinks"
- Long chains eat your context window — and your bill
- Multi-step math, logic, planning, tricky debugging
- The answer can be verified, so extra thinking pays for itself
- You'd happily trade seconds and cents for correctness
The written chain of thought is text that was trained to lead to correct answers — it isn't guaranteed to be a faithful transcript of how the model actually computed the result. A model can write plausible-looking steps while internally taking a different shortcut. Treat the chain as a useful trace for debugging, not as proof of reasoning.