Reasoning Models & Chain-of-Thought

One token at a time, no scratchpad

An LLM writes its answer one token at a time, and each token gets exactly one forward pass of compute — there is no hidden scratchpad. Ask a multi-step question and demand an instant answer, and the model has to compress all of the reasoning into that single pass. But let it think out loud — write the intermediate steps before committing — and every step it writes becomes context the next step can read. On math, logic and planning problems, accuracy jumps.

No scratchpad 1 pass per token

Forced to answer immediately, the model must do all the working invisibly, in one shot. Multi-step problems break.

Steps as tokens write the working

Each written step lands in the context, so the next token can build on it — the context window becomes the scratchpad.

Chain-of-thought "think step by step"

A prompt-level nudge that makes the model reason before answering — big accuracy gains on hard tasks.

This is chain-of-thought (CoT) prompting: show a few worked examples with their reasoning written out (Wei et al., 2022) — or, in the zero-shot variant, simply add "let's think step by step" (Kojima et al., 2022), and the model produces intermediate steps where each one conditions the next. It started life as a humble prompting trick — the original paper has its own note under research papers.

Fail fast, then think out loud

One word problem, three regimes. First the model is forced to answer instantly and fumbles. Then it thinks step by step and lands the answer. Then you'll see the training loop that bakes the habit in — and the curve that turns "thinking tokens" into a dial you pay to turn.

From prompting trick to trained behaviour

If thinking out loud helps, why wait for the user to ask? Reasoning models bake it in. OpenAI's o1 (2024), DeepSeek-R1 (Jan 2025) and the "thinking" modes in Qwen and others are trained — largely with reinforcement learning on verifiable rewards — to produce a long internal chain of thought before every final answer. Math and code are the perfect training ground: a program can check the answer automatically, so chains that end correct get rewarded, and behaviours like double-checking, backtracking and trying another approach get reinforced until they emerge on their own. (How reward signals shape model behaviour more broadly is the subject of RLHF & alignment.)

Verifiable rewards math & code

Where a checker can grade the answer, correct chains are rewarded automatically — no human labeller in the loop.

Think, then answer thousands of tokens

The model emits a long chain of thought before the final reply — sometimes shown in full, often hidden or summarized.

Test-time compute a new scaling dial

Same weights, more thinking tokens, better answers — a third axis after model size and training data, with diminishing returns.

Thinking budgets, visible vs hidden thought

Every thinking token costs latency and money, so APIs expose a dial: cap the thinking tokens, or pick a low/medium/high effort level. Some models show the full chain (DeepSeek-R1), others hide it and show only a summary (o1) — either way, you usually pay for the hidden tokens too.

When a reasoning model is the wrong tool

Skip it when

Simple lookups and casual chat — slower and pricier, zero accuracy gain
Latency-sensitive UIs — users sit and wait while it "thinks"
Long chains eat your context window — and your bill

Reach for one when

Multi-step math, logic, planning, tricky debugging
The answer can be verified, so extra thinking pays for itself
You'd happily trade seconds and cents for correctness

The chain isn't a confession

The written chain of thought is text that was trained to lead to correct answers — it isn't guaranteed to be a faithful transcript of how the model actually computed the result. A model can write plausible-looking steps while internally taking a different shortcut. Treat the chain as a useful trace for debugging, not as proof of reasoning.