GPT-3 (2020) — Scale Is All You Need?

The world before this paper

In 2019, getting a model to do a new task meant training a new model. The BERT-era recipe was carved in stone: pretrain on the internet, then collect thousands of labeled examples and fine-tune a separate copy of the network for every single task — one for sentiment, one for QA, one for translation. It worked, but it was a factory line, not intelligence. Meanwhile GPT-2 had dropped a strange hint: a 1.5B-parameter language model could sometimes do tasks zero-shot, straight from a prompt, no labels at all. An expensive question hung in the air.

The BERT recipe fine-tune everything

Every task needed labeled examples and its own fine-tuned copy of the model. New task, new dataset, new training run.

GPT-2's hint 1.5B, zero-shot

A big enough language model could do some tasks just from a prompt — crude, unreliable, but with zero labels.

The open question what if 100×?

Nobody knew what would happen if you scaled that idea a hundredfold. OpenAI decided to find out.

The key idea

In May 2020, Tom Brown and a small army of OpenAI co-authors published "Language Models are Few-Shot Learners", NeurIPS 2020. The bet was almost embarrassingly simple: take GPT-2's recipe — a decoder-only transformer trained to predict the next token — change nothing important about the architecture, and make it more than 100× bigger. 96 layers. 175 billion parameters. Roughly 300 billion tokens of text.

What came out was not just better autocomplete. The model could pick up a brand-new task from demonstrations written into its prompt. Show it three English → French pairs, then a fourth English word, and it continues with the French. No labels collected, no fine-tuning run, no weights touched. The paper named it in-context learning, and it grew with scale: few-shot beat zero-shot, and the gap widened as models got bigger. The team was also unusually candid about the failures — the model hallucinated, absorbed bias from the web, and completed text rather than followed instructions. And it shipped as an API, not as weights: you didn't download GPT-3, you called it.

The paper in one sentence

Scale a decoder-only transformer to 175 billion parameters and it can learn a task from a few examples written into the prompt — no gradient updates, no fine-tuning: the prompt became the program.

Want the full mechanics? See GPT mechanics — this page stays on the story.

Watch the trick

The animation replays the paper's core experiment: a frozen model, a prompt holding three worked examples, and accuracy curves that fan apart as parameters grow.

The results that mattered

The numbers landed hard. One model — never shown a task-specific label — posted strong few-shot results across translation, question answering, cloze tasks, even arithmetic.

Parameters 175B

~10× anything before it — 96 layers of decoder-only transformer, GPT-2's design with only minor sparse-attention tweaks.

Gradient updates 0

Tasks learned from the prompt alone. Bigger models proved to be better in-context learners.

Training diet 300B tokens

Huge for 2020 — a number Chinchilla's scaling math would later call too small for 175B parameters.

Legacy — and the catch

What it unlocked

Established the scaling playbook — bigger model, broader abilities
Prompting replaced fine-tuning for many tasks overnight
Triggered the LLM product era via the API model

The limits

Confidently wrong: hallucination at scale
Completed text instead of following instructions (fixed by RLHF)
Undertrained by Chinchilla's math — parameters outran data

Go deeper

Read the original: arXiv:2005.14165. For the concepts behind the story, see GPT mechanics, What is an LLM, Prompt engineering, and Context windows. Next paper: ViT (2020).