GPT-3 (2020) — Scale Is All You Need?
The world before this paper
In 2019, getting a model to do a new task meant training a new model. The BERT-era recipe was carved in stone: pretrain on the internet, then collect thousands of labeled examples and fine-tune a separate copy of the network for every single task — one for sentiment, one for QA, one for translation. It worked, but it was a factory line, not intelligence. Meanwhile GPT-2 had dropped a strange hint: a 1.5B-parameter language model could sometimes do tasks zero-shot, straight from a prompt, no labels at all. An expensive question hung in the air.
Every task needed labeled examples and its own fine-tuned copy of the model. New task, new dataset, new training run.
A big enough language model could do some tasks just from a prompt — crude, unreliable, but with zero labels.
Nobody knew what would happen if you scaled that idea a hundredfold. OpenAI decided to find out.
The key idea
In May 2020, Tom Brown and a small army of OpenAI co-authors published "Language Models are Few-Shot Learners", NeurIPS 2020. The bet was almost embarrassingly simple: take GPT-2's recipe — a decoder-only transformer trained to predict the next token — change nothing important about the architecture, and make it more than 100× bigger. 96 layers. 175 billion parameters. Roughly 300 billion tokens of text.
What came out was not just better autocomplete. The model could pick up a brand-new task from demonstrations written into its prompt. Show it three English → French pairs, then a fourth English word, and it continues with the French. No labels collected, no fine-tuning run, no weights touched. The paper named it in-context learning, and it grew with scale: few-shot beat zero-shot, and the gap widened as models got bigger. The team was also unusually candid about the failures — the model hallucinated, absorbed bias from the web, and completed text rather than followed instructions. And it shipped as an API, not as weights: you didn't download GPT-3, you called it.
Scale a decoder-only transformer to 175 billion parameters and it can learn a task from a few examples written into the prompt — no gradient updates, no fine-tuning: the prompt became the program.
Want the full mechanics? See GPT mechanics — this page stays on the story.
Watch the trick
The animation replays the paper's core experiment: a frozen model, a prompt holding three worked examples, and accuracy curves that fan apart as parameters grow.
The results that mattered
The numbers landed hard. One model — never shown a task-specific label — posted strong few-shot results across translation, question answering, cloze tasks, even arithmetic.
~10× anything before it — 96 layers of plain decoder-only transformer, no architectural tricks.
Tasks learned from the prompt alone. Bigger models proved to be better in-context learners.
Huge for 2020 — a number Chinchilla's scaling math would later call too small for 175B parameters.
Legacy — and the catch
- Established the scaling playbook — bigger model, broader abilities
- Prompting replaced fine-tuning for many tasks overnight
- Triggered the LLM product era via the API model
- Confidently wrong: hallucination at scale
- Completed text instead of following instructions (fixed by RLHF)
- Undertrained by Chinchilla's math — parameters outran data
Read the original: arXiv:2005.14165. For the concepts behind the story, see GPT mechanics, What is an LLM, Prompt engineering, and Context windows. Next paper: ViT (2020).