InstructGPT (2022) — Teaching GPT to Listen (RLHF)

Modern LLMs 2022 RLHF alignment

The world before this paper

GPT-3 was a genius that wouldn't listen. By early 2022 it could write essays, code, and poetry — if you tricked it just right. Ask it plainly to "explain the moon landing to a 6-year-old" and it might reply with a list of more requests like yours, because on the internet, one request is usually followed by another. It was a text-completion engine wearing the costume of an assistant.

Completion engine more of the same

Asked to explain the moon landing to a 6-year-old, GPT-3 might just generate more requests like yours. It continues patterns; it doesn't answer.

Prompt gymnastics fragile workarounds

Users hacked around it with elaborate few-shot prompts. Even then, outputs could be confidently false — or flat-out toxic.

Wrong objective next token ≠ helpful

Predicting the next web token is simply a different goal from being helpful, honest, and harmless. The objectives differ, so the behavior does too.

The key idea

In 2022, a team at OpenAI published the fix: Ouyang et al. — "Training language models to follow instructions with human feedback", NeurIPS 2022. Their bet was almost humble. Don't make the model bigger. Don't engineer cleverer prompts. Change what the model is optimized for — and use humans as the signal.

The recipe has three stages. First, hire labelers to write good answers to real prompts and fine-tune the model on those demonstrations — show it what helpful looks like. Second, sample several outputs per prompt, have labelers rank them best to worst, and train a reward model to predict those rankings — turn human taste into a number. Third, let the model write freely while the reward model scores it, and use PPO reinforcement learning to nudge the weights toward higher reward, with a KL penalty keeping it from drifting too far from its supervised self.

The paper in one sentence

Align the model in three stages: fine-tune on human demonstrations of good answers (SFT), train a reward model on human rankings of candidate outputs, then optimize the policy against that reward model with PPO reinforcement learning.

Want the full mechanics? See LLM fine-tuning.

Watch the recipe run

Five scenes: the failure, the three training stages, and the result that made everyone sit up. Hit play.

The results that mattered

The evaluation was simple and brutal: show human labelers outputs from different models and ask which one they'd rather receive.

The upset 1.3B > 175B

Labelers preferred the 1.3B InstructGPT over the raw 175B GPT-3. Alignment beat 100× scale on helpfulness.

The pipeline 3 stages

SFT → reward model → PPO. Each stage feeds the next; together they turn a text-completer into an assistant.

The aftershock ~9 months

From this paper to ChatGPT. The November 2022 chatbot is this exact recipe, applied to dialogue.

The wins weren't only about preference. InstructGPT was more truthful on TruthfulQA and less toxic than GPT-3 — and the "alignment tax" it initially paid on standard benchmarks was largely mitigated by mixing pretraining gradients back into the RL stage. Better behavior, without giving up capability.

Legacy — and the catch

What it unlocked
  • Turned text-completers into assistants — the product unlock for LLMs
  • Human preference became a trainable objective
  • Truthfulness and toxicity improved without losing capability
The limits
  • Optimizes what raters like, which isn't always what's true (sycophancy, reward hacking)
  • Quality ceiling = the labeler pool and its instructions
  • PPO is finicky; simpler successors (DPO) replaced it in many labs
Go deeper

Read the original: arXiv:2203.02155. For the mechanics behind each stage, see LLM fine-tuning, Evaluating LLMs, and Hallucinations. Next paper: Chain-of-Thought (2022).