RLHF & Alignment
A brilliant autocomplete, not an assistant
A freshly pretrained LLM is a next-token predictor and nothing more. Ask a raw base model "How do I write a good resume?" and it may happily continue with "How do I write a cover letter? How do I prepare for an interview?" — because on the internet, questions often appear in lists of more questions. It isn't being unhelpful on purpose; it's doing exactly what it was trained to do: continue the text. Alignment is everything we do after pretraining to turn that autocomplete into an assistant that behaves the way people actually want.
The usual shorthand for "the way people actually want" is the three H's:
Follow instructions, stay on task, give a useful answer instead of continuing the pattern or wandering off.
Prefer true statements, express uncertainty, admit "I don't know" instead of confidently inventing facts.
Decline genuinely dangerous requests, avoid toxic output — without refusing everything that merely sounds edgy.
Alignment is the post-training process — supervised fine-tuning plus preference tuning like RLHF or DPO — that reshapes a raw next-token predictor into a model that is helpful, honest, and harmless.
The three-stage pipeline
The modern recipe was laid out in OpenAI's InstructGPT paper (Ouyang et al., 2022 — there's a note on it here), and nearly every chat model since follows the same three stages:
Train on trillions of tokens of text to predict the next one. This is where all the knowledge and language ability comes from — and none of the manners.
Supervised fine-tuning on example conversations written by humans. The model learns the assistant format: a question deserves an answer, not more questions.
Teach the model which of two plausible answers people prefer. This is where tone, judgment, and refusal behavior get shaped — the subject of this note.
SFT can only imitate its demonstrations, and writing the perfect answer is hard even for experts. But comparing two answers is easy — almost anyone can say which of two drafts is better. Preference tuning exploits that asymmetry: instead of asking humans to write ideal answers, it asks them to rank the model's own attempts, and squeezes a training signal out of the votes.
How RLHF works
RLHF — reinforcement learning from human feedback — sounds intimidating, but the idea is intuitive. You can't backpropagate through a human, so you build a stand-in for one:
Sample two (or more) answers from the model for the same prompt and ask a human which is better. No essay-writing — just a vote. Collect tens of thousands of these comparisons.
Train a separate model to predict which answer the human would prefer. Once trained, this reward model can score any answer instantly — a tireless, scalable stand-in for the human raters.
Now run reinforcement learning (usually PPO): the LLM — the policy — generates answers, the reward model scores them, and the weights are nudged toward whatever scores higher.
Left unchecked, the policy will find weird text that the reward model loves but humans would hate — repeated flattery, strange tokens, confident nonsense. That's reward hacking: the reward model is only an imperfect proxy, and RL is ruthlessly good at exploiting proxies. The fix is a KL penalty: alongside the reward, the policy is penalized for drifting too far from the frozen SFT model (the reference model). It's a leash — improve within the neighborhood of sensible language, but don't wander off into high-reward gibberish.
DPO: skip the reward model entirely
RLHF works, but it's heavy machinery: a separate reward model to train, an RL loop that's famously finicky to tune, and up to four models held in memory at once. In 2023, DPO — Direct Preference Optimization (Rafailov et al.) — showed you can get most of the benefit with none of the apparatus. The key insight: the preference data already tells you what to do. A clever loss function trains the policy directly on the A ≻ B pairs — raise the probability of the preferred answer, lower the rejected one, with the same implicit KL leash to the reference model baked into the math.
- Two extra models: train a reward model, keep a frozen reference model, run a critic — heavy on memory and engineering
- RL is finicky: unstable training, lots of hyperparameters, easy to reward-hack if the KL leash is mistuned
- Online: the policy explores and gets scored on its own fresh samples — powerful, and still used by major labs
- No reward model, no RL loop — just the policy, the reference model, and a classification-style loss
- Stable and simple: trains like ordinary supervised fine-tuning; far fewer knobs to get wrong
- Offline: learns from a fixed dataset of preference pairs — which is why it's everywhere in open-weight models
RLHF and DPO consume the same data — human preference pairs. RLHF distills the preferences into a reward model and then chases that reward with RL; DPO turns the preferences into gradients directly. If you fine-tune an open model today, DPO (or one of its cousins) is very likely what you'll reach for first.
Watch the pipeline run
The animation below walks the whole journey: the base model autocompleting a question with more questions, SFT reshaping it into an assistant, humans ranking a pair of answers to train the reward model, the PPO loop with its KL leash — and finally the DPO shortcut that skips the machinery.
What can go wrong
Preference tuning optimizes for what humans say they like — which is close to, but not the same as, what's actually good. That gap is where the failure modes live.
- Reward hacking — the policy exploits blind spots in the reward model: padding answers with confident filler, gaming length, or drifting into text the judge over-scores. The KL leash limits this; it doesn't eliminate it.
- Sycophancy — raters tend to prefer answers that agree with them and sound confident, so the model learns to flatter and to tell you you're right even when you're not. Optimizing approval is not optimizing truth.
- Over-refusal — push too hard on "harmless" and the model starts declining perfectly reasonable requests. Tuning the helpful/harmless balance is an ongoing fight, not a solved problem.
- Alignment ≠ truthfulness — RLHF shapes preferences, it doesn't install a fact-checker. An aligned model still hallucinates; it just does so more politely.
One active direction: replace some of the expensive human feedback with AI feedback — RLAIF and Anthropic's Constitutional AI have a model critique and rank answers against a written set of principles, with humans supervising the process rather than labeling every pair.