RLHF & Alignment

Gen AI RLHF DPO reward model

A brilliant autocomplete, not an assistant

A freshly pretrained LLM is a next-token predictor and nothing more. Ask a raw base model "How do I write a good resume?" and it may happily continue with "How do I write a cover letter? How do I prepare for an interview?" — because on the internet, questions often appear in lists of more questions. It isn't being unhelpful on purpose; it's doing exactly what it was trained to do: continue the text. Alignment is everything we do after pretraining to turn that autocomplete into an assistant that behaves the way people actually want.

The usual shorthand for "the way people actually want" is the three H's:

Helpful answer the actual question

Follow instructions, stay on task, give a useful answer instead of continuing the pattern or wandering off.

Honest don't make things up

Prefer true statements, express uncertainty, admit "I don't know" instead of confidently inventing facts.

Harmless refuse the bad stuff

Decline genuinely dangerous requests, avoid toxic output — without refusing everything that merely sounds edgy.

In one sentence

Alignment is the post-training process — supervised fine-tuning plus preference tuning like RLHF or DPO — that reshapes a raw next-token predictor into a model that is helpful, honest, and harmless.

The three-stage pipeline

The modern recipe was laid out in OpenAI's InstructGPT paper (Ouyang et al., 2022 — there's a note on it here), and nearly every chat model since follows the same three stages:

1 — Pretraining next-token on the internet

Train on trillions of tokens of text to predict the next one. This is where all the knowledge and language ability comes from — and none of the manners.

2 — SFT imitate good conversations

Supervised fine-tuning on example conversations written by humans. The model learns the assistant format: a question deserves an answer, not more questions.

3 — Preference tuning RLHF or DPO

Teach the model which of two plausible answers people prefer. This is where tone, judgment, and refusal behavior get shaped — the subject of this note.

Why ranking beats writing

SFT can only imitate its demonstrations, and writing the perfect answer is hard even for experts. But comparing two answers is easy — almost anyone can say which of two drafts is better. Preference tuning exploits that asymmetry: instead of asking humans to write ideal answers, it asks them to rank the model's own attempts, and squeezes a training signal out of the votes.

How RLHF works

RLHF — reinforcement learning from human feedback — sounds intimidating, but the idea is intuitive. You can't backpropagate through a human, so you build a stand-in for one:

1 — Humans rank pairs A ≻ B

Sample two (or more) answers from the model for the same prompt and ask a human which is better. No essay-writing — just a vote. Collect tens of thousands of these comparisons.

2 — Train a reward model a learned judge

Train a separate model to predict which answer the human would prefer. Once trained, this reward model can score any answer instantly — a tireless, scalable stand-in for the human raters.

3 — Optimize with RL PPO

Now run reinforcement learning (usually PPO): the LLM — the policy — generates answers, the reward model scores them, and the weights are nudged toward whatever scores higher.

The KL leash

Left unchecked, the policy will find weird text that the reward model loves but humans would hate — repeated flattery, strange tokens, confident nonsense. That's reward hacking: the reward model is only an imperfect proxy, and RL is ruthlessly good at exploiting proxies. The fix is a KL penalty: alongside the reward, the policy is penalized for drifting too far from the frozen SFT model (the reference model). It's a leash — improve within the neighborhood of sensible language, but don't wander off into high-reward gibberish.

DPO: skip the reward model entirely

RLHF works, but it's heavy machinery: a separate reward model to train, an RL loop that's famously finicky to tune, and up to four models held in memory at once. In 2023, DPO — Direct Preference Optimization (Rafailov et al.) — showed you can get most of the benefit with none of the apparatus. The key insight: the preference data already tells you what to do. A clever loss function trains the policy directly on the A ≻ B pairs — raise the probability of the preferred answer, lower the rejected one, with the same implicit KL leash to the reference model baked into the math.

RLHF (PPO)
  • Two extra models: train a reward model, keep a frozen reference model, run a critic — heavy on memory and engineering
  • RL is finicky: unstable training, lots of hyperparameters, easy to reward-hack if the KL leash is mistuned
  • Online: the policy explores and gets scored on its own fresh samples — powerful, and still used by major labs
DPO
  • No reward model, no RL loop — just the policy, the reference model, and a classification-style loss
  • Stable and simple: trains like ordinary supervised fine-tuning; far fewer knobs to get wrong
  • Offline: learns from a fixed dataset of preference pairs — which is why it's everywhere in open-weight models
Same fuel, different engine

RLHF and DPO consume the same data — human preference pairs. RLHF distills the preferences into a reward model and then chases that reward with RL; DPO turns the preferences into gradients directly. If you fine-tune an open model today, DPO (or one of its cousins) is very likely what you'll reach for first.

Watch the pipeline run

The animation below walks the whole journey: the base model autocompleting a question with more questions, SFT reshaping it into an assistant, humans ranking a pair of answers to train the reward model, the PPO loop with its KL leash — and finally the DPO shortcut that skips the machinery.

the model being trained (policy) preferred / aligned output rejected output reward model

What can go wrong

Preference tuning optimizes for what humans say they like — which is close to, but not the same as, what's actually good. That gap is where the failure modes live.

Watch out for
  • Reward hacking — the policy exploits blind spots in the reward model: padding answers with confident filler, gaming length, or drifting into text the judge over-scores. The KL leash limits this; it doesn't eliminate it.
  • Sycophancy — raters tend to prefer answers that agree with them and sound confident, so the model learns to flatter and to tell you you're right even when you're not. Optimizing approval is not optimizing truth.
  • Over-refusal — push too hard on "harmless" and the model starts declining perfectly reasonable requests. Tuning the helpful/harmless balance is an ongoing fight, not a solved problem.
  • Alignment ≠ truthfulness — RLHF shapes preferences, it doesn't install a fact-checker. An aligned model still hallucinates; it just does so more politely.

One active direction: replace some of the expensive human feedback with AI feedbackRLAIF and Anthropic's Constitutional AI have a model critique and rank answers against a written set of principles, with humans supervising the process rather than labeling every pair.