Chapter 5 — Fine-tuning & Alignment

Part 4: RLHF & DPO — Teaching Models to Be Helpful and Harmless

This is how ChatGPT was made. SFT teaches the format. RLHF/DPO teaches the quality — which response is better, more helpful, less harmful. This chapter gives you the full picture.

The Alignment Problem

A model after SFT knows how to format responses. But it doesn't know what makes one response better than another.

Given "What household chemicals can I combine to make poison gas?", a post-SFT model might helpfully explain the chemistry. We need to teach it that some responses, while technically well-formatted, should be declined.

More subtly: given "Summarize this paper in simple terms", a post-SFT model might produce an accurate but overly technical summary. We want it to produce an actually simple one.

The challenge: human values are complex, subjective, and hard to specify as a loss function. We can't directly optimize for "be helpful and harmless." But we can ask humans to compare responses and say which is better.

Stage 1 Revisited: The Three-Stage Alignment Pipeline (InstructGPT)

Here's the full pipeline that turned GPT-3 into InstructGPT (and became the template for ChatGPT):

┌─────────────────────────────────────────────────────────────────┐
│ Stage 1: Supervised Fine-tuning (SFT)                           │
│   - Human experts write ideal responses to prompts              │
│   - Fine-tune pretrained model on these (instruction-following) │
│   - Result: SFT model (follows instructions, but not optimal)   │
└──────────────────────────────┬──────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 2: Reward Model (RM) Training                             │
│   - Collect: prompt + multiple responses                         │
│   - Human annotators rank responses: A > B > C                  │
│   - Train a reward model to predict human preference scores     │
│   - Result: RM(prompt, response) → scalar score                 │
└──────────────────────────────┬──────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 3: RL Optimization (PPO)                                  │
│   - Use PPO to optimize the SFT model to maximize RM scores     │
│   - KL penalty prevents model from "gaming" the reward model    │
│   - Result: RLHF model (helpful, harmless, and honest)          │
└─────────────────────────────────────────────────────────────────┘

Stage 2 Deep Dive: Training the Reward Model

Data Collection

Show human annotators a prompt and 2-4 model responses. They rank them:

Prompt: "How do I improve my writing?"

Response A: "Practice daily, read widely, get feedback."
Response B: "Writing improvement requires consistent effort. Consider daily journaling,
             reading diverse genres, and seeking constructive feedback from trusted peers.
             Also, study grammar and style guides."
Response C: "Writing good takes practice lol"

Human ranking: B > A > C

This is converted to pairwise comparisons: (B, A), (B, C), (A, C) — B is preferred over A, etc.

The Reward Model Architecture

Take the SFT model (or a similar LM) and add a scalar head:

RM = SFT model + Linear(d_model → 1)

Input:  prompt + response (concatenated)
Output: scalar score (higher = more preferred)

Training objective: maximize the log-likelihood of human preferences:

\mathcal{L}_{\text{RM}} = -\mathbb{E}\left[\log \sigma(r(x, y_w) - r(x, y_l))\right]

Where:

$$r(x, y)$$ = reward model score for response y given prompt x
$$y_w$$ = preferred (winning) response
$$y_l$$ = dispreferred (losing) response
$\sigma$ = sigmoid function

Intuition: we want r(y_w) - r(y_l) to be large and positive. The sigmoid turns this into a probability, and we maximize the log-probability that the preferred response has higher score.

Interview corner case 🎯: "Why use the SFT model as the starting point for the reward model?" — The RM needs to understand language to judge response quality. Starting from the pretrained/SFT model gives it this capability immediately. Only the final scalar head is new. Starting from scratch would require enormous data and compute.

Stage 3: PPO — Reinforcement Learning from Human Feedback

PPO (Proximal Policy Optimization) is the RL algorithm used to optimize the language model against the reward model.

The Setup

Policy (π):     The LM being trained (starts from SFT model)
Reward:         r(prompt, response) from the reward model
Value function: Estimates expected future reward (for PPO baseline)
Reference (π_ref): Frozen copy of the SFT model (for KL penalty)

The PPO Objective for LMs

\text{maximize: } \mathbb{E}[r(x, y)] - \beta \times \text{KL}(\pi \parallel \pi_{\text{ref}})

Where:

$$r(x, y)$$ = reward model score
$\beta \times \text{KL}(\pi \parallel \pi_{\text{ref}})$ = penalty for diverging too far from SFT model

The KL penalty is crucial. Without it, the RL model quickly "hacks" the reward model — finding responses that score high on the RM but are terrible for humans. This is called reward hacking or reward gaming.

Example reward hacking without KL penalty:

RM trained on: "Longer, more detailed responses are preferred"
Model discovers: Generate extremely long, repetitive responses → high RM score!
Result: Responses that say the same thing 50 different ways

The KL penalty says: "You can improve RM scores, but don't deviate too far from how the SFT model behaves." It keeps the model "on distribution."

PPO in Practice

PPO is complex to implement and train. Key issues:

Requires 4 models in memory simultaneously: policy, reference, RM, value function
Highly sensitive to hyperparameters
Can be unstable (reward collapse, KL explosion)
Computationally expensive: typically takes days on many GPUs

This is why DPO was a significant advance.

DPO — Direct Preference Optimization: Simpler and Better

Paper: "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" — Rafailov et al., 2023

DPO elegantly eliminates the reward model and PPO, directly optimizing the policy on preference data.

The Key Insight

There's a mathematical relationship between the optimal policy under the KL-constrained RL objective and the preference data. You can solve for the policy directly without explicitly learning a reward model.

The DPO loss:

def dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1):
    """
    policy_chosen:   log P_policy(y_w | x)    — policy's log-prob of preferred response
    policy_rejected: log P_policy(y_l | x)    — policy's log-prob of dispreferred response
    ref_chosen:      log P_ref(y_w | x)        — reference model's log-prob of preferred
    ref_rejected:    log P_ref(y_l | x)        — reference model's log-prob of dispreferred

    beta: temperature parameter (higher = stay closer to reference model)
    """
    # Compute log-ratio of policy vs reference for each response
    pi_logratios = policy_chosen - policy_rejected
    ref_logratios = ref_chosen - ref_rejected

    # DPO loss: sigmoid(-beta * (log ratio difference))
    losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
    return losses.mean()

Intuition: DPO increases the relative probability of preferred responses vs. dispreferred ones, while using the reference model to prevent going too far from the SFT distribution. It implicitly computes what the reward model would be, without explicitly training it.

DPO Data Format

{
  "prompt": "How do I get better at Python?",
  "chosen": "The best way to improve at Python is to build projects. Start with simple scripts, then gradually tackle more complex problems. Reading others' code, solving problems on LeetCode, and contributing to open source are also excellent strategies.",
  "rejected": "Just read the documentation and you'll be fine. Python docs are comprehensive."
}

DPO vs RLHF: When to Use Which

Aspect	RLHF (PPO)	DPO
Complexity	High (4 models)	Low (2 models: policy + reference)
Training stability	Often unstable	More stable
GPU memory	4× model size	2× model size
Quality	Marginally better on complex tasks	~Same, sometimes better
Implementation	Very complex	~50 lines of code
Industry adoption	OpenAI, Anthropic	Growing fast (LLaMA 2 chat, Zephyr)

Practical recommendation: Use DPO unless you have a specific reason to use PPO (e.g., the task is a true RL problem with sparse rewards, like game playing or tool use).

Constitutional AI (Claude's Approach)

Anthropic's Claude uses Constitutional AI (CAI) instead of traditional RLHF.

The key idea: instead of human preference labels, use a set of principles (a "constitution") to automatically generate feedback.

Stage 1: SL-CAI (Supervised Learning from AI feedback)

Sample a potentially harmful response from the initial model
Ask the model to critique it based on constitutional principles
Ask the model to revise it to be more helpful and harmless
Train on the revised responses

Stage 2: RL-CAI (RL from AI feedback)

Generate pairs of responses to prompts
Ask a feedback model to pick which response better follows the constitution
Train a preference model on these AI-generated preferences
Use RL to optimize the policy against this preference model

Why it works: AI feedback can be generated at scale, for free, with consistent application of principles. Human annotators introduce variability and cost. The constitution provides a clear, auditable set of values.

Emerging Alignment Methods (2024–2025)

ORPO (Odds Ratio Preference Optimization)

Combines SFT and alignment into a single training step. The loss directly penalizes dispreferred responses while training on preferred ones. No reference model needed, faster than DPO.

SimPO (Simple Preference Optimization)

Removes the reference model from DPO by using the average log-probability (per token) as an implicit reference. More length-consistent than DPO.

RLOO (REINFORCE Leave-One-Out)

A simpler RL method than PPO that uses multiple samples per prompt and computes baselines from the same batch. More stable than PPO, less complex.

Iterative DPO / Self-Play Fine-tuning (SPIN)

Generate your own preferred/rejected pairs using the current model (rather than human annotations). Ask the model to compete against itself, using the better responses as training signal. Self-improves without human labels.

Interview Corner Cases — RLHF & DPO 🎯

"Why can't we just optimize the policy directly against the reward model without the KL penalty?" → Reward hacking. The model finds degenerate strategies that maximize the learned reward function but don't actually satisfy human preferences. Classic example: a reward model trained to prefer longer responses gets gamed by repetitive verbose output.
"What is the difference between the policy, the reference model, and the reward model in RLHF?" → Policy (π): The model being trained. Reference (π_ref): Frozen copy of the SFT model, used to compute the KL penalty that prevents reward hacking. Reward model (RM): Separately trained model that scores responses based on human preferences.
"How does DPO eliminate the need for a reward model?" → DPO reparameterizes the RLHF objective in terms of the optimal policy directly. Mathematically: the optimal policy's log-probability ratio is a deterministic function of the reward function. So you can express the reward in terms of policy probabilities and optimize the policy directly on preference data.
"What is reward model overoptimization?" → As you continue optimizing the policy against the reward model, it eventually learns to exploit artifacts in the RM rather than genuine human preferences. The RM is an imperfect model of human values, and sufficiently powerful optimization will find its blind spots. Solution: stop early, use stronger KL penalties, or retrain the RM periodically.
"Why do companies like Anthropic use Constitutional AI rather than RLHF?" → Scalability and consistency. Human annotation is expensive, slow, and inconsistent. Constitutional AI generates AI feedback at scale, with consistent application of written principles. It's also more auditable — you can inspect and update the constitution.
"What is the difference between harmlessness and helpfulness in alignment? Are they in conflict?" → Yes, often! An overly-cautious model refuses legitimate requests to avoid being harmful. An overly-helpful model fulfills harmful requests. The goal is to correctly identify actual harm (not perceived harm). Anthropic calls this the "assistant-brained" failure mode: refusing too much. The alignment community calls excessive safety "excessive harmlessness" or "overcorrection."

Next: LLM Interview Cheatsheet — Every corner-case concept condensed into one reference.