Chapter 5 — Fine-tuning & Alignment
Part 2: LoRA & QLoRA — The Art of Fine-tuning Without Breaking the Bank
LoRA is the single most important practical technique for working with LLMs. It lets you fine-tune a 7B parameter model on a single consumer GPU. This chapter builds the intuition from the ground up.
The Core Problem LoRA Solves
Fine-tuning a 7B parameter model (full fine-tuning) requires:
- Model weights: 7B × 2 bytes (BF16) = 14GB
- Gradients: another 14GB (same size as weights)
- Optimizer states (AdamW): two moments × 4 bytes × 7B = 56GB
- Total: ~84GB+ → Requires at minimum 4× A100 80GB GPUs
LoRA reduces this to ~16GB — fitting on a single A100 or even a 4090.
The Low-Rank Intuition — Building It From Scratch
What Is a Low-Rank Matrix?
A rank-r matrix is one that can be written as the product of two thin matrices:
W shape: (m × n) — a large weight matrix
↓
W = A × B — low-rank decomposition
A shape: (m × r)
B shape: (r × n)
where r << min(m, n)
Instead of storing m×n values, you store m×r + r×n values.
For a 4096×4096 matrix with rank r=8:
- Full: 4096² = 16.7M values
- Low-rank: 4096×8 + 8×4096 = 65,536 values ← 255× smaller!
The LoRA Hypothesis
During fine-tuning, the change in model weights $\Delta W = W_{\text{finetuned}} - W_{\text{pretrained}}$ has low intrinsic rank.
What does this mean intuitively? The pretrained model has already learned a rich representation of language. Fine-tuning for a specific task doesn't require rearranging the entire weight space — it requires a relatively small adjustment that lives in a low-dimensional subspace.
This is analogous to: you've already learned to write well. Learning to write in a specific style (e.g., "formal academic tone") is a small adjustment — you're not re-learning grammar, vocabulary, or sentence structure. You're making a targeted modification.
Evidence: Aghajanyan et al. (2020) showed that fine-tuned models have much lower "intrinsic dimensionality" than their parameter count suggests. The effective number of parameters needed for fine-tuning is in the hundreds to thousands, not billions.
How LoRA Works — Step by Step
The Setup
For each weight matrix $W_0 \in \mathbb{R}^{m \times n}$ that you want to adapt:
h = W₀ x + ΔW x
h = W₀ x + BA x (where ΔW = BA)
B ∈ ℝ^(m×r) initialized to zeros (so ΔW = 0 at start)
A ∈ ℝ^(r×n) initialized to N(0, σ²)
Critical initialization:
- B = 0 at start → ΔW = BA = 0 at start → output = W₀x (pretrained behavior preserved)
- A = random → ensures gradients flow from the first step
- As training proceeds, B updates and ΔW = BA grows from 0
Which Matrices to Apply LoRA To?
In each transformer attention layer:
W_Q (query projection) ← Apply LoRA
W_K (key projection) ← Apply LoRA
W_V (value projection) ← Apply LoRA
W_O (output projection) ← Apply LoRA
In the FFN:
W_gate ← Sometimes
W_up ← Sometimes
W_down ← Sometimes
Original LoRA paper: Applied only to W_Q and W_V. Later work showed applying to all 4 attention matrices (and sometimes FFN) often gives better results.
The Scaling Factor: Alpha (α)
LoRA includes a scaling parameter α:
This controls the magnitude of the adaptation. Setting α = 2r gives ΔW a scaling factor of 2. Setting α = r gives a factor of 1.
Common practice: Set α = r (for scaling factor 1) or α = 2r. Use lora_alpha in PEFT library.
Interview corner case 🎯: "Why is B initialized to zero and not A?" — If both were initialized to random, ΔW = BA would be non-zero at the start, immediately corrupting the pretrained model's behavior. By making B=0, we ensure the model starts at its pretrained behavior (ΔW=0) and gradually learns a useful adaptation. A is random to break symmetry and allow gradients to flow.
LoRA Hyperparameters: What They Do
Rank (r)
The bottleneck dimension. Controls the expressiveness of the adaptation.
r = 4: Very parameter-efficient. Good for narrow tasks (specific style, format)
r = 8: Common default. Good balance for most tasks.
r = 16: More expressive. Good for complex tasks requiring broad capability changes.
r = 64: Near full fine-tuning expressiveness. Used when quality is critical.
r = 256: Very high rank. Approaches full fine-tuning but still more efficient.
Total LoRA parameters for a 7B model with r=8, targeting all attention matrices:
32 layers × 4 matrices × (4096×8 + 8×4096) = 32 × 4 × 65,536 ≈ 8.4M parameters
(vs. 7B total parameters → ~0.12% of model)
Lora Alpha (α)
Controls the effective learning rate of the LoRA weights. Higher α = larger updates.
Rule of thumb: α = r (scaling = 1.0) or α = 2r. The learning rate and alpha interact — higher alpha means the LoRA weights have more influence, so you might want a lower LR.
Target Modules
Which layers to apply LoRA to. More layers = more expressive = more memory.
# Minimal (original paper)
target_modules = ["q_proj", "v_proj"]
# Standard (more expressive)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# Aggressive (everything)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
Dropout
LoRA dropout applied to the low-rank matrices before the product:
lora_dropout = 0.05 # Usually small, 0-0.1
QLoRA — Quantized LoRA: The 4-bit Fine-tuning Revolution
Paper: "QLoRA: Efficient Finetuning of Quantized LLMs" — Dettmers et al., 2023
QLoRA combines two ideas:
- 4-bit NormalFloat quantization of the base model weights
- LoRA adapters in FP16/BF16
This lets you fine-tune a 65B model on a single 48GB GPU!
What Is Quantization?
Store weights in low-bit format to save memory:
- FP32: 32 bits per weight → 28GB for 7B model
- BF16: 16 bits → 14GB for 7B model
- INT8: 8 bits → 7GB for 7B model
- INT4 (NF4): 4 bits → 3.5GB for 7B model!
But 4-bit quantization loses precision. If you directly train a 4-bit model, the quantization error corrupts the gradients and quality suffers.
QLoRA's insight: Freeze the 4-bit quantized base model. Add LoRA adapters in full precision (BF16). Only the tiny LoRA adapters receive gradient updates.
Forward pass:
input → [quantize W to 4-bit just-in-time] → [W_4bit × x] → [+ BA × x (BF16)]
Backward pass:
gradients flow only through B and A (BF16)
W_4bit receives NO gradient updates
The memory breakdown for a 7B model with QLoRA:
Base model (NF4): 3.5GB
LoRA adapters (BF16): ~0.1GB (for r=8)
Gradient/optimizer: ~0.5GB (only for LoRA params)
Activations: ~2-3GB
Total: ~6-7GB ← fits on a single 8GB GPU!
NF4 — NormalFloat 4-bit
Regular INT4 is uniformly spaced (-8, -7, ..., 7). But neural network weights follow a roughly normal distribution — most values are near zero, few are large.
NF4 uses non-uniform spacing: more precision near zero, less at the extremes. This better matches the actual weight distribution, reducing quantization error.
Interview corner case 🎯: "What is double quantization in QLoRA?" — Quantize the quantization constants themselves! The 4-bit quantization requires "scale" constants (one per group of weights). These constants are in FP32. Double quantization compresses these constants to 8-bit, saving an additional ~0.37 bits per parameter. Total effective bitwidth: ~4.37 bits per parameter.
LoRA After Training: Merging Weights
After LoRA training, you have two options for deployment:
Option 1: Keep LoRA Separate (Recommended for Serving)
# At inference: add LoRA output to base model output
output = base_model(x) + lora_scale * (B @ A @ x)
# Slightly slower (two matrix multiplies) but flexible
# Can swap different LoRA adapters at runtime
Option 2: Merge LoRA Into Base Model
# Merge permanently: W_merged = W_pretrained + (α/r) × B @ A
# After merging: standard transformer, no overhead, can't unmerge
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(model, "path/to/lora_adapter")
merged = model.merge_and_unload() # W = W0 + BA
merged.save_pretrained("merged_model")
Merging is useful for deployment — you get the fine-tuned model as a single set of weights, with no LoRA overhead. Downside: you can't use the base model or swap adapters.
LoRA Variants You Should Know
LoRA+ (2024)
Sets different learning rates for A and B matrices. A (the input projection) trains at a lower LR, B (the output projection) at a higher LR. Gives 1-2% improvement.
DoRA (Weight-Decomposed LoRA, 2024)
Decomposes the pretrained weight into magnitude and direction, adapts them separately. Closer to full fine-tuning performance, especially for smaller rank values.
LoftQ (2023)
Better initialization: quantize the base model to 4-bit, but compensate the quantization error by initializing the LoRA matrices to absorb that error. Starts closer to the pretrained model, converges faster.
AdaLoRA (2023)
Dynamically allocates rank across different weight matrices based on their "importance" (measured by singular value decomposition). More rank to important matrices, less to unimportant ones.
Full Interview Corner Cases — LoRA & QLoRA 🎯
- "How does LoRA achieve the same quality as full fine-tuning with 100× fewer parameters?" → The pre-training already captured the broad capabilities. Fine-tuning for a specific task requires a small "adjustment" that lives in a low-dimensional subspace. LoRA exploits this by parameterizing ΔW = BA, which efficiently represents low-rank updates.
- "What rank should I use for LoRA?" → Start with r=8, which works well for most tasks. Increase to 16-64 if you notice quality is insufficient. For very narrow tasks (specific format/style), r=4 may be enough. For broad capability changes, r=64+.
- "Can LoRA overfit?" → Yes! LoRA has fewer parameters than full fine-tuning but can still overfit on very small datasets. Signs: train loss keeps dropping but val loss increases. Remedies: add LoRA dropout, reduce training steps, increase dataset size.
- "What is the difference between LoRA rank and LoRA alpha?" → Rank controls the expressiveness (how many "directions" of change the adapter can represent). Alpha controls the scale/magnitude of those changes. They interact: the effective contribution of the adapter is proportional to alpha/rank.
- "After training with LoRA, can you continue training from that checkpoint?" → Yes — just load the base model and the LoRA adapter, and resume training. Don't merge before continuing unless you want to start from the merged weights.
- "What's the difference between PEFT and LoRA?" → PEFT is a library by HuggingFace that implements many Parameter-Efficient Fine-Tuning methods. LoRA is one method it implements. Others include prefix tuning, prompt tuning, IA³, AdaLoRA, etc.
- "Can you apply multiple LoRA adapters at the same time?" → Yes, with some frameworks! You can sum the contributions:
W_merged = W0 + ΔW_adapter1 + ΔW_adapter2. This allows modular capability injection. Used in "LoRA Hub" and similar systems.
Next: RLHF and DPO: Alignment from Human Feedback — How we teach models to be helpful and safe.