Chapter 4 — Pre-training at Scale

Part 2: Scaling Laws — The Science Behind "Bigger is Better"

This chapter explains how to make the most important decisions in LLM training before you write a single line of code: how big should my model be, and how many tokens should I train on?

What Are Scaling Laws?

In 2020, Kaplan et al. (OpenAI) discovered that LLM performance improves in a remarkably predictable way as you scale three variables:

N: Model size (number of parameters)
D: Dataset size (number of training tokens)
C: Compute budget (FLOPs = floating point operations)

The relationship is a power law — on a log-log plot, loss vs. scale is a straight line:

\mathcal{L}(N) \propto N^{-0.076} \qquad \mathcal{L}(D) \propto D^{-0.095} \qquad \mathcal{L}(C) \propto C^{-0.050}

The intuition: Each 10× increase in parameters reduces loss by about 20% (not 90%). Diminishing returns, but predictably so.

This means you can train a tiny model, extrapolate its loss curve, and predict what a 100× bigger model will achieve — before spending millions on training.

The Chinchilla Revolution (2022)

Paper: "Training Compute-Optimal Large Language Models" — Hoffmann et al., DeepMind

The Kaplan et al. finding was: given a fixed compute budget, make your model as large as possible, and train it on as many tokens as you can afford.

In practice: GPT-3 (175B parameters) was trained on only 300B tokens — less than 2 tokens per parameter.

Chinchilla reanalyzed this and found: GPT-3 was massively undertrained. The optimal strategy is to scale model size AND data size equally.

The Chinchilla Scaling Laws

For a compute budget $$C$$ (in FLOPs):

Optimal model size: $N_{\text{opt}} \approx (C / 6)^{0.5}$ (approximately)
Optimal training tokens: $D_{\text{opt}} \approx (C / 6)^{0.5}$ (approximately)
Therefore: $N_{\text{opt}} \approx D_{\text{opt}}$ — scale together!

In practice: train ~20 tokens per parameter for compute-optimal training.

GPT-3:    N = 175B, D = 300B   → D/N = 1.7 tokens/param   (severely undertrained)
Chinchilla: N = 70B, D = 1.4T  → D/N = 20 tokens/param   (compute optimal)
LLaMA 7B: N = 7B, D = 1T       → D/N = 143 tokens/param  (overtrained for inference)
LLaMA 70B: N = 70B, D = 2T     → D/N = 29 tokens/param   (near compute optimal)

Chinchilla (70B) outperformed Gopher (280B) despite being 4× smaller, because it was trained on 5× more tokens.

The "Inference-Optimal" Correction

Wait — Chinchilla says train 20 tokens/param. But LLaMA uses 143 tokens/param (for the 7B model). Is LLaMA wrong?

No — Chinchilla is compute-training-optimal. But once a model is trained, you serve it to millions of users. For inference:

A smaller model with more training is cheaper to serve
A 7B model costs ~7× less per inference than a 70B model

So inference-optimal training = train a smaller model for much longer. You spend more compute training it, but you save more compute serving it.

LLaMA's philosophy: "I want a model that's as small as possible for inference, but as capable as possible. Train the 7B model on 1T tokens, even though the training is 'suboptimal' by Chinchilla standards."

Interview corner case 🎯: "You have a fixed compute budget of 1e23 FLOPs. Should you train a 7B model on 2T tokens, or a 70B model on 200B tokens?"

Chinchilla optimal: ~24B model on ~480B tokens
But if you're deploying to users: smaller model with more data (7B + 2T) because inference is much cheaper
If you're doing research and only running it once: go compute-optimal
The answer depends on your deployment scenario!

Understanding FLOPs

FLOPs (Floating Point Operations) measure computational cost.

For a transformer model:

FLOPs ≈ 6 × N × D

Where:
  N = number of parameters
  D = number of training tokens
  6 = approximately accounts for forward pass (2N) + backward pass (4N)

Examples:

GPT-2 (117M params, 100B tokens training):  6 × 117e6 × 100e9 = 7e19 FLOPs
GPT-3 (175B params, 300B tokens):           6 × 175e9 × 300e9 = 3.15e23 FLOPs
LLaMA 65B (1.4T tokens):                    6 × 65e9 × 1.4e12 = 5.46e23 FLOPs

A100 GPU performance: ~312 TFLOP/s (FP16) = 3.12e14 FLOPs/second

Training GPT-3 with 1 A100: 3.15e23 / 3.12e14 ≈ 1 billion seconds ≈ 31 years

That's why they used ~10,000 A100s in parallel! With 10,000 A100s (at ~50% efficiency): ~3.15e23 / (10000 × 0.5 × 3.12e14) ≈ 20 days

Interview corner case 🎯: "What is MFU (Model FLOP Utilization)?" — The fraction of theoretical peak FLOP/s that your training actually achieves. Top labs achieve 30-50% MFU on A100s, 40-60% on H100s. If your MFU is <30%, investigate: communication bottlenecks (gradient all-reduce), data loading, or suboptimal batch size.

Scaling Law Surprises — Emergent Abilities

The most exciting (and disputed) finding from scaling: emergent abilities — capabilities that appear suddenly as you cross a certain scale threshold.

Model size (log scale):
1M → 10M → 100M → 1B → 10B → 100B → 1T

Some ability (e.g., 3-digit arithmetic):
0%   0%    0%     0%   0%    75%    90%

The ability "emerges" sharply around 100B parameters.

Examples:

3-digit arithmetic: Appears around 10B parameters
Multi-step reasoning: Appears around 50-100B
Chain-of-thought prompting: Appears around 100B (smaller models don't benefit)
Instruction following (without fine-tuning): Appears around 50-100B

The controversy: Is it really "emergence" or just a measurement artifact? If you use a finer-grained metric (partial credit instead of all-or-nothing), do abilities appear gradually? Recent work (Schaeffer et al., 2023) argues many "emergent" abilities are artifacts of discontinuous metrics. The community is still debating.

Interview corner case 🎯: "What is the difference between 'emergent abilities' and just 'the metric threshold was hit'?" — If a task requires 5 correct reasoning steps and you need all 5 correct for credit, a model that's 80% correct at each step gets 0% on the task (0.8^5 = 33%). A model 95% correct gets 77%. The jump from 0% to 77% looks "emergent" but the underlying capability improved gradually. This is the metric artifact argument.

Chapter 4 Training Tricks (The Practical Side)

Batch Size and Learning Rate

Rule of thumb: Linear scaling rule — when you multiply batch size by k, multiply learning rate by k.

# If base LR = 1e-4 at batch_size = 256:
# At batch_size = 2048 (8× larger): LR = 8e-4

# But this breaks down at very large batch sizes
# (diminishing returns in gradient signal quality)

For LLMs, typical settings:

Batch size: 1M–4M tokens (packed sequences)
LR: 3e-4 to 1e-4 for AdamW

Gradient Checkpointing (Activation Checkpointing)

Problem: Forward pass stores all intermediate activations for the backward pass. For large models, this can use more memory than the model parameters themselves!

Solution: Don't store activations. Recompute them during the backward pass.

Tradeoff: ~33% slower training, but ~5-10× memory reduction. Crucial for training large models.

# In PyTorch:
from torch.utils.checkpoint import checkpoint

# Instead of:
x = transformer_layer(x)

# Use:
x = checkpoint(transformer_layer, x)  # Recomputes forward during backward

Mixed Precision Training

Train in FP16/BF16 (16-bit) instead of FP32 (32-bit):

2× less memory
2-3× faster on modern GPUs (special tensor cores for FP16)
Risk: numerical instability (overflow/underflow)

BF16 vs FP16:

FP16: 5 bits exponent, 10 bits mantissa — can overflow with large values
BF16: 8 bits exponent, 7 bits mantissa — same range as FP32, just lower precision
BF16 is strictly better for training LLMs (no gradient overflow)

# In PyTorch with BF16:
model = model.to(torch.bfloat16)

# Or using torch.amp for automatic mixed precision:
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    logits, loss = model(x, y)

Distributed Training

Training a 7B model requires at minimum 14GB GPU memory (BF16). A 70B model needs 140GB. This doesn't fit on a single GPU.

Data Parallelism (simplest): Copy the full model to N GPUs. Each GPU processes a different batch shard. Synchronize gradients with AllReduce after each step.

GPU 0: batch[0:64]  → gradient_0
GPU 1: batch[64:128] → gradient_1
GPU 2: batch[128:192] → gradient_2
GPU 3: batch[192:256] → gradient_3
All-reduce: average gradients across GPUs
All: update with averaged gradient

Tensor Parallelism (for very large models): Split individual weight matrices across GPUs. Each GPU holds a slice of each matrix. Requires inter-GPU communication during every forward/backward pass.

Pipeline Parallelism: Assign different layers to different GPUs. GPU 0 runs layers 1-8, GPU 1 runs layers 9-16, etc. Requires micro-batching to keep all GPUs busy.

3D Parallelism (Megatron-LM, DeepSpeed): Combines Data + Tensor + Pipeline parallelism. Used to train models at 1T+ parameters.

Interview corner case 🎯: "What is ZeRO optimization (DeepSpeed)?" — ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and model parameters across GPUs instead of replicating them (unlike standard Data Parallelism). ZeRO Stage 3 shards everything — you can train a 70B model on 8 GPUs with relatively little memory per GPU. The tradeoff is more communication overhead.