Chapter 2 — Building a Transformer From Scratch

Part 1: Attention — A Deep Dive with Intuition

This chapter has two parts. Read this theory file first, then run 02_gpt_from_scratch.py to see it all work together.

Why Attention? The Problem It Solves

Before attention existed, models faced a brutal tradeoff. In seq2seq translation:

"The cat sat on the mat" → [LSTM] → fixed-size vector → "Le chat était assis sur le tapis"

The LSTM had to compress the entire meaning of the English sentence into a single vector of, say, 512 numbers. Then the decoder had to decompress that vector into French. For short sentences, fine. For long sentences — information bottleneck.

The 2015 Bahdanau attention paper proposed a fix: instead of one compressed vector, let the decoder look at all the encoder hidden states simultaneously and choose which ones to focus on.

The Transformer took this further: remove the LSTM entirely. Just use attention.

The Query-Key-Value Framework

The most important abstraction in modern deep learning.

flowchart LR
    X[Input X] -->|W_Q| Q[Query Q]
    X -->|W_K| K[Key K]
    X -->|W_V| V[Value V]
    Q --> S["Score: QK_T divided by sqrt(dk)"]
    K --> S
    S --> SM[Softmax: attention weights]
    SM --> O[Weighted sum of V]
    V --> O
    O --> OUT[Output per token]
    style X fill:#fafaf9,stroke:#a8a29e
    style OUT fill:#fafaf9,stroke:#a8a29e

The Search Engine Analogy

Imagine you're searching YouTube.

Query: what you type in the search bar ("funny cat videos")
Keys: the titles/tags of all videos in the database
Values: the actual video content

The system computes a relevance score between your query and every key. The most relevant keys get the highest scores. You retrieve a weighted mix of values — the top videos.

Attention is exactly this, but:

The "database" is all the tokens in the sequence
The "query" comes from the current token
The model learns what makes a good query, key, and value

Formally

For a sequence of tokens, we create three matrices by linearly projecting the embeddings:

Q = X · W_Q    # Queries: "what am I looking for?"    shape: (T, d_k)
K = X · W_K    # Keys:    "what do I advertise?"      shape: (T, d_k)
V = X · W_V    # Values:  "what information do I give?" shape: (T, d_v)

Where X is the input matrix of shape (T, d_model) — T tokens, each with d_model-dimensional embedding.

Then the full attention formula:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Step by step through the math:

Step 1: $$QK^T$$ — compute all pairwise similarities

$$QK^T$$ has shape $$(T, T)$$ . Entry $[i, j] = Q_i \cdot K_j$ = how much token $$i$$ should attend to token $$j$$ .

This is a dot product between two vectors — it's high when the vectors point in similar directions, low when they're perpendicular, negative when they point away.

Step 2: $/ \sqrt{d_k}$ — scale down

Why? If $$d_k = 64$$ , the dot products have standard deviation $\approx \sqrt{64} = 8$ . Without scaling, $\text{softmax}(\text{large values})$ saturates and gradients vanish. Dividing by $\sqrt{d_k}$ keeps variance $\approx 1$ .

The intuition: imagine two vectors of dimension 64, each component drawn from $\mathcal{N}(0,1)$ . The dot product has variance $$= 64$$ (sum of 64 independent unit-variance terms). $\sqrt{64} = 8$ . After dividing, variance $$= 1$$ . Softmax works well in this range.

Step 3: softmax — turn scores into attention weights

softmax(z)_i = exp(z_i) / Σ_j exp(z_j)

Each row of the attention matrix now sums to 1. Row i gives the probability distribution over which tokens token i attends to.

Step 4: · V — weighted average of values

Output[i] = Σ_j attention_weight[i,j] × V[j]

Token i's output is a weighted sum of all value vectors, with weights determined by attention scores.

The final output has the same shape as the input: (T, d_v). Each token's representation has been "updated" by mixing in information from the tokens it attends to.

Causal (Masked) Attention — The Key for Language Models

Here's the problem: during training, if token 5 can see token 6, it's "cheating" — it's reading the answer while taking the test. For autoregressive language modeling, each token should only attend to previous tokens.

Solution: causal masking. Before softmax, set all future positions to -infinity:

mask[i, j] = 0 if j <= i else -inf
scores = QK_T / sqrt(d_k) + mask
attention = softmax(scores)

After softmax:

exp(-inf) = 0 → future tokens get zero attention weight
Past tokens get their normal weights

Visually, the attention matrix becomes lower-triangular:

     T1   T2   T3   T4
T1 [0.8  0     0    0  ]  ← T1 can only attend to itself
T2 [0.3  0.7   0    0  ]  ← T2 can attend to T1 and T2
T3 [0.2  0.5   0.3  0  ]  ← T3 sees T1, T2, T3
T4 [0.1  0.2   0.4  0.3]  ← T4 sees everything

Why this matters for generation: During inference, you generate one token at a time. The model's output at position T only depends on tokens 1..T. This means once you've computed attention up to position T, you can cache the Keys and Values — they won't change when you add token T+1. This is KV-caching.

Multi-Head Attention — Why One Head Isn't Enough

Single attention learns to focus on one type of relationship. But language has many simultaneous relationships:

Subject-verb agreement ("The cats that live on the street are hungry")
Coreference ("John said he was tired" — "he" → "John")
Syntactic structure (adjectives modify nearest noun)
Positional patterns (verbs often follow nouns)

Multi-head attention runs H independent attention operations in parallel, each in a smaller subspace:

# Instead of one attention with d_k = 64:
# Run 8 attention heads each with d_k = 8 (total dimensions = 8×8 = 64)

for each head h in range(H):
    Q_h = X · W_Q_h    # (T, d_k)  where d_k = d_model / H
    K_h = X · W_K_h
    V_h = X · W_V_h
    head_h = Attention(Q_h, K_h, V_h)

# Concatenate all head outputs: (T, H × d_v) = (T, d_model)
# Project back: multiply by W_O matrix
output = Concat(head_1, ..., head_H) · W_O

Each head can specialize. After training, you can visualize attention patterns:

Head 1 might focus on the immediately preceding token (short-range dependency)
Head 2 might connect pronouns to their referents (long-range coreference)
Head 3 might focus on verbs attending to subjects

The key insight: different heads learn different notions of "relevance", and the final output integrates all of them.

Interview Corner Case 🎯: "If all heads start with the same weights and same input, why do they learn different things?" — They don't start with the same weights! Each head has its own learned projection matrices W_Q_h, W_K_h, W_V_h initialized randomly. They learn different projections during training. Also, the gradient flows differently to each head because they're separate paths.

Positional Encoding — Giving the Model a Sense of Order

Here's a subtle but critical point: attention is permutation-invariant.

If you shuffle all the tokens in a sentence and run attention, you get the same output (just shuffled). "Dog bites man" and "Man bites dog" are processed identically — different meaning, same attention.

To fix this, we inject position information. Two approaches:

1. Learned Absolute Positional Embeddings (GPT-2, BERT)

position_embedding = nn.Embedding(max_seq_len, d_model)
x = token_embedding(tokens) + position_embedding(positions)

Simple: learn a separate embedding for each absolute position. Position 0 has its own vector, position 1 has its own vector, etc.

Limitation: The model can only generalize to sequences up to max_seq_len. At position 1025, if max was 1024, the model has never seen that position embedding.

2. Sinusoidal Positional Encoding (Original Transformer)

PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Fixed (not learned). Uses sin/cos waves at different frequencies for different dimensions.

Why it works: The dot product PE(pos)·PE(pos+k) is a function only of k (the relative offset). So the model can infer relative distances — even for positions not seen in training.

3. Rotary Position Embedding — RoPE (LLaMA, Mistral, GPT-NeoX)

The state-of-the-art approach. Instead of adding position info to tokens, rotate the Q and K vectors by an angle proportional to their position:

Q_{rotated}[pos] = \text{Rotate}(Q[pos], \theta \times pos)

K_{rotated}[pos] = \text{Rotate}(K[pos], \theta \times pos)

Where Rotate(v, angle) applies a rotation matrix in 2D subspaces.

The key property: Q_rotated[p] · K_rotated[q] = Q · K × f(p - q) — the dot product naturally encodes relative position (p-q). This makes RoPE generalize much better to longer sequences.

Intuition: Think of a clock hand. Instead of saying "I'm at position 5", you say "I've rotated 5 ticks". Any two hands know their relative angle, regardless of what tick they're on.

Interview Corner Case 🎯: "How does RoPE enable context length extension?" — Since RoPE encodes relative positions, you can extend context by adjusting the rotation base (the 10000 constant in the formula). YaRN, LongRoPE, and similar methods use this. You can sometimes extend a model trained at 4K context to 128K with only fine-tuning.

The Full Transformer Block

A single transformer layer = multi-head attention + feedforward network + layer normalization + residual connections:

                     Input x
                       │
            ┌──────────┘    (residual connection)
            │              ↓
            │       LayerNorm(x)
            │              ↓
            │    Multi-Head Self-Attention
            │              ↓
            └──────────►  x + attn_output
                            │
             ┌──────────────┘    (residual connection)
             │               ↓
             │          LayerNorm(x)
             │               ↓
             │    Feed-Forward Network (MLP)
             │               ↓
             └──────────►  x + ffn_output
                            │
                         Output x

Why Residual Connections?

The residual connection output = x + f(x) means the layer only needs to learn the difference from the identity (the residual). This has two huge benefits:

Gradient flow: Gradients can flow directly through the skip connection without being modified. Even in a 96-layer network, gradients from the loss can reach the first layer.
Learning stability: If a layer is useless, it can learn to output ~0, making itself a near-identity function. The network depth can grow without hurting performance.

Pre-LN vs Post-LN

Post-LN (original paper 2017): LayerNorm after residual addition. Hard to train deep networks without careful LR warmup.
Pre-LN (GPT-2+): LayerNorm before each sublayer. Much more stable training. The residual path remains unnormalized, allowing gradients to flow freely.

All modern LLMs use Pre-LN.

The FFN (Feed-Forward Network)

Two linear layers with a nonlinearity:

\text{FFN}(x) = \text{Activation}(x \cdot W_1 + b_1) \cdot W_2 + b_2

Where:

W_1: d_model → 4*d_model (expansion)
W_2: 4*d_model → d_model (projection back)

Why 4×? Empirically found to work well. The expanded space lets the model compute more complex functions before projecting back.

What does the FFN do? The attention mechanism aggregates information from different positions. The FFN operates on each position independently and non-linearly. Research suggests FFNs act as a key-value memory store — they memorize factual associations ("Paris is the capital of France") during training.

Modern activation functions:

ReLU (original): max(0, x) — simple, good
GELU: x·Φ(x) (Gaussian CDF) — smoother, slightly better in practice
SwiGLU (LLaMA): Swish(x · W_1) * (x · W_2) — gated variant, ~10% better empirically. Uses 3 matrices instead of 2.

Building Intuition: What Does Training Actually Do?

Here's a mental model for what happens when a transformer trains:

Early training (loss ~4.0): The model is random. Its attention is random — it looks at uniformly random tokens. Its predictions are uniform over the vocabulary. High loss, high perplexity.

After 1000 steps (loss ~3.0): The model has learned basic n-gram-like patterns. It knows "the" is often followed by a noun, that periods end sentences, that "I am" often starts a clause. Attention heads are starting to specialize — some learn to focus on nearby tokens.

After 10,000 steps (loss ~2.5): Attention heads develop clear specializations. The model handles syntax: subject-verb agreement, sentence structure. FFN layers start storing factual associations. Generated text is grammatical.

After 100,000 steps (loss ~1.5): Long-range coreference, consistent style, factual recall, multi-step reasoning. Some attention heads clearly track specific linguistic phenomena — you can visualize head 5 connecting "it" to the nearest noun.

Training insight: The model doesn't learn "rules of grammar." It learns millions of soft statistical associations that collectively implement something that looks like grammar and understanding.

Interview Corner Cases — Chapter 2 🎯

"Why is the scaling factor √d_k and not d_k or √d_model?" → Dot products of random vectors scale with √d_k (not d_k), because we're summing d_k terms each with variance 1, giving total variance d_k, and std = √d_k.
"Can you run self-attention without any weight matrices?" → Yes, but it performs terribly. Without W_Q, W_K, W_V, you'd compute raw token-token similarities. The learned projections are crucial — they teach the model how to ask questions and answer them in a learned subspace.
"What happens if all attention weights are equal (1/T each)?" → The output becomes just the mean of all value vectors. The model learns nothing position-specific. This is the failure mode when initialization is bad or training is unstable.
"Why do we need both attention AND FFN? Couldn't we just stack attention layers?" → Attention mixes information across positions but is linear in the value vectors. FFN adds non-linearity per-position. The combination of cross-position mixing (attention) and within-position non-linear transformation (FFN) is what makes transformers expressive. Attention-only transformers exist but perform worse.
"What is the relationship between number of heads and model quality?" → More heads ≠ always better. Pruning experiments show ~20-30% of heads can be removed with minimal performance loss. What matters is the total d_model and d_k per head. 12 heads of size 64 (d_model=768) is roughly equivalent to 6 heads of size 64 plus some parameter reallocation.
"What is the difference between attention score and attention weight?" → Score = raw QK^T / √d_k value (can be any real number). Weight = after softmax (between 0 and 1, sums to 1 per row). Weights are what actually get applied to values.

Next: From GPT to LLaMA: Modern LLM Architectures — How the transformer evolved into today's open models.