Stage 1 — Fundamentals

Part 1: Tokens, Embeddings, and Attention

No GPU needed for this stage. Everything here runs on your laptop.

🧩 What is a Token?

A token is the basic unit of text that an LLM processes. It's NOT always a full word.

How tokenization works:

"Hello, world!"   →   ["Hello", ",", " world", "!"]     (4 tokens)
"unhappiness"     →   ["un", "happiness"]                (2 tokens)
"GPT"             →   ["G", "PT"]                        (2 tokens, depends on vocab)

Modern LLMs use Byte Pair Encoding (BPE) — common subword chunks get their own token. Rule of thumb: ~1 token ≈ 0.75 words in English.

Why tokens and not characters?

Characters = too many steps to learn long-range patterns
Full words = vocab gets too large (millions of words across languages)
Subwords (BPE) = sweet spot: ~32K–100K vocab, handles any language + code

Interview Corner Cases 🎯

"How many tokens is 1000 words?" → ~1333 tokens
"Why do LLMs struggle with counting letters in a word?" → Because "strawberry" might be ONE token. The model never "sees" individual letters unless it explicitly reasons character-by-character.
"What's the max context length of GPT-4?" → 128K tokens (as of 2024). This matters because attention is O(n²) in sequence length.
"Why do LLMs sometimes fail with simple arithmetic?" → Numbers like 12345 may tokenize as ["123", "45"] — the model is doing arithmetic on token sequences, not digits.

🔢 What is an Embedding?

An embedding is a vector (list of numbers) that represents a token in high-dimensional space. Every token gets mapped to a point — semantically similar tokens land near each other.

flowchart LR
    subgraph Vocab["Token Vocabulary"]
        T1["'king'  → token id 2345"]
        T2["'queen' → token id 2891"]
        T3["'apple' → token id 7104"]
    end
    subgraph Space["768-dimensional Vector Space (GPT-2)"]
        V1["[0.23, -0.81, 0.44, ...]  ← king"]
        V2["[0.21, -0.79, 0.48, ...]  ← queen  (nearby!)"]
        V3["[-0.55, 0.33, -0.22, ...] ← apple  (far away)"]
    end
    T1 --> V1
    T2 --> V2
    T3 --> V3
    style Vocab fill:#fafaf9,stroke:#e7e5e4
    style Space fill:#fafaf9,stroke:#e7e5e4

The key insight: distance in vector space = semantic similarity. "king" and "queen" land close together; "apple" is far from both.

The famous arithmetic test

The model learns during training that:

$\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$
$\vec{Paris} - \vec{France} + \vec{Italy} \approx \vec{Rome}$

These relationships are never explicitly taught — they emerge from predicting the next token on billions of examples.

Try it yourself — 3 lines of code

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")  # free, runs on CPU

words = ["king", "queen", "apple", "man", "woman"]
vectors = model.encode(words)   # shape: (5, 384)

# Cosine similarity: 1.0 = identical, 0.0 = unrelated
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine(vectors[0], vectors[1]))   # king vs queen  → ~0.82 (close)
print(cosine(vectors[0], vectors[2]))   # king vs apple  → ~0.18 (far)

# Word arithmetic: king - man + woman ≈ queen?
result = vectors[0] - vectors[3] + vectors[4]
sims = [(w, cosine(result, vectors[i])) for i, w in enumerate(words)]
print(sorted(sims, key=lambda x: -x[1])[:2])
# → [('queen', 0.84), ('king', 0.71)]  ← it works!

Install: pip install sentence-transformers

What's the Embedding Matrix?

It's a lookup table of shape (vocab_size, embedding_dim). Every token ID maps to one row.

import torch
import torch.nn as nn

# GPT-2 size: 50,257 tokens, 768-dim vectors
embedding_table = nn.Embedding(vocab_size=50257, embedding_dim=768)

token_id = 1337
vector = embedding_table(torch.tensor(token_id))
print(vector.shape)  # → torch.Size([768])

The entire embedding matrix is learned during training — initialized randomly, updated via gradient descent until semantically similar tokens cluster together.

Interview Corner Cases 🎯

"What's the difference between token embeddings and positional embeddings?"

- Token embeddings = WHAT the token is (its meaning) - Positional embeddings = WHERE the token is in the sequence - Both are added together before feeding into the transformer

"Can two different tokens have the same embedding?" → Initially no (they start different), but after training they could theoretically converge if they always appear in identical contexts. In practice they don't.
"What is the embedding dimension for GPT-2 vs LLaMA-7B vs GPT-4?" → GPT-2: 768, LLaMA-7B: 4096, GPT-4: ~12,288 (estimated)
"Why do we use learned positional embeddings vs fixed sinusoidal ones?" → Learned = more flexible, can adapt to training data. Sinusoidal (original Transformer paper) = generalizes to longer sequences unseen at training time. Modern LLMs use RoPE (Rotary Position Embedding) which combines both benefits.

🎯 What is Attention?

Attention answers the question: "When processing this token, which other tokens should I look at?"

The Intuition:

Sentence: "The bank by the river was steep."
               ↑
When processing "bank", should the model attend to "river" (geographic context)
or ignore it and think "financial institution"?

Attention LEARNS to focus on "river" → correctly understands it's a riverbank.

The Math (Scaled Dot-Product Attention):

For each token, we create 3 vectors:

Q (Query): "What am I looking for?"
K (Key): "What do I contain?"
V (Value): "What information do I give out?"

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Step by step:

$$QK^T$$ → dot product = how much does each query "match" each key? → shape: (seq_len, seq_len)
$/ \sqrt{d_k}$ → scale down to prevent vanishing gradients in softmax
$\text{softmax}(\ldots)$ → turn scores into probabilities (attention weights)
$\cdot V$ → weighted sum of values

Why divide by $\sqrt{d_k}$ ?

Without scaling, with large $$d_k$$ , dot products get very large → softmax saturates → gradients vanish → training breaks.

Multi-Head Attention

Instead of one attention, run $$h$$ attention heads in parallel, each learning different relationships:

Head 1 might learn syntactic relationships (subject-verb)
Head 2 might learn coreference (pronoun → antecedent)
Head 3 might learn positional patterns

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
where head_i = Attention(Q·W_Qi, K·W_Ki, V·W_Vi)

Interview Corner Cases 🎯

"What is causal (masked) attention?" → In language modeling, a token can only attend to tokens BEFORE it (left context). We mask out future tokens with -inf before softmax. This is how autoregressive generation works.
"What's the time and space complexity of attention?" → O(n²) in sequence length n. This is why long contexts are expensive. At n=128K tokens, you have 128K² = 16 billion attention pairs.
"What is KV-cache?" → During inference, you recompute Q for only the new token, but K and V for all previous tokens stay the same — you cache them. This is what makes generation fast.
"What is the difference between self-attention and cross-attention?" → Self-attention: Q, K, V all come from the same sequence. Cross-attention: Q comes from the decoder, K and V come from the encoder (used in seq2seq models like the original Transformer for translation).
"Why do transformers need positional encoding?" → Attention is permutation-invariant — it treats "cat sat mat" and "mat sat cat" identically without position info. Positional encodings break this symmetry.
"What is Flash Attention?" → A memory-efficient attention implementation that rewrites the attention computation to avoid materializing the full n×n attention matrix in GPU HBM, using tiling. ~3-4x faster in practice.

🏗️ Full Transformer Architecture (Quick Map)

flowchart TD
    A([Input Text]) --> B[Tokenizer: text to token IDs]
    B --> C[Token Embedding + Positional Embedding]
    C --> D[Transformer Block x N]

    subgraph D[Transformer Block — repeated N times]
        D1[LayerNorm] --> D2[Multi-Head Self-Attention]
        D2 --> D3[Residual: add input back]
        D3 --> D4[LayerNorm]
        D4 --> D5[Feed-Forward Network MLP]
        D5 --> D6[Residual: add input back]
    end

    D --> E[Final LayerNorm]
    E --> F[Linear + Softmax over vocabulary]
    F --> G([Next Token Prediction])

    style A fill:#f5f5f4,stroke:#e7e5e4
    style G fill:#f5f5f4,stroke:#e7e5e4
    style D fill:#fafaf9,stroke:#e7e5e4

Key design choices:

Residual connections: Prevent vanishing gradients in deep nets
LayerNorm: Stabilize training (applied BEFORE attention in modern LLMs — "Pre-LN")
FFN: 2-layer MLP with hidden dim = 4× embedding dim. Does most of the "memorization"

Interview Corner Cases: Transformer Architecture 🎯

"Why are residual connections important?" → Allow gradients to flow directly to early layers. Without them, training >12 layer networks is nearly impossible.
"What is the ratio of attention to FFN parameters?" → FFN is usually 4× larger. In GPT-2: attention ~2.4M params per layer, FFN ~4.7M params per layer.
"What's the difference between Pre-LN and Post-LN?" → Original paper used Post-LN (LayerNorm after residual). Modern LLMs use Pre-LN (before). Pre-LN is more stable to train, especially at large scale.
"How does the model generate text?" → Autoregressive: feed input, predict 1 token, append it to input, repeat. It's a for-loop over the model.
"What is temperature in generation?" → Before softmax, divide logits by T. T<1 = more confident/deterministic. T>1 = more random/creative. T=0 ≈ argmax (greedy).

**Next file: 02_char_level_lm.py — Build a character-level language model from scratch in PyTorch.**

Stage 1 — Fundamentals

Part 1: Tokens, Embeddings, and Attention

🧩 What is a Token?

How tokenization works:

Why tokens and not characters?

Interview Corner Cases 🎯

🔢 What is an Embedding?

The famous arithmetic test

Try it yourself — 3 lines of code

What's the Embedding Matrix?

Interview Corner Cases 🎯

🎯 What is Attention?

The Intuition:

The Math (Scaled Dot-Product Attention):

Why divide by $\sqrt{d_k}$?

Multi-Head Attention

Interview Corner Cases 🎯

🏗️ Full Transformer Architecture (Quick Map)

Interview Corner Cases: Transformer Architecture 🎯

Why divide by $\sqrt{d_k}$ ?