Stage 1 โ Fundamentals
Part 1: Tokens, Embeddings, and Attention
No GPU needed for this stage. Everything here runs on your laptop.
๐งฉ What is a Token?
A token is the basic unit of text that an LLM processes. It's NOT always a full word.
How tokenization works:
"Hello, world!" โ ["Hello", ",", " world", "!"] (4 tokens)
"unhappiness" โ ["un", "happiness"] (2 tokens)
"GPT" โ ["G", "PT"] (2 tokens, depends on vocab)
Modern LLMs use Byte Pair Encoding (BPE) โ common subword chunks get their own token. Rule of thumb: ~1 token โ 0.75 words in English.
Why tokens and not characters?
- Characters = too many steps to learn long-range patterns
- Full words = vocab gets too large (millions of words across languages)
- Subwords (BPE) = sweet spot: ~32Kโ100K vocab, handles any language + code
Interview Corner Cases ๐ฏ
- "How many tokens is 1000 words?" โ ~1333 tokens
- "Why do LLMs struggle with counting letters in a word?" โ Because "strawberry" might be ONE token. The model never "sees" individual letters unless it explicitly reasons character-by-character.
- "What's the max context length of GPT-4?" โ 128K tokens (as of 2024). This matters because attention is O(nยฒ) in sequence length.
- "Why do LLMs sometimes fail with simple arithmetic?" โ Numbers like
12345may tokenize as["123", "45"]โ the model is doing arithmetic on token sequences, not digits.
๐ข What is an Embedding?
An embedding is a vector (list of numbers) that represents a token in high-dimensional space. Every token gets mapped to a point โ semantically similar tokens land near each other.
flowchart LR
subgraph Vocab["Token Vocabulary"]
T1["'king' โ token id 2345"]
T2["'queen' โ token id 2891"]
T3["'apple' โ token id 7104"]
end
subgraph Space["768-dimensional Vector Space (GPT-2)"]
V1["[0.23, -0.81, 0.44, ...] โ king"]
V2["[0.21, -0.79, 0.48, ...] โ queen (nearby!)"]
V3["[-0.55, 0.33, -0.22, ...] โ apple (far away)"]
end
T1 --> V1
T2 --> V2
T3 --> V3
style Vocab fill:#fafaf9,stroke:#e7e5e4
style Space fill:#fafaf9,stroke:#e7e5e4
The key insight: distance in vector space = semantic similarity. "king" and "queen" land close together; "apple" is far from both.
The famous arithmetic test
The model learns during training that:
- $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$
- $\vec{Paris} - \vec{France} + \vec{Italy} \approx \vec{Rome}$
These relationships are never explicitly taught โ they emerge from predicting the next token on billions of examples.
Try it yourself โ 3 lines of code
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2") # free, runs on CPU
words = ["king", "queen", "apple", "man", "woman"]
vectors = model.encode(words) # shape: (5, 384)
# Cosine similarity: 1.0 = identical, 0.0 = unrelated
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cosine(vectors[0], vectors[1])) # king vs queen โ ~0.82 (close)
print(cosine(vectors[0], vectors[2])) # king vs apple โ ~0.18 (far)
# Word arithmetic: king - man + woman โ queen?
result = vectors[0] - vectors[3] + vectors[4]
sims = [(w, cosine(result, vectors[i])) for i, w in enumerate(words)]
print(sorted(sims, key=lambda x: -x[1])[:2])
# โ [('queen', 0.84), ('king', 0.71)] โ it works!
Install: pip install sentence-transformers
What's the Embedding Matrix?
It's a lookup table of shape (vocab_size, embedding_dim). Every token ID maps to one row.
import torch
import torch.nn as nn
# GPT-2 size: 50,257 tokens, 768-dim vectors
embedding_table = nn.Embedding(vocab_size=50257, embedding_dim=768)
token_id = 1337
vector = embedding_table(torch.tensor(token_id))
print(vector.shape) # โ torch.Size([768])
The entire embedding matrix is learned during training โ initialized randomly, updated via gradient descent until semantically similar tokens cluster together.
Interview Corner Cases ๐ฏ
- "What's the difference between token embeddings and positional embeddings?"
- Token embeddings = WHAT the token is (its meaning) - Positional embeddings = WHERE the token is in the sequence - Both are added together before feeding into the transformer
- "Can two different tokens have the same embedding?" โ Initially no (they start different), but after training they could theoretically converge if they always appear in identical contexts. In practice they don't.
- "What is the embedding dimension for GPT-2 vs LLaMA-7B vs GPT-4?" โ GPT-2: 768, LLaMA-7B: 4096, GPT-4: ~12,288 (estimated)
- "Why do we use learned positional embeddings vs fixed sinusoidal ones?" โ Learned = more flexible, can adapt to training data. Sinusoidal (original Transformer paper) = generalizes to longer sequences unseen at training time. Modern LLMs use RoPE (Rotary Position Embedding) which combines both benefits.
๐ฏ What is Attention?
Attention answers the question: "When processing this token, which other tokens should I look at?"
The Intuition:
Sentence: "The bank by the river was steep."
โ
When processing "bank", should the model attend to "river" (geographic context)
or ignore it and think "financial institution"?
Attention LEARNS to focus on "river" โ correctly understands it's a riverbank.
The Math (Scaled Dot-Product Attention):
For each token, we create 3 vectors:
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What information do I give out?"
Step by step:
- $QK^T$ โ dot product = how much does each query "match" each key? โ shape: (seq_len, seq_len)
- $/ \sqrt{d_k}$ โ scale down to prevent vanishing gradients in softmax
- $\text{softmax}(\ldots)$ โ turn scores into probabilities (attention weights)
- $\cdot V$ โ weighted sum of values
Why divide by $\sqrt{d_k}$?
Without scaling, with large $d_k$, dot products get very large โ softmax saturates โ gradients vanish โ training breaks.
Multi-Head Attention
Instead of one attention, run $h$ attention heads in parallel, each learning different relationships:
- Head 1 might learn syntactic relationships (subject-verb)
- Head 2 might learn coreference (pronoun โ antecedent)
- Head 3 might learn positional patterns
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) ยท W_O
where head_i = Attention(QยทW_Qi, KยทW_Ki, VยทW_Vi)
Interview Corner Cases ๐ฏ
- "What is causal (masked) attention?" โ In language modeling, a token can only attend to tokens BEFORE it (left context). We mask out future tokens with -inf before softmax. This is how autoregressive generation works.
- "What's the time and space complexity of attention?" โ O(nยฒ) in sequence length
n. This is why long contexts are expensive. At n=128K tokens, you have 128Kยฒ = 16 billion attention pairs. - "What is KV-cache?" โ During inference, you recompute Q for only the new token, but K and V for all previous tokens stay the same โ you cache them. This is what makes generation fast.
- "What is the difference between self-attention and cross-attention?" โ Self-attention: Q, K, V all come from the same sequence. Cross-attention: Q comes from the decoder, K and V come from the encoder (used in seq2seq models like the original Transformer for translation).
- "Why do transformers need positional encoding?" โ Attention is permutation-invariant โ it treats "cat sat mat" and "mat sat cat" identically without position info. Positional encodings break this symmetry.
- "What is Flash Attention?" โ A memory-efficient attention implementation that rewrites the attention computation to avoid materializing the full nรn attention matrix in GPU HBM, using tiling. ~3-4x faster in practice.
๐๏ธ Full Transformer Architecture (Quick Map)
flowchart TD
A([Input Text]) --> B[Tokenizer: text to token IDs]
B --> C[Token Embedding + Positional Embedding]
C --> D[Transformer Block x N]
subgraph D[Transformer Block โ repeated N times]
D1[LayerNorm] --> D2[Multi-Head Self-Attention]
D2 --> D3[Residual: add input back]
D3 --> D4[LayerNorm]
D4 --> D5[Feed-Forward Network MLP]
D5 --> D6[Residual: add input back]
end
D --> E[Final LayerNorm]
E --> F[Linear + Softmax over vocabulary]
F --> G([Next Token Prediction])
style A fill:#f5f5f4,stroke:#e7e5e4
style G fill:#f5f5f4,stroke:#e7e5e4
style D fill:#fafaf9,stroke:#e7e5e4
Key design choices:
- Residual connections: Prevent vanishing gradients in deep nets
- LayerNorm: Stabilize training (applied BEFORE attention in modern LLMs โ "Pre-LN")
- FFN: 2-layer MLP with hidden dim = 4ร embedding dim. Does most of the "memorization"
Interview Corner Cases: Transformer Architecture ๐ฏ
- "Why are residual connections important?" โ Allow gradients to flow directly to early layers. Without them, training >12 layer networks is nearly impossible.
- "What is the ratio of attention to FFN parameters?" โ FFN is usually 4ร larger. In GPT-2: attention ~2.4M params per layer, FFN ~4.7M params per layer.
- "What's the difference between Pre-LN and Post-LN?" โ Original paper used Post-LN (LayerNorm after residual). Modern LLMs use Pre-LN (before). Pre-LN is more stable to train, especially at large scale.
- "How does the model generate text?" โ Autoregressive: feed input, predict 1 token, append it to input, repeat. It's a for-loop over the model.
- "What is temperature in generation?" โ Before softmax, divide logits by T. T<1 = more confident/deterministic. T>1 = more random/creative. T=0 โ argmax (greedy).
**Next file: 02_char_level_lm.py โ Build a character-level language model from scratch in PyTorch.**