Stage 1 โ€” Fundamentals

Part 1: Tokens, Embeddings, and Attention

No GPU needed for this stage. Everything here runs on your laptop.

๐Ÿงฉ What is a Token?

A token is the basic unit of text that an LLM processes. It's NOT always a full word.

How tokenization works:

"Hello, world!"   โ†’   ["Hello", ",", " world", "!"]     (4 tokens)
"unhappiness"     โ†’   ["un", "happiness"]                (2 tokens)
"GPT"             โ†’   ["G", "PT"]                        (2 tokens, depends on vocab)

Modern LLMs use Byte Pair Encoding (BPE) โ€” common subword chunks get their own token. Rule of thumb: ~1 token โ‰ˆ 0.75 words in English.

Why tokens and not characters?

  • Characters = too many steps to learn long-range patterns
  • Full words = vocab gets too large (millions of words across languages)
  • Subwords (BPE) = sweet spot: ~32Kโ€“100K vocab, handles any language + code

Interview Corner Cases ๐ŸŽฏ

  • "How many tokens is 1000 words?" โ†’ ~1333 tokens
  • "Why do LLMs struggle with counting letters in a word?" โ†’ Because "strawberry" might be ONE token. The model never "sees" individual letters unless it explicitly reasons character-by-character.
  • "What's the max context length of GPT-4?" โ†’ 128K tokens (as of 2024). This matters because attention is O(nยฒ) in sequence length.
  • "Why do LLMs sometimes fail with simple arithmetic?" โ†’ Numbers like 12345 may tokenize as ["123", "45"] โ€” the model is doing arithmetic on token sequences, not digits.

๐Ÿ”ข What is an Embedding?

An embedding is a vector (list of numbers) that represents a token in high-dimensional space. Every token gets mapped to a point โ€” semantically similar tokens land near each other.

flowchart LR
    subgraph Vocab["Token Vocabulary"]
        T1["'king'  โ†’ token id 2345"]
        T2["'queen' โ†’ token id 2891"]
        T3["'apple' โ†’ token id 7104"]
    end
    subgraph Space["768-dimensional Vector Space (GPT-2)"]
        V1["[0.23, -0.81, 0.44, ...]  โ† king"]
        V2["[0.21, -0.79, 0.48, ...]  โ† queen  (nearby!)"]
        V3["[-0.55, 0.33, -0.22, ...] โ† apple  (far away)"]
    end
    T1 --> V1
    T2 --> V2
    T3 --> V3
    style Vocab fill:#fafaf9,stroke:#e7e5e4
    style Space fill:#fafaf9,stroke:#e7e5e4

The key insight: distance in vector space = semantic similarity. "king" and "queen" land close together; "apple" is far from both.

The famous arithmetic test

The model learns during training that:

  • $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$
  • $\vec{Paris} - \vec{France} + \vec{Italy} \approx \vec{Rome}$

These relationships are never explicitly taught โ€” they emerge from predicting the next token on billions of examples.

Try it yourself โ€” 3 lines of code

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")  # free, runs on CPU

words = ["king", "queen", "apple", "man", "woman"]
vectors = model.encode(words)   # shape: (5, 384)

# Cosine similarity: 1.0 = identical, 0.0 = unrelated
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine(vectors[0], vectors[1]))   # king vs queen  โ†’ ~0.82 (close)
print(cosine(vectors[0], vectors[2]))   # king vs apple  โ†’ ~0.18 (far)

# Word arithmetic: king - man + woman โ‰ˆ queen?
result = vectors[0] - vectors[3] + vectors[4]
sims = [(w, cosine(result, vectors[i])) for i, w in enumerate(words)]
print(sorted(sims, key=lambda x: -x[1])[:2])
# โ†’ [('queen', 0.84), ('king', 0.71)]  โ† it works!

Install: pip install sentence-transformers

What's the Embedding Matrix?

It's a lookup table of shape (vocab_size, embedding_dim). Every token ID maps to one row.

import torch
import torch.nn as nn

# GPT-2 size: 50,257 tokens, 768-dim vectors
embedding_table = nn.Embedding(vocab_size=50257, embedding_dim=768)

token_id = 1337
vector = embedding_table(torch.tensor(token_id))
print(vector.shape)  # โ†’ torch.Size([768])

The entire embedding matrix is learned during training โ€” initialized randomly, updated via gradient descent until semantically similar tokens cluster together.

Interview Corner Cases ๐ŸŽฏ

  • "What's the difference between token embeddings and positional embeddings?"

- Token embeddings = WHAT the token is (its meaning) - Positional embeddings = WHERE the token is in the sequence - Both are added together before feeding into the transformer

  • "Can two different tokens have the same embedding?" โ†’ Initially no (they start different), but after training they could theoretically converge if they always appear in identical contexts. In practice they don't.
  • "What is the embedding dimension for GPT-2 vs LLaMA-7B vs GPT-4?" โ†’ GPT-2: 768, LLaMA-7B: 4096, GPT-4: ~12,288 (estimated)
  • "Why do we use learned positional embeddings vs fixed sinusoidal ones?" โ†’ Learned = more flexible, can adapt to training data. Sinusoidal (original Transformer paper) = generalizes to longer sequences unseen at training time. Modern LLMs use RoPE (Rotary Position Embedding) which combines both benefits.

๐ŸŽฏ What is Attention?

Attention answers the question: "When processing this token, which other tokens should I look at?"

The Intuition:

Sentence: "The bank by the river was steep."
               โ†‘
When processing "bank", should the model attend to "river" (geographic context)
or ignore it and think "financial institution"?

Attention LEARNS to focus on "river" โ†’ correctly understands it's a riverbank.

The Math (Scaled Dot-Product Attention):

For each token, we create 3 vectors:

  • Q (Query): "What am I looking for?"
  • K (Key): "What do I contain?"
  • V (Value): "What information do I give out?"
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Step by step:

  1. $QK^T$ โ†’ dot product = how much does each query "match" each key? โ†’ shape: (seq_len, seq_len)
  2. $/ \sqrt{d_k}$ โ†’ scale down to prevent vanishing gradients in softmax
  3. $\text{softmax}(\ldots)$ โ†’ turn scores into probabilities (attention weights)
  4. $\cdot V$ โ†’ weighted sum of values

Why divide by $\sqrt{d_k}$?

Without scaling, with large $d_k$, dot products get very large โ†’ softmax saturates โ†’ gradients vanish โ†’ training breaks.

Multi-Head Attention

Instead of one attention, run $h$ attention heads in parallel, each learning different relationships:

  • Head 1 might learn syntactic relationships (subject-verb)
  • Head 2 might learn coreference (pronoun โ†’ antecedent)
  • Head 3 might learn positional patterns
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) ยท W_O
where head_i = Attention(QยทW_Qi, KยทW_Ki, VยทW_Vi)

Interview Corner Cases ๐ŸŽฏ

  • "What is causal (masked) attention?" โ†’ In language modeling, a token can only attend to tokens BEFORE it (left context). We mask out future tokens with -inf before softmax. This is how autoregressive generation works.
  • "What's the time and space complexity of attention?" โ†’ O(nยฒ) in sequence length n. This is why long contexts are expensive. At n=128K tokens, you have 128Kยฒ = 16 billion attention pairs.
  • "What is KV-cache?" โ†’ During inference, you recompute Q for only the new token, but K and V for all previous tokens stay the same โ€” you cache them. This is what makes generation fast.
  • "What is the difference between self-attention and cross-attention?" โ†’ Self-attention: Q, K, V all come from the same sequence. Cross-attention: Q comes from the decoder, K and V come from the encoder (used in seq2seq models like the original Transformer for translation).
  • "Why do transformers need positional encoding?" โ†’ Attention is permutation-invariant โ€” it treats "cat sat mat" and "mat sat cat" identically without position info. Positional encodings break this symmetry.
  • "What is Flash Attention?" โ†’ A memory-efficient attention implementation that rewrites the attention computation to avoid materializing the full nร—n attention matrix in GPU HBM, using tiling. ~3-4x faster in practice.

๐Ÿ—๏ธ Full Transformer Architecture (Quick Map)

flowchart TD
    A([Input Text]) --> B[Tokenizer: text to token IDs]
    B --> C[Token Embedding + Positional Embedding]
    C --> D[Transformer Block x N]

    subgraph D[Transformer Block โ€” repeated N times]
        D1[LayerNorm] --> D2[Multi-Head Self-Attention]
        D2 --> D3[Residual: add input back]
        D3 --> D4[LayerNorm]
        D4 --> D5[Feed-Forward Network MLP]
        D5 --> D6[Residual: add input back]
    end

    D --> E[Final LayerNorm]
    E --> F[Linear + Softmax over vocabulary]
    F --> G([Next Token Prediction])

    style A fill:#f5f5f4,stroke:#e7e5e4
    style G fill:#f5f5f4,stroke:#e7e5e4
    style D fill:#fafaf9,stroke:#e7e5e4

Key design choices:

  • Residual connections: Prevent vanishing gradients in deep nets
  • LayerNorm: Stabilize training (applied BEFORE attention in modern LLMs โ€” "Pre-LN")
  • FFN: 2-layer MLP with hidden dim = 4ร— embedding dim. Does most of the "memorization"

Interview Corner Cases: Transformer Architecture ๐ŸŽฏ

  • "Why are residual connections important?" โ†’ Allow gradients to flow directly to early layers. Without them, training >12 layer networks is nearly impossible.
  • "What is the ratio of attention to FFN parameters?" โ†’ FFN is usually 4ร— larger. In GPT-2: attention ~2.4M params per layer, FFN ~4.7M params per layer.
  • "What's the difference between Pre-LN and Post-LN?" โ†’ Original paper used Post-LN (LayerNorm after residual). Modern LLMs use Pre-LN (before). Pre-LN is more stable to train, especially at large scale.
  • "How does the model generate text?" โ†’ Autoregressive: feed input, predict 1 token, append it to input, repeat. It's a for-loop over the model.
  • "What is temperature in generation?" โ†’ Before softmax, divide logits by T. T<1 = more confident/deterministic. T>1 = more random/creative. T=0 โ‰ˆ argmax (greedy).

**Next file: 02_char_level_lm.py โ€” Build a character-level language model from scratch in PyTorch.**