Stage 1 — Interview Cheat Sheet

LLM Fundamentals: Corner Cases & Trick Questions


🔥 TOKENS

QuestionAnswer
How many tokens ≈ 1000 words?~1333 tokens (1 token ≈ 0.75 words)
Why can't LLMs count letters in "strawberry"?"strawberry" is 1 token — model never sees individual letters
Why do LLMs struggle with arithmetic?"12345" tokenizes as ["123","45"] — they operate on token chunks
What is BPE?Byte Pair Encoding: merges frequent character pairs iteratively to build subword vocab
What's GPT-4's context length?128K tokens (2024). GPT-4o also 128K
Why not use character-level tokenization everywhere?Too many steps for long-range dependencies; models have to learn more
Why not use word-level?Vocab explodes (millions of words), can't handle OOV/misspellings

🔥 EMBEDDINGS

QuestionAnswer
What shape is the embedding matrix?(vocab_size × embedding_dim), e.g., (50257 × 768) for GPT-2
What's the difference between token and positional embeddings?Token = WHAT the token is. Positional = WHERE it is. Both added together.
What are the 3 types of positional encoding?Learned absolute, Sinusoidal (fixed), RoPE (rotary)
Why RoPE?Encodes relative positions, generalizes better to longer contexts
What is ALiBi?Attention with Linear Biases — adds position bias to attention scores, no embedding needed
What dim is LLaMA-7B's embedding?4096
What is weight tying?Sharing weights between token embedding matrix and the final output projection (lm_head). Reduces params ~20% and often improves performance

🔥 ATTENTION

QuestionAnswer
What is the time complexity of attention?O(n²·d) where n=sequence length, d=dimension
What is the space complexity?O(n²) for the attention matrix
What is causal (masked) attention?Tokens can only attend to past + current tokens (prevents future leakage)
What is KV-cache?Cache Key and Value tensors during inference to avoid recomputing them
What is Flash Attention?Memory-efficient attention using GPU tiling; avoids materializing full n×n matrix in HBM
What's the difference between self-attention and cross-attention?Self: Q,K,V from same sequence. Cross: Q from one sequence, K,V from another (encoder-decoder)
Why √d_k in denominator?Prevent dot products from becoming too large → softmax saturation → vanishing gradients
What are attention heads for?Each head learns different types of relationships (syntax, coreference, etc.)
How many attention heads in GPT-2 small?12 heads, each of size 64 (768 / 12)

🔥 TRANSFORMER ARCHITECTURE

QuestionAnswer
Why residual connections?Allow gradients to bypass layers; prevent vanishing gradients in deep nets
Pre-LN vs Post-LN?Pre-LN (LayerNorm before sublayer) is more stable. Post-LN is original paper. Modern LLMs use Pre-LN
What does the FFN do?2-layer MLP: expands to 4×dim then compresses back. Stores factual knowledge
What's the MLP ratio?Typically 4×, i.e., hidden_dim = 4 × model_dim
What is GELU vs ReLU?GELU is smoother; allows small negative values. Used in GPT-2+. Better empirical performance
What is SwiGLU?Activation used in LLaMA: combines Swish + Gated Linear Unit. ~better than GELU empirically
How many layers does GPT-2 large have?36 layers
What is the decoder-only architecture?No encoder. Input is autoregressive. Used by GPT, LLaMA, Mistral

🔥 TRAINING

QuestionAnswer
What loss function is used?Cross-entropy (next token prediction)
What is perplexity?exp(loss). Lower = better. Perplexity of N means the model is as confused as choosing uniformly from N options
What is teacher forcing?During training, always use ground truth tokens as input (not model's predictions) — makes training stable
What optimizer?AdamW (Adam with decoupled weight decay). Standard for LLMs
What is gradient clipping?Cap gradient norm at a threshold (usually 1.0) to prevent exploding gradients
What is learning rate warmup?Gradually increase LR from 0 to max for first ~1% of training. Prevents early instability
What is cosine decay?LR schedule that follows cosine curve from max to ~0. Standard for LLM training

🔥 GENERATION

QuestionAnswer
What is temperature?Scale logits by 1/T before softmax. T<1 = sharper/confident. T>1 = flatter/creative. T→0 = greedy
What is top-k sampling?Keep only top k tokens by probability and renormalize
What is top-p (nucleus) sampling?Keep smallest set of tokens whose cumulative prob ≥ p. More adaptive than top-k
What is greedy decoding?Always pick the argmax token. Fast but often produces repetitive text
What is beam search?Keep top-B sequences at each step. Better for translation, worse for open-ended generation
What is repetition penalty?Reduce logit score of tokens that have already appeared. Prevents loops
What is the difference between completion and chat models?Completion: raw next-token prediction. Chat: fine-tuned with RLHF/SFT to follow instructions

🔥 GOTCHAS INTERVIEWERS LOVE

  1. "Attention is permutation-invariant" — Without positional encoding, the model treats "dog bites man" and "man bites dog" identically.
  2. "Cross-entropy initial value" — A randomly initialized model has loss ≈ ln(vocab_size). For GPT-2: ln(50257) ≈ 10.8. Good sanity check for your training setup.
  3. "Gradient accumulation ≠ larger batch" — Accumulating gradients over N steps is mathematically equivalent to a batch N× larger (assuming no BatchNorm, which LLMs don't use).
  4. "nn.Embedding vs nn.Linear"nn.Embedding is just nn.Linear without bias, accessed via index lookup instead of matrix multiply. Functionally equivalent for integer inputs.
  5. "Why can't you just fine-tune the last layer?" — Because language understanding is distributed across ALL layers. You need at least the top layers fine-tuned, or use LoRA across all layers.
  6. "What happens if you set temperature=0?" — Numerically: logits → ∞ for the max, -∞ for the rest → after softmax, prob=1 for max → deterministic argmax. In code, usually implemented as torch.argmax directly.
  7. "LLMs don't have a concept of 'I don't know'" — They assign probability to every token regardless. Hallucination is what happens when the model confidently predicts plausible-sounding but wrong continuations.

📐 NUMBERS TO MEMORIZE

ModelParamsLayersHeadsdimVocab
GPT-2 small117M121276850,257
GPT-2 large774M3620128050,257
LLaMA-7B7B3232409632,000
LLaMA-13B13B4040512032,000
Mistral-7B7B3232409632,000
GPT-3175B969612,28850,257

Next: Stage 2 — Train a real Transformer from scratch