Stage 1 — Interview Cheat Sheet
LLM Fundamentals: Corner Cases & Trick Questions
🔥 TOKENS
| Question | Answer |
|---|
| How many tokens ≈ 1000 words? | ~1333 tokens (1 token ≈ 0.75 words) |
| Why can't LLMs count letters in "strawberry"? | "strawberry" is 1 token — model never sees individual letters |
| Why do LLMs struggle with arithmetic? | "12345" tokenizes as ["123","45"] — they operate on token chunks |
| What is BPE? | Byte Pair Encoding: merges frequent character pairs iteratively to build subword vocab |
| What's GPT-4's context length? | 128K tokens (2024). GPT-4o also 128K |
| Why not use character-level tokenization everywhere? | Too many steps for long-range dependencies; models have to learn more |
| Why not use word-level? | Vocab explodes (millions of words), can't handle OOV/misspellings |
🔥 EMBEDDINGS
| Question | Answer |
|---|
| What shape is the embedding matrix? | (vocab_size × embedding_dim), e.g., (50257 × 768) for GPT-2 |
| What's the difference between token and positional embeddings? | Token = WHAT the token is. Positional = WHERE it is. Both added together. |
| What are the 3 types of positional encoding? | Learned absolute, Sinusoidal (fixed), RoPE (rotary) |
| Why RoPE? | Encodes relative positions, generalizes better to longer contexts |
| What is ALiBi? | Attention with Linear Biases — adds position bias to attention scores, no embedding needed |
| What dim is LLaMA-7B's embedding? | 4096 |
| What is weight tying? | Sharing weights between token embedding matrix and the final output projection (lm_head). Reduces params ~20% and often improves performance |
🔥 ATTENTION
| Question | Answer |
|---|
| What is the time complexity of attention? | O(n²·d) where n=sequence length, d=dimension |
| What is the space complexity? | O(n²) for the attention matrix |
| What is causal (masked) attention? | Tokens can only attend to past + current tokens (prevents future leakage) |
| What is KV-cache? | Cache Key and Value tensors during inference to avoid recomputing them |
| What is Flash Attention? | Memory-efficient attention using GPU tiling; avoids materializing full n×n matrix in HBM |
| What's the difference between self-attention and cross-attention? | Self: Q,K,V from same sequence. Cross: Q from one sequence, K,V from another (encoder-decoder) |
| Why √d_k in denominator? | Prevent dot products from becoming too large → softmax saturation → vanishing gradients |
| What are attention heads for? | Each head learns different types of relationships (syntax, coreference, etc.) |
| How many attention heads in GPT-2 small? | 12 heads, each of size 64 (768 / 12) |
| Question | Answer |
|---|
| Why residual connections? | Allow gradients to bypass layers; prevent vanishing gradients in deep nets |
| Pre-LN vs Post-LN? | Pre-LN (LayerNorm before sublayer) is more stable. Post-LN is original paper. Modern LLMs use Pre-LN |
| What does the FFN do? | 2-layer MLP: expands to 4×dim then compresses back. Stores factual knowledge |
| What's the MLP ratio? | Typically 4×, i.e., hidden_dim = 4 × model_dim |
| What is GELU vs ReLU? | GELU is smoother; allows small negative values. Used in GPT-2+. Better empirical performance |
| What is SwiGLU? | Activation used in LLaMA: combines Swish + Gated Linear Unit. ~better than GELU empirically |
| How many layers does GPT-2 large have? | 36 layers |
| What is the decoder-only architecture? | No encoder. Input is autoregressive. Used by GPT, LLaMA, Mistral |
🔥 TRAINING
| Question | Answer |
|---|
| What loss function is used? | Cross-entropy (next token prediction) |
| What is perplexity? | exp(loss). Lower = better. Perplexity of N means the model is as confused as choosing uniformly from N options |
| What is teacher forcing? | During training, always use ground truth tokens as input (not model's predictions) — makes training stable |
| What optimizer? | AdamW (Adam with decoupled weight decay). Standard for LLMs |
| What is gradient clipping? | Cap gradient norm at a threshold (usually 1.0) to prevent exploding gradients |
| What is learning rate warmup? | Gradually increase LR from 0 to max for first ~1% of training. Prevents early instability |
| What is cosine decay? | LR schedule that follows cosine curve from max to ~0. Standard for LLM training |
🔥 GENERATION
| Question | Answer |
|---|
| What is temperature? | Scale logits by 1/T before softmax. T<1 = sharper/confident. T>1 = flatter/creative. T→0 = greedy |
| What is top-k sampling? | Keep only top k tokens by probability and renormalize |
| What is top-p (nucleus) sampling? | Keep smallest set of tokens whose cumulative prob ≥ p. More adaptive than top-k |
| What is greedy decoding? | Always pick the argmax token. Fast but often produces repetitive text |
| What is beam search? | Keep top-B sequences at each step. Better for translation, worse for open-ended generation |
| What is repetition penalty? | Reduce logit score of tokens that have already appeared. Prevents loops |
| What is the difference between completion and chat models? | Completion: raw next-token prediction. Chat: fine-tuned with RLHF/SFT to follow instructions |
🔥 GOTCHAS INTERVIEWERS LOVE
- "Attention is permutation-invariant" — Without positional encoding, the model treats "dog bites man" and "man bites dog" identically.
- "Cross-entropy initial value" — A randomly initialized model has loss ≈ ln(vocab_size). For GPT-2: ln(50257) ≈ 10.8. Good sanity check for your training setup.
- "Gradient accumulation ≠ larger batch" — Accumulating gradients over N steps is mathematically equivalent to a batch N× larger (assuming no BatchNorm, which LLMs don't use).
- "nn.Embedding vs nn.Linear" —
nn.Embedding is just nn.Linear without bias, accessed via index lookup instead of matrix multiply. Functionally equivalent for integer inputs. - "Why can't you just fine-tune the last layer?" — Because language understanding is distributed across ALL layers. You need at least the top layers fine-tuned, or use LoRA across all layers.
- "What happens if you set temperature=0?" — Numerically: logits → ∞ for the max, -∞ for the rest → after softmax, prob=1 for max → deterministic argmax. In code, usually implemented as
torch.argmax directly. - "LLMs don't have a concept of 'I don't know'" — They assign probability to every token regardless. Hallucination is what happens when the model confidently predicts plausible-sounding but wrong continuations.
📐 NUMBERS TO MEMORIZE
| Model | Params | Layers | Heads | dim | Vocab |
|---|
| GPT-2 small | 117M | 12 | 12 | 768 | 50,257 |
| GPT-2 large | 774M | 36 | 20 | 1280 | 50,257 |
| LLaMA-7B | 7B | 32 | 32 | 4096 | 32,000 |
| LLaMA-13B | 13B | 40 | 40 | 5120 | 32,000 |
| Mistral-7B | 7B | 32 | 32 | 4096 | 32,000 |
| GPT-3 | 175B | 96 | 96 | 12,288 | 50,257 |
Next: Stage 2 — Train a real Transformer from scratch