Stage 1 — Interview Cheat Sheet

LLM Fundamentals: Corner Cases & Trick Questions

Question	Answer
How many tokens ≈ 1000 words?	~1333 tokens (1 token ≈ 0.75 words)
Why can't LLMs count letters in "strawberry"?	"strawberry" is 1 token — model never sees individual letters
Why do LLMs struggle with arithmetic?	"12345" tokenizes as ["123","45"] — they operate on token chunks
What is BPE?	Byte Pair Encoding: merges frequent character pairs iteratively to build subword vocab
What's GPT-4's context length?	128K tokens (2024). GPT-4o also 128K
Why not use character-level tokenization everywhere?	Too many steps for long-range dependencies; models have to learn more
Why not use word-level?	Vocab explodes (millions of words), can't handle OOV/misspellings

Question	Answer
What shape is the embedding matrix?	(vocab_size × embedding_dim), e.g., (50257 × 768) for GPT-2
What's the difference between token and positional embeddings?	Token = WHAT the token is. Positional = WHERE it is. Both added together.
What are the 3 types of positional encoding?	Learned absolute, Sinusoidal (fixed), RoPE (rotary)
Why RoPE?	Encodes relative positions, generalizes better to longer contexts
What is ALiBi?	Attention with Linear Biases — adds position bias to attention scores, no embedding needed
What dim is LLaMA-7B's embedding?	4096
What is weight tying?	Sharing weights between token embedding matrix and the final output projection (lm_head). Reduces params ~20% and often improves performance

Question	Answer
What is the time complexity of attention?	O(n²·d) where n=sequence length, d=dimension
What is the space complexity?	O(n²) for the attention matrix
What is causal (masked) attention?	Tokens can only attend to past + current tokens (prevents future leakage)
What is KV-cache?	Cache Key and Value tensors during inference to avoid recomputing them
What is Flash Attention?	Memory-efficient attention using GPU tiling; avoids materializing full n×n matrix in HBM
What's the difference between self-attention and cross-attention?	Self: Q,K,V from same sequence. Cross: Q from one sequence, K,V from another (encoder-decoder)
Why √d_k in denominator?	Prevent dot products from becoming too large → softmax saturation → vanishing gradients
What are attention heads for?	Each head learns different types of relationships (syntax, coreference, etc.)
How many attention heads in GPT-2 small?	12 heads, each of size 64 (768 / 12)

Question	Answer
Why residual connections?	Allow gradients to bypass layers; prevent vanishing gradients in deep nets
Pre-LN vs Post-LN?	Pre-LN (LayerNorm before sublayer) is more stable. Post-LN is original paper. Modern LLMs use Pre-LN
What does the FFN do?	2-layer MLP: expands to 4×dim then compresses back. Stores factual knowledge
What's the MLP ratio?	Typically 4×, i.e., hidden_dim = 4 × model_dim
What is GELU vs ReLU?	GELU is smoother; allows small negative values. Used in GPT-2+. Better empirical performance
What is SwiGLU?	Activation used in LLaMA: combines Swish + Gated Linear Unit. ~better than GELU empirically
How many layers does GPT-2 large have?	36 layers
What is the decoder-only architecture?	No encoder. Input is autoregressive. Used by GPT, LLaMA, Mistral

Question	Answer
What loss function is used?	Cross-entropy (next token prediction)
What is perplexity?	exp(loss). Lower = better. Perplexity of N means the model is as confused as choosing uniformly from N options
What is teacher forcing?	During training, always use ground truth tokens as input (not model's predictions) — makes training stable
What optimizer?	AdamW (Adam with decoupled weight decay). Standard for LLMs
What is gradient clipping?	Cap gradient norm at a threshold (usually 1.0) to prevent exploding gradients
What is learning rate warmup?	Gradually increase LR from 0 to max for first ~1% of training. Prevents early instability
What is cosine decay?	LR schedule that follows cosine curve from max to ~0. Standard for LLM training

Question	Answer
What is temperature?	Scale logits by 1/T before softmax. T<1 = sharper/confident. T>1 = flatter/creative. T→0 = greedy
What is top-k sampling?	Keep only top k tokens by probability and renormalize
What is top-p (nucleus) sampling?	Keep smallest set of tokens whose cumulative prob ≥ p. More adaptive than top-k
What is greedy decoding?	Always pick the argmax token. Fast but often produces repetitive text
What is beam search?	Keep top-B sequences at each step. Better for translation, worse for open-ended generation
What is repetition penalty?	Reduce logit score of tokens that have already appeared. Prevents loops
What is the difference between completion and chat models?	Completion: raw next-token prediction. Chat: fine-tuned with RLHF/SFT to follow instructions

"Attention is permutation-invariant" — Without positional encoding, the model treats "dog bites man" and "man bites dog" identically.
"Cross-entropy initial value" — A randomly initialized model has loss ≈ ln(vocab_size). For GPT-2: ln(50257) ≈ 10.8. Good sanity check for your training setup.
"Gradient accumulation ≠ larger batch" — Accumulating gradients over N steps is mathematically equivalent to a batch N× larger (assuming no BatchNorm, which LLMs don't use).
"nn.Embedding vs nn.Linear" — nn.Embedding is just nn.Linear without bias, accessed via index lookup instead of matrix multiply. Functionally equivalent for integer inputs.
"Why can't you just fine-tune the last layer?" — Because language understanding is distributed across ALL layers. You need at least the top layers fine-tuned, or use LoRA across all layers.
"What happens if you set temperature=0?" — Numerically: logits → ∞ for the max, -∞ for the rest → after softmax, prob=1 for max → deterministic argmax. In code, usually implemented as torch.argmax directly.
"LLMs don't have a concept of 'I don't know'" — They assign probability to every token regardless. Hallucination is what happens when the model confidently predicts plausible-sounding but wrong continuations.

Model	Params	Layers	Heads	dim	Vocab
GPT-2 small	117M	12	12	768	50,257
GPT-2 large	774M	36	20	1280	50,257
LLaMA-7B	7B	32	32	4096	32,000
LLaMA-13B	13B	40	40	5120	32,000
Mistral-7B	7B	32	32	4096	32,000
GPT-3	175B	96	96	12,288	50,257

Next: Stage 2 — Train a real Transformer from scratch