Chapter 6 — Inference & Deployment
Part 1: Quantization — Running 7B Models on Your Laptop
Quantization is the biggest practical enabler for LLM democratization. It's why you can run LLaMA 7B on a MacBook M2. This chapter explains how, why, and what you lose.
The Memory Problem
A 7B parameter model in full precision requires:
FP32 (32-bit): 7B × 4 bytes = 28 GB ← requires A100 80GB
BF16 (16-bit): 7B × 2 bytes = 14 GB ← barely fits one A100
INT8 ( 8-bit): 7B × 1 byte = 7 GB ← fits on RTX 3090/4090
INT4 ( 4-bit): 7B × 0.5 byte = 3.5 GB ← fits on RTX 3060 or M2 Mac!
And at inference, you also need memory for:
- KV cache (1–4 GB for 2K–8K context)
- Activations (~1–2 GB per batch element)
So for a consumer 8GB GPU: 3.5GB model + 3GB KV cache + 1.5GB overhead = just fits!
Types of Quantization
Post-Training Quantization (PTQ)
No training required. Just convert the already-trained model weights to lower precision.
Round-to-nearest (RTN): Simplest approach. Just round each weight to the nearest representable value in the target precision.
def quantize_weight_rtn(W, bits=8):
"""Round-to-nearest quantization."""
# Compute scale and zero point
w_min, w_max = W.min(), W.max()
scale = (w_max - w_min) / (2**bits - 1)
zero_point = round(-w_min / scale)
# Quantize: W → integer
W_int = torch.clamp(torch.round(W / scale + zero_point), 0, 2**bits - 1)
# Dequantize: integer → float (at inference)
W_deq = scale * (W_int - zero_point)
return W_int, scale, zero_point
This works decently for INT8. For INT4, quality degrades significantly — you need smarter methods.
GPTQ — Accurate INT4 Quantization
Paper: "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"
Key insight: quantize weights one column at a time, compensating for quantization error in already-quantized weights by adjusting remaining unquantized weights.
For each weight matrix W:
For each column c:
1. Round W[:, c] to nearest INT4 value → quantization error δ
2. Use the inverse Hessian (H^-1) to distribute δ across remaining columns
W[:, c+1:] -= δ × H^-1[c, c+1:]
3. Move to next column
This is computationally expensive (requires the Hessian) but produces much higher quality than RTN at 4-bit.
In practice:
# Using AutoGPTQ library:
pip install auto-gptq
python -c "
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Each group of 128 weights shares scale/zero
desc_act=False, # Activation ordering (False = faster)
)
model = AutoGPTQForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', quantize_config)
examples = [...] # A few hundred examples for calibration
model.quantize(examples)
model.save_quantized('./llama2-7b-gptq-4bit')
"
GGUF / llama.cpp — CPU-Friendly Quantization
Created by Georgi Gerganov for llama.cpp. Runs on CPU (and also GPU). Supports K-quants (mixed precision: some layers at higher precision than others).
Common variants:
Q4_K_M: 4-bit, K-quant medium. Best quality/speed tradeoff. ~4.0 GB for 7B.Q5_K_M: 5-bit, better quality. ~4.8 GB for 7B.Q8_0: 8-bit, near-full quality. ~7.0 GB for 7B.Q2_K: 2-bit, very small but noticeable quality loss. ~2.5 GB for 7B.
Q4_K_M is the "daily driver" — excellent quality, fits on most machines.
# Install llama.cpp (pre-compiled binaries available)
# Or build from source:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Run a GGUF model directly:
./llama-cli -m ./llama-2-7b.Q4_K_M.gguf -p "Explain quantum computing in simple terms:" -n 200
bitsandbytes — Easy INT8/INT4 in Python
The easiest way to quantize for Python/HuggingFace workflows:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 8-bit quantization (good quality, ~7GB for 7B model)
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto",
)
# 4-bit quantization with NF4 (excellent efficiency, ~3.5GB for 7B model)
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=nf4_config,
device_map="auto",
)
Quantization Quality: What Do You Lose?
Benchmark results (LLaMA 2 7B on MMLU):
| Precision | MMLU Score | Memory | Notes |
|---|---|---|---|
| FP16 | 45.3% | 14 GB | Baseline |
| INT8 | 45.1% | 7 GB | ~0.2% loss — negligible |
| INT4 (NF4) | 44.8% | 3.5 GB | ~0.5% loss — very acceptable |
| INT4 (RTN) | 43.1% | 3.5 GB | ~2% loss — noticeable |
| INT2 | 38.2% | 2 GB | Significant degradation |
Rule of thumb: INT8 is essentially free quality loss. INT4 with smart quantization (GPTQ, NF4) is ~1% quality loss. INT4 with round-to-nearest can be noticeably worse.
Interview corner case 🎯: "Why do outlier weights cause problems in quantization?" — LLM weights typically follow a normal distribution with mean ~0 — except for occasional large outliers that can be 100× larger than typical values. If you use a single scale for the entire tensor, the range has to accommodate the outliers, making the precision for typical values very coarse. Solutions: per-channel quantization (separate scale per output channel), mixed-precision (keep outlier channels in FP16), LLM.int8() (decompose matrix multiply for outlier handling).
LLM.int8() — The Clever Outlier Solution
Paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" — Dettmers et al., 2022
The observation: transformer weights have systematic outliers in specific channels (the same channel has large values across all tokens). These outliers appear at scale (>6B parameters).
The solution: mixed decomposition:
# Decompose matrix multiply:
# Y = X @ W
# 1. Find outlier columns in X (values > threshold)
X_outlier = X[:, outlier_cols] # FP16 — keep precision for outliers
X_normal = X[:, normal_cols] # INT8 — quantize normal columns
# 2. Multiply each portion separately
Y_outlier = X_outlier @ W[outlier_cols, :] # FP16 matmul (small, exact)
Y_normal = X_normal @ W_int8 # INT8 matmul (large, fast)
# 3. Combine
Y = Y_outlier + Y_normal
~0.1% of features are outliers, but keeping them in FP16 preserves almost all quality. This is the load_in_8bit=True in bitsandbytes.
Quantization Interview Corner Cases 🎯
- "What is symmetric vs asymmetric quantization?" → Symmetric: range is [-max, max], zero point = 0. No zero-point parameter needed, slightly less accurate. Asymmetric: range is [min, max], with a non-zero zero point that maps the minimum value to zero in integer space. More accurate for non-symmetric weight distributions.
- "What is group quantization, and why is it used for INT4?" → Instead of one scale factor per tensor, use one per group of N weights (typically 128). Each group has a separate scale that better fits that local range, dramatically improving INT4 quality. Cost: slightly more overhead for storing scale factors.
- "Can you fine-tune a quantized model?" → Not the quantized weights directly (they're integers and can't receive gradients). But you can fine-tune LoRA adapters on top of a quantized base (QLoRA). The quantized weights stay frozen; only the FP16 LoRA adapters are trained.
- "What is the tradeoff between quantization bits and inference speed?" → Not always linear. INT8 on modern GPUs with tensor cores can be 2× faster than FP16 for large matrix multiplications. INT4 can be 3-4× faster. But the CPU spends time dequantizing, and memory bandwidth savings may dominate compute savings depending on hardware.
- "Why is GGUF better for CPU inference than GPTQ?" → GPTQ is optimized for CUDA GPU inference. GGUF (used by llama.cpp) uses CPU-friendly data layouts, SIMD instructions, and supports hybrid CPU+GPU inference where some layers are on GPU and rest on CPU. GGUF K-quants also adaptively apply higher precision to important layers.