Chapter 6 — Inference & Deployment

Part 1: Quantization — Running 7B Models on Your Laptop

Quantization is the biggest practical enabler for LLM democratization. It's why you can run LLaMA 7B on a MacBook M2. This chapter explains how, why, and what you lose.

The Memory Problem

A 7B parameter model in full precision requires:

FP32 (32-bit): 7B × 4 bytes = 28 GB   ← requires A100 80GB
BF16 (16-bit): 7B × 2 bytes = 14 GB   ← barely fits one A100
INT8 ( 8-bit): 7B × 1 byte  =  7 GB   ← fits on RTX 3090/4090
INT4 ( 4-bit): 7B × 0.5 byte = 3.5 GB ← fits on RTX 3060 or M2 Mac!

And at inference, you also need memory for:

KV cache (1–4 GB for 2K–8K context)
Activations (~1–2 GB per batch element)

So for a consumer 8GB GPU: 3.5GB model + 3GB KV cache + 1.5GB overhead = just fits!

Types of Quantization

Post-Training Quantization (PTQ)

No training required. Just convert the already-trained model weights to lower precision.

Round-to-nearest (RTN): Simplest approach. Just round each weight to the nearest representable value in the target precision.

def quantize_weight_rtn(W, bits=8):
    """Round-to-nearest quantization."""
    # Compute scale and zero point
    w_min, w_max = W.min(), W.max()
    scale = (w_max - w_min) / (2**bits - 1)
    zero_point = round(-w_min / scale)

    # Quantize: W → integer
    W_int = torch.clamp(torch.round(W / scale + zero_point), 0, 2**bits - 1)

    # Dequantize: integer → float (at inference)
    W_deq = scale * (W_int - zero_point)

    return W_int, scale, zero_point

This works decently for INT8. For INT4, quality degrades significantly — you need smarter methods.

GPTQ — Accurate INT4 Quantization

Paper: "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"

Key insight: quantize weights one column at a time, compensating for quantization error in already-quantized weights by adjusting remaining unquantized weights.

For each weight matrix W:
  For each column c:
    1. Round W[:, c] to nearest INT4 value → quantization error δ
    2. Use the inverse Hessian (H^-1) to distribute δ across remaining columns
       W[:, c+1:] -= δ × H^-1[c, c+1:]
    3. Move to next column

This is computationally expensive (requires the Hessian) but produces much higher quality than RTN at 4-bit.

In practice:

# Using AutoGPTQ library:
pip install auto-gptq

python -c "
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,          # 4-bit quantization
    group_size=128,  # Each group of 128 weights shares scale/zero
    desc_act=False,  # Activation ordering (False = faster)
)

model = AutoGPTQForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf', quantize_config)
examples = [...]  # A few hundred examples for calibration
model.quantize(examples)
model.save_quantized('./llama2-7b-gptq-4bit')
"

GGUF / llama.cpp — CPU-Friendly Quantization

Created by Georgi Gerganov for llama.cpp. Runs on CPU (and also GPU). Supports K-quants (mixed precision: some layers at higher precision than others).

Common variants:

Q4_K_M: 4-bit, K-quant medium. Best quality/speed tradeoff. ~4.0 GB for 7B.
Q5_K_M: 5-bit, better quality. ~4.8 GB for 7B.
Q8_0: 8-bit, near-full quality. ~7.0 GB for 7B.
Q2_K: 2-bit, very small but noticeable quality loss. ~2.5 GB for 7B.

Q4_K_M is the "daily driver" — excellent quality, fits on most machines.

# Install llama.cpp (pre-compiled binaries available)
# Or build from source:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run a GGUF model directly:
./llama-cli -m ./llama-2-7b.Q4_K_M.gguf -p "Explain quantum computing in simple terms:" -n 200

bitsandbytes — Easy INT8/INT4 in Python

The easiest way to quantize for Python/HuggingFace workflows:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 8-bit quantization (good quality, ~7GB for 7B model)
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto",
)

# 4-bit quantization with NF4 (excellent efficiency, ~3.5GB for 7B model)
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=nf4_config,
    device_map="auto",
)

Quantization Quality: What Do You Lose?

Benchmark results (LLaMA 2 7B on MMLU):

Precision	MMLU Score	Memory	Notes
FP16	45.3%	14 GB	Baseline
INT8	45.1%	7 GB	~0.2% loss — negligible
INT4 (NF4)	44.8%	3.5 GB	~0.5% loss — very acceptable
INT4 (RTN)	43.1%	3.5 GB	~2% loss — noticeable
INT2	38.2%	2 GB	Significant degradation

Rule of thumb: INT8 is essentially free quality loss. INT4 with smart quantization (GPTQ, NF4) is ~1% quality loss. INT4 with round-to-nearest can be noticeably worse.

Interview corner case 🎯: "Why do outlier weights cause problems in quantization?" — LLM weights typically follow a normal distribution with mean ~0 — except for occasional large outliers that can be 100× larger than typical values. If you use a single scale for the entire tensor, the range has to accommodate the outliers, making the precision for typical values very coarse. Solutions: per-channel quantization (separate scale per output channel), mixed-precision (keep outlier channels in FP16), LLM.int8() (decompose matrix multiply for outlier handling).

LLM.int8() — The Clever Outlier Solution

Paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" — Dettmers et al., 2022

The observation: transformer weights have systematic outliers in specific channels (the same channel has large values across all tokens). These outliers appear at scale (>6B parameters).

The solution: mixed decomposition:

# Decompose matrix multiply:
# Y = X @ W

# 1. Find outlier columns in X (values > threshold)
X_outlier = X[:, outlier_cols]     # FP16 — keep precision for outliers
X_normal  = X[:, normal_cols]      # INT8 — quantize normal columns

# 2. Multiply each portion separately
Y_outlier = X_outlier @ W[outlier_cols, :]   # FP16 matmul (small, exact)
Y_normal  = X_normal  @ W_int8               # INT8 matmul (large, fast)

# 3. Combine
Y = Y_outlier + Y_normal

~0.1% of features are outliers, but keeping them in FP16 preserves almost all quality. This is the load_in_8bit=True in bitsandbytes.

Quantization Interview Corner Cases 🎯

"What is symmetric vs asymmetric quantization?" → Symmetric: range is [-max, max], zero point = 0. No zero-point parameter needed, slightly less accurate. Asymmetric: range is [min, max], with a non-zero zero point that maps the minimum value to zero in integer space. More accurate for non-symmetric weight distributions.
"What is group quantization, and why is it used for INT4?" → Instead of one scale factor per tensor, use one per group of N weights (typically 128). Each group has a separate scale that better fits that local range, dramatically improving INT4 quality. Cost: slightly more overhead for storing scale factors.
"Can you fine-tune a quantized model?" → Not the quantized weights directly (they're integers and can't receive gradients). But you can fine-tune LoRA adapters on top of a quantized base (QLoRA). The quantized weights stay frozen; only the FP16 LoRA adapters are trained.
"What is the tradeoff between quantization bits and inference speed?" → Not always linear. INT8 on modern GPUs with tensor cores can be 2× faster than FP16 for large matrix multiplications. INT4 can be 3-4× faster. But the CPU spends time dequantizing, and memory bandwidth savings may dominate compute savings depending on hardware.
"Why is GGUF better for CPU inference than GPTQ?" → GPTQ is optimized for CUDA GPU inference. GGUF (used by llama.cpp) uses CPU-friendly data layouts, SIMD instructions, and supports hybrid CPU+GPU inference where some layers are on GPU and rest on CPU. GGUF K-quants also adaptively apply higher precision to important layers.