Chapter 6 — Inference & Deployment
Part 2: Inference Engines, vLLM, and Serving at Scale
By the time you've trained and fine-tuned a model, you need to actually run it. This chapter covers how to serve LLMs efficiently — from your laptop to production at scale.
The Inference Bottleneck: Memory Bandwidth, Not Compute
A common misconception: LLM inference is slow because of the computation (matrix multiplies). Wrong. It's slow because of memory bandwidth.
At inference time:
- For each generated token, you load ALL model weights from GPU memory
- 7B parameters × 2 bytes = 14GB loaded from HBM each time
- A100 HBM bandwidth: 2 TB/s → loading 14GB takes ~7ms
- At 7ms per token: max ~140 tokens/second per A100 (for batch=1)
Techniques to improve:
- Quantization: Reduce 14GB → 3.5GB → 4× more bandwidth available
- Larger batches: Multiple users share the same weight-loading overhead
- Speculative decoding: Generate multiple tokens per weight load
- Flash Attention: Reduce KV cache memory bandwidth overhead
Inference Engines: When to Use What
| Engine | Best For | Quantization | Multi-GPU | Notes |
|---|---|---|---|---|
| HuggingFace Transformers | Dev/testing | BnB | Basic | Easiest to use |
| llama.cpp | CPU, local use | GGUF | Partial | Very low RAM use |
| vLLM | GPU serving, production | GPTQ, AWQ | ✅ | Best throughput |
| TGI (HuggingFace) | Easy deploy | GPTQ | ✅ | Good for HF models |
| Ollama | Local, GUI users | GGUF | Partial | Easiest local setup |
| TensorRT-LLM | NVIDIA production | FP8/INT4 | ✅ | Maximum speed |
| ExLlamaV2 | Consumer GPU | EXL2 | Partial | Best RTX cards |
llama.cpp: The CPU Hero
What it is: A pure C++ reimplementation of LLaMA inference. Runs on any machine — Mac, Windows, Linux, no NVIDIA GPU needed.
Why it matters:
- Runs on Apple Silicon (M1/M2/M3) using Metal acceleration
- Uses AVX2/AVX512 SIMD instructions on CPU
- GGUF format with K-quants for excellent quality at low bits
- Supports partial GPU offloading (some layers on GPU, rest on CPU)
# Install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build for CPU
make
# Build for CUDA (NVIDIA)
make LLAMA_CUDA=1
# Build for Metal (Mac)
make LLAMA_METAL=1
# Download a GGUF model (Ollama's library or HuggingFace)
# Example: LLaMA 3.2 3B Q4_K_M
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Run it!
./llama-cli \
-m Llama-3.2-3B-Instruct-Q4_K_M.gguf \
--chat-template llama3 \
-n 200 \
-p "You are a helpful assistant.\n\nUser: Explain quantum entanglement simply.\nAssistant:"
# Start a server (OpenAI-compatible API!):
./llama-server -m model.gguf --port 8080
# Now call it like OpenAI API: http://localhost:8080/v1/chat/completions
Interview corner case 🎯: "How does llama.cpp handle models that don't fit in GPU memory?" — "GPU offloading" — you specify how many layers to put on GPU (-ngl 20 = 20 layers on GPU, rest on CPU). The model runs in a pipeline where GPU-resident layers process the tensor, then it moves to CPU, then back to GPU. Less efficient than all-GPU but much better than pure CPU.
Ollama: The Developer's Local LLM Tool
What it is: A wrapper around llama.cpp with a model library, simple CLI, and Docker-like model management.
# Install (Mac/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run a model (automatically downloads GGUF)
ollama run llama3.2
# List available models
ollama list
# Serve as API (starts automatically)
# API is at http://localhost:11434
# Call it:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Explain gradient descent."}]
}'
# Or use the OpenAI-compatible endpoint:
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Ollama handles model storage, GPU detection, and memory management automatically. Best for local development and demos.
vLLM: Production-Grade GPU Serving
What it is: A high-throughput inference engine for GPU serving, with PagedAttention for efficient memory management.
flowchart TD
subgraph Requests["Incoming Requests"]
R1[User 1]
R2[User 2]
R3[User 3]
end
R1 --> CB[Continuous Batching Scheduler]
R2 --> CB
R3 --> CB
CB --> GPU[GPU forward pass]
GPU --> PA[PagedAttention KV Cache Manager]
PA -->|allocate pages on demand| KV[(KV Cache on GPU HBM)]
PA --> OUT[Token output per request]
style Requests fill:#fafaf9,stroke:#e7e5e4
style KV fill:#f5f5f4,stroke:#a8a29e
Why it's revolutionary: Standard inference pre-allocates max_sequence_length × n_layers × head_dim of KV cache per request. At 4K context, this is huge and mostly empty for most requests.
PagedAttention (the key vLLM innovation):
- Stores KV cache in non-contiguous "pages" (like OS virtual memory pages)
- Pages are allocated on demand, not up front
- Different requests can share KV cache pages (for shared prefixes)
- Result: 10-30× more concurrent requests on the same GPU!
# Install
pip install vllm
# Start vLLM server (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
# Add --quantization awq for 4-bit (requires AWQ-quantized model)
# Query it (identical to OpenAI API):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [{"role": "user", "content": "What is LoRA?"}],
"temperature": 0.7,
"max_tokens": 200
}'
vLLM features:
- Continuous batching: new requests slot into existing batches mid-generation
- Prefix caching: if multiple requests share the same system prompt, compute KV once
- Multi-GPU tensor parallel inference
- Streaming responses
- AWQ, GPTQ, FP8 quantization support
When to use vLLM: Any time you're serving to multiple users. Even for 2-3 concurrent users, vLLM's batching gives 3-5× throughput vs. naive HuggingFace inference.
HuggingFace Text Generation Inference (TGI)
Hugging Face's production inference server. Good for deploying HF Hub models.
# Docker deployment (simplest):
docker run --gpus all \
-p 8080:80 \
-v $PWD/models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-chat-hf \
--max-input-length 2048 \
--max-total-tokens 4096
# Query with Python:
from huggingface_hub import InferenceClient
client = InferenceClient("http://localhost:8080")
response = client.text_generation("Explain transformers:", max_new_tokens=100)
Streaming Responses
Users expect to see text appear token by token (like ChatGPT). This requires server-sent events (SSE) or streaming.
# vLLM streaming with Python
import requests, json
def stream_generation(prompt, model="llama3.2"):
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": True, # ← KEY: enable streaming
},
stream=True,
)
for line in response.iter_lines():
if line.startswith(b"data: "):
data = line[6:] # Remove "data: " prefix
if data == b"[DONE]":
break
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)
stream_generation("Tell me about large language models.")
Building a Simple Chat API (see 03_simple_chat_api.py)
# Full FastAPI server in ~100 lines
# Supports streaming, conversation history, system prompts
# OpenAI-compatible format
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import threading, torch
app = FastAPI()
model_name = "microsoft/phi-2" # Small, no auth needed
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
@app.post("/chat")
async def chat(message: str):
inputs = tokenizer(message, return_tensors="pt")
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
thread = threading.Thread(
target=model.generate,
kwargs={"input_ids": inputs.input_ids, "streamer": streamer,
"max_new_tokens": 200, "temperature": 0.7, "do_sample": True}
)
thread.start()
def generate():
for text in streamer:
yield f"data: {text}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Interview Corner Cases — Inference & Deployment 🎯
- "What is continuous batching and why does vLLM use it?" → Without continuous batching, a server waits until all requests in a batch finish before starting the next batch. With continuous batching, completed sequences are immediately replaced with new requests, filling freed slots immediately. This dramatically improves GPU utilization (from ~30% to ~90%).
- "What is prefix sharing/caching in vLLM?" → If many requests start with the same system prompt (very common), the KV cache for that prefix can be computed once and shared across all requests. vLLM stores these shared KV blocks and reuses them. This is huge for production deployments where you have a fixed system prompt.
- "How does speculative decoding interact with batching?" → The draft model must generate candidates for the main model to verify. With batching, you'd need to run the draft model for multiple users and verify with the main model for all of them simultaneously. The verification step is naturally batched, but managing draft vs. main model runs adds complexity.
- "What is the difference between latency and throughput in LLM serving?" → Latency: time for first token (TTFT) + time per output token (TPOT). Lower = better user experience. Throughput: tokens/second across all users. Higher = cheaper to serve. They're often in tension: batching improves throughput but increases latency for individual users.
- "What is TensorRT-LLM and when would you use it?" → NVIDIA's inference framework that compiles models to optimized TensorRT engines for their specific GPU. Achieves maximum throughput (sometimes 2× better than vLLM) but requires NVIDIA GPUs, complex setup, and recompilation for each model/GPU combination. Used by large-scale production deployments.