Chapter 6 — Inference & Deployment

Part 2: Inference Engines, vLLM, and Serving at Scale

By the time you've trained and fine-tuned a model, you need to actually run it. This chapter covers how to serve LLMs efficiently — from your laptop to production at scale.

The Inference Bottleneck: Memory Bandwidth, Not Compute

A common misconception: LLM inference is slow because of the computation (matrix multiplies). Wrong. It's slow because of memory bandwidth.

At inference time:

For each generated token, you load ALL model weights from GPU memory
7B parameters × 2 bytes = 14GB loaded from HBM each time
A100 HBM bandwidth: 2 TB/s → loading 14GB takes ~7ms
At 7ms per token: max ~140 tokens/second per A100 (for batch=1)

Techniques to improve:

Quantization: Reduce 14GB → 3.5GB → 4× more bandwidth available
Larger batches: Multiple users share the same weight-loading overhead
Speculative decoding: Generate multiple tokens per weight load
Flash Attention: Reduce KV cache memory bandwidth overhead

Inference Engines: When to Use What

Engine	Best For	Quantization	Multi-GPU	Notes
HuggingFace Transformers	Dev/testing	BnB	Basic	Easiest to use
llama.cpp	CPU, local use	GGUF	Partial	Very low RAM use
vLLM	GPU serving, production	GPTQ, AWQ	✅	Best throughput
TGI (HuggingFace)	Easy deploy	GPTQ	✅	Good for HF models
Ollama	Local, GUI users	GGUF	Partial	Easiest local setup
TensorRT-LLM	NVIDIA production	FP8/INT4	✅	Maximum speed
ExLlamaV2	Consumer GPU	EXL2	Partial	Best RTX cards

llama.cpp: The CPU Hero

What it is: A pure C++ reimplementation of LLaMA inference. Runs on any machine — Mac, Windows, Linux, no NVIDIA GPU needed.

Why it matters:

Runs on Apple Silicon (M1/M2/M3) using Metal acceleration
Uses AVX2/AVX512 SIMD instructions on CPU
GGUF format with K-quants for excellent quality at low bits
Supports partial GPU offloading (some layers on GPU, rest on CPU)

# Install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build for CPU
make

# Build for CUDA (NVIDIA)
make LLAMA_CUDA=1

# Build for Metal (Mac)
make LLAMA_METAL=1

# Download a GGUF model (Ollama's library or HuggingFace)
# Example: LLaMA 3.2 3B Q4_K_M
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run it!
./llama-cli \
    -m Llama-3.2-3B-Instruct-Q4_K_M.gguf \
    --chat-template llama3 \
    -n 200 \
    -p "You are a helpful assistant.\n\nUser: Explain quantum entanglement simply.\nAssistant:"

# Start a server (OpenAI-compatible API!):
./llama-server -m model.gguf --port 8080
# Now call it like OpenAI API: http://localhost:8080/v1/chat/completions

Interview corner case 🎯: "How does llama.cpp handle models that don't fit in GPU memory?" — "GPU offloading" — you specify how many layers to put on GPU (-ngl 20 = 20 layers on GPU, rest on CPU). The model runs in a pipeline where GPU-resident layers process the tensor, then it moves to CPU, then back to GPU. Less efficient than all-GPU but much better than pure CPU.

Ollama: The Developer's Local LLM Tool

What it is: A wrapper around llama.cpp with a model library, simple CLI, and Docker-like model management.

# Install (Mac/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model (automatically downloads GGUF)
ollama run llama3.2

# List available models
ollama list

# Serve as API (starts automatically)
# API is at http://localhost:11434

# Call it:
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Explain gradient descent."}]
}'

# Or use the OpenAI-compatible endpoint:
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Ollama handles model storage, GPU detection, and memory management automatically. Best for local development and demos.

vLLM: Production-Grade GPU Serving

What it is: A high-throughput inference engine for GPU serving, with PagedAttention for efficient memory management.

flowchart TD
    subgraph Requests["Incoming Requests"]
        R1[User 1]
        R2[User 2]
        R3[User 3]
    end
    R1 --> CB[Continuous Batching Scheduler]
    R2 --> CB
    R3 --> CB
    CB --> GPU[GPU forward pass]
    GPU --> PA[PagedAttention KV Cache Manager]
    PA -->|allocate pages on demand| KV[(KV Cache on GPU HBM)]
    PA --> OUT[Token output per request]
    style Requests fill:#fafaf9,stroke:#e7e5e4
    style KV fill:#f5f5f4,stroke:#a8a29e

Why it's revolutionary: Standard inference pre-allocates max_sequence_length × n_layers × head_dim of KV cache per request. At 4K context, this is huge and mostly empty for most requests.

PagedAttention (the key vLLM innovation):

Stores KV cache in non-contiguous "pages" (like OS virtual memory pages)
Pages are allocated on demand, not up front
Different requests can share KV cache pages (for shared prefixes)
Result: 10-30× more concurrent requests on the same GPU!

# Install
pip install vllm

# Start vLLM server (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9
    # Add --quantization awq for 4-bit (requires AWQ-quantized model)

# Query it (identical to OpenAI API):
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "messages": [{"role": "user", "content": "What is LoRA?"}],
        "temperature": 0.7,
        "max_tokens": 200
    }'

vLLM features:

Continuous batching: new requests slot into existing batches mid-generation
Prefix caching: if multiple requests share the same system prompt, compute KV once
Multi-GPU tensor parallel inference
Streaming responses
AWQ, GPTQ, FP8 quantization support

When to use vLLM: Any time you're serving to multiple users. Even for 2-3 concurrent users, vLLM's batching gives 3-5× throughput vs. naive HuggingFace inference.

HuggingFace Text Generation Inference (TGI)

Hugging Face's production inference server. Good for deploying HF Hub models.

# Docker deployment (simplest):
docker run --gpus all \
    -p 8080:80 \
    -v $PWD/models:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --max-input-length 2048 \
    --max-total-tokens 4096

# Query with Python:
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")
response = client.text_generation("Explain transformers:", max_new_tokens=100)

Streaming Responses

Users expect to see text appear token by token (like ChatGPT). This requires server-sent events (SSE) or streaming.

# vLLM streaming with Python
import requests, json

def stream_generation(prompt, model="llama3.2"):
    response = requests.post(
        "http://localhost:8000/v1/chat/completions",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": True,  # ← KEY: enable streaming
        },
        stream=True,
    )

    for line in response.iter_lines():
        if line.startswith(b"data: "):
            data = line[6:]  # Remove "data: " prefix
            if data == b"[DONE]":
                break
            chunk = json.loads(data)
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)

stream_generation("Tell me about large language models.")

Building a Simple Chat API (see 03_simple_chat_api.py)

# Full FastAPI server in ~100 lines
# Supports streaming, conversation history, system prompts
# OpenAI-compatible format

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import threading, torch

app = FastAPI()
model_name = "microsoft/phi-2"  # Small, no auth needed
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

@app.post("/chat")
async def chat(message: str):
    inputs = tokenizer(message, return_tensors="pt")

    streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
    thread = threading.Thread(
        target=model.generate,
        kwargs={"input_ids": inputs.input_ids, "streamer": streamer,
                "max_new_tokens": 200, "temperature": 0.7, "do_sample": True}
    )
    thread.start()

    def generate():
        for text in streamer:
            yield f"data: {text}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Interview Corner Cases — Inference & Deployment 🎯

"What is continuous batching and why does vLLM use it?" → Without continuous batching, a server waits until all requests in a batch finish before starting the next batch. With continuous batching, completed sequences are immediately replaced with new requests, filling freed slots immediately. This dramatically improves GPU utilization (from ~30% to ~90%).
"What is prefix sharing/caching in vLLM?" → If many requests start with the same system prompt (very common), the KV cache for that prefix can be computed once and shared across all requests. vLLM stores these shared KV blocks and reuses them. This is huge for production deployments where you have a fixed system prompt.
"How does speculative decoding interact with batching?" → The draft model must generate candidates for the main model to verify. With batching, you'd need to run the draft model for multiple users and verify with the main model for all of them simultaneously. The verification step is naturally batched, but managing draft vs. main model runs adds complexity.
"What is the difference between latency and throughput in LLM serving?" → Latency: time for first token (TTFT) + time per output token (TPOT). Lower = better user experience. Throughput: tokens/second across all users. Higher = cheaper to serve. They're often in tension: batching improves throughput but increases latency for individual users.
"What is TensorRT-LLM and when would you use it?" → NVIDIA's inference framework that compiles models to optimized TensorRT engines for their specific GPU. Achieves maximum throughput (sometimes 2× better than vLLM) but requires NVIDIA GPUs, complex setup, and recompilation for each model/GPU combination. Used by large-scale production deployments.