Chapter 8 — Frontier & Future

Part 1: Mixture of Experts (MoE) — Scaling Without the Cost

MoE is likely how GPT-4, Gemini, and Grok work internally. It's the most important architectural innovation for scaling LLMs beyond the compute-optimal point without proportionally increasing inference cost.

The Core Idea: Conditional Computation

Standard transformer: every token goes through the same FFN layers. MoE transformer: every token goes through only a subset of FFN "experts."

Standard FFN (dense):
  x → FFN(x) → output
  Every token, every step, same computation.

MoE FFN (sparse):
  x → Router → select 2 of 8 experts → Expert_1(x) + Expert_2(x) → output
  Each token activates only 2/8 = 25% of FFN parameters.

The result: you can have 8× more FFN parameters for the same inference compute cost!

The Architecture

class MoELayer(nn.Module):
    """
    Mixture of Experts FFN layer.

    n_experts: total number of expert FFNs (e.g., 8 for Mixtral)
    top_k:     number of experts activated per token (e.g., 2)
    """

    def __init__(self, d_model, d_ffn, n_experts, top_k):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k

        # Router: learns which experts to send each token to
        # Output: (batch×seq, n_experts) logits
        self.router = nn.Linear(d_model, n_experts, bias=False)

        # Expert FFNs (each is an independent FFN)
        self.experts = nn.ModuleList([
            FeedForward(d_model, d_ffn)
            for _ in range(n_experts)
        ])

    def forward(self, x):
        B, T, C = x.shape
        x_flat = x.view(-1, C)   # (B*T, d_model)

        # ── Routing ──────────────────────────────────────────────────
        # Compute routing logits
        router_logits = self.router(x_flat)        # (B*T, n_experts)
        router_probs = F.softmax(router_logits, dim=-1)

        # Select top-k experts for each token
        top_k_probs, top_k_indices = torch.topk(
            router_probs, self.top_k, dim=-1
        )   # each shape: (B*T, top_k)

        # Normalize selected probabilities (so they sum to 1)
        top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # ── Expert Computation ────────────────────────────────────────
        output = torch.zeros_like(x_flat)

        for i in range(self.top_k):
            expert_idx = top_k_indices[:, i]      # (B*T,) which expert
            expert_weight = top_k_probs[:, i:i+1]  # (B*T, 1)

            # Group tokens by their assigned expert (for efficiency)
            for e in range(self.n_experts):
                mask = (expert_idx == e)
                if mask.any():
                    expert_out = self.experts[e](x_flat[mask])  # Run expert
                    output[mask] += expert_weight[mask] * expert_out

        return output.view(B, T, C)

Mixtral 8×7B: The Open-Source MoE Champion

Architecture:

8 expert FFNs per layer, 2 activated per token
32 transformer layers (all have MoE FFN)
Total parameters: ~46.7B
Active parameters per token: ~13B (2 experts per layer)

What this means:

You pay the inference cost of a 13B model
But you get the quality of a ~47B model (more total learned knowledge)
Training cost is higher (all experts must be trained)

Performance: Mixtral 8×7B beats LLaMA 2 70B on many benchmarks while requiring only 13B active parameters — ~5× more efficient inference.

The Router: The Heart of MoE

The router is a simple linear layer, but it's the most important part of MoE. Getting routing right is the difference between a working MoE and one where some experts are always chosen and others never used.

Load Balancing — The Key Challenge

Without explicit load balancing, routers collapse: a few experts get all the traffic, others never train, they stay random, which makes the router prefer the good experts even more → positive feedback loop → essentially a 1-expert model.

Auxiliary Loss (the standard solution):

def compute_load_balancing_loss(router_probs, expert_indices, n_experts):
    """
    Penalty for unequal expert utilization.
    Encourages router to distribute tokens evenly across experts.
    """
    # Fraction of tokens routed to each expert
    expert_counts = torch.zeros(n_experts, device=router_probs.device)
    for idx in expert_indices.view(-1):
        expert_counts[idx] += 1
    fraction_routed = expert_counts / expert_indices.numel()   # Should be ~1/n_experts

    # Mean routing probability for each expert
    mean_router_prob = router_probs.mean(dim=0)  # (n_experts,)

    # Load balancing loss: dot product of routing fraction and mean probability
    # Minimizing this encourages uniform routing
    load_balance_loss = n_experts * (fraction_routed * mean_router_prob).sum()

    return load_balance_loss

# Add to training loss with small coefficient
total_loss = lm_loss + 0.01 * load_balance_loss

Expert Choice Routing (alternative): Instead of tokens choosing experts, experts choose tokens. Each expert picks its top-k tokens from the batch. Automatically balances load — each expert gets exactly k tokens per batch. But introduces variable computation per token.

MoE Training: Challenges

Token dropping: With many experts and limited compute, some tokens might not be processed by enough experts. "Expert capacity" limits how many tokens each expert can process in one forward pass. Tokens exceeding capacity are "dropped" (skipped). Mitigation: increase capacity factor or use no-drop routing.

Communication overhead: Tokens and their corresponding expert must be collocated. In distributed training, tokens may need to be sent across GPUs to reach their assigned expert (All-to-All communication). This adds network latency.

Gradient flow: Each expert only processes a subset of tokens. Gradients from under-used experts are sparse. Expert initialization and load balancing are crucial for all experts to learn.

DeepSeek MoE: State of the Art (2024)

DeepSeek-V2 and DeepSeek-V3 pushed MoE further:

DeepSeek-V3 (December 2024):

671B total parameters, 37B active
256 experts, 8 activated per token
Trained for $5.5M (remarkably cheap)
Matches GPT-4 class performance

Key innovations:

Multi-head latent attention (MLA): Compresses the KV cache dramatically
Auxiliary-loss-free load balancing: Dynamic expert bias instead of loss term
FP8 mixed-precision training: First large model to use FP8 for training

DeepSeek-R1: Trained with reinforcement learning (GRPO, not PPO/DPO) to develop reasoning capabilities. The model learns to "think" through problems before answering.

Interview Corner Cases — MoE 🎯

"If MoE has 8 experts but uses only 2, isn't it like having a smaller model?" → The computation per token is like a smaller model (2/8 of the FFN compute). But the total knowledge capacity is like a larger model — 8× more FFN parameters have been trained and store different learned patterns. Different tokens activate different experts, so across a full document, all experts contribute.
"How do you serve a MoE model efficiently?" → Routing tokens to experts can be done efficiently at batch level. The challenge is GPU memory — a 47B model needs more VRAM than a 13B model even if compute is similar. At scale, experts can be distributed across GPUs (expert parallelism). Each GPU hosts a subset of experts; tokens are routed across GPUs with All-to-All.
"What is the difference between top-1 and top-2 routing?" → Top-1: each token goes to exactly 1 expert (harder to train, can fail). Top-2 (Mixtral default): each token goes to 2 experts with weighted averaging — more gradient signal per step, more stable training. Top-2 consistently outperforms top-1 empirically.
"What is the 'dead expert' problem?" → Without load balancing, some experts receive near-zero routing probability and stop learning (their gradients are negligible). They remain near-random initialization. The router then avoids them even more. Auxiliary load balancing losses prevent this positive feedback collapse.
"Can MoE be applied to the attention layers too?" → Yes, but it's less common and less impactful. Attention already has multi-head (a form of specialization). MoE is most beneficial in FFN layers because they're the largest component (~2/3 of parameters) and handle different types of "knowledge." Some research (e.g., Sparse Mixtral) applies MoE to both.