Guardrail Metrics: What You're Protecting

The problem

Your experiment shows +4% on the primary success metric. Leadership wants to ship. But buyer return rate is also up 0.8%, and page load time increased 120ms. Are those acceptable trade-offs?

This is the guardrail problem. Every experiment optimises for something. Guardrails define the things you're not allowed to harm in the process. Without explicit guardrails, teams unconsciously trade off things that matter for things that are easy to measure.

The intuition

Guardrail metrics are the metrics a team agrees, in advance, that they will not trade off. They're not success metrics — you don't need them to improve. You just need them to not get worse beyond a threshold.

The distinction matters. A success metric moving positively is a reason to ship. A guardrail metric breaching its threshold is a reason to stop, regardless of what the success metric says.

Good guardrails share a property: they protect someone other than the team running the experiment. Customer satisfaction metrics protect buyers when the seller-side team is running an experiment. System stability metrics protect infrastructure when a product team is running a feature test. The guardrail exists precisely because the optimising team has an incentive to ignore that dimension.

A guardrail is a commitment, not a metric. The difference is that you agree in advance you won't ship if it's breached.

In practice

At Amazon, experiment readouts had a standard structure: primary metric, secondary metrics, guardrails, and a ship/no-ship recommendation. The guardrails were pre-registered in the experiment doc before launch. Changing them after seeing results required a VP sign-off — which created the right friction.

Typical guardrails for a feature experiment: - Customer experience: page latency (p50, p99), error rate, zero-result rate - Business health: return rate, refund rate, seller defect rate - Platform stability: CPU utilisation, downstream service latency

The threshold-setting process is where teams invest too little time. "Don't harm" is not a threshold. "Buyer return rate < baseline + 0.5pp with 90% confidence" is a threshold. It requires you to decide how much harm is acceptable, which forces the right conversation before you've seen any data.

Going deeper (optional)

Multiple guardrail testing creates a multiple comparisons problem. If you test 10 guardrail metrics at α = 0.05 each, you expect at least one false alarm half the time. Options: Bonferroni correction (conservative), Holm-Bonferroni (less conservative), or a pre-specified hierarchy where you only test secondary guardrails if primary ones pass.

Some organisations use a composite guardrail score — a weighted combination of several metrics — to reduce the multiple comparisons problem while still capturing the relevant dimensions. The tradeoff is interpretability: a composite score is harder to explain when it's breached.

def check_guardrails(experiment_results: dict, guardrails: dict) -> dict:
    status = {}
    for metric, threshold in guardrails.items():
        delta = experiment_results[metric]["delta"]
        ci_lower = experiment_results[metric]["ci_lower"]
        # Breached if the lower bound of delta exceeds the allowed degradation
        status[metric] = {
            "delta": delta,
            "threshold": threshold,
            "breached": ci_lower < threshold,
        }
    return status