The problem

Your team has been iterating on a GenAI feature for three months. You've fixed bugs, improved prompts, and added safety filters. You're ready to launch — but when someone asks "how do we know it's good enough?", the answer is "it feels better."

Vibes are not a launch criterion. Neither is "it passed our internal demo." You need a benchmark — a repeatable measurement that gives you objective confidence before you ship.

The intuition

A benchmark is not a test suite. A test suite checks that specific inputs produce expected outputs (useful for regression). A benchmark measures capability across a representative distribution of inputs, and compares that measurement to a threshold or a baseline.

Three properties of a useful benchmark: 1. Representative: the eval set reflects the actual distribution of real-world inputs, including tail cases 2. Stable: you can run it on any system version and get a comparable number 3. Tied to a decision: the threshold is derived from "what score do we need to not harm users," not "what score did our best model get"

The third property is almost always missing. Teams build benchmarks that measure how good their system is, not whether it's good enough to ship. Those are different questions.

A benchmark without a launch threshold is just a leaderboard. The threshold is what makes it a quality gate.

In practice

At Amazon, we built launch benchmarks for shopping assistant features with four layers:

Automated scoring (fast feedback): LLM-as-Judge on 2,000 sampled queries, scoring accuracy, relevance, and safety. Runs in under an hour. Used for daily regression.

Human evaluation (ground truth): 300 queries scored by trained annotators on a detailed rubric. Takes 2–3 days. Used to calibrate the automated scorer and confirm before launch.

Adversarial set (red-teaming): 100 queries designed to probe known failure modes — leading questions, edge-case products, ambiguous intents. Pass rate on this set is a hard gate.

Shadow traffic evaluation (realism check): 1% of live traffic routed to the new system with no user-visible effect. Automated scoring runs on real queries. Confirms that lab benchmarks translate to production.

The launch criterion was: automated score ≥ baseline + 2%, human score ≥ 4.2/5, adversarial pass rate ≥ 95%. These numbers came from a user study that correlated scores with user satisfaction ratings.

Going deeper (optional)

Benchmark contamination is underappreciated. If your eval set overlaps with your training data or your prompt examples, your benchmark is measuring memorisation, not generalisation. Maintain a strict holdout and refresh the eval set quarterly as the input distribution evolves.

Measuring confidence intervals on your benchmark score matters when the eval set is small. With 300 examples, a 2% score difference might not be statistically significant. Bootstrap your eval score and report confidence intervals alongside point estimates.

import numpy as np
from scipy import stats

def benchmark_score_with_ci(scores: list[float], confidence: float = 0.95) -> dict:
    n = len(scores)
    mean = np.mean(scores)
    se = stats.sem(scores)
    ci = se * stats.t.ppf((1 + confidence) / 2, df=n - 1)
    return {
        "mean": round(mean, 4),
        "ci_lower": round(mean - ci, 4),
        "ci_upper": round(mean + ci, 4),
        "n": n
    }