The problem
You've built a shopping assistant that answers product questions. You need to evaluate 10,000 responses before launch. Human review at that scale costs weeks and thousands of dollars. But shipping without evaluation is worse.
The alternative everyone reaches for: use another LLM to score the outputs. Simple in theory. In practice, LLM-as-Judge systems fail quietly — the judge agrees with anything confident-sounding, penalises long responses uniformly, or anchors on surface features instead of factual correctness. You get a number that looks like quality signal but isn't.
The intuition
A reliable LLM-as-Judge isn't a single prompt — it's a pipeline with three components:
1. A rubric, not a vibe. Don't ask the judge "is this a good response?" Ask it to score specific, independently checkable dimensions: factual accuracy, relevance to the query, absence of hallucination, citation of product attributes. Each dimension gets its own 1–5 scale with anchored examples.
2. A calibration step. Before running at scale, score 200 responses with humans. Compare to the LLM's scores. Measure Spearman correlation per dimension. Any dimension below 0.65 correlation needs rubric revision or a different judge model.
3. An audit loop. Sample 2–5% of scored responses weekly for human review. Track judge drift over time — LLM judges degrade when the input distribution shifts (new product categories, seasonal language patterns).
A rubric forces the judge to make the same decision a human auditor would. Without it, you're asking for a feeling, not a measurement.
In practice
The rubric structure matters enormously. I've found that dimensions combining multiple concerns ("Is this accurate and relevant and concise?") produce inconsistent scores. One concern per dimension. Four to six dimensions per task. More than that and the judge starts cutting corners.
For the shopping assistant, our rubric had five dimensions: product factual accuracy, query relevance, completeness, tone appropriateness, and absence of fabrication. We scored 500 human-labelled examples and found that our first rubric had 0.41 correlation on "completeness" — it turned out the judge was equating length with completeness. We added anchor examples at each score level and correlation jumped to 0.78.
At scale (32K+ evaluations per run) we batched 10 examples per API call, extracted structured JSON scores, and ran a consistency check: score each example twice with temperature > 0, flag examples where scores differ by more than 1. Those go to human review.
Going deeper (optional)
The bias landscape for LLM judges is well-documented. Position bias: judges prefer the first option in comparative tasks. Verbosity bias: longer responses score higher even when less accurate. Self-enhancement bias: a model asked to judge its own outputs rates them higher.
Mitigation strategies: swap response order and average scores (for comparisons), normalise by response length for relevant dimensions, and always use a different model family as judge than the one being judged.
RUBRIC_PROMPT = """
Score the following response on FACTUAL ACCURACY only (1-5).
1 = Contains clear factual errors about the product
3 = Mostly accurate, minor omissions
5 = Fully accurate, all claims verifiable from product data
Respond with JSON: {"score": <int>, "reason": "<one sentence>"}
Query: {query}
Response: {response}
Product data: {product_data}
"""