The problem

A prompt that works beautifully in a notebook against 50 hand-picked examples falls apart when you run it against 32,000 real products. The model starts hedging where it should be definitive. It hallucinates attributes that weren't in the product data. It breaks your structured output format on the weird edge cases you didn't anticipate.

Prompt engineering for production isn't a creative writing exercise. It's an engineering discipline with a feedback loop.

The intuition

The fundamental shift from demo to production: in a notebook, you iterate until the output looks good. In production, you need a metric that tells you whether the prompt is better or worse across the distribution of real inputs, not just your cherry-picked examples.

That means before you change a single word of your prompt, you need: 1. A labelled evaluation set (200–500 examples, ideally human-scored) 2. A scoring function that maps model output to a number 3. A baseline score to beat

Everything else follows from that loop: change the prompt → run against the eval set → did the score improve?

Prompt engineering without an eval set is just vibes. The eval set is what makes it engineering.

In practice

A few things that consistently made the difference across my production prompts:

Constrain the output format strictly. Ask for JSON with a defined schema. Include an example in the prompt. Validate the output with a parser on every response — don't assume the model will comply. When it doesn't, log the failure and retry with a stricter prompt fragment appended.

Separate retrieval from reasoning. If the model needs product data to answer, put that data in the prompt explicitly. Don't ask the model to "use its knowledge" — it will hallucinate. Give it the facts, ask it to reason.

Spell out the hard cases. Review your failure modes from the eval set and add explicit handling for the top 3–5 edge cases. "If the product data does not contain a weight attribute, score completeness as 2, do not infer." That one instruction reduced our completeness scoring variance by 40%.

Temperature is a dial, not a boolean. For classification and scoring tasks, temperature 0 is almost always right. For open-ended generation, temperature 0.3–0.5 usually gives better diversity without incoherence.

Going deeper (optional)

Prompt regression testing is the practice of running your eval set against every prompt change and tracking scores over time. Treat prompts like code: version them in git, run evals in CI, and block prompt changes that regress quality below a threshold.

For long prompts (>2,000 tokens), test what happens when your prompt is placed at different positions in a long context — models have documented attention decay in the middle of long contexts. Critical instructions belong at the beginning or end, not buried in the middle.

import json
from openai import OpenAI

def score_response(product_data: dict, query: str, model_output: str) -> dict:
    prompt = SCORING_RUBRIC.format(
        product_data=json.dumps(product_data, indent=2),
        query=query,
        response=model_output
    )
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)