Multimodal Evaluation Pipelines

The problem

A product listing has a title, bullet points, images, and an HTML description. You need to score its quality across dimensions like visual consistency, attribute completeness, and claim accuracy. Text-only models can't see the images. Vision models can see images but struggle with the structured HTML. And the sheer volume — hundreds of thousands of listings — rules out manual review.

Multimodal evaluation pipelines are what you build when the signal you need spans multiple modalities simultaneously.

The intuition

The key architectural decision is whether to fuse modalities before or after scoring. Early fusion (send everything to one multimodal model) is simpler but loses nuance — the model has to attend to images and text simultaneously, which works poorly for precision tasks. Late fusion (score each modality separately, then aggregate) gives you better control and interpretability.

For product listings I used late fusion with three parallel scorers: - A vision model scoring image quality and attribute-image consistency - A text model scoring title and bullet point completeness - An HTML parser + text model scoring description structure and claim accuracy

Final scores aggregated with learned weights calibrated on human labels.

Late fusion lets each model specialise. Aggregating signals is easier than expecting one model to hold everything at once.

In practice

HTML ingestion is trickier than it sounds. Raw HTML sent to a language model produces noisy tokenisation — the model spends capacity on <div class="a-section"> instead of the content. I pre-process HTML to a clean markdown-like text representation: extract headings, list items, and paragraph text, strip all tags. The model gets clean signal.

For images, the main gotcha is resolution. Sending high-resolution images at full size balloons token cost with minimal quality gain. Resize to 512×512 or 768×768 before sending. Use structured output prompting to get attribute-specific scores rather than free-text descriptions — it's much easier to aggregate and validate.

At Amazon we built a pipeline that ingested 300K product listings per week. The vision scorer flagged image-text inconsistencies (image shows a red item, title says blue) that text-only models completely missed. That single dimension alone caught a class of listing errors that had been invisible to the previous evaluation system.

Going deeper (optional)

The calibration challenge is harder for multimodal than text-only evaluation. Human labellers often disagree on visual quality judgements (is this image "professional"?). Use labeller agreement metrics (Fleiss' kappa or Krippendorff's alpha) to filter out dimensions where human consensus is too low to train against. If humans can't agree, a model won't learn the right signal either.

import base64
from openai import OpenAI

def score_image_quality(image_path: str, attributes: list[str]) -> dict:
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()

    prompt = f"""Score this product image on: {', '.join(attributes)}.
    Return JSON: {{"scores": {{"attr": 1-5}}, "flags": ["issue1"]}}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": prompt}
        ]}],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)