title: Monitoring LLMs in Production: What Actually Matters date: 2026-04-04 excerpt: Why LLM monitoring is different from traditional software monitoring, the 4 signal types, building dashboards, and solving the silent failure problem.


Monitoring LLMs in Production: What Actually Matters

When I first shipped LLM systems into production at Amazon, I brought the monitoring playbook from traditional software. Metrics. Alerts. Dashboards. It felt rigorous and complete.

Then we started seeing failures that no alert caught.

A model would output valid JSON. Response times stayed flat. Cost per request didn't spike. But the model was hallucinating on 15% of queries — facts it fabricated, links that didn't exist, dates that were wrong. We only noticed because a customer complained. The system had been silently failing for three days.

This is the core problem with LLM monitoring: traditional metrics are insufficient. You can have zero errors, perfect latency, optimal cost, and still be serving garbage to your users. This essay is about what actually matters when you're running LLMs at scale.

The 4 Signal Types

LLM systems produce four distinct signal classes. You need all of them. Missing one creates a blind spot.

flowchart LR
    subgraph Easy["Easy to Measure"]
        LA[Latency\np50 / p95 / p99]
        CO[Cost\ntokens × price]
    end
    subgraph Hard["Hard to Measure — but Non-Negotiable"]
        QU[Quality\nLLM-as-judge on 1-5% traffic]
        SA[Safety\nhallucination rate · refusals]
    end
    LA -->|"alert: +20% over baseline"| DASH[Monitoring\nDashboard]
    CO -->|"alert: +15% per request"| DASH
    QU -->|"alert: -10% quality score"| DASH
    SA -->|"alert: any spike in harmful outputs"| DASH
    style Easy fill:#fafaf9,stroke:#e7e5e4
    style Hard fill:#fafaf9,stroke:#e7e5e4

Latency. How fast does the model respond? This is familiar territory — measure end-to-end time from request to response, include tokenization and parsing overhead, track percentiles (p50, p95, p99). Latency matters because it directly impacts user experience and your costs (longer context windows + slower throughput = more infra).

Cost. How much are you spending per request? Track input tokens, output tokens, and API calls separately. Cost is your constraint on scale. When your system grows 10x, cost grows with it. You need to see this signal clearly, because the economics of LLMs are brutal: a 10% improvement in tokens-per-request across 100M requests is a massive budget win. Most teams don't instrument this properly. They see a monthly bill and shrug.

Quality. Is the model producing correct, useful, coherent outputs? This is the hard one. Quality signals are not binary, and they're expensive to measure. You can't grade every output. But you must measure some. The best approach: use a second LLM (cheaper model) to evaluate the expensive model's output. Ask it: "Does this answer the user's question factually?" or "Is this code bug-free?" Evaluate ~1-5% of production traffic this way. It's cheaper than most realize, and it's the only way to detect drift.

Safety. Is the model refusing to respond when it should, or responding when it shouldn't? This includes hallucinations (the silent failure problem), prompt injection, jailbreak attempts, harmful content. Some of this you can catch with classifiers (is this output likely a hallucination? does it match our ground truth data?). Some you catch through user feedback loops. You need both.

Key insight: Latency and cost are easy to measure. Quality and safety are hard and non-negotiable. Teams that skip them are flying blind.

What to Log

Most teams log too little or too much. Here's what actually matters:

  • The prompt + context window. Log the exact input the model saw. You cannot debug failures without this.
  • The model ID and version. Which model? Which checkpoint? This is crucial for correlating failures to model changes.
  • Generated output. The full completion, not truncated.
  • Token counts. Input and output, separately. This is how you track cost drift.
  • Latency. End-to-end from first byte to last.
  • User feedback. If the user marks output as helpful/unhelpful, log it.
  • Quality evaluation result. If you ran an evaluation on this output (e.g., "is this hallucinating?"), log the score.

Don't log every single request — that gets expensive fast. But log systematically. Log a consistent sample (e.g., 10% of all requests), plus all errors, plus all user-flagged issues. This ensures you catch patterns without drowning in data.

Common mistake: Logging only the final response. You need the full context window to reproduce failures.

Detecting Drift

Models drift. Their behavior changes over time. Sometimes it's your fault (you changed the prompt, updated the model, modified the retriever). Sometimes it's the model provider's fault (they updated the weights, changed tokenization, degraded service quality). Sometimes it's the data (user behavior changed, distribution shifted).

Drift is silent. It doesn't crash logs. It just degrades slowly. You have to detect it actively.

Latency drift: Set a baseline (p95 latency on day 1). Alert if p95 latency increases by 20% over a rolling 7-day window. Sounds simple. Teams miss this all the time because they don't set a baseline.

Cost drift: Track cost per request. Average it daily. Alert on 15% increase. This catches prompt injection attacks (adversaries can force long outputs), retrieval system failures (you're pulling irrelevant context), or model provider changes.

Quality drift: This is harder. If you're evaluating 1% of traffic, you're averaging maybe ~500 samples per day. Use a moving average over 7 days to smooth noise. Alert on a 10% drop in your quality metric. Better yet, segment by query type — some queries might degrade while others stay flat.

Safety drift: Track hallucination rate, refusal rate, harmful content rate. These should be stable or improving. If refusal rate drops 50%, something changed. If hallucination rate rises, you have a problem.

Drift detection rule: Set a baseline. Define the alert threshold. Automate the check. Review it weekly.

Building Dashboards

You need dashboards, but make them informative, not just pretty.

The "system health" dashboard. Quad view: latency (p50, p95, p99), cost per request (rolling average), quality metric (if you have one), error rate. Update every 5 minutes. This is your glance-at-a-glance view that says "is anything on fire?"

The "request detail" dashboard. Drill-down. Pick a failed request and see: the exact prompt, the exact output, the exact model used, latency, cost, quality evaluation result. This is where you debug.

The "cost analysis" dashboard. Cost per model, cost per endpoint, cost per user, cost per query type. This is how you find leverage points for optimization.

The "quality by segment" dashboard. Quality metric broken down by query type, user segment, model version, time of day. This reveals which queries are failing and which are fine.

Don't build one big dashboard. Build several small, focused ones. Each team member should spend 2 minutes looking at the relevant dashboard every morning.

Alerting Thresholds

Alerting is where monitoring becomes actionable. But most alert configurations are noise.

High-severity alerts (page on-call):

  • Error rate > 5% for 15 minutes
  • p99 latency > 2x baseline for 30 minutes
  • Quality metric drops > 20% over 24 hours
  • Safety alert (detected hallucination or refusal failure)

Medium-severity alerts (Slack, reviewed daily):

  • Cost per request up 15% over 7 days
  • p95 latency drifting up 10% week-over-week
  • Model provider reported service degradation

Low-severity alerts (dashboard only):

  • p50 latency variations
  • Daily cost within normal range but trending

The key: alerts should be infrequent, precise, and actionable. If you're getting 10+ alerts per week, your thresholds are wrong. You're training yourself to ignore alerts.

The Silent Failure Problem

This is the real challenge. A system can look healthy by every traditional metric while serving bad outputs.

I've seen this happen four ways:

Model degradation without distribution shift. The model provider updates weights. The new version is "better" on benchmarks but worse on your specific queries. You don't know unless you're measuring quality on your data.

Retrieval failure. You're using RAG (Retrieval Augmented Generation). The retriever returns irrelevant chunks. The model hallucinates based on bad context. Latency is fine. Cost is fine. But every fifth answer is garbage. You catch this by evaluating outputs against ground truth.

Prompt brittleness. A small change in user input breaks the prompt. The model fails to parse instructions or returns malformed output. You only see this if you're sampling failures and reviewing them.

Adversarial input. A user crafts a prompt that jailbreaks the model or extracts secrets. Your safety filtering misses it. The model responds with harmful content. Traditional monitoring sees a normal request. You need a safety evaluation classifier.

The solution: measure quality on a sample of every request type, every model, every day. It's the only way to catch silent failures before customers do.

Hard truth: You can't monitor your way to safety. But you definitely can fail to notice problems if you're not monitoring the right signals.

Putting It Together

Here's the real-world monitoring setup I'd recommend:

  1. Instrument everything. Log prompt, output, latency, tokens, model version, user ID, timestamp.
  2. Sample intelligently. 10% of all requests, 100% of errors, 100% of user-flagged issues.
  3. Evaluate systematically. Use a quality classifier to evaluate ~1-5% of traffic. Check for hallucinations, factuality, safety.
  4. Dashboard obsessively. Five minutes a day on system health. Thirty seconds on cost. This is how you catch problems early.
  5. Alert sparingly. High-severity alerts only. Review them within 5 minutes. The alert-to-action feedback loop should be tight.
  6. Review weekly. Look at error patterns. Spot quality drift. Correlate failures to changes (model updates, prompt versions, retriever changes).

This requires infrastructure. You need logging, evaluation tools, a metric store, a dashboard. But the alternative is serving failures to customers and not knowing about it. That's a worse problem.

The LLM systems that work best are not the ones with the most sophisticated models. They're the ones with the clearest visibility into what's actually happening in production. Invest in your monitoring, because your model is only as good as your ability to see when it's failing.