Why Your Agent Works in the Demo and Fails at Work

You build a booking agent. In your demo, it beautifully searches flights, checks weather, and books a ticket in 4 steps. You deploy it. By week two, you're drowning in error tickets from production. What went wrong?

This is the difference between a prototype and a production system. I've debugged agents at Amazon that worked perfectly on 10 test cases and failed on 80% of real traffic. The failure modes are predictable. Let me walk through them.

Error Compounding: When One Mistake Cascades

This is the first thing that breaks production agents. A single error early in the task corrupts everything downstream.

The Cascade

Imagine a booking agent trying to find flights:

Step 1: Search flights to "Denver"

search_flights("Denver", "2026-04-05", budget=300)
→ Returns: 5 flights found
Status: OK ✓

Step 2: Check prices on results

Agent tries to extract price from result[0]
→ Result is malformed, missing "price" field
→ Agent passes `None` to next step
Status: ERROR ✗

Step 3: Filter by budget

Agent tries: if price < 300
→ None < 300 → TypeError
→ Agent panics, tries fallback logic
Status: ERROR ✗

Step 4: Recommend flight

Agent tries to recommend but has no valid flights
→ Generates: "Sorry, no flights available"
→ But flights existed! Agent just failed to parse them
Status: WRONG ANSWER ✗

This is error compounding. One parsing error cascaded into a wrong answer. And now the user is frustrated.

Why Agents Are Vulnerable

Regular programs are resilient to malformed data because you write error handling. Agents aren't. The LLM sees the error in its context and tries to reason about it, but it doesn't have access to the code that failed. It just sees a result it doesn't understand.

Agent's perspective:
"I called search_flights. I got back a result.
 The result has a 'flights' key but no 'price' field.
 Should I ignore this? Hallucinate a price? Ask the user?
 I'll guess..."

→ Agent guesses. Agent is wrong.

Mitigations

1. Validate tool outputs before passing to LLM

Before feeding a tool result back to the agent context, check it:

def execute_tool(name, input_dict):
    result = tools[name](**input_dict)

    # Validate the result schema
    try:
        result = validate_schema(result, expected_schema[name])
    except ValidationError as e:
        # Don't pass bad data to the LLM
        return {
            "error": f"Tool result validation failed: {e}",
            "original_result": result
        }

    return result

2. Limit tool depth

Don't call 10 tools in sequence. Limit to 3-5 max steps. Each step increases failure risk.

3. Use explicit error handling in prompts

Tell the agent what to do when things go wrong:

If a tool returns an error:
1. Read the error message carefully
2. Call the tool again with different parameters
3. If it fails twice, inform the user and stop

Do not guess or hallucinate.
Do not proceed with invalid data.

Context Window Limits: The Ceiling Everyone Hits

You think 100K tokens is infinite until you actually build with it.

An agent processing a real customer support case looks like this:

System prompt:              2K tokens
Conversation history:       15K tokens (20 turns)
Relevant docs (retrieved):  10K tokens
Previous examples:          8K tokens
Current request:            1K tokens
Tool descriptions:          5K tokens
                          --------
Running total:             41K tokens

Agent makes 5 tool calls.
Each tool result is ~2K tokens.
                            10K tokens

New total:                 51K tokens

User asks follow-up question...
Agent needs more docs...
                            +10K tokens

Running total:             61K tokens

After a few more interactions: 85K tokens

At 100K, you hit the ceiling. Agent can't function.

This happens fast. And the cost is terrible—you're paying for every token.

Real Problem: Hallucination at the Boundary

When the context window is almost full, LLMs behave badly. They stop referencing the actual context and start hallucinating.

Agent context: [... 95K tokens ...]

Agent needs to search flights, but adding the search results
would exceed 100K limit.

Agent thinks: "I've seen flight data before.
              Let me reason about what flights probably exist."

Agent hallucinates: "United has a 7:30 AM flight for $240"
(Never searched. Made it up.)

This is insidious because it looks right.

Mitigations

1. Aggressive context truncation

Don't keep full conversation history. Keep only the last 3-5 turns:

# Keep recent context only
recent_messages = messages[-10:]  # 10 messages = ~2K tokens

2. Summarize long conversations

Before context gets big, summarize:

Original exchange: [15 turns, 10K tokens]

Summary:
"User wants flight to Denver tomorrow.
 Prefers morning departures under $300.
 Lives in San Francisco.
 Has TSA PreCheck."

Summary: [5 turns, 500 tokens]

Keep summary, drop original conversation.

3. Use vector retrieval instead of in-context

Don't put docs in context. Retrieve them on-demand:

# Instead of:
context = full_docs + user_question  # Bloat

# Do this:
relevant_docs = vector_search(user_question, top_k=3)
context = relevant_docs + user_question  # Lean

4. Monitor token usage

Real agents need token budgets:

token_budget = 80000  # Stay under 100K limit
current_tokens = count_tokens(context)

if current_tokens > token_budget:
    context = summarize_context(context, token_budget * 0.5)

Tool Reliability: When Your Tools Lie

You build an agent that uses your API. The API is "production-grade." Still fails.

Reasons:

  • Rate limiting: API returns 429, agent doesn't retry
  • Timeout: API takes 3 seconds, agent waits 1 second, assumes failure
  • Partial failures: API returns 200 but the data is incomplete
  • Silent bugs: API returns valid JSON but the values are wrong

In my experience at Amazon, we found that ~5% of API calls had subtle issues. Not failures—valid responses with wrong semantics.

Example: The Price Is Wrong

Agent calls: search_flights("Denver", date="2026-04-05")

API returns:
{
  "flights": [
    {"id": "UA123", "price": "$280"},  ← Should be $290
    {"id": "DL456", "price": null},     ← Missing price
  ]
}

API didn't error. Agent doesn't know this is wrong.
Agent recommends a flight that's actually $30 more expensive.

Mitigations

1. Implement API contracts

Define what valid responses look like and assert them:

def call_flight_api(destination, date):
    response = api.search_flights(destination, date)

    # Contract check
    assert "flights" in response
    for flight in response["flights"]:
        assert "id" in flight
        assert "price" in flight
        assert isinstance(flight["price"], (int, float))

    return response

2. Implement timeouts and retries

Don't call tools once and assume it works:

def call_tool_with_retry(tool_name, input_dict, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = tools[tool_name](
                **input_dict,
                timeout=5  # Strict timeout
            )
            return result
        except (Timeout, RateLimitError) as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

3. Sanity check results

When results seem weird, flag them:

result = search_flights(...)
for flight in result["flights"]:
    if flight["price"] is None:
        raise ValueError(f"Flight {flight['id']} has no price")
    if flight["price"] < 0 or flight["price"] > 10000:
        raise ValueError(f"Flight price {flight['price']} is implausible")

4. Maintain a fallback

If the tool fails, do you have a backup plan?

try:
    flights = search_flights(destination, date)
except Exception as e:
    # Fallback: use cached results from yesterday
    flights = cache.get_flights(destination, date)
    if not flights:
        return {"error": "Unable to search flights"}

Prompt Brittleness: The Distribution Shift Problem

Your agent works on these inputs:

"Book me a flight to Denver tomorrow for $300"
"I need a flight to Denver, budget $300, tomorrow morning"
"Find a flight to Denver for tomorrow, under $300"

Then production hits it with:

"denver tomorrow under 300"  (no punctuation)
"FLY ME TO DENVER TOMORROW" (all caps)
"Can I get a Denver flight? Preferably tomorrow? Budget 300?" (question marks)
"I'm thinking Denver, whenever is cheapest" (no date specified)
"Flights to Denver—need it tomorrow" (em-dash)

And suddenly the agent's accuracy drops from 95% to 60%.

This is distribution shift. The real world doesn't match your demo.

Why This Happens

Your prompt probably says:

Extract the following from the user input:
- Destination city
- Departure date
- Budget in USD

Format: "destination, date, budget"

This works for well-formatted input. But "FLY ME TO DENVER TOMORROW" has no structure. The agent hallucinates.

Mitigations

1. Use structured parsing, not prompts

Don't ask the LLM to extract "destination, date, budget". Have it output JSON:

extraction_prompt = """
Extract the flight request. Output JSON only.
{
  "destination": "<city or null>",
  "date": "<YYYY-MM-DD or null>",
  "budget_usd": <number or null>
}
"""

Then validate the JSON schema. If invalid, retry.

2. Make prompts defensive

Assume worst-case input:

The user's input may be:
- Misspelled
- Missing information
- Formatted unexpectedly
- Ambiguous

If destination is unclear, ask for clarification.
If date is missing, use tomorrow's date.
If budget is missing, don't assume a limit.

Do not guess or infer. Ask the user.

3. Test on diverse inputs

Before shipping, run 100 variations of every user request:

test_cases = [
    "book me a flight to Denver tomorrow",
    "BOOK ME A FLIGHT TO DENVER TOMORROW",
    "denver tomorrow",
    "Denver, tomorrow",
    "I want to fly to denver tomorrow",
    "Can you find me flights to Denver for tomorrow?",
    # ... 94 more variations
]

for test_case in test_cases:
    output = agent.run(test_case)
    assert_correct(output)

4. Use prompt templates with validation

Instead of free-form prompting, use templates:

def parse_flight_request(user_input):
    # Step 1: Extract with LLM
    extracted = llm.extract_json(user_input)

    # Step 2: Validate schema
    if not extracted.destination:
        raise ValueError("No destination found")

    # Step 3: Normalize
    extracted.destination = normalize_city(extracted.destination)
    extracted.date = parse_date(extracted.date or "tomorrow")

    return extracted

Evaluation Gaps: You Don't Know What You Don't Know

You test your agent on 20 cases. It works on 19 of them. You ship it. Production breaks.

The problem: you didn't test the failure cases. You tested happy paths.

What You Probably Tested

1. "Book a flight to Denver tomorrow" → Works
2. "Find flights under $300" → Works
3. "I prefer morning flights" → Works
...
(All similar, well-formed requests)

What Production Does

1. "Book a flight to Denver tomorrow"
2. "Actually, change that to Sacramento"
3. "Wait, tomorrow won't work. How about next week?"
4. "Can you check the weather there?"
5. "Never mind, I'll drive. Cancel the search"
6. [User disappears for 3 hours]
7. "Are my flights still available?"

Your agent wasn't built for this. It doesn't handle:

  • Corrections mid-task
  • Cancellations
  • Context recovery after delays
  • Memory across sessions

Mitigations

1. Build a test harness

Don't just test individual inputs. Test trajectories:

test_trajectory = [
    {
        "user_input": "Book a flight to Denver tomorrow",
        "expected_action": ["search_flights", "present_options"]
    },
    {
        "user_input": "Actually, Sacramento instead",
        "expected_action": ["search_flights", "present_options"],
        "context_check": "Agent remembers the date (tomorrow)"
    },
    {
        "user_input": "How's the weather in Sacramento?",
        "expected_action": ["check_weather"],
        "context_check": "Agent remembers 'Sacramento' without re-asking"
    }
]

for step in test_trajectory:
    output = agent.run(step["user_input"])
    assert output_matches(output, step["expected_action"])

2. Test error cases

Explicitly test what breaks:

error_cases = [
    ("No destination: I want a flight", ["ask_clarification"]),
    ("No date: I want a flight to Denver", ["ask_clarification"]),
    ("Bad destination: I want to go to Atlantis", ["clarify_typo"]),
    ("Impossible budget: I want a $5 flight", ["explain_unrealistic"]),
    ("Rate limit: API is overloaded", ["fallback_or_retry"]),
]

for user_input, expected_behaviors in error_cases:
    output = agent.run(user_input)
    for behavior in expected_behaviors:
        assert behavior in output

3. Use golden trajectories

Record good agent runs and replay them as regression tests:

golden_trajectory = [
    {
        "turn": 1,
        "user": "Book flight to Denver tomorrow, under $300",
        "agent_action": "search_flights(destination='Denver', date='2026-04-05')",
        "agent_response": "Found 3 flights. Cheapest is United at 7:30 AM for $280."
    },
    {
        "turn": 2,
        "user": "That works, book it",
        "agent_action": "book_flight(flight_id='UA123')",
        "agent_response": "Booking confirmed. Your flight departs at 7:30 AM."
    }
]

# Replay this trajectory with new agent version
for step in golden_trajectory:
    output = agent.run(step["user"])
    assert_similar(output, step["agent_response"])  # Allow slight variations

The Debugging Problem: Black Box Agents

When your agent fails, where do you look?

  • Did the LLM make a reasoning error?
  • Did a tool return bad data?
  • Did the prompt mislead the agent?
  • Did the agent forget context?
  • Did the user input confuse it?

With traditional code, you have a stack trace. With agents, you have a conversation. Good luck.

What Makes Debugging Hard

Agent output: "No flights available to Denver"

Possible causes:
1. search_flights tool returned empty results
2. Tool call was malformed (agent sent wrong destination)
3. Tool was never called (agent decided not to)
4. Tool timed out (agent saw no results)
5. Agent forgot the destination (context issue)
6. Agent hallucinated "no results" (reasoning error)
7. Search succeeded but agent didn't understand JSON response

You need to see the agent's reasoning trace to know which. Most agents don't log it.

Mitigations

1. Log everything

Every tool call, tool result, and LLM output:

def run_agent_step(context, user_input):
    log.info(f"Agent input: {user_input}")
    log.info(f"Agent context tokens: {count_tokens(context)}")

    response = llm.generate(context)
    log.info(f"Agent output: {response}")

    tool_calls = parse_tool_calls(response)
    log.info(f"Tool calls detected: {tool_calls}")

    for call in tool_calls:
        log.info(f"Executing: {call.name}({call.input})")
        result = execute_tool(call.name, call.input)
        log.info(f"Tool result: {result}")

    return response

Then when something breaks, replay the logs:

[1] Agent input: "Book a flight to Denver tomorrow"
[2] Agent context tokens: 2400
[3] Agent output: "I'll search for flights to Denver..."
[4] Tool calls: [search_flights(destination='Denver', date='2026-04-05')]
[5] Tool result: {"flights": []}
[6] Agent output: "No flights available"

→ Clear problem: search_flights returned empty, agent reported it correctly
   Check: Is your test data populated? Is the date in the future?

2. Use trajectory analysis

Record full trajectories, not just inputs/outputs:

trajectory = {
    "user_id": "user_123",
    "goal": "book flight to Denver",
    "steps": [
        {
            "iteration": 1,
            "llm_input_tokens": 2400,
            "llm_output": "I'll search for flights...",
            "tool_calls": ["search_flights(...)"],
            "tool_results": {"flights": [...]},
            "reasoning": "Found flights, checking prices..."
        },
        {
            "iteration": 2,
            "llm_input_tokens": 3800,
            "llm_output": "Here are 3 options...",
            "tool_calls": [],
            "tool_results": null,
            "reasoning": "Provided recommendations, waiting for user input"
        }
    ],
    "outcome": "SUCCESS",
    "total_steps": 2,
    "total_cost_usd": 0.012
}

3. Build an agent debugger UI

Visualize the trajectory:

Step 1: search_flights
  Input: destination='Denver', date='2026-04-05'
  Output: [3 flights found]
  Reasoning: "Found flights, filtering by price..."

Step 2: present_options
  Reasoning: "All flights are under $300, showing all"
  Output: "Here are your options..."

Final: respond
  Output: "Which flight would you prefer?"

Show each step's input, output, and reasoning. Makes debugging obvious.

Key Takeaways

Production agents fail because:

  1. Errors compound: One bad tool result breaks downstream logic
  2. Context windows are tight: Conversations fill fast, causing hallucinations
  3. Tools aren't reliable: APIs timeout, return bad data, rate limit
  4. Prompts are brittle: Real input is messier than your tests
  5. You test happy paths: Production traffic includes edge cases
  6. Debugging is hard: Black-box agents need full trajectory logging

The fix isn't magic. It's engineering discipline: validate, retry, monitor, test thoroughly, and log everything.

Your agent won't work until you treat it like production code.