Why Your Agent Works in the Demo and Fails at Work
You build a booking agent. In your demo, it beautifully searches flights, checks weather, and books a ticket in 4 steps. You deploy it. By week two, you're drowning in error tickets from production. What went wrong?
This is the difference between a prototype and a production system. I've debugged agents at Amazon that worked perfectly on 10 test cases and failed on 80% of real traffic. The failure modes are predictable. Let me walk through them.
Error Compounding: When One Mistake Cascades
This is the first thing that breaks production agents. A single error early in the task corrupts everything downstream.
The Cascade
Imagine a booking agent trying to find flights:
Step 1: Search flights to "Denver"
search_flights("Denver", "2026-04-05", budget=300)
→ Returns: 5 flights found
Status: OK ✓
Step 2: Check prices on results
Agent tries to extract price from result[0]
→ Result is malformed, missing "price" field
→ Agent passes `None` to next step
Status: ERROR ✗
Step 3: Filter by budget
Agent tries: if price < 300
→ None < 300 → TypeError
→ Agent panics, tries fallback logic
Status: ERROR ✗
Step 4: Recommend flight
Agent tries to recommend but has no valid flights
→ Generates: "Sorry, no flights available"
→ But flights existed! Agent just failed to parse them
Status: WRONG ANSWER ✗
This is error compounding. One parsing error cascaded into a wrong answer. And now the user is frustrated.
Why Agents Are Vulnerable
Regular programs are resilient to malformed data because you write error handling. Agents aren't. The LLM sees the error in its context and tries to reason about it, but it doesn't have access to the code that failed. It just sees a result it doesn't understand.
Agent's perspective:
"I called search_flights. I got back a result.
The result has a 'flights' key but no 'price' field.
Should I ignore this? Hallucinate a price? Ask the user?
I'll guess..."
→ Agent guesses. Agent is wrong.
Mitigations
1. Validate tool outputs before passing to LLM
Before feeding a tool result back to the agent context, check it:
def execute_tool(name, input_dict):
result = tools[name](**input_dict)
# Validate the result schema
try:
result = validate_schema(result, expected_schema[name])
except ValidationError as e:
# Don't pass bad data to the LLM
return {
"error": f"Tool result validation failed: {e}",
"original_result": result
}
return result
2. Limit tool depth
Don't call 10 tools in sequence. Limit to 3-5 max steps. Each step increases failure risk.
3. Use explicit error handling in prompts
Tell the agent what to do when things go wrong:
If a tool returns an error:
1. Read the error message carefully
2. Call the tool again with different parameters
3. If it fails twice, inform the user and stop
Do not guess or hallucinate.
Do not proceed with invalid data.
Context Window Limits: The Ceiling Everyone Hits
You think 100K tokens is infinite until you actually build with it.
An agent processing a real customer support case looks like this:
System prompt: 2K tokens
Conversation history: 15K tokens (20 turns)
Relevant docs (retrieved): 10K tokens
Previous examples: 8K tokens
Current request: 1K tokens
Tool descriptions: 5K tokens
--------
Running total: 41K tokens
Agent makes 5 tool calls.
Each tool result is ~2K tokens.
10K tokens
New total: 51K tokens
User asks follow-up question...
Agent needs more docs...
+10K tokens
Running total: 61K tokens
After a few more interactions: 85K tokens
At 100K, you hit the ceiling. Agent can't function.
This happens fast. And the cost is terrible—you're paying for every token.
Real Problem: Hallucination at the Boundary
When the context window is almost full, LLMs behave badly. They stop referencing the actual context and start hallucinating.
Agent context: [... 95K tokens ...]
Agent needs to search flights, but adding the search results
would exceed 100K limit.
Agent thinks: "I've seen flight data before.
Let me reason about what flights probably exist."
Agent hallucinates: "United has a 7:30 AM flight for $240"
(Never searched. Made it up.)
This is insidious because it looks right.
Mitigations
1. Aggressive context truncation
Don't keep full conversation history. Keep only the last 3-5 turns:
# Keep recent context only
recent_messages = messages[-10:] # 10 messages = ~2K tokens
2. Summarize long conversations
Before context gets big, summarize:
Original exchange: [15 turns, 10K tokens]
Summary:
"User wants flight to Denver tomorrow.
Prefers morning departures under $300.
Lives in San Francisco.
Has TSA PreCheck."
Summary: [5 turns, 500 tokens]
Keep summary, drop original conversation.
3. Use vector retrieval instead of in-context
Don't put docs in context. Retrieve them on-demand:
# Instead of:
context = full_docs + user_question # Bloat
# Do this:
relevant_docs = vector_search(user_question, top_k=3)
context = relevant_docs + user_question # Lean
4. Monitor token usage
Real agents need token budgets:
token_budget = 80000 # Stay under 100K limit
current_tokens = count_tokens(context)
if current_tokens > token_budget:
context = summarize_context(context, token_budget * 0.5)
Tool Reliability: When Your Tools Lie
You build an agent that uses your API. The API is "production-grade." Still fails.
Reasons:
- Rate limiting: API returns 429, agent doesn't retry
- Timeout: API takes 3 seconds, agent waits 1 second, assumes failure
- Partial failures: API returns 200 but the data is incomplete
- Silent bugs: API returns valid JSON but the values are wrong
In my experience at Amazon, we found that ~5% of API calls had subtle issues. Not failures—valid responses with wrong semantics.
Example: The Price Is Wrong
Agent calls: search_flights("Denver", date="2026-04-05")
API returns:
{
"flights": [
{"id": "UA123", "price": "$280"}, ← Should be $290
{"id": "DL456", "price": null}, ← Missing price
]
}
API didn't error. Agent doesn't know this is wrong.
Agent recommends a flight that's actually $30 more expensive.
Mitigations
1. Implement API contracts
Define what valid responses look like and assert them:
def call_flight_api(destination, date):
response = api.search_flights(destination, date)
# Contract check
assert "flights" in response
for flight in response["flights"]:
assert "id" in flight
assert "price" in flight
assert isinstance(flight["price"], (int, float))
return response
2. Implement timeouts and retries
Don't call tools once and assume it works:
def call_tool_with_retry(tool_name, input_dict, max_retries=3):
for attempt in range(max_retries):
try:
result = tools[tool_name](
**input_dict,
timeout=5 # Strict timeout
)
return result
except (Timeout, RateLimitError) as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
3. Sanity check results
When results seem weird, flag them:
result = search_flights(...)
for flight in result["flights"]:
if flight["price"] is None:
raise ValueError(f"Flight {flight['id']} has no price")
if flight["price"] < 0 or flight["price"] > 10000:
raise ValueError(f"Flight price {flight['price']} is implausible")
4. Maintain a fallback
If the tool fails, do you have a backup plan?
try:
flights = search_flights(destination, date)
except Exception as e:
# Fallback: use cached results from yesterday
flights = cache.get_flights(destination, date)
if not flights:
return {"error": "Unable to search flights"}
Prompt Brittleness: The Distribution Shift Problem
Your agent works on these inputs:
"Book me a flight to Denver tomorrow for $300"
"I need a flight to Denver, budget $300, tomorrow morning"
"Find a flight to Denver for tomorrow, under $300"
Then production hits it with:
"denver tomorrow under 300" (no punctuation)
"FLY ME TO DENVER TOMORROW" (all caps)
"Can I get a Denver flight? Preferably tomorrow? Budget 300?" (question marks)
"I'm thinking Denver, whenever is cheapest" (no date specified)
"Flights to Denver—need it tomorrow" (em-dash)
And suddenly the agent's accuracy drops from 95% to 60%.
This is distribution shift. The real world doesn't match your demo.
Why This Happens
Your prompt probably says:
Extract the following from the user input:
- Destination city
- Departure date
- Budget in USD
Format: "destination, date, budget"
This works for well-formatted input. But "FLY ME TO DENVER TOMORROW" has no structure. The agent hallucinates.
Mitigations
1. Use structured parsing, not prompts
Don't ask the LLM to extract "destination, date, budget". Have it output JSON:
extraction_prompt = """
Extract the flight request. Output JSON only.
{
"destination": "<city or null>",
"date": "<YYYY-MM-DD or null>",
"budget_usd": <number or null>
}
"""
Then validate the JSON schema. If invalid, retry.
2. Make prompts defensive
Assume worst-case input:
The user's input may be:
- Misspelled
- Missing information
- Formatted unexpectedly
- Ambiguous
If destination is unclear, ask for clarification.
If date is missing, use tomorrow's date.
If budget is missing, don't assume a limit.
Do not guess or infer. Ask the user.
3. Test on diverse inputs
Before shipping, run 100 variations of every user request:
test_cases = [
"book me a flight to Denver tomorrow",
"BOOK ME A FLIGHT TO DENVER TOMORROW",
"denver tomorrow",
"Denver, tomorrow",
"I want to fly to denver tomorrow",
"Can you find me flights to Denver for tomorrow?",
# ... 94 more variations
]
for test_case in test_cases:
output = agent.run(test_case)
assert_correct(output)
4. Use prompt templates with validation
Instead of free-form prompting, use templates:
def parse_flight_request(user_input):
# Step 1: Extract with LLM
extracted = llm.extract_json(user_input)
# Step 2: Validate schema
if not extracted.destination:
raise ValueError("No destination found")
# Step 3: Normalize
extracted.destination = normalize_city(extracted.destination)
extracted.date = parse_date(extracted.date or "tomorrow")
return extracted
Evaluation Gaps: You Don't Know What You Don't Know
You test your agent on 20 cases. It works on 19 of them. You ship it. Production breaks.
The problem: you didn't test the failure cases. You tested happy paths.
What You Probably Tested
1. "Book a flight to Denver tomorrow" → Works
2. "Find flights under $300" → Works
3. "I prefer morning flights" → Works
...
(All similar, well-formed requests)
What Production Does
1. "Book a flight to Denver tomorrow"
2. "Actually, change that to Sacramento"
3. "Wait, tomorrow won't work. How about next week?"
4. "Can you check the weather there?"
5. "Never mind, I'll drive. Cancel the search"
6. [User disappears for 3 hours]
7. "Are my flights still available?"
Your agent wasn't built for this. It doesn't handle:
- Corrections mid-task
- Cancellations
- Context recovery after delays
- Memory across sessions
Mitigations
1. Build a test harness
Don't just test individual inputs. Test trajectories:
test_trajectory = [
{
"user_input": "Book a flight to Denver tomorrow",
"expected_action": ["search_flights", "present_options"]
},
{
"user_input": "Actually, Sacramento instead",
"expected_action": ["search_flights", "present_options"],
"context_check": "Agent remembers the date (tomorrow)"
},
{
"user_input": "How's the weather in Sacramento?",
"expected_action": ["check_weather"],
"context_check": "Agent remembers 'Sacramento' without re-asking"
}
]
for step in test_trajectory:
output = agent.run(step["user_input"])
assert output_matches(output, step["expected_action"])
2. Test error cases
Explicitly test what breaks:
error_cases = [
("No destination: I want a flight", ["ask_clarification"]),
("No date: I want a flight to Denver", ["ask_clarification"]),
("Bad destination: I want to go to Atlantis", ["clarify_typo"]),
("Impossible budget: I want a $5 flight", ["explain_unrealistic"]),
("Rate limit: API is overloaded", ["fallback_or_retry"]),
]
for user_input, expected_behaviors in error_cases:
output = agent.run(user_input)
for behavior in expected_behaviors:
assert behavior in output
3. Use golden trajectories
Record good agent runs and replay them as regression tests:
golden_trajectory = [
{
"turn": 1,
"user": "Book flight to Denver tomorrow, under $300",
"agent_action": "search_flights(destination='Denver', date='2026-04-05')",
"agent_response": "Found 3 flights. Cheapest is United at 7:30 AM for $280."
},
{
"turn": 2,
"user": "That works, book it",
"agent_action": "book_flight(flight_id='UA123')",
"agent_response": "Booking confirmed. Your flight departs at 7:30 AM."
}
]
# Replay this trajectory with new agent version
for step in golden_trajectory:
output = agent.run(step["user"])
assert_similar(output, step["agent_response"]) # Allow slight variations
The Debugging Problem: Black Box Agents
When your agent fails, where do you look?
- Did the LLM make a reasoning error?
- Did a tool return bad data?
- Did the prompt mislead the agent?
- Did the agent forget context?
- Did the user input confuse it?
With traditional code, you have a stack trace. With agents, you have a conversation. Good luck.
What Makes Debugging Hard
Agent output: "No flights available to Denver"
Possible causes:
1. search_flights tool returned empty results
2. Tool call was malformed (agent sent wrong destination)
3. Tool was never called (agent decided not to)
4. Tool timed out (agent saw no results)
5. Agent forgot the destination (context issue)
6. Agent hallucinated "no results" (reasoning error)
7. Search succeeded but agent didn't understand JSON response
You need to see the agent's reasoning trace to know which. Most agents don't log it.
Mitigations
1. Log everything
Every tool call, tool result, and LLM output:
def run_agent_step(context, user_input):
log.info(f"Agent input: {user_input}")
log.info(f"Agent context tokens: {count_tokens(context)}")
response = llm.generate(context)
log.info(f"Agent output: {response}")
tool_calls = parse_tool_calls(response)
log.info(f"Tool calls detected: {tool_calls}")
for call in tool_calls:
log.info(f"Executing: {call.name}({call.input})")
result = execute_tool(call.name, call.input)
log.info(f"Tool result: {result}")
return response
Then when something breaks, replay the logs:
[1] Agent input: "Book a flight to Denver tomorrow"
[2] Agent context tokens: 2400
[3] Agent output: "I'll search for flights to Denver..."
[4] Tool calls: [search_flights(destination='Denver', date='2026-04-05')]
[5] Tool result: {"flights": []}
[6] Agent output: "No flights available"
→ Clear problem: search_flights returned empty, agent reported it correctly
Check: Is your test data populated? Is the date in the future?
2. Use trajectory analysis
Record full trajectories, not just inputs/outputs:
trajectory = {
"user_id": "user_123",
"goal": "book flight to Denver",
"steps": [
{
"iteration": 1,
"llm_input_tokens": 2400,
"llm_output": "I'll search for flights...",
"tool_calls": ["search_flights(...)"],
"tool_results": {"flights": [...]},
"reasoning": "Found flights, checking prices..."
},
{
"iteration": 2,
"llm_input_tokens": 3800,
"llm_output": "Here are 3 options...",
"tool_calls": [],
"tool_results": null,
"reasoning": "Provided recommendations, waiting for user input"
}
],
"outcome": "SUCCESS",
"total_steps": 2,
"total_cost_usd": 0.012
}
3. Build an agent debugger UI
Visualize the trajectory:
Step 1: search_flights
Input: destination='Denver', date='2026-04-05'
Output: [3 flights found]
Reasoning: "Found flights, filtering by price..."
Step 2: present_options
Reasoning: "All flights are under $300, showing all"
Output: "Here are your options..."
Final: respond
Output: "Which flight would you prefer?"
Show each step's input, output, and reasoning. Makes debugging obvious.
Key Takeaways
Production agents fail because:
- Errors compound: One bad tool result breaks downstream logic
- Context windows are tight: Conversations fill fast, causing hallucinations
- Tools aren't reliable: APIs timeout, return bad data, rate limit
- Prompts are brittle: Real input is messier than your tests
- You test happy paths: Production traffic includes edge cases
- Debugging is hard: Black-box agents need full trajectory logging
The fix isn't magic. It's engineering discipline: validate, retry, monitor, test thoroughly, and log everything.
Your agent won't work until you treat it like production code.