Evaluating Agents: How to Know If Your Agent Actually Works
You build an agent. It seems to work. You ship it. Users complain. You're confused—it worked in testing.
The problem: you evaluated it wrong.
Standard LLM evaluation—"Did the model give a good answer?"—doesn't apply to agents. Agents don't give answers. They take actions over multiple steps. A good action in step 1 might be wrong if step 2 fails.
I learned this the hard way at Amazon. We had an agent that could successfully complete 95% of test cases, but only 40% of real support tickets. The agent's individual steps were correct. The trajectory was wrong.
This article is about evaluating trajectories, not individual steps.
Why Standard LLM Eval Breaks for Agents
Standard LLM evaluation looks like:
Input: "Summarize this article"
Expected output: "The article discusses..."
Model output: "The article discusses..."
Match? ✓
Grade: PASS
For agents, this doesn't work:
Input: "Book me a flight to Denver tomorrow under $300"
Agent's trajectory:
Step 1: search_flights("Denver", "2026-04-05", budget=300)
→ Found 3 flights
Status: ✓ Correct
Step 2: filter_by_price(flights, max=300)
→ All 3 flights under $300
Status: ✓ Correct
Step 3: check_weather("Denver", "2026-04-05")
→ Sunny, 72°F
Status: ✓ Correct (unnecessary, but correct)
Step 4: recommend_cheapest(flights)
→ United at 7:30 AM, $280
Status: ✓ Correct
Step 5: book_flight("UA123", passenger="user123")
→ Booking failed (invalid passenger ID)
Status: ✗ Wrong
Final outcome: FAIL (user has no booking)
Individual steps were good. Trajectory was bad. Standard evaluation would miss this.
Trajectory vs. Outcome Evaluation
There are two evaluation strategies. They measure different things.
Trajectory Evaluation
Does the agent take the right steps in the right order?
Success = agent follows expected path
Metric: Did agent:
- Call the right tools in the right order?
- Extract correct information from tool results?
- Handle errors gracefully?
- Reason about the problem correctly?
Example of good trajectory:
Step 1: Understand goal (book flight)
Step 2: Ask for requirements (destination, date, budget)
Step 3: Search flights with those parameters
Step 4: Present options
Step 5: Confirm selection
Step 6: Book flight
Step 7: Confirm booking
Each step has an expected tool call and reasoning.
Pros:
- Captures the reasoning process
- Detects hallucinations ("why did it call this tool?")
- Shows whether agent understands the problem
- Helps debug (see exactly where it went wrong)
Cons:
- Requires defining "correct trajectory" (subjective)
- Multiple valid trajectories exist (search first vs. ask first)
- Brittle (small variations = failure)
- Expensive to grade (need human judgment)
Outcome Evaluation
Does the agent achieve the goal?
Success = agent achieves user's goal
Metric: Did agent:
- Complete the task?
- Arrive at a correct answer?
- Solve the user's problem?
Example of good outcome:
- Goal: Book a flight
- Result: Flight is booked ✓
- Goal: Answer question
- Result: Answer is correct and helpful ✓
- Goal: Fix bug
- Result: Bug is fixed ✓
Pros:
- Simple (did it work or not?)
- Objective (often measurable)
- What users care about
- Cheap to grade (automated checks often work)
Cons:
- Misses "lucky success" (agent gets right answer for wrong reasons)
- Can't debug process
- Doesn't penalize inefficiency (uses 10 steps instead of 3)
- Hides hallucinations
Trajectory Evaluation: The Right Way
To evaluate trajectories, you need:
- Golden trajectories (what the agent should do)
- Actual trajectories (what your agent does)
- A grading rubric (how to score similarity)
Building Golden Trajectories
Manually create examples of correct agent behavior:
golden_trajectories = [
{
"name": "book_flight_happy_path",
"input": "Book me a flight to Denver tomorrow under $300",
"expected_steps": [
{
"step": 1,
"tool": "search_flights",
"input": {
"destination": "Denver",
"date": "2026-04-05",
"budget": 300
},
"expected_output": {
"flights": [...],
"status": "success"
}
},
{
"step": 2,
"tool": "present_options",
"input": {"flights": [...]},
"expected_output": {
"message": "Here are your options...",
"recommended": "cheapest or earliest"
}
},
{
"step": 3,
"tool": "book_flight",
"input": {"flight_id": "...", "passenger_name": "..."},
"expected_output": {"status": "booked"}
}
]
},
{
"name": "book_flight_no_results",
"input": "Book me a flight to Denver tomorrow under $50",
"expected_steps": [
{
"step": 1,
"tool": "search_flights",
"input": {"destination": "Denver", "date": "...", "budget": 50},
"expected_output": {"flights": [], "status": "success"}
},
{
"step": 2,
"tool": "inform_user",
"input": {"message": "No flights found under $50"},
"expected_output": {"status": "informed"}
}
]
}
]
Grading: Levenshtein Distance on Trajectories
Compare actual and expected trajectories:
def trajectory_similarity(actual, expected):
"""
Compare actual trajectory to expected trajectory.
Returns 0-1 score.
"""
actual_tools = [step["tool"] for step in actual]
expected_tools = [step["tool"] for step in expected]
# Use Levenshtein distance (edit distance)
# How many edits to transform actual into expected?
distance = levenshtein_distance(actual_tools, expected_tools)
# Normalize: how many changes per step?
max_distance = max(len(actual_tools), len(expected_tools))
similarity = 1 - (distance / max_distance)
return similarity
# Example:
actual_trajectory = ["search_flights", "search_flights", "present_options", "book_flight"]
expected_trajectory = ["search_flights", "present_options", "book_flight"]
similarity = trajectory_similarity(actual_trajectory, expected_trajectory)
# Levenshtein distance = 1 (one extra search_flights call)
# Similarity = 1 - (1/4) = 0.75
Interpretation:
- 1.0: Perfect match
- 0.8-0.99: Slightly different but reasonable
- 0.5-0.79: Major deviations, but achieves goal
- <0.5: Completely wrong approach
Rubric: Categorical Grading
Instead of just similarity, grade different aspects:
def grade_trajectory(actual, expected, actual_output):
"""
Grade agent's trajectory across multiple dimensions.
"""
grade = {
"path_similarity": trajectory_similarity(actual, expected),
"reasoning_quality": grade_reasoning(actual),
"tool_correctness": grade_tool_calls(actual),
"error_handling": grade_error_handling(actual),
"final_outcome": grade_outcome(actual_output)
}
# Weighted overall score
weights = {
"path_similarity": 0.3,
"reasoning_quality": 0.2,
"tool_correctness": 0.2,
"error_handling": 0.15,
"final_outcome": 0.15
}
overall = sum(
grade[key] * weights[key]
for key in weights
)
return grade, overall
def grade_reasoning(trajectory):
"""Does agent explain its thinking clearly?"""
# Check if agent has "Thought:" before each action
has_reasoning = all(
"thought" in step.get("reasoning", "").lower()
for step in trajectory
)
return 1.0 if has_reasoning else 0.5
def grade_tool_calls(trajectory):
"""Are tool calls well-formed and sensible?"""
valid_calls = 0
for step in trajectory:
if is_valid_tool_call(step):
valid_calls += 1
return valid_calls / len(trajectory) if trajectory else 1.0
def grade_error_handling(trajectory):
"""When errors occur, does agent handle gracefully?"""
error_steps = [s for s in trajectory if s.get("error")]
if not error_steps:
return 1.0 # No errors, perfect
handled_well = sum(
1 for step in error_steps
if step.get("recovery_action") # Agent tried to recover
)
return handled_well / len(error_steps)
def grade_outcome(actual_output):
"""Did the agent accomplish the goal?"""
if actual_output.get("success"):
return 1.0
elif actual_output.get("partial_success"):
return 0.5
else:
return 0.0
Outcome Evaluation: Automated and Scalable
Outcome evaluation is simpler. You just check: did it work?
Metric 1: Task Completion
def evaluate_task_completion(agent_output, task_specification):
"""
Did agent complete the task?
"""
if task_specification["type"] == "booking":
return agent_output.get("booking_confirmed") == True
elif task_specification["type"] == "search":
return len(agent_output.get("results", [])) > 0
elif task_specification["type"] == "answer":
expected_answer = task_specification["expected_answer"]
actual_answer = agent_output.get("answer")
return actual_answer == expected_answer
# ... more task types
Metric 2: Time to Completion
def evaluate_efficiency(trajectory):
"""
How many steps did agent take?
Fewer is better (but not if they're wrong steps).
"""
min_expected_steps = 3
actual_steps = len(trajectory)
if actual_steps <= min_expected_steps:
return 1.0 # Efficient
elif actual_steps <= min_expected_steps * 2:
return 0.8 # Acceptable
else:
return 0.5 # Inefficient
Metric 3: Cost
def evaluate_cost(trajectory):
"""
How many token did agent use?
How many API calls?
"""
total_tokens = sum(
step.get("tokens", 0)
for step in trajectory
)
api_calls = len([
step for step in trajectory
if step["type"] == "tool_call"
])
cost_usd = (total_tokens / 1000) * 0.001 + api_calls * 0.01
if cost_usd < 0.01:
return 1.0 # Cheap
elif cost_usd < 0.05:
return 0.8
else:
return 0.5 # Expensive
Building a Test Harness
class AgentTestHarness:
def __init__(self, agent):
self.agent = agent
self.results = []
def run_evaluation(self, test_cases):
"""Run all test cases and collect results"""
for test_case in test_cases:
result = self.evaluate_single(test_case)
self.results.append(result)
return self.summarize()
def evaluate_single(self, test_case):
"""Evaluate one test case"""
input_text = test_case["input"]
# Run agent
actual_trajectory, actual_output = self.agent.run(input_text)
# Grade trajectory
trajectory_grade, trajectory_score = grade_trajectory(
actual_trajectory,
test_case.get("expected_trajectory"),
actual_output
)
# Grade outcome
outcome_score = evaluate_task_completion(
actual_output,
test_case["task"]
)
return {
"test_name": test_case["name"],
"input": input_text,
"trajectory_score": trajectory_score,
"outcome_score": outcome_score,
"actual_trajectory": actual_trajectory,
"passed": trajectory_score > 0.7 and outcome_score > 0.8
}
def summarize(self):
"""Return summary statistics"""
passed = sum(1 for r in self.results if r["passed"])
total = len(self.results)
avg_trajectory = sum(
r["trajectory_score"] for r in self.results
) / total
avg_outcome = sum(
r["outcome_score"] for r in self.results
) / total
return {
"pass_rate": passed / total,
"avg_trajectory_score": avg_trajectory,
"avg_outcome_score": avg_outcome,
"results": self.results
}
# Usage:
harness = AgentTestHarness(my_agent)
results = harness.run_evaluation(test_cases)
print(f"Pass rate: {results['pass_rate']:.1%}")
print(f"Avg trajectory score: {results['avg_trajectory_score']:.2f}")
print(f"Avg outcome score: {results['avg_outcome_score']:.2f}")
Using LLM-as-Judge for Agent Evaluation
Sometimes trajectory is too subjective. Let another LLM judge.
def grade_trajectory_with_llm(actual_trajectory, expected_trajectory, user_input):
"""
Use an LLM to judge if trajectory is reasonable.
"""
prompt = f"""
User requested: {user_input}
Expected approach:
{json.dumps(expected_trajectory, indent=2)}
Agent's actual approach:
{json.dumps(actual_trajectory, indent=2)}
Is the agent's approach reasonable?
Consider:
- Does it achieve the goal?
- Does it follow logical reasoning?
- Are there obvious mistakes?
- Is it reasonably efficient?
Grade: 1 (bad) to 5 (excellent)
Explanation: [Your reasoning]
JSON output:
{{"grade": 1-5, "explanation": "..."}}
"""
response = llm.generate(prompt)
result = json.loads(response)
return result["grade"] / 5 # Normalize to 0-1
Pros:
- Captures nuance (human-like judgment)
- Handles multiple valid approaches
- Can evaluate reasoning
Cons:
- Expensive (calls another LLM)
- Non-deterministic (varies between runs)
- Can hallucinate (judge might be wrong)
Production Monitoring Signals
Evaluation doesn't end at deployment. Monitor real user interactions.
class AgentMonitor:
def __init__(self):
self.metrics = {
"trajectories": [],
"errors": [],
"user_satisfaction": []
}
def log_trajectory(self, trajectory, outcome, user_feedback):
"""Log every agent run"""
self.metrics["trajectories"].append({
"steps": len(trajectory),
"tools_used": [s["tool"] for s in trajectory],
"success": outcome["success"],
"timestamp": datetime.now(),
"user_satisfaction": user_feedback
})
def log_error(self, error_type, trajectory_step, recovery_action):
"""Log errors"""
self.metrics["errors"].append({
"type": error_type,
"at_step": trajectory_step,
"recovered": recovery_action is not None,
"timestamp": datetime.now()
})
def get_alerts(self):
"""Check for concerning patterns"""
alerts = []
# Alert: Success rate dropping
recent_success_rate = self.compute_success_rate(
window_hours=1
)
if recent_success_rate < 0.8:
alerts.append({
"severity": "high",
"message": f"Success rate dropped to {recent_success_rate:.1%}"
})
# Alert: Increasing error frequency
error_rate = len(self.metrics["errors"]) / len(self.metrics["trajectories"])
if error_rate > 0.3:
alerts.append({
"severity": "medium",
"message": f"Error rate: {error_rate:.1%}"
})
# Alert: Decreasing user satisfaction
recent_satisfaction = self.compute_satisfaction(
window_hours=1
)
if recent_satisfaction < 3.5: # Out of 5
alerts.append({
"severity": "high",
"message": f"User satisfaction dropped to {recent_satisfaction:.1f}/5"
})
return alerts
def compute_success_rate(self, window_hours=24):
"""Success rate in time window"""
cutoff = datetime.now() - timedelta(hours=window_hours)
recent = [
t for t in self.metrics["trajectories"]
if t["timestamp"] > cutoff
]
if not recent:
return 1.0
successful = sum(1 for t in recent if t["success"])
return successful / len(recent)
def compute_satisfaction(self, window_hours=24):
"""Average satisfaction in time window"""
cutoff = datetime.now() - timedelta(hours=window_hours)
recent = [
s for s in self.metrics["user_satisfaction"]
if s["timestamp"] > cutoff
]
if not recent:
return 5.0
return sum(s["rating"] for s in recent) / len(recent)
The Eval Loop
Evaluation isn't one-time. It's a loop:
Build agent
↓
Define golden trajectories
↓
Run evaluation
↓
Identify failures
↓
Debug failures
↓
Update agent or prompt
↓
Re-run evaluation
↓
Monitor production
↓
Collect user feedback
↓
Update golden trajectories
↓
Repeat
Each iteration improves the agent.
Key Takeaways
Evaluating agents is different from evaluating LLMs:
- Trajectory evaluation measures the path (how agent thinks)
- Outcome evaluation measures the goal (did it work?)
- Use both: trajectory for debugging, outcome for acceptance
- Golden trajectories are your ground truth
- Levenshtein distance on tool sequences is simple baseline
- LLM-as-Judge for subjective trajectory quality
- Automate outcome checks when possible
- Monitor production for drift
- Build a test harness and iterate
Done right, evaluation catches problems before users do.
Your agent isn't ready for production until you can prove it works.
What's Changing Right Now (2025–2026)
Agent evaluation is one of the fastest-moving areas in AI. Here's what's happening and why it matters for practitioners.
LLM-as-Judge is Becoming the Default
Six months ago, most teams graded agents by hand or used rigid string-matching. Today, LLM-as-Judge is the industry standard for any task too nuanced for rule-based checking. Anthropic's Claude, OpenAI's GPT-4o, and Google's Gemini are all being used as evaluators for other LLMs' outputs.
The practical implication: your eval stack should have a judge_model that's separate from your agent_model. Run a cheaper model as judge. For critical evals, use a stronger model than the one you're evaluating.
Agent Benchmarks Are Maturing
Three benchmarks now matter for real-world agent work:
- SWE-Bench: Software engineering tasks (fix GitHub issues). State-of-the-art went from 3% (2023) to 55%+ (2025 Claude/GPT-4o). If your agent does coding, test it here.
- GAIA: General assistant tasks requiring real-world tool use (web search, file manipulation, calculator). Humans score ~92%. Top models ~75%. Gap still significant.
- AgentBench: Multi-domain agent tasks. More practical than academic benchmarks.
What this means for you: use SWE-Bench or a subset of GAIA to anchor your internal benchmarks to external reference points.
The Shift to Computer Use Agents
In 2024, Anthropic shipped Claude's "computer use" capability — the agent can take screenshots, move the mouse, and type. By 2025, browser-native agents are productionized at companies like Cognition (Devin), Imbue, and Adept.
Evaluating computer use agents is fundamentally different: you evaluate screenshots, not text. Teams are building screenshot-diffing tools and VLM-based judges that compare expected vs. actual UI state.
Reliability > Capability
The emerging consensus in 2025: the bottleneck is not capability, it's reliability. A 95%-accurate agent fails 1 in 20 tasks — unacceptable for production. Teams are now optimizing specifically for:
- P99 reliability: not "average success rate" but "what happens on the worst 1% of tasks"
- Graceful degradation: agents that surface uncertainty instead of failing silently
- Human-in-the-loop triggers: automatic escalation when confidence drops below threshold
# Pattern: confidence-gated agent
def run_with_confidence_gate(task: str, threshold: float = 0.85) -> dict:
result = agent.run(task)
if result["confidence"] < threshold:
# Escalate to human instead of returning wrong answer
return {
"status": "needs_review",
"partial_result": result["output"],
"reason": result["uncertainty_reason"],
}
return {"status": "complete", "result": result["output"]}
This pattern — fail loudly rather than fail silently — is the difference between a demo and a production system.