Evaluating Agents: How to Know If Your Agent Actually Works

You build an agent. It seems to work. You ship it. Users complain. You're confused—it worked in testing.

The problem: you evaluated it wrong.

Standard LLM evaluation—"Did the model give a good answer?"—doesn't apply to agents. Agents don't give answers. They take actions over multiple steps. A good action in step 1 might be wrong if step 2 fails.

I learned this the hard way at Amazon. We had an agent that could successfully complete 95% of test cases, but only 40% of real support tickets. The agent's individual steps were correct. The trajectory was wrong.

This article is about evaluating trajectories, not individual steps.

Why Standard LLM Eval Breaks for Agents

Standard LLM evaluation looks like:

Input: "Summarize this article"
Expected output: "The article discusses..."
Model output: "The article discusses..."

Match? ✓
Grade: PASS

For agents, this doesn't work:

Input: "Book me a flight to Denver tomorrow under $300"

Agent's trajectory:
Step 1: search_flights("Denver", "2026-04-05", budget=300)
        → Found 3 flights
        Status: ✓ Correct

Step 2: filter_by_price(flights, max=300)
        → All 3 flights under $300
        Status: ✓ Correct

Step 3: check_weather("Denver", "2026-04-05")
        → Sunny, 72°F
        Status: ✓ Correct (unnecessary, but correct)

Step 4: recommend_cheapest(flights)
        → United at 7:30 AM, $280
        Status: ✓ Correct

Step 5: book_flight("UA123", passenger="user123")
        → Booking failed (invalid passenger ID)
        Status: ✗ Wrong

Final outcome: FAIL (user has no booking)

Individual steps were good. Trajectory was bad. Standard evaluation would miss this.

Trajectory vs. Outcome Evaluation

There are two evaluation strategies. They measure different things.

Trajectory Evaluation

Does the agent take the right steps in the right order?

Success = agent follows expected path

Metric: Did agent:
- Call the right tools in the right order?
- Extract correct information from tool results?
- Handle errors gracefully?
- Reason about the problem correctly?

Example of good trajectory:
Step 1: Understand goal (book flight)
Step 2: Ask for requirements (destination, date, budget)
Step 3: Search flights with those parameters
Step 4: Present options
Step 5: Confirm selection
Step 6: Book flight
Step 7: Confirm booking

Each step has an expected tool call and reasoning.

Pros:

  • Captures the reasoning process
  • Detects hallucinations ("why did it call this tool?")
  • Shows whether agent understands the problem
  • Helps debug (see exactly where it went wrong)

Cons:

  • Requires defining "correct trajectory" (subjective)
  • Multiple valid trajectories exist (search first vs. ask first)
  • Brittle (small variations = failure)
  • Expensive to grade (need human judgment)

Outcome Evaluation

Does the agent achieve the goal?

Success = agent achieves user's goal

Metric: Did agent:
- Complete the task?
- Arrive at a correct answer?
- Solve the user's problem?

Example of good outcome:
- Goal: Book a flight
- Result: Flight is booked ✓

- Goal: Answer question
- Result: Answer is correct and helpful ✓

- Goal: Fix bug
- Result: Bug is fixed ✓

Pros:

  • Simple (did it work or not?)
  • Objective (often measurable)
  • What users care about
  • Cheap to grade (automated checks often work)

Cons:

  • Misses "lucky success" (agent gets right answer for wrong reasons)
  • Can't debug process
  • Doesn't penalize inefficiency (uses 10 steps instead of 3)
  • Hides hallucinations

Trajectory Evaluation: The Right Way

To evaluate trajectories, you need:

  1. Golden trajectories (what the agent should do)
  2. Actual trajectories (what your agent does)
  3. A grading rubric (how to score similarity)

Building Golden Trajectories

Manually create examples of correct agent behavior:

golden_trajectories = [
    {
        "name": "book_flight_happy_path",
        "input": "Book me a flight to Denver tomorrow under $300",
        "expected_steps": [
            {
                "step": 1,
                "tool": "search_flights",
                "input": {
                    "destination": "Denver",
                    "date": "2026-04-05",
                    "budget": 300
                },
                "expected_output": {
                    "flights": [...],
                    "status": "success"
                }
            },
            {
                "step": 2,
                "tool": "present_options",
                "input": {"flights": [...]},
                "expected_output": {
                    "message": "Here are your options...",
                    "recommended": "cheapest or earliest"
                }
            },
            {
                "step": 3,
                "tool": "book_flight",
                "input": {"flight_id": "...", "passenger_name": "..."},
                "expected_output": {"status": "booked"}
            }
        ]
    },
    {
        "name": "book_flight_no_results",
        "input": "Book me a flight to Denver tomorrow under $50",
        "expected_steps": [
            {
                "step": 1,
                "tool": "search_flights",
                "input": {"destination": "Denver", "date": "...", "budget": 50},
                "expected_output": {"flights": [], "status": "success"}
            },
            {
                "step": 2,
                "tool": "inform_user",
                "input": {"message": "No flights found under $50"},
                "expected_output": {"status": "informed"}
            }
        ]
    }
]

Grading: Levenshtein Distance on Trajectories

Compare actual and expected trajectories:

def trajectory_similarity(actual, expected):
    """
    Compare actual trajectory to expected trajectory.
    Returns 0-1 score.
    """
    actual_tools = [step["tool"] for step in actual]
    expected_tools = [step["tool"] for step in expected]

    # Use Levenshtein distance (edit distance)
    # How many edits to transform actual into expected?
    distance = levenshtein_distance(actual_tools, expected_tools)

    # Normalize: how many changes per step?
    max_distance = max(len(actual_tools), len(expected_tools))
    similarity = 1 - (distance / max_distance)

    return similarity

# Example:
actual_trajectory = ["search_flights", "search_flights", "present_options", "book_flight"]
expected_trajectory = ["search_flights", "present_options", "book_flight"]

similarity = trajectory_similarity(actual_trajectory, expected_trajectory)
# Levenshtein distance = 1 (one extra search_flights call)
# Similarity = 1 - (1/4) = 0.75

Interpretation:

  • 1.0: Perfect match
  • 0.8-0.99: Slightly different but reasonable
  • 0.5-0.79: Major deviations, but achieves goal
  • <0.5: Completely wrong approach

Rubric: Categorical Grading

Instead of just similarity, grade different aspects:

def grade_trajectory(actual, expected, actual_output):
    """
    Grade agent's trajectory across multiple dimensions.
    """
    grade = {
        "path_similarity": trajectory_similarity(actual, expected),
        "reasoning_quality": grade_reasoning(actual),
        "tool_correctness": grade_tool_calls(actual),
        "error_handling": grade_error_handling(actual),
        "final_outcome": grade_outcome(actual_output)
    }

    # Weighted overall score
    weights = {
        "path_similarity": 0.3,
        "reasoning_quality": 0.2,
        "tool_correctness": 0.2,
        "error_handling": 0.15,
        "final_outcome": 0.15
    }

    overall = sum(
        grade[key] * weights[key]
        for key in weights
    )

    return grade, overall

def grade_reasoning(trajectory):
    """Does agent explain its thinking clearly?"""
    # Check if agent has "Thought:" before each action
    has_reasoning = all(
        "thought" in step.get("reasoning", "").lower()
        for step in trajectory
    )
    return 1.0 if has_reasoning else 0.5

def grade_tool_calls(trajectory):
    """Are tool calls well-formed and sensible?"""
    valid_calls = 0
    for step in trajectory:
        if is_valid_tool_call(step):
            valid_calls += 1

    return valid_calls / len(trajectory) if trajectory else 1.0

def grade_error_handling(trajectory):
    """When errors occur, does agent handle gracefully?"""
    error_steps = [s for s in trajectory if s.get("error")]

    if not error_steps:
        return 1.0  # No errors, perfect

    handled_well = sum(
        1 for step in error_steps
        if step.get("recovery_action")  # Agent tried to recover
    )

    return handled_well / len(error_steps)

def grade_outcome(actual_output):
    """Did the agent accomplish the goal?"""
    if actual_output.get("success"):
        return 1.0
    elif actual_output.get("partial_success"):
        return 0.5
    else:
        return 0.0

Outcome Evaluation: Automated and Scalable

Outcome evaluation is simpler. You just check: did it work?

Metric 1: Task Completion

def evaluate_task_completion(agent_output, task_specification):
    """
    Did agent complete the task?
    """
    if task_specification["type"] == "booking":
        return agent_output.get("booking_confirmed") == True

    elif task_specification["type"] == "search":
        return len(agent_output.get("results", [])) > 0

    elif task_specification["type"] == "answer":
        expected_answer = task_specification["expected_answer"]
        actual_answer = agent_output.get("answer")
        return actual_answer == expected_answer

    # ... more task types

Metric 2: Time to Completion

def evaluate_efficiency(trajectory):
    """
    How many steps did agent take?
    Fewer is better (but not if they're wrong steps).
    """
    min_expected_steps = 3
    actual_steps = len(trajectory)

    if actual_steps <= min_expected_steps:
        return 1.0  # Efficient

    elif actual_steps <= min_expected_steps * 2:
        return 0.8  # Acceptable

    else:
        return 0.5  # Inefficient

Metric 3: Cost

def evaluate_cost(trajectory):
    """
    How many token did agent use?
    How many API calls?
    """
    total_tokens = sum(
        step.get("tokens", 0)
        for step in trajectory
    )

    api_calls = len([
        step for step in trajectory
        if step["type"] == "tool_call"
    ])

    cost_usd = (total_tokens / 1000) * 0.001 + api_calls * 0.01

    if cost_usd < 0.01:
        return 1.0  # Cheap
    elif cost_usd < 0.05:
        return 0.8
    else:
        return 0.5  # Expensive

Building a Test Harness

class AgentTestHarness:
    def __init__(self, agent):
        self.agent = agent
        self.results = []

    def run_evaluation(self, test_cases):
        """Run all test cases and collect results"""
        for test_case in test_cases:
            result = self.evaluate_single(test_case)
            self.results.append(result)

        return self.summarize()

    def evaluate_single(self, test_case):
        """Evaluate one test case"""
        input_text = test_case["input"]

        # Run agent
        actual_trajectory, actual_output = self.agent.run(input_text)

        # Grade trajectory
        trajectory_grade, trajectory_score = grade_trajectory(
            actual_trajectory,
            test_case.get("expected_trajectory"),
            actual_output
        )

        # Grade outcome
        outcome_score = evaluate_task_completion(
            actual_output,
            test_case["task"]
        )

        return {
            "test_name": test_case["name"],
            "input": input_text,
            "trajectory_score": trajectory_score,
            "outcome_score": outcome_score,
            "actual_trajectory": actual_trajectory,
            "passed": trajectory_score > 0.7 and outcome_score > 0.8
        }

    def summarize(self):
        """Return summary statistics"""
        passed = sum(1 for r in self.results if r["passed"])
        total = len(self.results)

        avg_trajectory = sum(
            r["trajectory_score"] for r in self.results
        ) / total

        avg_outcome = sum(
            r["outcome_score"] for r in self.results
        ) / total

        return {
            "pass_rate": passed / total,
            "avg_trajectory_score": avg_trajectory,
            "avg_outcome_score": avg_outcome,
            "results": self.results
        }

# Usage:
harness = AgentTestHarness(my_agent)
results = harness.run_evaluation(test_cases)

print(f"Pass rate: {results['pass_rate']:.1%}")
print(f"Avg trajectory score: {results['avg_trajectory_score']:.2f}")
print(f"Avg outcome score: {results['avg_outcome_score']:.2f}")

Using LLM-as-Judge for Agent Evaluation

Sometimes trajectory is too subjective. Let another LLM judge.

def grade_trajectory_with_llm(actual_trajectory, expected_trajectory, user_input):
    """
    Use an LLM to judge if trajectory is reasonable.
    """
    prompt = f"""
    User requested: {user_input}

    Expected approach:
    {json.dumps(expected_trajectory, indent=2)}

    Agent's actual approach:
    {json.dumps(actual_trajectory, indent=2)}

    Is the agent's approach reasonable?
    Consider:
    - Does it achieve the goal?
    - Does it follow logical reasoning?
    - Are there obvious mistakes?
    - Is it reasonably efficient?

    Grade: 1 (bad) to 5 (excellent)
    Explanation: [Your reasoning]

    JSON output:
    {{"grade": 1-5, "explanation": "..."}}
    """

    response = llm.generate(prompt)
    result = json.loads(response)

    return result["grade"] / 5  # Normalize to 0-1

Pros:

  • Captures nuance (human-like judgment)
  • Handles multiple valid approaches
  • Can evaluate reasoning

Cons:

  • Expensive (calls another LLM)
  • Non-deterministic (varies between runs)
  • Can hallucinate (judge might be wrong)

Production Monitoring Signals

Evaluation doesn't end at deployment. Monitor real user interactions.

class AgentMonitor:
    def __init__(self):
        self.metrics = {
            "trajectories": [],
            "errors": [],
            "user_satisfaction": []
        }

    def log_trajectory(self, trajectory, outcome, user_feedback):
        """Log every agent run"""
        self.metrics["trajectories"].append({
            "steps": len(trajectory),
            "tools_used": [s["tool"] for s in trajectory],
            "success": outcome["success"],
            "timestamp": datetime.now(),
            "user_satisfaction": user_feedback
        })

    def log_error(self, error_type, trajectory_step, recovery_action):
        """Log errors"""
        self.metrics["errors"].append({
            "type": error_type,
            "at_step": trajectory_step,
            "recovered": recovery_action is not None,
            "timestamp": datetime.now()
        })

    def get_alerts(self):
        """Check for concerning patterns"""
        alerts = []

        # Alert: Success rate dropping
        recent_success_rate = self.compute_success_rate(
            window_hours=1
        )

        if recent_success_rate < 0.8:
            alerts.append({
                "severity": "high",
                "message": f"Success rate dropped to {recent_success_rate:.1%}"
            })

        # Alert: Increasing error frequency
        error_rate = len(self.metrics["errors"]) / len(self.metrics["trajectories"])

        if error_rate > 0.3:
            alerts.append({
                "severity": "medium",
                "message": f"Error rate: {error_rate:.1%}"
            })

        # Alert: Decreasing user satisfaction
        recent_satisfaction = self.compute_satisfaction(
            window_hours=1
        )

        if recent_satisfaction < 3.5:  # Out of 5
            alerts.append({
                "severity": "high",
                "message": f"User satisfaction dropped to {recent_satisfaction:.1f}/5"
            })

        return alerts

    def compute_success_rate(self, window_hours=24):
        """Success rate in time window"""
        cutoff = datetime.now() - timedelta(hours=window_hours)

        recent = [
            t for t in self.metrics["trajectories"]
            if t["timestamp"] > cutoff
        ]

        if not recent:
            return 1.0

        successful = sum(1 for t in recent if t["success"])
        return successful / len(recent)

    def compute_satisfaction(self, window_hours=24):
        """Average satisfaction in time window"""
        cutoff = datetime.now() - timedelta(hours=window_hours)

        recent = [
            s for s in self.metrics["user_satisfaction"]
            if s["timestamp"] > cutoff
        ]

        if not recent:
            return 5.0

        return sum(s["rating"] for s in recent) / len(recent)

The Eval Loop

Evaluation isn't one-time. It's a loop:

Build agent
    ↓
Define golden trajectories
    ↓
Run evaluation
    ↓
Identify failures
    ↓
Debug failures
    ↓
Update agent or prompt
    ↓
Re-run evaluation
    ↓
Monitor production
    ↓
Collect user feedback
    ↓
Update golden trajectories
    ↓
Repeat

Each iteration improves the agent.

Key Takeaways

Evaluating agents is different from evaluating LLMs:

  1. Trajectory evaluation measures the path (how agent thinks)
  2. Outcome evaluation measures the goal (did it work?)
  3. Use both: trajectory for debugging, outcome for acceptance
  4. Golden trajectories are your ground truth
  5. Levenshtein distance on tool sequences is simple baseline
  6. LLM-as-Judge for subjective trajectory quality
  7. Automate outcome checks when possible
  8. Monitor production for drift
  9. Build a test harness and iterate

Done right, evaluation catches problems before users do.

Your agent isn't ready for production until you can prove it works.


What's Changing Right Now (2025–2026)

Agent evaluation is one of the fastest-moving areas in AI. Here's what's happening and why it matters for practitioners.

LLM-as-Judge is Becoming the Default

Six months ago, most teams graded agents by hand or used rigid string-matching. Today, LLM-as-Judge is the industry standard for any task too nuanced for rule-based checking. Anthropic's Claude, OpenAI's GPT-4o, and Google's Gemini are all being used as evaluators for other LLMs' outputs.

The practical implication: your eval stack should have a judge_model that's separate from your agent_model. Run a cheaper model as judge. For critical evals, use a stronger model than the one you're evaluating.

Agent Benchmarks Are Maturing

Three benchmarks now matter for real-world agent work:

  • SWE-Bench: Software engineering tasks (fix GitHub issues). State-of-the-art went from 3% (2023) to 55%+ (2025 Claude/GPT-4o). If your agent does coding, test it here.
  • GAIA: General assistant tasks requiring real-world tool use (web search, file manipulation, calculator). Humans score ~92%. Top models ~75%. Gap still significant.
  • AgentBench: Multi-domain agent tasks. More practical than academic benchmarks.

What this means for you: use SWE-Bench or a subset of GAIA to anchor your internal benchmarks to external reference points.

The Shift to Computer Use Agents

In 2024, Anthropic shipped Claude's "computer use" capability — the agent can take screenshots, move the mouse, and type. By 2025, browser-native agents are productionized at companies like Cognition (Devin), Imbue, and Adept.

Evaluating computer use agents is fundamentally different: you evaluate screenshots, not text. Teams are building screenshot-diffing tools and VLM-based judges that compare expected vs. actual UI state.

Reliability > Capability

The emerging consensus in 2025: the bottleneck is not capability, it's reliability. A 95%-accurate agent fails 1 in 20 tasks — unacceptable for production. Teams are now optimizing specifically for:

  • P99 reliability: not "average success rate" but "what happens on the worst 1% of tasks"
  • Graceful degradation: agents that surface uncertainty instead of failing silently
  • Human-in-the-loop triggers: automatic escalation when confidence drops below threshold
# Pattern: confidence-gated agent
def run_with_confidence_gate(task: str, threshold: float = 0.85) -> dict:
    result = agent.run(task)

    if result["confidence"] < threshold:
        # Escalate to human instead of returning wrong answer
        return {
            "status": "needs_review",
            "partial_result": result["output"],
            "reason": result["uncertainty_reason"],
        }

    return {"status": "complete", "result": result["output"]}

This pattern — fail loudly rather than fail silently — is the difference between a demo and a production system.