Back to Blog
Evals

Why agentic eval loops fail

When the grader and the agent share a model family, you measure preferences, not performance. A field guide to evals that catch real regressions.

8 min read
By Echo Theory Labs
evalsagentsmeasurementproduction-ai

The first eval loop you ship will tell you your agent is great. The second one — built by someone who wasn't on the original team — will tell you it's mediocre. The third one, with hand-graded golden examples, will tell you what's actually happening.

This is a field guide to building the third kind on the first try.

Key takeaways

  • Grading an agent with the same model family tends to inflate reported scores — sometimes by double-digit points.
  • Anchor every eval in 5–10 hand-graded golden examples before you scale to thousands.
  • Stack deterministic checks (regex, schema) underneath LLM-as-judge for high-stakes flows.
  • An eval that doesn't fail on a known-bad output is not an eval — it's decoration.

The problem with confident scores

When you grade an agent with the same model that powers it, you don't measure quality — you measure preference. The model has opinions about its own outputs, and those opinions get baked into the score.

The pattern is depressingly common: weeks of prompt tuning push the reported number up. Swap the grader to a different model family and the number falls back. The gap between the two scores is the bias you've been baking in — and you only see it when you measure it.

What a good trace looks like

Before we talk about scoring, look at what we're scoring. An agent run is a sequence of decisions, tool calls, and observations — not a single output. Your eval needs to grade the whole trace, not just the final answer.

Trace

User asks: 'Find the latest Anthropic safety paper.'

  1. Thought

    I need to search recent Anthropic publications.

  2. Tool callweb_search480ms

    Searching for the paper

    Input{ "query": "Anthropic safety research 2026" }
  3. Observation

    Found 12 results; most recent is from April 2026.

    Output[{"title":"Constitutional AI v3","date":"2026-04","url":"..."}, ...]
  4. Tool callweb_fetch1.24s

    Fetching the paper to confirm authorship

    Input{ "url": "https://anthropic.com/research/constitutional-ai-v3" }
  5. Final320ms

    The latest paper is "Constitutional AI v3", published April 2026 by the Anthropic Alignment team.

A bad eval scores only the final string. A good eval scores: did the agent search before answering? Did it verify the source? Did it return a date? An eval is the contract you write between you and your future agent.

Three approaches to grading

There are three families of grader. Each has a place; mixing them is usually best.

Use a different model to score the agent's output against a rubric. Flexible, expensive, biased toward verbose answers.

const score = await grader.generate({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: `Rubric: ${rubric}\nAgent output: ${output}\nReturn 0-100.`,
  }],
});

How they compare

MethodCostSpeedBias riskBest for
LLM-as-judgePer-callSlow (seconds)High (same-family)Open-ended quality
DeterministicFreeFast (ms)NoneSchema, format, presence
Human-gradedEngineer timeSlow (minutes)Low (deliberate)Anchor + spot-check

The right answer is rarely "pick one." It's almost always: deterministic checks first, LLM-as-judge for what's left, anchored to a hand-graded golden set you trust.

Naive vs production loops

The shape of the loop matters as much as the grader.

Naive
const result = await agent.run(input);
const score = await grader.score(result);
return score;

No retries, no timeout, no logging, no diff against golden. You'll never reproduce a regression.

Production
const result = await withRetry(
  () => agent.run(input),
  { timeout: 30_000, maxAttempts: 3 }
);
const score = await grader.score(result);
const goldenScore = compareAgainst(result, golden[caseId]);
log.info({ caseId, score, goldenScore, trace: result.trace });
return { score, goldenScore };

Idempotent, observable, comparable to a known baseline.

Implementing it

  1. Build the golden set first

    Pick 5–10 inputs whose ideal outputs you can write by hand. Store them in evals/golden/. Re-grade them yourself every quarter — your standards drift.

  2. Layer deterministic checks

    Schema validation, length bounds, presence of required fields, regex for forbidden content. These run in milliseconds and catch the embarrassing failures.

  3. Add LLM-as-judge for what's left

    Pick a different model family than your agent. Score against an explicit rubric, not "is this good?" — the rubric reveals disagreement faster than the score does.

  4. Run on every PR

    A 50-case eval that takes 90 seconds is a CI gate. A 5,000-case eval that takes an hour is a once-a-day batch. Both are useful for different things.

Caveats and edge cases

What if my agent has no clear 'correct' answer?

Then your eval needs comparative judgments, not absolute scores. Use pairwise comparisons ("which is better, A or B?") with a different-family judge. The signal is in the win rate over many pairs, not any single decision.

My golden set goes stale. What now?

That's a feature, not a bug. Schedule a quarterly review. If the golden answers no longer match your bar, your bar moved — write down what changed and update them. Don't quietly let the score drift.

Can I use the same model for grading if I prompt it differently?

Marginally better than nothing, definitively worse than a different family. Same-model graders share priors about formatting, verbosity, and refusals. The simplest robustness check: re-grade 50 cases with a foreign-family model. If scores drift >5 points, you have bias.

Visualizing the full loop

Agent runDeterministic checksLLM-as-judgeGolden set diffScore + Trace(per case)
An eval loop with deterministic checks, LLM-as-judge, and a hand-graded golden anchor. The golden set is the only piece a human touches; everything else runs on every PR.