Why agentic eval loops fail

The first eval loop you ship will tell you your agent is great. The second one — built by someone who wasn't on the original team — will tell you it's mediocre. The third one, with hand-graded golden examples, will tell you what's actually happening.

This is a field guide to building the third kind on the first try.

Key takeaways

Grading an agent with the same model family tends to inflate reported scores — sometimes by double-digit points.
Anchor every eval in 5–10 hand-graded golden examples before you scale to thousands.
Stack deterministic checks (regex, schema) underneath LLM-as-judge for high-stakes flows.
An eval that doesn't fail on a known-bad output is not an eval — it's decoration.

The problem with confident scores

When you grade an agent with the same model that powers it, you don't measure quality — you measure preference. The model has opinions about its own outputs, and those opinions get baked into the score.

The pattern is depressingly common: weeks of prompt tuning push the reported number up. Swap the grader to a different model family and the number falls back. The gap between the two scores is the bias you've been baking in — and you only see it when you measure it.

What a good trace looks like

Before we talk about scoring, look at what we're scoring. An agent run is a sequence of decisions, tool calls, and observations — not a single output. Your eval needs to grade the whole trace, not just the final answer.

Trace

User asks: 'Find the latest Anthropic safety paper.'

Thought
I need to search recent Anthropic publications.

Tool callweb_search480ms

Searching for the paper

Input{ "query": "Anthropic safety research 2026" }

Observation

Found 12 results; most recent is from April 2026.

Output[{"title":"Constitutional AI v3","date":"2026-04","url":"..."}, ...]

Tool callweb_fetch1.24s

Fetching the paper to confirm authorship

Input{ "url": "https://anthropic.com/research/constitutional-ai-v3" }

Final320ms
The latest paper is "Constitutional AI v3", published April 2026 by the Anthropic Alignment team.

A bad eval scores only the final string. A good eval scores: did the agent search before answering? Did it verify the source? Did it return a date? An eval is the contract you write between you and your future agent.

Three approaches to grading

There are three families of grader. Each has a place; mixing them is usually best.

Use a different model to score the agent's output against a rubric. Flexible, expensive, biased toward verbose answers.

const score = await grader.generate({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: `Rubric: ${rubric}\nAgent output: ${output}\nReturn 0-100.`,
  }],
});

Schema validation, regex, exact match. Cheap, fast, no opinions. Misses nuance.

const valid = JSON_SCHEMA.safeParse(output).success
  && output.includes(expectedDate)
  && output.length < 500;

Five to ten gold-standard examples scored by a domain expert. Slow, expensive, irreplaceable. Anchor every other eval to these.

const golden = readJSON('evals/golden-set.json');
const score = compareAgainst(output, golden[caseId]);

How they compare

Method	Cost	Speed	Bias risk	Best for
LLM-as-judge	Per-call	Slow (seconds)	High (same-family)	Open-ended quality
Deterministic	Free	Fast (ms)	None	Schema, format, presence
Human-graded	Engineer time	Slow (minutes)	Low (deliberate)	Anchor + spot-check

The right answer is rarely "pick one." It's almost always: deterministic checks first, LLM-as-judge for what's left, anchored to a hand-graded golden set you trust.

Naive vs production loops

The shape of the loop matters as much as the grader.

Naive

const result = await agent.run(input);
const score = await grader.score(result);
return score;

No retries, no timeout, no logging, no diff against golden. You'll never reproduce a regression.

Production

const result = await withRetry(
  () => agent.run(input),
  { timeout: 30_000, maxAttempts: 3 }
);
const score = await grader.score(result);
const goldenScore = compareAgainst(result, golden[caseId]);
log.info({ caseId, score, goldenScore, trace: result.trace });
return { score, goldenScore };

Idempotent, observable, comparable to a known baseline.

Implementing it

Build the golden set first
Pick 5–10 inputs whose ideal outputs you can write by hand. Store them in evals/golden/. Re-grade them yourself every quarter — your standards drift.
Layer deterministic checks
Schema validation, length bounds, presence of required fields, regex for forbidden content. These run in milliseconds and catch the embarrassing failures.
Add LLM-as-judge for what's left
Pick a different model family than your agent. Score against an explicit rubric, not "is this good?" — the rubric reveals disagreement faster than the score does.
Run on every PR
A 50-case eval that takes 90 seconds is a CI gate. A 5,000-case eval that takes an hour is a once-a-day batch. Both are useful for different things.

Caveats and edge cases

What if my agent has no clear 'correct' answer?

Then your eval needs comparative judgments, not absolute scores. Use pairwise comparisons ("which is better, A or B?") with a different-family judge. The signal is in the win rate over many pairs, not any single decision.

My golden set goes stale. What now?

That's a feature, not a bug. Schedule a quarterly review. If the golden answers no longer match your bar, your bar moved — write down what changed and update them. Don't quietly let the score drift.

Can I use the same model for grading if I prompt it differently?

Marginally better than nothing, definitively worse than a different family. Same-model graders share priors about formatting, verbosity, and refusals. The simplest robustness check: re-grade 50 cases with a foreign-family model. If scores drift >5 points, you have bias.

Visualizing the full loop

An eval loop with deterministic checks, LLM-as-judge, and a hand-graded golden anchor. The golden set is the only piece a human touches; everything else runs on every PR.

Key takeaways

Build the golden set first

Layer deterministic checks

Add LLM-as-judge for what's left

Run on every PR