Why agentic eval loops fail
When the grader and the agent share a model family, you measure preferences, not performance. A field guide to evals that catch real regressions.
The first eval loop you ship will tell you your agent is great. The second one — built by someone who wasn't on the original team — will tell you it's mediocre. The third one, with hand-graded golden examples, will tell you what's actually happening.
This is a field guide to building the third kind on the first try.
Key takeaways
- Grading an agent with the same model family tends to inflate reported scores — sometimes by double-digit points.
- Anchor every eval in 5–10 hand-graded golden examples before you scale to thousands.
- Stack deterministic checks (regex, schema) underneath LLM-as-judge for high-stakes flows.
- An eval that doesn't fail on a known-bad output is not an eval — it's decoration.
The problem with confident scores
When you grade an agent with the same model that powers it, you don't measure quality — you measure preference. The model has opinions about its own outputs, and those opinions get baked into the score.
The pattern is depressingly common: weeks of prompt tuning push the reported number up. Swap the grader to a different model family and the number falls back. The gap between the two scores is the bias you've been baking in — and you only see it when you measure it.
What a good trace looks like
Before we talk about scoring, look at what we're scoring. An agent run is a sequence of decisions, tool calls, and observations — not a single output. Your eval needs to grade the whole trace, not just the final answer.
Trace
User asks: 'Find the latest Anthropic safety paper.'
- Thought
I need to search recent Anthropic publications.
- Tool call
web_search480msSearching for the paper
Input{ "query": "Anthropic safety research 2026" } - Observation
Found 12 results; most recent is from April 2026.
Output[{"title":"Constitutional AI v3","date":"2026-04","url":"..."}, ...] - Tool call
web_fetch1.24sFetching the paper to confirm authorship
Input{ "url": "https://anthropic.com/research/constitutional-ai-v3" } - Final320ms
The latest paper is "Constitutional AI v3", published April 2026 by the Anthropic Alignment team.
A bad eval scores only the final string. A good eval scores: did the agent search before answering? Did it verify the source? Did it return a date? An eval is the contract you write between you and your future agent.
Three approaches to grading
There are three families of grader. Each has a place; mixing them is usually best.
Use a different model to score the agent's output against a rubric. Flexible, expensive, biased toward verbose answers.
const score = await grader.generate({
model: 'gpt-4o',
messages: [{
role: 'user',
content: `Rubric: ${rubric}\nAgent output: ${output}\nReturn 0-100.`,
}],
});Schema validation, regex, exact match. Cheap, fast, no opinions. Misses nuance.
const valid = JSON_SCHEMA.safeParse(output).success
&& output.includes(expectedDate)
&& output.length < 500;Five to ten gold-standard examples scored by a domain expert. Slow, expensive, irreplaceable. Anchor every other eval to these.
const golden = readJSON('evals/golden-set.json');
const score = compareAgainst(output, golden[caseId]);How they compare
| Method | Cost | Speed | Bias risk | Best for |
|---|---|---|---|---|
| LLM-as-judge | Per-call | Slow (seconds) | High (same-family) | Open-ended quality |
| Deterministic | Free | Fast (ms) | None | Schema, format, presence |
| Human-graded | Engineer time | Slow (minutes) | Low (deliberate) | Anchor + spot-check |
The right answer is rarely "pick one." It's almost always: deterministic checks first, LLM-as-judge for what's left, anchored to a hand-graded golden set you trust.
Naive vs production loops
The shape of the loop matters as much as the grader.
const result = await agent.run(input);
const score = await grader.score(result);
return score;No retries, no timeout, no logging, no diff against golden. You'll never reproduce a regression.
const result = await withRetry(
() => agent.run(input),
{ timeout: 30_000, maxAttempts: 3 }
);
const score = await grader.score(result);
const goldenScore = compareAgainst(result, golden[caseId]);
log.info({ caseId, score, goldenScore, trace: result.trace });
return { score, goldenScore };Idempotent, observable, comparable to a known baseline.
Implementing it
Build the golden set first
Pick 5–10 inputs whose ideal outputs you can write by hand. Store them in
evals/golden/. Re-grade them yourself every quarter — your standards drift.Layer deterministic checks
Schema validation, length bounds, presence of required fields, regex for forbidden content. These run in milliseconds and catch the embarrassing failures.
Add LLM-as-judge for what's left
Pick a different model family than your agent. Score against an explicit rubric, not "is this good?" — the rubric reveals disagreement faster than the score does.
Run on every PR
A 50-case eval that takes 90 seconds is a CI gate. A 5,000-case eval that takes an hour is a once-a-day batch. Both are useful for different things.
Caveats and edge cases
What if my agent has no clear 'correct' answer?
Then your eval needs comparative judgments, not absolute scores. Use pairwise comparisons ("which is better, A or B?") with a different-family judge. The signal is in the win rate over many pairs, not any single decision.
My golden set goes stale. What now?
That's a feature, not a bug. Schedule a quarterly review. If the golden answers no longer match your bar, your bar moved — write down what changed and update them. Don't quietly let the score drift.
Can I use the same model for grading if I prompt it differently?
Marginally better than nothing, definitively worse than a different family. Same-model graders share priors about formatting, verbosity, and refusals. The simplest robustness check: re-grade 50 cases with a foreign-family model. If scores drift >5 points, you have bias.