Chapter 8: Evaluation

Most of the time, you experience your system from the inside: you see the code, the prompts, and the logs. You know which tricks you used to steer the model, which caches you added to shave a few hundred milliseconds, which hacks you left in “just for now.” When you run it, you can almost hear the gears turning: call the model, retrieve from memory, call another model, write an artifact. Your users do not see any of that internal structure; for them, the system is a single surface where they ask and it answers. The only reality that matters is the text they see and what it does for them. Those two perspectives can easily lead you to overestimate the system’s quality. From the inside, you see sophistication and effort and want to believe it adds up to quality. From the outside, quality is whatever appears on the screen. The system feels intelligent because you know how much is going on. The user decides whether it is intelligent based on one blunt criterion: is this actually good? Across the earlier chapters, we treated intelligence as architecture. Context, memory, agency, coordination, autonomy—each is a pattern you design. Evaluation connects that design to observed behavior by comparing outputs against defined criteria. This creates a different kind of challenge. How do you tell if a composite system “works” when:

The same input can produce different outputs.
Many outputs can be acceptable in different ways.
What “good” means is partly subjective and context dependent.
Checking quality carefully is often more expensive than generating the output.

If the system passes all your tests, that does not imply users are happy. A metric that improves while complaints increase may be reflecting your definition of “good” more than genuine progress. When you use a model to judge a model, you may be measuring how convincingly it justifies itself rather than how well it serves users. The rest of this chapter takes those confusions seriously. We will treat evaluation as its own design problem: deciding what to measure, how to measure it, how to live with stochasticity, and how to keep the whole process sustainable. The first question is the most basic and the most slippery: when you say “the system works,” what are you actually claiming, and how would you know if it were false?

8.1 Knowing When the System Works

[DEMO: A simple summarizer with a “Looks fine, ship it” button vs. an evaluation panel. On the left, you run a few summaries and eyeball them. On the right, each run is scored along multiple dimensions (accuracy, relevance, clarity) by deterministic checks and an LLM-as-judge. A toggle shows how your initial intuition diverges from the measured metrics.] When you run an agentic system a few times and the outputs look reasonable, it is tempting to conclude “this works.” Nothing crashes, the shape of the response looks right, and in spot checks you do not see obvious disasters. But “works” is ambiguous. It raises questions: works for which inputs, on which dimensions, and according to whose standard? If every unit test passes, that does not imply the answers are factually correct. A five‑point gain in an accuracy metric may reflect your definition of “accuracy” more than genuine progress. When users complain more while your dashboard trends upward, the more likely explanation is that your metrics are misaligned with user value rather than that users are wrong. At this point, every measurement you use is a proxy that simplifies a complex notion of quality into a single signal. Every measurement collapses a rich, multifaceted notion of quality into something small enough to log and graph. The danger is not that you cannot measure, but that you start treating the metric as equivalent to overall quality.

Evaluation is a function that takes an output and returns a quality signal. Every quality signal is a proxy: it verifies what you chose to test and ignores everything you did not. Designing evaluation means deciding what proxy to use and remembering that it is incomplete.

The mechanical picture looks like this:

type EvalContext = {
  // Whatever the evaluator needs to know
  input: string;
  taskDescription: string;
  referenceAnswer?: string;
};

type QualitySignal = {
  score: number;          // 0–1 overall
  dimensions?: Record<string, number>; // e.g. { accuracy: 0.8, clarity: 0.9 }
  issues?: string[];      // Human- or model-readable notes
};

type Evaluation = (output: string, ctx: EvalContext) => Promise<QualitySignal>;

We will build evaluators that inspect outputs in context and emit quality signals in different formats. Those signals might be:

A hard pass/fail: did the JSON parse, did the code compile.
A numeric score: “0.82 helpfulness.”
A set of labels: “missing required field title.”
A structured diagnosis: “correct, but irrelevant to the user’s actual question.”

Here is a minimal composite evaluator that illustrates the pattern:

async function evaluateSummary(
  output: string,
  ctx: EvalContext
): Promise<QualitySignal> {
  // Deterministic check: non-empty and short enough
  if (!output.trim()) {
    return { score: 0, issues: ['empty_output'] };
  }
  if (output.length > 500) {
    return { score: 0.2, issues: ['too_long'] };
  }

  // Reference comparison if we have a golden answer
  let refScore = 0;
  if (ctx.referenceAnswer) {
    refScore = await semanticSimilarity(output, ctx.referenceAnswer); // 0–1
  }

  // LLM-as-judge for subjective aspects
  const judgeScore = await llmJudge(output, ctx.taskDescription);

  // Combine: simple weighted average for illustration
  const score = ctx.referenceAnswer
    ? 0.5 * refScore + 0.5 * judgeScore
    : judgeScore;

  return {
    score,
    dimensions: {
      reference: refScore,
      llm: judgeScore
    }
  };
}

This code does not claim to know “truth.” It codifies a decision: these are the aspects of quality we care about for summaries, and this is how we will compress them into a number. Change the weights or methods, and your “quality” moves—even if user experience has not. “The system works” is not an intrinsic property of the code; it depends on how the system behaves under the evaluation you designed. If you are not explicit about that evaluation, you are relying on intuition and anecdotes. If users complain while metrics rise, it usually means your evaluation is measuring the wrong thing, not that users are mistaken. Evaluation is a designed function, so you can design it with intent:

Make quality multi-dimensional instead of a single score when possible.
Use different evaluators for different purposes (regression safety vs. UX research).
Be explicit about what your metrics do not capture.

The rest of the chapter is about what you can plug into this evaluation function and how to live with their limitations.

8.2 Using Models to Evaluate Model Outputs

[DEMO: A question-answering agent with two modes. In mode A, the model answers and then “self-grades” within the same call; grades always look rosy. In mode B, a separate evaluator model, called with the original question and answer, produces harsher, more variable scores. A human rating slider appears so you can compare both models to your own judgment.] If you already have a powerful language model in the loop, it is natural to ask it whether its own outputs are good. Why not append “Now evaluate your answer on a scale from 1 to 5” at the end of the prompt and capture that score? In that case, the model is not genuinely “stepping back” to evaluate; it continues the same pattern it just produced, with a bias toward justifying itself. If the model prefers verbose outputs, it may systematically overrate long answers and punish outputs that are correct but stylistically different. Even if humans define what “good” means, automated judges are attractive because they scale. The challenge is to use them in ways that surface their biases instead of amplifying them.

Same-call self-evaluation is entangled with generation and behaves more like rationalization than judgment. Separate-call LLM-as-judge can work well, but it inherits biases that you must discover by calibrating it against human evaluations. Use LLM judges for scalable screening; use humans for calibration and ground truth on subjective dimensions.

The mechanical difference between self-evaluation and LLM-as-judge is where you put the boundary between calls. Self-evaluation in a single call looks like this:

async function answerAndSelfEvaluate(question: string): Promise<{
  answer: string;
  selfScore: number;
}> {
  const prompt = `
You are a helpful assistant.

Question:
${question}

First, write your best answer.
Then, on a new line, write: SCORE: <number from 1 to 5> indicating how good your answer was.
  `;

  const completion = await llm.complete(prompt);
  const [answerPart, scoreLine] = completion.split('SCORE:');

  const selfScore = Number(scoreLine?.trim() ?? '0') || 0;

  return { answer: answerPart.trim(), selfScore };
}

Mechanically tidy, philosophically compromised. The model is optimizing for a single task: produce text that looks like “good answer + plausible score line.” There is no separation between “writer” and “judge.” A separate-call judge places the evaluation in a different context, often with a different prompt:

async function llmJudge(
  question: string,
  answer: string
): Promise<{ score: number; rationale: string }> {
  const prompt = `
You are an impartial evaluator.

Task: Rate how well the answer addresses the question.
Consider correctness, relevance, and clarity.

Question:
${question}

Answer:
${answer}

Respond with JSON:
{ "score": number between 0 and 1, "rationale": "short explanation" }
  `;

  const raw = await llm.complete(prompt);
  return JSON.parse(raw);
}

Now the judge model:

Sees the task as evaluation, not answering.
Can be prompted with explicit criteria and examples.
Can be swapped independently of the generator model.

This separates roles, but not truth. The judge model still:

Tends to favor verbose outputs over terse ones.
Can be impressed by confident but wrong statements.
May be sensitive to stylistic markers you did not intend.

Treating it as authoritative is as risky as treating user praise as a metric. You need a way to see the judge’s blind spots. Calibration is the mechanism. You collect a set of examples, have humans rate them, then compare the judge’s scores to human scores:

async function calibrateJudge(
  judge: (q: string, a: string) => Promise<{ score: number }>,
  labeled: { question: string; answer: string; humanScore: number }[]
): Promise<{
  correlation: number;
  bias: number;
  meanAbsError: number;
}> {
  const modelScores: number[] = [];
  const humanScores: number[] = [];

  for (const item of labeled) {
    const { score } = await judge(item.question, item.answer);
    modelScores.push(score);
    humanScores.push(item.humanScore);
  }

  return {
    correlation: pearsonCorrelation(modelScores, humanScores),
    bias: mean(modelScores) - mean(humanScores),
    meanAbsError: meanAbsoluteError(modelScores, humanScores)
  };
}

The numbers you get back tell you what the judge is doing:

High correlation: it agrees with humans in ranking outputs.
Positive bias: it over-rates answers relative to humans.
Large error: it is shaky on your particular task or rubric.

The implication is practical: use LLM-as-judge in places where its limitations are acceptable.

Screening: flag obviously bad outputs before users see them.
Ranking: choose the best of several candidates.
Monitoring: track approximate quality trends over time.

Do not use it as the final authority in high-stakes contexts, or as the only signal when you change important behaviors. In those cases, humans remain the reference point—not because they are perfect, but because “quality” for subjective tasks is defined by what humans say.

8.3 Evaluating Stochastic Outputs

[DEMO: A small Q&A system where you can run the same question 10 times. A panel shows the distribution of LLM judge scores across runs, alongside deterministic checks (schema validity) and a reference similarity score when a ground-truth answer exists. You can add new test cases and watch some show tight clusters and others wide variance.] Agentic systems are stochastic by construction. Sampling is a feature, not a bug: it lets the model explore alternatives, escape local optima, and produce varied outputs. But stochasticity collides with the way you are used to testing. If the same input can yield a family of outputs, “the system passes the test” has to mean something about distributions rather than single runs. A single great run proves nothing about typical behavior, and a deterministic JSON parse check still leaves content quality unexamined. Test case design adds another layer. If you invent all the cases by hand, you mostly test your imagination. If you harvest them from production, you have to worry about privacy, coverage, and overfitting to narrow slices of reality.

To evaluate stochastic outputs, you measure distributions, not single runs. You combine deterministic checks for structure with content-focused methods like reference comparison and LLM-as-judge. Your test cases come from a mix of real usage, synthetic generation, careful curation, and adversarial design—all versioned as seriously as code.

At the code level, the main shift is trivial: you add a loop.

async function evaluateStochastic(
  system: (input: string) => Promise<string>,
  testCase: EvalContext,
  runs: number = 5
): Promise<{
  mean: number;
  stdDev: number;
  samples: QualitySignal[];
}> {
  const scores: number[] = [];
  const samples: QualitySignal[] = [];

  for (let i = 0; i < runs; i++) {
    const output = await system(testCase.input);
    const quality = await evaluateSummary(output, testCase);
    scores.push(quality.score);
    samples.push(quality);
  }

  return {
    mean: mean(scores),
    stdDev: standardDeviation(scores),
    samples
  };
}

Looping over runs changes what you care about:

A single score becomes a distribution. You now care about mean, variance, and worst case.
Reducing variance can be as valuable as raising the mean. A system that averages 0.85 but sometimes drops to 0.4 may be less useful than one that sits reliably at 0.8.
Outliers matter differently in different contexts. For a marketing copy assistant, a rare bad answer is annoying. For a code generator in production CI, a rare bad answer can break a deployment.

Structural correctness is the easy part. You can check it deterministically:

function structuralChecks(output: string): { passed: boolean; issues: string[] } {
  const issues: string[] = [];
  try {
    JSON.parse(output);
  } catch {
    issues.push('invalid_json');
  }
  if (!output.includes('"title"')) {
    issues.push('missing_title_field');
  }
  return { passed: issues.length === 0, issues };
}

These checks tell you nothing about whether "title" is accurate, relevant, or ethical. For that, you combine them with content-focused evaluators:

Reference comparison: if you have a known-good answer, measure similarity.
LLM-as-judge: if you do not, ask a model to approximate human judgment.
Execution-based checks: if the output is code or SQL, run it against tests or a sandbox.

Evaluation is only as good as the test cases you feed it. Designing those is its own exercise:

interface TestCase {
  id: string;
  input: string;
  category: 'representative' | 'edge' | 'adversarial';
  difficulty: 'easy' | 'medium' | 'hard';
  expectedBehavior?: string;
  groundTruth?: string; // when objectively determinable
  tags: string[];
}

interface EvaluationDataset {
  name: string;
  version: string;
  created: string;
  description: string;
  cases: TestCase[];
}

You populate these from four main sources:

Real usage: sampled and anonymized production inputs, so your tests match the distribution you actually face.
Synthetic generation: LLM-generated scenarios to cover combinations users have not hit yet.
Curation: hand-picked cases that matter disproportionally (e.g. your most common workflows).
Adversarial design: prompts specifically intended to break your assumptions or bypass your safeguards.

Versioning the dataset is non-negotiable. If you change the test set and see metrics improve, you need to know whether you improved the system or just picked easier questions. “Quality went up” only means something when you know relative to which dataset. Stochastic evaluation feels messier than traditional unit tests because it is. You are measuring behavior across a space of possibilities, not checking a single path. The reward is a more honest picture of what your system actually does.

8.4 Making Evaluation Sustainable

[DEMO: An evaluation dashboard with three knobs: “percent of outputs evaluated,” “percent of evaluations done by humans,” and “acceptable regression threshold.” As you move the knobs, you see simulated monthly cost, latency impact, and the chance of missing a regression. A panel highlights scenarios where Goodhart’s Law kicks in: optimizing for an easy metric worsens user satisfaction.] Once you start evaluating seriously, you immediately run into a practical wall: cost. Human evaluation is slow and expensive. LLM-as-judge is cheaper but not free. Even deterministic checks and dataset runs add latency and compute. If you tried to evaluate every output on every dimension with humans, you would spend more on judging than on serving the product. Optimizing only what is cheap to measure risks drifting away from what users actually care about. Requiring weeks of evaluation before every deployment will halt shipping, while continuous evaluation forces you to define what “good enough” means in practice.

Sustainable evaluation is layered and selective. You use cheap automated checks everywhere, expensive human evaluation where it matters most, and you accept that your metrics are proxies that can be gamed. You ship when regressions are caught by these layers and quality exceeds a defined threshold on representative test sets, then keep evaluating to catch drift.

In code, the layering looks straightforward:

type EvalTier = 'rejected' | 'needs_review' | 'accepted';

async function layeredEvaluate(
  output: string,
  ctx: EvalContext
): Promise<{ tier: EvalTier; signal: QualitySignal }> {
  // Layer 1: Deterministic checks
  const structural = structuralChecks(output);
  if (!structural.passed) {
    return {
      tier: 'rejected',
      signal: { score: 0, issues: structural.issues }
    };
  }

  // Layer 2: LLM-as-judge for scalable subjective assessment
  const llmScore = await llmJudge(ctx.input, output);
  if (llmScore.score < 0.7) {
    return {
      tier: 'rejected',
      signal: { score: llmScore.score, issues: ['low_llm_score'] }
    };
  }

  // Layer 3: Human review for borderline or high-stakes cases
  if (llmScore.score < 0.85) {
    return {
      tier: 'needs_review',
      signal: { score: llmScore.score, issues: ['manual_review_required'] }
    };
  }

  return {
    tier: 'accepted',
    signal: { score: llmScore.score }
  };
}

The design work sits in the policy around this function:

What thresholds do you pick, and how do you justify them?
Which categories of outputs bypass human review, and which must be inspected?
How often do you re-run human calibration to ensure your LLM judge has not drifted?

When you optimize only for the judge model’s score, the system tends to produce outputs tailored to that metric rather than to user value. If you optimize only for your LLM judge’s score, your system will eventually learn to produce answers that please the judge rather than users—longer, more hedged, or filled with the kinds of phrases the judge mistakenly equates with quality. You counter this by:

Keeping some evaluation channels hidden from the system (e.g. periodic human audits on random samples).
Maintaining multiple, independent metrics rather than a single magic number.
Regularly checking whether improvements in metrics correlate with improvements in user satisfaction.

Sustainability also means temporal coverage. As systems become more autonomous (Chapter 7), they act when you are not looking. Evaluation has to keep up:

// Run periodically to monitor quality in production
@schedule('every hour')
async function monitorProductionQuality() {
  const recent = await this.sampleRecentOutputs(100); // random sample

  const evaluations = await Promise.all(
    recent.map(({ input, output }) =>
      evaluateSummary(output, { input, taskDescription: 'prod' })
    )
  );

  const avg = mean(evaluations.map(e => e.score));
  const failureRate = evaluations.filter(e => e.score < 0.7).length / evaluations.length;

  await this.recordMetric('quality.avg', avg);
  await this.recordMetric('quality.failure_rate', failureRate);

  const baseline = await this.getBaselineMetrics();

  if (avg < baseline.avg - 0.1 || failureRate > baseline.failureRate + 0.05) {
    await this.alert('quality_regression', { avg, failureRate, baseline });
  }
}

You do not evaluate everything. You sample. You set thresholds. You accept that some regressions will slip through, and design alerts to catch trends rather than single incidents. Shipping decisions become questions of acceptable risk framed by evaluation:

Has the candidate system been run against the current evaluation dataset?
Are there any regressions above our tolerance in key categories?
Do we understand the tradeoffs (e.g. slightly lower completeness for significantly lower latency)?

You move from “it feels better” to “it scores better on these tests, and we have decided that is enough to ship.” The final piece is discipline. Evaluation only stays useful if you resist the urge to compress everything into a single “quality score” and declare victory. The more honest you are about what your evaluations measure and what they miss, the more value you can extract from them over time.

8.5 Building Systems with Evaluation-Driven Development

[DEMO: An interactive “evaluation-first” playground. On the left, you define an evaluation spec for a simple citation-generating agent (what counts as a good citation). On the right, you iteratively modify the agent’s prompt and tool wiring. A chart at the bottom shows evaluation scores by dimension over iterations, making clear how changing implementation without changing evaluation can appear to “improve” or “worsen” quality.] At this point, we have treated evaluation as something you bolt onto a system to see how it behaves. You can invert that relationship. In traditional software, test-driven development asks you to write tests before implementation. The tests define what “done” means. You then write code until those tests pass. The same discipline is available for agentic systems, but the “tests” are evaluations. If you cannot state how you would measure success for a feature, you do not yet know what you are building. When the system passes your evaluation but users are unhappy, that usually indicates a flaw in the evaluation, not in the measurements themselves.

Evaluation-driven development means specifying the quality signals you care about before you build or change behavior, then iterating until your system reliably meets those signals. The evaluation formalizes your intentions; the system’s job is to satisfy them.

Mechanically, this looks like defining evaluation specs alongside features:

type EvaluationSpec = {
  name: string;
  description: string;
  metric: (output: string, ctx: EvalContext) => Promise<number>; // 0–1
  threshold: number;
};

const citationAccuracySpec: EvaluationSpec = {
  name: 'citation_accuracy',
  description: 'Responses should only cite sources that actually support the claims.',
  metric: async (output, ctx) => {
    const citations = extractCitations(output);       // e.g. URLs or IDs
    if (citations.length === 0) return 0;

    const verified = await verifyCitations(citations, ctx.referenceAnswer ?? '');
    return verified / citations.length;               // fraction verified
  },
  threshold: 0.95
};

Before you wire up a “research assistant” agent, you decide: we will not consider this feature acceptable unless at least 95% of cited sources actually back the claims. That is your evaluator’s threshold. Running the spec is straightforward:

async function checkFeature(
  system: (input: string) => Promise<string>,
  spec: EvaluationSpec,
  testCases: EvalContext[]
): Promise<{ spec: string; passRate: number }> {
  let passed = 0;

  for (const tc of testCases) {
    const output = await system(tc.input);
    const score = await spec.metric(output, tc);
    if (score >= spec.threshold) passed += 1;
  }

  return {
    spec: spec.name,
    passRate: passed / testCases.length
  };
}

When you run this loop, two levers appear:

If the system fails, you can improve the system or adjust the spec. Changing the spec forces you to articulate why you are lowering or changing the bar.
If the system passes but users complain about something the spec does not measure (tone, for example), you add new evaluation specs. Your definition of “done” evolves, but explicitly.

Over time, you accumulate a small library of evaluation specs:

“Follows safety policy” for content filters.
“Answers all parts of the question” for multi-step tasks.
“Explains reasoning with at least one concrete example” for educational tools.

Each spec is a piece of your system’s contract with reality. The more you rely on them, the more your development process starts to look like this:

Define or refine evaluation specs for the behavior you want.
Build or modify the system.
Run evaluations on a representative dataset (possibly with variance-aware sampling).
Decide whether to ship based on evaluation results and their known limitations.

Evaluation becomes a primary mechanism for specifying and enforcing the behavior you expect from the system. It becomes a continuous specification and checking mechanism that you use throughout development.

Key Takeaways

Evaluation is where your agent meets the world.

An evaluation is a function from output (plus context) to a quality signal. That signal is always a proxy. It measures what you chose to encode and is blind to everything else.
Letting a model grade itself in the same call produces rationalizations, not independent judgments. Separate-call LLM-as-judge can be a powerful evaluator, but only when you understand and calibrate its biases against human ground truth.
Stochastic outputs require distributional thinking. You run multiple times, measure variance, combine deterministic structural checks with content-focused evaluators, and build versioned datasets that reflect real, edge, and adversarial usage.
Sustainable evaluation is layered: cheap deterministic checks everywhere, LLM-as-judge at scale, human evaluation for calibration and high-stakes cases. Optimizing a metric is not the same as optimizing user value, especially when the system learns to game the metric.
Evaluation-driven development treats evaluations as first-class specifications. You define what “good” means in measurable terms before building, then iterate until your system meets those criteria—adjusting either system or criteria with explicit justification.

Transition to Chapter 9

Evaluation tells you how well your system behaves but not how to make it better. A score of 0.73 does not say which sentence was misleading. A “needs review” flag does not say what to change. Even a detailed LLM judgment can feel like a post-mortem rather than a guide. To turn evaluation into improvement, you need feedback loops. In the next chapter, we move from measuring to modifying. We will look at how to design systems that use their own evaluation signals—whether from deterministic checks, LLM judges, or humans—to iteratively refine outputs within a task and to learn across tasks over time. Evaluation is the sensor; feedback is the actuator that responds to it.

Elements of Agentic System Design

Part 1: Elements

Part 2: Applications

Appendices

Chapter 8: Evaluation

8.1 Knowing When the System Works

8.2 Using Models to Evaluate Model Outputs

8.3 Evaluating Stochastic Outputs

8.4 Making Evaluation Sustainable

8.5 Building Systems with Evaluation-Driven Development

Key Takeaways

Transition to Chapter 9

Elements of Agentic System Design

Part 1: Elements

Part 2: Applications

Appendices

​8.1 Knowing When the System Works

​8.2 Using Models to Evaluate Model Outputs

​8.3 Evaluating Stochastic Outputs

​8.4 Making Evaluation Sustainable

​8.5 Building Systems with Evaluation-Driven Development

​Key Takeaways

​Transition to Chapter 9

8.1 Knowing When the System Works

8.2 Using Models to Evaluate Model Outputs

8.3 Evaluating Stochastic Outputs

8.4 Making Evaluation Sustainable

8.5 Building Systems with Evaluation-Driven Development

Key Takeaways

Transition to Chapter 9