Most of the time, you experience your system from the inside: you see the code, the prompts, and the logs. You know which tricks you used to steer the model, which caches you added to shave a few hundred milliseconds, which hacks you left in “just for now.” When you run it, you can almost hear the gears turning: call the model, retrieve from memory, call another model, write an artifact. Your users do not see any of that internal structure; for them, the system is a single surface where they ask and it answers. The only reality that matters is the text they see and what it does for them. Those two perspectives can easily lead you to overestimate the system’s quality. From the inside, you see sophistication and effort and want to believe it adds up to quality. From the outside, quality is whatever appears on the screen. The system feels intelligent because you know how much is going on. The user decides whether it is intelligent based on one blunt criterion: is this actually good? Across the earlier chapters, we treated intelligence as architecture. Context, memory, agency, coordination, autonomy—each is a pattern you design. Evaluation connects that design to observed behavior by comparing outputs against defined criteria. This creates a different kind of challenge. How do you tell if a composite system “works” when:Documentation Index
Fetch the complete documentation index at: https://docs.idyllic.so/llms.txt
Use this file to discover all available pages before exploring further.
- The same input can produce different outputs.
- Many outputs can be acceptable in different ways.
- What “good” means is partly subjective and context dependent.
- Checking quality carefully is often more expensive than generating the output.
8.1 Knowing When the System Works
[DEMO: A simple summarizer with a “Looks fine, ship it” button vs. an evaluation panel. On the left, you run a few summaries and eyeball them. On the right, each run is scored along multiple dimensions (accuracy, relevance, clarity) by deterministic checks and an LLM-as-judge. A toggle shows how your initial intuition diverges from the measured metrics.] When you run an agentic system a few times and the outputs look reasonable, it is tempting to conclude “this works.” Nothing crashes, the shape of the response looks right, and in spot checks you do not see obvious disasters. But “works” is ambiguous. It raises questions: works for which inputs, on which dimensions, and according to whose standard? If every unit test passes, that does not imply the answers are factually correct. A five‑point gain in an accuracy metric may reflect your definition of “accuracy” more than genuine progress. When users complain more while your dashboard trends upward, the more likely explanation is that your metrics are misaligned with user value rather than that users are wrong. At this point, every measurement you use is a proxy that simplifies a complex notion of quality into a single signal. Every measurement collapses a rich, multifaceted notion of quality into something small enough to log and graph. The danger is not that you cannot measure, but that you start treating the metric as equivalent to overall quality.Evaluation is a function that takes an output and returns a quality signal. Every quality signal is a proxy: it verifies what you chose to test and ignores everything you did not. Designing evaluation means deciding what proxy to use and remembering that it is incomplete.
- A hard pass/fail: did the JSON parse, did the code compile.
- A numeric score: “0.82 helpfulness.”
- A set of labels: “missing required field
title.” - A structured diagnosis: “correct, but irrelevant to the user’s actual question.”
- Make quality multi-dimensional instead of a single score when possible.
- Use different evaluators for different purposes (regression safety vs. UX research).
- Be explicit about what your metrics do not capture.
8.2 Using Models to Evaluate Model Outputs
[DEMO: A question-answering agent with two modes. In mode A, the model answers and then “self-grades” within the same call; grades always look rosy. In mode B, a separate evaluator model, called with the original question and answer, produces harsher, more variable scores. A human rating slider appears so you can compare both models to your own judgment.] If you already have a powerful language model in the loop, it is natural to ask it whether its own outputs are good. Why not append “Now evaluate your answer on a scale from 1 to 5” at the end of the prompt and capture that score? In that case, the model is not genuinely “stepping back” to evaluate; it continues the same pattern it just produced, with a bias toward justifying itself. If the model prefers verbose outputs, it may systematically overrate long answers and punish outputs that are correct but stylistically different. Even if humans define what “good” means, automated judges are attractive because they scale. The challenge is to use them in ways that surface their biases instead of amplifying them.Same-call self-evaluation is entangled with generation and behaves more like rationalization than judgment. Separate-call LLM-as-judge can work well, but it inherits biases that you must discover by calibrating it against human evaluations. Use LLM judges for scalable screening; use humans for calibration and ground truth on subjective dimensions.
- Sees the task as evaluation, not answering.
- Can be prompted with explicit criteria and examples.
- Can be swapped independently of the generator model.
- Tends to favor verbose outputs over terse ones.
- Can be impressed by confident but wrong statements.
- May be sensitive to stylistic markers you did not intend.
- High correlation: it agrees with humans in ranking outputs.
- Positive bias: it over-rates answers relative to humans.
- Large error: it is shaky on your particular task or rubric.
- Screening: flag obviously bad outputs before users see them.
- Ranking: choose the best of several candidates.
- Monitoring: track approximate quality trends over time.
8.3 Evaluating Stochastic Outputs
[DEMO: A small Q&A system where you can run the same question 10 times. A panel shows the distribution of LLM judge scores across runs, alongside deterministic checks (schema validity) and a reference similarity score when a ground-truth answer exists. You can add new test cases and watch some show tight clusters and others wide variance.] Agentic systems are stochastic by construction. Sampling is a feature, not a bug: it lets the model explore alternatives, escape local optima, and produce varied outputs. But stochasticity collides with the way you are used to testing. If the same input can yield a family of outputs, “the system passes the test” has to mean something about distributions rather than single runs. A single great run proves nothing about typical behavior, and a deterministic JSON parse check still leaves content quality unexamined. Test case design adds another layer. If you invent all the cases by hand, you mostly test your imagination. If you harvest them from production, you have to worry about privacy, coverage, and overfitting to narrow slices of reality.To evaluate stochastic outputs, you measure distributions, not single runs. You combine deterministic checks for structure with content-focused methods like reference comparison and LLM-as-judge. Your test cases come from a mix of real usage, synthetic generation, careful curation, and adversarial design—all versioned as seriously as code.
- A single score becomes a distribution. You now care about mean, variance, and worst case.
- Reducing variance can be as valuable as raising the mean. A system that averages 0.85 but sometimes drops to 0.4 may be less useful than one that sits reliably at 0.8.
- Outliers matter differently in different contexts. For a marketing copy assistant, a rare bad answer is annoying. For a code generator in production CI, a rare bad answer can break a deployment.
"title" is accurate, relevant, or ethical. For that, you combine them with content-focused evaluators:
- Reference comparison: if you have a known-good answer, measure similarity.
- LLM-as-judge: if you do not, ask a model to approximate human judgment.
- Execution-based checks: if the output is code or SQL, run it against tests or a sandbox.
- Real usage: sampled and anonymized production inputs, so your tests match the distribution you actually face.
- Synthetic generation: LLM-generated scenarios to cover combinations users have not hit yet.
- Curation: hand-picked cases that matter disproportionally (e.g. your most common workflows).
- Adversarial design: prompts specifically intended to break your assumptions or bypass your safeguards.
8.4 Making Evaluation Sustainable
[DEMO: An evaluation dashboard with three knobs: “percent of outputs evaluated,” “percent of evaluations done by humans,” and “acceptable regression threshold.” As you move the knobs, you see simulated monthly cost, latency impact, and the chance of missing a regression. A panel highlights scenarios where Goodhart’s Law kicks in: optimizing for an easy metric worsens user satisfaction.] Once you start evaluating seriously, you immediately run into a practical wall: cost. Human evaluation is slow and expensive. LLM-as-judge is cheaper but not free. Even deterministic checks and dataset runs add latency and compute. If you tried to evaluate every output on every dimension with humans, you would spend more on judging than on serving the product. Optimizing only what is cheap to measure risks drifting away from what users actually care about. Requiring weeks of evaluation before every deployment will halt shipping, while continuous evaluation forces you to define what “good enough” means in practice.Sustainable evaluation is layered and selective. You use cheap automated checks everywhere, expensive human evaluation where it matters most, and you accept that your metrics are proxies that can be gamed. You ship when regressions are caught by these layers and quality exceeds a defined threshold on representative test sets, then keep evaluating to catch drift.
- What thresholds do you pick, and how do you justify them?
- Which categories of outputs bypass human review, and which must be inspected?
- How often do you re-run human calibration to ensure your LLM judge has not drifted?
- Keeping some evaluation channels hidden from the system (e.g. periodic human audits on random samples).
- Maintaining multiple, independent metrics rather than a single magic number.
- Regularly checking whether improvements in metrics correlate with improvements in user satisfaction.
- Has the candidate system been run against the current evaluation dataset?
- Are there any regressions above our tolerance in key categories?
- Do we understand the tradeoffs (e.g. slightly lower completeness for significantly lower latency)?
8.5 Building Systems with Evaluation-Driven Development
[DEMO: An interactive “evaluation-first” playground. On the left, you define an evaluation spec for a simple citation-generating agent (what counts as a good citation). On the right, you iteratively modify the agent’s prompt and tool wiring. A chart at the bottom shows evaluation scores by dimension over iterations, making clear how changing implementation without changing evaluation can appear to “improve” or “worsen” quality.] At this point, we have treated evaluation as something you bolt onto a system to see how it behaves. You can invert that relationship. In traditional software, test-driven development asks you to write tests before implementation. The tests define what “done” means. You then write code until those tests pass. The same discipline is available for agentic systems, but the “tests” are evaluations. If you cannot state how you would measure success for a feature, you do not yet know what you are building. When the system passes your evaluation but users are unhappy, that usually indicates a flaw in the evaluation, not in the measurements themselves.Evaluation-driven development means specifying the quality signals you care about before you build or change behavior, then iterating until your system reliably meets those signals. The evaluation formalizes your intentions; the system’s job is to satisfy them.
- If the system fails, you can improve the system or adjust the spec. Changing the spec forces you to articulate why you are lowering or changing the bar.
- If the system passes but users complain about something the spec does not measure (tone, for example), you add new evaluation specs. Your definition of “done” evolves, but explicitly.
- “Follows safety policy” for content filters.
- “Answers all parts of the question” for multi-step tasks.
- “Explains reasoning with at least one concrete example” for educational tools.
- Define or refine evaluation specs for the behavior you want.
- Build or modify the system.
- Run evaluations on a representative dataset (possibly with variance-aware sampling).
- Decide whether to ship based on evaluation results and their known limitations.
Key Takeaways
Evaluation is where your agent meets the world.- An evaluation is a function from output (plus context) to a quality signal. That signal is always a proxy. It measures what you chose to encode and is blind to everything else.
- Letting a model grade itself in the same call produces rationalizations, not independent judgments. Separate-call LLM-as-judge can be a powerful evaluator, but only when you understand and calibrate its biases against human ground truth.
- Stochastic outputs require distributional thinking. You run multiple times, measure variance, combine deterministic structural checks with content-focused evaluators, and build versioned datasets that reflect real, edge, and adversarial usage.
- Sustainable evaluation is layered: cheap deterministic checks everywhere, LLM-as-judge at scale, human evaluation for calibration and high-stakes cases. Optimizing a metric is not the same as optimizing user value, especially when the system learns to game the metric.
- Evaluation-driven development treats evaluations as first-class specifications. You define what “good” means in measurable terms before building, then iterate until your system meets those criteria—adjusting either system or criteria with explicit justification.