You can run an evaluation without changing anything. In Chapter 8, we treated evaluation as the system looking at itself: measuring accuracy, latency, cost, user satisfaction. You took an output, judged it, and got a number or a label back. That was enough to decide whether the system “works” in the statistical sense. But from the user’s perspective, something is missing. If the first answer is wrong and the system knows it is wrong, the second answer still often looks the same. Agents can explain in detail what went wrong yet repeat the same mistake on the next task. Models can expose flaws in their own reasoning without changing that reasoning on subsequent calls. This is where feedback enters. Feedback is not another word for evaluation. Evaluation is a measurement of output quality. Feedback is information that the system uses to modify subsequent outputs, such as by updating prompts, selecting tools, or adjusting control flow based on evaluation results. Evaluation tells you how good something is. Feedback changes what happens next. The difference becomes clear when you implement it in code. A failed unit test and a “7/10, could be better” rating are both evaluations. Only one will reliably make your system improve when you wire it into the loop. The other just records that things are bad and then cheerfully continues as before. This is an architectural decision: if you treat feedback as “add a quality score somewhere,” you build systems that can describe their own failures but keep repeating them. If you treat feedback as a structured signal that directs the next step, you get systems that actually move—step by step—toward better outputs. So the first question is not “how do I add a feedback loop?” but something sharper: When you already know how to evaluate, what does feedback add beyond a score, where should it come from, and how do you decide when enough feedback is enough?Documentation Index
Fetch the complete documentation index at: https://docs.idyllic.so/llms.txt
Use this file to discover all available pages before exploring further.
9.1 Feedback as a Driver of Change Beyond Evaluation
[DEMO: A side-by-side view. On the left, a code-generating agent with a single “quality score” from 1–10; each iteration regenerates code using only the new score. On the right, the same agent, but the evaluator returns exact failing test names, error messages, and line numbers, which are fed into a targeted “fix this code” prompt. Users can click “iterate” several times on both sides and watch one side oscillate while the other converges to passing tests.] Evaluation answers, “how good was that?” Feedback must answer a harder question: “how should the next attempt change?” If evaluation says “this is wrong,” it tells you nothing about how to fix it. A 6/10 does not tell you which part to rewrite, whether the issue is accuracy or style, or whether the problem is in the introduction or the conclusion. At some point, “this is bad” has to become “fix X at line Y by doing Z.” The signal needs enough structure for an agent to treat it as a direction instead of a complaint.Evaluation measures; feedback directs improvement. A score of 6/10 doesn’t tell you how to get to 8/10. Useful feedback is specific (identifies the problem), located (points to where), and actionable (suggests what to do). Test failures are ideal feedback—exact error, exact line, exact mismatch. Model-based feedback approximates this through detailed evaluation prompts.
generate changes.
Now contrast that with a loop where the evaluator returns specific issues and the next step uses them to revise the output:
- Generate an output for the task.
- Evaluate it, but return issues, not just a score.
- Feed those issues back in as explicit instructions for how to revise.
- Repeat until the score crosses a threshold or you hit an iteration limit.
- Generic feedback like “could be clearer” forces the model to guess what to change. Detailed feedback like “paragraph 3 introduces vectorization without defining it; add a one-sentence definition” tells it exactly where to act.
- You can decouple how you evaluate from how you improve. The evaluator can be another model, a test runner, or a custom checker. As long as it emits specific, located, actionable issues, the improvement step looks the same.
- Once you adopt this pattern, a lot of vague “feedback loops” stop looking like feedback loops at all. If nothing in your system points to what to change, you are evaluating, not feeding back.
9.2 Trustworthiness of Feedback Sources
[DEMO: Three panels evaluating the same generated answer. Panel A: ground-truth checker (e.g. unit tests or fact database) highlights concrete failures. Panel B: a single LLM both generates and then “reviews” its own answer in one call; sometimes it waves errors through. Panel C: a separate LLM call, with a strict evaluation prompt, flags more issues. A fourth view shows three evaluators with different criteria (accuracy, clarity, completeness) disagreeing; the user can apply simple prioritization rules to see which changes propagate.] Once feedback directs change, the next question is: how much can you trust the director? If the same model that wrote the answer also critiques it, the critique can be biased toward justifying the original. When you already have an external test suite, a model that says “this looks fine” should not override failing tests. When multiple evaluators disagree—one calling the answer “factually wrong” and another “beautifully written”—your loop must follow a clear priority rule. Adding more evaluators can surface complementary issues, but it can also drown the agent in conflicting instructions if you do not define how those signals combine.External verification (test suites, schema checkers, curated fact databases) provides deterministic checks within their domain—rely on them when they exist. Model-based feedback is next best, but requires separate calls for independence and can hallucinate issues or miss real ones. You should monitor agreement between model reviewers and external checks, and handle cases where the reviewer returns invalid JSON or no issues despite known failures. Multiple evaluators with different focuses (accuracy, clarity, completeness) catch complementary issues. Conflicting feedback requires prioritization—severity, confidence, or domain relevance.
critical. Your improvement step should treat those as non-negotiable. Models can opine on style; tests decide whether the code runs.
Second, model-based feedback is separated into its own call (modelReview). You do not ask the same model that just generated the answer to both produce and critique in one shot. Independence matters. It forces the system to look at the output as an object to be judged, not as part of a single “assistant” persona that may want to save face.
A stripped-down example of that separation looks like this:
- Always fix critical issues from external sources.
- Then fix major model-identified issues (e.g. accuracy and logic).
- Only if budget remains, address minor issues like phrasing and polish.
- Whenever you can build or borrow an external verifier—tests, schema validators, constraint checkers—do it. Those sources give you feedback you do not need to “trust” in the probabilistic sense. They are deterministic guards within their domain.
- When you use model-based feedback, isolate it. Make it a separate call with a strict prompt and a strictly-typed output. Treat it as another component, not as a mysterious inner voice of the same agent.
- Instead of asking “can I trust this model?,” ask “what domain am I trusting it with?” It might be acceptable to let an LLM judge clarity, but not to overrule a failing test or a violated business constraint.
- When evaluators disagree, do not let the model resolve the conflict implicitly. Express your priorities in code. You control what “matters” for the task.
9.3 Stopping Conditions for Iterative Improvement
[DEMO: A timeline visualization of iterative runs. Each iteration shows a score and a sparkline of changes. In one scenario, scores increase then plateau; the system stops once gains fall below a small delta. In another, scores bounce up and down; the system stops when the best-so-far stops improving. A third scenario shows a cheap threshold (e.g. “≥0.8 is good enough”); the system stops early even though further small improvements would be possible at higher cost.] Once you have feedback you trust, it is tempting to keep going. If the current score is 0.84, pushing to 0.9 feels attractive. If every iteration catches at least one more small issue, stopping at three feels premature. And yet you have seen iterations that fix one bug while introducing another, or refining phrasing endlessly without adding substance. So you need a clear rule for when to stop. You can stop when the score crosses a threshold, when the evaluator stops finding meaningful issues, or when changes become smaller than the noise in your evaluation. When quality plateaus well below your target, you also need to decide whether the feedback is too weak or the task is simply beyond the model.Stop when: quality exceeds threshold (good enough), improvement stalls (diminishing returns), or iterations exceed limit (resource constraint). Plateaus indicate either feedback isn’t actionable enough, or you’ve reached the capability ceiling for the task. Track improvement rate—if delta approaches zero, more iteration won’t help.
minDelta threshold implements diminishing returns: if the latest iteration only improved by less than, say, 0.02, you treat that as a plateau. At that point, you either reached the limit of what this model-plus-feedback combination can do, or your evaluator is too noisy to justify further refinement.
Third, keep track of the best output seen so far. Iteration can degrade quality—an “improvement” might fix a minor issue while breaking something major. Returning the last output is risky. Returning the best seen so far gives you monotonic improvement at the level of the final result, even if the path wobbles.
The implications for system behavior are predictable:
- If your history shows scores climbing and then flattening, you have a textbook case of diminishing returns. The right design choice is to stop and ship, not to keep spinning.
- If your history oscillates wildly—0.6, 0.8, 0.62, 0.79—you likely have noisy or conflicting feedback. The loop is not converging; it is wandering. The fix is not “try more iterations” but “improve the feedback source or lower the temperature.”
- If scores improve for two or three iterations and then stall below your target (e.g. stuck at 0.7), you may have hit the model’s capability ceiling for this task with this prompting strategy. At that point, you are in design territory: change the decomposition, add tools, or accept that this system cannot solve this task reliably.
9.4 Cost–Quality Tradeoffs in Iterative Improvement
[DEMO: A cost-quality tradeoff playground. Users can choose a task, then configure “max iterations” and “candidate count.” The UI shows total tokens spent, total latency, and final quality score for different strategies: single-shot, up to 3 iterations, 5 iterations, or parallel generation with selection. A graph displays marginal quality gain per extra iteration side by side with marginal cost.] If improvement is almost always possible in principle, it is tempting to keep iterating. You could always add one more pass, one more candidate, or one more refinement, but each extra step has cost. If you had infinite compute and zero latency, you might ignore those costs. In practice, every generation and evaluation step costs tokens, wall-clock time, and operational complexity. So the decision is whether one more loop is worth it. When tests give you perfect feedback, the decision is simple: keep going until they pass or the cost is unacceptable. When you lack hard tests and rely on fuzzy model judgments, the costs and benefits are harder to see. If evaluation is expensive, it might be cheaper to generate multiple candidates and let a human pick the best for high-value tasks. If your agents run unsupervised but must meet strict latency budgets, you may only get one or two iterations before you violate SLAs.Each iteration has cost (latency, compute, money) and diminishing benefit. First iteration catches big issues; subsequent iterations catch smaller ones. Choose the number of iterations so that the expected quality gain from one more loop is roughly equal to the extra latency and token cost it would add—for example, stop when recent iterations improve the score by less than a small delta while consuming the same budget as earlier iterations. When tests don’t exist, model-based evaluation substitutes—less precise, but often sufficient.
- Single-shot generation
- One iteration of feedback
- Three iterations of feedback
- Parallel generation with selection only
- Hybrid: parallel generation + a few iterations on the best
- For interactive chat, you might prefer low-latency, low-iteration behavior: one shot, maybe one refinement.
- For batch offline jobs (e.g. generating documentation overnight), you can afford more loops and higher candidate counts.
- For safety-critical actions (e.g. code that will run in production), you might combine tests, multiple candidates, and iteration, accepting higher cost in exchange for reliability.