You can run an evaluation without changing anything.
In Chapter 8, we treated evaluation as the system looking at itself: measuring accuracy, latency, cost, user satisfaction. You took an output, judged it, and got a number or a label back. That was enough to decide whether the system “works” in the statistical sense.
But from the user’s perspective, something is missing. If the first answer is wrong and the system knows it is wrong, the second answer still often looks the same. Agents can explain in detail what went wrong yet repeat the same mistake on the next task. Models can expose flaws in their own reasoning without changing that reasoning on subsequent calls.
This is where feedback enters. Feedback is not another word for evaluation. Evaluation is a measurement of output quality. Feedback is information that the system uses to modify subsequent outputs, such as by updating prompts, selecting tools, or adjusting control flow based on evaluation results. Evaluation tells you how good something is. Feedback changes what happens next.
The difference becomes clear when you implement it in code. A failed unit test and a “7/10, could be better” rating are both evaluations. Only one will reliably make your system improve when you wire it into the loop. The other just records that things are bad and then cheerfully continues as before.
This is an architectural decision: if you treat feedback as “add a quality score somewhere,” you build systems that can describe their own failures but keep repeating them. If you treat feedback as a structured signal that directs the next step, you get systems that actually move—step by step—toward better outputs.
So the first question is not “how do I add a feedback loop?” but something sharper:
When you already know how to evaluate, what does feedback add beyond a score, where should it come from, and how do you decide when enough feedback is enough?
9.1 Feedback as a Driver of Change Beyond Evaluation
[DEMO: A side-by-side view. On the left, a code-generating agent with a single “quality score” from 1–10; each iteration regenerates code using only the new score. On the right, the same agent, but the evaluator returns exact failing test names, error messages, and line numbers, which are fed into a targeted “fix this code” prompt. Users can click “iterate” several times on both sides and watch one side oscillate while the other converges to passing tests.]
Evaluation answers, “how good was that?” Feedback must answer a harder question: “how should the next attempt change?”
If evaluation says “this is wrong,” it tells you nothing about how to fix it. A 6/10 does not tell you which part to rewrite, whether the issue is accuracy or style, or whether the problem is in the introduction or the conclusion. At some point, “this is bad” has to become “fix X at line Y by doing Z.” The signal needs enough structure for an agent to treat it as a direction instead of a complaint.
Evaluation measures; feedback directs improvement. A score of 6/10 doesn’t tell you how to get to 8/10. Useful feedback is specific (identifies the problem), located (points to where), and actionable (suggests what to do). Test failures are ideal feedback—exact error, exact line, exact mismatch. Model-based feedback approximates this through detailed evaluation prompts.
Mechanically, feedback is just another piece of text going into the context. The difference is how much structure you give it and how explicitly you connect it to the next generation step.
To see what evaluation without feedback looks like in code, consider a loop that measures quality but never uses the result:
async function evaluateOnly(task: string) {
const output = await generate(task);
const score = await evaluate(task, output); // e.g. number from 0 to 1
// Score is logged, but not used to change anything
console.log('Score:', score);
return output;
}
The system “knows” the score, but that knowledge is inert. Nothing about the next call to generate changes.
Now contrast that with a loop where the evaluator returns specific issues and the next step uses them to revise the output:
type Issue = {
description: string;
location?: string; // e.g. "line 23" or "paragraph 2"
suggestion?: string; // what to change
};
type EvalResult = {
score: number;
issues: Issue[];
};
async function generateWithFeedback(task: string, maxIters = 3) {
let output = await generate(task);
for (let i = 0; i < maxIters; i++) {
const evalResult: EvalResult = await evaluateDetailed(task, output);
if (evalResult.score >= 0.9 || evalResult.issues.length === 0) {
return output; // good enough
}
output = await improveWithIssues(task, output, evalResult.issues);
}
return output;
}
async function improveWithIssues(task: string, output: string, issues: Issue[]) {
const issueText = issues.map((issue, i) => {
return `${i + 1}. ${issue.description}
Location: ${issue.location ?? 'unspecified'}
Suggested fix: ${issue.suggestion ?? 'not specified'}`;
}).join('\n\n');
const prompt = `
You previously attempted this task:
TASK:
${task}
PREVIOUS OUTPUT:
${output}
IDENTIFIED ISSUES:
${issueText}
INSTRUCTION:
Produce a revised version that fixes these issues.
Preserve everything that is already correct.
Only change what is necessary to address the issues.`;
return await llmComplete({
prompt,
maxTokens: 1024,
temperature: 0.2
});
}
The mechanism works as follows:
- Generate an output for the task.
- Evaluate it, but return issues, not just a score.
- Feed those issues back in as explicit instructions for how to revise.
- Repeat until the score crosses a threshold or you hit an iteration limit.
In practice, you also need to guard against long inputs by truncating or summarizing the task, output, and issue list so the prompt fits within model limits.
The important architectural shift is not the loop itself, but the shape of the evaluator’s output. Instead of a scalar (“7/10”), it returns structured guidance: each issue, where it occurs, and what to do about it. That structure is what makes the next step targeted.
The implications are immediate:
-
Generic feedback like “could be clearer” forces the model to guess what to change. Detailed feedback like “paragraph 3 introduces vectorization without defining it; add a one-sentence definition” tells it exactly where to act.
-
You can decouple how you evaluate from how you improve. The evaluator can be another model, a test runner, or a custom checker. As long as it emits specific, located, actionable issues, the improvement step looks the same.
-
Once you adopt this pattern, a lot of vague “feedback loops” stop looking like feedback loops at all. If nothing in your system points to what to change, you are evaluating, not feeding back.
9.2 Trustworthiness of Feedback Sources
[DEMO: Three panels evaluating the same generated answer. Panel A: ground-truth checker (e.g. unit tests or fact database) highlights concrete failures. Panel B: a single LLM both generates and then “reviews” its own answer in one call; sometimes it waves errors through. Panel C: a separate LLM call, with a strict evaluation prompt, flags more issues. A fourth view shows three evaluators with different criteria (accuracy, clarity, completeness) disagreeing; the user can apply simple prioritization rules to see which changes propagate.]
Once feedback directs change, the next question is: how much can you trust the director?
If the same model that wrote the answer also critiques it, the critique can be biased toward justifying the original. When you already have an external test suite, a model that says “this looks fine” should not override failing tests. When multiple evaluators disagree—one calling the answer “factually wrong” and another “beautifully written”—your loop must follow a clear priority rule. Adding more evaluators can surface complementary issues, but it can also drown the agent in conflicting instructions if you do not define how those signals combine.
External verification (test suites, schema checkers, curated fact databases) provides deterministic checks within their domain—rely on them when they exist. Model-based feedback is next best, but requires separate calls for independence and can hallucinate issues or miss real ones. You should monitor agreement between model reviewers and external checks, and handle cases where the reviewer returns invalid JSON or no issues despite known failures. Multiple evaluators with different focuses (accuracy, clarity, completeness) catch complementary issues. Conflicting feedback requires prioritization—severity, confidence, or domain relevance.
Mechanically, a feedback source is just a function you call. The trustworthiness comes from how that function is implemented and what domain it covers.
Here is a skeleton that combines three kinds of sources:
type FeedbackSource = 'tests' | 'facts' | 'model-accuracy' | 'model-clarity';
type TrustedIssue = Issue & {
source: FeedbackSource;
severity: 'critical' | 'major' | 'minor';
};
type MultiEvalResult = {
score: number;
issues: TrustedIssue[];
};
async function evaluateWithSources(task: string, output: string): Promise<MultiEvalResult> {
const [testIssues, factIssues, modelIssues] = await Promise.all([
runTestSuite(task, output), // external tests, if any
checkFacts(task, output), // external knowledge, if any
modelReview(task, output) // LLM-based critique
]);
const allIssues: TrustedIssue[] = [
...testIssues.map(i => ({ ...i, source: 'tests', severity: 'critical' })),
...factIssues.map(i => ({ ...i, source: 'facts', severity: 'critical' })),
...modelIssues.map(i => ({
...i,
source: i.type === 'accuracy' ? 'model-accuracy' : 'model-clarity',
severity: i.type === 'accuracy' ? 'major' : 'minor'
}))
];
const score = scoreFromIssues(allIssues); // e.g. lower with more severe issues
return { score, issues: prioritizeIssues(allIssues) };
}
There are several design choices encoded here.
First, external verification dominates. If tests or fact checks produce issues, they are tagged as critical. Your improvement step should treat those as non-negotiable. Models can opine on style; tests decide whether the code runs.
Second, model-based feedback is separated into its own call (modelReview). You do not ask the same model that just generated the answer to both produce and critique in one shot. Independence matters. It forces the system to look at the output as an object to be judged, not as part of a single “assistant” persona that may want to save face.
A stripped-down example of that separation looks like this:
async function modelReview(task: string, output: string): Promise<Issue[]> {
const reviewPrompt = `
You are reviewing an output for the following task:
TASK:
${task}
OUTPUT TO REVIEW:
${output}
Identify any issues with:
1. Factual accuracy
2. Logical consistency
3. Clarity and structure
For each issue, provide:
- category: "accuracy" | "logic" | "clarity"
- description
- location (if specific)
- suggested fix
Respond as JSON array of issues.`;
const reviewText = await llmComplete(reviewPrompt);
return JSON.parse(reviewText);
}
This is the same pattern you saw earlier, but with a crucial constraint: the reviewer sees the task and the output, not the entire conversation that produced it. It becomes an external evaluator in the architectural sense, even if it runs on the same underlying model weights.
Model reviewers have their own failure modes. They can hallucinate issues that do not exist, overlook critical errors, or fail to follow the output schema. You should log reviewer outputs, check them against any ground truth you have, and harden the parsing path—fallback to safe defaults when JSON is invalid, and avoid treating a “no issues” response as authoritative when other checks say otherwise.
Third, prioritization is explicit. When different sources disagree, you don’t resolve it inside the model’s “intuition”; you resolve it in code, using a simple rule:
- Always fix critical issues from external sources.
- Then fix major model-identified issues (e.g. accuracy and logic).
- Only if budget remains, address minor issues like phrasing and polish.
function prioritizeIssues(issues: TrustedIssue[]): TrustedIssue[] {
const order = { critical: 0, major: 1, minor: 2 } as const;
return issues
.sort((a, b) => order[a.severity] - order[b.severity]);
}
The implications for system design are clear:
-
Whenever you can build or borrow an external verifier—tests, schema validators, constraint checkers—do it. Those sources give you feedback you do not need to “trust” in the probabilistic sense. They are deterministic guards within their domain.
-
When you use model-based feedback, isolate it. Make it a separate call with a strict prompt and a strictly-typed output. Treat it as another component, not as a mysterious inner voice of the same agent.
-
Instead of asking “can I trust this model?,” ask “what domain am I trusting it with?” It might be acceptable to let an LLM judge clarity, but not to overrule a failing test or a violated business constraint.
-
When evaluators disagree, do not let the model resolve the conflict implicitly. Express your priorities in code. You control what “matters” for the task.
9.3 Stopping Conditions for Iterative Improvement
[DEMO: A timeline visualization of iterative runs. Each iteration shows a score and a sparkline of changes. In one scenario, scores increase then plateau; the system stops once gains fall below a small delta. In another, scores bounce up and down; the system stops when the best-so-far stops improving. A third scenario shows a cheap threshold (e.g. “≥0.8 is good enough”); the system stops early even though further small improvements would be possible at higher cost.]
Once you have feedback you trust, it is tempting to keep going. If the current score is 0.84, pushing to 0.9 feels attractive. If every iteration catches at least one more small issue, stopping at three feels premature. And yet you have seen iterations that fix one bug while introducing another, or refining phrasing endlessly without adding substance.
So you need a clear rule for when to stop. You can stop when the score crosses a threshold, when the evaluator stops finding meaningful issues, or when changes become smaller than the noise in your evaluation. When quality plateaus well below your target, you also need to decide whether the feedback is too weak or the task is simply beyond the model.
Stop when: quality exceeds threshold (good enough), improvement stalls (diminishing returns), or iterations exceed limit (resource constraint). Plateaus indicate either feedback isn’t actionable enough, or you’ve reached the capability ceiling for the task. Track improvement rate—if delta approaches zero, more iteration won’t help.
You can encode all three stopping conditions—quality, improvement, and budget—directly into the loop. Here is a minimal pattern:
type IterationRecord = {
iteration: number;
score: number;
output: string;
};
type IterationResult = {
output: string;
converged: boolean;
reason: 'quality_threshold' | 'no_improvement' | 'max_iterations';
history: IterationRecord[];
};
async function iterateToQuality(
task: string,
initialOutput?: string,
options = { targetScore: 0.9, maxIterations: 4, minDelta: 0.02 }
): Promise<IterationResult> {
let currentOutput = initialOutput ?? await generate(task);
let history: IterationRecord[] = [];
let bestOutput = currentOutput;
let bestScore = 0;
for (let i = 0; i < options.maxIterations; i++) {
const evalResult = await evaluateWithSources(task, currentOutput);
const { score, issues } = evalResult;
history.push({ iteration: i, score, output: currentOutput });
if (score > bestScore) {
bestScore = score;
bestOutput = currentOutput;
}
// 1. Stop if good enough
if (score >= options.targetScore) {
return {
output: currentOutput,
converged: true,
reason: 'quality_threshold',
history
};
}
// 2. Stop if improvement stalls
const prev = history[history.length - 2];
if (prev && score - prev.score < options.minDelta) {
return {
output: bestOutput,
converged: true,
reason: 'no_improvement',
history
};
}
// 3. Otherwise, improve using feedback and continue
currentOutput = await improveWithIssues(task, currentOutput, issues);
}
// 4. Stop if out of iterations
return {
output: bestOutput,
converged: false,
reason: 'max_iterations',
history
};
}
Several principles are embedded here.
First, “good enough” is explicit. You do not chase the maximum possible score; you set a target that matches your use case. For a user-facing chat, 0.85 might be fine. For code that runs in production, you might require tests to pass and effectively demand 1.0 on that dimension.
Second, you track change, not just absolute score. The minDelta threshold implements diminishing returns: if the latest iteration only improved by less than, say, 0.02, you treat that as a plateau. At that point, you either reached the limit of what this model-plus-feedback combination can do, or your evaluator is too noisy to justify further refinement.
Third, keep track of the best output seen so far. Iteration can degrade quality—an “improvement” might fix a minor issue while breaking something major. Returning the last output is risky. Returning the best seen so far gives you monotonic improvement at the level of the final result, even if the path wobbles.
The implications for system behavior are predictable:
-
If your history shows scores climbing and then flattening, you have a textbook case of diminishing returns. The right design choice is to stop and ship, not to keep spinning.
-
If your history oscillates wildly—0.6, 0.8, 0.62, 0.79—you likely have noisy or conflicting feedback. The loop is not converging; it is wandering. The fix is not “try more iterations” but “improve the feedback source or lower the temperature.”
-
If scores improve for two or three iterations and then stall below your target (e.g. stuck at 0.7), you may have hit the model’s capability ceiling for this task with this prompting strategy. At that point, you are in design territory: change the decomposition, add tools, or accept that this system cannot solve this task reliably.
Stopping conditions determine what “enough thinking” means for the system and can be encoded directly into the loop. They belong in the core design of the loop, not as an afterthought.
9.4 Cost–Quality Tradeoffs in Iterative Improvement
[DEMO: A cost-quality tradeoff playground. Users can choose a task, then configure “max iterations” and “candidate count.” The UI shows total tokens spent, total latency, and final quality score for different strategies: single-shot, up to 3 iterations, 5 iterations, or parallel generation with selection. A graph displays marginal quality gain per extra iteration side by side with marginal cost.]
If improvement is almost always possible in principle, it is tempting to keep iterating. You could always add one more pass, one more candidate, or one more refinement, but each extra step has cost. If you had infinite compute and zero latency, you might ignore those costs. In practice, every generation and evaluation step costs tokens, wall-clock time, and operational complexity.
So the decision is whether one more loop is worth it. When tests give you perfect feedback, the decision is simple: keep going until they pass or the cost is unacceptable. When you lack hard tests and rely on fuzzy model judgments, the costs and benefits are harder to see.
If evaluation is expensive, it might be cheaper to generate multiple candidates and let a human pick the best for high-value tasks. If your agents run unsupervised but must meet strict latency budgets, you may only get one or two iterations before you violate SLAs.
Each iteration has cost (latency, compute, money) and diminishing benefit. First iteration catches big issues; subsequent iterations catch smaller ones. Choose the number of iterations so that the expected quality gain from one more loop is roughly equal to the extra latency and token cost it would add—for example, stop when recent iterations improve the score by less than a small delta while consuming the same budget as earlier iterations. When tests don’t exist, model-based evaluation substitutes—less precise, but often sufficient.
Architecturally, you can treat iteration as a cost-benefit decision: given the current score and remaining budget, decide whether another loop is likely to improve the result enough to justify the cost. You can use a simple heuristic instead of a detailed economic model, such as stopping when recent score improvements fall below a fixed threshold while cost continues to accumulate.
Here is a basic pattern that tracks both cost and quality, and stops when the next step looks too expensive for the likely benefit:
type Cost = {
promptTokens: number;
completionTokens: number;
wallTimeMs: number;
};
type EvalWithCost = {
score: number;
issues: Issue[];
cost: Cost;
};
type IterationOptions = {
targetScore: number;
maxIterations: number;
maxTotalTokens: number;
};
async function iterateWithBudget(
task: string,
options: IterationOptions
) {
let output = await generate(task);
let totalTokens = 0;
let bestScore = 0;
let bestOutput = output;
for (let i = 0; i < options.maxIterations; i++) {
const evalResult: EvalWithCost = await evaluateWithCost(task, output);
const { score, issues, cost } = evalResult;
totalTokens += cost.promptTokens + cost.completionTokens;
if (score > bestScore) {
bestScore = score;
bestOutput = output;
}
if (score >= options.targetScore) {
break; // quality threshold reached
}
if (totalTokens >= options.maxTotalTokens) {
break; // budget exhausted
}
// Optional: if remaining budget is tiny, expect negligible improvement
const remainingTokens = options.maxTotalTokens - totalTokens;
if (remainingTokens < estimatedTokensForOneMoreIteration(task)) {
break;
}
output = await improveWithIssues(task, output, issues);
}
return { output: bestOutput, score: bestScore, totalTokens };
}
Separating quality and cost unlocks a few patterns.
First, you can compare strategies empirically. Run the same task through:
- Single-shot generation
- One iteration of feedback
- Three iterations of feedback
- Parallel generation with selection only
- Hybrid: parallel generation + a few iterations on the best
Record both final quality and total cost. In many empirical tests, you see a similar pattern: big gains on the first iteration or two, smaller gains after, and costs that scale roughly linearly with iterations and candidates. You should validate this pattern on your own tasks.
Second, you can choose different policies for different contexts:
-
For interactive chat, you might prefer low-latency, low-iteration behavior: one shot, maybe one refinement.
-
For batch offline jobs (e.g. generating documentation overnight), you can afford more loops and higher candidate counts.
-
For safety-critical actions (e.g. code that will run in production), you might combine tests, multiple candidates, and iteration, accepting higher cost in exchange for reliability.
Third, when tests are not available, you can still approximate rational behavior. Model-based evaluation is noisier and less trustworthy, but you can treat it as a proxy signal and still apply the same logic: stop when scores plateau or budget is spent. The system will not be optimal, but it will be bounded and predictable.
Cost-aware design interacts with earlier patterns: for example, a coordinator can route complex tasks to agents that run heavier feedback loops in batch mode, while keeping interactive paths to one or two iterations. Evaluation gives you the measurements; feedback, properly constrained, gives you a controlled way to spend computation where it actually improves outcomes.
Bridge to Chapter 10
Feedback improves an output in the moment: you generate, evaluate, and then revise. The loop has memory inside the task: each iteration remembers what went wrong in the previous attempt and avoids it. But when the task ends, that memory evaporates. The next user request starts from scratch. The agent rediscovers the same edge cases, relearns the same prompt tricks, and re-fixes the same recurring bugs.
This is where feedback and learning diverge. Feedback is transient and local: it shapes the next few steps in a single episode. Learning is persistent and global: it changes how the system behaves on future episodes because of what happened before.
Feedback is structured, actionable information. This raises a new architectural question: instead of throwing that information away at the end of each task, how do you store it, reuse it, and let it accumulate into durable competence?
Chapter 10 takes that step. We will treat learning as feedback plus memory over time: collecting successful patterns, updating prompts based on evidence, refining retrieval strategies, and slowly turning one-off corrections into lasting improvements in the agent itself.