You have already seen three illusions come into focus.
Context made it feel like you were talking to a single mind. Memory made it feel like that mind persisted. Agency made it feel like the mind could act. In each case, the “mind” turned out to be a composite: the model plus the structures you built around it.
Reasoning is where that composite begins to solve multi-step problems, revise outputs, and handle more complex tasks. A good system does not just respond; it breaks problems down, explores alternatives, checks its work, and revises. Taken together, these behaviors make the system seem to “think.”
The important question is where this “thinking” actually happens.
From the outside, it is tempting to imagine a hidden deliberation phase inside the model. First it thinks, then it speaks. First it reasons, then it acts. Prompts like “think step by step” play into that picture: you ask it to reveal more of its thoughts, as if you were peeling back a layer.
Mechanically, nothing like that exists. There is no inner workspace the model retreats to. There is no second chance after it commits to a token. There is only one thing happening: an autoregressive process, token after token, conditioned on whatever text you fed it and whatever text it has already produced.
This creates a puzzle: if the model has no hidden workspace, where do “plans” live? If every token commits to a trajectory, how can it ever genuinely reconsider? These constraints raise a third issue: how can a system feel like it learns inside a task when a call cannot access information that does not yet exist?
This has direct consequences for how you design your system. Misplacing where reasoning lives leads teams to chase better prompts instead of better architectures. They ask the model to “check its work” inside a single call and find that it mostly affirms its own output. They expect self-reflection where only continuation is possible.
This behavior is not inherent to the model; you can change it through system design. The agent is real, but its reasoning is distributed across components you control: multiple calls, intermediate artifacts, external tools, and explicit feedback loops. You decide when commitment happens, when evaluation happens, when new information enters, and when to branch or iterate.
The rest of this chapter follows that thread through four questions:
- Why split work across multiple LLM calls at all?
- Can a model genuinely evaluate its own output?
- What makes intermediate results worth the complexity?
- When do you branch into multiple alternatives versus iterating on one?
By the end, “reasoning” will look less like a mysterious built-in capability and more like a design pattern: structured generation plus structured evaluation over time.
4.1 Why Split Work Across Multiple Calls?
[DEMO: A toggleable interface that runs a small research-and-summarize task in two modes. Mode A: a single prompt that asks the model to “think step by step” and produce a report. Mode B: a three-step pipeline that (1) plans sub-questions, (2) retrieves web results for each, and (3) synthesizes a report from retrieved snippets. The UI shows both outputs side by side, with a visible intermediate retrieval stage only in Mode B.]
If chain-of-thought in one prompt can make the model “think step by step,” why ever bother splitting work across multiple calls? Why not simply write one very careful prompt, ask for reasoning, and let the model do everything in one go?
If you can ask:
“Think step by step, identify subproblems, imagine relevant sources, reason about them, and then produce the final answer.”
what exactly does a second or third call buy you? If the model cannot fetch new data in the middle of a generation, what kinds of tasks are fundamentally impossible to do well in a single pass?
You can get a long way with one well-designed call. But you will hit a wall for three distinct, mechanical reasons: information, verification, and focus.
Single-call reasoning is constrained to information present at call start. Splitting calls lets you insert intermediate retrieval, external verification, and focused contexts between reasoning steps. You split when you need information or checks that do not exist until mid-task.
Single-call reasoning is bounded by initial context
In a single call, the model’s entire universe is the text you send plus the tokens it has already produced. It cannot pause halfway through, run a database query, and resume. It cannot silently execute code and then condition on the result. It cannot discover new facts that are not yet in context.
A small TypeScript example makes this boundary concrete:
async function singleCallReport(topic: string) {
const prompt = `
You are a careful analyst.
Task: Research "${topic}" and write a short report.
Think step by step, identifying key questions
and answering them.
Then write the final report.
`;
const report = await llm.complete(prompt);
return report;
}
The model has to “research” by pattern-matching over its training data and whatever is in the prompt. There is no chance for actual retrieval or checking; everything must be imagined or recalled from training. The chain-of-thought is real as text, but it is trapped inside that one forward pass.
Now compare this to a split design:
async function multiCallReport(topic: string) {
// Call 1: Decompose the problem
const plan = await llm.complete(`
You are planning research on "${topic}".
List 3–5 concrete subquestions that,
if answered, would give a good overview.
`);
const questions = parseQuestions(plan);
// Call 2: Retrieve for each subquestion
const findings = await Promise.all(
questions.map(async (q) => {
const results = await searchWeb(q); // external tool
return { question: q, results };
})
);
// Call 3: Synthesize from actual retrieved data
const synthesis = await llm.complete(`
You are writing a report on "${topic}".
Here are subquestions and search results:
${findings
.map(
(f) =>
`Question: ${f.question}\n` +
f.results
.map((r) => `Title: ${r.title}\nSnippet: ${r.snippet}`)
.join('\n\n')
)
.join('\n\n---\n\n')}
Based only on these results, write a concise report.
`);
return synthesis;
}
The model is the same between singleCallReport and multiCallReport; only the surrounding pipeline changes. In the multi-call version, intermediate retrieval produces information that did not exist at the start. That information narrows the model’s possibilities in a way the initial prompt alone never could.
External verification lives between calls, not inside them
The same structural limitation applies to verification. In a single call, you can ask the model to “test” or “check” its output, but it cannot execute code or query external systems within that pass. Any “checking” is simulated.
Splitting calls lets you introduce real checks between generations:
async function implementWithTests(spec: string) {
// Call 1: Generate code
const code = await llm.complete(`
Implement the following specification in TypeScript:
${spec}
`);
// Intermediate: run actual tests on the code
const results = await runTestsOnGeneratedCode(code);
// Call 2: Fix based on real failures
if (!results.allPassed) {
const fixed = await llm.complete(`
The following implementation failed tests:
${code}
Test results:
${formatTestResults(results)}
Fix the code so all tests pass.
`);
return fixed;
}
return code;
}
The test results are not opinions the model invented. They are facts produced by executing the code. The second call conditions on those facts. There is simply no way to access that feedback inside the first call, because it does not exist until you run the tests.
Focused context beats “everything at once”
The third reason to split is cognitive: even if you could stuff all instructions, examples, and goals into one giant prompt, every token must attend to everything. The model has to juggle planning, retrieval surrogates, evaluation criteria, and formatting rules simultaneously.
Breaking the task into focused calls allows you to control what each step “pays attention to”:
- A planning call that only thinks about decomposition.
- A retrieval-powered call that only analyzes sources.
- A synthesis call that only integrates structured findings.
Each step sees a simpler world, with less competition in the context window. That often matters more than sheer context size.
Design guidance
Use single-call reasoning when:
- All necessary information is already in context.
- You do not need real-time retrieval or execution.
- You will verify outputs externally anyway.
- Latency and cost are more important than maximum quality.
Split into multiple calls when:
- You need information that can only be fetched or computed mid-task.
- You need to run external tools (search, code execution, APIs).
- Different steps have different context needs (planning vs. synthesis).
- You care about verifiable correctness, not just plausible text.
The model stays the same. The “reasoning” improves because you changed the shape of the system around it.
4.2 Can a Model Genuinely Evaluate Its Own Output?
[DEMO: A code generation playground with two modes for “Write a quicksort and check it.” Mode A: one prompt that asks the model to implement quicksort and then evaluate its own code in the same response. Mode B: a two-call setup where call 1 generates code and call 2, framed as a separate “reviewer,” critiques that code. The UI highlights how often Mode A declares correctness vs. how often Mode B finds substantial issues.]
If you ask the model to “write a function and then check if it’s correct” in a single prompt, is it actually checking? When it appends “This looks good and handles edge cases,” is that a genuine judgment or just more narration in the same trajectory?
Every token in a completion is conditioned on all previous tokens. Once the model has produced an answer, it has effectively committed to that answer in its internal trajectory. When it later generates “evaluation” tokens, those are conditioned on the fact that it has already said whatever it just said.
So when you add “Now critically evaluate your previous answer” to a prompt, is there any chance for real reconsideration? Or is the model just extending the story it started?
Same-pass self-evaluation is compromised by autoregressive commitment. By the time the model “evaluates,” it has already committed to the output; the evaluation is conditioned on that commitment. A separate call sees the output as input, not as something it just wrote.
The commitment problem in code
Here is a single-call setup where generation and evaluation share one forward pass:
async function singleCallImplementAndEvaluate() {
const response = await llm.complete(`
Write a TypeScript function \`quicksort\` that sorts an array in-place.
After you write the code, evaluate it carefully:
- Is it a correct quicksort?
- Does it sort in-place?
- What edge cases might fail?
First write the code, then the evaluation.
`);
return response;
}
Internally, this looks like one long token sequence:
- Tokens describing the function.
- Tokens with some commentary about the function.
- Possibly a verdict: “This correctly implements quicksort.”
At no point does the model step outside its own trajectory. The evaluation tokens are chosen because they fit the pattern “what typically follows code I just wrote when someone asks me to evaluate it.” That pattern often includes mild criticism or affirmation, but rarely a wholesale reversal.
It is structurally hard for the model to say, “Actually, the thing I just wrote is fundamentally broken,” because the text so far makes that conclusion unlikely in the learned distribution. The easiest continuation is: “This is correct” with maybe some minor caveats.
A separate call changes the conditioning
Now separate generation from evaluation:
async function multiCallImplementAndEvaluate() {
// Call 1: generate code
const code = await llm.complete(`
Write a TypeScript function \`quicksort\` that sorts an array in-place.
`);
// Call 2: evaluate as a fresh reviewer
const evaluation = await llm.complete(`
You are a strict code reviewer.
Here is a quicksort implementation:
${code}
Evaluate it critically:
- Is it a correct quicksort?
- Does it sort in-place?
- What edge cases or bugs do you see?
- How would you improve it?
Be direct and specific.
`);
return { code, evaluation };
}
This setup uses two independent forward passes. That independence changes how the evaluation behaves:
- In the first pass, the model is conditioned on the spec and whatever it has already generated.
- In the second pass, the model is conditioned only on “You are a strict code reviewer” and the code as an external artifact.
The evaluator has no “memory” that it wrote the code. It is not continuing a narrative it started; it is responding to a new prompt that happens to contain code. That changes the probability distribution over evaluation tokens: harsh criticism is now much more likely, because “strict reviewers” harshly criticize arbitrary code in training data.
This independence is architectural, not psychological. It is tempting to describe this as giving the system a “second opinion.” But the important thing is not a metaphor about personalities. It is the separation of forward passes and the change in context:
- Same-pass evaluation is forced to be consistent with the story so far. It is a continuation.
- Separate-pass evaluation is free to take whatever stance the new prompt encourages. It is a reaction.
This explains why “check your work” prompts behave the way they do. Of course they tend to rubber-stamp or nitpick. You asked a continuation process to reverse itself.
Design guidance
Use separate calls for evaluation when:
- You need real critique, not just decorative self-reflection.
- You care about catching substantial issues, not minor cleanup.
- You are going to feed evaluation back for revision.
Reserve single-call self-commentary for:
- Surfacing rough reasoning where correctness will be checked externally.
- UX: making the model explain itself to the user, not to itself.
- Lightweight sanity checks on trivial tasks.
Architecturally, “the evaluator” is just another LLM call with a different prompt and different conditioning. You get genuine evaluation not by asking nicelier, but by moving that evaluation to a different pass.
[DEMO: An essay-writing demo with three modes on the same topic. Mode A: direct “Write a 500-word essay on X.” Mode B: “Write an essay and critique it in the same response.” Mode C: a three-step loop that (1) generates a draft, (2) uses a separate call to produce targeted feedback, and (3) uses another call to revise based on that feedback. The UI shows the prompt and output at each step and highlights how specific feedback changes the revision.]
Once you split work into multiple calls, intermediate artifacts such as plans, drafts, evaluations, test logs, and search snippets become first-class objects. These are intermediate results.
If each step is just another LLM call, what exactly makes these intermediate artifacts worth the overhead? Why is “Use this feedback to revise your answer” so much more effective than “Improve your answer”?
If intermediate results become inputs to subsequent calls, what is special about those inputs compared to just writing better instructions? What properties distinguish helpful feedback from noise?
Intermediate results provide more specific constraints than general instructions alone. Instructions describe what to do; intermediate artifacts provide concrete material to operate on. “Improve this essay” is vague; “Improve this essay: [draft] using this feedback: [specific issues]” gives the model focused constraints. Each step’s output becomes the next step’s foundation.
Instructions vs. concrete artifacts
To contrast instructions with concrete artifacts, consider two prompts that both ask for an improved explanation of the same concept.
Single-call “just do better”:
async function improveInPlace(concept: string) {
const response = await llm.complete(`
Explain "${concept}" to a junior engineer.
Then improve your explanation to make it clearer.
`);
return response;
}
The model has to imagine a worse explanation and a better one in one trajectory. There is no actual draft to improve; it is all in its head. The “improvement” is just another pass at the same task, lightly constrained by “make it clearer.”
Now compare a two-step pipeline with explicit artifacts:
async function generateThenRevise(concept: string) {
// Step 1: draft
const draft = await llm.complete(`
Explain "${concept}" to a junior engineer
in 3–4 paragraphs.
`);
// Step 2: feedback
const feedback = await llm.complete(`
You are reviewing an explanation for a junior engineer.
Explanation:
${draft}
Identify concrete issues:
- What parts are confusing?
- What important details are missing?
- Where could examples help?
List specific suggestions, not general advice.
`);
// Step 3: revision
const revision = await llm.complete(`
Here is your original explanation:
${draft}
Here is feedback on it:
${feedback}
Rewrite the explanation to address the feedback
while keeping it roughly the same length.
`);
return { draft, feedback, revision };
}
The revision call is conditioned on two artifacts:
- The draft: a concrete object to edit.
- The feedback: a structured list of issues and suggestions.
Those artifacts dramatically narrow the sampling space. The model no longer has to guess what “better” means; the feedback tells it. It no longer has to invent what the draft might have been; it sees the actual text. The probability mass shifts from “any explanation of this concept” to “explanations close to this draft that fix these issues.”
Feedback as structured constraint
Intermediate artifacts are valuable when they add structure:
- A plan: a list of subproblems that subsequent calls must address.
- Tests: specific input-output failures that code must fix.
- Critique: concrete issues that text must repair.
- An outline: section headers that the final document must follow.
In each case, you are not just adding more words to the prompt. You are adding constraints on what future tokens are allowed to say while still satisfying the instructions.
The generate–evaluate–revise loop from the previous section is a direct application of this:
async function generateEvaluateRevise(task: string) {
// Generate
const draft = await llm.complete(task);
// Evaluate (separate call, structured feedback)
const evaluation = await llm.complete(`
You are a critical evaluator.
Task:
${task}
Draft:
${draft}
Identify problems with this draft:
- Incorrect or misleading statements
- Missing important points
- Unclear explanations or poor structure
Be specific and actionable.
`);
// Optionally decide if revision is needed…
// Revise
const revised = await llm.complete(`
Task:
${task}
Original draft:
${draft}
Evaluation of the draft:
${evaluation}
Produce a revised version that fixes the issues.
`);
return { draft, evaluation, revised };
}
The critical part is not the phrasing “be specific.” It is the fact that the evaluation becomes an object that the next step must respect. The model cannot pretend the evaluation did not happen; it is part of the conditioning context.
Intermediate results are not just “more context”
You could, in principle, write one giant prompt that says:
“Imagine a draft, imagine its evaluation, and then imagine the improved version based on that evaluation.”
This is functionally what a single-call “do X, then critique, then improve” prompt does. But the intermediate “draft” and “evaluation” in that scenario only exist as imagined steps in one trajectory. They are not independently generated artifacts that can be inspected, logged, or reused. The model can skip over them or compress them, and nothing in your system will notice.
When you externalize intermediate results as first-class artifacts:
- You can examine them yourself.
- You can store them for later learning.
- You can choose to branch or stop based on them.
- You can send them to different specialized components.
You convert implicit, internal scaffolding into explicit structure that your system—and other calls—can use.
Design guidance
Introduce explicit intermediate results when:
- You want to make the reasoning process inspectable (for logging, debugging, or UX).
- You want later steps to be tightly constrained by earlier ones.
- You want to plug in external evaluators or tools between steps.
- You anticipate reusing intermediate artifacts (e.g., plans, outlines, test suites).
Keep steps minimal but concrete. The more specific the artifact, the more leverage you get from feeding it forward.
4.4 When to Generate Alternatives vs. Iterate on One
[DEMO: A product-description generator with two modes. Mode A: generate–evaluate–revise on a single description until it is “good enough.” Mode B: generate N different descriptions in parallel, have the model score them, and either (1) pick the best as-is or (2) run a single revision pass on just the top candidate. The UI shows how often the best-of-N from parallel generation beats the revised single path.]
Once you can split work across calls and turn intermediate results into artifacts, you have two basic levers:
- Deepen a single path by iterating on it (generate–evaluate–revise–…).
- Broaden your search by exploring multiple paths in parallel and selecting.
If you can always revise, why bother generating multiple candidates? If you can always generate multiple candidates, why spend effort repeatedly polishing one?
If planning and execution are separate calls, should execution be allowed to deviate from the plan, or must it treat the plan as law? When verification is available externally (tests, factual checks), when does model-based evaluation still make sense?
Parallel generation explores different regions of the model’s output distribution; serial refinement deepens one region. Use parallel candidates when you do not know which direction is promising. Use iteration when you have a reasonable direction and need to polish. Use external verification when you have ground truth; use model-based judgment when quality is subjective.
Serial refinement: deepen a committed path
Serial refinement is what we have already seen: generate once, then loop:
- Evaluate the result.
- Revise based on feedback.
- Repeat until some stop condition.
Each revision is intended to incrementally improve the result relative to your starting point.
async function refineUntilGoodEnough(task: string, maxRounds = 3) {
let current = await llm.complete(task);
for (let round = 0; round < maxRounds; round++) {
const evaluation = await llm.complete(`
You are evaluating this response:
Task:
${task}
Response:
${current}
Score it from 1–10 on:
- Correctness
- Completeness
- Clarity
Then list specific improvements.
`);
const { score, suggestions } = parseEvaluation(evaluation);
if (score >= 8) break; // good enough
current = await llm.complete(`
Task:
${task}
Current response:
${current}
Suggested improvements:
${suggestions}
Rewrite the response to address the suggestions.
`);
}
return current;
}
This pattern shines when:
- The space of acceptable answers is narrow and continuous (e.g., tightening an explanation).
- You already have a decent first attempt.
- Evaluation feedback can easily be translated into incremental changes.
It performs poorly when:
- Your initial attempt lands in a bad region (e.g., misinterprets the task).
- The task is creative or open-ended, with many distinct modes.
- There are multiple qualitatively different approaches you want to consider.
In those cases, you are just polishing the wrong thing.
Parallel generation: explore the distribution
Parallel generation plus selection treats the model’s stochasticity as a feature:
async function generateAndSelect(task: string, n = 5) {
// Generate N candidates
const candidates = await Promise.all(
Array.from({ length: n }, () => llm.complete(task))
);
// Evaluate each candidate independently
const evaluations = await Promise.all(
candidates.map((candidate) =>
llm.complete(`
Task:
${task}
Candidate response:
${candidate}
Score from 1–10 on:
- Relevance to the task
- Quality of content
- Clarity and style
Provide scores and a brief justification.
`)
)
);
const scored = candidates.map((candidate, i) => {
const { totalScore } = parseScores(evaluations[i]);
return { candidate, score: totalScore };
});
scored.sort((a, b) => b.score - a.score);
return scored[0].candidate;
}
Each candidate is an independent sample from the output distribution for that prompt. With temperature > 0, those samples diverge: different framings, different argument orders, different creative spins.
Selection then lets you pick:
- The single best candidate.
- The top K candidates to feed into further refinement.
- Different candidates for different audiences or contexts.
You are no longer relying on a single trajectory. You are asking the model, “Show me several ways this could go,” then choosing.
Parallel generation is especially powerful when:
- The task is creative (marketing copy, interface ideas, example generation).
- You do not know which style or angle you want yet.
- The cost of multiple generations is acceptable compared to the cost of a bad choice.
Combining breadth and depth
You do not have to choose between these levers; you can combine them:
- Generate multiple initial candidates.
- Select the top one or two.
- Run refinement loops only on those.
async function exploreThenRefine(task: string) {
const initialBest = await generateAndSelect(task, 5);
const refined = await refineUntilGoodEnough(
`Improve this response to the task:\n\n${task}\n\nResponse:\n${initialBest}`,
2
);
return refined;
}
Breadth helps you avoid starting in a bad region. Depth helps you polish once you have a promising trajectory.
Plans vs. execution
The same tradeoff appears when you separate planning from execution:
- A planning call generates a sequence of steps.
- Execution calls implement those steps.
Should the executor be allowed to deviate? Mechanically, deviation is just another LLM call with slightly different prompts and context. The question is design: do you treat the plan as a suggestion or a contract?
You can encode either behavior in the prompt:
// Strict execution: plan as contract
const strictExecutionPrompt = `
You are executing this plan exactly:
Plan:
1. Collect user requirements
2. Propose three design options
3. Compare options and recommend one
You are on step 2 now. Do NOT change the plan.
`;
// Flexible execution: plan as guidance
const flexibleExecutionPrompt = `
You are executing this plan:
Plan:
1. Collect user requirements
2. Propose three design options
3. Compare options and recommend one
You may modify the plan if you uncover new information
that makes a different sequence clearly better.
When you do, explain why.
`;
A strict executor improves predictability and auditability. A flexible executor can correct bad plans mid-stream at the cost of more variance. Both are just different prompts and control logic around the same model.
External vs. model-based verification
Finally, where does verification live when you have ground truth?
- If you can run tests, query a database, or call a deterministic API, do that. Use the model to interpret or fix based on those results.
- Use model-based evaluation when quality is subjective (style, tone, pedagogy) or when you have no cheap oracle.
Verification then becomes another place where you choose breadth vs. depth:
- Breadth: ask multiple evaluators (potentially with different instructions) and aggregate.
- Depth: iterate on a single evaluator’s feedback until quality converges.
Design guidance
Use serial refinement when:
- You already have a plausible answer.
- Improvements are incremental and local.
- You have structured feedback you can feed forward.
Use parallel generation when:
- You do not know what “good” looks like yet.
- Diversity of ideas matters.
- You can afford multiple samples.
Combine them when:
- You want both exploration and polish.
- The cost of a bad initial direction is high.
From the user’s perspective, brainstorming, revision, and changes of direction come from how you combine these two levers over time.
Bridge to Chapter 5
In this chapter, reasoning stopped being a mystical property of the model and became an architectural choice. You saw that:
- Single-call “thinking” is bound by initial context and autoregressive commitment.
- Splitting work across calls lets you insert retrieval, verification, and focused prompts.
- Intermediate artifacts—plans, drafts, evaluations, test results—are where constraint and improvement actually live.
- Breadth (multiple candidates) and depth (iterative refinement) are the two basic levers you control.
So far, we have treated reasoning as something that happens inside a single composite agent: one system orchestrating calls around one model. Real applications rarely look that simple. They involve multiple components—different agents, tools, and humans—each with their own reasoning loops, all influencing a shared artifact or goal.
The next chapter is about coordination: how you route work between these components, how they share state, how they hand off responsibilities, and how you weave human judgment into these loops. Reasoning turns a single component into a thinker. Coordination turns many components into a system.