You have already seen three illusions come into focus. Context made it feel like you were talking to a single mind. Memory made it feel like that mind persisted. Agency made it feel like the mind could act. In each case, the “mind” turned out to be a composite: the model plus the structures you built around it. Reasoning is where that composite begins to solve multi-step problems, revise outputs, and handle more complex tasks. A good system does not just respond; it breaks problems down, explores alternatives, checks its work, and revises. Taken together, these behaviors make the system seem to “think.” The important question is where this “thinking” actually happens. From the outside, it is tempting to imagine a hidden deliberation phase inside the model. First it thinks, then it speaks. First it reasons, then it acts. Prompts like “think step by step” play into that picture: you ask it to reveal more of its thoughts, as if you were peeling back a layer. Mechanically, nothing like that exists. There is no inner workspace the model retreats to. There is no second chance after it commits to a token. There is only one thing happening: an autoregressive process, token after token, conditioned on whatever text you fed it and whatever text it has already produced. This creates a puzzle: if the model has no hidden workspace, where do “plans” live? If every token commits to a trajectory, how can it ever genuinely reconsider? These constraints raise a third issue: how can a system feel like it learns inside a task when a call cannot access information that does not yet exist? This has direct consequences for how you design your system. Misplacing where reasoning lives leads teams to chase better prompts instead of better architectures. They ask the model to “check its work” inside a single call and find that it mostly affirms its own output. They expect self-reflection where only continuation is possible. This behavior is not inherent to the model; you can change it through system design. The agent is real, but its reasoning is distributed across components you control: multiple calls, intermediate artifacts, external tools, and explicit feedback loops. You decide when commitment happens, when evaluation happens, when new information enters, and when to branch or iterate. The rest of this chapter follows that thread through four questions:Documentation Index
Fetch the complete documentation index at: https://docs.idyllic.so/llms.txt
Use this file to discover all available pages before exploring further.
- Why split work across multiple LLM calls at all?
- Can a model genuinely evaluate its own output?
- What makes intermediate results worth the complexity?
- When do you branch into multiple alternatives versus iterating on one?
4.1 Why Split Work Across Multiple Calls?
[DEMO: A toggleable interface that runs a small research-and-summarize task in two modes. Mode A: a single prompt that asks the model to “think step by step” and produce a report. Mode B: a three-step pipeline that (1) plans sub-questions, (2) retrieves web results for each, and (3) synthesizes a report from retrieved snippets. The UI shows both outputs side by side, with a visible intermediate retrieval stage only in Mode B.] If chain-of-thought in one prompt can make the model “think step by step,” why ever bother splitting work across multiple calls? Why not simply write one very careful prompt, ask for reasoning, and let the model do everything in one go? If you can ask:“Think step by step, identify subproblems, imagine relevant sources, reason about them, and then produce the final answer.”what exactly does a second or third call buy you? If the model cannot fetch new data in the middle of a generation, what kinds of tasks are fundamentally impossible to do well in a single pass? You can get a long way with one well-designed call. But you will hit a wall for three distinct, mechanical reasons: information, verification, and focus.
Single-call reasoning is constrained to information present at call start. Splitting calls lets you insert intermediate retrieval, external verification, and focused contexts between reasoning steps. You split when you need information or checks that do not exist until mid-task.
Single-call reasoning is bounded by initial context
In a single call, the model’s entire universe is the text you send plus the tokens it has already produced. It cannot pause halfway through, run a database query, and resume. It cannot silently execute code and then condition on the result. It cannot discover new facts that are not yet in context. A small TypeScript example makes this boundary concrete:singleCallReport and multiCallReport; only the surrounding pipeline changes. In the multi-call version, intermediate retrieval produces information that did not exist at the start. That information narrows the model’s possibilities in a way the initial prompt alone never could.
External verification lives between calls, not inside them
The same structural limitation applies to verification. In a single call, you can ask the model to “test” or “check” its output, but it cannot execute code or query external systems within that pass. Any “checking” is simulated. Splitting calls lets you introduce real checks between generations:Focused context beats “everything at once”
The third reason to split is cognitive: even if you could stuff all instructions, examples, and goals into one giant prompt, every token must attend to everything. The model has to juggle planning, retrieval surrogates, evaluation criteria, and formatting rules simultaneously. Breaking the task into focused calls allows you to control what each step “pays attention to”:- A planning call that only thinks about decomposition.
- A retrieval-powered call that only analyzes sources.
- A synthesis call that only integrates structured findings.
Design guidance
Use single-call reasoning when:- All necessary information is already in context.
- You do not need real-time retrieval or execution.
- You will verify outputs externally anyway.
- Latency and cost are more important than maximum quality.
- You need information that can only be fetched or computed mid-task.
- You need to run external tools (search, code execution, APIs).
- Different steps have different context needs (planning vs. synthesis).
- You care about verifiable correctness, not just plausible text.
4.2 Can a Model Genuinely Evaluate Its Own Output?
[DEMO: A code generation playground with two modes for “Write a quicksort and check it.” Mode A: one prompt that asks the model to implement quicksort and then evaluate its own code in the same response. Mode B: a two-call setup where call 1 generates code and call 2, framed as a separate “reviewer,” critiques that code. The UI highlights how often Mode A declares correctness vs. how often Mode B finds substantial issues.] If you ask the model to “write a function and then check if it’s correct” in a single prompt, is it actually checking? When it appends “This looks good and handles edge cases,” is that a genuine judgment or just more narration in the same trajectory? Every token in a completion is conditioned on all previous tokens. Once the model has produced an answer, it has effectively committed to that answer in its internal trajectory. When it later generates “evaluation” tokens, those are conditioned on the fact that it has already said whatever it just said. So when you add “Now critically evaluate your previous answer” to a prompt, is there any chance for real reconsideration? Or is the model just extending the story it started?Same-pass self-evaluation is compromised by autoregressive commitment. By the time the model “evaluates,” it has already committed to the output; the evaluation is conditioned on that commitment. A separate call sees the output as input, not as something it just wrote.
The commitment problem in code
Here is a single-call setup where generation and evaluation share one forward pass:- Tokens describing the function.
- Tokens with some commentary about the function.
- Possibly a verdict: “This correctly implements quicksort.”
A separate call changes the conditioning
Now separate generation from evaluation:- In the first pass, the model is conditioned on the spec and whatever it has already generated.
- In the second pass, the model is conditioned only on “You are a strict code reviewer” and the code as an external artifact.
- Same-pass evaluation is forced to be consistent with the story so far. It is a continuation.
- Separate-pass evaluation is free to take whatever stance the new prompt encourages. It is a reaction.
Design guidance
Use separate calls for evaluation when:- You need real critique, not just decorative self-reflection.
- You care about catching substantial issues, not minor cleanup.
- You are going to feed evaluation back for revision.
- Surfacing rough reasoning where correctness will be checked externally.
- UX: making the model explain itself to the user, not to itself.
- Lightweight sanity checks on trivial tasks.
4.3 What Makes Intermediate Results Valuable?
[DEMO: An essay-writing demo with three modes on the same topic. Mode A: direct “Write a 500-word essay on X.” Mode B: “Write an essay and critique it in the same response.” Mode C: a three-step loop that (1) generates a draft, (2) uses a separate call to produce targeted feedback, and (3) uses another call to revise based on that feedback. The UI shows the prompt and output at each step and highlights how specific feedback changes the revision.] Once you split work into multiple calls, intermediate artifacts such as plans, drafts, evaluations, test logs, and search snippets become first-class objects. These are intermediate results. If each step is just another LLM call, what exactly makes these intermediate artifacts worth the overhead? Why is “Use this feedback to revise your answer” so much more effective than “Improve your answer”? If intermediate results become inputs to subsequent calls, what is special about those inputs compared to just writing better instructions? What properties distinguish helpful feedback from noise?Intermediate results provide more specific constraints than general instructions alone. Instructions describe what to do; intermediate artifacts provide concrete material to operate on. “Improve this essay” is vague; “Improve this essay: [draft] using this feedback: [specific issues]” gives the model focused constraints. Each step’s output becomes the next step’s foundation.
Instructions vs. concrete artifacts
To contrast instructions with concrete artifacts, consider two prompts that both ask for an improved explanation of the same concept. Single-call “just do better”:- The draft: a concrete object to edit.
- The feedback: a structured list of issues and suggestions.
Feedback as structured constraint
Intermediate artifacts are valuable when they add structure:- A plan: a list of subproblems that subsequent calls must address.
- Tests: specific input-output failures that code must fix.
- Critique: concrete issues that text must repair.
- An outline: section headers that the final document must follow.
Intermediate results are not just “more context”
You could, in principle, write one giant prompt that says:“Imagine a draft, imagine its evaluation, and then imagine the improved version based on that evaluation.”This is functionally what a single-call “do X, then critique, then improve” prompt does. But the intermediate “draft” and “evaluation” in that scenario only exist as imagined steps in one trajectory. They are not independently generated artifacts that can be inspected, logged, or reused. The model can skip over them or compress them, and nothing in your system will notice. When you externalize intermediate results as first-class artifacts:
- You can examine them yourself.
- You can store them for later learning.
- You can choose to branch or stop based on them.
- You can send them to different specialized components.
Design guidance
Introduce explicit intermediate results when:- You want to make the reasoning process inspectable (for logging, debugging, or UX).
- You want later steps to be tightly constrained by earlier ones.
- You want to plug in external evaluators or tools between steps.
- You anticipate reusing intermediate artifacts (e.g., plans, outlines, test suites).
4.4 When to Generate Alternatives vs. Iterate on One
[DEMO: A product-description generator with two modes. Mode A: generate–evaluate–revise on a single description until it is “good enough.” Mode B: generate N different descriptions in parallel, have the model score them, and either (1) pick the best as-is or (2) run a single revision pass on just the top candidate. The UI shows how often the best-of-N from parallel generation beats the revised single path.] Once you can split work across calls and turn intermediate results into artifacts, you have two basic levers:- Deepen a single path by iterating on it (generate–evaluate–revise–…).
- Broaden your search by exploring multiple paths in parallel and selecting.
Parallel generation explores different regions of the model’s output distribution; serial refinement deepens one region. Use parallel candidates when you do not know which direction is promising. Use iteration when you have a reasonable direction and need to polish. Use external verification when you have ground truth; use model-based judgment when quality is subjective.
Serial refinement: deepen a committed path
Serial refinement is what we have already seen: generate once, then loop:- Evaluate the result.
- Revise based on feedback.
- Repeat until some stop condition.
- The space of acceptable answers is narrow and continuous (e.g., tightening an explanation).
- You already have a decent first attempt.
- Evaluation feedback can easily be translated into incremental changes.
- Your initial attempt lands in a bad region (e.g., misinterprets the task).
- The task is creative or open-ended, with many distinct modes.
- There are multiple qualitatively different approaches you want to consider.
Parallel generation: explore the distribution
Parallel generation plus selection treats the model’s stochasticity as a feature:- The single best candidate.
- The top K candidates to feed into further refinement.
- Different candidates for different audiences or contexts.
- The task is creative (marketing copy, interface ideas, example generation).
- You do not know which style or angle you want yet.
- The cost of multiple generations is acceptable compared to the cost of a bad choice.
Combining breadth and depth
You do not have to choose between these levers; you can combine them:- Generate multiple initial candidates.
- Select the top one or two.
- Run refinement loops only on those.
Plans vs. execution
The same tradeoff appears when you separate planning from execution:- A planning call generates a sequence of steps.
- Execution calls implement those steps.
External vs. model-based verification
Finally, where does verification live when you have ground truth?- If you can run tests, query a database, or call a deterministic API, do that. Use the model to interpret or fix based on those results.
- Use model-based evaluation when quality is subjective (style, tone, pedagogy) or when you have no cheap oracle.
- Breadth: ask multiple evaluators (potentially with different instructions) and aggregate.
- Depth: iterate on a single evaluator’s feedback until quality converges.
Design guidance
Use serial refinement when:- You already have a plausible answer.
- Improvements are incremental and local.
- You have structured feedback you can feed forward.
- You do not know what “good” looks like yet.
- Diversity of ideas matters.
- You can afford multiple samples.
- You want both exploration and polish.
- The cost of a bad initial direction is high.
Bridge to Chapter 5
In this chapter, reasoning stopped being a mystical property of the model and became an architectural choice. You saw that:- Single-call “thinking” is bound by initial context and autoregressive commitment.
- Splitting work across calls lets you insert retrieval, verification, and focused prompts.
- Intermediate artifacts—plans, drafts, evaluations, test results—are where constraint and improvement actually live.
- Breadth (multiple candidates) and depth (iterative refinement) are the two basic levers you control.