Chapter 10: Learning

Every other chapter has described a property you can see inside a single task. Context is visible in the prompt. Memory is visible in what gets retrieved. Agency is visible in which tools fire. Feedback is visible in the revisions you watch unfold on screen. Learning is different because you never see it happen directly. You only notice it when you come back a week later and the system fails in a new way, or stops failing in an old way. The prompts look similar. The tools are the same. The model version hasn’t changed. Yet the system does not behave the way it did when you shipped it. Architecturally, the situation is: The model that served your first request is the model that serves your millionth. The weights are frozen. API providers do not update parameters based on your traffic. From the model’s perspective, every request is the first request it has ever seen. Yet from the system’s perspective, this is obviously false. The thousandth bug report does not feel like the first. The knowledge base does not start empty every morning. The routing rules do not remain as naive as the day you wrote them. Something has changed. Framed this way, the question is architectural: The core reasoning engine is identical from one day to the next, so improvement must live elsewhere. Memory and feedback exist, but some systems get better while others only stay aware of their own mistakes. The key distinction is between systems that only remember and systems that actually learn. The answer is that the composite agent you’ve been building across nine chapters has multiple surfaces where improvement can accumulate. This chapter maps those surfaces.

10.1 Frozen Models and Changing Systems

[DEMO: Two identical LLM-backed agents answer a sequence of similar tasks. Both use the same underlying model and temperature. The left agent always uses a static prompt and fixed retrieval configuration. The right agent stores successful interactions and periodically updates its prompt and retrieval rules based on those examples. Users can step through tasks; over time the right agent’s answers measurably improve while the left agent’s do not, despite identical model weights.] You can ship an agent, do nothing to the model, and watch performance improve over weeks: fewer escalations, better formatting, fewer hallucinations on common queries. The model is fixed, the weights are identical, and every inference call is stateless, so any accumulated experience must live outside the model. Architecturally, the agent is the composite system, not the model alone. Improvement accumulates in the components you control.

The model doesn’t learn; the system learns. Improvement lives in everything that can change around a fixed reasoning engine: prompts, examples, retrieval configuration, routing rules, and external knowledge.

You already saw in earlier chapters that each task goes through three phases: assemble context, call the model, handle the result. Learning simply adds a fourth phase: update the scaffolding—prompts, stored examples, routing rules, and knowledge structures—for future tasks.

type Task = { type: string; input: string };
type Result = { output: string; score: number };

export class LearningWrapper {
  // System-level state: things that can change
  @field promptLibrary: Map<string, string> = new Map();
  @field successfulExamples: Task[] = [];
  @field routingHints: Map<string, string[]> = new Map();

  constructor(private readonly llm: LLMClient) {}

  async handle(task: Task): Promise<Result> {
    const prompt = this.buildPrompt(task);
    const { output } = await this.llm.complete(prompt);

    const score = await this.evaluate(task, output);

    // This is where learning actually happens
    await this.updateScaffolding(task, output, score);

    return { output, score };
  }

  private buildPrompt(task: Task): string {
    const base = this.promptLibrary.get(task.type) ?? "You are a helpful assistant.";
    const hints = this.routingHints.get(task.type) ?? [];
    return [
      base,
      hints.length ? `Guidelines:\n- ${hints.join("\n- ")}` : "",
      `Task:\n${task.input}`
    ].join("\n\n");
  }

  private async evaluate(task: Task, output: string): Promise<number> {
    // Chapter 8/9: any evaluation mechanism you like
    return await externalJudge(task, output);
  }

  private async updateScaffolding(task: Task, output: string, score: number) {
    if (score > 0.9) {
      // Treat this as an example of “what good looks like”
      this.successfulExamples.push(task);
    }

    if (score < 0.5) {
      // Ask the model how to avoid this kind of failure next time
      const hint = await this.llm.complete(
        `We got a low score on this ${task.type} task.

Task:
${task.input}

Bad output:
${output}

In one concise bullet, state a concrete guideline that would reduce this failure mode in the future.`
      );

      const existing = this.routingHints.get(task.type) ?? [];
      this.routingHints.set(task.type, [...existing, hint.output.trim()]);
    }
  }
}

Nothing inside the LLMClient changes. You never touch weights. All of the improvement happens in the wrapper:

The prompt library can be edited.
The set of good examples can grow.
The routing hints can accumulate new guidelines.

Because the wrapper runs before every call, every small change there propagates to all future tasks. This has two implications: if the system does not improve, the wrapper is not changing; if it does improve, you should be able to identify where that change occurs in code or data. The model does not change between calls; all learning comes from explicit mechanisms around it.

10.2 The Difference Between Learning and Memory

[DEMO: On the left, an agent with a “tape recorder” log: it stores entire past interactions and can retrieve and replay them into context. On the right, an otherwise identical agent that, after each task, extracts a short “lesson” (a generalized rule) and stores that instead. Users can run varied tasks; the left agent can quote prior interactions but keeps repeating old mistakes, while the right agent begins to avoid classes of mistakes even on new, unseen inputs.] By Chapter 2 you already had memory: databases, vector stores, logs. By Chapter 9 you had feedback: scores, critiques, test failures. You can now store almost anything and you can judge almost everything. Logging every interaction and replaying it into context does not necessarily mean the system has learned. Adding evaluation labels also does not guarantee learning. Pasting past successes verbatim into prompts remains memory until it is processed into general rules. The line is easy to blur, because both memory and learning involve putting bits somewhere and reading them later. But they play different roles.

Memory stores information; learning stores patterns that improve performance. Few-shot examples, routing rules, and distilled guidelines are memory that has been processed into reusable behavior.

The difference shows up in how you use what you store. A pure memory system writes everything down and hopes that replaying the right parts later will help:

@field rawLog: { task: Task; output: string; score: number }[] = [];

async afterTask(task: Task, output: string, score: number) {
  this.rawLog.push({ task, output, score });
}

async buildContextFor(task: Task): Promise<string> {
  const similar = await this.retrieveSimilarTasks(task);

  const examples = similar
    .slice(0, 3)
    .map(
      (rec) =>
        `Task:\n${rec.task.input}\n\nOutput:\n${rec.output}\n\nScore: ${rec.score.toFixed(2)}`
    )
    .join("\n\n---\n\n");

  return `You are solving a new task. Here are some past tasks and outputs:

${examples}

Now solve this task:\n${task.input}`;
}

This is memory. It may help, but it is undifferentiated. High-scoring and low-scoring cases can both appear. The system has not distilled why anything worked or failed; it simply stores episodes. A learning system inserts a processing step. It turns raw episodes into compact instructions that can be applied to other episodes.

@field lessons: string[] = [];

async afterTaskWithFeedback(task: Task, output: string, score: number) {
  // Memory: keep the raw record if you want
  // ...

  // Learning: extract a portable lesson
  const lesson = await this.llm.complete(
    `We just handled a task of type "${task.type}".

Task:
${task.input}

Output:
${output}

Score: ${score}

In one sentence, describe a general rule that would improve future handling of similar tasks. Focus on *general* patterns, not this specific example.`
  );

  const text = lesson.output.trim();
  if (text && !this.lessons.includes(text)) {
    this.lessons.push(text);
  }
}

async buildContextFor(task: Task): Promise<string> {
  const relevantLessons = this.lessons.filter((l) => l.includes(task.type)).slice(0, 5);

  return [
    `You are handling a "${task.type}" task.`,
    relevantLessons.length
      ? `Here are lessons learned from past similar tasks:\n- ${relevantLessons.join("\n- ")}`
      : "",
    `Now handle this task:\n${task.input}`
  ]
    .filter(Boolean)
    .join("\n\n");
}

The raw log can be enormous and noisy. The lessons array is small and curated. It encodes patterns instead of instances. The distinction matters for three reasons:

Generalization. Memory lets you copy-and-paste previous behavior. Learning lets you adapt to new inputs that merely rhyme with the old ones.
Compression. You can only fit so much into the context window. Storing a rule (“always ask for the account ID before answering a billing question”) is cheaper than replaying ten transcripts where that turned out to be necessary.
Control. Patterns are audit-able. You can read through a list of lessons and see what the system has “internalized.” Raw logs are opaque.

Few-shot examples sit right at this boundary. A curated set of examples is memory, but selecting which examples to include, and rewriting them as generic templates instead of specific episodes, is learning. A useful mental test: if you strip away the specific names and dates from what you stored, does it still improve performance? If yes, you are probably looking at learning. If not, you just have a very detailed memory.

10.3 Managing Example Accumulation

[DEMO: An interface shows a live prompt for a particular task type, with a panel of “candidate examples” collected from past usage. Users can click to accept or reject examples, or let the system auto-select based on similarity. As more tasks run, the prompt on the right either steadily improves (when selection is curated) or bloats and degrades (when every past example is appended). A metric chart shows performance over time under each strategy.] Learning lives in prompts, examples, routing, and knowledge. This raises another question: as those artifacts accumulate, how do you prevent them from turning into noise? If you keep adding examples to prompts, you need a stopping rule. If every good interaction becomes a demonstration, you risk filling the context window with old news. When the system edits its own instructions, you must decide which edits persist and which get rolled back. “Just store what worked” is not enough. Without selection, learning collapses back into memory: a pile of episodes with no structure. Without forgetting, new knowledge gets buried under old, and previously helpful examples become misleading as the domain shifts.

Examples accumulate usefully when you treat them as a retrieval problem. You store many, select few. The system learns not by hoarding every success, but by choosing the most relevant, current, and representative patterns for each new task.

The simplest mechanism is to separate collection from use. Embedding quality and retrieval latency limit how complex your selection logic can be. Retrieval can surface off-topic or adversarial examples, so you need safeguards such as score thresholds, age penalties, and periodic human review of surfaced examples. Collection is cheap and permissive: record potentially useful episodes. Use is strict and selective: pick only a handful of examples and rules for the current task.

type Example = {
  id: string;
  taskType: string;
  input: string;
  output: string;
  score: number;
  timestamp: number;
};

export class ExampleManager {
  @field examples: Example[] = [];

  async considerForStorage(task: Task, output: string, score: number) {
    if (score < 0.9) return; // Only store strong successes

    this.examples.push({
      id: crypto.randomUUID(),
      taskType: task.type,
      input: task.input,
      output,
      score,
      timestamp: Date.now()
    });
  }

  async selectForPrompt(task: Task, maxExamples = 3): Promise<Example[]> {
    // Start from candidates of the same type
    const candidates = this.examples.filter((e) => e.taskType === task.type);

    // Rank by mixed objective: similarity, recency, quality
    const scored = await Promise.all(
      candidates.map(async (ex) => {
        const similarity = await this.semanticSimilarity(task.input, ex.input);
        const recency = 1 - this.agePenalty(ex.timestamp);
        const quality = ex.score;
        return { ex, score: 0.6 * similarity + 0.2 * recency + 0.2 * quality };
      })
    );

    return scored
      .sort((a, b) => b.score - a.score)
      .slice(0, maxExamples)
      .map((s) => s.ex);
  }

  private agePenalty(timestamp: number): number {
    const days = (Date.now() - timestamp) / (1000 * 60 * 60 * 24);
    return Math.min(1, days / 30); // Anything older than 30 days gets full penalty
  }

  private async semanticSimilarity(a: string, b: string): Promise<number> {
    // Use embeddings or a classifier; 0–1 similarity
    return await embeddingSimilarity(a, b);
  }
}

This pattern decouples “how much we remember” from “how much we show the model.”

The examples array can grow large.
The prompt never sees more than maxExamples for a given task.
Old examples gradually lose influence via the age penalty.
Low-similarity or low-quality examples rarely surface.

You can take this further and let the model itself curate the collection. Instead of storing every high-scoring case, you can ask the model which ones represent distinct patterns:

async summarizeAndCull(taskType: string) {
  const ofType = this.examples.filter((e) => e.taskType === taskType);

  const summary = await this.llm.complete(
    `You are managing a library of examples for task type "${taskType}".

Here are some examples:
${ofType
  .slice(0, 20)
  .map(
    (e, i) => `Example ${i + 1}:
Input: ${e.input}
Output: ${e.output}
Score: ${e.score}
`
  )
  .join("\n")}

1. Group these examples into 3–5 distinct patterns.
2. For each pattern, write a short description and pick *one* example that best represents it.
Return JSON with keys "patterns" (array of {description, representativeExampleIndex}).`
  );

  const patterns = JSON.parse(summary.output).patterns;

  // Keep only the representatives; mark others as archival
  const keepIds = new Set(
    patterns.map((p: { representativeExampleIndex: number }) => ofType[p.representativeExampleIndex].id)
  );

  this.examples = this.examples.filter(
    (e) => e.taskType !== taskType || keepIds.has(e.id)
  );
}

Now the system not only selects examples per task; it periodically compresses its library into a small set of archetypes. The design implications:

Learning is constrained by the context window. You cannot “remember everything” and “use everything.” Retrieval is the valve that determines what actually affects behavior.
Staleness is domain-dependent. In a fast-moving domain (policies, prices, medical guidelines), your age penalty should be aggressive; in a stable domain (math proofs), examples can remain valid far longer.
Human oversight scales further than you think. You do not need to hand-curate every example. For example, instead of reviewing thousands of support tickets, a reviewer can examine 10–20 archetypal examples per week before they are added to the prompt library. You only need to review the patterns the system proposes to keep.

In a mature production agent, you should be able to answer which examples it relies on for a task type and why. If you cannot, you do not have learning; you have an unstructured accumulation of examples that are no longer meaningfully curated.

10.4 System-Level vs Model-Level Learning

[DEMO: Three panels compare approaches on the same evaluation set. Panel A: a vanilla agent with static prompts and no adaptation. Panel B: an adaptive agent that logs outcomes, updates prompts and retrieval, and uses distilled lessons, but never fine-tunes the model. Panel C: an agent backed by a custom fine-tuned model but with static prompts. Users can toggle which axes to view (accuracy on old tasks, accuracy on new tasks, robustness to distribution shift). The demo highlights that system-level learning improves performance without erasing capabilities, while model fine-tuning can boost some areas while degrading others.] By now we have a clear recipe for system-level learning:

Use feedback (Chapter 9) to identify what went wrong or right on a task.
Distill that into patterns, examples, or updates to prompts, routing, and knowledge.
Store those patterns somewhere persistent.
Retrieve the relevant subset for the next task.

This is enough to make a system feel like it “learns from experience” without touching the model. There are limits to wrapper-based learning. At some point you may need model training, and crossing that boundary changes how improvements are applied.

Feedback improves a single task; learning connects that feedback to future tasks via storage and reuse. System-level learning changes the wrapper and is reversible and local. Model-level learning (fine-tuning) changes the core and can improve capabilities at the cost of forgetting or distorting others.

The mechanism we have been using all chapter can be summarized in three hooks:

async beforeTask(task: Task): Promise<AugmentedTask> {
  // Learning applied: prompts, examples, knowledge, routing
}

async afterTask(task: Task, result: Result): Promise<void> {
  // Feedback captured: evaluation, critique, metadata
}

@schedule("0 2 * * *")
async periodicReflection(): Promise<void> {
  // Patterns extracted; wrapper updated
}

This architecture has clear limits:

If the model simply cannot perform a capability at all (e.g., a small model consistently fails at nontrivial code synthesis), no amount of prompt cleverness will conjure the missing competence.
If you want to encode domain knowledge so deeply that it is always “there” even without retrieval, you eventually face the context window and latency costs of external memory.

At that point you may decide to fine-tune. Mechanically, fine-tuning looks like moving some of the work we have been doing in the wrapper into a separate training pipeline:

// System-level: collect structured training signals
type TrainingExample = {
  input: string;
  idealOutput: string;
};

@field trainingBuffer: TrainingExample[] = [];

async afterTask(task: Task, result: Result) {
  const { output, score } = result;

  // If user corrected the output, treat that as "ideal"
  if (score < 0.7 && result.correctedOutput) {
    this.trainingBuffer.push({
      input: this.buildRawModelInput(task), // what we sent to the base model
      idealOutput: result.correctedOutput
    });
  }

  // ...usual system-level learning here...
}

// Offline: periodically export data for fine-tuning
async exportForFinetune() {
  const batch = this.trainingBuffer.splice(0, this.trainingBuffer.length);
  await writeToBlobStorage(batch);
}

The learning still starts at the system level: you decide which examples are worth turning into gradient updates. The difference is where the “apply improvement” step runs:

In system-level learning, you update prompts, examples, routing rules, and knowledge inside your own infrastructure.
In model-level learning, you feed labeled examples into a separate training run that produces a new set of weights.

There are important tradeoffs:

Safety. System-level changes are easy to sandbox and roll back. If a new routing rule is bad, you delete one row. If a fine-tune is bad, you have a misaligned model and need to revert the whole artifact.
Scope. Wrapper changes affect specific surfaces (a task type, a tool, a prompt). Fine-tuning changes behavior everywhere, including places you did not test.
Forgetting. System-level learning rarely destroys previous capabilities; at worst it can overshadow them with new prompts. Fine-tuning can overwrite internal representations (“catastrophic forgetting”), making the model worse on tasks it previously handled well.

In practice, high-leverage learning often looks like:

Use system-level mechanisms to adapt quickly, cheaply, and safely.
Use the data they generate to inform occasional fine-tunes when there is a clear, well-scoped need (for example, a narrow domain where context-based retrieval is too slow or unwieldy).

The important conceptual shift is this: when you see a system improve, do not assume the model somehow “absorbed” the experience. Ask instead which of two levers moved:

The wrapper (prompts, examples, retrieval, routing, knowledge).
The weights (a new model or fine-tune).

If the source of improvement is unclear, the system behaves like an uncontrolled experiment rather than a deliberately designed learning process.

10.5 Putting It Together: An Improving Agent Architecture

The elements from previous chapters come together here. You already know how to store state (Chapter 2), schedule autonomous work (Chapter 7), evaluate (Chapter 8), and feed feedback into multi-step loops (Chapter 9). Learning is what happens when you point those capabilities across tasks instead of just within one. The following class sketches an architecture for an improving agent:

export class ImprovingAgent extends AgenticSystem {
  // 1. Things that can change
  @field promptLibrary: Map<string, string> = new Map();
  @field examples: Example[] = [];
  @field lessons: string[] = [];
  @field knowledgeBase: VectorStore<Knowledge>;
  @field history: { task: Task; result: Result }[] = [];
  @field improvementLog: { timestamp: number; change: string }[] = [];

  // 2. Before each task: apply what we’ve learned so far
  async beforeTask(task: Task): Promise<AugmentedTask> {
    const basePrompt =
      this.promptLibrary.get(task.type) ?? "You are a careful, honest assistant.";

    const relevantExamples = await this.exampleManager.selectForPrompt(task, 2);
    const relevantLessons = this.lessons.filter((l) => l.includes(task.type)).slice(0, 3);
    const knowledge = await this.knowledgeBase.search(task.input, 5);

    const promptParts = [
      basePrompt,
      relevantLessons.length
        ? `Lessons learned from past "${task.type}" tasks:\n- ${relevantLessons.join("\n- ")}`
        : "",
      relevantExamples.length
        ? `Here are successful past examples:\n\n${relevantExamples
            .map(
              (ex) =>
                `Input:\n${ex.input}\n\nOutput:\n${ex.output}\n(Score: ${ex.score.toFixed(2)})`
            )
            .join("\n\n---\n\n")}`
        : "",
      knowledge.length
        ? `Relevant reference knowledge:\n${knowledge.map((k) => "- " + k.content).join("\n")}`
        : "",
      `Now handle this new task:\n${task.input}`
    ].filter(Boolean);

    return { ...task, prompt: promptParts.join("\n\n") };
  }

  // 3. After each task: capture experience
  async afterTask(task: AugmentedTask, result: Result) {
    this.history.push({ task, result });

    const { score } = result;

    // Memory: keep high-quality examples
    await this.exampleManager.considerForStorage(task, result.output, score);

    // Learning: distill explicit lessons
    if (score < 0.6 || score > 0.9) {
      const lesson = await this.llm.complete(
        `We just handled a "${task.type}" task with score ${score}.

Prompt used:
${task.prompt}

Output:
${result.output}

In one sentence, state a general guideline that would either:
- prevent this kind of failure in the future (if score < 0.6), or
- preserve this success pattern (if score > 0.9).

Return only the guideline.`
      );

      const text = lesson.output.trim();
      if (text && !this.lessons.includes(text)) {
        this.lessons.push(text);
      }
    }

    // Knowledge: optionally extract and store new information
    await this.maybeExtractKnowledge(task, result);
  }

  // 4. Periodically: reflect and update the scaffolding
  @schedule("0 3 * * *") // every day at 3am
  async nightlyReflection() {
    const recent = this.history.slice(-200); // recent tasks only

    const analysis = await this.llm.complete(
      `You are analyzing recent performance of an agent.

Here are some tasks, outputs, and scores:
${recent
  .map(
    ({ task, result }, i) => `Case ${i + 1}:
Type: ${task.type}
Prompt: ${task.prompt.slice(0, 500)}...
Output: ${result.output.slice(0, 500)}...
Score: ${result.score}
`
  )
  .join("\n")}

1. Identify task types whose scores are trending down.
2. For each, propose a small prompt change or additional instruction that would likely improve performance.
Return JSON: [{ taskType, proposal, rationale }].`
    );

    const proposals = JSON.parse(analysis.output);

    for (const p of proposals) {
      const prev = this.promptLibrary.get(p.taskType) ?? "You are a careful, honest assistant.";
      const newPrompt = `${prev}\n\nAdditional guidance:\n${p.proposal}`;

      // (You could test this on a held-out set before adopting it.)
      this.promptLibrary.set(p.taskType, newPrompt);

      this.improvementLog.push({
        timestamp: Date.now(),
        change: `Updated prompt for ${p.taskType}: ${p.proposal}`
      });
    }

    // Maintenance: compress examples, prune stale knowledge, etc.
    await this.exampleManager.summarizeAndCullForAllTypes();
    await this.pruneStaleKnowledge();
  }
}

In this design, everything that learns is explicit:

promptLibrary holds evolving instructions.
examples and lessons hold distilled patterns from experience.
knowledgeBase holds growing domain knowledge.
nightlyReflection is the scheduled feedback-to-learning bridge.

The model is nowhere near any of these data structures. It remains a pure function from text to text. All of the adaptation happens in code and storage that you own. Once you see learning this way, the principle generalizes:

A personalized assistant is one whose prompts and lessons are keyed by user ID.
A team-adapted agent is one whose examples and routing rules are filtered by team.
A self-improving system is one with credible evaluation, disciplined logging, and scheduled routines that convert feedback into concrete updates.

There is no separate magic ingredient. Learning is simply feedback plus memory, oriented across time.

Key Takeaways

The model’s weights are frozen in production. What users experience as “learning” is the evolution of prompts, examples, routing rules, and external knowledge—components you control.
Memory and learning are not the same. Memory stores episodes; learning extracts and stores patterns that change future behavior. Few-shot examples are learning in compressed, concrete form.
Example accumulation is a retrieval problem, not a hoarding problem. You can store many, but you should select few: relevant, recent, high-quality, and representative.
System-level learning is safe and local: you change the wrapper, not the core. Model-level learning (fine-tuning) is powerful but global and risky: you change the core behavior and can accidentally erase prior capabilities.
Architecturally, learning appears as three hooks: before-task application of accumulated knowledge, after-task capture of feedback, and periodic reflection that turns feedback into updated scaffolding.

Transition

Chapter 10 closes the loop on the ten elements. You now have the full vocabulary: context, memory, agency, reasoning, coordination, artifacts, autonomy, evaluation, feedback, and learning. Each element is simple in isolation. The systems that feel “intelligent” in practice are the ones that compose them well. In Part 2, we stop looking at these elements one by one. We build systems where they interact. Chapter 11, Virtual Office, revisits the multi-agent scenario from the introduction and shows how coordination, artifacts, autonomy, and learning combine into something that feels less like a single chatbot—and more like a functioning organization.

Elements of Agentic System Design

Part 1: Elements

Part 2: Applications

Appendices

Chapter 10: Learning

10.1 Frozen Models and Changing Systems

10.2 The Difference Between Learning and Memory

10.3 Managing Example Accumulation

10.4 System-Level vs Model-Level Learning

10.5 Putting It Together: An Improving Agent Architecture

Key Takeaways

Transition

Elements of Agentic System Design

Part 1: Elements

Part 2: Applications

Appendices

​10.1 Frozen Models and Changing Systems

​10.2 The Difference Between Learning and Memory

​10.3 Managing Example Accumulation

​10.4 System-Level vs Model-Level Learning

​10.5 Putting It Together: An Improving Agent Architecture

​Key Takeaways

​Transition

10.1 Frozen Models and Changing Systems

10.2 The Difference Between Learning and Memory

10.3 Managing Example Accumulation

10.4 System-Level vs Model-Level Learning

10.5 Putting It Together: An Improving Agent Architecture

Key Takeaways

Transition