Skip to main content
When you talk about a capable AI system, it can seem as if you are interacting with something that accumulates you. Preferences settle in. Shared jokes return hours later. A project you started yesterday resumes today without recapitulation. The system seems to construct a private representation that includes a history, a set of facts, and an ongoing relationship. That private representation does not live in the model. Between calls, the model is as empty as a pure function. Yet your system clearly knows more on day 30 than it did on day 1. Somewhere, information is piling up. Where does that information go? It is tempting to answer: “in context.” Chapter 1 already showed that context is the world you assemble for a single call. But context is bounded. You cannot pour a year of interaction into a single window. Meanwhile, the apparent memory of an agent can grow without any obvious limit. The more you use it, the more it seems to know. You now have two different growth curves in your head: a fixed-size context window and an unbounded sense of accumulated information. If the model’s working world is finite, where does the rest of that information go? How does any of it ever come back? A practical consequence is that if you conflate the bounded context window with unbounded accumulation, you will either forget too much or load too much. In both cases, the agent feels duller than it could be. A second consequence is less obvious. Storage itself is boring. Files, tables, vector indexes—none of these feel like “intelligence.” Yet the agent you experience is only as good as the way your storage feeds its context. Memory is not a separate mystical faculty; it is the plumbing that determines what the model ever gets to think about. The core design problem is not how to store everything, but how to decide what to retrieve into context. The illusion of a coherent, remembering agent arises from that decision, repeated over time. This chapter is about the pattern behind that illusion. We will treat “memory” as it actually functions in an agentic system: as external storage designed for one purpose only—to be selectively pulled back into a finite context window when it matters.

2.1 What Memory Stores

When an agent “remembers” that the user tends to be skeptical of cloud solutions after a bad AWS outage, or that when Alice mentions “the client” she specifically means Acme Corp not the generic concept, or that the team has an unspoken understanding that Monday meeting deadlines are soft, what exactly is being stored? In traditional software, memory holds simple types: booleans, integers, bytes, arrays. These are concrete, mechanical representations with clear semantics defined by the programming language. But agent memory behaves differently: the system appears to track context, maintain nuanced distinctions, and preserve relationships between concepts. You cannot easily represent “skeptical of cloud solutions because of past experience” as a database column. When you think about “the agent’s memory,” you think about understanding and meaning, not rows and bytes. The bridge between these two views is simpler than it appears: semantic information—facts, concepts, understanding—is represented as text strings that the language model interprets. Consider what happens when your system “remembers” that contextual understanding:
// Storage: text describing the user's attitude and its cause
const userContext = {
  userId: 'alice',
  attitudes: [
    {
      topic: 'cloud infrastructure',
      sentiment: 'skeptical',
      reason: 'Experienced significant downtime during AWS outage in Q2 2024, affecting client deliverables'
    }
  ]
};

contextStore.set('alice', userContext);

// Later, when building context, you serialize it:
const context = contextStore.get('alice');
const systemPrompt = `User context: ${JSON.stringify(context, null, 2)}`;

// The model reads this string and interprets its meaning
const messages = [
  { role: 'system', content: systemPrompt },
  { role: 'user', content: 'Should we migrate to Cloudflare Workers?' }
];

const reply = await llm.complete(messages);
// reply: "Given your past experience with cloud outages, let me address
// reliability concerns first. Cloudflare has a different architecture..."
The “understanding” happens when the model reads the string and interprets "skeptical" in relation to "cloud infrastructure" and connects "reason" to the current question about migration. Nothing about the storage is semantic—it’s just JSON. The semantic layer—interpreting that past negative experiences should inform how to frame cloud recommendations—exists only when the LLM processes that text. This is fundamentally different from traditional data storage. In a conventional application, you might store units: 'metric' and then write explicit logic to check that value and format distances accordingly:
// Traditional approach: explicit logic interprets the data
if (user.units === 'metric') {
  return `${distanceInKm} km`;
} else {
  return `${distanceInMiles} miles`;
}
In an agentic system, the interpretation happens inside the model. You store facts as text, inject them into context, and the model’s language understanding does the rest. The “memory” is not storing understanding—it is storing text that, when read by a sufficiently capable language model, produces behavior that looks like understanding.

Alternative Representations

You can store semantic information in various formats—structured or unstructured—but all of them ultimately become strings that the model reads. The most direct approach is to serialize structured data as-is:
const profile = {
  name: 'Alice',
  company: 'Acme Corp',
  preferences: { units: 'metric' },
  projects: ['European expansion']
};

// Serialized directly to string for context:
const systemPrompt = `User profile: ${JSON.stringify(profile, null, 2)}`;
// User profile: {
//   "name": "Alice",
//   "company": "Acme Corp",
//   "preferences": { "units": "metric" },
//   "projects": ["European expansion"]
// }
The keys like "name", "company", "preferences" are semantic strings the model understands. You do not need to convert this into prose. The model reads JSON and interprets its structure and meaning directly. You could also store prose descriptions if that fits your use case:
// Unstructured representation: natural language
const profile = `
Alice works at Acme Corp, a company with a budget of $50k.
She prefers metric units and is working on expanding their
European market presence.
`;
Or explicit relationship triplets:
// Structured representation: subject-predicate-object triplets
const facts = [
  { subject: 'Alice', predicate: 'works_at', object: 'Acme Corp' },
  { subject: 'Acme Corp', predicate: 'budget', object: '$50k' },
  { subject: 'Alice', predicate: 'prefers', object: 'metric units' }
];

// When building context, these become strings:
const factStrings = facts.map(f =>
  `${f.subject} ${f.predicate.replace('_', ' ')} ${f.object}`
);

const systemPrompt = `Known facts:\n${factStrings.join('\n')}`;
// Known facts:
// Alice works at Acme Corp
// Acme Corp budget $50k
// Alice prefers metric units
Each representation has tradeoffs. JSON is compact and the model handles it well—no conversion needed. Prose reads naturally and can express nuance, but it is harder to update programmatically (you cannot just set profile.company = 'NewCo'). Triplets are easy to query programmatically (“give me all facts about Alice”) but verbose when serialized, and you have to convert them to readable strings. The key insight is that no matter how you store the information, it ends up as text in a prompt. The model never directly accesses your database schema, your object relationships, or your data structures. It sees strings. The semantic layer—what those strings mean and how to reason about them—is provided by the model’s language understanding when it processes the text. Your agent’s memory system can absolutely maintain conceptual graphs, symbolic knowledge bases, or any other structure you design. But when the language model needs to reason, all of that structure must be serialized into text and placed in context. The model does not traverse your graph or query your database directly—it reads whatever string representation you provide. The “understanding” happens when the model interprets that text, using the same language understanding it would apply to any string.

2.2 Memory Lives Outside the Model

[DEMO: Show a chat that “remembers” a user’s name across page refreshes. Side panel reveals that the name is just stored in a simple key–value store and reinserted into the prompt on each new message. When the user changes their name, the store updates, and the next prompt reflects it. The demonstration makes the mechanism completely transparent: you can see the store get/set operations happening, see the exact prompt being constructed, and see how the model has no knowledge of prior calls.] When a system greets you by name, or recalls that you prefer metric units, it is natural to talk as if it remembered. But there is no single “it.” There is a model, there is orchestration code, and there is whatever storage your system happens to use. The model has no state between invocations. When a framework advertises “long-term memory,” that memory lives in storage—databases, files, vector indexes—not in the model. Context is rebuilt on every call from what you retrieve and inject. In code, the full mechanism of “remembering a preference” is almost aggressively mundane:
type Preferences = { name?: string; units?: 'metric' | 'imperial' };

const preferencesStore = new Map<string, Preferences>();

async function chat(userId: string, input: string) {
  // 1. Load memory from external storage
  const prefs = preferencesStore.get(userId) ?? {};

  // 2. Build context that *includes* remembered data
  const systemPrompt = [
    prefs.name && `The user's name is ${prefs.name}.`,
    prefs.units && `The user prefers ${prefs.units} units.`,
  ]
    .filter(Boolean)
    .join(' ');

  const messages = [
    { role: 'system', content: systemPrompt },
    { role: 'user', content: input },
  ];

  // 3. Call the stateless model with this constructed context
  const reply = await llm.complete(messages);

  // 4. Optionally *update* memory based on this turn
  const updatedPrefs = extractPreferencesFromTurn(input, reply, prefs);
  preferencesStore.set(userId, updatedPrefs);

  return reply;
}
There is no hidden state in the model. The only things that persist across calls are what you write into preferencesStore (or a database, or a vector index). The apparent continuity comes from two explicit choices:
  • what to extract from the interaction and store, and
  • how to reinsert stored data into the next context.
Memory is external storage. The model has none. What we call “memory” is data persisted in your system (databases, files, vector stores) that you selectively load into context. Memory feeds context; context is the window into memory.
Several frameworks’ “memory” features become legible once you see this pattern. “Short-term memory” is usually just an array of recent messages kept in process or in a database table. “Long-term memory” is often a vector store or another table keyed by user or topic. Both are entirely outside the model. They differ only in how you can query them. Because memory is external, you can reason about it like any other system component:
  • You can inspect it directly (open the DB, look at the row).
  • You can migrate it, back it up, or wipe it.
  • You can change how you encode it without retraining anything.
The critical implication for agent design is that nothing you fail to store will be available later. A model call that generates a sophisticated plan will forget that plan as soon as the response is sent unless you serialize it somewhere. There is no auto-save inside the weights. Equally, nothing you store will matter unless you later retrieve and load it into context. A gigabyte of carefully collected facts is invisible to the model if you never pull any of them back into a prompt. Memory is a two-step contract: persistence now for potential relevance later, and retrieval later for actual relevance now.

Updating Memory

Because memory is external storage, updates work like any other data update—but the approach depends on how you represented the information. With structured data, updates are straightforward:
// Structured memory: easy to update specific facts
const profile = memoryStore.get('alice');
profile.attitudes[0].sentiment = 'neutral'; // User's skepticism has faded
profile.attitudes[0].reason += '. Recent experience with Cloudflare has been positive.';
memoryStore.set('alice', profile);
With prose, updates require rewriting or asking the model to rewrite:
// Prose memory: harder to update specific facts
let profile = `
Alice is skeptical of cloud infrastructure after a bad AWS outage.
She prefers on-premise solutions.
`;

// To update, you either:
// 1. Rewrite the entire string manually
profile = `
Alice was initially skeptical of cloud infrastructure after a bad AWS outage,
but recent positive experience with Cloudflare has made her more open to it.
She still values reliability over convenience.
`;

// 2. Or ask the model to update it
const updated = await llm.complete([
  { role: 'system', content: 'Update this user profile to reflect that their cloud skepticism has faded due to positive recent experience.' },
  { role: 'user', content: profile }
]);
profile = updated;
The model-based update is elegant but expensive (another LLM call) and potentially lossy (the model might omit details). For facts that change frequently or need precise updates, structured representations win. For rich contextual narratives that rarely change, prose can work. Many systems use a hybrid: structured data for facts that need programmatic updates (preferences, settings, explicit commitments), and prose or embeddings for contextual understanding that accumulates over time but rarely needs surgical edits.

2.3 Retrieval Follows Storage

[DEMO: Three panes showing (1) an array of chat messages, (2) a key–value map of user profiles, (3) a vector store of knowledge snippets. The user can ask specific test questions:
  • “What did I say in my last message?” → Array returns it instantly; map and vector cannot.
  • “What company do I work at?” → Map returns it instantly via user ID; array must scan; vector search may miss it if query doesn’t match semantically.
  • “What have we discussed about pricing?” → Vector store finds semantic matches across all messages; array must scan everything; map has no entry for “pricing.”
Each query shows which data structure succeeds (green checkmark) or fails (red X), making the structural constraints visible.] Memory is external storage, which surfaces another problem. Storage is easy. You can shovel everything into some table or blob store. The hard part is getting the right pieces back out when you build the next context. The choice of data structures determines which retrieval strategies are available. An array gives you positional access and recency. A map gives you key-based lookup. A vector store gives you semantic search. If you only record positional information, you cannot later perform efficient meaning-based queries. A common source of confusion is treating memory as a single mechanism rather than several distinct components. In practice, different storage layouts give you different questions you can ask. These patterns mirror common data structures in ordinary software. An array is excellent when you care about order or recency:
type Message = { role: 'user' | 'assistant'; content: string };

const messageLog: Message[] = [];

function storeTurn(user: string, assistant: string) {
  messageLog.push({ role: 'user', content: user });
  messageLog.push({ role: 'assistant', content: assistant });
}

function recentMessages(limit: number): Message[] {
  return messageLog.slice(-limit);
}
This structure makes “give me the last 10 messages” trivial and fast. It makes “give me all messages about pricing” expensive, because you must scan everything. A map (or object keyed by ID) is ideal when you know what you’re looking for by name:
type Profile = { company?: string; role?: string };

const profiles = new Map<string, Profile>();

function rememberProfile(userId: string, profile: Profile) {
  profiles.set(userId, profile);
}

function getProfile(userId: string): Profile | undefined {
  return profiles.get(userId);
}
Here “what company does this user work at?” is O(1). “Which users care about pricing?” is not directly supported. A vector store is what you reach for when you want to query by meaning. Under the hood, a vector store converts text into numerical representations called embeddings that capture semantic meaning. An embedding model (a neural network trained on massive text corpora) transforms a sentence into a vector—essentially a point in high-dimensional space. Sentences with similar meanings end up near each other in this space. When you search, the vector store converts your query into an embedding, then finds the stored items whose embeddings are geometrically closest—usually measured by cosine similarity or Euclidean distance. “Similar meaning” becomes “nearby in vector space.”
type KnowledgeItem = { id: string; content: string };

const knowledge = new VectorStore<KnowledgeItem>();

async function rememberFact(item: KnowledgeItem) {
  await knowledge.add(item.content, item);
}

async function recallRelevant(query: string, k = 5): Promise<KnowledgeItem[]> {
  return knowledge.search(query, { limit: k });
}
This structure can’t answer “what is fact #42?” cheaply unless you also tracked IDs, but it can answer “what do we know about database performance?” even if “database performance” never appeared verbatim in the stored texts. It is worth noting that embedding models have their own biases and blindspots. Semantic similarity is learned, not perfect. The model’s training data determines what “similar” means—domain-specific jargon, rare languages, or novel concepts may not embed well if the model never saw them during training.

The Implicit Semantic Graph

Vector stores do more than support similarity search: they act like an implicit graph where related items can be discovered at query time without storing explicit edges. In a traditional graph database, if you want to know that “AWS outage” relates to “cloud reliability concerns” relates to “Cloudflare migration,” you must explicitly create those relationship edges and maintain them as facts change. The graph grows quadratically: N items can have up to N² connections. Storing and updating all those edges becomes expensive. Vector embeddings compress this into a different representation. Instead of storing explicit edges, you store each item as a point in high-dimensional space. The “edges”—which items relate to which others—emerge dynamically at query time by computing geometric proximity. You ask “what’s related to cloud reliability?” and the vector store computes distances to find nearby points. The graph is implicit: connections are computed on demand using nearest-neighbor search rather than stored explicitly. This is analogous to associative memory: related items can be surfaced based on learned patterns of co-occurrence. The model learns during training that “outage,” “downtime,” and “reliability” co-occur in similar contexts, so their vector representations end up nearby. Retrieval becomes pattern completion: given a partial cue (“cloud reliability”), the system surfaces related patterns without needing explicit pointers. The computational advantage can be large in practice; for example, storing N items requires N embeddings, and approximate nearest-neighbor indexes avoid quadratic behavior in highly connected graphs. The tradeoff is that similarity is approximate. Two items might be semantically related but geometrically distant if the embedding model failed to capture their relationship during training. The graph is lossy. But for many retrieval tasks, approximate similarity is sufficient—and vastly cheaper than exact graph traversal or exhaustive search. Vector search returns semantically similar items regardless of when they were created, so temporal ordering must be handled separately. Search for “cloud reliability” and you might retrieve a user concern from three months ago, a recent vendor evaluation, and an incident from last year, all interleaved by similarity score. This collapses temporal structure. You need to handle time separately. You might filter to recent events first, then search within that window—mirroring how recent memories feel more salient. Or you might retrieve by similarity but include timestamps in the serialized context, letting the model reconstruct the sequence. The model sees scattered facts with dates and mentally simulates how events unfolded—the “world reconstruction” pattern from Chapter 1.
Storage structure determines retrieval capability. Arrays enable positional access. Maps enable key lookup. Vector stores enable semantic similarity. If you only store by position, you cannot later query by meaning efficiently. Design storage for the retrievals you’ll need, not just the data you have.
A typical agentic system uses all three simultaneously:
  • an array for episodic history (“what just happened?”),
  • maps for structured state (“what is the user’s timezone?”),
  • a vector store for semantic lookup (“what documents are relevant to this question?”).
A common pitfall is picking a single storage shape (for example, “we’ll just log everything”) and assuming you can add arbitrary retrieval later. You cannot ask a vector store for “the N most recent events” unless you stored timestamps. You cannot ask a flat log for “everything about topic X” efficiently without some secondary index. Retrieval is not a generic add-on; it is constrained by how you decided to store. A practical rule is to work backwards from the questions your system must be able to answer when constructing context. For each question, choose a storage structure that makes the corresponding retrieval trivial. When in doubt, store the same underlying fact in multiple shapes:
  • a log entry for chronology,
  • a keyed record for direct access,
  • an embedded entry in a vector index for semantic access.
This redundancy is cheap compared to the cost of struggling to retrieve what you never indexed for.

2.4 Relevance Is Not Just Recency

[DEMO: A timeline of events for a single user (array), a list of “important” markers on some events, and a vector search box. The user can ask test questions and see which events are retrieved under three strategies: Question: “What was my budget constraint?”
  • (a) Recency-only: Shows last 5 messages → none mention budget. ❌ Fails.
  • (b) Semantic-only: Finds a message about “financial planning” → wrong context, talks about investing. ❌ Fails.
  • (c) Hybrid (recency + importance): Finds the starred message from two weeks ago: “Stay under $50k for this project.” ✅ Succeeds.
Question: “What did we just talk about?”
  • (a) Recency-only: Shows the last few messages. ✅ Succeeds.
  • (b) Semantic-only: Returns messages semantically similar to “talk about”—vague, often wrong. ❌ Fails.
  • (c) Hybrid: Recent messages score highest. ✅ Succeeds.
Question: “What do you know about Paris?”
  • (a) Recency-only: No recent mentions. ❌ Fails (unless Paris was just discussed).
  • (b) Semantic-only: Finds all mentions of Paris, including an old joke about croissants. ⚠️ Partial success (finds mentions, but no importance weighting).
  • (c) Hybrid: Finds Paris mentions but weights recent and important ones higher. ✅ Better signal.
Pass/fail indicators make it immediately clear which strategy handles which query type.] If you can store arbitrarily much and retrieve in several ways, you still haven’t answered the core question: for a given model call, which subset of everything you know should enter the finite context window? Recency is an obvious heuristic. Recent events are often the most relevant. But what about a user’s long-term goal stated three weeks ago? What about a billing constraint established last month that should still govern today’s suggestions? How do you ensure that deeply important but older facts are not drowned by a flood of trivial recent chatter? Semantic similarity seems like an upgrade: “just retrieve by meaning.” But similarity to the current query is not the same as importance to the ongoing relationship. A throwaway joke about Paris might be semantically related to “France”; it is not as important as a clearly expressed project objective that shares fewer surface terms with the current question. The challenge is that the model’s context window forces you to choose. What should you keep close? What should you let fall away? How do you approximate relevance to the current query with mechanisms that only see timestamps, similarity scores, and occasional importance flags? A minimal hybrid retrieval layer looks like this:
type Event = {
  id: string;
  content: string;
  timestamp: number;
  important?: boolean;
};

// Episodic store: append-only log
const events: Event[] = [];

// Semantic index: meaning-based retrieval
const eventIndex = new VectorStore<Event>();

async function rememberEvent(event: Event) {
  events.push(event);
  await eventIndex.add(event.content, event);
}

async function buildContextFor(input: string, tokenBudget: number) {
  const now = Date.now();

  // 1. Recency: take the latest N events
  const recent = events.slice(-20);

  // 2. Semantic: events similar to the current input
  const similar = await eventIndex.search(input, { limit: 20 });

  // 3. Importance: all events marked important, regardless of age
  const important = events.filter(e => e.important);

  // 4. Merge with simple scoring
  const scored = new Map<string, { event: Event; score: number }>();

  function bump(e: Event, delta: number) {
    const existing = scored.get(e.id) ?? { event: e, score: 0 };
    existing.score += delta;
    scored.set(e.id, existing);
  }

  for (const e of recent) bump(e, 1.0);
  for (const e of similar) bump(e, 2.0);
  for (const e of important) bump(e, 5.0);

  // 5. Sort by score (and slight bias toward more recent within ties)
  const ranked = Array.from(scored.values())
    .sort(
      (a, b) =>
        b.score - a.score ||
        b.event.timestamp - a.event.timestamp,
    )
    .map(s => s.event);

  // 6. Pack into context until you hit the token budget
  const selected: Event[] = [];
  let used = 0;

  for (const e of ranked) {
    const cost = estimateTokens(e.content);
    if (used + cost > tokenBudget) break;
    selected.push(e);
    used += cost;
  }

  return selected.map(e => ({ role: 'system', content: e.content }));
}
This example illustrates a deeper point: relevance is multi-dimensional. You need to balance at least three axes:
  • how recent something is,
  • how semantically related it is to the current input,
  • how intrinsically important it is across time.
Recency is a proxy for relevance—usually good, sometimes wrong. Important old information can be more relevant than trivial recent information. Hybrid retrieval (recency + semantic + explicit importance) approximates relevance to the current query better than any single strategy.
Each axis is imperfect. Recency will prioritize last night’s small talk over a critical preference from last month. Pure semantic similarity will happily surface an anecdote that mentions “France” over a pricing constraint that never uses that term. Pure importance flags will risk reintroducing the same handful of items in every context, wasting tokens. Combining them gives you a system that behaves more like a real memory: what happened recently is vivid, but some things never fade, and the current topic colors which memories feel salient. The engineering consequence is that you should treat retrieval as a configurable policy, not as a hard-coded query. You will adjust weights, experiment with thresholds, and sometimes introduce new signals (access frequency, user pinning, explicit “forget this” commands) as you observe failures.

2.4.1 When Semantic Similarity Fails

[DEMO: A side-by-side view showing how embedding-based retrieval handles tricky queries: Query: “I do NOT want spicy food”
  • Retrieval returns: documents about spicy food recommendations. ❌
  • Why: Embeddings capture topics, not negation. “NOT spicy” and “spicy” are semantically close.
Query: “The thing we discussed yesterday”
  • Retrieval returns: unrelated items with similar phrasing. ❌
  • Why: No semantic hook. The query is a temporal reference with no content keywords.
Query: “What’s the latest on the flux capacitor module?”
  • Retrieval returns: generic engineering docs, misses the specific module. ❌
  • Why: Domain-specific jargon (“flux capacitor”) may not have been in the embedding model’s training data.
Query: “Show me the pricing tiers”
  • Retrieval returns: correct pricing documentation. ✅
  • Why: Common terminology the embedding model knows well.
Each result includes an explanation of why the strategy succeeded or failed, demonstrating that semantic search is a tool with specific blindspots, not a universal solution.] Semantic similarity is powerful, but it is not magic. Embedding models learn associations from their training data, and they fail in predictable ways. Negation is often ignored. Embeddings encode topics and concepts, not logical operators. “I do NOT want spicy food” and “I want spicy food” will retrieve similar documents because both mention “spicy food.” The model sees the shared semantic space, not the negation. Indirect references have no hooks. Queries like “the thing we discussed yesterday” or “what I mentioned earlier” contain no content keywords. The embedding is semantically close to other vague temporal references, not to the actual content you care about. These queries need recency-based retrieval, not semantic search. Domain jargon may not embed well. If your system deals with specialized terminology—medical codes, legal terms, internal project names, emerging technology—the embedding model may never have encountered those terms during training. It will treat them as rare or unknown tokens, producing poor similarity scores. Homonyms and context collapse. “Jaguar” could mean the animal, the car brand, or the macOS version. Embeddings capture a blended representation of all senses, weighted by frequency in training data. If your domain uses a minority sense, retrieval may surface the wrong context. The practical implication is that you should treat semantic retrieval as one signal among several, not as a complete solution. Hybrid retrieval that combines recency, explicit structure (tags, categories, importance flags), and semantic similarity will handle edge cases better than any single strategy. When semantic search fails in your system, the fix is usually not “get a better embedding model.” The fix is to add complementary retrieval paths: structured metadata for facts that don’t need semantic matching, recency for temporal queries, and explicit markers for high-priority information.

2.4.2 Retrieval Policy Shapes Personality

Two systems with identical storage can feel very different depending on how they weight retrieval signals. A system that overweights recency feels like it has a “short attention span.” It is responsive to the immediate conversation but forgets long-term commitments. A user might say “remember, I’m on a tight budget” in week one, but by week three the agent is suggesting expensive options because recent messages don’t mention budget and recency dominates the retrieval policy. A system that overweights importance feels “obsessive about key facts.” It will reintroduce the same handful of starred items in every context, even when they are no longer relevant. The user experiences this as the agent “not letting go” of certain topics or repeatedly reminding them of things they already acknowledged. A system that overweights semantic similarity is “very on-topic but forgets your name.” It will surface contextually relevant documents and past discussions related to the current query, but it may fail to retrieve simple structured facts (like preferences or identifiers) that don’t have strong semantic hooks in the current input. Adjusting these weights changes the memory policy rather than the model itself. The model is just a powerful function that reasons over whatever world you construct. The personality the user experiences is shaped by both your retrieval policy and your system prompt. In practice, good retrieval policies are rarely static. You might:
  • Weight recency higher during the first few turns of a session (to stay grounded in the immediate conversation), then shift weight toward importance and semantics as the conversation progresses (to surface long-term context).
  • Boost importance scores when the user asks a planning or decision-making question (“Should I…?” / “What should we do about…?”).
  • Reduce semantic weight when the user is asking for simple facts (“What’s my email?”) where structured lookup is more reliable.
There is no single optimal policy; treat retrieval as a tunable system that you adjust based on observed behavior.

2.5 Forgetting Is a Design Choice

[DEMO: A growing memory store visualized over time. The interface shows a timeline of stored events with metadata (timestamp, importance, token count). User can toggle strategies: Strategy: “Keep everything”
  • Timeline fills up. Retrieval becomes slow. Context windows fill with noise (20 “thanks!” messages, 15 greetings, outdated preferences from months ago).
  • Evaluation: Ask “What are my current preferences?” → System surfaces a mix of old and new, contradictory preferences. ❌ Fails.
Strategy: “Summarize old”
  • Old events compress into summaries. Recent events stay verbatim.
  • Evaluation: Ask about a general theme (“What have we worked on?”) → System provides good high-level summary. ✅ Succeeds. Ask for a specific quote from two months ago → System can’t provide it. ⚠️ Partial failure (expected tradeoff).
Strategy: “Expire by age”
  • Events older than 1 week disappear unless marked important.
  • Evaluation: Ask about a recent topic → Works well. ✅ Ask about a long-term goal set 3 weeks ago (not marked important) → Gone. ❌ Fails.
Strategy: “Keep only marked important”
  • Aggressive pruning. Only important events survive.
  • Evaluation: Ask about the user’s core objectives → Perfect recall. ✅ Succeeds. Ask about a normal conversation from yesterday (not marked important) → No trace. ❌ Fails.
The demo makes the tradeoffs visible: no forgetting strategy is universally correct; you choose based on what kind of continuity you need.] Even when storage is cheap, you still need explicit forgetting strategies. Disk space keeps getting cheaper. Vector indexes can scale. You could, in principle, store every keystroke a user ever sends and call it “long-term memory.” But unlimited storage is not unlimited utility. Old information can be wrong (outdated facts), misleading (preferences that changed), or low-value (routine acknowledgments such as “thanks”). Retrieval over an ever-growing pile of data gets harder. Even with embeddings and indexes, your search has to decide which of the millions of items are worth precious context tokens right now. There are three main forgetting mechanisms, each with different failure modes. 1. Forget by summarization You keep what happened in compressed form and let literal details go:
type Turn = { role: 'user' | 'assistant'; content: string; timestamp: number };

let turns: Turn[] = [];
let summary: string | null = null;

async function maybeCompressHistory() {
  // Only compress when we have many turns
  if (turns.length < 40) return;

  const old = turns.slice(0, -10);
  const recent = turns.slice(-10);

  const prompt = [
    {
      role: 'system',
      content:
        'Summarize this interaction. Preserve goals, decisions, and enduring preferences. Omit greetings and small talk.',
    },
    { role: 'user', content: JSON.stringify(old) },
  ];

  const summaryText = await llm.complete(prompt);

  summary = summary
    ? `${summary}\n\nAdditional context:\n${summaryText}`
    : summaryText;

  // Replace old turns with a single synthetic system message
  turns = [
    { role: 'system', content: `Summary of earlier conversation:\n${summary}`, timestamp: Date.now() },
    ...recent,
  ];
}
Now your context can include:
  • the accumulated summary (low cost, coarse),
  • the last few turns verbatim (higher cost, precise).
This is ideal for preserving themes and long-term preferences. It will fail if you later need an exact quote or specific reference from early in the interaction. 2. Forget by expiration You delete items once they are too old to matter:
const ONE_WEEK = 7 * 24 * 60 * 60 * 1000;

function purgeEphemeralEvents(now: number = Date.now()) {
  // Keep only recent events or those explicitly marked important
  events = events.filter(
    e => e.important || now - e.timestamp < ONE_WEEK,
  );
}
This is appropriate for short-lived contexts: “in-progress notifications,” “jobs in queue,” “transient debug logs.” It is dangerous if you apply it to durable commitments (“user’s billing tier,” “accepted terms of service”). 3. Forget by curation You explicitly mark certain items as important and aggressively prune the rest:
function markImportant(id: string) {
  const event = events.find(e => e.id === id);
  if (event) event.important = true;
}

function pruneUnimportant(maxSize: number) {
  // Keep all important events, plus the most recent unimportant ones
  const important = events.filter(e => e.important);
  const unimportant = events.filter(e => !e.important).slice(-maxSize);
  events = [...important, ...unimportant];
}
Here you are not trusting time or similarity to identify value. You (or the model) decide that certain facts are worth keeping indefinitely, everything else is subject to aggressive forgetting.
Unlimited storage doesn’t mean unlimited utility. Old information pollutes retrieval, increases costs, and can mislead. Forgetting is memory hygiene: summarization for themes, deletion for obsolete facts, importance-based retention for what matters regardless of age.
You can treat this as memory hygiene: routinely clean up what you store so that retrieval surfaces the right things. Summaries preserve narrative continuity. Expiration removes stale, misleading state. Curation keeps anchors that should never disappear. Because memory is external, you can change your forgetting strategy without touching the model. You can migrate from “keep everything” to “keep summaries and anchors,” reindex your vector store, or adjust expiration policies as you observe real usage. The agent you are shaping lives in those policies as much as in your prompts.

2.6 Putting Memory in Its Place

To see these ideas together, consider a simplified assistant that tracks a user over time:
export class Assistant {
  // Episodic: full turns plus summary
  private turns: Turn[] = [];
  private summary: string | null = null;

  // Structured: preferences and project state
  private preferences: Preferences = {};
  private project: { name: string; objective: string } | null = null;

  // Semantic: knowledge across sessions
  private knowledge = new VectorStore<{ content: string; timestamp: number }>();

  async chat(userId: string, input: string) {
    // 1. Update memory with the new user turn (episodic)
    this.turns.push({ role: 'user', content: input, timestamp: Date.now() });

    // 2. Retrieve from memory for this specific context
    const context = await this.buildContext(input);

    // 3. Call the model
    const reply = await llm.complete(context);

    // 4. Store assistant turn and maybe extract new persistent facts
    this.turns.push({ role: 'assistant', content: reply, timestamp: Date.now() });

    await this.updateStructuredMemory(input, reply);
    await this.updateSemanticMemory(input, reply);

    // 5. Manage growth
    await this.maybeCompressHistory();
    this.pruneUnimportantEvents();

    return reply;
  }

  private async buildContext(input: string) {
    const tokenBudget = 2000;

    // 1. Preferences: direct lookup
    const prefsText = this.renderPreferences();

    // 2. Project state: direct lookup
    const projectText = this.project
      ? `Current project: ${this.project.name}. Objective: ${this.project.objective}.`
      : '';

    // 3. Episodic + semantic hybrid retrieval
    const relevantEvents = await buildContextFor(input, tokenBudget / 2);

    // 4. Remaining budget goes to immediate recent turns
    const recentTurns = this.turns.slice(-10);

    return [
      { role: 'system', content: prefsText },
      projectText && { role: 'system', content: projectText },
      ...relevantEvents,
      ...recentTurns,
      { role: 'user', content: input },
    ].filter(Boolean) as { role: string; content: string }[];
  }

  private renderPreferences(): string {
    const { name, units } = this.preferences;
    const parts = [];
    if (name) parts.push(`The user's name is ${name}.`);
    if (units) parts.push(`The user prefers ${units} units.`);
    return parts.join(' ') || 'You are talking to a returning user.';
  }

  private async updateStructuredMemory(input: string, reply: string) {
    // Use the model (or rules) to extract durable preferences and project info
    const extractionPrompt = [
      {
        role: 'system',
        content:
          'From the following exchange, extract any stable preferences (like name, units) and project info.',
      },
      { role: 'user', content: input },
      { role: 'assistant', content: reply },
    ];
    const json = await llm.completeAsJson<{
      preferences?: Preferences;
      project?: { name?: string; objective?: string };
    }>(extractionPrompt);

    if (json.preferences) {
      this.preferences = { ...this.preferences, ...json.preferences };
    }
    if (json.project) {
      this.project = { ...(this.project ?? {}), ...json.project } as any;
    }
  }

  private async updateSemanticMemory(input: string, reply: string) {
    // Treat assistant replies as candidate knowledge
    await this.knowledge.add(reply, { content: reply, timestamp: Date.now() });
  }

  private async maybeCompressHistory() {
    // As in the earlier example
  }

  private pruneUnimportantEvents() {
    // As in the earlier example
  }
}
This pattern of using the model to extract structured data from conversations comes with tradeoffs. Firing an LLM call on every turn to extract preferences adds cost and latency. It is also fragile: the model might hallucinate a preference that was never stated, or misinterpret ambiguous language. Production systems often use heuristics (regex patterns for email addresses, explicit user settings pages), batch extraction during quiet periods, or manual curation for high-value facts rather than relying entirely on per-turn LLM extraction. The “extract everything with the model” approach is convenient for prototypes but becomes expensive and error-prone at scale. Nothing in this implementation gives the model “memory” in any intrinsic sense. All the persistent structure lives in your class fields and backing stores. The agent’s apparent continuity arises from:
  • what you choose to preserve (preferences, projects, knowledge, summaries),
  • how you index it (maps, logs, vector stores),
  • how you select from it (recency, similarity, importance),
  • and when you allow things to fade (summarization, expiration, pruning).
From the model’s point of view, each call is just another batch of text. From the user’s point of view, there is a stable counterpart on the other side of the screen. The bridge between these two perspectives is memory as external, organized storage that feeds a carefully constructed context.

Bridge to Chapter 3

Chapters 1 and 2 have been about what your system knows and how that knowledge shows up at the moment of inference. Context is the instantaneous world the model sees; memory is everything your system has chosen to keep outside that window and can pull back in. But an entity that only accumulates and recalls is not yet an agent. It can answer questions, summarize documents, and continue conversations, but it cannot do anything beyond emitting text. Agency is the property of making things happen in the world—sending emails, modifying files, updating databases, orchestrating other services. The model’s outputs become actions only when you connect them to tools and execute them. Chapter 3 turns to this next element: how to turn a stateless text generator plus its memory into a system that can cause effects, and how to keep that causal chain understandable and controllable.