AI agent memory context management

Introduction

LLM is stateless so it can not remember each runs, sessions or users. That poses the problem with no continuity, no personalization, no learning and no long tasks.

Memory stack

  1. Layer 1: context window (in the prompt) -> lost when context overflows
  2. Layer 2: short term memory (agent session) -> lost when session ends
  3. Layer 3: long term memory (persistent) -> store fact about users, past conversations, learned preferences
  4. Layer 4: external knowledge (tool, rag) -> documents, external API, queried and not maintain by agent

Core operations

  1. Write: store new information
  2. Read: most difficult, pull relevant information when needed
  3. Update: modify existing memory when new information contradicts
  4. Delete: forget/remove the stale, no longer relevant information

What to store, what not?

  • Stable user facts
  • Strong preferences
  • Outcome of past tasks (what worked failed)
  • Decision that shapes future work (architecture, policies, constraints)

Challenges

  1. Context window cutoff => summarize old results into compact, drop tool outputs that no longer needed, retrieve the most relevant top k search results.
  2. Retrieval miss important memory => use semantic search (embeddings) instead of keyword match, store memory with good title and description, retrieve more candidates than needed
  3. Outdated memory
  4. Memory bloat: LRU policy
  5. Privacy leaks: use scope to use id, encrypt sensitive data, require consent before writing
  6. Short vs long term: if useful only to current task => short term, if matters later => long term. when in doubt use short term

System design

Design a text-only personalization and memory system for an AI chatbot

The chatbot should use a user’s previous conversations, preferences, and feedback to generate more personalized responses in future sessions. The system should support retrieval-augmented generation or a similar approach, but the exact implementation is up to you.

Address the following topics:

  1. How would you store user memory from past conversations?
  2. How would you retrieve the right memories for a new user query?
  3. How would you handle conflicting information in the user’s history, such as an old preference contradicting a newer one? If every chatbot response requires retrieval, how can the system be made faster? If a user’s history becomes very large, how should the system manage and summarize it?
  4. How would you design a feedback loop so the system improves personalization quality over time?

System Overview

Every conversation generates knowledge about the user. The system must extract that knowledge, store it durably, retrieve the right subset at query time, resolve conflicts when knowledge contradicts itself, and improve over time through feedback.

graph TD
    U["User message"] --> CL["Classifier:<br/>needs personalization?"]
    CL -->|Yes| R["Retrieval Layer"]
    CL -->|No| LLM
    R --> VS["Vector Store<br/>(semantic search)"]
    R --> KV["Key-Value Store<br/>(user profile)"]
    VS --> Ranker["Re-ranker<br/>(recency + relevance + confidence)"]
    KV --> Ranker
    Ranker --> LLM["LLM generates response<br/>with injected memories"]
    LLM --> Resp["Response to user"]
    Resp --> Extractor["Post-turn Extractor"]
    Extractor --> Writer["Memory Writer"]
    Writer --> VS
    Writer --> KV
    Resp --> FB["Feedback Signals"]
    FB --> Scorer["Confidence Scorer"]
    Scorer --> Writer

1. How would you store user memory from past conversations?

What to store

Not raw transcripts — those are too noisy and grow unboundedly. Instead, the system extracts structured memory items from conversations. Each item is a single fact about the user.

Memory item schema:

FieldTypePurpose
idUUIDUnique identifier
user_idstringPartition key — isolates users
contentstringThe fact in natural language (“Prefers dark roast coffee”)
categoryenumpreference, constraint, biographical, episodic, procedural
confidencefloat 0–1How sure the system is this fact is accurate
versionintIncrements on updates; enables conflict resolution
created_attimestampWhen first extracted
updated_attimestampWhen last confirmed or modified
source_turnstringWhich conversation turn produced this
superseded_byUUID or nullPoints to the newer fact if this one was overridden
embeddingfloat[]Vector representation for semantic search

Memory categories

  • Preference: “Likes Italian food”, “Prefers formal tone”
  • Constraint: “Allergic to peanuts”, “Budget under $100”
  • Biographical: “Lives in Seattle”, “Software engineer”
  • Episodic: “Booked a flight to Tokyo last week” — time-bound events
  • Procedural: “When asked about reports, always include charts” — learned interaction patterns

Storage architecture: two stores, one purpose

Store A — Structured profile (key-value / document store): A compact JSON document per user containing their current profile: name, location, active preferences, constraints. This is small (typically <2KB), always loaded, and cheap to read.

Store B — Semantic memory (vector database): All extracted facts with embeddings. Supports similarity search so the system can retrieve memories relevant to a query the user has never explicitly connected to a preference.

Why two stores? The profile is tiny, always relevant, and should be injected every turn without a search. The semantic store is large, query-dependent, and needs vector search. Combining them into one store either makes the profile slow to load or makes search results noisy with always-relevant items.

Extraction pipeline

After each conversation turn:

  1. Send the conversation to the LLM with a structured extraction prompt:

    “Extract any new facts about the user from this conversation. For each fact, classify it as preference/constraint/biographical/episodic/procedural. If no new facts, return empty.”

  2. Parse the LLM’s structured output (JSON)
  3. Embed each fact using an embedding model
  4. Check for duplicates/conflicts against existing memories (see section 3)
  5. Write to the appropriate store

When to extract:

  • Not after every single message — that’s expensive
  • After every turn pair (user message + assistant response) — the response often clarifies what the user meant
  • Debounce: if the user sends 5 messages in rapid succession, batch the extraction

2. How would you retrieve the right memories for a new user query?

Two-phase retrieval

Phase 1 — Static injection (once per conversation start): Load the user’s structured profile from Store A. This gives the LLM immediate context: name, key preferences, active constraints. No search needed — it’s a direct key lookup.

Phase 2 — Contextual retrieval (every turn):

  1. Embed the user’s current message
  2. Search Store B for the top-k most similar memory items (k=5–10)
  3. Apply a re-ranking formula to the candidates
  4. Inject the top results into the system prompt

Re-ranking formula

Raw vector similarity alone isn’t enough. A fact from 2 years ago shouldn’t outrank one from yesterday if both are equally similar to the query.

score = α · sim(q, m) + β · recency(m) + γ · confidence(m)

Where:

  • sim(q, m) = cosine similarity between query embedding and memory embedding
  • recency(m) = e^(-λ · days_since_update) — exponential decay
  • confidence(m) = the stored confidence score (0–1)
  • α = 0.6, β = 0.25, γ = 0.15 — tunable weights

What gets injected into the prompt

[System] You are a helpful assistant.

[Memory — Profile]
- Name: Alex
- Location: Seattle
- Key preference: concise answers with examples

[Memory — Relevant to this query]
- Prefers dark roast coffee (confidence: 0.95, last confirmed: 3 days ago)
- Allergic to tree nuts (confidence: 1.0, last confirmed: 2 weeks ago)
- Recently booked a trip to Tokyo for April (confidence: 0.8, 1 week ago)

[User] Can you suggest a coffee shop near my hotel in Tokyo?

The profile section is always present. The relevant memories section changes per turn. Together they rarely exceed 500 tokens.

When NOT to retrieve

Not every query needs memory lookup. “What’s 2+2?” doesn’t benefit from knowing the user’s coffee preferences. A lightweight classifier decides:

  • Does the query contain personal pronouns (“my”, “I”, “me”)?
  • Does it reference past conversations (“like last time”, “as I mentioned”)?
  • Is it a general knowledge question with no personalization angle?

Skip retrieval for clear general-knowledge queries. This saves latency and cost.


3. Handling conflicts, speed, and scale

Conflict resolution

The problem: User said “I love spicy food” 6 months ago. Today they say “I can’t handle spicy food anymore.”

Strategy: version-chain with last-explicit-statement-wins.

When the extractor produces a new fact:

  1. Search existing memories for semantically similar items (cosine similarity > 0.85 on the same user_id)
  2. If a near-duplicate exists with the same meaning → skip (don’t store “likes coffee” twice)
  3. If a conflicting fact exists (similar topic, opposite meaning):
    • Mark the old fact: superseded_by = new_fact_id
    • Set old fact’s confidence to 0 (still searchable for audit, but won’t surface in retrieval)
    • Store the new fact with version = old.version + 1
  4. If ambiguous (the system isn’t sure if it’s a conflict or an addition):
    • Store both, give the newer one higher confidence
    • On next retrieval that surfaces both, let the LLM see the timestamps and decide

Conflict detection prompt (used in step 3):

“Given existing memory: ‘{old_fact}’ and new statement: ‘{new_fact}’, are these (a) the same fact restated, (b) contradictory facts, or (c) independent facts?”

This LLM call is cheap (short prompt, structured output) and only fires when semantic similarity flags a potential conflict.

Making retrieval fast

BottleneckSolutionEffect
Vector search on every turnPre-filter by user_id partition — search only within one user’s memoriesReduces search space by orders of magnitude
Embedding the queryCache embeddings for repeated/similar queries within a sessionSaves ~100ms per cache hit
LLM extraction after every turnDebounce writes — extract every 3–5 turns or after N seconds of inactivityReduces extraction LLM calls by 3–5x
Loading user profileCache in-memory for active sessions — only read from DB on session startEliminates DB read on every turn
Retrieval for trivial queriesClassifier gate — skip retrieval for general knowledge questionsSaves ~50–200ms for non-personal queries
Multiple sequential callsParallel retrieval — issue profile load + vector search concurrentlyWall clock = slowest call, not sum

Target latency budget:

StepTargetApproach
Classify query<10msRule-based or cached small model
Profile load<5msIn-memory cache, refreshed per session
Vector search<30msPartitioned index, approximate nearest neighbors
Re-rank<5msSimple arithmetic on k=10 candidates
Total retrieval overhead<50msWell within user-perceptible threshold

Managing large history

When a user accumulates thousands of memory items:

Tier 1 — Progressive summarization:

  • Every ~50 episodic memories, run a summarization pass: “Summarize these 50 travel-related memories into 3–5 key facts”
  • Replace the 50 originals with the summaries (keep originals in cold storage for audit)
  • Preferences and constraints are never summarized — they’re already atomic facts

Tier 2 — TTL-based eviction:

  • Episodic memories (“booked flight last week”) decay naturally — TTL of 90 days
  • Preferences only expire if superseded or explicitly forgotten
  • Constraints never auto-expire (allergies don’t go away silently)

Tier 3 — Confidence-based pruning:

  • Facts with confidence < 0.3 that haven’t been accessed in 60 days → delete
  • Facts that were superseded > 30 days ago → delete (the replacement is canonical)

Budget guardrails:

  • Hard cap: 1,000 active memories per user
  • When approaching the cap, trigger a consolidation pass that merges related facts
  • Example: 12 memories about coffee preferences → 2 consolidated facts

4. Feedback loop for improving personalization quality

Three feedback channels

Channel A — Explicit feedback (high signal, low volume):

The user directly tells the system what to remember or forget:

  • “Remember that I’m vegetarian now”
  • “Stop suggesting spicy food”
  • Thumbs up/down buttons on responses

Actions:

  • “Remember X” → extract fact, set confidence = 1.0, skip extraction pipeline
  • “Forget X” → find matching memory, set superseded_by = "user_deleted"
  • 👍 → boost confidence of all memories used in that response by +0.1 (cap at 1.0)
  • 👎 → decay confidence by -0.2, prompt: “What should I have known?”

Channel B — Implicit feedback (medium signal, high volume):

Behavioral signals from the conversation itself:

  • User corrects the agent → the memory that led to the wrong assumption should be decayed
  • User rephrases the same question → retrieval missed something relevant
  • User engages longer after a personalized response → the memories used were valuable
  • User changes topic abruptly after a personalized response → the personalization may have been off

Detection: after each turn, a lightweight classifier checks for correction patterns (“No, I said…”, “Actually…”, “That’s not right”). If detected:

  1. Identify which injected memory led to the error
  2. Decay its confidence
  3. Extract the corrected fact as a new memory with high confidence

Channel C — Offline quality review (high quality, periodic):

Weekly batch process:

  1. Sample 100 conversations where memories were injected
  2. For each, evaluate: did the user correct the agent? Did engagement increase? Did they return?
  3. Compute a personalization quality score per memory category
  4. Adjust the re-ranking weights (α, β, γ) based on which memory types correlated with positive outcomes
  5. Identify memory categories that consistently lead to corrections → tighten extraction prompts for those categories

Memory lifecycle with feedback

stateDiagram-v2
    [*] --> Extracted: LLM extracts fact
    Extracted --> Active: confidence > 0.5
    Extracted --> Probationary: confidence ≤ 0.5

    Active --> Boosted: thumbs-up or implicit positive signal
    Active --> Decayed: thumbs-down or user correction
    Active --> Superseded: newer contradicting fact

    Boosted --> Active: cap at 1.0
    Decayed --> Probationary: confidence drops below 0.5
    Decayed --> Active: subsequent positive signal restores

    Probationary --> Active: confirmed by user or retrieval success
    Probationary --> Pruned: 60 days unused + low confidence

    Superseded --> Archived: 30 days
    Pruned --> [*]
    Archived --> [*]

What “improving over time” looks like concretely

MetricHow it improvesMechanism
Retrieval precisionThe right memories surface more oftenConfidence scores trained by feedback push good memories up, bad ones down
Conflict resolution accuracyFewer stale facts shown to usersSupersession chain ensures latest statement wins; corrections accelerate this
Extraction qualityFewer irrelevant facts storedOffline review identifies low-value extraction categories → refine prompts
Re-ranking accuracyThe weighting formula gets betterOffline batch analysis adjusts α/β/γ based on positive feedback correlation
User trustUsers give more explicit signals over timeVisible memory controls (“Here’s what I remember — edit anytime”) encourage engagement

Summary of key design decisions

DecisionChoiceRationale
Store raw transcripts?No — extract structured factsTranscripts grow unboundedly, are noisy, and expensive to search
One store or two?Two (profile KV + semantic vector)Profile is always needed (no search); semantic facts need similarity search
Conflict resolutionLast-explicit-statement-wins with version chainSimple, auditable, matches user mental model
When to retrieveClassifier gate — skip for general queriesSaves latency and avoids irrelevant memory injection
When to extractDebounced — every few turns or on inactivityBalances freshness with cost
History managementProgressive summarization + TTL + confidence pruningThree complementary mechanisms prevent unbounded growth
Feedback modelExplicit + implicit + offline batchEach channel has different signal quality and volume; combining all three gives robust learning
Jasmine Nguyen