Introduction
LLM is stateless so it can not remember each runs, sessions or users. That poses the problem with no continuity, no personalization, no learning and no long tasks.
Memory stack
- Layer 1: context window (in the prompt) -> lost when context overflows
- Layer 2: short term memory (agent session) -> lost when session ends
- Layer 3: long term memory (persistent) -> store fact about users, past conversations, learned preferences
- Layer 4: external knowledge (tool, rag) -> documents, external API, queried and not maintain by agent
Core operations
- Write: store new information
- Read: most difficult, pull relevant information when needed
- Update: modify existing memory when new information contradicts
- Delete: forget/remove the stale, no longer relevant information
What to store, what not?
- Stable user facts
- Strong preferences
- Outcome of past tasks (what worked failed)
- Decision that shapes future work (architecture, policies, constraints)
Challenges
- Context window cutoff => summarize old results into compact, drop tool outputs that no longer needed, retrieve the most relevant top k search results.
- Retrieval miss important memory => use semantic search (embeddings) instead of keyword match, store memory with good title and description, retrieve more candidates than needed
- Outdated memory
- Memory bloat: LRU policy
- Privacy leaks: use scope to use id, encrypt sensitive data, require consent before writing
- Short vs long term: if useful only to current task => short term, if matters later => long term. when in doubt use short term
System design
Design a text-only personalization and memory system for an AI chatbot
The chatbot should use a user’s previous conversations, preferences, and feedback to generate more personalized responses in future sessions. The system should support retrieval-augmented generation or a similar approach, but the exact implementation is up to you.
Address the following topics:
- How would you store user memory from past conversations?
- How would you retrieve the right memories for a new user query?
- How would you handle conflicting information in the user’s history, such as an old preference contradicting a newer one? If every chatbot response requires retrieval, how can the system be made faster? If a user’s history becomes very large, how should the system manage and summarize it?
- How would you design a feedback loop so the system improves personalization quality over time?
System Overview
Every conversation generates knowledge about the user. The system must extract that knowledge, store it durably, retrieve the right subset at query time, resolve conflicts when knowledge contradicts itself, and improve over time through feedback.
graph TD
U["User message"] --> CL["Classifier:<br/>needs personalization?"]
CL -->|Yes| R["Retrieval Layer"]
CL -->|No| LLM
R --> VS["Vector Store<br/>(semantic search)"]
R --> KV["Key-Value Store<br/>(user profile)"]
VS --> Ranker["Re-ranker<br/>(recency + relevance + confidence)"]
KV --> Ranker
Ranker --> LLM["LLM generates response<br/>with injected memories"]
LLM --> Resp["Response to user"]
Resp --> Extractor["Post-turn Extractor"]
Extractor --> Writer["Memory Writer"]
Writer --> VS
Writer --> KV
Resp --> FB["Feedback Signals"]
FB --> Scorer["Confidence Scorer"]
Scorer --> Writer
1. How would you store user memory from past conversations?
What to store
Not raw transcripts — those are too noisy and grow unboundedly. Instead, the system extracts structured memory items from conversations. Each item is a single fact about the user.
Memory item schema:
| Field | Type | Purpose |
|---|---|---|
id | UUID | Unique identifier |
user_id | string | Partition key — isolates users |
content | string | The fact in natural language (“Prefers dark roast coffee”) |
category | enum | preference, constraint, biographical, episodic, procedural |
confidence | float 0–1 | How sure the system is this fact is accurate |
version | int | Increments on updates; enables conflict resolution |
created_at | timestamp | When first extracted |
updated_at | timestamp | When last confirmed or modified |
source_turn | string | Which conversation turn produced this |
superseded_by | UUID or null | Points to the newer fact if this one was overridden |
embedding | float[] | Vector representation for semantic search |
Memory categories
- Preference: “Likes Italian food”, “Prefers formal tone”
- Constraint: “Allergic to peanuts”, “Budget under $100”
- Biographical: “Lives in Seattle”, “Software engineer”
- Episodic: “Booked a flight to Tokyo last week” — time-bound events
- Procedural: “When asked about reports, always include charts” — learned interaction patterns
Storage architecture: two stores, one purpose
Store A — Structured profile (key-value / document store): A compact JSON document per user containing their current profile: name, location, active preferences, constraints. This is small (typically <2KB), always loaded, and cheap to read.
Store B — Semantic memory (vector database): All extracted facts with embeddings. Supports similarity search so the system can retrieve memories relevant to a query the user has never explicitly connected to a preference.
Why two stores? The profile is tiny, always relevant, and should be injected every turn without a search. The semantic store is large, query-dependent, and needs vector search. Combining them into one store either makes the profile slow to load or makes search results noisy with always-relevant items.
Extraction pipeline
After each conversation turn:
- Send the conversation to the LLM with a structured extraction prompt:
“Extract any new facts about the user from this conversation. For each fact, classify it as preference/constraint/biographical/episodic/procedural. If no new facts, return empty.”
- Parse the LLM’s structured output (JSON)
- Embed each fact using an embedding model
- Check for duplicates/conflicts against existing memories (see section 3)
- Write to the appropriate store
When to extract:
- Not after every single message — that’s expensive
- After every turn pair (user message + assistant response) — the response often clarifies what the user meant
- Debounce: if the user sends 5 messages in rapid succession, batch the extraction
2. How would you retrieve the right memories for a new user query?
Two-phase retrieval
Phase 1 — Static injection (once per conversation start): Load the user’s structured profile from Store A. This gives the LLM immediate context: name, key preferences, active constraints. No search needed — it’s a direct key lookup.
Phase 2 — Contextual retrieval (every turn):
- Embed the user’s current message
- Search Store B for the top-k most similar memory items (k=5–10)
- Apply a re-ranking formula to the candidates
- Inject the top results into the system prompt
Re-ranking formula
Raw vector similarity alone isn’t enough. A fact from 2 years ago shouldn’t outrank one from yesterday if both are equally similar to the query.
score = α · sim(q, m) + β · recency(m) + γ · confidence(m)
Where:
sim(q, m)= cosine similarity between query embedding and memory embeddingrecency(m) = e^(-λ · days_since_update)— exponential decayconfidence(m)= the stored confidence score (0–1)α = 0.6, β = 0.25, γ = 0.15— tunable weights
What gets injected into the prompt
[System] You are a helpful assistant.
[Memory — Profile]
- Name: Alex
- Location: Seattle
- Key preference: concise answers with examples
[Memory — Relevant to this query]
- Prefers dark roast coffee (confidence: 0.95, last confirmed: 3 days ago)
- Allergic to tree nuts (confidence: 1.0, last confirmed: 2 weeks ago)
- Recently booked a trip to Tokyo for April (confidence: 0.8, 1 week ago)
[User] Can you suggest a coffee shop near my hotel in Tokyo?
The profile section is always present. The relevant memories section changes per turn. Together they rarely exceed 500 tokens.
When NOT to retrieve
Not every query needs memory lookup. “What’s 2+2?” doesn’t benefit from knowing the user’s coffee preferences. A lightweight classifier decides:
- Does the query contain personal pronouns (“my”, “I”, “me”)?
- Does it reference past conversations (“like last time”, “as I mentioned”)?
- Is it a general knowledge question with no personalization angle?
Skip retrieval for clear general-knowledge queries. This saves latency and cost.
3. Handling conflicts, speed, and scale
Conflict resolution
The problem: User said “I love spicy food” 6 months ago. Today they say “I can’t handle spicy food anymore.”
Strategy: version-chain with last-explicit-statement-wins.
When the extractor produces a new fact:
- Search existing memories for semantically similar items (cosine similarity > 0.85 on the same
user_id) - If a near-duplicate exists with the same meaning → skip (don’t store “likes coffee” twice)
- If a conflicting fact exists (similar topic, opposite meaning):
- Mark the old fact:
superseded_by = new_fact_id - Set old fact’s confidence to 0 (still searchable for audit, but won’t surface in retrieval)
- Store the new fact with
version = old.version + 1
- Mark the old fact:
- If ambiguous (the system isn’t sure if it’s a conflict or an addition):
- Store both, give the newer one higher confidence
- On next retrieval that surfaces both, let the LLM see the timestamps and decide
Conflict detection prompt (used in step 3):
“Given existing memory: ‘{old_fact}’ and new statement: ‘{new_fact}’, are these (a) the same fact restated, (b) contradictory facts, or (c) independent facts?”
This LLM call is cheap (short prompt, structured output) and only fires when semantic similarity flags a potential conflict.
Making retrieval fast
| Bottleneck | Solution | Effect |
|---|---|---|
| Vector search on every turn | Pre-filter by user_id partition — search only within one user’s memories | Reduces search space by orders of magnitude |
| Embedding the query | Cache embeddings for repeated/similar queries within a session | Saves ~100ms per cache hit |
| LLM extraction after every turn | Debounce writes — extract every 3–5 turns or after N seconds of inactivity | Reduces extraction LLM calls by 3–5x |
| Loading user profile | Cache in-memory for active sessions — only read from DB on session start | Eliminates DB read on every turn |
| Retrieval for trivial queries | Classifier gate — skip retrieval for general knowledge questions | Saves ~50–200ms for non-personal queries |
| Multiple sequential calls | Parallel retrieval — issue profile load + vector search concurrently | Wall clock = slowest call, not sum |
Target latency budget:
| Step | Target | Approach |
|---|---|---|
| Classify query | <10ms | Rule-based or cached small model |
| Profile load | <5ms | In-memory cache, refreshed per session |
| Vector search | <30ms | Partitioned index, approximate nearest neighbors |
| Re-rank | <5ms | Simple arithmetic on k=10 candidates |
| Total retrieval overhead | <50ms | Well within user-perceptible threshold |
Managing large history
When a user accumulates thousands of memory items:
Tier 1 — Progressive summarization:
- Every ~50 episodic memories, run a summarization pass: “Summarize these 50 travel-related memories into 3–5 key facts”
- Replace the 50 originals with the summaries (keep originals in cold storage for audit)
- Preferences and constraints are never summarized — they’re already atomic facts
Tier 2 — TTL-based eviction:
- Episodic memories (“booked flight last week”) decay naturally — TTL of 90 days
- Preferences only expire if superseded or explicitly forgotten
- Constraints never auto-expire (allergies don’t go away silently)
Tier 3 — Confidence-based pruning:
- Facts with confidence < 0.3 that haven’t been accessed in 60 days → delete
- Facts that were superseded > 30 days ago → delete (the replacement is canonical)
Budget guardrails:
- Hard cap: 1,000 active memories per user
- When approaching the cap, trigger a consolidation pass that merges related facts
- Example: 12 memories about coffee preferences → 2 consolidated facts
4. Feedback loop for improving personalization quality
Three feedback channels
Channel A — Explicit feedback (high signal, low volume):
The user directly tells the system what to remember or forget:
- “Remember that I’m vegetarian now”
- “Stop suggesting spicy food”
- Thumbs up/down buttons on responses
Actions:
- “Remember X” → extract fact, set confidence = 1.0, skip extraction pipeline
- “Forget X” → find matching memory, set
superseded_by = "user_deleted" - 👍 → boost confidence of all memories used in that response by +0.1 (cap at 1.0)
- 👎 → decay confidence by -0.2, prompt: “What should I have known?”
Channel B — Implicit feedback (medium signal, high volume):
Behavioral signals from the conversation itself:
- User corrects the agent → the memory that led to the wrong assumption should be decayed
- User rephrases the same question → retrieval missed something relevant
- User engages longer after a personalized response → the memories used were valuable
- User changes topic abruptly after a personalized response → the personalization may have been off
Detection: after each turn, a lightweight classifier checks for correction patterns (“No, I said…”, “Actually…”, “That’s not right”). If detected:
- Identify which injected memory led to the error
- Decay its confidence
- Extract the corrected fact as a new memory with high confidence
Channel C — Offline quality review (high quality, periodic):
Weekly batch process:
- Sample 100 conversations where memories were injected
- For each, evaluate: did the user correct the agent? Did engagement increase? Did they return?
- Compute a personalization quality score per memory category
- Adjust the re-ranking weights (α, β, γ) based on which memory types correlated with positive outcomes
- Identify memory categories that consistently lead to corrections → tighten extraction prompts for those categories
Memory lifecycle with feedback
stateDiagram-v2
[*] --> Extracted: LLM extracts fact
Extracted --> Active: confidence > 0.5
Extracted --> Probationary: confidence ≤ 0.5
Active --> Boosted: thumbs-up or implicit positive signal
Active --> Decayed: thumbs-down or user correction
Active --> Superseded: newer contradicting fact
Boosted --> Active: cap at 1.0
Decayed --> Probationary: confidence drops below 0.5
Decayed --> Active: subsequent positive signal restores
Probationary --> Active: confirmed by user or retrieval success
Probationary --> Pruned: 60 days unused + low confidence
Superseded --> Archived: 30 days
Pruned --> [*]
Archived --> [*]
What “improving over time” looks like concretely
| Metric | How it improves | Mechanism |
|---|---|---|
| Retrieval precision | The right memories surface more often | Confidence scores trained by feedback push good memories up, bad ones down |
| Conflict resolution accuracy | Fewer stale facts shown to users | Supersession chain ensures latest statement wins; corrections accelerate this |
| Extraction quality | Fewer irrelevant facts stored | Offline review identifies low-value extraction categories → refine prompts |
| Re-ranking accuracy | The weighting formula gets better | Offline batch analysis adjusts α/β/γ based on positive feedback correlation |
| User trust | Users give more explicit signals over time | Visible memory controls (“Here’s what I remember — edit anytime”) encourage engagement |
Summary of key design decisions
| Decision | Choice | Rationale |
|---|---|---|
| Store raw transcripts? | No — extract structured facts | Transcripts grow unboundedly, are noisy, and expensive to search |
| One store or two? | Two (profile KV + semantic vector) | Profile is always needed (no search); semantic facts need similarity search |
| Conflict resolution | Last-explicit-statement-wins with version chain | Simple, auditable, matches user mental model |
| When to retrieve | Classifier gate — skip for general queries | Saves latency and avoids irrelevant memory injection |
| When to extract | Debounced — every few turns or on inactivity | Balances freshness with cost |
| History management | Progressive summarization + TTL + confidence pruning | Three complementary mechanisms prevent unbounded growth |
| Feedback model | Explicit + implicit + offline batch | Each channel has different signal quality and volume; combining all three gives robust learning |