Introduction

LLM is stateless so it can not remember each runs, sessions or users. That poses the problem with no continuity, no personalization, no learning and no long tasks.

Memory stack

Layer 1: context window (in the prompt) -> lost when context overflows
Layer 2: short term memory (agent session) -> lost when session ends
Layer 3: long term memory (persistent) -> store fact about users, past conversations, learned preferences
Layer 4: external knowledge (tool, rag) -> documents, external API, queried and not maintain by agent

Core operations

Write: store new information
Read: most difficult, pull relevant information when needed
Update: modify existing memory when new information contradicts
Delete: forget/remove the stale, no longer relevant information

What to store, what not?

Stable user facts
Strong preferences
Outcome of past tasks (what worked failed)
Decision that shapes future work (architecture, policies, constraints)

Challenges

Context window cutoff => summarize old results into compact, drop tool outputs that no longer needed, retrieve the most relevant top k search results.
Retrieval miss important memory => use semantic search (embeddings) instead of keyword match, store memory with good title and description, retrieve more candidates than needed
Outdated memory
Memory bloat: LRU policy
Privacy leaks: use scope to use id, encrypt sensitive data, require consent before writing
Short vs long term: if useful only to current task => short term, if matters later => long term. when in doubt use short term

System design

Design a text-only personalization and memory system for an AI chatbot

The chatbot should use a user’s previous conversations, preferences, and feedback to generate more personalized responses in future sessions. The system should support retrieval-augmented generation or a similar approach, but the exact implementation is up to you.

Address the following topics:

How would you store user memory from past conversations?
How would you retrieve the right memories for a new user query?
How would you handle conflicting information in the user’s history, such as an old preference contradicting a newer one? If every chatbot response requires retrieval, how can the system be made faster? If a user’s history becomes very large, how should the system manage and summarize it?
How would you design a feedback loop so the system improves personalization quality over time?

System Overview

Every conversation generates knowledge about the user. The system must extract that knowledge, store it durably, retrieve the right subset at query time, resolve conflicts when knowledge contradicts itself, and improve over time through feedback.

graph TD
    U["User message"] --> CL["Classifier:<br/>needs personalization?"]
    CL -->|Yes| R["Retrieval Layer"]
    CL -->|No| LLM
    R --> VS["Vector Store<br/>(semantic search)"]
    R --> KV["Key-Value Store<br/>(user profile)"]
    VS --> Ranker["Re-ranker<br/>(recency + relevance + confidence)"]
    KV --> Ranker
    Ranker --> LLM["LLM generates response<br/>with injected memories"]
    LLM --> Resp["Response to user"]
    Resp --> Extractor["Post-turn Extractor"]
    Extractor --> Writer["Memory Writer"]
    Writer --> VS
    Writer --> KV
    Resp --> FB["Feedback Signals"]
    FB --> Scorer["Confidence Scorer"]
    Scorer --> Writer

1. How would you store user memory from past conversations?

What to store

Not raw transcripts — those are too noisy and grow unboundedly. Instead, the system extracts structured memory items from conversations. Each item is a single fact about the user.

Memory item schema:

Field	Type	Purpose
`id`	UUID	Unique identifier
`user_id`	string	Partition key — isolates users
`content`	string	The fact in natural language (“Prefers dark roast coffee”)
`category`	enum	`preference`, `constraint`, `biographical`, `episodic`, `procedural`
`confidence`	float 0–1	How sure the system is this fact is accurate
`version`	int	Increments on updates; enables conflict resolution
`created_at`	timestamp	When first extracted
`updated_at`	timestamp	When last confirmed or modified
`source_turn`	string	Which conversation turn produced this
`superseded_by`	UUID or null	Points to the newer fact if this one was overridden
`embedding`	float[]	Vector representation for semantic search

Memory categories

Preference: “Likes Italian food”, “Prefers formal tone”
Constraint: “Allergic to peanuts”, “Budget under $100”
Biographical: “Lives in Seattle”, “Software engineer”
Episodic: “Booked a flight to Tokyo last week” — time-bound events
Procedural: “When asked about reports, always include charts” — learned interaction patterns

Storage architecture: two stores, one purpose

Store A — Structured profile (key-value / document store): A compact JSON document per user containing their current profile: name, location, active preferences, constraints. This is small (typically <2KB), always loaded, and cheap to read.

Store B — Semantic memory (vector database): All extracted facts with embeddings. Supports similarity search so the system can retrieve memories relevant to a query the user has never explicitly connected to a preference.

Why two stores? The profile is tiny, always relevant, and should be injected every turn without a search. The semantic store is large, query-dependent, and needs vector search. Combining them into one store either makes the profile slow to load or makes search results noisy with always-relevant items.

Extraction pipeline

After each conversation turn:

Send the conversation to the LLM with a structured extraction prompt:

“Extract any new facts about the user from this conversation. For each fact, classify it as preference/constraint/biographical/episodic/procedural. If no new facts, return empty.”
Parse the LLM’s structured output (JSON)
Embed each fact using an embedding model
Check for duplicates/conflicts against existing memories (see section 3)
Write to the appropriate store

When to extract:

Not after every single message — that’s expensive
After every turn pair (user message + assistant response) — the response often clarifies what the user meant
Debounce: if the user sends 5 messages in rapid succession, batch the extraction

2. How would you retrieve the right memories for a new user query?

Two-phase retrieval

Phase 1 — Static injection (once per conversation start): Load the user’s structured profile from Store A. This gives the LLM immediate context: name, key preferences, active constraints. No search needed — it’s a direct key lookup.

Phase 2 — Contextual retrieval (every turn):

Embed the user’s current message
Search Store B for the top-k most similar memory items (k=5–10)
Apply a re-ranking formula to the candidates
Inject the top results into the system prompt

Re-ranking formula

Raw vector similarity alone isn’t enough. A fact from 2 years ago shouldn’t outrank one from yesterday if both are equally similar to the query.

score = α · sim(q, m) + β · recency(m) + γ · confidence(m)

Where:

sim(q, m) = cosine similarity between query embedding and memory embedding
recency(m) = e^(-λ · days_since_update) — exponential decay
confidence(m) = the stored confidence score (0–1)
α = 0.6, β = 0.25, γ = 0.15 — tunable weights

What gets injected into the prompt

[System] You are a helpful assistant.

[Memory — Profile]
- Name: Alex
- Location: Seattle
- Key preference: concise answers with examples

[Memory — Relevant to this query]
- Prefers dark roast coffee (confidence: 0.95, last confirmed: 3 days ago)
- Allergic to tree nuts (confidence: 1.0, last confirmed: 2 weeks ago)
- Recently booked a trip to Tokyo for April (confidence: 0.8, 1 week ago)

[User] Can you suggest a coffee shop near my hotel in Tokyo?

The profile section is always present. The relevant memories section changes per turn. Together they rarely exceed 500 tokens.

When NOT to retrieve

Not every query needs memory lookup. “What’s 2+2?” doesn’t benefit from knowing the user’s coffee preferences. A lightweight classifier decides:

Does the query contain personal pronouns (“my”, “I”, “me”)?
Does it reference past conversations (“like last time”, “as I mentioned”)?
Is it a general knowledge question with no personalization angle?

Skip retrieval for clear general-knowledge queries. This saves latency and cost.

3. Handling conflicts, speed, and scale

Conflict resolution

The problem: User said “I love spicy food” 6 months ago. Today they say “I can’t handle spicy food anymore.”

Strategy: version-chain with last-explicit-statement-wins.

When the extractor produces a new fact:

Search existing memories for semantically similar items (cosine similarity > 0.85 on the same user_id)
If a near-duplicate exists with the same meaning → skip (don’t store “likes coffee” twice)
If a conflicting fact exists (similar topic, opposite meaning):
- Mark the old fact: superseded_by = new_fact_id
- Set old fact’s confidence to 0 (still searchable for audit, but won’t surface in retrieval)
- Store the new fact with version = old.version + 1
If ambiguous (the system isn’t sure if it’s a conflict or an addition):
- Store both, give the newer one higher confidence
- On next retrieval that surfaces both, let the LLM see the timestamps and decide

Conflict detection prompt (used in step 3):

“Given existing memory: ‘{old_fact}’ and new statement: ‘{new_fact}’, are these (a) the same fact restated, (b) contradictory facts, or (c) independent facts?”

This LLM call is cheap (short prompt, structured output) and only fires when semantic similarity flags a potential conflict.

Making retrieval fast

Bottleneck	Solution	Effect
Vector search on every turn	Pre-filter by user_id partition — search only within one user’s memories	Reduces search space by orders of magnitude
Embedding the query	Cache embeddings for repeated/similar queries within a session	Saves ~100ms per cache hit
LLM extraction after every turn	Debounce writes — extract every 3–5 turns or after N seconds of inactivity	Reduces extraction LLM calls by 3–5x
Loading user profile	Cache in-memory for active sessions — only read from DB on session start	Eliminates DB read on every turn
Retrieval for trivial queries	Classifier gate — skip retrieval for general knowledge questions	Saves ~50–200ms for non-personal queries
Multiple sequential calls	Parallel retrieval — issue profile load + vector search concurrently	Wall clock = slowest call, not sum

Target latency budget:

Step	Target	Approach
Classify query	<10ms	Rule-based or cached small model
Profile load	<5ms	In-memory cache, refreshed per session
Vector search	<30ms	Partitioned index, approximate nearest neighbors
Re-rank	<5ms	Simple arithmetic on k=10 candidates
Total retrieval overhead	<50ms	Well within user-perceptible threshold

Managing large history

When a user accumulates thousands of memory items:

Tier 1 — Progressive summarization:

Every ~50 episodic memories, run a summarization pass: “Summarize these 50 travel-related memories into 3–5 key facts”
Replace the 50 originals with the summaries (keep originals in cold storage for audit)
Preferences and constraints are never summarized — they’re already atomic facts

Tier 2 — TTL-based eviction:

Episodic memories (“booked flight last week”) decay naturally — TTL of 90 days
Preferences only expire if superseded or explicitly forgotten
Constraints never auto-expire (allergies don’t go away silently)

Tier 3 — Confidence-based pruning:

Facts with confidence < 0.3 that haven’t been accessed in 60 days → delete
Facts that were superseded > 30 days ago → delete (the replacement is canonical)

Budget guardrails:

Hard cap: 1,000 active memories per user
When approaching the cap, trigger a consolidation pass that merges related facts
Example: 12 memories about coffee preferences → 2 consolidated facts

4. Feedback loop for improving personalization quality

Three feedback channels

Channel A — Explicit feedback (high signal, low volume):

The user directly tells the system what to remember or forget:

“Remember that I’m vegetarian now”
“Stop suggesting spicy food”
Thumbs up/down buttons on responses

Actions:

“Remember X” → extract fact, set confidence = 1.0, skip extraction pipeline
“Forget X” → find matching memory, set superseded_by = "user_deleted"
👍 → boost confidence of all memories used in that response by +0.1 (cap at 1.0)
👎 → decay confidence by -0.2, prompt: “What should I have known?”

Channel B — Implicit feedback (medium signal, high volume):

Behavioral signals from the conversation itself:

User corrects the agent → the memory that led to the wrong assumption should be decayed
User rephrases the same question → retrieval missed something relevant
User engages longer after a personalized response → the memories used were valuable
User changes topic abruptly after a personalized response → the personalization may have been off

Detection: after each turn, a lightweight classifier checks for correction patterns (“No, I said…”, “Actually…”, “That’s not right”). If detected:

Identify which injected memory led to the error
Decay its confidence
Extract the corrected fact as a new memory with high confidence

Channel C — Offline quality review (high quality, periodic):

Weekly batch process:

Sample 100 conversations where memories were injected
For each, evaluate: did the user correct the agent? Did engagement increase? Did they return?
Compute a personalization quality score per memory category
Adjust the re-ranking weights (α, β, γ) based on which memory types correlated with positive outcomes
Identify memory categories that consistently lead to corrections → tighten extraction prompts for those categories

Memory lifecycle with feedback

stateDiagram-v2
    [*] --> Extracted: LLM extracts fact
    Extracted --> Active: confidence > 0.5
    Extracted --> Probationary: confidence ≤ 0.5

    Active --> Boosted: thumbs-up or implicit positive signal
    Active --> Decayed: thumbs-down or user correction
    Active --> Superseded: newer contradicting fact

    Boosted --> Active: cap at 1.0
    Decayed --> Probationary: confidence drops below 0.5
    Decayed --> Active: subsequent positive signal restores

    Probationary --> Active: confirmed by user or retrieval success
    Probationary --> Pruned: 60 days unused + low confidence

    Superseded --> Archived: 30 days
    Pruned --> [*]
    Archived --> [*]

What “improving over time” looks like concretely

Metric	How it improves	Mechanism
Retrieval precision	The right memories surface more often	Confidence scores trained by feedback push good memories up, bad ones down
Conflict resolution accuracy	Fewer stale facts shown to users	Supersession chain ensures latest statement wins; corrections accelerate this
Extraction quality	Fewer irrelevant facts stored	Offline review identifies low-value extraction categories → refine prompts
Re-ranking accuracy	The weighting formula gets better	Offline batch analysis adjusts α/β/γ based on positive feedback correlation
User trust	Users give more explicit signals over time	Visible memory controls (“Here’s what I remember — edit anytime”) encourage engagement

Summary of key design decisions

Decision	Choice	Rationale
Store raw transcripts?	No — extract structured facts	Transcripts grow unboundedly, are noisy, and expensive to search
One store or two?	Two (profile KV + semantic vector)	Profile is always needed (no search); semantic facts need similarity search
Conflict resolution	Last-explicit-statement-wins with version chain	Simple, auditable, matches user mental model
When to retrieve	Classifier gate — skip for general queries	Saves latency and avoids irrelevant memory injection
When to extract	Debounced — every few turns or on inactivity	Balances freshness with cost
History management	Progressive summarization + TTL + confidence pruning	Three complementary mechanisms prevent unbounded growth
Feedback model	Explicit + implicit + offline batch	Each channel has different signal quality and volume; combining all three gives robust learning

AI agent memory context management