Agent Memory: How Agents Remember Across Conversations
An agent without memory is a person with amnesia. Every conversation starts from zero. It asks the same questions, forgets preferences, repeats itself.
But memory is tricky. You can't just stuff everything into the prompt—that's expensive and hits context limits fast. Different types of information need different storage strategies.
In my experience building LLM systems at Amazon, we discovered that successful agents use four separate memory systems, each optimized for a different task. This article explains when and how to use each.
The Four Types of Agent Memory
Think of them as a filing system:
- In-context (short-term): What the agent is thinking about right now
- Vector store (semantic): What the agent might need to remember from the past
- Key-value store (episodic): What the agent should always know about the user
- Structured database (procedural): What the agent has learned to do
Type 1: In-Context Memory (Short-Term)
This is the conversation history in the prompt. Everything that's happening right now.
System prompt: "You are a flight booking assistant..."
User: "Book me a flight to Denver"
Assistant: "I'll search for flights..."
Tool result: [Flight options...]
User: "Actually, check the weather first"
Assistant: "Good idea, let me check weather..."
The agent can see all of this. It's in-context.
When to use: Current conversation, active problem-solving, immediate context
Pros:
- Fast (no lookups)
- Clear (agent sees everything)
- Works immediately
Cons:
- Limited size (context window is finite, expensive)
- Grows fast (turns accumulate tokens)
- Not searchable (can't find "that thing we discussed")
Example:
class InContextMemory:
def __init__(self):
self.messages = []
def add_message(self, role, content):
"""Add to conversation history"""
self.messages.append({"role": role, "content": content})
def get_context(self):
"""Return all messages as context"""
return self.messages
def trim_old_messages(self, max_messages=10):
"""Keep only recent messages"""
self.messages = self.messages[-max_messages:]
Real example: User asks "Should I cancel my flight?" Agent sees the 5 most recent messages and knows what flight they're talking about.
Type 2: Vector Store Memory (Semantic)
You have 100 past conversations with a user. You can't fit them all in context. Instead, you embed them and retrieve relevant ones.
How it works:
- Convert past conversations to embeddings (dense vectors)
- When agent needs context, search embeddings
- Retrieve the most similar past conversations
- Add top results to in-context memory
User asks: "Do I usually prefer morning or evening flights?"
Vector search:
→ Embed: "Do I usually prefer morning or evening flights?"
→ Find most similar past statements in database
→ Results:
[past_convo_1: "I love morning flights, gives me time to settle"]
[past_convo_2: "Evening flights are cheaper"]
[past_convo_3: "Morning works better with my schedule"]
→ Add top 2-3 to context for agent
Agent now has relevant memory without filling context with everything.
When to use: Long-term patterns, historical context, learning across sessions
Pros:
- Unlimited history (can store thousands of conversations)
- Semantic (finds relevant context by meaning, not keywords)
- Reduces context bloat (only relevant memories added)
Cons:
- Retrieval quality depends on embeddings (bad embeddings = bad memories)
- Latency (database lookup adds 100-500ms)
- Can conflate similar but different contexts
Example:
class VectorMemory:
def __init__(self, vector_db):
self.vector_db = vector_db
def store_conversation(self, conversation_text, embedding):
"""Store conversation with its embedding"""
self.vector_db.add({
"text": conversation_text,
"embedding": embedding,
"timestamp": datetime.now()
})
def retrieve_relevant(self, current_query, top_k=3):
"""Find most relevant past conversations"""
query_embedding = embed(current_query)
results = self.vector_db.search(
query_embedding,
top_k=top_k
)
return [r["text"] for r in results]
def add_to_context(self, current_query, agent_context):
"""Augment agent context with relevant memories"""
memories = self.retrieve_relevant(current_query)
# Format for agent
memory_text = "Relevant past conversations:\n" + "\n".join(memories)
agent_context.append({
"role": "system",
"content": memory_text
})
return agent_context
Real example: User has had 50 support conversations with the agent. They ask a question similar to one from 3 months ago. Vector search finds that old conversation, and the agent uses it to give consistent advice.
The retrieval problem: If your query embedding doesn't match old embeddings, you miss relevant information. This is why embedding quality matters.
Type 3: Key-Value Store Memory (Episodic)
Some facts should be instantly accessible and don't change often. User preferences, account info, settings.
kv_store = {
"user_preferences": {
"preferred_airline": "United",
"preferred_seat": "window",
"budget_limit_usd": 500
},
"user_account": {
"home_city": "San Francisco",
"loyalty_number": "UA123456",
"payment_on_file": "Visa ending in 4211"
}
}
Before the agent starts, load this into context:
System prompt: "You are a flight booking assistant.
User facts:
- Prefers United airlines
- Prefers window seats
- Home city: San Francisco
- Budget limit: $500
Use these facts to provide personalized service."
When to use: Static facts, user preferences, settings, identity
Pros:
- Instant lookup (O(1) access)
- Structured (type-safe)
- Easy to update
Cons:
- Requires manual schema (you decide what to store)
- Static (can't capture nuance or changes)
- Update logic can get complex
Example:
class KeyValueMemory:
def __init__(self, kv_store):
self.kv_store = kv_store
def get_user_facts(self, user_id):
"""Retrieve all facts about a user"""
return self.kv_store.get(f"user:{user_id}", {})
def set_preference(self, user_id, key, value):
"""Update a user preference"""
user_key = f"user:{user_id}"
facts = self.kv_store.get(user_key, {})
facts[key] = value
self.kv_store.set(user_key, facts)
def add_to_context(self, user_id, agent_context):
"""Add user facts to agent prompt"""
facts = self.get_user_facts(user_id)
if facts:
fact_text = "User facts:\n" + "\n".join([
f"- {k}: {v}" for k, v in facts.items()
])
agent_context.append({
"role": "system",
"content": fact_text
})
return agent_context
def learn_preference(self, user_id, key, value):
"""Update based on observed behavior"""
# Called when agent notices a pattern
self.set_preference(user_id, key, value)
Real example: User's preferred airline is "United", stored in KV. Agent loads this before every conversation and uses it when searching flights.
Type 4: Structured Database Memory (Procedural)
Some agents learn how to do things. These are workflows, learned rules, or patterns.
For example:
- "When searching flights, always check weather first"
- "If user requests a refund, verify they're not a frequent flyer (conflicts with rewards)"
- "Before booking, confirm the date twice"
procedural_db = [
{
"name": "book_flight_workflow",
"steps": [
"1. Ask for destination",
"2. Ask for date",
"3. Ask for budget",
"4. Search flights",
"5. Check weather",
"6. Present options",
"7. Confirm selection",
"8. Book"
],
"learned_from": "observed_successful_bookings"
}
]
When to use: Learned best practices, workflows, conditional logic
Pros:
- Captures learned patterns
- Explicit (easy to audit and modify)
- Reusable across conversations
Cons:
- Requires detection and extraction (what did the agent learn?)
- Can be wrong (learned from unlucky patterns)
- Maintenance burden
Example:
class ProceduralMemory:
def __init__(self):
self.workflows = {}
def register_workflow(self, name, steps):
"""Store a learned workflow"""
self.workflows[name] = {
"steps": steps,
"success_rate": 0.0,
"num_uses": 0
}
def get_workflow(self, name):
"""Retrieve a workflow"""
return self.workflows.get(name)
def update_success_rate(self, name, was_successful):
"""Update based on outcome"""
if name not in self.workflows:
return
w = self.workflows[name]
w["num_uses"] += 1
if was_successful:
w["success_rate"] = (
(w["success_rate"] * (w["num_uses"] - 1) + 1) /
w["num_uses"]
)
def add_to_context(self, relevant_workflows, agent_context):
"""Add relevant workflows to prompt"""
workflow_text = "Relevant workflows:\n"
for wf_name in relevant_workflows:
wf = self.get_workflow(wf_name)
if wf:
workflow_text += f"\n{wf_name}:\n"
for step in wf["steps"]:
workflow_text += f" {step}\n"
agent_context.append({
"role": "system",
"content": workflow_text
})
return agent_context
Real example: Agent learns that searches succeed more when it checks weather first. Procedural memory stores this workflow and suggests it next time.
Putting It Together: A Complete Memory System
Here's how all four types work together in practice:
User: "Book me a flight to Denver like last time"
STEP 1: Retrieve memories
├─ In-context: Load last 5 messages (might be first conversation)
├─ KV store: Load preferences (airline=United, seat=window)
├─ Vector store: Search "Denver flights" → Find past Denver trip
│ Result: "Booked United 7:30 AM flight for $280"
└─ Procedural: Load "book_flight_workflow"
STEP 2: Build agent context
System prompt + facts + workflows + recent history + retrieved memories
STEP 3: Agent processes with full context
"Based on past behavior, user probably wants:
- United airline (from KV store)
- Morning flight (from vector memory)
- Under $300 (from past conversation)
Workflow suggests checking weather. Let me do that."
STEP 4: Agent takes actions
→ Check weather in Denver
→ Search flights for United departures
→ Present options matching patterns
STEP 5: Update memories
├─ Add new conversation to in-context
├─ After conversation ends, store in vector memory
├─ If new preference discovered, update KV
└─ If new workflow discovered, add to procedural
The Forgetting Problem
Here's what nobody talks about: agents need to forget.
If you keep adding memories forever, two problems happen:
- Retrieval gets noisy: More memories = more results that aren't relevant
- Storage gets expensive: Thousands of conversations = expensive database
Solution: Aging and Cleanup
Implement memory decay:
class MemoryWithExpiry:
def __init__(self):
self.memories = []
def add_memory(self, content, ttl_days=30):
"""Add memory with expiration"""
self.memories.append({
"content": content,
"created": datetime.now(),
"ttl_days": ttl_days
})
def cleanup_expired(self):
"""Remove expired memories"""
now = datetime.now()
self.memories = [
m for m in self.memories
if (now - m["created"]).days < m["ttl_days"]
]
def retrieve_relevant(self, query, top_k=3):
"""Retrieve only non-expired memories"""
self.cleanup_expired()
query_embedding = embed(query)
scores = [
similarity(query_embedding, embed(m["content"]))
for m in self.memories
]
top_indices = sorted(
range(len(scores)),
key=lambda i: scores[i],
reverse=True
)[:top_k]
return [self.memories[i]["content"] for i in top_indices]
Real example: Vacation preferences have a short TTL (relevant this summer, not next year). Account info has a long TTL (always relevant).
Practical Patterns
Pattern 1: Load-Augment-Process
def agent_step(user_id, user_input):
# Load all memory types
user_facts = kv_memory.get_user_facts(user_id)
recent_messages = in_context_memory.get_context()
relevant_history = vector_memory.retrieve_relevant(user_input)
relevant_workflows = procedural_memory.find_matching_workflows(user_input)
# Build context
agent_context = []
agent_context = kv_memory.add_to_context(user_id, agent_context)
agent_context = vector_memory.add_to_context(user_input, agent_context)
agent_context = procedural_memory.add_to_context(relevant_workflows, agent_context)
agent_context += recent_messages
# Process
response = agent.generate(agent_context, user_input)
# Store
in_context_memory.add_message("user", user_input)
in_context_memory.add_message("assistant", response)
return response
Pattern 2: Update on Success
Only update procedural memory when something works:
def book_flight(flight_id):
"""Book a flight and update memories if successful"""
try:
result = api.book_flight(flight_id)
# Success! Update procedural memory with what worked
if result["status"] == "confirmed":
procedural_memory.register_workflow(
"successful_booking",
steps=current_workflow_steps
)
# Also update KV with learned preferences
kv_memory.set_preference(
user_id,
"last_successful_booking",
result
)
return result
except Exception as e:
# Failure—don't learn bad patterns
log.error(f"Booking failed: {e}")
return None
Pattern 3: Periodic Consolidation
Summarize old memories to save tokens:
def consolidate_memories():
"""Periodically summarize old conversations"""
old_conversations = vector_memory.get_memories(older_than_days=30)
if len(old_conversations) > 10:
# Summarize
summary = llm.summarize(old_conversations)
# Replace 10 old memories with 1 summary
for convo in old_conversations[:10]:
vector_memory.delete(convo)
vector_memory.store_conversation(
f"Summary of conversations: {summary}",
embedding=embed(summary)
)
Key Takeaways
Agents need four types of memory:
- In-context (short-term): Current conversation
- Vector store (semantic): Relevant historical context, retrieved on-demand
- Key-value (episodic): Static facts about the user
- Structured DB (procedural): Learned workflows and patterns
Use all four together:
- Load facts from KV before each conversation
- Retrieve relevant history from vector store
- Keep recent messages in in-context
- Apply learned workflows from procedural DB
Don't forget to forget:
- Implement memory decay
- Consolidate old memories
- Monitor retrieval quality
Done right, agents remember. Done wrong, they repeat themselves endlessly.