Agent Memory: How Agents Remember Across Conversations

An agent without memory is a person with amnesia. Every conversation starts from zero. It asks the same questions, forgets preferences, repeats itself.

But memory is tricky. You can't just stuff everything into the prompt—that's expensive and hits context limits fast. Different types of information need different storage strategies.

In my experience building LLM systems at Amazon, we discovered that successful agents use four separate memory systems, each optimized for a different task. This article explains when and how to use each.

The Four Types of Agent Memory

Think of them as a filing system:

  1. In-context (short-term): What the agent is thinking about right now
  2. Vector store (semantic): What the agent might need to remember from the past
  3. Key-value store (episodic): What the agent should always know about the user
  4. Structured database (procedural): What the agent has learned to do

Type 1: In-Context Memory (Short-Term)

This is the conversation history in the prompt. Everything that's happening right now.

System prompt: "You are a flight booking assistant..."
User: "Book me a flight to Denver"
Assistant: "I'll search for flights..."
Tool result: [Flight options...]
User: "Actually, check the weather first"
Assistant: "Good idea, let me check weather..."

The agent can see all of this. It's in-context.

When to use: Current conversation, active problem-solving, immediate context

Pros:

  • Fast (no lookups)
  • Clear (agent sees everything)
  • Works immediately

Cons:

  • Limited size (context window is finite, expensive)
  • Grows fast (turns accumulate tokens)
  • Not searchable (can't find "that thing we discussed")

Example:

class InContextMemory:
    def __init__(self):
        self.messages = []

    def add_message(self, role, content):
        """Add to conversation history"""
        self.messages.append({"role": role, "content": content})

    def get_context(self):
        """Return all messages as context"""
        return self.messages

    def trim_old_messages(self, max_messages=10):
        """Keep only recent messages"""
        self.messages = self.messages[-max_messages:]

Real example: User asks "Should I cancel my flight?" Agent sees the 5 most recent messages and knows what flight they're talking about.

Type 2: Vector Store Memory (Semantic)

You have 100 past conversations with a user. You can't fit them all in context. Instead, you embed them and retrieve relevant ones.

How it works:

  1. Convert past conversations to embeddings (dense vectors)
  2. When agent needs context, search embeddings
  3. Retrieve the most similar past conversations
  4. Add top results to in-context memory
User asks: "Do I usually prefer morning or evening flights?"

Vector search:
→ Embed: "Do I usually prefer morning or evening flights?"
→ Find most similar past statements in database
→ Results:
   [past_convo_1: "I love morning flights, gives me time to settle"]
   [past_convo_2: "Evening flights are cheaper"]
   [past_convo_3: "Morning works better with my schedule"]

→ Add top 2-3 to context for agent

Agent now has relevant memory without filling context with everything.

When to use: Long-term patterns, historical context, learning across sessions

Pros:

  • Unlimited history (can store thousands of conversations)
  • Semantic (finds relevant context by meaning, not keywords)
  • Reduces context bloat (only relevant memories added)

Cons:

  • Retrieval quality depends on embeddings (bad embeddings = bad memories)
  • Latency (database lookup adds 100-500ms)
  • Can conflate similar but different contexts

Example:

class VectorMemory:
    def __init__(self, vector_db):
        self.vector_db = vector_db

    def store_conversation(self, conversation_text, embedding):
        """Store conversation with its embedding"""
        self.vector_db.add({
            "text": conversation_text,
            "embedding": embedding,
            "timestamp": datetime.now()
        })

    def retrieve_relevant(self, current_query, top_k=3):
        """Find most relevant past conversations"""
        query_embedding = embed(current_query)

        results = self.vector_db.search(
            query_embedding,
            top_k=top_k
        )

        return [r["text"] for r in results]

    def add_to_context(self, current_query, agent_context):
        """Augment agent context with relevant memories"""
        memories = self.retrieve_relevant(current_query)

        # Format for agent
        memory_text = "Relevant past conversations:\n" + "\n".join(memories)

        agent_context.append({
            "role": "system",
            "content": memory_text
        })

        return agent_context

Real example: User has had 50 support conversations with the agent. They ask a question similar to one from 3 months ago. Vector search finds that old conversation, and the agent uses it to give consistent advice.

The retrieval problem: If your query embedding doesn't match old embeddings, you miss relevant information. This is why embedding quality matters.

Type 3: Key-Value Store Memory (Episodic)

Some facts should be instantly accessible and don't change often. User preferences, account info, settings.

kv_store = {
    "user_preferences": {
        "preferred_airline": "United",
        "preferred_seat": "window",
        "budget_limit_usd": 500
    },
    "user_account": {
        "home_city": "San Francisco",
        "loyalty_number": "UA123456",
        "payment_on_file": "Visa ending in 4211"
    }
}

Before the agent starts, load this into context:

System prompt: "You are a flight booking assistant.

User facts:
- Prefers United airlines
- Prefers window seats
- Home city: San Francisco
- Budget limit: $500

Use these facts to provide personalized service."

When to use: Static facts, user preferences, settings, identity

Pros:

  • Instant lookup (O(1) access)
  • Structured (type-safe)
  • Easy to update

Cons:

  • Requires manual schema (you decide what to store)
  • Static (can't capture nuance or changes)
  • Update logic can get complex

Example:

class KeyValueMemory:
    def __init__(self, kv_store):
        self.kv_store = kv_store

    def get_user_facts(self, user_id):
        """Retrieve all facts about a user"""
        return self.kv_store.get(f"user:{user_id}", {})

    def set_preference(self, user_id, key, value):
        """Update a user preference"""
        user_key = f"user:{user_id}"
        facts = self.kv_store.get(user_key, {})
        facts[key] = value
        self.kv_store.set(user_key, facts)

    def add_to_context(self, user_id, agent_context):
        """Add user facts to agent prompt"""
        facts = self.get_user_facts(user_id)

        if facts:
            fact_text = "User facts:\n" + "\n".join([
                f"- {k}: {v}" for k, v in facts.items()
            ])

            agent_context.append({
                "role": "system",
                "content": fact_text
            })

        return agent_context

    def learn_preference(self, user_id, key, value):
        """Update based on observed behavior"""
        # Called when agent notices a pattern
        self.set_preference(user_id, key, value)

Real example: User's preferred airline is "United", stored in KV. Agent loads this before every conversation and uses it when searching flights.

Type 4: Structured Database Memory (Procedural)

Some agents learn how to do things. These are workflows, learned rules, or patterns.

For example:

  • "When searching flights, always check weather first"
  • "If user requests a refund, verify they're not a frequent flyer (conflicts with rewards)"
  • "Before booking, confirm the date twice"
procedural_db = [
    {
        "name": "book_flight_workflow",
        "steps": [
            "1. Ask for destination",
            "2. Ask for date",
            "3. Ask for budget",
            "4. Search flights",
            "5. Check weather",
            "6. Present options",
            "7. Confirm selection",
            "8. Book"
        ],
        "learned_from": "observed_successful_bookings"
    }
]

When to use: Learned best practices, workflows, conditional logic

Pros:

  • Captures learned patterns
  • Explicit (easy to audit and modify)
  • Reusable across conversations

Cons:

  • Requires detection and extraction (what did the agent learn?)
  • Can be wrong (learned from unlucky patterns)
  • Maintenance burden

Example:

class ProceduralMemory:
    def __init__(self):
        self.workflows = {}

    def register_workflow(self, name, steps):
        """Store a learned workflow"""
        self.workflows[name] = {
            "steps": steps,
            "success_rate": 0.0,
            "num_uses": 0
        }

    def get_workflow(self, name):
        """Retrieve a workflow"""
        return self.workflows.get(name)

    def update_success_rate(self, name, was_successful):
        """Update based on outcome"""
        if name not in self.workflows:
            return

        w = self.workflows[name]
        w["num_uses"] += 1

        if was_successful:
            w["success_rate"] = (
                (w["success_rate"] * (w["num_uses"] - 1) + 1) /
                w["num_uses"]
            )

    def add_to_context(self, relevant_workflows, agent_context):
        """Add relevant workflows to prompt"""
        workflow_text = "Relevant workflows:\n"

        for wf_name in relevant_workflows:
            wf = self.get_workflow(wf_name)
            if wf:
                workflow_text += f"\n{wf_name}:\n"
                for step in wf["steps"]:
                    workflow_text += f"  {step}\n"

        agent_context.append({
            "role": "system",
            "content": workflow_text
        })

        return agent_context

Real example: Agent learns that searches succeed more when it checks weather first. Procedural memory stores this workflow and suggests it next time.

Putting It Together: A Complete Memory System

Here's how all four types work together in practice:

User: "Book me a flight to Denver like last time"

STEP 1: Retrieve memories
├─ In-context: Load last 5 messages (might be first conversation)
├─ KV store: Load preferences (airline=United, seat=window)
├─ Vector store: Search "Denver flights" → Find past Denver trip
│   Result: "Booked United 7:30 AM flight for $280"
└─ Procedural: Load "book_flight_workflow"

STEP 2: Build agent context
System prompt + facts + workflows + recent history + retrieved memories

STEP 3: Agent processes with full context
"Based on past behavior, user probably wants:
 - United airline (from KV store)
 - Morning flight (from vector memory)
 - Under $300 (from past conversation)

 Workflow suggests checking weather. Let me do that."

STEP 4: Agent takes actions
→ Check weather in Denver
→ Search flights for United departures
→ Present options matching patterns

STEP 5: Update memories
├─ Add new conversation to in-context
├─ After conversation ends, store in vector memory
├─ If new preference discovered, update KV
└─ If new workflow discovered, add to procedural

The Forgetting Problem

Here's what nobody talks about: agents need to forget.

If you keep adding memories forever, two problems happen:

  1. Retrieval gets noisy: More memories = more results that aren't relevant
  2. Storage gets expensive: Thousands of conversations = expensive database

Solution: Aging and Cleanup

Implement memory decay:

class MemoryWithExpiry:
    def __init__(self):
        self.memories = []

    def add_memory(self, content, ttl_days=30):
        """Add memory with expiration"""
        self.memories.append({
            "content": content,
            "created": datetime.now(),
            "ttl_days": ttl_days
        })

    def cleanup_expired(self):
        """Remove expired memories"""
        now = datetime.now()
        self.memories = [
            m for m in self.memories
            if (now - m["created"]).days < m["ttl_days"]
        ]

    def retrieve_relevant(self, query, top_k=3):
        """Retrieve only non-expired memories"""
        self.cleanup_expired()

        query_embedding = embed(query)
        scores = [
            similarity(query_embedding, embed(m["content"]))
            for m in self.memories
        ]

        top_indices = sorted(
            range(len(scores)),
            key=lambda i: scores[i],
            reverse=True
        )[:top_k]

        return [self.memories[i]["content"] for i in top_indices]

Real example: Vacation preferences have a short TTL (relevant this summer, not next year). Account info has a long TTL (always relevant).

Practical Patterns

Pattern 1: Load-Augment-Process

def agent_step(user_id, user_input):
    # Load all memory types
    user_facts = kv_memory.get_user_facts(user_id)
    recent_messages = in_context_memory.get_context()
    relevant_history = vector_memory.retrieve_relevant(user_input)
    relevant_workflows = procedural_memory.find_matching_workflows(user_input)

    # Build context
    agent_context = []
    agent_context = kv_memory.add_to_context(user_id, agent_context)
    agent_context = vector_memory.add_to_context(user_input, agent_context)
    agent_context = procedural_memory.add_to_context(relevant_workflows, agent_context)
    agent_context += recent_messages

    # Process
    response = agent.generate(agent_context, user_input)

    # Store
    in_context_memory.add_message("user", user_input)
    in_context_memory.add_message("assistant", response)

    return response

Pattern 2: Update on Success

Only update procedural memory when something works:

def book_flight(flight_id):
    """Book a flight and update memories if successful"""
    try:
        result = api.book_flight(flight_id)

        # Success! Update procedural memory with what worked
        if result["status"] == "confirmed":
            procedural_memory.register_workflow(
                "successful_booking",
                steps=current_workflow_steps
            )

            # Also update KV with learned preferences
            kv_memory.set_preference(
                user_id,
                "last_successful_booking",
                result
            )

        return result

    except Exception as e:
        # Failure—don't learn bad patterns
        log.error(f"Booking failed: {e}")
        return None

Pattern 3: Periodic Consolidation

Summarize old memories to save tokens:

def consolidate_memories():
    """Periodically summarize old conversations"""
    old_conversations = vector_memory.get_memories(older_than_days=30)

    if len(old_conversations) > 10:
        # Summarize
        summary = llm.summarize(old_conversations)

        # Replace 10 old memories with 1 summary
        for convo in old_conversations[:10]:
            vector_memory.delete(convo)

        vector_memory.store_conversation(
            f"Summary of conversations: {summary}",
            embedding=embed(summary)
        )

Key Takeaways

Agents need four types of memory:

  1. In-context (short-term): Current conversation
  2. Vector store (semantic): Relevant historical context, retrieved on-demand
  3. Key-value (episodic): Static facts about the user
  4. Structured DB (procedural): Learned workflows and patterns

Use all four together:

  • Load facts from KV before each conversation
  • Retrieve relevant history from vector store
  • Keep recent messages in in-context
  • Apply learned workflows from procedural DB

Don't forget to forget:

  • Implement memory decay
  • Consolidate old memories
  • Monitor retrieval quality

Done right, agents remember. Done wrong, they repeat themselves endlessly.