How AI Agents Actually Work

The first time I watched an agent work end-to-end, I realized most explanations skip the boring, crucial details. Everyone talks about the magic of reasoning, but no one shows you the scaffolding that makes it happen. This article is the scaffolding.

An agent is fundamentally different from a chatbot. A chatbot answers your question. An agent decides what to do to answer your question. That decision-making loop—perceive, think, act, observe—is where everything lives.

The Agent Loop: Perceive → Think → Act → Observe

Every agent, from the simplest to the most complex, runs a loop. Here's the pseudocode:

loop:
  1. PERCEIVE: Get the current state
     - Read the user's message
     - Retrieve relevant context from memory
     - Observe tool outputs from the last step

  2. THINK: Reason about what to do
     - LLM processes state and generates thoughts
     - LLM decides: call a tool or respond directly?

  3. ACT: Execute the decision
     - If tool call: invoke the function
     - If response: return answer to user

  4. OBSERVE: Record what happened
     - Store tool output in context
     - Update state for next loop iteration

  Exit when: response given OR max steps exceeded

Here's how the loop looks visually — every agent, regardless of complexity, runs this cycle:

flowchart TD
    U([User / Environment]) -->|input| P[PERCEIVE]
    P --> T[THINK]
    T -->|tool call| A[ACT]
    T -->|final answer| R([Respond to User])
    A --> O[OBSERVE]
    O --> P
    style U fill:#f5f5f4,stroke:#e7e5e4
    style R fill:#f5f5f4,stroke:#e7e5e4
    style P fill:#fafaf9,stroke:#a8a29e
    style T fill:#fafaf9,stroke:#a8a29e
    style A fill:#fafaf9,stroke:#a8a29e
    style O fill:#fafaf9,stroke:#a8a29e

This looks abstract. Let me make it concrete with a real example.

Concrete Example: "Book me a flight to Denver"

Step 1: PERCEIVE

User input: "I need a flight to Denver tomorrow, budget $300"
Memory context: [Previous bookings show user prefers morning departures]
Tool outputs: [None yet, first iteration]

Step 2: THINK The LLM sees: goal requires searching flights. It doesn't have flight data. It needs to call a tool.

LLM reasoning (in context window):

I need to: find flights to Denver tomorrow under $300.
Available tools: search_flights, check_weather, book_flight
The search_flights tool takes: destination, date, budget, preferences
I have all that. Let me call search_flights.

Step 3: ACT

Tool call: search_flights(
  destination="Denver",
  date="2026-04-05",
  budget_usd=300,
  time_preference="morning"
)

Step 4: OBSERVE Tool returns:

[
  {flight_id: "UA123", departure: "7:30 AM", price: "$280"},
  {flight_id: "DL456", departure: "6:45 AM", price: "$295"}
]

Next Loop (Iteration 2):

  • PERCEIVE: Now the context includes the flight results above
  • THINK: Agent decides both options fit budget, recommends the cheapest + earliest
  • ACT: Agent responds directly to user (no tool call)
  • OBSERVE: Conversation ends, or user asks follow-up

This loop repeats until the agent outputs a response (ACT) or hits a maximum step limit.

Tool Use: How LLMs Actually Call Functions

This is where most tutorials handwave. Here's what actually happens.

An LLM doesn't "call" functions like a normal program. Instead, it generates text that describes the function call. The agent runtime parses that text and executes the actual function.

The Actual Mechanism

The LLM never directly touches your code. Instead:

  1. Prompt includes tool definitions: The system prompt lists all available tools in a structured format (usually JSON Schema).
  2. LLM generates tool-use syntax: Depending on your API (OpenAI, Anthropic, etc.), the LLM generates something like:
Based on the flights shown, I should book the cheapest option.

<tool_use>
{
  "name": "book_flight",
  "input": {
    "flight_id": "UA123",
    "passenger_name": "Alice Johnson"
  }
}
</tool_use>
  1. Runtime parses and executes: Your agent code detects the tool-use block, extracts the function name and parameters, calls the actual function, and gets a result.
  2. Result fed back to LLM: The tool result is added to the context, and the loop continues.

Here's pseudocode for the tool-calling runtime:

def run_agent_step(user_input, memory_context, available_tools):
    # Prepare the prompt
    system_prompt = build_system_prompt(available_tools)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input}
    ] + memory_context

    # Get LLM response (which may include tool calls)
    response = llm.generate(messages)

    # Parse tool calls from response
    tool_calls = parse_tool_calls(response.text)

    if tool_calls:
        # Execute the tools
        results = []
        for call in tool_calls:
            function = available_tools[call.name]
            result = function(**call.input)
            results.append({
                "tool": call.name,
                "input": call.input,
                "output": result
            })

        # Add results to context and loop again
        new_context = memory_context + [{
            "role": "assistant",
            "content": response.text
        }, {
            "role": "user",
            "content": format_tool_results(results)
        }]

        return run_agent_step(
            user_input="",  # empty, we're continuing
            memory_context=new_context,
            available_tools=available_tools
        )
    else:
        # No tool calls, agent responded directly
        return response.text

Key insight: The LLM never sees the actual function code. It only sees the tool description (name, parameters, purpose). It's generating text that describes what to do. Your runtime interprets that text.

Planning Strategies: ReAct and Chain-of-Thought

An agent without a planning strategy is like driving without checking a map. You might get somewhere, but probably not where you intended.

ReAct: Reason + Act

ReAct (Reasoning + Acting) is the dominant pattern in production agents. The idea: make the LLM's reasoning explicit before each tool call.

The LLM generates something like:

Thought: I need to find flights to Denver for tomorrow.
Action: search_flights
Action Input: {"destination": "Denver", "date": "2026-04-05", "budget_usd": 300}
Observation: [Flight results...]

Thought: The user prefers morning flights. UA123 at 7:30 AM is $280. I should recommend this.
Action: respond_to_user
Action Input: {"message": "I found a flight departing at 7:30 AM for $280..."}

Why this works: Explicit reasoning gives the LLM a hook to catch its own mistakes. If it thinks "I'll call search_flights" but then actually calls book_flights, something is obviously wrong.

Chain-of-Thought (CoT)

Chain-of-Thought is simpler: the LLM writes out its reasoning step-by-step before deciding.

Let me think through this:
1. The user wants a flight tomorrow
2. I don't have access to flight databases directly
3. I should use the search_flights tool
4. It needs destination, date, and budget
5. I have all three pieces of information
6. Let me call the tool now...

CoT is useful when the task is complex but doesn't require repeated tool calls. ReAct is better when tools are involved.

The Three Types of Memory

Agents without memory are like people with amnesia—they repeat themselves and can't learn. But not all memory is the same.

1. In-Context (Short-Term) Memory

This is the conversation history stuffed into the prompt. Every message, tool output, and observation lives in the context window.

Pros:

  • Fast (no lookups needed)
  • Clear (the LLM sees everything)
  • Works immediately

Cons:

  • Limited by context window size (~100K tokens tops)
  • Expensive (you pay per token, including all history)
  • Forces you to truncate old conversations

When I built systems at Amazon processing millions of requests, we learned: in-context memory alone doesn't scale. A 10-turn conversation uses maybe 2K tokens, but a 100-turn conversation becomes expensive.

2. External Vector Store (Semantic Memory)

Imagine you have 1000 conversations with an agent. You can't fit them all in context. Instead, you embed them into a vector database and retrieve the most relevant ones.

User: "Do I prefer morning or evening flights?"

Retrieve from vector store:
→ similarity("prefer morning") → [past_convo_1, past_convo_2, ...]
→ Add top 3 to context

Pros:

  • Handles unlimited historical data
  • Semantic search (finds relevant context by meaning, not keyword)
  • Reduces context window bloat

Cons:

  • Retrieval quality matters (bad embeddings = bad memories)
  • Latency (extra database lookup)
  • Can mix up similar but different contexts

3. Key-Value Store (Episodic Memory)

For facts that don't change often, use a simple KV store. "User's preferred airline: United", "User's home city: San Francisco".

Before PERCEIVE step:
facts = kv_store.get("user_preferences")
→ {"airline": "United", "home_city": "San Francisco"}
→ Add to context as: "User prefers United flights and lives in San Francisco"

Pros:

  • Precise (structured facts)
  • Fast (direct lookup)
  • Easy to update

Cons:

  • Requires manual structure (you decide what to store)
  • Static (facts don't capture nuance)
  • Update logic can be tricky

4. Structured Database (Procedural Memory)

Some agents learn procedures. "When booking a flight, always check weather first." This lives in a database of learned rules or workflows.

In my experience, most production agents use a mix of all three:

  • In-context for the current turn
  • Vector store for relevant historical context
  • KV store for user facts
  • Procedural for learned patterns

The tradeoff: more memory systems = more complexity. But a single memory type will eventually fail.

Here's how these memory types fit into the agent loop:

flowchart LR
    subgraph Memory["Memory Systems"]
        IC["In-Context (current session)"]
        VS["Vector Store (semantic search)"]
        KV["Key-Value Store (user facts)"]
        DB["Structured DB (learned rules)"]
    end
    PERCEIVE -->|load into context| IC
    VS -->|top-k relevant| PERCEIVE
    KV -->|user preferences| PERCEIVE
    DB -->|procedures| PERCEIVE
    IC --> THINK
    THINK --> ACT
    ACT -->|store outcomes| VS
    ACT -->|update facts| KV
    style Memory fill:#fafaf9,stroke:#e7e5e4

From Theory to Practice

Let me tie this together with a more complete example. Here's a simplified booking agent:

SYSTEM PROMPT:
You are a flight booking assistant. You have access to:
- search_flights(destination, date, budget): searches flights
- check_weather(city, date): checks weather forecast
- book_flight(flight_id, passenger_name): books a flight
- respond(message): sends a message to user

Always use ReAct format: Thought → Action → Observation

USER INPUT:
"Book me a flight to Denver tomorrow. I hate rain."

ITERATION 1:
Thought: The user wants a flight and cares about weather.
        I should check weather first, then search flights.
Action: check_weather
Action Input: {"city": "Denver", "date": "2026-04-05"}

Tool output:
{"forecast": "sunny, 72°F"}

ITERATION 2:
Thought: Great weather in Denver. Now find flights.
Action: search_flights
Action Input: {
  "destination": "Denver",
  "date": "2026-04-05",
  "budget_usd": null
}

Tool output:
{"flights": [
  {"id": "UA123", "time": "7:30 AM", "price": "$280"},
  {"id": "DL456", "time": "2:15 PM", "price": "$250"}
]}

ITERATION 3:
Thought: Both flights are available. User said no rain (weather is good).
        I should ask which one they prefer before booking.
Action: respond
Action Input: {
  "message": "I found two flights to Denver tomorrow.
             Weather is sunny and 72°F (great!).
             Option 1: United at 7:30 AM ($280)
             Option 2: Delta at 2:15 PM ($250)
             Which would you prefer?"
}

Agent responds and waits for user input.

Notice: The agent reasoned about the problem (thought) before acting. It even checked weather proactively because the user mentioned rain. This is what separates agents from chatbots.

Key Takeaways

  • Agents run a loop: perceive → think → act → observe
  • Tool use works by LLM text generation + runtime parsing
  • ReAct (explicit reasoning) is the pattern that works in production
  • Memory comes in flavors: in-context (fast), vector store (semantic), KV (precise), procedural (learned)
  • The complexity scales with ambition, but the fundamentals stay the same

Next article: why all of this falls apart at scale.