April 30, 2026

The hidden cost of AI agent memory: why your context window is your biggest bill

Everyone focuses on per-token prices. The real cost driver in most agent deployments is how much context you load on every call. Here is a breakdown of where those tokens are actually going.

Developers optimizing AI agent costs usually start with model selection. Switch from Opus to Sonnet, from GPT-5 to GPT-4.1. That is the right instinct. But in many production deployments, model choice is the second biggest cost lever, not the first.

The first is context size.

Where your tokens actually go

In a typical agent deployment, input tokens break down roughly like this:

  • System prompt: 1,000-5,000 tokens (instructions, persona, rules)
  • Tool definitions: 500-3,000 tokens (depends on how many tools you have registered)
  • Conversation history: 0-50,000+ tokens (grows with session length)
  • Retrieved context (RAG): 1,000-20,000 tokens per call
  • User message: 10-500 tokens

The user message, which is the actual work, is often the smallest item in the context. You are paying frontier prices for every token in the system prompt and tool definitions, on every single call.

The compounding problem

Conversation history is the silent cost multiplier. In a 10-turn conversation, by turn 10, the first 9 turns are in the context window on every subsequent call. If turns average 500 tokens of output, by the end of the conversation you are loading 4,500 tokens of history that you already paid for 9 times.

The total input tokens for a 10-turn conversation are not 10x the cost of 1 turn. They are roughly 55x (1+2+3+...+10 = 55 units if each unit is one-turn worth of history). This is a well-understood problem but most developers do not have a concrete sense of the numbers until they see a bill.

What you can actually do

Summarize history aggressively. Instead of passing the full transcript, summarize it every N turns. Summarization costs tokens up front but saves more on subsequent calls. The break-even is usually around turn 4-6 depending on your model and verbosity.

Trim tool definitions. Every tool you register is loaded on every call. If you have 20 tools but only 3 are relevant to any given task, you are loading 17 tools worth of tokens unnecessarily. Dynamic tool loading, where you only pass tools relevant to the current intent, can cut tool definition overhead by 50-80%.

Use structured system prompts. Verbose prose system prompts are a common cost leak. "You are a helpful AI assistant that always provides accurate, thoughtful, comprehensive responses..." is 15 tokens. A well-structured instruction set covering the same ground in bullet points is often 30-50% fewer tokens for the same behavioral effect.

Separate long-context from short-context tasks. RAG queries that need large context windows should run on models priced for that use case, not on your default frontier model. Running a document analysis task that needs 30K context tokens on Claude Opus 4.6 costs $0.15 in input alone. That same task on a mid-tier model costs $0.03-0.06.

What to measure

Before optimizing anything, measure where your tokens are going. Most inference providers return token counts in the API response. Log them, aggregate by task type, and look at the distribution.

You will usually find that a small number of call patterns (long-session conversations, high-tool-count calls, large RAG retrievals) account for the majority of your costs, and they are not necessarily your highest-value tasks.

The Clawback dashboard shows you a breakdown of where your tokens are going across your agent deployment. Connect it to your OpenClaw session logs and you will see which call patterns are driving costs, not just the total bill.

See your actual numbers

The calculator runs in your browser. No account needed.