May 17, 2026

Agent Cost Catastrophes: Six Real Patterns That Blow Up Your AI Budget

The most expensive AI agent mistakes are not random. They are predictable patterns that show up across codebases once you know what to look for. Here are six that reliably cause runaway spend.

Why agent costs spike

Most teams that run into AI cost problems did not make one big mistake. They made six small ones that compounded. The patterns below show up repeatedly in agent codebases once you start measuring at the token level. Most of them are fixable in an afternoon.

1. The context accumulation loop

A multi-turn agent appends every previous response to the next prompt. After 10 turns, the context window is 90% previous output. You are paying to re-process everything the agent already said, every single call.

Fix: implement a context manager that keeps only the most recent N turns, or summarizes older turns into a compact digest. A well-tuned summarization step typically reduces context window usage by 60-80% for conversation-heavy agents with no meaningful quality loss.

2. Expensive model on cheap tasks

Claude Opus on classification tasks. GPT-5.4 on document chunking. It sounds like a caricature, but it is common. Teams default to the best model for the whole pipeline and never route by task complexity.

Real numbers: classifying a support ticket (determine category, priority, department) on Opus costs roughly $0.050 per call. On Haiku: $0.002. For a team processing 5,000 tickets per month, the difference is $240/month vs $10/month for the same output quality. The expensive model is not 25x better at binary classification.

3. Synchronous tool calls that force a full round-trip

An agent needs four pieces of data. It calls each tool sequentially, waiting for each response before issuing the next call. If each tool call takes 500ms and the model response costs $0.01, you have burned four round-trips of latency and cost where you needed one.

Fix: batch independent tool calls. Most frameworks support parallel function calling. For agents with multiple independent data fetches, parallelizing tool calls can cut both latency and compute cost by 50-70%.

4. No cache on identical prompt prefixes

If your agent has a large, stable system prompt (instructions, persona, policies), you are paying full price to process those tokens on every call. Anthropic, OpenAI, and Google all support prompt caching for repeated prefixes. Cache hit rates of 80%+ are achievable on most production agent deployments.

The math: a 2,000-token system prompt cached vs not cached, 10,000 calls per month. At Claude Sonnet rates, uncached input costs $0.03 per 1K tokens. Cached input costs $0.003. That is $600/month vs $60/month for the same calls. The API call to enable caching is two lines of code.

5. Heartbeat frequency mismatch

A heartbeat agent runs every 30 seconds. Its job is to check whether something needs to happen. 95% of the time, nothing does. You are paying for 95% "nothing to do" responses at full model price.

Fix: tiered polling. Start with a lightweight check (cheap model, minimal context, binary output: "needs action" or "all clear"). Only invoke the full agent when the lightweight check fires. An Haiku-based triage layer running every 30 seconds, triggering a Sonnet full-response agent 5% of the time, typically reduces heartbeat costs by 70-85%.

6. Output verbosity that feeds back as input

The agent produces a 500-word response. That response gets fed back into the next prompt as context. The model is generating tokens you do not need, that then cost money to re-process later.

Fix: be specific about output format in your system prompt. "Respond in 2-3 sentences. Use bullet points for lists. No preamble." LLMs follow output format instructions reliably. Reducing output tokens by 60% reduces both the output cost and the cost of any subsequent turn that ingests that output. For agentic pipelines, this compounds fast.

How to find your version of these patterns

All six patterns are invisible without token-level logging. You need to see: which model handled which call, how many input tokens vs output tokens, what percentage of input tokens were repeated context from previous turns, and whether you are hitting the cache on your stable prefixes.

Most agent frameworks give you a latency log and a bill. That is not enough. You need to see what you are actually sending and why it costs what it costs.

The Clawback dashboard shows this breakdown per agent run. If you are running agents at any meaningful scale, a 30-minute review typically finds 40-60% cost reduction sitting in one or two of these six patterns.

See your actual numbers

The calculator runs in your browser. No account needed.