May 6, 2026
Open-source LLMs are rewriting agent cost math
Five frontier-class open-weight models shipped in 30 days. Llama 4, Qwen 3.5, DeepSeek V4, Gemma 4, Mistral Medium 3.5. What this actually means for teams paying per-token.
Five frontier-class open-weight models shipped in the last 30 days: Meta's Llama 4 (Scout and Maverick), Alibaba's Qwen 3.5, DeepSeek V4 Pro and Flash, Google's Gemma 4, and Mistral Medium 3.5 on April 29. Moonshot's Kimi K2.6 quietly followed. The open-source versus closed-source question has inverted. It used to be "can we get away with using open-weight models?" Now it is "which open-weight model should we deploy?"
For teams paying Anthropic or OpenAI per token, this changes the budget conversation. Here is what the numbers actually look like.
What shipped and what it costs
The benchmark data is real. DeepSeek V4 Pro hits 80.6 on SWE-Bench Verified and 90.1 on GPQA Diamond with a 1M context window. Mistral Medium 3.5 hits 77.6% SWE-Bench Verified on a 128B dense model. Kimi K2.6 (Moonshot's 1T/32B MoE) hits 80.2 SWE-Bench Verified, 96.4 AIME 2026. These are not embarrassing scores hiding behind "open-source" as a qualifier. They are competitive with proprietary frontier models.
On cost, the math depends on how you deploy. Self-hosted means compute costs instead of API fees, but a smaller models like Gemma 4 runs on a single GPU. API access to DeepSeek V4 Flash runs around $0.08 per million input tokens and $0.24 per million output tokens via Novita. Compare that to Claude Opus 4.7 at $5 input and $25 output. For the same agent workload, the bill is different by an order of magnitude.
The actual tradeoff for agents
Per-benchmark, these models are close to proprietary frontier. Per-task, the picture is more complicated. Here is what actually matters for agentic workloads.
Instruction following. Most agent work is following multi-step instructions reliably across a long context window. Proprietary models have been tuned hard for this. Some open-weight models are catching up; some still drift on long horizons.
Tool call reliability. Agent pipelines depend on structured outputs and reliable function calling. This is an area where model quality differences show up concretely. A model that hallucinates a tool name or misformats a JSON argument costs you a retry plus the debugging time.
Latency and infrastructure overhead. Self-hosted open-weight models mean you manage the infrastructure. API-accessed open-weight models remove that, but you are now depending on a third-party host whose uptime and latency you do not control.
Context window economics. Llama 4 Scout has a 10M token context window. DeepSeek V4 Pro has 1M. For agents that need to load large codebases or long documents, this matters. Longer context means more input tokens per call, so larger context windows can cut costs (fewer calls) or raise them (more tokens loaded), depending on the workload.
A practical routing heuristic
The most cost-efficient agent deployments in 2026 do not use one model for everything. They route by task complexity and stakes.
- High-stakes, complex reasoning: Proprietary frontier (Claude Opus 4.7, GPT-5). These tasks are infrequent and justify the cost.
- Coding tasks with clear specs: DeepSeek V4 Flash or Mistral Medium 3.5. Benchmark-competitive at a fraction of the price.
- Classification, routing, and filtering: Gemma 4 locally or a small open-weight model. These tasks do not need frontier capability.
- Heartbeat and orchestration: Haiku or an equivalent small model. Same logic as the model-routing post: classification tasks do not need a $15/MTok input model.
The teams seeing meaningful cost reductions right now are not switching off proprietary models entirely. They are identifying the 60-70% of calls that do not need frontier capability and routing those somewhere cheaper.
What this means for your bill
If your agent stack runs entirely on Claude Opus 4.7 or GPT-5, the 2026 open-source wave is a real alternative for parts of your pipeline. Not a drop-in replacement, but a meaningful routing option.
The prerequisite is knowing which parts of your pipeline actually need frontier capability. That requires token-level observability: which calls are expensive, what those calls are doing, and whether the output quality justifies the cost.
The Clawback dashboard shows you token usage broken down by call type, session, and model. If you connect your OpenClaw logs, you will see exactly which parts of your agent workload are consuming your budget, and whether they need a $5/MTok model or a $0.08/MTok one.
See your actual numbers
The calculator runs in your browser. No account needed.