May 9, 2026

Hybrid model routing: mixing open-source and proprietary models to cut agent costs

A concrete guide to routing agent calls between open-source and proprietary models based on task type. How to identify which calls need frontier capability and which do not.

The question is not "should I use open-source or proprietary models?" The question is "which calls in my pipeline actually need frontier capability?" Those are different questions, and only the second one leads to a useful answer.

Here is a concrete framework for thinking about hybrid routing in an agent context.

Start with call classification

Pull a sample of recent calls from your agent logs, 50 to 100 calls across a representative period. Sort them by token cost (input plus output times price per token). You will almost always find a power-law distribution: a small number of expensive calls accounting for most of the bill, and a long tail of cheap calls.

For each expensive call, ask: what is the task? What happens if this call produces a mediocre output instead of a great one? The answers tend to cluster into four buckets.

The four routing buckets

Bucket 1: High-stakes irreversible tasks. Writing code that goes directly to production. Composing an email sent from Jake's address. Making a decision that is hard to undo. These need your best model. Do not route them to save $0.08. The cost of a bad output is not the API bill, it is the cleanup time.

Bucket 2: Complex reasoning with clear evaluation criteria. Code generation with a test suite. Research synthesis where you will verify the output. Long-form writing where you will edit it yourself. These tasks benefit from a frontier model but the output gets reviewed. Consider a mid-tier model (Sonnet, Mistral Medium) and escalate to Opus or GPT-5 only when the mid-tier output fails.

Bucket 3: Structured extraction and transformation. Parsing a document into a schema. Summarizing a meeting transcript. Converting one format to another. These are high-token tasks that do not require frontier reasoning. A capable open-weight model at $0.08 to $0.50 per million input tokens does this as well as a $5/MTok model in most cases.

Bucket 4: Classification and routing. Deciding which agent handles a message. Filtering irrelevant inputs. Labeling sentiment or intent. These are pure classification tasks. A small model handles them. Claude Haiku, Gemma 4 locally, or any fine-tuned small model. The cost difference here is 50x or more versus Opus.

How to implement this without rebuilding everything

You do not need to rewrite your agent framework to do hybrid routing. The practical approach:

Add a task-type field to your prompts. Even a simple comment in the system prompt saying "task: classification" or "task: code-generation" makes it easy to route programmatically or review in logs.

Start with the tail, not the head. Do not replace your most important calls first. Find the highest-volume cheap calls and route those to a cheaper model. Validate the output quality over a week. Then move up the chain.

Measure quality, not just cost. When you route a class of calls to a cheaper model, instrument something about output quality. Downstream error rate. User satisfaction. Task completion rate. Cost optimization that degrades quality is not a win.

Use a fallback chain. Route to the cheap model. If the output fails a quality check (empty, malformed, fails a test), automatically retry on a better model. The retry adds latency and cost but catches the cases where the cheap model genuinely cannot handle the task.

What the numbers look like

A typical OpenClaw agent at 30-minute heartbeats with a few active task types might look like this:

Heartbeat calls (24/day): 6,000 tokens each, pure classification → Haiku at $0.003/day per channel
Document extraction (5/day): 20,000 tokens each → DeepSeek V4 Flash at $0.12/day
Complex reasoning (2/day): 15,000 tokens each → Sonnet at $0.24/day
High-stakes decisions (0.5/day): 10,000 tokens each → Opus at $0.30/day

Total: roughly $0.66/day. The same workload entirely on Opus: around $4.50/day. The savings come from routing the high-volume, low-complexity calls to cheap models, not from switching your most important calls away from the best model.

The prerequisite: knowing what your calls actually are

You cannot route what you cannot see. The first step in any cost optimization is logging your calls with enough context to classify them. Model, token count, task type, latency, output quality signal.

Clawback connects to your OpenClaw session logs and shows you a breakdown of calls by type, cost, and volume. If you have been running agents for a while without this visibility, the first session report usually surfaces a few obvious routing opportunities. The heartbeat bucket alone is often $50-150/month of Opus spend that belongs on Haiku.

See your actual numbers

The calculator runs in your browser. No account needed.

Open Calculator Analyze My Config Per-Task Costs Example Configs