title: Model Routing - Sending the Right Query to the Right Model date: 2026-04-04 excerpt: How to route queries to different models based on complexity, latency vs. cost tradeoffs, and when routing backfires.
Model Routing: Sending the Right Query to the Right Model
The fundamental insight: all queries are not created equal.
Some queries are trivial. "What's the weather?" "What's my account balance?" These need a fast model, any reasonably competent one will do. "Tell me about quantum computing in 5 pages." "Debug this broken code." These are hard. They need reasoning, depth, multiple steps. They need the expensive model.
If you send every query to the expensive model, you're burning money. If you send every query to the cheap model, you're burning your users. The solution: model routing — decide, on a per-request basis, which model to use.
Done well, routing cuts costs 30-50% with zero quality loss. Done poorly, it confuses your system and makes debugging impossible. This is a guide to getting it right.
The Core Idea
Model routing rests on a simple premise: query complexity varies. Model cost and quality vary too.
flowchart TD
Q([Incoming Query]) --> R{Router: classify complexity}
R -->|simple — 60-70% of traffic| C[Cheap Model e.g. Claude Haiku]
R -->|complex — 30-40% of traffic| E[Expensive Model e.g. Claude Sonnet]
C --> OUT([Response])
E --> OUT
style Q fill:#f5f5f4,stroke:#e7e5e4
style OUT fill:#f5f5f4,stroke:#e7e5e4
style C fill:#fafaf9,stroke:#a8a29e
style E fill:#fafaf9,stroke:#a8a29e
style R fill:#fafaf9,stroke:#a8a29e
You have a cheap model (Claude Haiku, GPT-3.5): fast, good enough for simple tasks, 1/10th the cost of the expensive model.
You have an expensive model (Claude 3 Sonnet or GPT-4): slow, excellent on reasoning and complex tasks, 10x the cost.
The routing decision: given a query, which model should handle it?
Cheap model for:
- Factual retrieval (what's the company policy on X?)
- Simple classification (is this email spam?)
- Template-based generation (fill in the template with this data)
- Summarization (summarize this article)
- Simple questions (when was X founded?)
Expensive model for:
- Reasoning (explain why X happened)
- Code generation (write a function that does X)
- Creative writing (write a short story about X)
- Multi-step problem solving (plan a marketing campaign)
- Ambiguous or complex queries (what should we do about X?)
A well-configured router catches 50-70% of queries with the cheap model. The remaining 30-50% go to the expensive model. Cost: ~40% of the baseline (all expensive) cost. Quality: no degradation, because easy queries get routed correctly.
How to Classify Query Complexity
There are three approaches: heuristics, classifiers, and learned routers. Each has tradeoffs.
Heuristic Rules
Simple rules based on query characteristics:
- Query length > 500 characters → expensive model
- Contains keywords (code, debug, design, explain) → expensive model
- User is premium tier → expensive model
- This is a followup question in a conversation → expensive model
- Otherwise → cheap model
Pros: Fast, no model calls, easy to understand. Cons: Brittle, captures maybe 60% of the right decisions, requires tuning.
Heuristics are a good starting point. But they'll catch about 10-15% of queries incorrectly (routing an easy query expensively or vice versa).
Classifiers
Use a small, fast classifier to predict complexity: "Given a query, is this complex? Yes/No."
Train on a dataset of queries you've already labeled. Use a lightweight model (a logistic regression, a small BERT, anything fast). Run the classifier on every incoming query. Route based on the score.
Pros: More accurate than heuristics (80-90%), scales to your specific data, captures linguistic patterns. Cons: Requires labeled training data, adds latency (~100ms per request), requires retraining as query distribution changes.
Learned Routers
Let the system learn optimal routing based on outcomes.
Setup: track every query, which model it used, and the quality of the output. Over time, learn which queries should have gone where. Update routing accordingly.
This is powerful but complex. It requires:
- Quality signal on every request (this is hard)
- A learning algorithm that handles sequential decision-making
- Exploration (occasionally trying the non-preferred model to improve data)
Pros: Optimal over time, adapts to changing patterns, captures domain-specific complexity. Cons: Requires infrastructure, takes weeks to stabilize, sensitive to quality signal quality.
Cascade Routing vs. Learned Routers
There's a middle ground: cascade routing.
Send query to cheap model first. The cheap model responds, but also outputs a confidence score: "I'm 90% confident in this answer" or "I'm 30% confident."
High confidence? Return the response. Low confidence? Retry with the expensive model.
Pros: Adaptive. The cheap model self-assesses. You get the cost savings when it works, the quality when it doesn't. No separate classifier needed. Cons: More complex. You need to handle retries. Latency is higher (cheap + expensive instead of cheap or expensive). Some models don't output confidence scores well.
Cascade routing is elegant in theory. In practice, model confidence scores are often poorly calibrated — models are confident when they're wrong, uncertain when they're right. I'd use it as an opt-in feature (for high-stakes queries), not as the default router.
Latency vs. Cost vs. Quality Tradeoffs
Routing creates tradeoffs that are real and sometimes painful.
Cost vs. latency: The cheap model is faster. The expensive model is slower. If you route a query to the expensive model, you're trading latency for quality.
If your system must respond in <1 second and the expensive model takes 2 seconds, routing fails. You need a fast model, period. Alternatively, use a cheaper model and accept the quality loss.
Cost vs. quality: The cheap model is worse. It hallucinates more, reasons poorly, generates shorter/lower-quality outputs. Routing exposes this tradeoff. Some queries (code generation, reasoning) suffer noticeably when routed to the cheap model. Your quality metrics will show a hit.
Routing accuracy vs. latency: The more accurate your router, the slower it is. A classifier adds 100-200ms. Cascade routing (cheap + maybe expensive) adds 500ms. Heuristics are instant. You have to trade off routing accuracy against system latency.
The balancing act: most successful routers I've seen use simple heuristics (routing time < 10ms) with occasional cascade fallback (if cheap model confidence is low, retry expensive). This gets most of the cost benefit with minimal latency overhead.
Practical Routing Architectures
Here's how I'd build it for a real system:
Architecture 1: Heuristic + Cascade
1. Apply heuristic rules (query length, keywords, user tier)
2. If heuristic score > 0.8 (definitely complex) → go to expensive model
3. If heuristic score < 0.2 (definitely simple) → go to cheap model
4. If 0.2 < score < 0.8 (uncertain) → go to cheap model, stream response
5. If cheap model outputs low confidence → retry with expensive model
Latency: cheap model adds ~100-200ms, expensive ~800-1500ms. Throughput: cheap model 5-10x better.
Cost: ~35-45% of all-expensive baseline.
Architecture 2: Lightweight Classifier
1. Run logistic regression classifier on query embedding
2. If P(complex) > 0.7 → expensive model
3. Otherwise → cheap model
Requires: training data, occasional retraining.
Cost: ~40-50% of baseline.
Latency: classifier adds ~50ms, total time similar to heuristic.
Architecture 3: User-Tier Routing
1. If user.tier == premium → always expensive model
2. If user.tier == free → always cheap model
3. If user.tier == standard → heuristic rules
Simple, aligns with business model, easy to explain to customers.
Cost: varies by user mix (free users subsidize their own cheap model cost).
When Routing Backfires
Routing is powerful but not magic. It breaks in specific scenarios.
Conversation context matters. A follow-up question like "Why?" needs context from the previous response. If you routed the previous response to the cheap model and it was mediocre, the follow-up will be worse. Solution: track conversation model version. If a conversation started with the expensive model, keep using it.
Quality degradation cascades. If a cheap model gives a weak answer, the next expensive model has bad context. It has to "fix" the previous answer, which is harder than getting it right once. Solution: don't try to fix cheap model failures with expensive models in cascade. Just retry.
Multi-step tasks. Code generation often needs multiple steps (write code, test it, refine it). If step 1 went to the cheap model and produced mediocre code, step 2 will be harder. Solution: for known multi-step workflows, use the expensive model for all steps.
User expectations misaligned. A user asks a simple question and gets a one-sentence response from the cheap model. They're disappointed — they expected depth. If the same query to the expensive model would have given a paragraph, routing felt like a downgrade. Solution: be transparent. Show which model handled the query. Let users override the routing decision.
Routing creep. You add more heuristics. More conditions. More fallbacks. The routing logic becomes spaghetti. New engineers don't understand it. Bugs hide. Solution: keep it simple. Heuristic + optional cascade. No more than 5 rules.
Building a Routing Dashboard
You need visibility into your routing decisions.
Track per day:
- % of queries routed to cheap model
- % of queries routed to expensive model
- Quality metrics broken down by routing decision
- Cost broken down by routing decision
- User satisfaction broken down by routing decision
This shows you: are we routing correctly? Are cheap-routed queries acceptable quality? Are expensive-routed queries worth the cost?
If 30% of cheap-routed queries have low quality, you're routing wrong. Adjust thresholds.
If expensive-routed queries have the same quality as cheap-routed, you're routing too conservatively. Lower the threshold.
If your cheap model cost increased 20% (because you routed more queries to it), that's fine. If your overall cost decreased, routing is working.
A Real Example
Imagine you're building a documentation chatbot. Users ask questions about your API. Current setup:
- All queries go to Claude 3 Sonnet
- 70% of queries are simple (what endpoint do I call for X?)
- 30% are complex (how do I solve Y with your API?)
- Cost: $0.003 per query
- Average queries: 100,000/month
- Monthly cost: $300
You implement routing:
- Simple queries (detected via heuristics) → Claude Haiku
- Complex queries → Claude 3 Sonnet
Expected outcome:
- 70% of queries at $0.0003 = $0.00021 per query
- 30% of queries at $0.003 = $0.0009 per query
- Blended: $0.00111 per query
- Monthly cost: $111
- Savings: 63%
Quality: simple queries get routed to a model that's still fully capable of answering them. Complex queries get the expensive model. No quality loss.
This is the power of routing: you get the cost of the cheap model for 70% of queries, without sacrificing quality anywhere.
Conclusion
Model routing is one of the highest-leverage optimizations you can make. It's not hard — heuristics + basic cascade is enough. It scales. It's easy to measure and adjust.
The catch: it requires measurement. You need to know which queries are routed where, what quality you got, and whether it was the right decision. Without measurement, routing is a guess.
Start simple. Heuristic rules. Measure results. Add complexity only if you need it.
Most teams ship routing and cut costs 30-50% in the first month. That's not magic. That's just not burning money on easy problems.