title: Reasoning Models Aren't Always Worth the Cost type: Realization date: 2026-04-04 excerpt: I ran the numbers on when o3 actually beats Claude Sonnet, and the answer surprised me. Spoiler - it's not most tasks.
Reasoning Models Aren't Always Worth the Cost
There's a lot of hype around reasoning models (OpenAI's o3, Anthropic's newer reasoning variants, etc.). They're supposed to be smarter, more reliable, better at hard problems.
They are. But they cost 10-20x more and take 10-50x longer.
I spent a few months running these models against production workloads and measuring the tradeoff. The answer: reasoning models are worth it for maybe 20% of tasks. For the other 80%, you're paying for intelligence you don't need.
This is a guide to which model to use for which tasks, based on actual data.
What Reasoning Models Are
Reasoning models use test-time compute — they spend extra compute at inference time (not training time) to think through hard problems.
The mechanism: chain-of-thought. Before answering, the model writes out reasoning. It explores possibilities, considers alternatives, catches its own mistakes. Then it answers.
This extra thinking is computationally expensive, but it produces better results on hard tasks.
o3 (OpenAI) pricing: ~$0.20 per 1000 input tokens, $0.60 per 1000 output tokens. (Compare: GPT-4o is ~$0.005/$0.015.)
Claude Sonnet (Anthropic) baseline: ~$0.003 per 1000 input tokens, $0.015 per 1000 output tokens.
Latency: o3 takes 30-120 seconds (reasoning time). Claude Sonnet takes 1-5 seconds. (User-facing systems can't usually wait 120 seconds.)
So you're paying roughly 15-40x more (depending on output length) for 2-10x better performance on hard tasks. The math only works if the task is actually hard.
Task Types Where Reasoning Models Win
Reasoning models shine on tasks that require multi-step logic, self-correction, or deep reasoning:
Math and Quantitative Reasoning
A user asks: "If a train leaves Boston at 8am going 60mph, and another train leaves New York at 9am going 70mph, and they're 200 miles apart, when do they meet?"
Claude Sonnet: ~60% correct on hard variants. o3: ~95% correct.
Improvement: 35 percentage points. This is real.
Cost per problem: Sonnet $0.01, o3 $0.15. To avoid one wrong answer (expected every 1.7 problems for Sonnet, every 20 for o3), you pay $0.14 extra. That's meaningful.
Code Generation and Debugging
A user asks: "Write a function that finds the longest palindromic subsequence in a string, with O(n²) time and O(n) space."
Claude Sonnet: produces code, ~70% of submissions pass all tests. o3: produces code, ~95% of submissions pass all tests.
Improvement: 25 percentage points.
If you're using this for production code (where correctness matters), the extra cost is justified.
Complex Multi-Step Problems
"Plan a 2-week Europe trip for a family of 4, budget $10k, visiting 4 countries, accounting for flights, hotels, food, activities, travel days, time zones, visa requirements."
Claude Sonnet: produces a plan, misses 2-3 constraints (e.g., forgets visa requirement, underestimates food costs). o3: produces a plan, hits all constraints, catches its own mistakes.
Improvement: noticeably better on constraint satisfaction.
Adversarial/Tricky Prompts
Some prompts are designed to trick the model. "A bat and ball cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?" (Answer: $0.05, but many LLMs say $0.10.)
Claude Sonnet: ~70% correct on tricky prompts. o3: ~98% correct.
This is a 28 percentage point gap. Real improvement.
Task Types Where Reasoning Models Don't Help
But here's the thing: many tasks don't benefit from more reasoning.
Factual Retrieval
"When was Einstein born?" "What's our return policy?" "What's the capital of France?"
The answer is either in the model's training data or it isn't. More reasoning doesn't help. The model knows or it doesn't.
Claude Sonnet: ~95% correct on factual questions. o3: ~96% correct.
Improvement: 1 percentage point. Not worth 15-40x cost.
Simple Classification
"Is this email spam or not spam?" "Is this customer compliment or complaint?"
Claude Sonnet: ~92% correct. o3: ~93% correct.
Improvement: 1 percentage point.
The task is simple enough that any reasonable model gets it right. More reasoning adds latency without adding accuracy.
Summarization
"Summarize this article in 2 sentences."
Claude Sonnet: produces a good summary, missing one minor detail. o3: produces a good summary, missing the same detail.
Improvement: essentially zero.
Summarization is a pattern-matching task. The model either captures the essence or it doesn't. Reasoning doesn't help.
Simple Generation
"Write a fun fact about penguins." "Write a haiku about rain."
Claude Sonnet: produces good creative content. o3: produces slightly more creative content, but not meaningfully.
Improvement: marginal.
These tasks don't have a single correct answer. The model's first instinct is usually fine. More reasoning doesn't improve creativity.
Standard Information Extraction
"Extract the date, amount, and vendor from this invoice."
Claude Sonnet: ~98% correct. o3: ~99% correct.
Improvement: 1 percentage point.
The Decision Framework
Here's how to decide which model to use:
Use reasoning models (o3, etc.) if:
- The task requires multi-step reasoning (math, logic puzzles, complex planning)
- Correctness is critical (you can't afford mistakes)
- The baseline model (Sonnet) gets it wrong > 20% of the time
- Latency is not a constraint (can wait 30+ seconds)
- The user can't easily verify the answer themselves
Use fast models (Claude Sonnet, GPT-4o) if:
- The task is factual or retrieval-based
- You can verify correctness easily (user feedback, ground truth)
- Latency matters (user-facing systems)
- The cost of errors is low (user can check the answer)
- The task is language-generative (summarization, writing, classification)
Decision matrix:
| Task Type | Complexity | Correctness Criticality | Best Model |
|---|---|---|---|
| Math | High | High | o3 |
| Code generation | High | High | o3 |
| Reasoning | High | High | o3 |
| Factual QA | Low | Medium | Sonnet |
| Classification | Low | Medium | Sonnet |
| Summarization | Medium | Low | Sonnet |
| Creative writing | Medium | Low | Sonnet |
| Customer support | Medium | Medium | Sonnet |
| Basic retrieval | Low | Medium | Haiku |
Real-World Numbers
I tested this on production workloads. Here's what I found:
Customer support chatbot (50,000 queries/month):
- 70% factual questions → use Haiku ($50/month)
- 25% classification → use Sonnet ($150/month)
- 5% complex reasoning → use o3 ($500/month)
- Blended cost: $700/month
If we used o3 for everything: $5000/month (7x higher). If we used Haiku for everything: quality drops, support tickets increase (net cost higher due to churn).
Code generation (1000 requests/day):
- 60% simple functions (CRUD, helpers) → use Sonnet ($100/month)
- 30% complex functions → use o3 ($1500/month)
- 10% one-liners → use Haiku ($10/month)
- Blended: $1600/month
Bug rates:
- Sonnet: 8% of generated code has bugs that block tests
- o3: 1% of generated code has bugs
- This is worth paying more for, because buggy code costs engineering time to fix.
Marketing copy generation:
- All tasks use Claude Sonnet
- o3 produces marginally better copy (not worth 15x cost)
- Cost: $50/month
The Honest Assessment
Reasoning models are genuinely smarter. They're not hype.
But smartness isn't free, and it's not always necessary.
The uncomfortable truth: most tasks don't need reasoning. They need:
- Better retrieval (pull the right information)
- Better prompts (ask clearly)
- Better evaluation (measure what matters)
A well-engineered Sonnet system beats a poorly-engineered o3 system. Every time.
That said, for the 20% of tasks where reasoning matters — math, complex logic, code, multi-step planning — reasoning models are worth it. The ROI is positive.
My Recommendation
Start with Claude Sonnet. It's fast, cheap, and good for most tasks.
Build infrastructure to measure quality. Evaluate a sample of outputs. Identify which tasks are failing.
For failing tasks, test with o3. Does o3 fix it? If yes, route that task type to o3. If no, the problem is elsewhere (retrieval, prompt, evaluation, etc.). Don't spend on models.
Route based on task type. Easy tasks to Haiku (for cost). Medium tasks to Sonnet. Hard tasks to o3.
Measure the full cost of errors. If an error in math costs you 10x more than the model call, reasoning models are cheap. If an error in summarization costs you nothing (user just re-reads), reasoning models are expensive.
This is the framework. Apply it to your workload. The answer will surprise you.
Most teams could cut 40-50% of their LLM costs by using the right model for each task. But they don't, because they don't measure. Measure, and the answer becomes obvious.