title: The Economics of Running LLMs - How to Cut Costs Without Killing Quality date: 2026-04-04 excerpt: Where LLM costs come from, prompt caching, batching, quantization, and the build-vs-buy decision for inference.
The Economics of Running LLMs: How to Cut Costs Without Killing Quality
The first time I saw the bill for running LLM systems at Amazon scale, I understood why people spend weeks optimizing token counts. A few percentage points matter. A 10% improvement in cost across millions of requests becomes hundreds of thousands of dollars.
But here's what most teams get wrong: they optimize in the dark. They don't know where costs actually come from. They apply generic tactics without measuring impact. They sacrifice quality for price without understanding the tradeoff.
This is a guide to the economics of LLMs — where money goes, what actually saves it, and how to think about value.
Where Costs Actually Come From
LLM costs are straightforward: you pay per token.
Input tokens. The tokens in your prompt and context window. The longer your context, the higher your cost, per request. If you're using RAG and retrieving 5000 tokens of context, you're paying for those 5000 tokens on every request, even if only 500 are relevant.
Output tokens. The tokens the model generates. A one-sentence response costs less than a multi-paragraph essay. More important: output tokens are usually 2-4x more expensive than input tokens (at least for OpenAI and Anthropic models). This is because generating a token is compute-heavy; reading a token is not.
API calls. Every request to an LLM API has a base cost, often negligible but real. If you're making multiple API calls per user request, those add up.
Inference infrastructure. If you're self-hosting, you're paying for GPUs, memory, bandwidth. This is a fixed cost structure, not marginal. It's usually cheaper than APIs at massive scale, but only if you're truly massive.
Most teams focus on output tokens (because they're expensive) and ignore input tokens (because they feel inevitable). This is a mistake. The fastest path to cost reduction is usually reducing context window size, not optimizing generation length.
Cost principle: Optimize input tokens first. They're the largest lever and the easiest to pull.
Step 0: Measure Before You Optimize
You can't improve what you can't see. Instrument every LLM call before touching a single prompt:
import anthropic
import time
client = anthropic.Anthropic()
def tracked_call(system: str, user: str, model: str = "claude-sonnet-4-5") -> dict:
"""Wrapper that logs cost-relevant metrics for every LLM call."""
start = time.time()
response = client.messages.create(
model=model,
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user}],
)
latency_ms = (time.time() - start) * 1000
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
# Approximate pricing (check docs for current rates)
PRICE_PER_M = {
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
}
p = PRICE_PER_M.get(model, {"input": 3.00, "output": 15.00})
cost_usd = (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
print(f"[{model}] in={input_tokens} out={output_tokens} "
f"cost=${cost_usd:.5f} latency={latency_ms:.0f}ms")
return {
"text": response.content[0].text,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": cost_usd,
"latency_ms": latency_ms,
}
# Use it:
result = tracked_call(
system="You are a helpful customer support agent.",
user="What is your return policy?"
)
Run this for a week on production traffic. You'll immediately see which query types are expensive, which prompts are too long, and where quick wins are.
Prompt Caching: How It Works, When It Saves Money
Prompt caching is a game-changer if you're using the same long context repeatedly.
The idea: you send a prompt with a large context (say, 10,000 tokens of a document or system prompt). The model caches it. Your next request with the same cached content pays a fraction of the price — usually 10-20% of the original cost, depending on the provider.
How it works: You mark certain parts of your prompt as cacheable. The first request pays full price. Requests within a 5-minute window (for Claude) reuse that cache and pay the discount rate. The cache is keyed on exact content — a single character change invalidates it.
When it saves money:
- You have a static system prompt that's large. Cache it. Every request gets the discount.
- You're processing a long document multiple times. Retrieve it once, cache it, process different queries against it.
- You have few users with high request volume. They reuse context.
When it doesn't save money:
- Your context changes on every request. No cache hit.
- You have many users, each with unique context. The cache thrashes.
- Your queries are so simple that context size doesn't matter.
The math: If caching saves 80% on input tokens, and input tokens are 40% of your bill, you save 32% overall. That's huge. But it only works if you have high cache-hit rates (>50%).
To maximize cache hits, consolidate your static content. Put system prompts, few-shot examples, reference documentation — anything that doesn't change — in a cacheable block at the start of your prompt.
Prompt Caching in Practice (Anthropic SDK)
import anthropic
client = anthropic.Anthropic()
# Imagine you have a 10,000-token internal policy document
# Cache it once — all subsequent requests reuse the cached tokens at ~10% cost
POLICY_DOC = open("company_policy.txt").read() # large static document
def answer_policy_question(user_question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
system=[
{
"type": "text",
"text": "You are a helpful policy assistant.",
},
{
"type": "text",
"text": POLICY_DOC,
"cache_control": {"type": "ephemeral"}, # ← mark as cacheable
},
],
messages=[{"role": "user", "content": user_question}],
)
# Check cache hit
usage = response.usage
cache_read = getattr(usage, "cache_read_input_tokens", 0)
cache_write = getattr(usage, "cache_creation_input_tokens", 0)
print(f"Cache write: {cache_write} | Cache read: {cache_read}")
# First call: cache_write > 0, cache_read = 0 (full price)
# Next calls: cache_write = 0, cache_read > 0 (~10% price!)
return response.content[0].text
# First call is expensive. All subsequent calls are cheap.
print(answer_policy_question("Can employees work from home?"))
print(answer_policy_question("What is the vacation policy?"))
Caching tip: Measure your actual cache-hit rate. If it's below 30%, caching isn't worth the engineering complexity.
Batching Strategies
Batching is underrated. If you can wait a few seconds for results, batching can cut costs dramatically.
The simplest approach: collect requests for a few seconds, batch them together, send to the model as a single API call with multiple prompts. Some providers charge the same per-token rate for batches, but you save on API call overhead and often get better infrastructure utilization.
More sophisticated: use asynchronous APIs. You send 1000 requests now, get results later. The provider queues them, processes them efficiently, returns results in bulk. You pay a small discount for the latency tradeoff.
Where batching works:
- You have a queue of work that can wait 5-60 seconds. Chat applications: no. Nightly reports: yes.
- Your queries are independent. Processing one doesn't depend on results from others.
- Your volume is high. Batching 10 requests is pointless. Batching 10,000 saves real money.
Where batching breaks:
- User-facing systems where latency matters. Adding 30 seconds to every response is unacceptable.
- Sequential workflows. If step 2 depends on step 1's output, you can't batch.
A practical setup: have a queue for non-urgent work. Every 30 seconds or when you hit 1000 queued requests, send a batch. This gets you most of the efficiency gain with minimal latency impact.
Async Batching with Anthropic's Batch API
import anthropic
import time
client = anthropic.Anthropic()
# Build a list of requests — up to 10,000 per batch
requests = [
{
"custom_id": f"query-{i}",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 256,
"messages": [{"role": "user", "content": f"Summarize: {doc}"}],
},
}
for i, doc in enumerate(my_documents) # your list of docs
]
# Submit batch — processed in the background, up to 50% cheaper
batch = client.messages.batches.create(requests=requests)
print(f"Batch submitted: {batch.id}")
# Poll until complete (or use a webhook in production)
while True:
batch = client.messages.batches.retrieve(batch.id)
if batch.processing_status == "ended":
break
print(f"Still processing... {batch.request_counts}")
time.sleep(60)
# Retrieve all results
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}")
Use this for: nightly report generation, document classification pipelines, bulk summarization, fine-tuning data generation.
The Build vs. Buy Decision for Inference
Should you use an API (OpenAI, Anthropic, etc.) or host your own model?
APIs are cheaper at small scale. You pay per token, no infrastructure costs. Easy to scale up and down. You get the latest models.
Self-hosting is cheaper at large scale. You pay for GPUs upfront. Marginal cost per token is low. You have full control. But: you need to manage infrastructure, deal with GPU scarcity, keep models updated, handle outages.
The breakeven point is roughly: 100M tokens/month (for a ballpark). Below that, use an API. Above that, run the math on self-hosting.
This assumes you're using a decent model (Claude, GPT-4). Using an older model locally (Llama 2) might be cheaper sooner, but you sacrifice quality.
Hidden cost of self-hosting: maintenance, GPU downtime, model updates, ops overhead. Most teams underestimate this by 3-5x.
If you do choose to self-host, quantization becomes critical.
Quantization Trade-offs
Quantization means reducing the precision of model weights. Instead of 32-bit floats, use 8-bit or 4-bit integers. You save 4-8x on memory and compute. But the model quality degrades.
4-bit quantization: ~4x memory savings. Quality degradation: 5-15%, depending on the model. Acceptable for many tasks (classification, simple generation). Risky for reasoning or code.
8-bit quantization: ~2x memory savings. Quality degradation: 1-5%. Usually minimal. Often worth doing.
The key: quantization is a one-time tradeoff. You save real money on hardware, but you're locked into a specific quality level. Test thoroughly before shipping.
Quantization rule: Use 8-bit quantization by default. Only go to 4-bit if you've tested on your actual workload and the quality loss is acceptable.
Thinking About Cost Per "Unit of Value"
This is the reframe that matters most.
Don't think: "API X costs $0.10 per 1000 tokens."
Think: "I need to solve user problem Y. What's the cheapest way? What's the fastest? What has the best quality?"
Cost per unit of value means:
- A cheap model that solves the problem in 1 request: $0.01 cost per solved problem.
- An expensive model that solves the problem in 0.5 requests: $0.02 cost per solved problem.
- An expensive model that fails 50% of the time: infinite cost per solved problem.
This reframe changes everything. It means:
Use model routing. Send easy queries to cheap models. Send hard queries to expensive models. This cuts costs 20-40% with no quality loss.
Invest in prompt quality. A better prompt is clearer to the model, generates shorter outputs, succeeds on the first try. It costs less and delivers better results.
Optimize retrieval. Bad context makes the model hallucinate. It generates longer outputs to compensate. It needs more calls. Spend engineering effort on retrieval quality. It pays dividends.
Build user feedback loops. If 10% of your outputs are wrong, you're wasting 10% of your compute budget on garbage. Fix that, and you've cut costs.
The cheapest path is not always the most obvious one. Sometimes the expensive approach is cheaper per unit of value delivered.
A Real Cost Optimization Example
Let's say you're building a customer support chatbot. You get 1M queries per month. Current setup:
- Average context: 2000 tokens (customer history, documentation)
- Average response: 300 tokens
- Model: Claude 3 Sonnet
- Cost: roughly $2000/month
You want to cut costs. Here's the order I'd attack it:
Week 1: Prompt optimization. Rewrite the system prompt to be clearer and shorter. Remove examples. Your context goes from 2000 to 1200 tokens. Cost: $1200/month. Savings: $800. Time: 4 hours.
Week 2: Retrieval quality. Your context includes a lot of irrelevant history. Improve the retriever to pull only relevant docs. Context: 800 tokens. Cost: $800/month. Savings: $400. Time: 16 hours.
Week 3: Model routing. 60% of queries are simple (account balance, order status). Route those to a cheaper model (GPT-3.5 or Claude Haiku). 40% go to Sonnet. Blended cost: $600/month. Savings: $200. Time: 8 hours.
Week 4: Caching. Your documentation is static. Cache the system prompt + relevant docs. Cache hit rate: 40%. Cost: $480/month. Savings: $120. Time: 4 hours.
Total savings: $1520/month (76%). Total time: 32 hours.
This is not aggressive optimization. It's just being intentional.
Conclusion
LLM costs scale linearly with usage, which is both a problem and an opportunity. Every optimization compounds. A 10% improvement is 10% forever.
The teams that build sustainable LLM systems are not the ones who found the perfect model or the smartest prompts. They're the ones who:
- Measure costs clearly (by query type, by model, by user)
- Understand the tradeoffs (cheap vs. good vs. fast)
- Optimize methodically (retrieval, prompts, routing, caching)
- Monitor quality alongside cost (because cost-cutting that breaks quality is not actually cost-cutting)
You have more leverage than you think. Use it.