Spacetime AgentsSpacetime Agents
Back to Blog

LLM cost optimization 2025: cut inference spend safely

Haven Vu, Founder & CEO of Spacetime||4 min read
Ink-style illustration for the blog post: LLM cost optimization playbook

TL;DR

To cut LLM inference costs without breaking production, start by measuring cost per successful request, not cost per token. Then apply the big levers in order: caching and deduplication, model routing to smaller models for most requests, batching, and only then quantization or self-hosting. Most teams can cut spend 30–60% with these basics before touching training or fancy research.

Engineering teams burn cash on inference because they treat LLMs like magic. They aren't. They are software. When the CFO asks why the bill doubled, model choice is rarely the culprit. You haven't built the plumbing yet.

Most “LLM cost optimization” advice focuses on quantization or fancy research techniques. Real wins come from boring, standard engineering patterns applied to probabilistic systems.

The Problem

Everyone tracks token spend. Almost nobody tracks the cost of a failed chain. If your agent tries three times to parse a JSON object and fails, you just bought three expensive failures plus the engineering hours to fix them. That is the hidden tax of poor reliability.

Focus on reliability first. Optimizing for token price before the system is stable targets the wrong metric.

What should you measure before optimizing AI inference costs?

Measure what the business actually pays for: the cost per successful outcome.

  • Cost per successful request: The real price tag includes retries, fallbacks, and every wasted token spent on a hallucination.
  • P95 latency: A cheap model is expensive if it drives users away with 10-second load times.
  • Throughput ceilings: Rate limits and queue depth kill production faster than pricing models do.

A Series B fintech handling ~50k support tickets/month in Q3 routed every request through GPT-4o. This ran up a $6,000/month bill. We found that 40% of queries were simple "reset password" or "status check" requests. We cached the static responses. We routed intent classification to a smaller model (Claude 3 Haiku). The bill dropped to $1,800/month overnight. No model training required. Just better plumbing.

Which inference optimization levers actually matter?

Here’s a safe order of operations that protects production stability:

  • Lever: Caching and deduplication | Typical impact: High | Risk: Low
  • Lever: Model routing to smaller models | Typical impact: High | Risk: Medium
  • Lever: Batching and streaming | Typical impact: Medium | Risk: Low
  • Lever: Quantization | Typical impact: Medium | Risk: Medium
  • Lever: Self-hosting | Typical impact: Medium to high | Risk: High

The first lever: aggressive caching

If a prompt enters the system that you’ve seen before, the response belongs in Redis. Embeddings, legal disclaimers, and standard greetings have no business hitting an LLM provider’s API twice. Normalize your prompts so repeats hit the same cache key.

Why are you using a 400B model for JSON?

You don't need a PhD-level model to summarize a two-sentence email. Classify the request difficulty first. Send easy tasks to a smaller, faster model. Reserve the heavy reasoning models for complex queries.

A 7B parameter model handles JSON formatting just fine. Save the 400B model for creative reasoning.

Batching creates free throughput

Batching fills idle GPU gaps while streaming hides the wait time. Together, they keep hardware saturated and responses fast. For example, aggregating user requests into 50ms windows allows the GPU to process matrix multiplications in parallel. This doubles throughput without touching the model weights.

Quantization is dangerous without guardrails

Quantization cuts cost and improves speed, but it sacrifices nuance. The degradation is invisible. Do not ship quantized models without a rigorous evaluation suite to catch the edge cases where quality degrades—like a legal summarizer missing a double negative in a contract clause.

When should you self-host vs stay on an API?

Stick to an API when traffic is spiky or you are still finding product-market fit. The engineering overhead of managing your own GPUs is massive. Self-hosting becomes viable when utilization is high and predictable. Sometimes compliance requirements force your hand.

Always factor in the on-call load. If you save $500 on compute but burn out your lead engineer, you lost money.

What To Do Next

If you need to cut spend fast, follow this path for the next week. Start by instrumenting your cost per successful request. Stop flying blind. Add caching for your top three repeated workflows immediately, then set up model routing with a strict fallback. Keep reliability high.

Run an eval suite before any quantization or hosting change. If you want a team to build this end-to-end and keep it stable, that’s the work we do at Spacetime Agents.

Sources

  1. IDC: DeepSeek’s shift in model efficiency and cost structure — Cost and efficiency framing after DeepSeek.
  2. Mozilla AI: Running an open-source LLM in 2025 — Practical trade-offs of running open models.
  3. Tredence: LLM inference optimization — Overview of batching, pruning, and quantization.
  4. InformationWeek: Will enterprises adopt DeepSeek? — Enterprise adoption and maturity considerations.
  5. DEV: Managing AI cost strategies for efficient deployment — Practical cost levers and deployment considerations.
  6. Medium: Building the LLM economics framework — Cost framework discussion for API vs self-host trade-offs.

Frequently Asked Questions

I reply to all emails if you want to chat:

Get AI automation insights

No spam. Occasional dispatches on AI agents, automation, and scaling with less headcount.