Inference Economics in 2025 — Where the Cost Curves Have Settled
The cost-per-token curves moved dramatically through 2024. Where do they sit at the start of 2025, and what does it mean for enterprise architecture decisions?
The cost of LLM inference moved more in 2024 than in any previous year. Commercial API prices dropped sharply as competition intensified. Open-model performance improved. Self-hosted inference matured. The architectural decisions that depended on cost projections from 2023 need to be revisited against 2025 economics.
This piece is a brief practitioner snapshot of where the inference cost curves sit at the start of 2025 — what changed, what the implications are, and what to recalibrate.
What changed in 2024
The directional moves:
Commercial API prices fell substantially
The headline frontier model prices dropped at most providers. Mid-tier and small-model pricing dropped further. The cost of an average enterprise LLM call is materially lower at the end of 2024 than at the start.
Batch APIs at significant discount
Most providers offer batch APIs at roughly 50% of real-time pricing. Adoption is uneven; the discount is genuine.
Open models closed the capability gap
Llama 3.1 405B, Llama 3.3, Mistral Large, and others approached commercial frontier capability on many workloads at much lower per-token cost when self-hosted.
Inference hardware availability improved
H100 and H200 availability stabilised after the 2023 shortage. Pricing on hyperscalers and specialised providers became more competitive.
Prompt caching emerged
Several providers introduced prompt caching that significantly reduces the cost of long-context workloads with repeated prefixes. The discount is workload-specific but can be substantial.
Inference optimisation tools matured
vLLM, TGI, TensorRT-LLM, and others improved throughput and reduced per-call cost on self-hosted infrastructure.
What these changes mean
The case for self-hosting changed
A year ago, self-hosting was justified mainly by data residency. Now it competes on cost at lower volumes than before. The crossover from commercial APIs to self-hosting has shifted toward lower-volume workloads.
This doesn't make self-hosting the right answer for most workloads; the operational complexity remains real. But the cost-justified self-hosting workload set is larger than it was.
Model routing economics shifted
When the cheapest commercial models were 10x cheaper than the most expensive, routing produced order-of-magnitude savings. When the gap is 3-5x, the savings are still material but smaller. Routing is still worthwhile; the absolute savings per call are lower.
Cost discipline became less urgent but not unnecessary
Workloads that consumed unsustainable amounts in 2023 are now manageable at lower volumes. Workloads at high volume still benefit from discipline; the bills aren't yet trivial.
Long-context workloads became more viable
The combination of falling per-token prices and prompt caching makes long-context workloads more economical than they were. Architectures that used short context for cost reasons can reconsider.
Batch processing became more compelling
The 50% discount on batch APIs is now meaningful enough that workloads should be evaluated for batch suitability deliberately, not by default.
The decision changes
A few specific decisions enterprises should revisit:
Model selection
Workloads that defaulted to GPT-4 for capability should re-evaluate. Smaller commercial models, open models, and mid-tier alternatives are competitive on more workloads than they were. The cost differential may have changed enough to flip the decision.
Architecture for cost
Architectures that pre-computed aggressively to avoid model calls may be over-engineered now. The cost of letting more calls happen real-time has dropped.
Self-hosting decisions
Self-hosting cases at the margin a year ago may be clearly favourable now. Self-hosting cases that were marginal-against now may be clearly disfavourable if commercial API prices dropped enough.
Workload mix decisions
Workloads that were too expensive to consider at 2023 prices may be viable at 2025 prices. The portfolio of AI workloads worth running has expanded.
Long-context adoption
Workloads where long context simplifies architecture should reconsider given prompt caching and per-token reductions. The cost gap with retrieval-augmented short-context patterns has narrowed.
The metrics to track
For ongoing cost discipline in 2025:
- Cost per call, by workload. The unit economic.
- Cache hit rate, by workload. The discipline indicator.
- Cost per user, per period. The attribution check.
- Cost ratio across model tiers. The routing health check.
- Cost trajectory month-over-month. The trend indicator.
These should be visible at the team and workload level, not just at the org level.
What we keep seeing
Patterns in early 2025 enterprise AI cost reviews:
Many architectures are over-optimised for 2023 economics. Teams that built workarounds for old costs find the workarounds aren't earning their complexity anymore.
Procurement positions need updating. Vendor pricing positions established in 2023 may be significantly improvable. Renegotiation is worthwhile.
Self-hosting decisions need re-evaluation. Decisions that made sense a year ago may have shifted. Periodic review catches this.
The workload portfolio is broadening. Workloads that were uneconomical are becoming viable. The list of "AI workloads we could run" is longer.
Cost discipline as a posture remains valuable. Even with lower per-token costs, the discipline of monitoring, attribution, and budgets prevents the silent inflation that produces surprises.
What we recommend
For enterprise teams reviewing AI cost in 2025:
- Revisit model selection decisions. The cost gradients have changed.
- Re-evaluate self-hosting cases. The economics have shifted at the margin.
- Renegotiate vendor positions. Pricing improvements may be available.
- Reconsider architectures over-optimised for old costs. Simplification may be possible.
- Expand the workload portfolio. Previously uneconomical workloads may now be viable.
- Maintain cost discipline. The discipline still earns its place even with lower per-token costs.
Inference economics in 2025 are materially different from 2023. The architectural decisions, vendor relationships, and workload portfolios that fit 2023 may not fit now. The teams that re-evaluate periodically capture the value as the economics evolve. The teams that lock in to old decisions over-pay for capability they could get more cheaply.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
LLM Cost Discipline — Engineering Practices That Keep Bills Predictable
Most teams discover LLM cost through the bill. By then, the cost shape is set and hard to change. The engineering practices that keep costs predictable are not exotic, but they have to be in place from the start.
Three Years of Enterprise AI — What We Got Right and Wrong
A practitioner reflection on three years of enterprise AI work — the patterns I called correctly, the calls I got wrong, and what to take from each into 2026 and beyond.
The 2026 AI Infrastructure Shift — What's Changing Underneath
The infrastructure layer for enterprise AI is shifting in 2026. New hardware, new deployment patterns, new economics. A look at what's actually different and what it means for architecture decisions.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.