AI & Enterprise AI23 April 20247 min read

LLM Cost Discipline — Engineering Practices That Keep Bills Predictable

Most teams discover LLM cost through the bill. By then, the cost shape is set and hard to change. The engineering practices that keep costs predictable are not exotic, but they have to be in place from the start.

ByIntellectual AI Engineering Practice· Collective byline

A recurring conversation with enterprise AI teams six months into production: "Our LLM bill is much higher than we expected." The bill tends to grow faster than usage because the cost-per-call has been slowly creeping up, and nobody was looking. Prompts grew. Retries crept in. The model upgrade brought higher per-token pricing. Function calls multiplied. Each effect was small; together they doubled the bill.

This is a practitioner view of LLM cost discipline — what to instrument, where the leverage actually is, and what habits keep bills predictable rather than surprising.

The cost composition

In a typical retrieval-augmented LLM workload, the cost components are:

Generation calls to the primary model (GPT-4, Claude, equivalents) — usually the largest line
Embedding calls for queries and documents
Retrieval calls to the vector store
Optional judge or evaluation calls
Function execution costs in the downstream systems
Infrastructure for hosting any self-hosted components

Generation dominates in most workloads. Within generation, two factors drive cost:

Input tokens — the prompt, including system instructions, retrieved context, conversation history, examples
Output tokens — what the model produces

Output is more expensive per token than input on most providers. Input is more often where the actual cost lives, because prompts are larger than completions in most workloads.

What to instrument from day one

The mechanics:

Token counts per call. Input and output, separately.
Model identifier per call. Which model, which version.
User and workload identifiers. Who was making the request, which workload.
Latency. Latency correlates with token count; outliers surface here.
Cache hit indicators. When a cache responds instead of the model, mark it.
Cost attribution. Each call attributed to a user, a workload, a tenant.

Without this instrumentation, cost optimization is guesswork. With it, the patterns that drive cost are visible.

Where the leverage actually is

The interventions that reliably reduce cost without reducing quality:

Prompt minimisation

The largest line in most prompts is retrieved context. A working pattern:

Retrieve more candidates, rerank, send fewer to the model. A reranker is cheaper than the generation step; reducing the context sent to the generator reduces tokens substantially.
Drop low-quality retrievals. If a retrieved chunk's score is below a threshold, exclude it. Tokens spent on irrelevant context are wasted.
Compress conversation history. Long conversations don't need to be sent verbatim. Summary-based history reduces tokens without losing context.

Model routing

Not every call needs the most capable model. A working pattern:

Triage with a small model. A small, cheap model classifies the request: easy, medium, hard.
Route easy requests to a cheap model. Most production traffic is easy; the small model handles it adequately.
Reserve the expensive model for hard cases. When the small model flags low confidence, escalate to the larger model.

Done well, this can reduce overall cost by half or more without affecting end-user-visible quality.

Caching at multiple layers

Caches that earn their keep:

Embedding cache. The same query produces the same embedding. Cache by canonical query form.
Retrieval cache. The same query against the same index produces the same retrievals. Cache them.
Response cache. For deterministic prompts (low temperature, structured outputs), the same input produces the same output. Cache by input hash.
Partial-context cache. Where prompt-level caching is available, longer prompts benefit substantially.

Cache hit rate is one of the highest-leverage metrics in a mature LLM system.

Batch processing

For workloads where latency is not critical, batch APIs from providers offer significant discounts. Batch processing for overnight jobs can produce 50% cost savings over real-time calls for the same workload.

Output cap

Long completions are expensive. A max_tokens cap that reflects the actual expected response length prevents runaway completions. The default of "no cap" is rarely the right choice.

Streaming where it helps

For user-facing workloads, streaming the response improves perceived latency without changing cost. For non-user-facing workloads, streaming adds overhead with no benefit. Use it where it earns the operational complexity.

Where teams accidentally inflate cost

The patterns we see produce cost surprises:

Retry loops

A call fails; the system retries. The retry succeeds. The cost of the failed attempt is real. Without monitoring, retries can multiply a workload's cost silently. Track retry rates as a primary metric.

Loop runaway

A function-calling loop or an agent loop runs longer than intended. Each iteration costs tokens. An unbounded loop can produce 10x the expected cost in a single request. Strict turn limits, enforced as circuit breakers, prevent this.

Prompt creep

A prompt that started concise grows over time as the team adds instructions, examples, edge-case handling. Six months later, the prompt is three times its original size. Cost per call has risen proportionally. Periodic prompt review catches this.

Conversation history bloat

A conversational system that appends every turn to history sees prompt sizes grow indefinitely. After fifty turns, the prompt is dominated by history. Compression, summarisation, or pruning prevents this.

Expensive model defaulting

Teams start with the most capable model for safety, never re-evaluate. Six months later, every call is going to the expensive model when 80% of calls would work on a cheaper one.

Embeddings re-runs

A team re-runs embedding for the entire corpus to switch models or update content. Without batch APIs and careful planning, this is expensive. Plan re-embeddings as deliberate operations.

Budget controls as guardrails

For production workloads, technical budget controls are part of the architecture:

Per-user limits. A user cannot consume more than X tokens or Y dollars per period.
Per-workload limits. A workload has a total budget; exceeding it triggers alerts and, optionally, throttling.
Per-call cost caps. A single call cannot exceed a maximum cost.
Anomaly detection. Unusual cost patterns flagged for investigation.
Circuit breakers. Automatic shutdown if cost rate exceeds policy.

These prevent the worst surprises — a buggy script, an abusive user, a runaway loop — from producing a five- or six-figure bill before anyone notices.

Reporting cadence

Cost reporting that drives behaviour:

Daily — overall cost trend, anomalies, top users by cost
Weekly — cost per workload, cache hit rates, model mix
Monthly — cost trends, cost per user, ROI analysis
Quarterly — strategic review, model contract negotiation, architecture-level decisions

Without reports, cost discipline drifts. With reports, the team has a feedback loop.

What we keep seeing

Recurring patterns in production LLM cost engagements:

The first audit always finds the same things. Bloated prompts, no caching, no model routing, no per-user limits. The same problems in different orgs.

Cost optimisation pays back fast. Most teams that invest a sprint in cost discipline see 30-60% bill reduction. The payback is weeks, not months.

Model upgrades are the silent killers. A move from older model to newer model often improves quality and worsens cost in the same instant. Without re-evaluation, the cost stays elevated.

Cache hit rates are leading indicators. When cache hit rates drop, something has changed — usage patterns, prompt structure, or cache configuration. Investigate.

Per-user attribution surfaces abuse. A small fraction of users sometimes consumes a large fraction of cost. Visibility lets you act.

What we recommend

For an enterprise team running LLM workloads in 2024:

Instrument cost per call from day one. Without this, you cannot manage what you cannot see.
Build model routing into the architecture. Cheap models for easy requests; expensive for hard.
Cache aggressively. Embeddings, retrievals, responses. Each layer compounds.
Cap outputs. The default of unbounded completions is rarely correct.
Set per-user and per-workload limits as circuit breakers.
Review prompts periodically. They grow; trim them.
Treat model upgrades as cost events, not just quality events.

Cost discipline is not a separate engineering concern; it is part of the system design. The teams that build cost awareness into the workload from the start ship sustainable systems. The teams that wait until the bill surprises them spend the next quarter rebuilding what should have been there from the start.

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Service practices

Service

AI & Intelligent Automation

/services/ai-solutions →

Service

Cloud, DevOps & Platform Engineering

/services/cloud-engineering →

Related pieces

23 July 20247 min read

From AI Pilot to Production — The Playbook That Bridges the Gap

Every enterprise has AI pilots. Far fewer have AI in production. The bridge between the two is more about organisational discipline than technical capability. A practitioner playbook.

4 June 20248 min read

LLMOps Maturity — A Practitioner's Maturity Model

Most enterprises are operating LLM workloads on engineering intuition alone. A maturity model helps locate where you are, what to invest in next, and what the next stage actually requires.

22 April 20257 min read

AI in IT Operations — Where the Real Productivity Lands

ITSM and IT operations are document-heavy, repetitive, and high-volume — well-matched to AI augmentation. The deployments that ship share recognisable shape; the ones that stall share recognisable failure modes.

Industry

Financial Services & Banking

Regulated integration, compliance automation, and secure digital banking.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Contact Intellectual →

← Newer post

AI Observability — What to Log and Why

Older post →

Fine-Tuning vs Prompting — How to Decide for Enterprise Workloads

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights