Intellectual
← All Insights
AI & Enterprise AI7 May 20247 min read

The Case for Smaller Models in Enterprise AI

The default of routing everything to the largest frontier model is a habit, not a strategy. Open and smaller commercial models have closed enough of the gap that the case for using them is now strong for many enterprise workloads.

The default architecture choice for enterprise LLM workloads through 2023 was: route everything to GPT-4, sometimes to Claude, occasionally to GPT-3.5 for cost reasons. That default made sense when smaller models lagged on capability. Through the first half of 2024, the picture has shifted. Llama 3, Mistral's open releases, Phi-3, and Claude 3 Haiku have closed enough of the gap that the case for smaller models in production is now strong.

This is a practitioner view of where smaller models earn their place in enterprise architectures, how to evaluate the trade-off, and what operational reality you take on when you adopt them.

What "smaller" means

Roughly, in current usage:

  • Frontier models — GPT-4, Claude 3 Opus, Gemini 1.5 Pro. Hundreds of billions of parameters effective. Highest capability, highest cost per call, highest latency.
  • Mid-tier models — Claude 3 Sonnet, GPT-4 Turbo, Gemini 1.5 Flash. Strong capability, moderate cost.
  • Small commercial models — Claude 3 Haiku, GPT-3.5 Turbo. Cheaper and faster; capability suitable for many workloads.
  • Open mid-size models — Llama 3 70B, Mixtral 8x22B. Comparable to mid-tier commercial models on many benchmarks.
  • Open small models — Llama 3 8B, Mistral 7B, Phi-3-mini. Useful for narrow tasks; can be fine-tuned for specific workloads.

The capability ranking doesn't fully predict suitability. A small model fine-tuned on a narrow task can outperform a large model used naively on the same task.

Where smaller models work

Workloads where smaller models are genuinely good enough or better:

Classification and routing

Text classification — intent detection, support ticket categorisation, document type identification — is well within reach of small models. A fine-tuned smaller model often outperforms a prompted larger model and costs an order of magnitude less per call.

Structured extraction with constrained schemas

When the extraction schema is well-defined and the inputs are bounded in shape, smaller models do well. The instruction-following capability needed for structured extraction is largely present in models down to the 7B parameter range.

High-volume document processing

When the task is processing many documents at high throughput, the cost differential between large and small models becomes significant. A smaller model that does the job adequately at one-tenth the cost is the better engineering choice.

Latency-sensitive workloads

A smaller model running on dedicated infrastructure can produce responses in tens of milliseconds. Frontier API calls take hundreds of milliseconds at best. For workloads where latency drives experience, the choice is clear.

Privacy-sensitive workloads

Self-hosted smaller models keep data inside your boundary. For workloads with strict data residency or confidentiality requirements, this is decisive.

Where frontier models still earn their cost

Smaller models don't replace frontier models for everything. Where the frontier still earns the cost:

  • Complex reasoning tasks. Multi-step planning, code generation in unfamiliar contexts, deep analysis of complex documents. The capability gap is still significant.
  • Open-ended generation. Drafting documents from limited context, creative writing in support of a workload, generating examples. Frontier models produce higher-quality output here.
  • Edge cases. When the workload covers a long tail of unusual inputs, the frontier model's broader capability handles the tail better.
  • High-stakes decisions. When the output supports a consequential decision, the extra accuracy of the frontier model is worth the cost.

The pattern that works in production: route easy and high-volume work to smaller models; reserve the frontier for the cases that justify it.

The model routing pattern

A working pattern combining models:

  1. A small triage model classifies the incoming request — easy, medium, hard.
  2. Easy requests go to a small model. Most production traffic falls here.
  3. Medium requests go to a mid-tier model. A meaningful fraction.
  4. Hard requests go to a frontier model. A small fraction, where the cost is justified.
  5. Confidence-based escalation. Where any model produces low-confidence output, the next tier takes over.

This pattern can reduce overall cost by 50-80% with minimal quality impact for typical workloads. The triage step is the key piece; investing in it is the unlock.

Self-hosting vs hosted

A real decision when adopting smaller models is whether to self-host or use hosted endpoints:

Hosted small models — Claude Haiku, GPT-3.5 Turbo via the providers' APIs. Cheaper than the frontier but still pay-per-token. Operationally simple.

Hosted open models — Llama 3 or Mixtral via Together AI, Anyscale, AWS Bedrock, Azure ML. Per-token pricing similar to small commercial models. Less operational responsibility than self-hosting.

Self-hosted open models — running Llama 3 or similar on your own infrastructure. Lowest variable cost; significant fixed cost in infrastructure and operations. Justified at high volume; over-investment at low volume.

The choice depends on volume, operational capability, and the data residency profile. A typical adoption curve: start with hosted commercial small models; if volume justifies, move to hosted open models; if volume and compliance justify, move to self-hosted.

What you take on operationally

Self-hosting an open model is a real operational commitment:

  • Inference infrastructure. GPUs, scaling policies, monitoring.
  • Model lifecycle. New model versions release frequently; the team has to evaluate and adopt.
  • Capacity management. Inference throughput, queue management, latency under load.
  • Reliability. A self-hosted model is your problem when it goes down.
  • Cost monitoring. GPU costs are different from per-token costs; the FinOps model has to adapt.
  • Security. Self-hosted models reduce some risks (data leaves the boundary less) and create others (you operate the surface area).

Teams that have been running ML infrastructure are well-prepared. Teams new to it should start with hosted models and consider self-hosting only when the case is strong.

Fine-tuning small models

Where smaller models really shine is in fine-tuning. The economics:

  • Frontier model fine-tuning is rarely available, expensive when it is, and constrained in what you can do with the result.
  • Small open model fine-tuning is accessible, cheaper, and gives you a model you fully control.

A fine-tuned 7B or 8B model on a narrow task often matches or beats a frontier model prompted for the same task — at a fraction of the per-call cost. The training cost is moderate; the inference cost is low; the operational model is yours to define.

This is one of the strongest cases for adopting open models in enterprise: a fine-tuning pipeline that produces narrow expert models is a capability worth building.

What we keep seeing

Patterns in enterprise smaller-model adoption:

Initial scepticism replaced by surprise. Teams expect a significant quality gap and find a manageable one. Easy and medium workloads transition cleanly.

Cost reduction follows. The cost savings from model routing are typically 50-70% of the original bill. The savings fund more sophisticated work elsewhere.

Latency wins are real. Workloads that needed sub-second response times become possible with smaller models when they were impractical with frontier APIs.

Operational burden underestimated for self-hosting. Teams that jump to self-hosting without the ML infrastructure background struggle. Stage the adoption.

Specialisation pays off. A fine-tuned small model for a narrow task is a competitive asset. The frontier model is generally available; your fine-tuned model is yours.

What we recommend

For enterprise teams considering smaller models in 2024:

  1. Audit current usage. Identify the workloads where a smaller model could plausibly work.
  2. Build model routing. The triage step is high-leverage; deploy it across workloads.
  3. Start with hosted small models. Operational simplicity. Establish the quality baseline.
  4. Move to hosted open models when volume justifies. Per-token costs drop further.
  5. Self-host only when the volume, latency, or compliance case is strong.
  6. Build a fine-tuning capability. The narrow specialised model is the real competitive advantage.
  7. Maintain evaluation discipline across model choices. The choice of model is one of many variables; evaluation tells you whether the choice was right.

The default of routing everything to the frontier model was a reasonable habit when smaller models lagged significantly. In 2024 the gap has closed enough that the default is wrong for most enterprise workloads. The teams that adapt will have lower bills, faster systems, and more strategic flexibility. The teams that don't will keep paying frontier prices for work that doesn't need them.

RELATED READING

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.