Intellectual
← All Insights
AI & Enterprise AI17 September 20248 min read

Self-Hosting Open LLMs in Enterprise — When It's Worth It

Self-hosting open models has gone from a research exercise to a real enterprise option in 2024. The cases where it earns its operational cost are clearer than they were a year ago.

Self-hosting open-weight LLMs has gone from a research exercise to a real enterprise option through 2024. Llama 3 (in particular Llama 3 70B and 8B), Mistral's open releases, Mixtral, Phi-3 — the open weights have closed enough of the capability gap that hosting them inside the enterprise is now a credible architecture for some workloads. The cases where it pays back are clearer than they were a year ago.

This piece is a practitioner view of when self-hosting is worth the operational commitment, what the realistic cost picture looks like, and how to set up self-hosted inference so it actually serves production.

The cases where self-hosting earns its cost

Self-hosting is a real operational commitment. The cases where it pays back:

Data residency or confidentiality constraints

When the data the workload processes cannot leave the boundary — for compliance, contractual, or competitive reasons — hosted commercial models are not an option. Self-hosted models inside the boundary are the path to AI capability for these workloads.

Examples: government workloads where data cannot leave the country; legal workloads where attorney-client privilege constrains data sharing; competitive intelligence workloads where the data is itself the asset.

Sustained high volume

When the workload processes enough requests that the per-call cost differential between commercial APIs and self-hosted infrastructure becomes substantial. The crossover varies by model and workload; for many enterprise scenarios, sustained workloads of a few million calls per month are where self-hosting starts to compete on cost.

Latency-critical workloads

A self-hosted model on dedicated infrastructure can produce responses in tens of milliseconds. Commercial APIs add network latency that pushes p50 to hundreds of milliseconds and p99 much higher. For workloads where latency drives experience, the self-hosted option is the only viable path.

Fine-tuning for narrow tasks

When the workload benefits from a fine-tuned model on a specific task. Commercial fine-tuning is expensive, less flexible, and produces a model you don't fully control. Open model fine-tuning produces a narrow expert that performs better than a general model on the specific task, at a much lower per-call cost.

Independence from vendor changes

Commercial models update silently. Capabilities change; behaviour shifts; pricing evolves. Self-hosted models are stable until you choose to upgrade. For systems where stability matters (regulated workloads, long-lived integrations), this is genuine value.

The cases where self-hosting is the wrong answer

Where self-hosting tends to disappoint:

Low-volume exploratory workloads

When the workload is processing thousands of calls per month, not millions. The infrastructure cost dominates; the per-call savings don't materialise. Commercial APIs are cheaper at this volume.

Workloads needing frontier capability

When the task requires reasoning at the level only the largest commercial models provide. The open models in mid-2024 are competitive with mid-tier commercial models; they trail the frontier on the hardest reasoning tasks.

Teams without inference infrastructure experience

Operating an inference cluster — GPUs, load balancers, scaling policies, monitoring, reliability — is real infrastructure engineering. Teams without that background take it on and produce systems that are unreliable and expensive. Build the skills before betting on self-hosting.

Workloads with bursty demand

A self-hosted cluster is provisioned for the peak. Bursty demand means most of the time the capacity is idle. Commercial APIs scale with demand; the unit economics for bursty workloads favour commercial.

The realistic cost picture

A self-hosted inference setup for a Llama 3 70B-class model in mid-2024:

  • GPU infrastructure — A single H100 GPU costs ~$2-3/hour on hyperscalers. A 70B model in 8-bit quantisation runs on 1-2 H100s; in fp16 needs 2-4. So per-instance is $4-12/hour depending on configuration.
  • Throughput — A well-configured Llama 3 70B instance can serve 50-200 tokens per second for moderate batching. Higher with aggressive batching; lower with conversational latency requirements.
  • Reliability — At least two instances for redundancy. Often four or more for production redundancy and rolling updates.
  • Operating cost — Infrastructure, observability, on-call rotation, periodic upgrades.

For a continuous load equivalent to a few thousand requests per minute, total cost is in the low-mid five figures per month. Commercial API equivalent for the same throughput could be lower or higher depending on the workload's token profile.

For continuous load equivalent to tens of thousands of requests per minute, self-hosting becomes more decisively cheaper than commercial APIs.

The inference stack

A typical production self-hosted inference stack in 2024:

  • Model serving — vLLM is the dominant choice for high-throughput serving; TGI (Text Generation Inference) from Hugging Face is an alternative; specialised offerings like NVIDIA TensorRT-LLM for maximum throughput on Nvidia hardware.
  • Container runtime — Kubernetes for orchestration; specialised operators for GPU workloads.
  • Routing — A gateway in front that handles request routing, queueing, rate limiting.
  • Monitoring — Standard observability plus AI-specific metrics (token throughput, queue depth, request latency by length).
  • Model storage — Where the model weights live (object storage with appropriate caching).
  • Hardware — H100s where available; A100s as the previous-generation choice; consumer GPUs (4090s) for development but not production.

The stack is conventional infrastructure with AI-specific tooling layered on. Teams with strong infrastructure backgrounds adapt quickly; teams without struggle with the basics.

Quantisation and optimisation

A real lever for self-hosted economics is quantisation — running the model in lower precision (8-bit, 4-bit) than the native 16-bit weights. The trade-off:

  • Quality drops, typically modestly. For most tasks, well-implemented 8-bit quantisation produces near-identical results to 16-bit; 4-bit produces measurable but often acceptable degradation.
  • Memory usage drops significantly. A 70B model in 4-bit fits on a single H100 instead of needing two.
  • Throughput improves with quantisation, sometimes substantially.

Quantisation makes the economics of larger open models accessible. A team running 70B-class models in production usually uses 8-bit quantisation; 4-bit for development and lower-stakes workloads.

Operational realities

What teams take on with self-hosting:

Model updates

Open model releases come frequently. The team has to decide whether to upgrade, when to upgrade, how to evaluate the upgrade. Upgrades are migrations, not toggles.

Capacity planning

The cluster has to handle the peak load. Estimating the peak, monitoring against it, scaling proactively — all become regular work.

Reliability engineering

GPUs fail. Networks fail. The serving stack has bugs. The team has to handle these with the rigour of any other production infrastructure.

Security maintenance

The serving stack has security updates. The infrastructure has compliance posture to maintain. AI-specific concerns (model behaviour, prompt injection) add to conventional security work.

Cost monitoring

GPU costs are different from per-token costs. The FinOps model has to adapt. Utilisation, idle time, instance right-sizing all become areas of optimisation.

Vendor relationships

GPU procurement is competitive; relationships with hyperscalers or specialty providers matter for capacity guarantees. Software providers (vLLM, the framework ecosystem) are part of the technology partnership picture.

What we keep seeing

Recurring patterns in enterprise self-hosted LLM engagements:

Teams underestimate the operational commitment. Self-hosting looks like a cost optimisation; it is a capability investment. Teams without infrastructure depth struggle.

The first deployment is more expensive than projected. Capacity sizing, redundancy, monitoring, on-call rotation — all add to the cost model. Realistic projections come from operating experience, not from initial planning.

Quality is competitive on most workloads. Once teams adopt and tune, the quality is comparable to commercial mid-tier models for most enterprise workloads.

The compliance case is decisive when it applies. Workloads with strict data residency or confidentiality have no alternative. Self-hosting is the path to AI capability.

Combination architectures are common. Many enterprises end up with self-hosted models for sensitive workloads and commercial APIs for everything else. The decision is per-workload, not per-organisation.

What we recommend

For enterprise teams considering self-hosting in 2024:

  1. Start with the case. The cost case for self-hosting is workload-specific; don't generalise.
  2. Build infrastructure skills before committing. Self-hosting is real infrastructure; the skills matter.
  3. Trial at small scale before production deployment. The operating model is different from commercial APIs.
  4. Plan capacity for the peak, not the average. Self-hosted economics favour steady high utilisation.
  5. Use quantisation deliberately. Quality and throughput trade-offs matter.
  6. Treat model upgrades as migrations. The release cadence is faster than commercial APIs.
  7. Combine with commercial APIs where it makes sense. Most enterprises end up with both.

Self-hosting open LLMs is a real option in 2024 with real workload categories that justify it. The teams that adopt it deliberately produce strong systems; the teams that adopt it for the wrong reasons (cost optimisation on low volumes, capability matching to commercial APIs) produce expensive and unreliable systems. The case has to be made workload by workload.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.