Observability12 September 20238 min read

Enterprise Monitoring Platforms

Datadog, Splunk, Dynatrace, New Relic, the Grafana stack, cloud-native observability — the market has split along distinct operating models with materially different cost curves. A practitioner view of how to choose, when to switch, and where the migrations actually break.

ByIntellectual Enterprise Integration Team· Collective byline

The enterprise monitoring platform market in 2023 has split along distinct operating models with materially different cost curves and operational footprints. Datadog has become the SaaS default for many cloud-native estates; Splunk remains entrenched in log-heavy regulated industries; Dynatrace and New Relic compete in the application performance monitoring tier; the Grafana ecosystem (Prometheus, Loki, Tempo, Mimir) has matured into a credible alternative for organisations willing to operate their own observability stack; the hyperscaler-native offerings (Azure Monitor, AWS CloudWatch, Google Cloud Operations) are increasingly viable for cloud-bound workloads.

The question is no longer "what tool to use." The question is which operating model fits the estate's workload, team capability, and cost economics. This piece is the framework we use to evaluate that decision.

The four operating models

The platforms cluster into four operating-model categories:

Category 1 — Full-SaaS, full-coverage. Datadog, Dynatrace, New Relic. The vendor operates everything; the customer instruments applications and consumes the result. Capabilities span metrics, logs, traces, real-user monitoring, synthetic monitoring, security observability. Per-host or per-data pricing.

Category 2 — Log-centric SaaS. Splunk (cloud or on-prem), Sumo Logic, Elastic Cloud. Strong in log aggregation and SIEM-adjacent use cases. Metrics and traces are present but secondary. Often the legacy choice in enterprises that adopted log analytics before unified observability matured.

Category 3 — Self-operated open-source stack. Prometheus + Grafana + Loki + Tempo, possibly with Mimir or Cortex for scale. The customer operates the observability stack itself; the open-source projects provide the capability. Lower licensing cost; higher operational cost.

Category 4 — Hyperscaler-native. Azure Monitor + Application Insights + Log Analytics; AWS CloudWatch + X-Ray; Google Cloud Operations. Tight integration with the cloud's services; weaker integration with other clouds. Per-data pricing similar to SaaS competitors but typically lower for cloud-resident workloads.

Each model has a profile the others can't easily match. The question is which profile fits.

When each model fits

Category 1 (Datadog, Dynatrace, New Relic) fits when:

The estate is mid-to-large with serious observability needs
The team prefers operational simplicity over fine-grained control
The cost economics work — typically when the engineering hours saved exceed the SaaS bill
Cross-vendor support (cloud + on-prem + SaaS observability) matters

The friction points: cost growth with data volume, lock-in to vendor query languages, occasional gaps in cutting-edge cloud service coverage. The estates that grow into these tools often stay because the operational simplicity is genuinely valuable.

Category 2 (Splunk, Sumo Logic, Elastic) fits when:

The estate is log-heavy and the log analytics value is foundational
Security observability is a primary use case (Splunk's SIEM heritage is real)
Compliance retention requirements drive the architecture
The team has invested deeply in the query language and saved searches

The friction points: cost-per-data is high; many estates outgrow the budget for full log retention. Trace and metric capabilities exist but rarely match the Category 1 leaders.

Category 3 (Grafana stack, self-operated) fits when:

The team has the engineering capacity to operate the observability stack
Cost growth in a SaaS model would exceed the engineering cost of self-operation
The estate is technology-forward enough that the open-source ecosystem aligns with how the team works
Specific customisation requirements are easier with open-source than with SaaS

The friction points: real operational work to run the stack at production scale. Estates that adopt Category 3 without the engineering capacity end up with a half-operated observability stack and the operational pain that comes with it.

Category 4 (hyperscaler-native) fits when:

Workloads are predominantly in one cloud
Native cloud-service integration matters (out-of-the-box dashboards for cloud services, integration with cloud identity, alignment with cloud billing)
The team prefers the cloud vendor's operational model
Multi-cloud observability isn't a primary requirement

The friction points: when workloads expand to multiple clouds, the hyperscaler-native model creates per-cloud silos. Migration to a Category 1 or 3 solution often follows.

Cost curves that diverge

The cost shape across the four categories diverges materially as estates scale.

Category 1 typically prices per-host (Datadog Infrastructure) and per-data-volume (logs, custom metrics, traces). For mid-size estates the bill is manageable; for large estates with heavy log volume or high cardinality custom metrics, the bill can grow disproportionately. Datadog cost surprises have been a regular industry conversation since 2021.

Category 2 prices primarily per-data-volume (Splunk ingest, Elastic data nodes). The economics worsen at scale; many large Splunk estates have moved to log tiering strategies to control cost.

Category 3 has near-zero licensing cost but significant engineering cost — typically two to four FTE-equivalents to operate the stack at production scale, plus storage infrastructure cost. For estates above a certain size, the engineering cost is less than the SaaS would have charged.

Category 4 pricing tracks the cloud vendor's data ingestion and retention pricing. Inside the cloud, the per-byte cost is often lower than Category 1; cross-cloud or cross-region transfers add costs.

The cost analysis that holds up models actual data volumes against each category's pricing, including the engineering cost for Category 3. We have seen enterprise migration decisions made on assumed cost models that didn't survive contact with actual data volumes.

OpenTelemetry as the de-risk

The biggest practical change in 2023 is that OpenTelemetry has reached production-ready status across most languages and frameworks. OpenTelemetry instrumentation in applications produces telemetry that can be routed to any vendor; switching backends doesn't require re-instrumenting applications.

This dramatically reduces the lock-in cost of vendor choice. An estate that has instrumented with OpenTelemetry can switch from Category 1 to Category 3 (or between vendors in any category) by changing the collector configuration. The migration is non-trivial — dashboards, alerts, and queries need to be rebuilt — but the application-side instrumentation persists.

The estates that have adopted OpenTelemetry can afford to make vendor decisions based on current economics rather than past commitments. The estates that haven't are tied to whatever vendor-specific instrumentation they invested in.

The recommendation regardless of category: use OpenTelemetry for application instrumentation. The portability benefit is genuine and the cost is modest.

When to migrate

Migration between observability platforms is operationally substantial. Don't migrate casually. The cases where migration usually pays back:

Cost growth has become disproportionate. The bill is growing faster than the estate, or the bill is approaching the engineering cost of self-operation. Recalculate; if migration economics genuinely work, plan it.

A new use case isn't supported by the current platform. Real-user monitoring, security observability, cloud-native service coverage — sometimes the platform doesn't fit a new requirement and migration is the only path.

Vendor relationship has materially changed. Acquisition, product direction change, support deterioration. Sometimes the relationship breaks and migration is forced.

Workload shape has changed. A formerly single-cloud estate that's now genuinely multi-cloud may have outgrown the hyperscaler-native model.

The cases where migration usually doesn't pay back:

"The new platform looks better" without a specific gap
Cost is unfavourable but engineering cost to migrate exceeds the cost gap
Migration would be a multi-year programme that other priorities will interrupt

Migration patterns

When migration is the right call, the architectural patterns:

Pattern A — Full cutover. The new platform is stood up; instrumentation is migrated; alerts and dashboards are rebuilt; the old platform is decommissioned on a defined date. Highest disruption, fastest result. Works for smaller estates.

Pattern B — Parallel operation, gradual cutover. Both platforms run in parallel. New workloads land on the new platform; existing workloads migrate in tranches. Most common pattern for large estates.

Pattern C — New-only on the new platform. Existing workloads stay on the legacy platform indefinitely; only new workloads land on the new one. Lowest disruption, longest tail. Appropriate when the legacy platform is functionally acceptable and migration cost can't be justified.

Most enterprise migrations end up using Pattern B. The disciplines that distinguish smooth migrations from rough ones: dashboard parity before cutting over alerts, alert noise tuning on the new platform before relying on it, retention strategy for the legacy platform's historical data, and a defined decommission date that's actually held.

What we recommend

For an enterprise evaluating observability platforms:

Audit the actual data volumes and use cases. Choose against the actual estate, not the brochure.
Default to OpenTelemetry for instrumentation regardless of vendor choice. The portability matters.
Pick the category that fits the operating model and cost curve. Don't try to optimise for every dimension.
Build cost monitoring into the observability stack itself. Surprise bills are predictable failures.

For an existing observability estate showing cost or capability strain:

Run the actual cost model against alternatives. The economics often justify migration but sometimes don't.
If migration is the answer, plan Pattern B parallel operation. Single cutovers fail more often than they succeed.
Adopt OpenTelemetry during the migration; it de-risks future moves.
Set a decommission date for the legacy platform and hold it.

The observability platform market has matured into clear category leaders with distinct operating models. The choice is less about which tool is best and more about which operating model fits the estate. The estates that recognise this distinction make decisions that hold up; the estates that pick a tool based on feature comparisons often switch again within three years.

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Service practices

Service

Cloud, DevOps & Platform Engineering

/services/cloud-engineering →

Service

Managed Services

/services/managed-services →

Related pieces

27 June 20238 min read

Enterprise Logging & Observability

Most enterprise logging architectures were designed for a different era of cost economics. A practitioner view of structured logging, retention tiers, the cost discipline modern observability requires — and why the cheapest log line is often the one not emitted.

10 May 20228 min read

Enterprise Integration Monitoring

Most integration monitoring tells you that processes are running. Useful integration monitoring tells you whether the business is being served. An operational guide to the metrics that matter, the alerts that should fire, and the dashboards an operations engineer can actually use.

30 April 20248 min read

AI Observability — What to Log and Why

Conventional application observability misses what matters in LLM systems. A practitioner view of the trace shape that actually lets you debug, audit, and improve a production AI system.

Programme · Life Sciences · North America

AI-Ready Event Streaming — Global Life Sciences Enterprise

Production-grade Apache Kafka event streaming platform feeding AI models, ML pipelines, and operational intelligence systems across global operations.

Industry

Financial Services & Banking

Regulated integration, compliance automation, and secure digital banking.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Contact Intellectual →

← Newer post

Digital Platform Engineering

Older post →

Hybrid Cloud Integration Strategy

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights