Observability27 June 20238 min read

Enterprise Logging & Observability

Most enterprise logging architectures were designed for a different era of cost economics. A practitioner view of structured logging, retention tiers, the cost discipline modern observability requires — and why the cheapest log line is often the one not emitted.

ByIntellectual Enterprise Integration Team· Collective byline

Most enterprise logging architectures we audit were designed in an era of cheaper storage and less verbose applications. The pattern — log everything, retain it everywhere, search it occasionally — produced operational cost surprises through the late 2010s and is genuinely punitive in 2023's cloud economics. Datadog, Splunk, Sumo Logic, and Elastic invoices have become a regular item in enterprise IT budget reviews.

The pattern needs revisiting. Not because logging is less important — it remains critical — but because the cost economics now reward discipline that they previously did not. This piece is a practitioner view of what enterprise logging architecture should look like in 2023, and why the cheapest log line is often the one that was never emitted.

What logging is for, refined

Logs serve three operational purposes:

Debugging during incidents. When something is wrong now, logs are where engineers look first. The detail needs to be high; the latency to availability needs to be low; the time window needed is usually the last few hours.

Forensic investigation. When something went wrong in the past, logs are the historical record. The detail needs to be high; the time window can be days, weeks, or in regulated industries, years.

Audit trail. Specific events that need to be preserved for compliance, regulatory, or contractual reasons. The detail is structured; the retention is long; the access is occasional but high-stakes.

These three uses have different requirements. Designing one log pipeline to serve all three uniformly is the most common source of enterprise logging cost. The discipline is to split the uses across different storage tiers with different retention.

The three-tier model

The pattern that produces good cost economics:

Tier 1 — Hot. Recent operational logs, kept in fast-search-capable storage for active debugging. Retention: 7-30 days. Storage cost is high per byte; query latency is low. The Datadog, Splunk, or hot Elasticsearch tier.

Tier 2 — Warm. Older operational logs and historical investigation data, kept in cheaper storage with slower query. Retention: 90-365 days. Storage cost is moderate; query latency is acceptable for forensic work. Object storage with a query engine (S3 + Athena, Azure Data Lake + Synapse, equivalent) is the dominant pattern.

Tier 3 — Cold / Audit. Long-term audit and compliance retention. Stored in cheapest available storage; rarely accessed; format is structured for compliance queries when needed. Glacier, Azure Archive, equivalent. Retention measured in years, often 7-15.

The lifecycle policy moves logs from tier to tier automatically. Tier 1 ages out to tier 2 after 30 days; tier 2 ages out to tier 3 after a year; tier 3 deletes per the retention policy.

The estates that have implemented this tiering see logging costs decrease by 60-80% compared to retaining everything in tier 1. The estates that haven't continue to renegotiate Datadog contracts annually.

Structured logging — non-negotiable

For logs to be useful operationally, they need to be queryable. Free-text log lines are queryable only by string matching, which is slow and produces poor signal-to-noise. Structured logs — JSON or equivalent — are queryable by field, which transforms what's possible during incidents.

The discipline:

Every log line is a JSON object with consistent fields
Required fields: timestamp (ISO 8601), level, service, correlation ID, message
Conditional fields: user ID, tenant ID, business operation, error class
Free text goes in a message field; structured data goes in dedicated fields
No multiline log lines — multiline is a stack trace as a string field, not actual newlines

Most languages and frameworks have structured logging libraries that make this practical: log4j2's JSON layout, Python's structlog, Go's zerolog, Node's pino. The cost of adopting structured logging is a library configuration change; the benefit is order-of-magnitude better operational queryability.

Estates that emit unstructured logs find that their expensive log aggregation product is mostly being used as a poor search engine. Estates that emit structured logs use the same tooling as a queryable operational database.

What not to log

The cheapest log line is the one that was never emitted. The discipline of deciding what doesn't need to be logged is more consequential than the choice of logging platform.

Things that often log too much:

Debug-level logs in production. Useful during development; rarely useful in production. Most platforms support log-level gates; set the production level to INFO.
Health check noise. Application emits a log line every time a load balancer probes its health endpoint. At scale this is millions of identical log lines per day. Suppress at source.
Routine API request/response logging. The gateway already logs these; the application logging the same data again doubles the cost without adding information.
Verbose third-party library output. Many libraries are noisy by default. Configure their log level downward.
Stack traces for handled errors. If the error is handled gracefully, the stack trace is operational noise. Log the error class and the recoverable handling; skip the stack trace.

These habits accumulate over years. A logging audit — running through what's actually emitted and asking "why is this useful" for each pattern — often finds 30-50% of log volume that adds no operational value.

The log pipeline architecture

Between application emission and storage, the log pipeline does enrichment, filtering, routing, and transformation:

Collection. Logs are collected from applications by an agent (Fluent Bit, Vector, Datadog Agent, OpenTelemetry Collector) or via direct API calls.

Enrichment. The pipeline adds context — Kubernetes pod metadata, cloud instance metadata, deployment identifiers. The application doesn't have to know these; the pipeline knows them.

Filtering. The pipeline can drop or sample log lines based on rules. High-volume debug logs in non-debug contexts can be dropped here.

Routing. Different log streams go to different destinations. Operational logs to the hot tier. Audit logs to the audit tier. Security-relevant logs to the SIEM. Cost-sensitive logs to a budget tier.

Transformation. The pipeline can reformat logs as needed for downstream consumers. Field renaming, type coercion, redaction of sensitive data.

The modern observability collector (OpenTelemetry Collector) handles most of this with declarative configuration. Earlier generations required scripting at each stage. The pipeline tier is where logging cost is most controllable; investing in it pays back across years.

Redaction and sensitive data

Logs frequently capture data that shouldn't be in logs: API request bodies that include personal identifiers, JWT tokens, internal credentials, customer PII. The discipline of redaction is part of the logging architecture.

The patterns that work:

Field-aware redaction at emission. Application code that logs structured data knows which fields are sensitive and redacts them before emission.
Pipeline-stage redaction. Pattern matching at the pipeline drops or redacts known-sensitive patterns (credit card numbers, JWT signatures, passwords in any field).
Categorisation of streams. Streams that may contain sensitive data are routed to access-controlled storage; streams that definitely don't can flow to broader-access storage.
Audit of accidental sensitive data. Periodic sampling and review to catch new sensitive data patterns that have started flowing.

The cost of getting redaction wrong is real — both in compliance terms and in actual data breach exposure. Most enterprise estates have at least one sensitive-data-in-logs issue that hasn't been caught yet.

Querying patterns that work

Operational queries against logs work best when:

Time range is narrow (hours, not days)
Filters are field-based, not text-based
The query returns aggregates (counts, percentiles) rather than raw lines, except for specific investigation
Correlation ID joins the query across services

Operational queries that don't work well:

Full-text search across days of logs
Queries that scan the warm or cold tier without indexing
Queries that try to reconstruct full request behaviour from logs alone (traces are better)

The estates that operate well teach their on-call engineers the queries that work. The estates that don't have on-call engineers running expensive full-text searches that compete for query resources during the same incident they're trying to resolve.

Cost discipline as continuous practice

Logging cost is not a once-a-year budget exercise. The disciplines that hold costs down:

Per-service log volume metrics. When a service starts emitting 10x its usual volume, the alert is loud.
Per-team cost visibility. Teams can see what their logging costs. The visibility produces voluntary reduction.
Sampling for high-volume non-critical streams. Not every line of access logs needs to be retained; sample at 10% for low-criticality streams.
Regular log-volume reviews. Quarterly, find the top 10 contributors and ask whether the volume is justified.

The estates that treat logging cost as continuous practice spend 30-50% less than equivalent estates that don't. The savings compound over years.

What we recommend

For an enterprise rethinking logging architecture:

Adopt structured logging across all new services and as services are touched
Implement tiered retention — hot for operational, warm for forensic, cold for audit
Build the log pipeline with explicit filtering and enrichment
Audit current log volume — what's being emitted, what's being used, what should stop
Establish redaction discipline at emission and at the pipeline
Make cost visible per team

For an estate paying too much for logging:

Run the volume audit. The top 10 contributors are almost always over-emitting.
Implement sampling for non-critical high-volume streams
Move retention tiers; most enterprises are over-storing in the hot tier
Drop debug-level logs in production at the source

Logging is necessary infrastructure that has become disproportionately expensive when treated as undifferentiated. The estates that apply engineering discipline to logging pay reasonable costs for genuinely useful operational visibility. The estates that don't pay disproportionate costs for log volume that's mostly noise.

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Service practices

Service

Enterprise Integration & API Management

/services/enterprise-integration →

Service

Cloud, DevOps & Platform Engineering

/services/cloud-engineering →

Related pieces

12 September 20238 min read

Enterprise Monitoring Platforms

Datadog, Splunk, Dynatrace, New Relic, the Grafana stack, cloud-native observability — the market has split along distinct operating models with materially different cost curves. A practitioner view of how to choose, when to switch, and where the migrations actually break.

10 May 20228 min read

Enterprise Integration Monitoring

Most integration monitoring tells you that processes are running. Useful integration monitoring tells you whether the business is being served. An operational guide to the metrics that matter, the alerts that should fire, and the dashboards an operations engineer can actually use.

30 April 20248 min read

AI Observability — What to Log and Why

Conventional application observability misses what matters in LLM systems. A practitioner view of the trace shape that actually lets you debug, audit, and improve a production AI system.

Programme · Healthcare · Consumer Products · North America

Enterprise Integration Consolidation — Global Healthcare Enterprise

Multi-year integration consolidation programme unifying middleware across business units, establishing an Integration Centre of Excellence, and reducing operational complexity.

Industry

Financial Services & Banking

Regulated integration, compliance automation, and secure digital banking.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Contact Intellectual →

← Newer post

Modernization Frameworks for Legacy Systems

Older post →

Platform Reliability Engineering

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights