Enterprise Logging & Observability
Most enterprise logging architectures were designed for a different era of cost economics. A practitioner view of structured logging, retention tiers, the cost discipline modern observability requires — and why the cheapest log line is often the one not emitted.
Most enterprise logging architectures we audit were designed in an era of cheaper storage and less verbose applications. The pattern — log everything, retain it everywhere, search it occasionally — produced operational cost surprises through the late 2010s and is genuinely punitive in 2023's cloud economics. Datadog, Splunk, Sumo Logic, and Elastic invoices have become a regular item in enterprise IT budget reviews.
The pattern needs revisiting. Not because logging is less important — it remains critical — but because the cost economics now reward discipline that they previously did not. This piece is a practitioner view of what enterprise logging architecture should look like in 2023, and why the cheapest log line is often the one that was never emitted.
What logging is for, refined
Logs serve three operational purposes:
Debugging during incidents. When something is wrong now, logs are where engineers look first. The detail needs to be high; the latency to availability needs to be low; the time window needed is usually the last few hours.
Forensic investigation. When something went wrong in the past, logs are the historical record. The detail needs to be high; the time window can be days, weeks, or in regulated industries, years.
Audit trail. Specific events that need to be preserved for compliance, regulatory, or contractual reasons. The detail is structured; the retention is long; the access is occasional but high-stakes.
These three uses have different requirements. Designing one log pipeline to serve all three uniformly is the most common source of enterprise logging cost. The discipline is to split the uses across different storage tiers with different retention.
The three-tier model
The pattern that produces good cost economics:
Tier 1 — Hot. Recent operational logs, kept in fast-search-capable storage for active debugging. Retention: 7-30 days. Storage cost is high per byte; query latency is low. The Datadog, Splunk, or hot Elasticsearch tier.
Tier 2 — Warm. Older operational logs and historical investigation data, kept in cheaper storage with slower query. Retention: 90-365 days. Storage cost is moderate; query latency is acceptable for forensic work. Object storage with a query engine (S3 + Athena, Azure Data Lake + Synapse, equivalent) is the dominant pattern.
Tier 3 — Cold / Audit. Long-term audit and compliance retention. Stored in cheapest available storage; rarely accessed; format is structured for compliance queries when needed. Glacier, Azure Archive, equivalent. Retention measured in years, often 7-15.
The lifecycle policy moves logs from tier to tier automatically. Tier 1 ages out to tier 2 after 30 days; tier 2 ages out to tier 3 after a year; tier 3 deletes per the retention policy.
The estates that have implemented this tiering see logging costs decrease by 60-80% compared to retaining everything in tier 1. The estates that haven't continue to renegotiate Datadog contracts annually.
Structured logging — non-negotiable
For logs to be useful operationally, they need to be queryable. Free-text log lines are queryable only by string matching, which is slow and produces poor signal-to-noise. Structured logs — JSON or equivalent — are queryable by field, which transforms what's possible during incidents.
The discipline:
- Every log line is a JSON object with consistent fields
- Required fields: timestamp (ISO 8601), level, service, correlation ID, message
- Conditional fields: user ID, tenant ID, business operation, error class
- Free text goes in a
messagefield; structured data goes in dedicated fields - No multiline log lines — multiline is a stack trace as a string field, not actual newlines
Most languages and frameworks have structured logging libraries that make this practical: log4j2's JSON layout, Python's structlog, Go's zerolog, Node's pino. The cost of adopting structured logging is a library configuration change; the benefit is order-of-magnitude better operational queryability.
Estates that emit unstructured logs find that their expensive log aggregation product is mostly being used as a poor search engine. Estates that emit structured logs use the same tooling as a queryable operational database.
What not to log
The cheapest log line is the one that was never emitted. The discipline of deciding what doesn't need to be logged is more consequential than the choice of logging platform.
Things that often log too much:
- Debug-level logs in production. Useful during development; rarely useful in production. Most platforms support log-level gates; set the production level to INFO.
- Health check noise. Application emits a log line every time a load balancer probes its health endpoint. At scale this is millions of identical log lines per day. Suppress at source.
- Routine API request/response logging. The gateway already logs these; the application logging the same data again doubles the cost without adding information.
- Verbose third-party library output. Many libraries are noisy by default. Configure their log level downward.
- Stack traces for handled errors. If the error is handled gracefully, the stack trace is operational noise. Log the error class and the recoverable handling; skip the stack trace.
These habits accumulate over years. A logging audit — running through what's actually emitted and asking "why is this useful" for each pattern — often finds 30-50% of log volume that adds no operational value.
The log pipeline architecture
Between application emission and storage, the log pipeline does enrichment, filtering, routing, and transformation:
Collection. Logs are collected from applications by an agent (Fluent Bit, Vector, Datadog Agent, OpenTelemetry Collector) or via direct API calls.
Enrichment. The pipeline adds context — Kubernetes pod metadata, cloud instance metadata, deployment identifiers. The application doesn't have to know these; the pipeline knows them.
Filtering. The pipeline can drop or sample log lines based on rules. High-volume debug logs in non-debug contexts can be dropped here.
Routing. Different log streams go to different destinations. Operational logs to the hot tier. Audit logs to the audit tier. Security-relevant logs to the SIEM. Cost-sensitive logs to a budget tier.
Transformation. The pipeline can reformat logs as needed for downstream consumers. Field renaming, type coercion, redaction of sensitive data.
The modern observability collector (OpenTelemetry Collector) handles most of this with declarative configuration. Earlier generations required scripting at each stage. The pipeline tier is where logging cost is most controllable; investing in it pays back across years.
Redaction and sensitive data
Logs frequently capture data that shouldn't be in logs: API request bodies that include personal identifiers, JWT tokens, internal credentials, customer PII. The discipline of redaction is part of the logging architecture.
The patterns that work:
- Field-aware redaction at emission. Application code that logs structured data knows which fields are sensitive and redacts them before emission.
- Pipeline-stage redaction. Pattern matching at the pipeline drops or redacts known-sensitive patterns (credit card numbers, JWT signatures, passwords in any field).
- Categorisation of streams. Streams that may contain sensitive data are routed to access-controlled storage; streams that definitely don't can flow to broader-access storage.
- Audit of accidental sensitive data. Periodic sampling and review to catch new sensitive data patterns that have started flowing.
The cost of getting redaction wrong is real — both in compliance terms and in actual data breach exposure. Most enterprise estates have at least one sensitive-data-in-logs issue that hasn't been caught yet.
Querying patterns that work
Operational queries against logs work best when:
- Time range is narrow (hours, not days)
- Filters are field-based, not text-based
- The query returns aggregates (counts, percentiles) rather than raw lines, except for specific investigation
- Correlation ID joins the query across services
Operational queries that don't work well:
- Full-text search across days of logs
- Queries that scan the warm or cold tier without indexing
- Queries that try to reconstruct full request behaviour from logs alone (traces are better)
The estates that operate well teach their on-call engineers the queries that work. The estates that don't have on-call engineers running expensive full-text searches that compete for query resources during the same incident they're trying to resolve.
Cost discipline as continuous practice
Logging cost is not a once-a-year budget exercise. The disciplines that hold costs down:
- Per-service log volume metrics. When a service starts emitting 10x its usual volume, the alert is loud.
- Per-team cost visibility. Teams can see what their logging costs. The visibility produces voluntary reduction.
- Sampling for high-volume non-critical streams. Not every line of access logs needs to be retained; sample at 10% for low-criticality streams.
- Regular log-volume reviews. Quarterly, find the top 10 contributors and ask whether the volume is justified.
The estates that treat logging cost as continuous practice spend 30-50% less than equivalent estates that don't. The savings compound over years.
What we recommend
For an enterprise rethinking logging architecture:
- Adopt structured logging across all new services and as services are touched
- Implement tiered retention — hot for operational, warm for forensic, cold for audit
- Build the log pipeline with explicit filtering and enrichment
- Audit current log volume — what's being emitted, what's being used, what should stop
- Establish redaction discipline at emission and at the pipeline
- Make cost visible per team
For an estate paying too much for logging:
- Run the volume audit. The top 10 contributors are almost always over-emitting.
- Implement sampling for non-critical high-volume streams
- Move retention tiers; most enterprises are over-storing in the hot tier
- Drop debug-level logs in production at the source
Logging is necessary infrastructure that has become disproportionately expensive when treated as undifferentiated. The estates that apply engineering discipline to logging pay reasonable costs for genuinely useful operational visibility. The estates that don't pay disproportionate costs for log volume that's mostly noise.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
Enterprise Monitoring Platforms
Datadog, Splunk, Dynatrace, New Relic, the Grafana stack, cloud-native observability — the market has split along distinct operating models with materially different cost curves. A practitioner view of how to choose, when to switch, and where the migrations actually break.
Enterprise Integration Monitoring
Most integration monitoring tells you that processes are running. Useful integration monitoring tells you whether the business is being served. An operational guide to the metrics that matter, the alerts that should fire, and the dashboards an operations engineer can actually use.
API Observability Best Practices
Most API observability tells you that the gateway is up. Useful API observability tells you what consumers actually experience. A practitioner view of the three pillars, the consumer-centric metrics that matter, and the OpenTelemetry adoption that has finally simplified the picture.
Programme · Healthcare · Consumer Products · North America
Enterprise Integration Consolidation — Global Healthcare Enterprise
Multi-year integration consolidation programme unifying middleware across business units, establishing an Integration Centre of Excellence, and reducing operational complexity.
Industry
Financial Services & Banking
Regulated integration, compliance automation, and secure digital banking.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.