AI Observability — What to Log and Why
Conventional application observability misses what matters in LLM systems. A practitioner view of the trace shape that actually lets you debug, audit, and improve a production AI system.
A production LLM system produces incidents the way every other production system does. When the incident happens, the question is whether the team can reconstruct what happened. Conventional application observability — request, response, latency, error rate — misses most of what matters. The reasoning inside the LLM call, the retrieved context, the prompt assembly, the model version — without these, the trace is incomplete and the debugging is guesswork.
This is a practitioner view of what to log in a production LLM system, why it matters, and how to design the observability layer so it serves debugging, auditing, evaluation, and improvement at the same time.
What conventional observability misses
A modern application has metrics, logs, and traces. For an LLM workload, the conventional setup captures:
- The user's request
- The HTTP-level interaction with the model provider
- The latency and status code
- Errors
What it misses:
- The actual prompt that went to the model. The HTTP body might be redacted; the assembly of system prompt + retrieved context + user input + examples is not reconstructible.
- The model's response. Often truncated, redacted, or simply not captured at the depth needed for debugging.
- The retrieval result. Which chunks came back, with what scores, from which sources.
- The model identifier and version. The provider may have silently upgraded; you can't tell from latency.
- The cost. Token counts and pricing per call.
- The downstream effects. Function calls, side effects, follow-up calls.
- The reasoning. For chain-of-thought or tool-use patterns, the intermediate steps the model produced.
A trace that doesn't capture these reconstructs the network call but not the AI workload. When an incident happens, the team can see that the model was called and what came back; it cannot see why the model produced what it did.
What to log
The trace shape that has held up in production LLM systems:
At session level
- Session ID — joinable across all subsequent calls
- User identity
- Channel — web, mobile, API
- Tenant, workload — for attribution
- Start and end timestamps
At request level
- Request ID — within the session
- The user's original input
- The intent classification or routing decision
- The workload-specific metadata
At call level (for each LLM call in the request)
- Model identifier — including version
- Provider
- The full prompt — system instructions, context, history, user input, in the assembled form sent to the model
- The full response — including any thinking tokens or intermediate reasoning where available
- Input and output token counts
- Latency
- Cost (computed from tokens and pricing)
- Cache hit indicator
- Any errors
At retrieval level (for each retrieval)
- Query — including any rewriting
- Embedding model
- Vector store
- Top-k parameter
- Retrieved chunks — IDs, scores, source documents, the actual content
- Filter conditions applied
At function call level (for each function the LLM invokes)
- Function name and version
- Arguments — as the model produced them
- Validation result
- Execution result
- Latency
- Any errors
At decision level (for human-in-the-loop steps)
- The case that required human input
- The information shown to the human
- The decision the human made
- The reviewer identity and timestamp
This is more data than conventional traces. It is also what makes debugging an LLM system possible.
How to structure it
A few patterns for the trace shape:
- Use a tracing standard. OpenTelemetry is the dominant choice. Treat LLM spans as first-class.
- Hierarchical structure. Session contains requests; request contains calls; calls contain retrievals and function calls. The hierarchy lets you drill in.
- Consistent identifiers. Same ID format across spans. Join-friendly.
- Sampling thoughtfully. Full traces are expensive at high volume. Sample by importance — full trace for all errors, sampled for normal traffic, full for high-stakes workloads.
- Retention by purpose. Debug traces retained briefly; audit traces retained as long as regulation requires; evaluation traces retained as long as the eval set needs them.
The pattern is more disciplined than conventional logging. The payoff is that the trace is actually useful when you need it.
Where the storage matters
LLM traces are large. Prompts can run thousands of tokens; responses can run thousands more. Retrieval results add another thousand or so. A single high-end workload can produce megabytes per interaction.
Storage options:
- Conventional log aggregators (Datadog, Splunk, ELK) — work for the smaller-volume case. Get expensive at scale.
- Specialised LLM observability tools (LangSmith, Helicone, Phoenix, Langfuse) — purpose-built for LLM traces. Easier to navigate; some lack enterprise governance features.
- Object storage with metadata indexes (S3 with Athena, Azure Blob with Synapse) — cheap at scale; less convenient to query interactively.
- Hybrid — high-signal metadata in the log aggregator; full prompts and responses in object storage; the aggregator links to the object store.
The right choice depends on volume, governance posture, and the team's existing observability stack. Most enterprise teams end up with a hybrid model.
What to monitor
Observability is not just logs; it is metrics with alerts. The metrics worth monitoring:
- Latency — p50, p95, p99 by workload
- Error rate — by error type
- Cost rate — dollars per hour per workload
- Token throughput — input and output
- Cache hit rate — at each cache layer
- Retrieval quality proxies — average top-k score, retrieval recall against an eval set
- User feedback signal — where users can rate, the rate of positive/negative feedback
- Validation failure rate — for structured output workloads, the rate of malformed output
- Function call failure rate — for tool-using workloads
Each metric has thresholds. Threshold breaches trigger alerts. Alerts trigger investigation. Investigation uses the traces.
How to use it for debugging
When something is wrong in production, the trace lets the team work backwards:
- The user reports a bad output.
- The team looks up the session and the specific request.
- The trace shows the user input, the prompt sent to the model, the retrieved context, the model output, any function calls.
- The team can identify whether the failure was in retrieval (wrong context), in the prompt (poor assembly), in the model (poor reasoning), in the function call (wrong arguments), or in the downstream system.
Without this depth of trace, the team is reduced to running the input through the system manually and hoping to reproduce the issue. The trace turns mysteries into investigations.
How to use it for audit
In regulated workflows, the trace is the audit record. The trace has to:
- Be retained for the required period
- Be tamper-evident
- Be searchable by case, user, time
- Be exportable for regulator review
- Be redacted appropriately when shared
The audit-grade trace is more demanding than the debug-grade trace. The same infrastructure can serve both with appropriate retention and access controls.
How to use it for improvement
The trace is the substrate for system improvement:
- Failure analysis — which inputs produce which failures
- Cost analysis — which workloads consume which resources
- Latency analysis — where the time goes
- Quality analysis — where the outputs fall short
Patterns surface from aggregate analysis that single incidents never reveal. The team that uses traces this way improves the system continuously; the team that uses traces only for incident response does not.
What we keep seeing
Recurring patterns in production LLM observability:
Retrofit is expensive. Teams that didn't capture full traces from the start spend significant effort retrofitting. Plan it from day one.
Storage cost is the trap. Capturing everything is expensive; sampling poorly loses the cases that matter. The right sampling policy — keep all errors, sample success — is not obvious initially.
The trace tool sprawl. Teams end up with conventional logs, a specialised LLM observability tool, and the cloud provider's logs. Reconciling them takes effort.
The most-used view is the per-request trace. A clean view of "what happened in this request" with prompts, retrievals, function calls, and outputs visible is what the team will use daily. Optimise for it.
Aggregate analysis is underused. The trace data could answer many improvement questions, but the team has to invest in the analysis tooling and the analyst time. Many teams don't, and miss the improvement signal in their own data.
What we recommend
For an enterprise team building LLM observability in 2024:
- Capture full prompts, full responses, retrieval results, and function calls from day one. Retrofitting is harder than starting right.
- Structure traces hierarchically. Session → request → call → retrieval/function call.
- Use OpenTelemetry or a compatible standard. Lock-in here is painful.
- Pick a storage strategy that scales. Hybrid is common; pure aggregator costs at volume.
- Monitor metrics with thresholds. Traces are the deep dive; metrics are the daily signal.
- Build aggregate analysis tooling. The improvement signal is in the data; extract it.
- Treat audit-grade retention as a first-class requirement in regulated workloads.
LLM systems without observability are systems you cannot debug, audit, or improve. The infrastructure is more demanding than conventional observability but no more demanding than the work is worth. The teams that build it well operate confidently; the teams that skip it operate reactively.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
AI Platform Engineering — What Mature Platforms Look Like in 2025
The first wave of enterprise AI platforms is now mature enough to extract patterns. The platforms that compound value across line-of-business teams share recognisable shape.
LLMOps Maturity — A Practitioner's Maturity Model
Most enterprises are operating LLM workloads on engineering intuition alone. A maturity model helps locate where you are, what to invest in next, and what the next stage actually requires.
LLM Evaluation — The Engineering Discipline Most Teams Skip
Without evaluation, every change to an LLM system is a guess. Teams that build evaluation discipline ship with confidence; teams that skip it operate on intuition until production incidents force the issue.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.