Observability30 April 20248 min read

AI Observability — What to Log and Why

Conventional application observability misses what matters in LLM systems. A practitioner view of the trace shape that actually lets you debug, audit, and improve a production AI system.

ByIntellectual AI Engineering Practice· Collective byline

A production LLM system produces incidents the way every other production system does. When the incident happens, the question is whether the team can reconstruct what happened. Conventional application observability — request, response, latency, error rate — misses most of what matters. The reasoning inside the LLM call, the retrieved context, the prompt assembly, the model version — without these, the trace is incomplete and the debugging is guesswork.

This is a practitioner view of what to log in a production LLM system, why it matters, and how to design the observability layer so it serves debugging, auditing, evaluation, and improvement at the same time.

What conventional observability misses

A modern application has metrics, logs, and traces. For an LLM workload, the conventional setup captures:

The user's request
The HTTP-level interaction with the model provider
The latency and status code
Errors

What it misses:

The actual prompt that went to the model. The HTTP body might be redacted; the assembly of system prompt + retrieved context + user input + examples is not reconstructible.
The model's response. Often truncated, redacted, or simply not captured at the depth needed for debugging.
The retrieval result. Which chunks came back, with what scores, from which sources.
The model identifier and version. The provider may have silently upgraded; you can't tell from latency.
The cost. Token counts and pricing per call.
The downstream effects. Function calls, side effects, follow-up calls.
The reasoning. For chain-of-thought or tool-use patterns, the intermediate steps the model produced.

A trace that doesn't capture these reconstructs the network call but not the AI workload. When an incident happens, the team can see that the model was called and what came back; it cannot see why the model produced what it did.

What to log

The trace shape that has held up in production LLM systems:

At session level

Session ID — joinable across all subsequent calls
User identity
Channel — web, mobile, API
Tenant, workload — for attribution
Start and end timestamps

At request level

Request ID — within the session
The user's original input
The intent classification or routing decision
The workload-specific metadata

At call level (for each LLM call in the request)

Model identifier — including version
Provider
The full prompt — system instructions, context, history, user input, in the assembled form sent to the model
The full response — including any thinking tokens or intermediate reasoning where available
Input and output token counts
Latency
Cost (computed from tokens and pricing)
Cache hit indicator
Any errors

At retrieval level (for each retrieval)

Query — including any rewriting
Embedding model
Vector store
Top-k parameter
Retrieved chunks — IDs, scores, source documents, the actual content
Filter conditions applied

At function call level (for each function the LLM invokes)

Function name and version
Arguments — as the model produced them
Validation result
Execution result
Latency
Any errors

At decision level (for human-in-the-loop steps)

The case that required human input
The information shown to the human
The decision the human made
The reviewer identity and timestamp

This is more data than conventional traces. It is also what makes debugging an LLM system possible.

How to structure it

A few patterns for the trace shape:

Use a tracing standard. OpenTelemetry is the dominant choice. Treat LLM spans as first-class.
Hierarchical structure. Session contains requests; request contains calls; calls contain retrievals and function calls. The hierarchy lets you drill in.
Consistent identifiers. Same ID format across spans. Join-friendly.
Sampling thoughtfully. Full traces are expensive at high volume. Sample by importance — full trace for all errors, sampled for normal traffic, full for high-stakes workloads.
Retention by purpose. Debug traces retained briefly; audit traces retained as long as regulation requires; evaluation traces retained as long as the eval set needs them.

The pattern is more disciplined than conventional logging. The payoff is that the trace is actually useful when you need it.

Where the storage matters

LLM traces are large. Prompts can run thousands of tokens; responses can run thousands more. Retrieval results add another thousand or so. A single high-end workload can produce megabytes per interaction.

Storage options:

Conventional log aggregators (Datadog, Splunk, ELK) — work for the smaller-volume case. Get expensive at scale.
Specialised LLM observability tools (LangSmith, Helicone, Phoenix, Langfuse) — purpose-built for LLM traces. Easier to navigate; some lack enterprise governance features.
Object storage with metadata indexes (S3 with Athena, Azure Blob with Synapse) — cheap at scale; less convenient to query interactively.
Hybrid — high-signal metadata in the log aggregator; full prompts and responses in object storage; the aggregator links to the object store.

The right choice depends on volume, governance posture, and the team's existing observability stack. Most enterprise teams end up with a hybrid model.

What to monitor

Observability is not just logs; it is metrics with alerts. The metrics worth monitoring:

Latency — p50, p95, p99 by workload
Error rate — by error type
Cost rate — dollars per hour per workload
Token throughput — input and output
Cache hit rate — at each cache layer
Retrieval quality proxies — average top-k score, retrieval recall against an eval set
User feedback signal — where users can rate, the rate of positive/negative feedback
Validation failure rate — for structured output workloads, the rate of malformed output
Function call failure rate — for tool-using workloads

Each metric has thresholds. Threshold breaches trigger alerts. Alerts trigger investigation. Investigation uses the traces.

How to use it for debugging

When something is wrong in production, the trace lets the team work backwards:

The user reports a bad output.
The team looks up the session and the specific request.
The trace shows the user input, the prompt sent to the model, the retrieved context, the model output, any function calls.
The team can identify whether the failure was in retrieval (wrong context), in the prompt (poor assembly), in the model (poor reasoning), in the function call (wrong arguments), or in the downstream system.

Without this depth of trace, the team is reduced to running the input through the system manually and hoping to reproduce the issue. The trace turns mysteries into investigations.

How to use it for audit

In regulated workflows, the trace is the audit record. The trace has to:

Be retained for the required period
Be tamper-evident
Be searchable by case, user, time
Be exportable for regulator review
Be redacted appropriately when shared

The audit-grade trace is more demanding than the debug-grade trace. The same infrastructure can serve both with appropriate retention and access controls.

How to use it for improvement

The trace is the substrate for system improvement:

Failure analysis — which inputs produce which failures
Cost analysis — which workloads consume which resources
Latency analysis — where the time goes
Quality analysis — where the outputs fall short

Patterns surface from aggregate analysis that single incidents never reveal. The team that uses traces this way improves the system continuously; the team that uses traces only for incident response does not.

What we keep seeing

Recurring patterns in production LLM observability:

Retrofit is expensive. Teams that didn't capture full traces from the start spend significant effort retrofitting. Plan it from day one.

Storage cost is the trap. Capturing everything is expensive; sampling poorly loses the cases that matter. The right sampling policy — keep all errors, sample success — is not obvious initially.

The trace tool sprawl. Teams end up with conventional logs, a specialised LLM observability tool, and the cloud provider's logs. Reconciling them takes effort.

The most-used view is the per-request trace. A clean view of "what happened in this request" with prompts, retrievals, function calls, and outputs visible is what the team will use daily. Optimise for it.

Aggregate analysis is underused. The trace data could answer many improvement questions, but the team has to invest in the analysis tooling and the analyst time. Many teams don't, and miss the improvement signal in their own data.

What we recommend

For an enterprise team building LLM observability in 2024:

Capture full prompts, full responses, retrieval results, and function calls from day one. Retrofitting is harder than starting right.
Structure traces hierarchically. Session → request → call → retrieval/function call.
Use OpenTelemetry or a compatible standard. Lock-in here is painful.
Pick a storage strategy that scales. Hybrid is common; pure aggregator costs at volume.
Monitor metrics with thresholds. Traces are the deep dive; metrics are the daily signal.
Build aggregate analysis tooling. The improvement signal is in the data; extract it.
Treat audit-grade retention as a first-class requirement in regulated workloads.

LLM systems without observability are systems you cannot debug, audit, or improve. The infrastructure is more demanding than conventional observability but no more demanding than the work is worth. The teams that build it well operate confidently; the teams that skip it operate reactively.

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Service practices

Service

AI & Intelligent Automation

/services/ai-solutions →

Service

Cloud, DevOps & Platform Engineering

/services/cloud-engineering →

Related pieces

11 February 20257 min read

AI Platform Engineering — What Mature Platforms Look Like in 2025

The first wave of enterprise AI platforms is now mature enough to extract patterns. The platforms that compound value across line-of-business teams share recognisable shape.

4 June 20248 min read

LLMOps Maturity — A Practitioner's Maturity Model

Most enterprises are operating LLM workloads on engineering intuition alone. A maturity model helps locate where you are, what to invest in next, and what the next stage actually requires.

2 April 20248 min read

LLM Evaluation — The Engineering Discipline Most Teams Skip

Without evaluation, every change to an LLM system is a guess. Teams that build evaluation discipline ship with confidence; teams that skip it operate on intuition until production incidents force the issue.

Industry

Government & Public Sector

Regulatory platforms, citizen services, and federal-grade integration.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Contact Intellectual →

← Newer post

The Case for Smaller Models in Enterprise AI

Older post →

LLM Cost Discipline — Engineering Practices That Keep Bills Predictable

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights