Enterprise Integration Monitoring
Most integration monitoring tells you that processes are running. Useful integration monitoring tells you whether the business is being served. An operational guide to the metrics that matter, the alerts that should fire, and the dashboards an operations engineer can actually use.
Most integration monitoring in production tells you that processes are running. Server is up. Queue depth is below threshold. Last batch completed within window. This is technical monitoring; it does not tell you whether the business is being served.
Useful integration monitoring answers a different question: are the integrations doing what the business depends on them to do? This piece walks through the practical shape of integration monitoring that operations engineers can use during an incident and that the business can read during a review.
The three layers of integration monitoring
Effective monitoring layers in three tiers, each answering a different question.
Platform health. Is the integration runtime functional? Are the servers up, the messaging tier responsive, the database connections healthy, the disk space adequate? This is the layer most monitoring tools handle by default. It is necessary but not sufficient.
Process health. Are the individual integrations completing? How many succeeded, how many failed, what is the latency distribution? This layer is harder because it requires the monitoring to understand integration boundaries — what counts as an integration, when one starts, when it ends. Most platforms expose enough hooks to make this visible; many estates do not wire those hooks up.
Business health. Is the business outcome the integration was built for actually being served? An integration that "succeeds" by completing its flow without errors but produces no records downstream is failing in the only way that matters. Business health requires the monitoring to look beyond the integration runtime into the systems on either side.
Estates that operate well monitor all three. Estates that operate poorly stop at platform health and discover business-health problems through stakeholder complaints.
What to measure per integration
For every integration in production, we recommend the following metrics:
Volume. How many messages, documents, or batch items did this integration process in the last interval (hour, day, week)? The expected range is known — anything outside the range is a signal even when nothing has technically failed. An integration that processed 10,000 orders yesterday and 200 today did not crash; it stopped being given work, and that is more important to know quickly.
Latency distribution. Not just average latency — the full distribution, with p50, p95, p99, and max. Most integration latency problems hide in the tail. The integration whose p50 is 50ms and p99 is 8000ms has a problem that average reporting will never reveal.
Error rate, classified. Not just "did it fail" but "what kind of failure." Transport errors (network timeouts, gateway 502s), authentication errors, schema validation errors, business logic errors, downstream system errors — each maps to a different remediation path. Lumping them together as "errors" produces alerts that nobody can act on.
Retry success rate. How often does the retry succeed? An integration with 5% transport errors that all succeed on retry is operating fine. An integration with 5% transport errors that retry to permanent failure is in distress. The retry success metric distinguishes them.
Dead-letter queue depth. Items that exhausted their retries and were quarantined. The DLQ is the single most useful operational metric for many integrations; an empty DLQ means current operations are healthy, a growing DLQ means human intervention is needed.
Business outcome metrics. Tied to the specific integration's purpose. An order-routing integration's most important metric might be "orders submitted within five minutes of receipt"; an invoice integration's might be "invoices acknowledged by the partner within the contracted window." These metrics encode the business commitment the integration exists to fulfill.
What to alert on
Most estates over-alert. Pages fire on platform-health signals that have no business consequence; engineers learn to ignore them; the page that matters arrives in the same channel and is missed.
Useful alerting is selective:
- Alert on volume falling below expected floor. An integration that should run 100 orders an hour and runs 0 for two hours is a P1.
- Alert on error rate sustained above threshold. Not on a single failure (which retry will handle), but on sustained failures suggesting a systemic problem.
- Alert on DLQ growth. Items accumulating in the dead-letter queue indicate human intervention is needed; alerting on growth (not just on depth) catches problems while they are still small.
- Alert on latency p99 sustained above SLA. Tail latency problems often grow gradually; alerting on p99 catches them while the p50 still looks fine.
- Alert on business outcome metric breaches. When the integration's business commitment is at risk, page someone who can intervene.
The principle: every alert that fires should require someone to do something. Alerts that fire and clear without intervention are training the team to ignore the next alert.
The dashboards that hold up
The dashboards we see operations teams actually use during incidents follow a pattern:
Tier 1 — Estate overview. One screen, suitable for a wall display. Per integration: green / amber / red status against its SLA, volume vs expected, error rate, latency. An operations engineer scanning this screen can see in five seconds which integrations need attention.
Tier 2 — Per-integration detail. Click into an integration to see the last 24 hours of volume, latency, errors, and DLQ. Annotations for deployments and known events. Links to the integration's runbook. Filter by partner, by document type, by source system.
Tier 3 — Per-message trace. For a specific message or transaction, the complete path through the platform: which services it touched, when it touched them, what each one returned. Correlation ID is the join key; the integration platform must emit it everywhere.
The dashboards that fail are usually too dense (the operations engineer cannot find what they need) or too sparse (the metric exists but only on a different dashboard). The pattern that works is hierarchical — overview → detail → trace — with each level designed for a specific operational question.
The runbook commitment
Monitoring is necessary; alerting is necessary; but neither is sufficient without a runbook. Every alert that can fire should have a runbook entry that tells the receiving engineer:
- What this alert means
- What systems are affected
- The first three diagnostic steps
- The escalation path if the diagnostic does not resolve it
- The known remediation patterns for the common causes
The runbook is a living document. Each incident produces a runbook update — the diagnostic step that worked, the cause that turned out to be unexpected, the colleague who knew something useful. Over time, the runbook captures the institutional knowledge that would otherwise live in one or two senior engineers' heads.
Estates without runbook discipline have engineers who can solve every incident because they remember every incident. Estates with runbook discipline have engineers who can solve every incident because the team has captured how. The first model fails the day the senior engineer leaves. The second model survives.
What to invest in first
For an estate with minimal integration monitoring:
- Get every integration into a single observability stack. One place to look. The cost is the integration work to emit telemetry consistently from every integration; the benefit is that operations engineers stop hunting for information across multiple tools.
- Define the volume, latency, error rate, and DLQ metric for every integration. Per-integration baselines for what "normal" looks like.
- Build the Tier 1 estate overview dashboard. Wall-display ready, business-meaning categorisation, current status visible at a glance.
- Convert ad hoc alerts into a runbook-backed alerting policy. Every alert has a runbook; alerts without a runbook are deleted or get one.
For an estate with monitoring but recurring operational pain:
- Audit the alert volume. How many alerts fired in the last month? How many led to action? The ratio tells you whether your alerting is selective or noisy.
- Add business outcome metrics to the integrations that have the highest business consequence. Most estates have technical metrics for everything and business metrics for nothing.
- Tighten the runbook discipline. After every incident, update the runbook. Track runbook coverage as a metric.
Integration monitoring done well is not glamorous. It does not produce keynotes or magazine features. It does produce estates where incidents resolve quickly, the business has visibility, and the operations team can sustainably support a complex estate over many years. That is the actual deliverable.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
Integration Scalability Challenges
The places enterprise integration estates actually slow down are rarely the places engineers expect. A practitioner's catalogue of the real bottlenecks — and what to do about them when they bite.
Enterprise Monitoring Platforms
Datadog, Splunk, Dynatrace, New Relic, the Grafana stack, cloud-native observability — the market has split along distinct operating models with materially different cost curves. A practitioner view of how to choose, when to switch, and where the migrations actually break.
Enterprise Logging & Observability
Most enterprise logging architectures were designed for a different era of cost economics. A practitioner view of structured logging, retention tiers, the cost discipline modern observability requires — and why the cheapest log line is often the one not emitted.
Programme · Healthcare · Consumer Products · North America
Enterprise Integration Consolidation — Global Healthcare Enterprise
Multi-year integration consolidation programme unifying middleware across business units, establishing an Integration Centre of Excellence, and reducing operational complexity.
Industry
Life Sciences & Consumer Goods
Global system integration, data pipelines, and operational platforms.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.