Intellectual
← All Insights
Enterprise Integration28 March 20228 min read

Building Scalable Integration Platforms

Scaling an integration platform is rarely about throughput. The bottlenecks are almost always in the operating model — partner onboarding capacity, deployment cadence, observability coverage, and the senior-engineer concentration that nobody planned for.

The question that arrives most often in capacity-planning conversations for integration platforms is some version of "how many transactions per second can this thing handle." It is rarely the right question. By the time an integration platform genuinely runs out of throughput headroom, it has usually been operating beyond its operating-model headroom for years.

This piece is about the actual scaling work in an enterprise integration estate — the constraints that bind first, what to do about them, and where pure throughput finally becomes the limiting factor.

The four scaling axes

When clients ask us to "scale the integration platform," we work through four axes in order:

Partner-count scaling. How many trading partners or external integrations can the platform onboard per quarter? This is the most common binding constraint for B2B-heavy estates. Throughput per partner may be modest; the operational work to bring each partner on, configure them correctly, monitor them, and respond to their exceptions accumulates linearly.

Integration-count scaling. How many distinct integrations can the platform support before they start interfering with each other operationally? This is the constraint that hits product-organisation estates. Each integration is small; the cognitive load of operating two hundred of them is the bottleneck, not the runtime cost.

Deployment-cadence scaling. How quickly can the team ship a change to production? An estate that can deploy weekly scales differently from one that can deploy quarterly, regardless of underlying platform capacity.

Transaction-throughput scaling. How many messages per second, transactions per minute, or batches per night can the runtime actually handle? This is the axis people usually mean when they say "scale," and it is usually the last one to bind.

The order matters. Estates that hit partner-count or integration-count walls and react by scaling throughput end up with a faster platform that still cannot onboard the next partner. The throughput investment was wasted relative to the actual constraint.

Partner-count scaling — the work

For trading-network and B2B-heavy estates, partner onboarding is the constraint that bites first. A platform that takes two engineering-weeks to onboard a new partner caps at roughly twenty-five partners per quarter per engineer; an estate growing past that needs intervention.

The interventions that work:

  • A partner onboarding workflow with named stages, owners per stage, and an SLA per stage. The work is decomposed; no single engineer is the bottleneck for every step.
  • A partner agreement template that captures the standard configuration and identifies the variations. Most partners fit the template; the variations are the genuinely different work.
  • A document mapping library that consists of standard maps for common EDI document types (X12 850, X12 855, EDIFACT ORDERS, EDIFACT INVOIC, etc.). New partners inherit from the standard map and override per field; ad-hoc map-building per partner is what does not scale.
  • A partner onboarding portal that lets partners self-configure the non-technical parts (contact info, notification preferences, document type selections, test cases). This single change can cut the engineering effort per partner by 40-60%.

We have brought trading-network estates from twenty-five partners per quarter to over a hundred per quarter with these four interventions and no platform change.

Integration-count scaling — the work

For estates that run hundreds of integrations rather than hundreds of partners, the constraints look different. The platform handles the runtime work fine; the operations team cannot keep up.

The interventions:

  • A shared services library so that every integration uses common error handling, common audit emission, common retry policy. The cognitive load of operating an integration drops dramatically when you can predict its behaviour from the library.
  • A canonical message vocabulary so that integration A's output is integration B's input without translation. New integrations compose against the canonical model rather than against each endpoint's idiosyncratic format.
  • Naming conventions that make the namespace navigable. An engineer encountering an integration named order.purchaseOrder.create.standard can place it without explanation.
  • Per-integration runbooks as a deliverable, not as an afterthought. Operations engineers can support an integration they did not build because the runbook is genuinely useful.

The cost of these interventions is largely upfront engineering work. The benefit is that adding the two-hundred-and-first integration is no more expensive than adding the fiftieth.

Deployment-cadence scaling — the work

Some estates are limited by deployment cadence rather than by runtime capacity. The team can build the integrations; they cannot push them to production fast enough.

The interventions:

  • Deployment pipelines that are actually automated. Most "automated" pipelines we audit have manual gates that are routinely held up. A genuinely automated pipeline produces a deployment to production within hours of merge, with the manual gates only where compliance requires them.
  • Environment management that is rigorous. Dev / test / staging / production environments that drift in configuration produce deployment failures that have nothing to do with code changes. Configuration-as-code, with the same template applied to each environment with parameters, eliminates the drift.
  • Automated regression testing at the integration level. Manual regression testing on every release caps deployment cadence at the speed of the QA team. Automated regression in the pipeline lets the QA team focus on exploratory testing of new features.
  • Feature flagging for partner-facing changes. New behaviour rolls out to a small partner subset first, with monitoring, before full activation. The deployment can ship without the partner risk holding it back.

Many estates that are throughput-bound at the platform level are actually deployment-cadence-bound at the team level. The interventions are not really about the platform; they are about the operating model around it.

Transaction-throughput scaling — when it actually binds

Eventually throughput does become the binding constraint. The runtime is genuinely saturated, the queues are backing up, the SLAs are at risk. At this point the architectural choices that have been deferred come due.

What works:

  • Horizontal scale at the runtime level. Modern integration platforms (webMethods Hybrid Integration, MuleSoft Anypoint Runtime Fabric, Boomi Molecule) support running multiple runtime instances behind a load balancer with shared state. Adding capacity is a configuration change, not a redesign.
  • Asynchronous patterns where the use case allows. A synchronous integration that takes 200 ms per call caps at a few hundred per second per runtime instance. The same workload split into "submit asynchronously, process from queue" can scale to thousands per second on the same hardware.
  • Channel partitioning at the messaging tier. A single Universal Messaging channel handling all order traffic from all regions caps at the channel's throughput. Splitting by region (or by document type, or by partner tier) lets each partition scale independently.
  • Caching for read-heavy reference data. Integrations that look up the same reference data on every call (product catalogues, partner directories, organisational hierarchies) benefit hugely from a cache that the integration platform owns.

The architectural moves are well-understood. They require commitment from the architecture team that asynchronous patterns are acceptable for the workload and capacity from the platform team to operate the partitioned topology.

Capacity planning that holds up

Capacity planning for integration platforms is almost always done badly. The pattern we see most often is: someone sums up the "expected transactions per day" across all integrations, divides by 86,400 seconds, and concludes that the platform needs to handle X transactions per second.

This number is almost always wrong, in both directions.

It is too low because traffic is not uniform: there will be a peak hour that is many times the average, and the platform needs to handle that peak, not the average. Real workloads have ratios of 10:1 or higher between peak and average.

It is too high because not all transactions are equal: a 50-byte message routed straight through is a thousandth of the cost of a 5MB document going through XML schema validation, transformation, and partner-specific signing. Adding them all up as "transactions" produces nonsense numbers.

Capacity planning that works:

  • Identify the half-dozen highest-volume integrations and characterise their peak behaviour separately
  • Identify the most expensive integrations (by CPU, memory, or wall-clock per message) and characterise their cost
  • Estimate the worst-case overlap (what happens if the daily peak of integration A coincides with the daily peak of integration B?)
  • Add capacity headroom — we use 40% as a starting point — to absorb unexpected spikes

This is more work than dividing by 86,400. It produces capacity estimates that survive contact with production.

What we recommend

When a client says "we need to scale the integration platform," our first conversation is about which axis is binding. The answer is almost never throughput in the first instance. The interventions that produce the largest delta are usually in partner-onboarding workflow, shared-services library, deployment-pipeline automation, and observability — none of which are the answer to "increase TPS."

When throughput is genuinely the constraint, the moves are architectural rather than buy-more-hardware: horizontal scale, asynchronous patterns, channel partitioning, caching. The estates that handle these moves well are the ones that committed to the operating-model investments first, so the scaled platform is run by a team that can actually operate it.

Throughput is the last thing to bind. The estates that focus on it first usually end up with an unconstrained runtime that still cannot ship the next integration.

RELATED READING

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.