Intellectual
← All Insights
AI & Enterprise AI26 March 20249 min read

Multi-Agent Orchestration — Hype Versus Production Reality

Multi-agent frameworks dominate the AI engineering conversation right now. The patterns that actually ship are narrower, more bounded, and more boring than the demos suggest.

Multi-agent orchestration is the loudest conversation in AI engineering right now. CrewAI, AutoGen, LangGraph, and a parade of newer frameworks all claim to make sophisticated agent collectives easy to assemble. The demos are striking. The production deployments — at the enterprises we work with — are much narrower than the demos, and more bounded, and more boring.

This is a practitioner view of where multi-agent patterns actually ship into production, where the hype runs ahead of the reality, and what design discipline matters when the goal is a system that operates reliably, not a system that demos well.

What multi-agent actually means

In current usage, a multi-agent system is one in which multiple LLM-driven roles, each with a defined purpose and a defined set of tools, coordinate to accomplish a task that exceeds what any single agent could handle in one call. The coordination patterns vary:

  • Supervisor-worker — a coordinating agent delegates sub-tasks to specialist agents, integrates the results
  • Pipeline — agents arranged in sequence, each transforming the output of the previous
  • Conversation — agents exchange messages until a goal state is reached
  • Marketplace — agents bid on or claim work from a shared queue
  • Hierarchical — multiple levels of coordination with different scopes

In the research literature and the framework documentation, all of these are presented as viable architectures. In enterprise production, the picture is narrower.

What ships, what doesn't

What ships:

  • Bounded supervisor-worker pipelines where the supervisor is constrained to a defined sequence of sub-tasks. The supervisor is closer to a workflow engine than to a free-form planner.
  • Single-agent systems with tool use where the "agent" is one LLM with the ability to call functions. Not multi-agent in any meaningful sense, but often what teams call multi-agent.
  • Hand-off patterns where a first agent does intake, makes a decision, and hands to one of several specialist agents based on the decision. Effectively a routing agent feeding deterministic downstream paths.

What does not ship reliably:

  • Free-form conversation patterns where agents debate until consensus. The conversations either reach consensus quickly (in which case the multi-agent design was overhead) or fail to converge (in which case the system fails to produce an answer).
  • Marketplace patterns for production workloads where determinism matters. The non-determinism of "who picks up the work" makes them hard to audit, hard to debug, hard to control costs on.
  • Hierarchical agent collectives at any scale beyond two levels. The coordination overhead and the failure modes multiply.

The pattern: multi-agent works when the multi-agentness is a structural property of the workflow — different specialist roles, deterministic hand-offs, bounded scope. It doesn't work when the multi-agentness is emergent — agents negotiating, deciding among themselves, freely planning.

The recurring failure modes

Across the production attempts we have seen:

Loop explosion

Two agents exchange messages, refining a draft. Each refinement triggers a response. Without strict turn limits, the conversation continues indefinitely. Cost climbs. The output, when finally produced, is no better than a single-pass solution would have been.

The fix: hard turn limits, monitored. Anything longer than expected is flagged as an anomaly.

Coordination drift

The supervisor agent delegates work, receives results, decides what to do next. Over many cycles, the supervisor's representation of the overall task drifts from what the user asked for. The final answer addresses something subtly different from the original request.

The fix: anchor the supervisor at every step to the original task. Re-state the goal in each prompt. Validate the final output against the original task.

Cost surprises

A multi-agent run involves many model calls. A single user request can produce hundreds of LLM invocations under the hood. The unit cost per request can be ten or twenty times what a single-agent run would cost. Without per-request cost monitoring, the bill becomes the first signal that something is wrong.

The fix: per-request cost budgets enforced as circuit breakers. Granular cost monitoring. Periodic review of cost per request type.

Observability collapse

When the system makes a mistake, reconstructing what happened requires inspecting the full sequence of agent interactions. Most frameworks log poorly by default. The trace is incomplete; the prompts and outputs at each step are not all captured; the order of operations is hard to reconstruct.

The fix: instrument the agent system explicitly. Every prompt, every output, every tool call, every state transition logged with consistent identifiers. Treat the trace as a first-class artifact, not an afterthought.

Brittleness under change

A working multi-agent system depends on specific prompt phrasing, specific tool descriptions, specific examples. Small changes — to one prompt, to one tool's description, to the underlying model version — produce large changes in system behaviour. The system that worked yesterday no longer works today.

The fix: extensive integration testing. Pin all the things that can be pinned. Treat changes to any agent as a change to the whole system.

When multi-agent is genuinely worth it

Despite the failure modes, there are cases where the multi-agent pattern actually earns its keep:

Strong specialisation

When the work genuinely has multiple specialist roles — a researcher agent that searches and summarises, a writer agent that drafts based on research, a reviewer agent that checks against a quality bar — the specialisation produces better results than a single generalist agent.

The pattern works because each specialist agent has a smaller, more focused job. The prompt for "search and summarise" is much sharper than the prompt for "do everything." Each agent can be optimised independently.

Long-horizon tasks

For tasks that span more steps than fit usefully in a single context window, multi-agent decomposition is one of the few paths that works. A supervisor coordinates over a longer horizon than any single agent could.

The discipline: clear hand-offs, summarised state passed between agents, no expectation that any single agent has the full history.

Tool isolation

When different parts of the workload need different tool access — one part needs database access, another needs file-system access, another needs external API access — separating them into agents with separate permission sets is a useful security boundary. Each agent has the minimum privilege required for its role.

This is closer to micro-service architecture than to agent architecture, but the patterns are the same.

Quality multiplication through review

A two-agent loop — one writes, one reviews — often produces better output than a single agent. The review agent applies a quality bar the writer is too close to apply. As long as the loop is bounded (one review, one revision, then commit), the cost is manageable and the quality lift is real.

Framework choices in 2024

Briefly, the frameworks teams are picking up:

  • LangGraph — graph-based agent orchestration on top of LangChain. The graph model is more controllable than free-form. Aligned with how production workflows actually look.
  • AutoGen — Microsoft's framework, strong on conversation patterns. Powerful but the conversation pattern is where most failure modes live.
  • CrewAI — opinionated multi-agent framework. Good for the demo case; production deployments require disciplined extension.
  • Custom orchestration — many teams end up here. The frameworks help with the easy 80%; the production-grade 20% gets built in-house anyway.

None of these are wrong choices. The choice depends more on the team's familiarity, the workload shape, and the integration with existing platform conventions than on framework quality.

What we keep seeing

Recurring patterns in production multi-agent engagements:

The first multi-agent design is too complex. The team designs for the interesting case; the production deployment needs the boring case. Simplification is the first refactor.

The cost surprise happens once. After it happens, monitoring goes in. The bill drops. Lessons learned.

The team underestimates testing effort. A multi-agent system has more states than a single-agent system. The test surface is multiplicatively larger.

The frameworks accelerate the demo and complicate the production. Frameworks make the easy case trivial and the hard case harder. Often a custom orchestration is the right answer.

Single-agent solutions work more often than expected. Before designing a multi-agent system, ask whether a single agent with tools could do the job. Often the answer is yes.

What we recommend

For an enterprise team considering multi-agent designs in 2024:

  1. Start with the question: "Can a single agent with tools do this?" If yes, do that. Multi-agent has real costs.
  2. If multi-agent, prefer supervisor-worker with deterministic hand-offs over free-form conversation patterns.
  3. Hard limits on everything — turns, cost, wall-clock time. Enforce them as circuit breakers.
  4. Instrument observability before scale. The trace has to reconstruct what happened.
  5. Pin model versions across all agents. Treat upgrades as planned migrations.
  6. Pick frameworks based on production fit, not demo aesthetics. Be willing to write custom orchestration when the framework doesn't fit.
  7. Test the integration shape. Multi-agent systems have more failure modes than single-agent; coverage matters.

Multi-agent orchestration will be a real category over the next several years. The patterns that ship will look more like workflow engines with intelligent steps than like emergent agent societies. The teams that recognise this will deliver useful systems. The teams that chase the demo aesthetic will deliver systems that work in the showroom and not in production.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.