Intellectual
← All Insights
AI & Enterprise AI20 August 20247 min read

Real-Time AI vs Batch AI — Choosing the Right Latency Profile

The default is real-time. The right choice is often batch. A practitioner view of when each pattern earns its complexity, and how to design for the latency profile your workload actually needs.

A pattern across enterprise AI architectures: real-time is the default choice; batch is rarely considered. The default makes sense for some workloads — anywhere a user is waiting — and is wrong for others, where the latency requirement is invented rather than required. The cost differential between real-time and batch is significant; the complexity differential is also significant.

This piece is a practitioner view of how to decide between real-time and batch processing for AI workloads, where each is the right answer, and what the design implications look like.

What each pattern means in this context

Real-time in current LLM usage means a user or upstream system makes a request and waits for a response. Latency targets are usually under a second for the first token, with completion within a few seconds.

Batch means a workload of inputs is queued and processed when capacity allows. Latency targets are minutes to hours; in some cases overnight processing is acceptable.

Asynchronous sits between — a request is submitted, processing happens in the background, the result is delivered later (notification, callback, polling). Latency targets are tens of seconds to minutes.

Each has appropriate use cases. Defaulting to real-time without considering the alternatives leaves significant value on the table.

When real-time is the right answer

The cases where real-time is genuinely required:

  • A user is waiting. Conversational interfaces, agent-assist for customer support, search-style query interfaces.
  • The request is part of a synchronous user flow. "Generate a draft of this document before I switch screens."
  • Latency drives experience. The perceived responsiveness of the system depends on speed.
  • The result feeds an immediate decision. Risk scoring during a transaction; classification of an incoming message that needs immediate routing.

For these, the cost of real-time is the price of doing business. The architecture has to support the latency target, the model choice has to support the latency target, and the infrastructure has to handle the load profile.

When batch is the right answer

The cases where batch is the better choice, often missed by default:

Periodic processing of large volumes

Document processing pipelines that handle thousands of documents per day. Compliance reviews. Bulk classification. The work doesn't need to happen in real time; aggregating it into batch runs produces cost savings of 50% or more on most providers.

Pre-computation of expected queries

If most user queries are predictable, run them in advance. The user sees pre-computed results; the LLM doesn't run at request time. Hugely cheaper, dramatically faster from the user's perspective.

Offline knowledge base updates

Re-indexing, re-embedding, regenerating summaries when source documents change. Batch processing fits this naturally.

Background analysis

Quality monitoring across closed cases. Trend analysis across customer feedback. Periodic compliance scans. The result is consumed later; the processing doesn't need to be live.

Data enrichment

Adding extracted attributes to records in the database. Translating between formats. Generating descriptions. The result is stored; consumption is decoupled from production.

Training data preparation

Labelling, classifying, generating examples for fine-tuning or evaluation. This is almost always batch-appropriate.

The cost case for batch is strong wherever the user isn't actively waiting for the result.

When asynchronous is the right answer

Asynchronous fits where real-time is too constrained and batch is too delayed:

  • Long-running individual tasks. A user requests something that takes 30 seconds; they don't need to stare at a spinner. Submit, notify when done.
  • Multi-step processes. A workflow that has several AI steps; each is fast but the chain is slow. Asynchronous design with progress visibility.
  • External system integration. A request that depends on a slow downstream system. Asynchronous decouples the AI work from the integration.

The architectural differences

Each pattern has different architectural needs:

Real-time architecture

  • Low-latency model access (dedicated capacity, edge deployment, or simply faster providers)
  • Aggressive caching (any cache hit avoids the model call)
  • Streaming where it helps user experience
  • Tight monitoring on p95, p99 latency
  • Capacity for the peak load, not the average
  • Graceful degradation when capacity is constrained

Batch architecture

  • Queue infrastructure (message queues, job schedulers)
  • Idempotent processing (jobs can be retried)
  • Cost-optimised model selection (often cheaper models are fine)
  • Use of batch APIs where available (50% discounts at most providers)
  • Less aggressive caching (less natural cache locality)
  • Capacity for the throughput, not the latency
  • Failure handling at the job level (DLQ, retry policies)

Asynchronous architecture

  • Submission and result-retrieval endpoints
  • State management (where is each job in its lifecycle)
  • Notification or polling for completion
  • Long-lived state for in-progress work
  • User-facing progress indication

The patterns are mostly conventional. The novelty is in applying them to AI workloads where the default has been to wire everything as synchronous calls.

The cost differential

Specifically, in mid-2024:

  • Real-time calls at retail rates from major providers
  • Batch APIs at typically 50% of retail rates
  • Provisioned capacity for high volume at negotiated rates, generally significantly below retail

For a workload processing a million items, the choice between real-time and batch can change the cost by 50% or more. At a hundred million items, it changes the cost by hundreds of thousands of dollars annually.

What we keep seeing

Recurring patterns in real-time vs batch decisions:

Real-time is the default everywhere. Even for workloads that have no human waiting, teams build for real-time. The cost shows up in the next quarter's bill.

Batch APIs are underused. Most major providers offer batch APIs at significant discount. Most teams don't use them. The discount is real money left on the table.

Pre-computation is rarely considered. Workloads where 80% of queries are predictable; the predictable ones can be pre-computed, the rest fall back to real-time. Few teams do this; the ones that do see disproportionate cost savings.

Async is the missing middle. Teams use real-time or pure batch; the asynchronous design that fits long-running individual tasks is often skipped. Either users wait on spinners that should be notifications, or batch latency is forced where async would be better.

Cost optimisation pays back fast. A latency profile audit of an existing AI deployment typically identifies 20-40% cost savings without quality impact, just by moving the right workloads to the right pattern.

What we recommend

For enterprise teams designing AI architectures in 2024:

  1. Ask the latency question explicitly for every workload. "Does this need real-time?" usually has a more nuanced answer than the default.
  2. Use batch APIs where the workload tolerates the latency. The discount is significant.
  3. Pre-compute predictable queries. The user perceives instant results; the cost is amortised over uses.
  4. Use asynchronous patterns for long-running individual tasks. Don't make users watch spinners.
  5. Right-size the capacity model. Real-time needs peak capacity; batch needs throughput; async needs queues.
  6. Audit existing workloads for latency-profile fit. The cost savings are usually substantial.

The default of "build everything real-time" is a habit, not a strategy. The teams that match the latency profile to the actual requirement build cheaper, more reliable systems. The teams that default to real-time pay for latency they don't need and operate complexity that doesn't earn its keep.

RELATED READING

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.