Long Context Windows — What Changes for Enterprise Workloads
Million-token context windows are now commercially available. They change the design of some workloads materially, change others not at all, and introduce new failure modes worth understanding.
The expansion of context windows through 2024 — Gemini 1.5 at 1-2 million tokens, Claude 3.5 Sonnet at 200K, GPT-4 Turbo at 128K — has shifted what's possible in enterprise LLM design. Workloads that previously required complex retrieval architectures can now be served by stuffing relevant context directly into the prompt. Workloads that couldn't fit at all are now tractable.
This piece is a practitioner view of where long context changes the architecture meaningfully, where it doesn't, and what the new failure modes look like.
What "long context" enables
The capabilities the larger windows unlock:
- Whole-document reasoning. A 500-page legal document fits in a million-token context. The model can reason about the whole thing in one call instead of being chunked.
- Multiple-document synthesis. Comparing or aggregating across many documents in a single call.
- Extended conversation memory. A long conversation can be fully visible to the model rather than summarised or truncated.
- In-context learning at scale. Hundreds of examples can be included as in-context demonstrations.
- Codebase reasoning. Significant portions of a codebase can be in context.
- Reduced architectural complexity. Some workloads that needed retrieval can be served with a long context call.
These are real capability shifts. They don't make retrieval obsolete; they change where retrieval has to be applied.
Where the architecture actually changes
Document analysis
Previously: extract relevant sections via retrieval, send to the model with the question. The retrieval step determined the quality.
Now: send the whole document, ask the question. The model finds what it needs. Quality is often higher because the model has full context.
The trade-off: cost. Sending a million tokens per call is expensive. For one-off analysis or small volume, the cost is acceptable. For high volume, retrieval remains the cheaper approach.
Long-form Q&A
Previously: retrieve relevant chunks; assemble prompt; answer.
Now: include enough context that the relevant material is there. Often the entire knowledge base for narrow domains. Cost scales with the prompt size, but reasoning quality is often noticeably better.
The pattern that works: when the knowledge base is bounded and the queries justify the cost, long context simplifies the architecture meaningfully.
Few-shot at scale
Previously: a handful of carefully curated examples in the prompt.
Now: dozens or hundreds of examples can be included. The model picks up patterns better with more examples.
For workloads where prompt engineering with a few examples was hitting a quality ceiling, more examples often break through the ceiling.
Conversation memory
Previously: long conversations were summarised, with the summary degrading over many turns.
Now: the full conversation can fit. The model has the context it needs without compression.
For workloads where conversational coherence matters — coaching, support, complex multi-turn tasks — the experience improves materially.
Where long context doesn't help
High-volume short-context workloads
If the workload is already serving short contexts adequately, longer windows don't add value. Most production LLM workloads we have measured use 1-10K input tokens. Long context windows are irrelevant.
Latency-sensitive workloads
Long context processing has higher latency. For real-time workloads with sub-second latency requirements, long context isn't usable.
Cost-sensitive workloads
Long context is expensive per call. For workloads where unit economics matter, the cost case for long context has to be made carefully. Often the retrieval pattern is still cheaper.
Workloads where retrieval already works well
If retrieval is producing good answers, switching to long context for the same workload doesn't necessarily improve quality and definitely increases cost. The argument has to be specific.
The new failure modes
Long context introduces failure modes that short context didn't have:
Lost in the middle
The recurring observation that information in the middle of a long context is less reliably used than information at the beginning or end. The effect is real and reproducible. Designs that depend on the model finding specific information buried in a long context fail more often than designs that surface the relevant information near the boundaries.
The mitigation: don't depend on the model finding needles in haystacks. Structure prompts so important information is at the top or bottom, or use retrieval to surface it explicitly.
Cost surprises
The unit cost per call can be 100x higher than with short context. A workload that runs millions of calls per month becomes economically interesting in entirely different ways. Without cost monitoring at the per-call level, the bill arrives as a surprise.
Quality plateaus
The intuition that "more context = better answer" doesn't hold reliably. Past a certain length, more context can confuse the model rather than help it. Empirical evaluation matters more than intuition.
Token-budget chaos
When multiple components compete for the same large window — system prompt, examples, retrieved context, conversation history, user input — the allocation logic gets complex. Without explicit budgeting, components push each other out or the prompt grows unboundedly.
Latency variance
Long context responses have higher latency variance than short ones. P99 latency in particular degrades. For user-facing workloads, the variance can be more disruptive than the average.
The hybrid pattern
The pattern that works well in mid-to-late 2024: retrieval-augmented systems with long context windows.
The architecture:
- Retrieval identifies the most relevant content for the query.
- Generous chunks are pulled — more than would fit in a short context, less than the whole corpus.
- The long context allows several documents or large sections to be sent.
- The model reasons with rich context.
This pattern combines the cost discipline of retrieval (don't send irrelevant content) with the quality of long context (don't be artificially constrained on what is sent). It's the default architecture in our recent engagements.
Cost modelling
A working pattern for cost modelling long-context workloads:
- Estimate average and p95 input tokens per call.
- Estimate average and p95 output tokens per call.
- Apply provider pricing.
- Estimate call volume.
- Multiply.
- Compare against the same workload with retrieval at conventional context sizes.
The decision between long context and retrieval is rarely binary. It's usually about where on the spectrum the workload sits. Less common queries with deep reasoning may justify long context per call; more common queries benefit from retrieval to keep per-call cost manageable.
What we keep seeing
Recurring patterns in long-context engagements:
Workloads simplify where the case applies. Some workloads that required complex retrieval engineering become simpler with long context. The architecture cleans up.
Cost discipline becomes more important, not less. The per-call cost variance with long context is higher; the workload-level cost can swing dramatically with usage patterns.
Lost-in-the-middle is real but manageable. Design around it; don't pretend it doesn't exist. Most enterprise designs work fine with the constraint.
Retrieval is still the default at high volume. The cost gradient makes retrieval the right choice for the bulk of enterprise LLM volume. Long context is the right choice for specific high-value cases.
Latency budget shrinks. Long context can put workloads outside acceptable latency. Architecture decisions need to account for this.
What we recommend
For enterprise teams in late 2024:
- Identify the workloads where long context genuinely simplifies or improves the architecture. Don't blanket-adopt.
- Model the cost before deployment. Long context economics are different from short context.
- Design around the lost-in-the-middle constraint. Put critical content at boundaries; don't bury it.
- Combine retrieval with long context where appropriate. The hybrid is usually the right answer.
- Monitor latency variance, not just averages. P99 matters for user experience.
- Evaluate quality empirically. Intuitions about long context can mislead.
Long context windows are a real new capability. They change some architectures meaningfully, leave others unchanged. The teams that match the technique to the workload reliably capture the value. The teams that adopt indiscriminately either pay too much or fail to leverage the capability where it actually helps.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
Cloud-Native Enterprise Modernization
Cloud-native modernization is rarely a re-platforming exercise and almost never a wholesale rewrite. A practitioner framework for what actually changes — and a candid look at where cloud-native produces compounding value versus where the term has become marketing dust.
Workflow Automation Architecture
Workflow estates that hold up share a smaller set of architectural decisions than people expect. A look at the structural choices that determine whether a workflow programme matures into a platform or stalls as a series of one-off projects.
Three Years of Enterprise AI — What We Got Right and Wrong
A practitioner reflection on three years of enterprise AI work — the patterns I called correctly, the calls I got wrong, and what to take from each into 2026 and beyond.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.