AI & Enterprise AI16 January 20249 min read

RAG Architecture — From Demo to Production

Retrieval-augmented generation is the dominant enterprise LLM pattern of the year. The demos are cheap; the production systems are not. A practitioner walkthrough of where the work actually sits.

ByIntellectual AI Engineering Practice· Collective byline

A retrieval-augmented generation demo takes an afternoon. A retrieval-augmented generation system in production takes a quarter. The reason is not that the model is hard to integrate or the vector store is exotic. The reason is that the parts of the system that do not involve the model — ingestion, chunking, retrieval quality, observability, versioning — turn out to be the hard parts.

This is a walkthrough of what RAG actually looks like when it ships into a regulated enterprise environment, and where teams predictably underestimate the work.

What RAG is, briefly

A user query arrives. The system finds relevant content from a knowledge base, attaches it to a prompt, and asks an LLM to answer using that content. The pattern reduces hallucinations, grounds responses in current data, and avoids the cost of fine-tuning models on enterprise knowledge.

The high-level pipeline:

Ingestion — documents flow into the knowledge base
Chunking — documents are split into retrievable units
Embedding — chunks are encoded as vectors
Indexing — vectors are stored in a vector database
Query — the user query is embedded
Retrieval — top-k similar chunks are returned
Reranking — chunks are reordered by a more expensive scoring step
Prompt assembly — the chunks plus the query plus instructions become a prompt
Generation — the LLM produces an answer
Post-processing — the answer is validated, formatted, and logged

Each of these steps is a place where the demo and the production system diverge.

Where the production work actually sits

Ingestion

The demo loads a folder of clean PDFs. The production system ingests from a dozen sources, each with its own access controls, refresh schedule, and content shape. The work here is:

Connector engineering. SharePoint, Confluence, Google Drive, an internal CMS, a regulatory filings database. Each needs an authenticated connector with permission propagation, incremental sync, and deletion handling.
Content parsing. PDFs with tables, scanned images, mixed-language documents, structured forms, marked-up regulatory text. Each format has its own parsing path. The output quality of the entire RAG system is upper-bounded here.
Metadata extraction. Author, publication date, document type, regulatory classification, business unit, sensitivity level. The richer the metadata, the more usable the retrieval index becomes.
Lifecycle. Documents update. Documents get deleted. Documents get reclassified. The ingestion pipeline has to handle change events, not just one-time loads.

A team that treats ingestion as a one-time bulk operation produces a knowledge base that decays into uselessness within months. A team that builds proper ingestion plumbing produces a knowledge base that compounds in value.

Chunking

The demo splits documents on fixed token boundaries. The production system thinks about what a retrievable unit actually is for the content type.

Structured documents (contracts, regulatory filings, policy documents) have natural section structure. Chunk on the structure, not on tokens. A retrieved chunk that respects "Section 4.2 — Termination" reads better than one that ends mid-paragraph.
Conversational content (support tickets, chat transcripts) chunks on turn boundaries, with enough surrounding context to disambiguate references.
Long-form prose (whitepapers, reports) benefits from overlapping chunks — same content in two retrievable windows means a query that hits either window surfaces the content.
Tabular content needs special handling — a table cell makes no sense without its column headers and at least some surrounding context. Naive chunking destroys tables.

Chunking strategy is the second-biggest determinant of retrieval quality, after embedding model choice. It deserves dedicated engineering time.

Embedding model selection

The demo uses whatever embedding model the tutorial used. The production system picks deliberately.

Selection criteria:

Quality on the domain. A model optimised for English web text may underperform on regulatory documents or non-English content. Evaluate on a sample of real content.
Dimensionality and cost. Higher-dimensional embeddings cost more to store and query. The marginal quality improvement from 3072-dim over 1024-dim is often not worth the storage cost.
Update cadence. Embedding models update; re-embedding the entire corpus is expensive. Models that version cleanly and offer migration paths are preferable.
Hosting model. Hosted embedding APIs are convenient but introduce a per-document cost and a network dependency. Self-hosted models on commodity hardware are realistic for many enterprise deployments.

The embedding model is the foundation. Changing it later means re-embedding everything. Decide deliberately at the start.

Vector store selection

The demo uses whatever vector store the tutorial used. The production system picks based on operational profile.

Hosted vs self-hosted. Pinecone, Weaviate Cloud, MongoDB Atlas Vector Search, and others are hosted offerings. Postgres with pgvector, Milvus, and Weaviate self-hosted are operational alternatives. Hosted is fastest to start; self-hosted is necessary for some compliance regimes.
Hybrid search. Pure vector search misses queries where lexical match matters. Hybrid (BM25 + vector) consistently outperforms either alone. The store has to support this.
Metadata filtering. Most enterprise queries are filtered — by document type, by date, by access permissions, by language. The store has to support efficient filtered search, not just naive top-k.
Operational maturity. Backup, replication, point-in-time recovery, monitoring. These are the things you wish you had thought about before the first production incident.

Retrieval and reranking

Naive top-k retrieval against an embedding is the baseline. Production RAG goes further:

Query rewriting. A user query may be ambiguous, under-specified, or in a different style than the indexed content. An LLM call rewrites the query into a form better suited for retrieval. This costs a model call but improves retrieval quality substantially.
Multi-query retrieval. Generate several variations of the query, retrieve for each, take the union. Catches content that any single phrasing would miss.
Reranking. A first-pass retrieval returns 50 candidates. A more expensive reranker model (often a cross-encoder) scores each candidate against the query and returns the top 5. The cross-encoder is too expensive to run over the whole index but cheap over a candidate set.
Diversity-aware retrieval. Top-k by similarity tends to return near-duplicates. Diversity penalties produce a broader retrieval set, which generally produces a better final answer.

Retrieval quality is the biggest determinant of overall RAG quality. Teams that invest here see compound returns.

Prompt assembly

The demo concatenates retrieved chunks. The production system manages the prompt as a structured artifact.

System instructions — the persistent rules the model follows. Tone, refusal behaviour, citation requirements, output format.
Retrieved context — the chunks, ideally with metadata (source, date, section reference) so the model can cite.
Conversation history — when present, with explicit boundary markers.
User query — clearly distinguished from context.
Output format instructions — what shape the response should take.

Each section has a token budget. Each section has an order. The prompt assembly logic is non-trivial code that deserves its own tests.

Citation and grounding

In any enterprise RAG system, the answer has to be traceable to the source. The model needs to be instructed — and verified — to cite the chunks it relied on.

The pattern that works:

Tag each retrieved chunk with an identifier.
Instruct the model to cite by identifier.
Post-process the answer to validate that each citation refers to an actual retrieved chunk.
If validation fails, retry with a stricter prompt or escalate.

Without this discipline, the system produces plausible-looking answers that nobody can defend in an audit.

Evaluation

The hardest unsolved problem in enterprise RAG is evaluation. Knowing whether your system is getting better or worse requires:

An evaluation set. Real queries with known good answers. Curated by domain experts, not synthesized by an LLM.
A scoring methodology. Retrieval recall (did the right chunks come back?), answer faithfulness (does the answer reflect the retrieved chunks?), answer helpfulness (does it actually answer the user's question?). Each requires separate measurement.
Continuous evaluation. Every time the model, the embedder, the chunking strategy, or the prompt template changes, the eval set runs. Regressions get caught.

Teams that build the evaluation harness early ship faster and with more confidence. Teams that skip it operate on intuition until production incidents force them to build it.

Observability

Every component of the pipeline logs:

Query text (after rewriting)
Retrieved chunk IDs and scores
Prompt token count
Model identifier and version
Generation latency
Output token count
Final answer
User feedback if any

This is the substrate for evaluation, debugging, and improvement. Build it before you need it.

What we keep seeing

Recurring patterns in production RAG deployments:

Quality limit set by ingestion, not by the model. Teams that try to improve RAG quality by switching models when the real bottleneck is broken PDF parsing.

Vector index sprawl. One index per document type, one per department, one per project. The retrieval surface fragments and overall recall drops. Consolidate where the access control permits.

Prompt drift. A prompt that worked at launch quietly degrades as the corpus grows. Without evaluation, this is invisible until users complain.

Context window mismanagement. Newer models offer larger context windows. Teams stuff more chunks in. Performance often gets worse, not better. The model attention is finite; relevance beats volume.

Model version surprises. A hosted model is silently updated; outputs change. Pin versions and treat upgrades as planned migrations.

What we recommend

For an enterprise team building RAG in 2024:

Spend the first weeks on ingestion. Get one document type flowing properly before adding more.
Build evaluation before scale. A small curated eval set beats a large unevaluated system.
Pick the embedding model and chunking strategy deliberately. These are the foundation; changing them later is expensive.
Add reranking and hybrid search before adding more documents. Better retrieval over a small index beats worse retrieval over a large one.
Enforce citation and validate it. The system has to defend its answers.
Pin model versions. Treat upgrades as migrations.
Plan for the operational profile from day one — backup, monitoring, access control, audit. Production systems have these or fail at the first incident.

The model is the smallest part of the system. The integration discipline around it is what determines whether RAG works in your enterprise or stays a demo.

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights