AI & Enterprise AI19 March 20249 min read

Conversational BI — Patterns That Survive in Production

Conversational interfaces over enterprise data are tempting and easy to demo. The patterns that survive enterprise governance, accuracy expectations, and data complexity are a much narrower set than the demos suggest.

ByIntellectual AI Engineering Practice· Collective byline

Conversational interfaces over enterprise data have been "almost there" for two decades. Every generation of natural-language query tools produced confident demos that failed under the weight of real schemas, real governance, and real user expectations. The current generation, with LLMs in the loop, has changed the picture — but most production deployments still fail for predictable reasons.

This piece is a practitioner view of conversational BI as it actually ships in 2024 — what works, what doesn't, and where the design discipline matters.

The promise and the reality

The promise: a business user asks a question in natural language; the system returns a chart, a number, or a brief narrative answer; the user does not need to learn SQL, the BI tool's UI, or the schema.

The reality, when poorly designed: a system that answers the easy questions reasonably well, answers the hard questions confidently and wrongly, and produces answers no analyst would defend. Users learn within weeks not to trust it.

The reality, when well designed: a system that answers a defined set of questions correctly, refuses gracefully outside that set, cites the data it is reading from, and gives the user a path to verify the answer. Users learn what they can ask, build trust, and adopt.

The difference is design discipline, not model capability.

The architecture pattern

A working conversational BI system has the following shape:

Intent classification — what is the user actually asking for
Schema mapping — what entities and relationships in the data model are relevant
Query generation — produce a query (SQL, KQL, or a structured retrieval) that addresses the intent
Query validation — check that the generated query is safe, well-formed, and within permitted scope
Execution — run the query
Result interpretation — turn the query result into a useful answer
Citation and provenance — show the user what data was queried, what query was used
Refinement — let the user clarify or refine if the answer is not what they meant

Each step is a place to fail. The successful systems get all of them right; the failed systems get one of them wrong.

Intent classification — narrower than it looks

Most enterprise questions, when categorised, fall into a small set:

Retrieval — "Show me the X for Y"
Aggregation — "What is the total/average/count of X"
Comparison — "Compare X to Y"
Trend — "How has X changed over time"
Anomaly — "Which X is unusual"
Diagnostic — "Why did X change"

The first four are tractable with current technology. Anomaly detection is workable with careful design. Diagnostic questions are the hardest; they require reasoning across data that the system may not have direct access to. The discipline is to design for the first four, refuse anomaly questions outside known anomaly definitions, and route diagnostic questions to humans.

Classification is the first place where governance shows up. A question outside the supported categories should produce a clear "I can't answer that" message, not an attempted answer that will be wrong.

Schema mapping — where most systems break

The schema is the part LLMs have not seen in training. The model knows English; it doesn't know that your customer table is called cust_mast, that the relevant date field is eff_dt, and that the relationship to the order table goes through an intermediate party table with two filtering conditions.

The pattern that works:

Curated semantic layer. A descriptive model of the business entities, attributes, and relationships, separate from the physical schema. The LLM is given the semantic layer, not raw table names.
Domain-specific terminology. What does "active customer" mean in your business? What does "this quarter" mean — calendar or fiscal? The semantic layer encodes these definitions explicitly.
Example questions and answers. Few-shot examples paired with the queries that produce them. The model learns the pattern of how questions map to queries in this specific environment.
Schema retrieval, not schema dump. For large schemas, retrieve only the relevant subset for each query. The full schema doesn't fit in a prompt usefully, and including irrelevant tables hurts accuracy.

Without this discipline, the model is guessing at the schema, which produces wrong queries that look right.

Query generation — structured, not free-form

The model generates a query. The query is in SQL, KQL, MDX, or a custom query DSL.

What works:

Constrained generation. Tools like LangChain's SQL chains, LlamaIndex's structured query engine, or custom validators ensure the generated query uses only known tables and known columns.
Templates and constraints. Where possible, the model fills in templates rather than generating from scratch. A template "SELECT FROM WHERE " with constrained values for each variable is more reliable than free-form SQL generation.
Multi-step query construction. For complex queries, the model produces the query in steps — entities first, joins next, aggregations last. Each step is checked.

What doesn't work:

Asking the model to generate complex multi-CTE queries from a natural-language description. The error rate is too high. Decompose or refuse.
Letting the model produce arbitrary SQL against the production database. Even with read-only access, a runaway query can take down the database.

Query validation

Before execution:

Syntax check. Does this parse?
Schema check. Do all referenced tables and columns exist?
Access check. Does the user have permission to query these tables and columns?
Safety check. No destructive operations, no obviously dangerous patterns (cross joins without conditions, no LIMIT on potentially large queries).
Cost estimate. What is the estimated query cost? If high, require explicit confirmation or refuse.

Validation is deterministic code, not LLM judgment. The LLM produces the query; the validator decides whether to run it.

Execution and result interpretation

Execution is conventional. Result interpretation is where the LLM re-enters.

The pattern:

Format the result. A single number is presented as a number. A small table is presented as a table. A larger table is summarised.
Generate a narrative. A brief explanation of what the answer is and how it was computed. Not flowery; matter-of-fact.
Highlight context. If the result is unusual, flag it. If the result needs caveats (filtered on incomplete data, partial month, etc.), state them.

The narrative is generated; the underlying numbers come from the query. The numbers are facts; the narrative is the LLM's interpretation. Users need to be able to see both.

Citation and provenance

Every answer shows:

The query that was executed (or a simplified version a user can read)
The data source
The freshness of the data (when was it last updated)
Any filters or scopes applied
A link to the underlying records, where appropriate

Without provenance, the user cannot verify the answer. They have to take the system's word for it, and they shouldn't.

This is the single biggest difference between conversational BI that earns trust and conversational BI that loses trust. Users with provenance can verify and increase their trust over time. Users without provenance learn to distrust because they have no path to check.

Refinement

The first answer is often not what the user wanted. The system has to support refinement.

"That's not what I meant." The user can refine the question; the system uses the prior context.
"Show me as a chart." The user can change the presentation without re-asking.
"Limit to the last quarter." The user can add filters conversationally.
"Show me the underlying records." The user can drill in.

Refinement is what turns the interface from a one-shot oracle into a working investigative tool. It requires the system to maintain conversational state, which adds complexity to the architecture.

Where teams underestimate the work

Recurring patterns in production deployments:

Underestimating the semantic layer. Teams expect the LLM to "figure out" the schema. Without a semantic layer, accuracy stays at proof-of-concept levels. The investment in the semantic layer is what unlocks production accuracy.

Trusting the model on aggregations. "Sum of revenue for customers in segment X" is a common question. The model can produce a query; the query may join tables incorrectly and produce a number that is wrong but plausible. Validate aggregations against known totals, especially for high-stakes answers.

Skipping the evaluation set. Without a curated set of question/answer pairs, you don't know whether changes are improving the system. Build it early. Curate it with domain experts. Run it on every change.

Skipping the refinement design. Teams build the one-shot oracle and call it done. Users want refinement; without it, the interface feels punitive.

Trying to answer every question. Pressure to support more question categories before the existing ones are reliable. Each new category adds failure modes. Master the supported set first.

What we keep seeing

Patterns in successful production conversational BI:

Narrow but reliable. Support a defined set of questions; refuse outside it; expand based on demonstrated mastery. Users adapt to a narrow tool that always works.

Strong semantic layer. Investment in curated semantic models that encode the business meaning of the data.

Evaluation discipline. Curated eval sets, run on every change, with regression alerts.

Provenance as a first-class feature. Users can see the query, the source, the freshness. They can verify.

Hand-off to analysts. For questions outside scope, a clean hand-off to "ask an analyst" rather than an attempted answer that will be wrong.

What we recommend

For an enterprise team building conversational BI in 2024:

Define the supported question categories. Refuse outside them.
Invest in the semantic layer. It is the foundation.
Use constrained query generation, not free-form SQL.
Validate every generated query before execution.
Show provenance with every answer.
Support refinement, not just one-shot queries.
Build evaluation alongside the system. Curate the eval set with domain experts.

Conversational BI works in 2024 if treated as a constrained, verifiable interface over a curated semantic model. It does not work as a natural-language search over the raw schema. The difference between these two design philosophies determines whether the deployment ships and stays shipped, or quietly gets abandoned within a year.

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights