AI & Enterprise AI13 August 20248 min read

Text-to-SQL Beyond Demos — What Production Deployments Actually Require

Natural-language-to-SQL has been a research demo for two decades. Current models make it credible. Making it production-grade in an enterprise data warehouse requires more than the demo suggests.

ByIntellectual AI Engineering Practice· Collective byline

Natural-language-to-SQL has been a research demo for two decades. The earlier generations produced systems that worked on toy schemas and broke on enterprise ones. Current models — GPT-4 class and above — are credible enough that production deployments are now realistic. Realistic doesn't mean simple. The work that turns a demo into a production system is more than the demos suggest.

This piece is a practitioner view of what it takes to ship text-to-SQL into an enterprise data warehouse — the components that matter, the failure modes that recur, and the design choices that determine whether the system earns trust or loses it.

The components of a working system

A text-to-SQL system that holds up in production has the following shape:

Query understanding — the user's question is interpreted
Schema resolution — relevant tables and columns are identified
Query construction — SQL is generated against the relevant schema
Query validation — the SQL is checked for safety, correctness, and permissions
Execution — the SQL runs against the data warehouse
Result interpretation — the result is formatted for the user
Citation — the query and the source are shown
Refinement — the user can clarify or extend

Each component has substance. The naive demo collapses several of these into a single LLM call; the production system separates them.

Query understanding

Before generating SQL, the system understands the question:

What kind of question is it? Retrieval, aggregation, comparison, trend, anomaly, diagnostic. Different question types map to different query patterns.
What entities are involved? Named entity recognition over the question against the business vocabulary.
What time frame applies? Explicit, implicit, or unspecified.
What scope or filter is implied? "My region", "this quarter", "active customers" — terms that have specific meaning in the business context.

Question understanding is non-trivial because business questions are often ambiguous. The system that asks for clarification when uncertain produces better answers than the system that guesses.

Schema resolution — where the leverage is

The model doesn't know your schema. It knows English. The schema has to be presented to the model in a form it can use.

This is the highest-leverage area in a text-to-SQL system. The patterns that work:

A semantic layer

A curated description of business entities and their relationships, separate from the physical schema. Entities have business names, descriptions, attributes, and relationships. Metrics are defined explicitly with the underlying calculation. Filters are named.

The semantic layer is the artifact the model uses. The physical schema is a backend detail.

Schema retrieval, not schema dump

For large schemas, only the relevant subset is presented to the model per query. The full schema doesn't fit in a useful prompt. A retrieval step finds the entities and relationships relevant to the question.

Few-shot examples

Question-and-SQL pairs from your domain, demonstrating the patterns. These anchor the model on your specific schema, naming conventions, and query style. A small number of high-quality examples produces dramatic improvement in accuracy.

Glossary

A glossary of business terms with their definitions. "Active customer" means something specific; the model needs to know what it means here.

Negative examples

Examples of common mistakes and the correct alternatives. "This question often produces this query, but the right query is this other one." Negative examples sharpen the model's choices.

Investment in the semantic layer is the unlock. Teams that skip it ship systems that work on simple questions and fail on real ones.

Query construction

The model generates SQL. The patterns for getting reliable output:

Constrained generation. Tools like sqlglot for SQL parsing and validation can be used in a constrained-decoding loop.
Template-based generation. For common patterns, fill in templates rather than generating from scratch.
Multi-step generation. First identify entities; then identify joins; then generate filters and aggregations. Each step is checked.
Explanation alongside SQL. The model produces a brief explanation of what the query does. The explanation is shown to the user for verification.

What doesn't work:

Asking the model to generate complex multi-CTE SQL from a vague question.
Letting the model see only the schema and hope it figures out the semantics.

Query validation

Before execution:

Syntax check. Does this parse?
Schema check. Do all referenced tables and columns exist?
Permission check. Does the user have access to the referenced tables and columns?
Safety check. No destructive operations (DELETE, UPDATE, DROP). No obviously dangerous patterns. Required LIMIT clauses.
Cost estimate. What is the estimated cost of this query? If high, require explicit confirmation or refuse.
Time-out enforcement. Query has a maximum run time.

Validation is deterministic code, not LLM judgment. The model proposes; the validator decides.

Execution

Conventional, with one caveat: the executor runs with the user's permissions, not with the system's. Row-level security, column-level masking, all the conventional database security features apply.

This is the safety net for cases where the model produces a query that goes beyond what the user is supposed to see. The database enforces the boundary.

Result interpretation

The query produces a result. The result is formatted for the user:

A single number is shown as a number with context (what it represents, what filter was applied)
A small set of rows is shown as a table
A larger set is summarised with key statistics
A trend or time series is offered as a chart
Anomalies in the result are highlighted

For complex results, the LLM produces a brief narrative — what the answer is, in plain language. The narrative is grounded in the actual numbers, not in the model's prior beliefs.

Citation

Every answer shows:

The query that was executed (or a simplified, readable version)
The data source
The freshness of the data
Any filters or scopes applied
Any caveats (partial data, recent update, in-progress refresh)

This is the cornerstone of trust. Users with citation can verify; users without it have to take the system's word for it.

Refinement

The user can ask follow-up questions, refine the query, change the presentation, drill in. The system maintains conversational state — what entities are in scope, what filters apply, what time frame is current.

Refinement is what makes the interface a useful tool rather than a one-shot oracle.

Where it fails

Recurring failure modes in production text-to-SQL:

Joins that look right but aren't. The model picks a plausible join key; the join is technically valid but semantically wrong. The query runs; the answer is incorrect. Hard to detect without validation against known totals.

Aggregations on the wrong level. Sum across rows that shouldn't be summed; count of records that should be deduplicated. The query runs; the number is wrong.

Filters that miss important conditions. The user's question implies a filter the model didn't generate. The result includes more than the user intended.

Time frame ambiguity. "Last quarter" — calendar or fiscal? The system needs to know or to ask.

Unstated business rules. "Active customers" includes a specific set of conditions the user takes for granted. Without the semantic layer encoding this, the model uses its general understanding, which is wrong.

Each of these is a case where the SQL is syntactically correct and the result is semantically wrong. They are the hardest failure modes to catch.

What we keep seeing

Recurring patterns in production text-to-SQL engagements:

Investment in the semantic layer is decisive. Systems with a strong semantic layer reach 80-90% accuracy on relevant questions; systems without struggle below 50%.

Validation against known answers catches the silent failures. A curated test set of question-and-answer pairs, run on every change, surfaces the joins-that-look-right-but-aren't class.

The narrow successful systems answer a defined set of questions reliably. The ambitious ones try to answer everything and end up answering nothing reliably.

Adoption tracks trust. Users who can verify answers (via citation) build trust over time. Users who can't lose trust at the first failure and don't return.

The data team is the new owner. Text-to-SQL systems live in the data warehouse. The analytics or data engineering team is the natural operator, not the AI team.

What we recommend

For an enterprise team building text-to-SQL in 2024:

Define the supported question categories. Refuse outside them.
Invest heavily in the semantic layer. It is the unlock.
Use constrained query generation, not free-form SQL.
Validate every generated query. Syntax, schema, permission, safety, cost.
Execute with the user's permissions. The database is the safety net.
Show citations with every answer. Trust depends on verifiability.
Curate an evaluation set with domain experts. Run it on every change.
Plan ownership at the data team. They will operate it.

Text-to-SQL is finally crossing the production threshold for enterprise data warehouses. The teams that ship reliably are the ones that respect the engineering work around the model — the semantic layer, the validation pipeline, the citation discipline. The teams that ship the demo aesthetic into production end up with systems that answer well in showcases and produce incidents in operation.

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Service practices

Service

AI & Intelligent Automation

/services/ai-solutions →

Service

Data Engineering & Analytics

/services/data-analytics →

Related pieces

19 March 20249 min read

Conversational BI — Patterns That Survive in Production

Conversational interfaces over enterprise data are tempting and easy to demo. The patterns that survive enterprise governance, accuracy expectations, and data complexity are a much narrower set than the demos suggest.

17 March 20268 min read

Three Years of Enterprise AI — What We Got Right and Wrong

A practitioner reflection on three years of enterprise AI work — the patterns I called correctly, the calls I got wrong, and what to take from each into 2026 and beyond.

10 February 20267 min read

The 2026 AI Infrastructure Shift — What's Changing Underneath

The infrastructure layer for enterprise AI is shifting in 2026. New hardware, new deployment patterns, new economics. A look at what's actually different and what it means for architecture decisions.

Industry

Financial Services & Banking

Regulated integration, compliance automation, and secure digital banking.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Contact Intellectual →

← Newer post

Real-Time AI vs Batch AI — Choosing the Right Latency Profile

Older post →

AI Systems and Enterprise Identity — Where Most Deployments Cut Corners

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights