Text-to-SQL: AI-Powered Natural Language Querying of Your Data

How Text-to-SQL systems translate natural language questions into SQL queries, why they fail silently without good data architecture underneath, and how to evaluate, deploy, and govern Text-to-SQL in enterprise environments.

Text-to-SQL — the ability for a natural language question to be automatically translated into a SQL query and executed against a database — is one of the most commercially promising AI applications for enterprise data. The pitch is compelling: business users can ask questions in plain English, get answers from live data without writing SQL, and stop depending on analysts to translate business questions into queries.

The reality is more nuanced. Text-to-SQL works well in constrained, well-structured environments. It fails silently in complex, poorly documented ones. Understanding where Text-to-SQL succeeds, where it fails, and what data architecture requirements it imposes is essential for organisations evaluating or deploying it.

How Text-to-SQL Works

Modern Text-to-SQL systems use large language models that have been fine-tuned or prompted to understand SQL schema and generate correct queries. The basic flow:

1. The user asks a question in natural language

2. The system selects relevant tables and columns from the database schema (using semantic search or rules)

3. The selected schema plus the user's question are assembled into a prompt

4. The language model generates a SQL query

5. The query is executed against the database

6. The result is returned to the user (either as raw data or translated back into natural language)

The quality of the SQL generated depends on: the quality of the schema documentation, the quality of schema selection, the model's SQL generation capability, and the complexity of the query required to answer the question.

When Text-to-SQL Fails

The most dangerous characteristic of Text-to-SQL failure is that it often fails silently — generating a syntactically correct SQL query that executes successfully and returns a number, but the number is wrong.

**Ambiguous schema.** If a table named "orders" could mean purchase orders or sales orders depending on context, the model cannot reliably choose correctly. If a column named "revenue" might mean gross or net revenue depending on which table it is in, the model will make assumptions that may not match the user's intent.

**Undocumented business logic.** The SQL for "active customers" is not self-evident from the schema. If active customers means users who have placed at least one order in the last 90 days and have not requested account deletion, that definition must be documented (in column descriptions, in table descriptions, or in documented examples) for the model to generate correct SQL. Without documentation, the model guesses.

**Complex multi-hop queries.** Questions that require joining many tables, applying non-obvious filters, or computing intermediate results before aggregating push the boundaries of reliable SQL generation. The model may generate a query that looks plausible but applies joins in the wrong order, omits a necessary filter, or aggregates at the wrong granularity.

**Dialect and function differences.** SQL is not uniform across databases. A Snowflake-dialect query uses different syntax than a BigQuery-dialect query. Models trained primarily on one dialect produce syntax errors or incorrect results on others. This is less severe with current frontier models but still a failure mode for edge cases.

Data Architecture Requirements for Reliable Text-to-SQL

Text-to-SQL quality is directly correlated with data architecture quality. The same data architecture best practices that make analytics reliable make Text-to-SQL reliable.

**Documented schemas.** Every table and every column needs a description that unambiguously states what it contains and what the values mean. Column descriptions are the primary mechanism by which Text-to-SQL systems understand business logic. A column named net_revenue with the description "Net revenue in USD, after returns and before tax, as recognised on the order date" is unambiguous. A column named rev with no description is not.

**Well-named tables and columns.** Table and column names should be self-explanatory. orders, customers, products, order_lines. Not tbl_ord_v2, cust_mstr, rev_adj. The model uses names as signals; opaque names degrade generation quality.

**Denormalised analytical models.** Text-to-SQL works better against wide, denormalised analytical tables (dimensional models, dbt mart layers) than against normalised OLTP schemas. A wide orders table with customer attributes, product attributes, and order line details joined in produces simpler queries than a star schema with ten tables. The same good data architecture practices that improve analyst productivity improve Text-to-SQL accuracy.

**Documented metrics and definitions.** The metric definitions that live in a dbt semantic layer, a business glossary, or documented calculated fields should be surfaced to the Text-to-SQL system. If "customer lifetime value" is defined and documented, a question asking for LTV by customer segment has a reference answer. Without it, the model invents a definition.

Evaluation: Measuring Text-to-SQL Quality

Text-to-SQL systems must be evaluated before deployment and continuously monitored in production. Evaluation requires a benchmark set of question-answer pairs: questions a business user might ask, paired with known-correct SQL and expected results.

**Execution accuracy**: does the generated SQL produce the same result as the reference SQL? This is the most meaningful metric — a query that produces the correct answer via a different SQL path is a correct query.

**Valid SQL rate**: does the generated SQL execute without error? A high rate of syntax errors indicates a schema documentation or model quality problem.

**Semantic accuracy**: for questions with known-correct answers (e.g., "how many orders were placed yesterday?" against a dataset with a known answer), does the query return the correct number?

Building a benchmark requires significant upfront effort. A minimum viable benchmark has 50–100 questions spanning the range of question types the system will face. Expand the benchmark as new failure modes are discovered in production.

Text-to-SQL Products and Approaches

**Tableau Pulse and Tableau Ask Data**: Tableau's natural language query features use Text-to-SQL internally to generate VizQL queries from natural language, constrained to Tableau's data model. Works within Tableau's governance framework; limited to questions answerable from the connected data source.

**Snowflake Cortex Analyst**: Snowflake's native Text-to-SQL, configured via a semantic model YAML that describes tables, columns, metrics, and relationships. Runs within Snowflake's access control model; the semantic model documentation is the primary quality lever.

**Databricks SQL AI Functions**: AI_QUERY and related functions that allow natural language queries within Databricks notebooks and SQL.

**Standalone frameworks (LangChain, LlamaIndex, Vanna.ai)**: open-source and commercial frameworks for building custom Text-to-SQL pipelines. More flexible but require more engineering investment to deploy reliably.

**BI tool integrations**: Power BI's Q&A feature, Looker's natural language search, and similar features in established BI tools apply Text-to-SQL within the tool's semantic model, benefiting from the tool's existing data definitions.

Deployment Governance

Text-to-SQL changes the governance model for data access. Users who previously could only query data through curated dashboards can now ask arbitrary questions of the database. The database's access control model (row-level security, column masking, role-based access) applies — but the user's ability to formulate questions that indirectly leak sensitive information requires consideration.

**Run Text-to-SQL against analytical layers, not operational databases.** Text-to-SQL should query data warehouse mart layers and reporting schemas, not production application databases. This enforces data access boundaries, reduces the risk of performance impact on operational systems, and means the data model has been designed for analytical use.

**Maintain query logging and auditing.** Every Text-to-SQL query should be logged: the natural language input, the generated SQL, and the result. This supports debugging, security review, and building the evaluation benchmark from real user questions.

Our BI strategy consulting practice advises on AI-powered analytics deployments including Text-to-SQL evaluation and governance — contact us to discuss your requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →