What Is Query Optimization in a Data Warehouse?

Query optimization is the process of improving the performance of SQL queries in a data warehouse — reducing execution time, lowering compute cost, and improving the experience for analysts and BI tools. This guide explains the key techniques, the differences between cloud warehouse optimizers, and the most impactful changes to make first.

Query optimization is the process of improving the performance of SQL queries in a data warehouse — reducing execution time, lowering compute cost, and making analytical workloads reliable and predictable. In cloud data warehouses billed by compute (Snowflake virtual warehouse credits, BigQuery bytes scanned, Redshift cluster-hours), query optimization directly reduces cost.

Understanding how cloud warehouses execute queries — and what causes them to execute poorly — is prerequisite to optimizing them effectively.

How Cloud Warehouse Query Planners Work

When a SQL query is submitted to a cloud warehouse, the query planner (optimizer) produces an execution plan: the sequence of operations the engine will perform to return results. The planner's goal is to find the cheapest plan — minimizing I/O, data movement, and compute — from among the many possible execution orders.

Cloud warehouses use cost-based optimizers (CBO): they estimate the cost of different execution plans based on statistics about the data (row counts, column cardinality, data distribution, null rates) and choose the plan with the lowest estimated cost.

The optimizer's decisions have a larger impact on query performance than almost any other factor. A query that the optimizer executes with an efficient plan may run in seconds; the same query with a poor plan may run for minutes. Providing the optimizer with accurate statistics and writing queries in ways the optimizer can reason about effectively is the core of query optimization.

Partition Pruning and Clustering

Cloud warehouses organize data in micro-partitions (Snowflake), partitions (BigQuery, Redshift), or file segments (Databricks). Queries that filter on the partition or clustering key can skip entire partitions that cannot contain relevant data — dramatically reducing I/O.

**Snowflake micro-partitions and clustering** — Snowflake automatically organizes data into micro-partitions and maintains per-partition min/max metadata. When a query filters on a column, the query planner checks which micro-partitions could contain matching values and scans only those. For tables where data is inserted in order of a frequently-queried column (like a timestamp), auto-clustering often works well. For tables where that is not the case, explicit Automatic Clustering on the high-cardinality filter column improves pruning dramatically.

**BigQuery partitioning and clustering** — BigQuery supports date/timestamp partitioning (data in separate storage shards per day or month) and clustering by up to four columns. Queries that filter on the partition column scan only relevant partitions (reducing bytes-scanned-based cost). Queries that filter on clustering columns benefit from BigQuery's column clustering, which organizes data within partitions to co-locate similar values.

**Redshift distribution style** — Redshift distributes rows across nodes using a distribution key. Joins on the distribution key execute without data movement between nodes. Queries that join on non-distribution-key columns require data to be moved, which is expensive. Choosing the right distribution key for the join patterns in your workload is a primary Redshift optimization.

Join Order and Join Type

Query planners choose the order in which tables are joined and the join algorithm (hash join, nested loop, merge join) based on statistics. Poor statistics cause the optimizer to make incorrect join order decisions, which can produce catastrophically bad plans.

**Always filter before joining** — where possible, apply filters before joining to large tables. Filtering reduces the number of rows that participate in the join, which reduces the data movement and compute costs of the join.

**Avoid cross joins** — a cross join between two tables of N and M rows produces N*M rows. Cross joins are almost never intentional; they typically result from missing join conditions. In large tables, they quickly become unmanageable.

**Large table to large table joins** — joining two very large tables is expensive in any warehouse. Pre-aggregate one of the tables before joining, filter each table to the minimum relevant rows, or redesign the model to avoid the join.

Subqueries and CTEs

Common Table Expressions (CTEs) improve readability and maintainability but do not automatically improve performance. In some databases, CTEs act as optimization fences — the optimizer cannot push predicates into them. In others, they are inlined and optimized as part of the full query.

Snowflake inlines most CTEs; filtering outside the CTE is equivalent to filtering inside. BigQuery materializes CTEs into temporary tables by default, which can be beneficial (the CTE is computed once and reused) or harmful (the planner cannot push a predicate from the outer query into the CTE).

Testing alternative formulations — CTE vs subquery vs derived table — and comparing their execution plans is the only reliable way to determine which performs better for a specific query in a specific warehouse.

Avoiding SELECT *

Columnar warehouses read only the columns referenced in a query. SELECT * reads all columns, eliminating this benefit. Always specify the columns needed; this is particularly important in wide tables with hundreds of columns.

In Snowflake and BigQuery, where billing is based on data scanned or credits consumed, SELECT * on a 200-column table costs significantly more than SELECT col1, col2, col3 on the same table.

Materialized Views and Pre-Aggregation

For queries that run frequently on large tables with computationally expensive aggregations, materialized views pre-compute and store the aggregation result. Subsequent queries against the materialized view scan the pre-computed result rather than re-aggregating the base table.

Snowflake materialized views: the warehouse maintains the view automatically as the base table changes. Queries that can be served from the materialized view are transparently rewritten by the optimizer.

BigQuery materialized views: similarly maintained automatically, with transparent query rewriting.

dbt incremental models: not true materialized views, but incremental materialization pre-computes transformation results and updates only changed rows on each run. For BI tools querying large transformed tables, materialized dbt models dramatically outperform querying source tables directly.

Warehouse Sizing and Workload Isolation (Snowflake)

In Snowflake, the virtual warehouse size determines the compute available for a query. Larger warehouses process more data in parallel and execute complex queries faster. For simple queries against small tables, a larger warehouse does not help and costs more.

Workload isolation through multiple virtual warehouses prevents analytics workloads (long-running dashboards, ad-hoc queries) from competing with ETL workloads (data loading, dbt transformations). Contention between workloads causes unpredictable query performance. Separate warehouses with appropriate sizing for each workload type is the standard Snowflake production setup.

Query Profiling Tools

All major cloud warehouses expose execution plan information:

- **Snowflake**: Query Profile in the web UI shows a DAG of execution steps with time and bytes-read for each step. Identifying the most expensive step (longest bar) immediately focuses optimization effort.

- **BigQuery**: EXPLAIN query or Query Plan in the console shows execution stages with slot milliseconds and bytes processed per stage.

- **Redshift**: EXPLAIN command outputs the query plan with estimated costs per step. SVL_QUERY_REPORT shows actual execution statistics for completed queries.

Starting with the query profile — identifying which step consumes the most time or bytes — is more effective than guessing what to optimize.

Our data architecture and Tableau consulting practices include query optimization for data warehouse and BI workloads — contact us to discuss performance improvements for your analytical environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →