SQL optimization is the process of improving query performance by reducing the amount of data read, the number of operations performed, and the compute resources consumed. This guide explains the most impactful optimization techniques for analytical SQL workloads.
SQL optimization is the process of improving query performance — reducing the time a query takes to run, the amount of data it reads, and the compute resources it consumes. In analytical environments with tables containing billions of rows and queries executing against terabytes of data, the difference between an unoptimized and optimized query can be the difference between minutes and seconds, or between a result and a timeout.
Optimization matters at every layer: the queries analysts write, the data models engineers build, the indexes and partitions architects configure. This guide focuses on the techniques that have the most impact most often.
Understand the Query Execution Plan
Before optimizing a query, understand how the database plans to execute it. Every major SQL system exposes a query plan: Snowflake's EXPLAIN command, BigQuery's query plan in the console, PostgreSQL's EXPLAIN ANALYZE. The plan shows the sequence of operations the engine will perform, the estimated cost of each operation, and — in most systems — the actual rows processed at each step.
Reading a query plan reveals:
- Full table scans where filter pushdowns should be applied
- Nested loop joins on large tables that should use hash joins
- Operations on columns that are unsorted or unpartitioned, forcing full scans
- Stages that consume disproportionate time or data volume
Optimization without a query plan is guessing. Start there.
Minimize Data Read
The most impactful category of optimization in columnar analytical databases (Snowflake, BigQuery, Redshift, ClickHouse) is reducing how much data the query reads. These systems store data by column; queries that read only the columns they need are dramatically faster than queries that read all columns.
**Select only needed columns.** SELECT * reads all columns. In a table with 80 columns where your query needs 5, SELECT * reads 16 times the data your query actually uses. Name the columns explicitly.
**Push filters early.** WHERE clauses that filter large fractions of the data should reference columns that are partitioned, clustered, or sorted. In Snowflake, micro-partition pruning eliminates partitions based on the filter before the scan begins. In BigQuery, partition filters on the partitioning column skip entire partition files. Filters on non-clustered columns still require full scans.
**Use partition pruning.** Tables partitioned by date allow queries filtered to a date range to scan only the relevant partitions. A query filtered to "last 30 days" on a two-year table scans 4% of the data. The partition column must appear in the WHERE clause for pruning to occur.
**Avoid functions on filter columns.** WHERE DATE(created_at) = '2024-01-01' applies a function to every row before filtering, preventing partition pruning. WHERE created_at >= '2024-01-01' AND created_at < '2024-01-02' uses the raw column and enables pruning.
Write Efficient Joins
Joins are often the most expensive operation in analytical queries. Several patterns consistently cause performance problems:
**Joining before aggregating.** When joining two large tables where one will be significantly reduced by a GROUP BY, aggregate first in a subquery, then join the smaller result. This reduces the data volume the join must process.
**Cartesian products from missing join conditions.** A JOIN without a complete join predicate produces a cartesian product — every row of the left table matched with every row of the right. On tables with millions of rows, this is catastrophic. Always verify join conditions are complete and correct.
**Joining on high-cardinality string columns.** Joining on integer IDs is faster than joining on string columns because integer comparison is cheaper and equality on strings requires full string comparison. Where possible, join on integer keys.
**Fan-out joins.** Joining a detail table (orders) to a parent table (customers) and then to another detail table (products) without controlling for duplicate rows at the intermediate step can multiply row counts unexpectedly. Check intermediate row counts when joins produce more rows than expected.
Avoid Common Anti-Patterns
**Correlated subqueries.** A subquery in the SELECT or WHERE clause that references the outer query executes once per row of the outer result. On a million-row table, the subquery runs a million times. Rewrite using JOINs or window functions.
**DISTINCT as a debugging tool.** SELECT DISTINCT used to eliminate unexpected duplicates is often masking an upstream join problem. The join is producing duplicates, and DISTINCT is hiding the cause while making the query slower. Fix the join.
**Wildcard pattern matching on large string columns.** WHERE description LIKE '%keyword%' cannot use indexes and requires scanning every character of every string in the column. Where full-text search is needed, use a purpose-built full-text index.
**Repeated subquery evaluation.** If the same subquery expression appears multiple times in a query, most SQL engines evaluate it multiple times. Use a CTE (WITH clause) to define the subquery once and reference it by name. Most modern query engines materialize CTEs, executing them once.
Index and Partition Strategy (for OLTP-style databases)
For PostgreSQL, MySQL, and other row-oriented databases used in analytical contexts:
**Index columns used in JOIN conditions and WHERE clauses.** An index on a column allows the database to find matching rows without a full table scan — critical for filter-heavy queries on large tables.
**Avoid index creation on low-cardinality columns.** An index on a boolean column with only two values is unlikely to help — the database may find it faster to scan the table than to use the index and follow pointers for half the rows.
**Use covering indexes for frequently-run queries.** A covering index includes all columns needed by a specific query in the index itself, allowing the query to be answered entirely from the index without accessing the table.
**Partial indexes for filtered subsets.** An index with a WHERE clause indexes only the rows matching the filter. For a table where most queries filter to active = true, an index WHERE active = true is smaller and faster than a full-column index.
Materialization and Incremental Models
For recurring analytical queries that are expensive to run from scratch:
**Materialize expensive intermediate results.** In dbt, converting a view to a materialized table (or an incremental table) eliminates the cost of re-running the underlying SQL on every downstream query. The materialized table is computed once per pipeline run and queried directly.
**Use incremental materialization for large tables.** Full-refresh materialization recomputes the entire table on every run. Incremental materialization processes only new or changed rows. For tables with billions of rows, the difference in pipeline run time is orders of magnitude.
**Pre-aggregate where aggregations are always the same.** If every dashboard query aggregates daily revenue by region, materializing a pre-aggregated daily_revenue_by_region table eliminates the aggregation from every query. This is the dimensional modeling instinct: denormalize aggressively for read performance.
Query Profiling Tools
- Snowflake: Query History in the console, Query Profile visualization showing execution stages
- BigQuery: Query plan in the console, Execution Details with bytes processed per stage
- Redshift: EXPLAIN, STL_QUERY, SVL_QUERY_REPORT system tables
- PostgreSQL: EXPLAIN ANALYZE with actual row counts and execution times
- dbt: dbt's query timing in logs; warehouse-specific profiling via the same tools above
Our data architecture practice reviews and optimizes SQL transformation layers for organizations where pipeline performance or analyst query latency has become a constraint. Contact us to discuss your query optimization requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →