BlogData Architecture

Redshift Performance Tuning: Distribution Keys, Sort Keys, and Query Optimisation

Obed Tsimi
Obed Tsimi
Founder & Senior Tableau Architect
·January 10, 202712 min read

Amazon Redshift performance is largely determined by table design decisions made at creation time — distribution style, sort key, and compression encoding. This guide covers how each affects query performance, the diagnostic queries that identify problems, and the optimisation patterns for common slow-query scenarios.

Redshift performance is largely predetermined by table design decisions. Unlike Snowflake or BigQuery, where the query engine handles most distribution and execution decisions automatically, Redshift exposes distribution and sort key configuration directly to the user. Getting these decisions right at table creation time is the most impactful performance work in a Redshift environment. Getting them wrong produces slow queries that require table recreation to fix — which is why understanding these concepts before designing your schema matters.

Distribution Styles

Redshift stores table data across multiple compute nodes. How it distributes data — which rows go to which nodes — determines whether join and aggregation operations require data movement across the network.

### EVEN Distribution

Rows are distributed across nodes in a round-robin fashion. Every node gets approximately the same number of rows. Regardless of the column values, distribution is balanced.

**When to use:** Small dimension tables that are joined frequently to large fact tables. Also the correct default for tables where you have not yet determined the right distribution key.

**Problem:** When two tables with EVEN distribution are joined, Redshift must redistribute one table's rows across nodes to co-locate matching rows for the join. This redistribution — a "broadcast" or "redistribute" operation visible in the EXPLAIN plan — adds network I/O cost.

### KEY Distribution

Rows are distributed based on the values in a specified distribution key column. All rows with the same distribution key value go to the same node.

**When to use:** Large fact tables that are frequently joined to dimension tables. The distribution key should be the column used in the JOIN condition. When the fact table and dimension table are both distributed on the same key values, Redshift can join them without network redistribution — a co-located join that is dramatically faster.

**How to choose the key:** The distribution key should be:

1. A column used in frequent JOIN conditions (the foreign key to your primary dimension)

2. A high-cardinality column that distributes data evenly across nodes

3. Not a column with extreme skew (a few values appearing in 90% of rows would send most data to a few nodes)

For a fact table with order-level data, distributing on customer_id (if that is the primary join key) or order_id are both reasonable choices. Distributing on a low-cardinality column like region or country concentrates rows on few nodes.

### ALL Distribution

A full copy of the table is stored on every node. Joins to ALL-distributed tables are always co-located — every node has the full table.

**When to use:** Very small reference tables (under a few million rows) that are joined to large tables from many different queries. Date dimension tables, status code lookup tables, and small product attribute tables are candidates. The storage cost of the full copy is worth the join performance for tables small enough to replicate.

**Do not use for:** Tables over a few million rows. The storage cost and maintenance overhead (VACUUM, ANALYZE, copy operations for INSERT) make ALL distribution impractical at scale.

### AUTO Distribution

Redshift chooses the distribution style automatically. Small tables start as ALL; they transition to EVEN as they grow. This is the default for tables created without an explicit DISTKEY.

**When to use:** For tables where you have not yet determined the right distribution strategy, or for smaller tables where the performance difference is not material. For large production fact tables, explicit KEY distribution on the right column will outperform AUTO.

Sort Keys

Sort keys tell Redshift how to sort data within each 1MB block. Well-chosen sort keys allow Redshift to skip entire blocks during queries — a technique called zone map filtering. If your query's WHERE clause filters on the sort key, Redshift reads only the relevant blocks rather than the full table.

### Single-Column Sort Keys

A single sort key is appropriate when most queries filter on one column — typically a date column for time-series data.

create table fact_orders (

order_date date not null sortkey,

...

);

With order_date as the sort key, queries filtering on order_date (WHERE order_date >= '2025-01-01') use zone map filtering to skip blocks that contain no data in the date range.

### Compound Sort Keys

Multiple columns are specified in priority order. Zone map filtering applies to the leading column; subsequent columns only filter within the range identified by the leading column.

create table fact_orders (

order_date date not null,

region varchar(50) not null,

...

) compound sortkey(order_date, region);

Queries filtering on order_date AND region benefit from both. Queries filtering only on region get minimal benefit — only rows within each date block are filtered.

**When to use:** When queries filter on multiple columns in a consistent priority order (date first, then region/product/etc).

### Interleaved Sort Keys

All specified columns are given equal weight in the sort key. Zone map filtering applies to any combination of the specified columns equally.

create table fact_orders (

order_date date not null,

region varchar(50) not null,

...

) interleaved sortkey(order_date, region);

**Trade-off:** Interleaved sort keys are more expensive to maintain during VACUUM operations. On tables with frequent INSERTs, interleaved sort keys increase VACUUM time significantly. AWS recommends compound sort keys for most use cases and reserving interleaved for tables that are primarily read-heavy with ad-hoc filter patterns.

Diagnostic Queries

### Finding Unsorted Rows

After data loading, newly inserted rows are unsorted. VACUUM restores sort order. A high percentage of unsorted rows reduces zone map efficiency.

select trim(name) as table_name,

rows as total_rows,

unsorted as pct_unsorted

from svv_table_info

where unsorted > 20

order by unsorted desc;

Tables with more than 20% unsorted rows are candidates for VACUUM SORT ONLY (faster than full VACUUM — only restores sort order without reclaiming space).

### Finding Skewed Distribution

Uneven data distribution across nodes creates imbalanced processing — some nodes do much more work than others.

select trim(name) as table_name,

skew_rows

from svv_table_info

where skew_rows > 1.5

order by skew_rows desc;

A skew_rows value above 1.5 indicates meaningful imbalance. The fix is changing the distribution key to one with higher cardinality and more even value distribution.

### Identifying Expensive Queries

select query,

total_exec_time / 1000000.0 as exec_time_seconds,

trim(querytxt) as query_text

from stl_query

where total_exec_time > 60000000 -- queries over 60 seconds

order by total_exec_time desc

limit 20;

Combine with the EXPLAIN plan for the slow queries to identify whether the bottleneck is redistribution (network), sort key misalignment (full scans), or join order.

### Reading the EXPLAIN Plan

The EXPLAIN output for a Redshift query shows the physical execution plan. Key things to look for:

**DS_DIST_INNER or DS_DIST_ALL_INNER:** Row redistribution is occurring during a join. This adds network overhead and indicates the join columns are not aligned with the distribution keys.

**DS_BCAST_INNER:** The smaller table is being broadcast to all nodes for the join. This is expected for small tables; for large tables it indicates a distribution key mismatch.

**Full scan without zone map filtering:** If the EXPLAIN shows a table scan with no filter, the sort key is not aligned with the query's WHERE predicates.

WLM and Concurrency Scaling

Workload Management (WLM) configures query queues — how many queries execute concurrently and how memory is allocated between query classes.

The default WLM configuration has a single queue with up to 5 concurrent queries. For environments with mixed workloads — both fast interactive queries and long-running batch transformations — separate queues for different query types prevent long-running batch jobs from blocking analyst queries.

Automatic WLM (available since 2019) dynamically allocates memory and concurrency based on query characteristics. For new Redshift deployments, Automatic WLM with query monitoring rules (QMR) to classify and route query types is recommended over manual WLM configuration.

Concurrency Scaling automatically adds transient cluster capacity during periods of high concurrent demand. It adds cost (charged by the second above the free tier) but prevents queueing delays during peak usage.

VACUUM and ANALYZE

**VACUUM:** Reclaims space from deleted rows and restores sort order for unsorted data. DELETE operations in Redshift soft-delete rows (marking them as deleted but not immediately reclaiming space); VACUUM physically removes them.

VACUUM operations are resource-intensive and should run during low-usage windows. Prioritise tables with high delete rates and high unsorted percentages. VACUUM DELETE ONLY reclaims space without sorting — faster for tables where sort order degradation is not the primary concern.

**ANALYZE:** Updates table statistics used by the query planner. Outdated statistics cause the query planner to make poor execution decisions — wrong join order, incorrect parallelism. Run ANALYZE after large data loads and periodically on tables that receive frequent updates.

Our data architecture consulting practice includes Redshift architecture design and performance optimisation — contact us to discuss Redshift performance for your analytics environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →