BlogCloud Engineering

Amazon Redshift Performance Optimisation: Distribution, Sort Keys, and Workload Management

James Okafor
James Okafor
Senior Data Engineer
·August 18, 202711 min read

Redshift is a column-oriented MPP warehouse that distributes data across compute nodes. Its performance characteristics are fundamentally shaped by how data is distributed across nodes and sorted within them — choices made at table creation time that affect every query run against that table.

Amazon Redshift is a column-oriented massively parallel processing warehouse that distributes data across compute nodes. Its performance characteristics are fundamentally shaped by how data is distributed across nodes (distribution key) and sorted within each node (sort key) — decisions made at table creation time that cannot be changed without rebuilding the table. Getting these choices right is the foundation of Redshift performance; no amount of query tuning compensates for a poorly distributed table.

Distribution Style and Distribution Keys

Redshift distributes table rows across compute nodes according to the distribution style. The three styles:

**EVEN** distribution spreads rows across nodes in a round-robin fashion. Every node gets roughly the same number of rows regardless of content. EVEN is appropriate for small tables and for tables that are never joined to other tables. For tables that participate in joins, EVEN distribution forces data movement — Redshift must redistribute rows from one or both tables across nodes before the join can execute. Data redistribution is expensive and is the most common source of Redshift performance problems.

**KEY** distribution assigns rows to nodes based on the hash of the distribution key column. Rows with the same distribution key value always go to the same node. When two tables are joined on a column that is the distribution key of both, the matching rows are already co-located on the same node — no redistribution is needed. KEY distribution is the right choice for large fact tables joined on a consistent join column (customer_id, order_id).

**ALL** distribution stores a copy of the entire table on every node. Joins against an ALL-distributed table never require redistribution — the matching rows are always on the local node. ALL distribution is appropriate for small dimension tables that are frequently joined to large fact tables.

The optimal distribution key for a fact table is usually the foreign key that participates in the most frequent and most expensive joins. For a customer orders table, distributing on customer_id collocates the orders with the customer dimension (if the customer table uses ALL distribution or is also distributed on customer_id).

**Automatic distribution** (DISTKEY AUTO) allows Redshift to recommend and apply the distribution key based on workload analysis. It is a good starting point for new tables; reviewing the recommendations before accepting them for tables with complex join patterns is advisable.

Sort Keys

Sort keys define the physical order of data within each node and enable range-restricted scans: when a query includes a filter on the sort key column, Redshift skips blocks that cannot contain matching values (similar to Snowflake's micro-partition pruning or BigQuery's clustering).

**Single-column sort key** — one column defines the sort order. The sort key should be the column most frequently used in range filter predicates (WHERE conditions). For time-series tables, the date or timestamp column is the standard sort key: queries filtering by date range skip the blocks outside the date range.

**Compound sort key** — multiple columns, sorted in the specified order. A compound sort key on (date, region) allows range-restricted scans on date alone (leading key), or on (date AND region) together, but not on region alone. The leading column of a compound sort key should be the most selective filter column.

**Interleaved sort key** — treats each column as equally important for filtering. Interleaved sort keys support efficient filtering on any column in the key, at the cost of higher maintenance overhead on VACUUM and INSERT operations. Interleaved sort keys are appropriate for ad-hoc workloads where filter patterns are unpredictable; they are not appropriate for write-heavy tables.

Workload Management

Redshift Workload Management (WLM) allocates memory and concurrency slots across query queues. The default configuration uses a single queue; production environments with mixed workloads benefit from separate queues for different query types.

**Queue separation** — separate queues for ETL/ELT transforms (which need large memory allocations for hash joins and aggregations) and interactive BI queries (which need fast response time with limited memory). Without queue separation, a large transform query can consume all available memory and queue all BI queries behind it.

**Concurrency scaling** — Redshift can automatically route overflow queries to transient concurrency scaling clusters when the main cluster is saturated. Concurrency scaling is charged per second of cluster time; it prevents query queuing during peak periods at moderate cost.

**Query monitoring rules** — WLM rules can automatically terminate, log, or change queue priority for queries that exceed resource thresholds (elapsed time, rows returned, memory consumed). Rules that log expensive queries to a monitoring table are more informative than rules that terminate them silently.

VACUUM and ANALYZE

Redshift's columnar storage requires maintenance to retain optimal query performance.

**VACUUM** reclaims space from deleted or updated rows and, for tables with sort keys, re-sorts data to restore sort order. DELETE and UPDATE operations mark rows as deleted but do not reclaim their storage; VACUUM FULL reclaims the storage. VACUUM SORT ONLY re-sorts without reclaiming space.

By default, Redshift runs VACUUM automatically on a schedule. For tables with very high write volume (frequent appends and deletes), automatic VACUUM may not keep up. Monitoring the 'unsorted' percentage in SVV_TABLE_INFO identifies tables that need manual VACUUM attention.

**ANALYZE** updates the statistics that the query planner uses to select join strategies and access paths. Stale statistics cause the planner to make suboptimal choices — choosing a nested loop join where a hash join would be faster, or choosing redistribution where collocated processing is possible. ANALYZE runs automatically after bulk loads; for tables that receive continuous small-batch updates, scheduling periodic manual ANALYZE ensures statistics stay current.

Redshift Serverless vs. Provisioned

Redshift Serverless automatically manages compute capacity, scaling up for queries and scaling to zero when idle. It is appropriate for variable workloads with periods of inactivity — development environments, low-volume analytics, or workloads that run a few times per day. Provisioned clusters provide consistent performance for sustained workloads and offer more configuration control (WLM, node type selection, reserved instances pricing).

For production analytics environments with predictable workloads, provisioned clusters with reserved instance pricing are more cost-effective than Serverless. For development or unpredictable workloads, Serverless eliminates idle compute cost and provides on-demand scalability.

Our cloud engineering practice optimises Redshift environments for performance and cost — contact us to discuss Redshift architecture and tuning for your data warehouse.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →