BlogData Architecture

Amazon Redshift Architecture: Design Patterns for Analytical Performance

Obed Tsimi
Obed Tsimi
Founder & Senior Tableau Architect
·December 1, 202713 min read

Amazon Redshift is AWS's managed cloud data warehouse — a columnar, MPP (massively parallel processing) database designed for analytical workloads. Understanding Redshift's architecture — how it distributes data across nodes, how sort keys and distribution keys affect query performance, and how RA3 nodes separate compute and storage — is the foundation for designing Redshift deployments that perform at scale.

Amazon Redshift is AWS's managed cloud data warehouse — a columnar, massively parallel processing (MPP) database designed for analytical workloads. Since its launch in 2012, Redshift has been the primary cloud data warehouse for AWS-native data stacks. Understanding Redshift's architecture — how it distributes data, how sort keys and distribution keys interact with query execution, and how RA3 nodes separate compute and storage — is essential for designing Redshift deployments that perform at scale.

Redshift Architecture Overview

**MPP architecture** — Redshift distributes data and query execution across multiple nodes. Each query is decomposed into segments that run in parallel across all nodes. The degree of parallelism scales with node count; adding nodes increases both storage capacity and query throughput.

**Columnar storage** — Redshift stores data column-by-column rather than row-by-row. Analytical queries that access only a subset of columns scan far less data in a columnar layout. Column data is also more compressible, reducing storage and I/O.

**Leader node and compute nodes** — the leader node receives queries, generates query execution plans, and coordinates execution across compute nodes. Compute nodes execute the actual query work and return results to the leader node for aggregation. For small result sets, most query time is on compute nodes; for large result returns, the leader node can become a bottleneck.

**RA3 nodes and managed storage** — RA3 (Redshift-managed storage) nodes separate compute from storage, similar to Snowflake and BigQuery. Data is stored in S3 and cached on local NVMe drives on each compute node. This separation allows independent scaling of compute and storage. RA3 is the recommended node type for new Redshift deployments.

Distribution Styles

How Redshift distributes rows across compute nodes — the distribution style — is one of the most impactful design decisions for query performance. Poorly chosen distribution causes data movement during joins and aggregations, which is the primary source of Redshift performance problems.

**DISTSTYLE AUTO** — Redshift automatically selects the optimal distribution style based on table size and query patterns. Recommended for most tables; Redshift's automatic statistics update and optimiser evolve the distribution as query patterns change.

**DISTSTYLE EVEN** — rows are distributed across nodes in a round-robin fashion, regardless of data values. Produces even storage distribution but may cause data movement during joins when the join key doesn't match the distribution key.

**DISTSTYLE KEY** — rows are distributed by the value of a specified column (DISTKEY). Rows with the same DISTKEY value are co-located on the same node. When two large tables are joined on their DISTKEY columns, the join executes on each node locally without data movement — the most efficient join pattern.

Use DISTKEY for large fact tables and their most frequently-joined dimension tables (joined on the same key). The goal is to co-locate the rows that are most frequently joined together.

**DISTSTYLE ALL** — a complete copy of the table is stored on every node. Appropriate for small dimension tables that are frequently joined to large fact tables; eliminates data movement for joins to this table at the expense of storage overhead.

Use ALL for small dimension tables (under a few hundred thousand rows). Joining a large fact table (DISTKEY) to a small dimension table (ALL) produces a local join on every node without data movement.

Sort Keys

Sort keys specify the physical order of data within each node slice. Redshift uses zone maps (min/max values per 1MB block) to skip blocks that cannot contain data matching a filter predicate. Sort keys make this skipping effective.

**Single-column sort key** — specify one column. Most effective when queries frequently filter on that column.

**Compound sort key** — multiple columns in order. Most effective when queries filter on the first sort key column, or on the first and second together. Range scans on the compound key prefix are efficient; range scans on non-prefix columns are not.

**Interleaved sort key** — each column in the sort key is weighted equally, enabling efficient scans for any single sort key column. More effective than compound for diverse query patterns but has higher maintenance overhead (VACUUM REINDEX required for optimal performance after large data loads).

Sort key selection guidance:

- Use the column most frequently in range predicates (date columns for time-series data)

- Use DISTKEY = sort key if the distribution key is also a common filter column

- Avoid high-cardinality columns as sort keys; low to medium cardinality columns with repetitive values enable better zone map effectiveness

Compression

Redshift stores column data with column-level encoding. Selecting appropriate compression encodings reduces storage and I/O:

**ENCODE AUTO** — Redshift automatically selects optimal encoding per column. Recommended for most tables; use ANALYZE COMPRESSION to review Redshift's recommendations if manually specifying encodings.

Common encoding types:

- **ZSTD** — general-purpose compression, good ratio and speed; default for many column types

- **BYTEDICT** — efficient for low-cardinality columns (status codes, categories)

- **DELTA** — efficient for monotonically increasing columns (dates, sequential IDs)

- **LZO** — higher compression ratio for longer text columns

Workload Management (WLM)

Redshift's Workload Management (WLM) controls how queries are queued and prioritised. Without WLM configuration, all queries compete for the same resource pool.

**Automatic WLM** — Redshift automatically manages concurrency and memory allocation based on query complexity and cluster load. Recommended for most deployments.

**Manual WLM** — define queues with explicit concurrency levels, memory percentages, and user or query group assignments. Use manual WLM when specific workloads (ETL jobs, ad-hoc queries, BI tool queries) need strict resource isolation.

For BI tools that run many concurrent simple queries, a queue with higher concurrency (10-20 slots) and lower memory per slot is appropriate. For complex ETL queries that require large memory for sort and hash operations, a queue with lower concurrency (2-5 slots) and higher memory per slot is better.

Redshift Spectrum

Redshift Spectrum allows querying data in S3 directly from Redshift without loading it into Redshift tables. External tables in S3 (Parquet, ORC, CSV, JSON) are queried using standard SQL alongside internal Redshift tables.

Use cases:

- Historical data in S3 that is rarely queried but needs to be available for ad-hoc analysis

- Landing zone data that is queryable before being loaded into Redshift

- Large volumes of cold data that would be expensive to store in Redshift managed storage

Spectrum queries execute partially on Redshift's compute nodes and partially on a fleet of Spectrum nodes managed by AWS. Performance for selective queries on S3 Parquet data (with partition pruning) is reasonable; performance for full-scan queries is slower than equivalent Redshift internal table queries.

RA3 vs Legacy Node Types

Legacy Redshift deployments used DS2 or DC2 nodes where storage was local to the node. RA3 nodes use Redshift Managed Storage (backed by S3 with NVMe caching), enabling:

- Independent scaling of compute and storage

- Consistent performance as storage grows (S3 scales automatically)

- Serverless Redshift (pay per query, no node management)

New Redshift deployments should use RA3 nodes or Redshift Serverless unless there is a specific reason to use older node types.

Our data architecture practice designs Redshift architectures and optimises Redshift deployments for enterprise analytics teams — contact us to discuss your Redshift performance and architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →