AWS Data Architecture: Building a Modern Data Platform on Amazon Web Services

AWS has the broadest data service catalog of any cloud provider. Here is the reference architecture for building a modern data platform on AWS — choosing the right services and avoiding the common pitfalls.

The quick answer

AWS has more data services than any other cloud provider — which makes choosing the right combination complex. A modern data platform on AWS uses: S3 for object storage, AWS Glue or MSK (Kafka) for ingestion, Redshift or Snowflake/Databricks (hosted on AWS) for the warehouse layer, dbt for transformation, Airflow on MWAA or Step Functions for orchestration, and QuickSight or a third-party BI tool for serving. The AWS-native path (Glue + Redshift + Lake Formation + QuickSight) is coherent but imposes significant AWS lock-in. The hybrid path (Snowflake or Databricks on AWS, with best-of-breed tooling) is more flexible. This guide covers both patterns.

The AWS data service landscape

AWS offers data services in every layer of the data platform stack. Understanding the service map prevents choosing redundant tools:

**Storage**: S3 (standard object storage, the foundation of every AWS data platform), EFS (managed NFS for shared file systems), EBS (block storage for compute instances). For a data platform, S3 is the primary data lake storage layer.

**Ingestion**: AWS DMS (Database Migration Service, for one-time migrations and CDC from RDS and on-premise databases), Kinesis Data Streams (Kafka-compatible event streaming), Kinesis Data Firehose (managed delivery to S3, Redshift, Elasticsearch, Splunk), AWS Glue (ETL and data integration service), MSK (Managed Streaming for Apache Kafka).

**Storage / Compute separation**: AWS Lake Formation (governance layer for S3 data lake — access control, data cataloging using Glue Data Catalog), AWS Glue Data Catalog (Hive-compatible metadata catalog for S3 data), AWS Glue ETL (Spark-based managed ETL).

**Data warehouse**: Amazon Redshift (columnar MPP warehouse; RA3 node type with Redshift Managed Storage separates compute and storage), Redshift Serverless (serverless Redshift; no cluster provisioning). For organisations not committed to AWS-native, Snowflake and Databricks both run excellently on AWS.

**Query engines**: Amazon Athena (serverless SQL query engine that queries S3 directly using Presto, $5/TB scanned), EMR (managed Spark, Hive, Presto on EC2 or EKS), Redshift Spectrum (query S3 data from Redshift without loading it).

**Orchestration**: Amazon MWAA (Managed Workflows for Apache Airflow), AWS Step Functions (serverless workflow orchestration for AWS services), AWS Glue Workflows (for Glue-centric pipelines).

**BI and serving**: Amazon QuickSight (AWS-native BI tool; SPICE in-memory engine; per-session pricing), Tableau/Power BI/Looker hosted on EC2 or accessed via SaaS.

**ML / AI**: Amazon SageMaker (managed ML platform for model training, feature store, endpoint serving), Amazon Bedrock (managed LLM APIs).

The AWS-native data stack

For organisations committed to AWS-native tooling, the coherent stack is:

**Ingestion**: Kinesis Data Firehose (for streaming event data → S3), AWS DMS (for CDC from RDS/Aurora), AWS Glue ETL (for batch ingestion and transformation of third-party data). For SaaS source connectors at scale, Fivetran with Redshift as the destination is more reliable than Glue for complex sources.

**Storage**: S3 as the data lake layer (organised in bronze/silver/gold prefix structure). AWS Glue Data Catalog to register table metadata. Lake Formation for access governance.

**Warehouse**: Redshift RA3 with Redshift Managed Storage. Redshift Spectrum for querying S3 data from Redshift SQL without loading it into Redshift — enabling a lake-house pattern where raw data stays in S3 and is queried via Spectrum.

**Transformation**: AWS Glue ETL (Spark-based) for large-scale data processing, or dbt-core running in a Docker container on ECS/Fargate for SQL transformation. dbt with the dbt-redshift adapter connects directly to Redshift; this is the recommended pattern for SQL transformation in an AWS/Redshift environment.

**Orchestration**: MWAA for Airflow-based orchestration (the standard for complex multi-step pipelines), or Step Functions for AWS service-centric workflows (triggering Glue jobs, Lambda functions, SageMaker pipelines).

**BI**: Amazon QuickSight for AWS-native reporting at scale (especially for large user populations where QuickSight's per-session pricing is cost-effective), or third-party BI tools.

The hybrid AWS data stack

For organisations that want best-of-breed tooling hosted on AWS without full AWS lock-in:

**Ingestion**: Fivetran or Airbyte Cloud (hosted on AWS, connecting to Snowflake or Databricks)

**Storage**: Snowflake on AWS (storage and compute in AWS regions, with Snowflake's virtual warehouse model), or Databricks on AWS with Delta Lake on S3

**Transformation**: dbt Cloud or dbt Core on ECS, with dbt-snowflake or dbt-databricks adapter

**Orchestration**: Airflow on MWAA, Prefect Cloud, or Astronomer (managed Airflow on AWS)

**BI**: Tableau, Power BI, or Looker (all available as SaaS or self-hosted on EC2)

The hybrid approach avoids Glue's complexity (Glue ETL has high configuration overhead and slow iteration cycles compared to dbt) and Redshift's architectural constraints (table distribution style and sort key design requires significant expertise vs Snowflake's more automated approach).

Key architectural decisions

**Redshift vs Snowflake on AWS**: the central choice for the warehouse layer. Snowflake on AWS provides a superior user experience, better automatic query optimisation, and no distribution key configuration overhead. Redshift on AWS provides better integration with AWS-native services (Lake Formation, Glue, DMS, SageMaker) and lower cost for steady-state workloads with reserved instances. See snowflake vs redshift for the full comparison.

**Glue vs Fivetran/Airbyte for ingestion**: Glue ETL is flexible (any Python code) but has high overhead (slow job startup, complex debugging, no pre-built connector ecosystem). Fivetran and Airbyte have pre-built connectors for most common sources and are significantly faster to deploy. Use Glue for custom ingestion logic that Fivetran/Airbyte cannot handle; use Fivetran/Airbyte for standard SaaS and database sources.

**Athena vs Redshift for ad-hoc analytics**: Athena ($5/TB scanned) is cost-effective for low-frequency ad-hoc queries on S3 data. Redshift is better for high-frequency, concurrent analytical queries where Redshift's performance optimisation exceeds Athena's. For BI tools with many concurrent users, Redshift outperforms Athena; for occasional analyst exploration of raw S3 data, Athena is simpler.

**Lake Formation vs IAM for data access**: Lake Formation adds a governed data access layer on top of S3 and Glue Data Catalog, enabling column-level security and row-level access control on S3 data. For regulated environments (HIPAA, PCI, SOC 2), Lake Formation's governance capabilities are important. For simpler environments, IAM bucket policies provide sufficient access control with less complexity.

Cost considerations

AWS data services have varied pricing models. Watch for:

**Glue DPU costs**: AWS Glue ETL jobs charge per DPU-hour (data processing unit). A 10-DPU Glue job running for 1 hour costs ~$4.40. For frequent, short-running transformation jobs, this adds up — dbt on a persistent EC2 instance or ECS task is often cheaper for SQL transformation.

**Athena scan costs**: Athena at $5/TB scanned encourages partitioning and column selection. Unoptimised Athena queries against unpartitioned data scan the full table — implement partitioning and columnar formats (Parquet) to reduce scan costs.

**Data transfer costs**: data transferred between AWS regions or to the internet incurs transfer fees. Co-locate data processing in the same region as storage to minimise inter-region transfer charges.

For the Redshift-specific architecture, see snowflake vs redshift. For the Databricks alternative on AWS, see azure synapse vs databricks. For the data pipeline architecture that runs on AWS, see data pipeline best practices.

Our cloud engineering and data architecture practices design and implement AWS data platforms — from service selection through implementation and cost optimisation. Book a free 30-minute audit to discuss your AWS data architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →