Analytics on AWS: Designing the AWS Analytics Stack

The AWS analytics services and how they fit together — S3 as the data lake foundation, AWS Glue for cataloguing and ETL, Athena for serverless SQL, Redshift for the data warehouse, Kinesis for streaming ingestion, and the architecture patterns that combine them into a production analytics platform.

AWS offers more analytics services than any other cloud provider — and more opportunity to make expensive architecture mistakes. The depth of the AWS analytics portfolio (S3, Glue, Athena, EMR, Kinesis, Redshift, QuickSight, Lake Formation, DataZone, MSK, OpenSearch, SageMaker) means the tool selection and integration decisions are genuinely hard. This guide covers the core AWS analytics services, how they fit together, and the architecture patterns that work in production.

The AWS Analytics Foundation: S3

Amazon S3 is the foundational storage layer for AWS analytics. Almost every analytics architecture on AWS starts here — raw data lands in S3, processed data lives in S3, and the data warehouse (Redshift Managed Storage) is backed by S3.

**S3 as the data lake.** Raw data from source systems (operational databases, APIs, event streams, file transfers) lands in S3 in its original format. S3 storage is cheap ($0.023/GB/month standard, $0.0125/GB for infrequent access, $0.004/GB for Glacier), highly durable (eleven 9s), and serverless. Data can remain in S3 indefinitely, accessible via any number of compute engines.

**S3 organisation.** Organise S3 buckets by data zone (raw, processed, curated) and by source domain (salesforce/, product/, financials/). Within each zone, partition by date: raw/salesforce/2024/03/15/. Date partitioning is the most important structure decision — it enables partition pruning in Athena and other S3-querying services.

**S3 access control.** Use IAM roles (not access keys) for all programmatic S3 access. Enable S3 server-side encryption (SSE-S3 or SSE-KMS). Enable S3 access logging and object versioning for data zones that require audit trails.

AWS Glue: Cataloguing and ETL

AWS Glue serves two functions: the Glue Data Catalog (a metadata repository for tables defined over S3 data) and Glue ETL (a managed Spark environment for data transformation jobs).

**Glue Data Catalog** stores table definitions — schema, partition information, and location — for data in S3. When you define a Glue table pointing at a partitioned S3 prefix, services like Athena and Redshift Spectrum can query that data using the table's schema without needing to inspect the files directly. Glue Crawlers can automatically discover and update schemas by scanning S3 paths — useful for tracking schema evolution in ingested data.

**Glue ETL** runs PySpark or Scala Spark jobs on managed infrastructure. You write the transformation logic; Glue handles cluster provisioning, scaling, and termination. Glue is billed per DPU-hour (Data Processing Unit). For straightforward ETL workflows, Glue is convenient; for complex Python-heavy data engineering, Databricks or EMR (managed Spark) provide more flexibility.

**Glue Studio** provides a visual interface for building ETL pipelines that generate underlying PySpark code. Appropriate for teams with limited Spark expertise who need to build simple transformations.

Amazon Athena: Serverless SQL on S3

Athena is a serverless SQL query engine that queries data directly in S3. There is no cluster to manage — you write SQL, Athena executes it against S3 data using distributed compute, and you pay per TB scanned ($5/TB).

Athena use cases:

- Ad-hoc SQL queries against large S3 datasets without loading them into a warehouse

- Data exploration during pipeline development

- Federated queries joining S3 data with data in RDS, Redshift, DynamoDB, or other sources via Athena Federated Query

- The query engine for Glue-catalogued S3 tables

**Performance optimisation:** Partition all Athena-queried tables by date or another high-selectivity column. Use columnar formats (Parquet or ORC) instead of CSV or JSON — columnar formats reduce bytes scanned by allowing column-level skipping. Enable S3 query result caching for repeated identical queries.

**Athena vs Redshift Spectrum:** Both query S3 data with SQL. Athena is serverless with per-query billing; Redshift Spectrum requires an existing Redshift cluster and charges $5/TB on top of Redshift compute costs. Use Athena for ad-hoc and exploration queries; use Redshift Spectrum for queries that join S3 data with Redshift warehouse tables.

Amazon Redshift: The AWS Data Warehouse

Redshift is AWS's purpose-built analytical SQL database. It is the right choice for structured analytical workloads where query complexity is high, multi-user concurrency is required, or BI tool integration is the primary use case.

Redshift RA3 nodes (the current generation) separate compute from managed storage — storage is backed by S3 via Redshift Managed Storage, and you scale compute nodes independently of data volume. Redshift Serverless scales automatically based on workload and bills per RPU-second — appropriate for intermittent or unpredictable workloads.

For Redshift architecture and optimisation practices, see the separate Redshift best practices guide.

Amazon Kinesis: Streaming Ingestion

Amazon Kinesis provides managed streaming data pipelines for real-time ingestion:

**Kinesis Data Streams:** A horizontally scalable, durable stream of events. Producers write records to shards; consumers read from shards. Unlike Kafka, Kinesis Data Streams has a maximum data retention of 7 days and has strict throughput limits per shard (1MB/s write, 2MB/s read). Appropriate for event streaming at scale where you control both producer and consumer.

**Kinesis Data Firehose:** A managed delivery pipeline that reads from Kinesis Data Streams (or directly from producers) and writes to S3, Redshift, OpenSearch, or Splunk. Firehose handles batching, compression, and format conversion (to Parquet/ORC for S3). This is the standard pattern for streaming events into S3: producer → Kinesis Data Streams → Kinesis Firehose → S3 (partitioned Parquet).

**Amazon MSK (Managed Streaming for Apache Kafka):** Managed Kafka for teams that need Kafka's richer ecosystem — longer retention, more consumer groups, Kafka Streams, Kafka Connect connectors, or compatibility with existing Kafka producers. MSK is the right choice when Kinesis's throughput limits or 7-day retention are insufficient.

Lake Formation: Access Governance for the Data Lake

AWS Lake Formation provides centralised access control over S3 data via the Glue Data Catalog. Rather than managing S3 bucket policies and IAM permissions separately for each table, Lake Formation provides table-level and column-level permissions that apply consistently across Athena, Redshift Spectrum, Glue ETL, and EMR.

For organisations with PII data in S3 or compliance requirements for data lake access, Lake Formation is the right governance layer. The tradeoff: Lake Formation adds setup complexity and requires all access to route through its permission model.

Common AWS Analytics Architecture Patterns

Pattern 1: Batch ELT (most common)

Source systems → Fivetran/Airbyte to S3 (raw zone) → Glue Crawler to catalog → dbt on Redshift for transformation → Redshift data warehouse → Tableau/QuickSight for BI.

This pattern handles most analytical workloads for organisations that are AWS-native and want managed components. The transformation layer (dbt on Redshift) is mature and well-supported.

Pattern 2: Serverless Analytics (low-traffic or ad-hoc)

Source systems → custom Lambda or Glue ETL to S3 (Parquet, partitioned) → Athena for SQL queries → QuickSight for BI.

No persistent warehouse. Appropriate for organisations with low query concurrency, unpredictable usage patterns, and analytical teams comfortable with Athena SQL. Cost is proportional to actual query volume.

Pattern 3: Streaming + Batch Hybrid

Events → Kinesis Data Streams → Kinesis Firehose → S3 (raw Parquet, partitioned by hour) + Kinesis Data Analytics (Flink) for real-time aggregations → Redshift for historical analysis → real-time OLAP engine (ClickHouse or Druid on EC2) for sub-minute dashboards.

Appropriate for organisations with genuine real-time requirements alongside historical analytical needs. Operationally complex; justify carefully before building.

**What to avoid:** Building custom Python extractors for source systems that Fivetran connects to; using S3 without partitioning (Athena full-table scans are expensive); using Glue ETL for simple SQL transformations that dbt handles better; mixing analytical workloads into production RDS instances instead of a dedicated analytical database.

Our data architecture consulting practice designs and implements AWS analytics architectures — contact us to discuss your AWS analytics requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →