A cloud data platform is the integrated set of cloud-native services that stores, processes, governs, and exposes data for analytical and operational use. This guide explains what a cloud data platform is, how it differs from a data warehouse, and the architectural decisions that define it.
A cloud data platform is the integrated set of cloud-native services that an organization uses to store, process, govern, and expose data for analytical and operational use. It is broader than a data warehouse — it encompasses the full infrastructure layer from data ingestion through consumption, including the governance policies and operational processes that keep it reliable.
The term "cloud data platform" is used in different ways by different vendors. AWS promotes its "Modern Data Architecture," Google has its "Data Cloud," Azure has its "Analytics Platform," and Databricks has the "Data Intelligence Platform." Despite the different branding, they share a common architecture: cloud object storage as the foundation, analytical compute services above it, a governance layer across it, and a consumption layer (BI, ML, applications) at the top.
Why "Cloud Data Platform" Rather Than "Cloud Data Warehouse"
A data warehouse is one component of a cloud data platform — the analytical query engine. A cloud data platform is the full system:
- **Ingestion layer** — pipelines that move data from source systems (operational databases, SaaS applications, IoT devices, third-party feeds) into the platform's storage
- **Storage layer** — where data lives: cloud object storage (S3, GCS, Azure Blob) for raw and semi-structured data; the cloud data warehouse (Snowflake, BigQuery, Redshift) for governed analytical tables
- **Processing layer** — compute for batch transformation (dbt, Spark), streaming processing (Kafka, Flink), and ML training (Vertex AI, SageMaker)
- **Governance layer** — data catalog, data lineage, access control, quality monitoring, and policy enforcement
- **Consumption layer** — BI tools, notebooks, APIs, and applications that use the platform's data
A data warehouse is the query engine in the storage and consumption layers. A cloud data platform is the full system.
The Major Cloud Data Platform Architectures
### AWS Data Platform
The canonical AWS data platform uses:
- **S3** for raw data storage (the data lake)
- **AWS Glue** for managed ETL and data catalog
- **Amazon Redshift** for analytical querying (with Redshift Spectrum for querying S3 data directly)
- **Amazon EMR** (managed Hadoop/Spark) or **AWS Glue Studio** for large-scale batch processing
- **Amazon Kinesis** or **MSK (Managed Kafka)** for streaming ingestion
- **AWS Lake Formation** for governance and access control across the platform
- **Amazon SageMaker** for ML
- **QuickSight** for BI (or integrated Tableau/Power BI)
### Google Cloud Data Platform
The Google Cloud equivalent:
- **Google Cloud Storage** for raw data storage
- **BigQuery** for analytical querying — serverless, and increasingly the processing engine for both SQL analytics and ML (BigQuery ML)
- **Dataflow** (managed Apache Beam) for both batch and streaming processing
- **Pub/Sub** for event streaming ingestion
- **Dataplex** for governance, cataloging, and data quality across the platform
- **Vertex AI** for ML
- **Looker** for BI (or integrated Tableau/Power BI)
### Databricks Data Intelligence Platform (Lakehouse)
Databricks structures the platform around the lakehouse architecture:
- **Delta Lake** (on cloud object storage — S3, GCS, or ADLS) for raw, silver, and gold layer tables
- **Databricks SQL** for SQL analytics
- **Databricks Runtime (Apache Spark)** for large-scale batch and streaming processing
- **MLflow** for ML lifecycle management
- **Unity Catalog** for governance and access control
- **Databricks Partner Connect** for BI (Tableau, Power BI, Looker connect to Databricks SQL)
The Governance Layer: What Makes It a Platform
What distinguishes a cloud data platform from a collection of disconnected cloud services is the governance layer — the infrastructure that makes data discoverable, trustworthy, and secure across the full system.
**Data catalog and lineage** — the platform knows what data exists, where it came from, how it has been transformed, and what depends on it. Impact analysis (what breaks if I change this table?) and provenance (where did this number come from?) are answerable without manual investigation.
**Unified access control** — access policies are enforced consistently across storage, compute, and analytics services. A user who should not see PII cannot see it whether they are running a SQL query in the warehouse, querying raw files in object storage, or accessing a data source through a BI tool.
**Data quality monitoring** — the platform surfaces data quality issues — missing records, schema changes, statistical anomalies — before they propagate to analytical consumers. Quality failures trigger alerts and, in well-governed platforms, halt downstream pipelines rather than silently propagating bad data.
**Metric and semantic consistency** — the platform enforces consistent definitions of business metrics across all consumers. The semantic layer sits between raw data and consumer tools, ensuring that revenue means the same thing in every tool and every query.
Building vs. Buying a Cloud Data Platform
There is no single "cloud data platform" product to purchase. Every organization assembles a platform from components — and the assembly choices determine the platform's capabilities, operational complexity, and cost.
**Best-of-breed assembly** selects the best tool for each layer: Fivetran for ingestion, Snowflake for warehousing, dbt for transformation, Airflow for orchestration, Tableau for BI. This approach produces maximum capability in each layer at the cost of integration complexity and vendor proliferation.
**Cloud-native consolidation** uses the native services of a single cloud provider (AWS, GCP, Azure) for as many layers as possible. This reduces integration complexity and often produces better economics for organizations deeply committed to a single cloud, at the cost of being locked into that cloud's service quality.
**Databricks lakehouse consolidation** uses Databricks as the central compute and governance layer, with Delta Lake as the storage format. Strong for organizations where both ML and SQL analytics are first-class requirements.
Our data architecture and cloud engineering practices design cloud data platform architectures — selecting the right components for each layer, designing the governance model, and implementing the platform from initial deployment through production. Contact us to discuss your cloud data platform requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →