Cloud Data Architecture Patterns: Medallion, Lambda, Kappa, and When to Use Each

The most widely used cloud data architecture patterns — medallion architecture for lakehouse environments, Lambda for parallel batch and streaming, Kappa for stream-first simplicity, and the evaluation framework for choosing the right pattern for your organisation.

Cloud data architecture patterns are the structural templates that determine how data flows from source systems through transformation into analytical consumption. The wrong pattern for a given set of requirements produces either over-engineered complexity or under-powered simplicity. This guide covers the four most widely used patterns — medallion, Lambda, Kappa, and the data mesh — with the evaluation criteria for choosing among them.

The Medallion Architecture

Medallion architecture is the default structural pattern for lakehouse environments (Databricks, Apache Iceberg, Delta Lake). It organises data into three layers, named for their quality level:

**Bronze (raw):** Data as ingested from source systems — unmodified, append-only. A Bronze table of Salesforce opportunities contains every record as it arrived from Fivetran, including historical versions and duplicates. Bronze is the permanent audit layer — it is never modified or deleted. If a downstream model produces incorrect output, you can reprocess from Bronze.

**Silver (cleansed and standardised):** Bronze data that has been deduplicated, type-cast to correct types, null-handled, and standardised (consistent column names across sources, consistent enumerations). Silver represents one clean record per business entity version — for Salesforce, Silver would be a deduplicated view of current and historical Opportunity records with consistent status naming. Silver is joined, not aggregated.

**Gold (aggregated and business-ready):** Silver data transformed into the analytical models used for reporting and BI — fact tables, dimension tables, pre-aggregated summaries, metric definitions. Gold is what BI tools and analysts query.

The medallion pattern is well-suited for lakehouse environments because it maps naturally to how Spark pipelines process data: Bronze ingest → Silver clean and conform → Gold aggregate and model. It is also useful conceptually as an organisational principle — every data team member knows what layer a table belongs to and what quality to expect from it.

**In dbt projects**, medallion maps to: staging (Bronze-adjacent, minimally transformed source tables) → intermediate (Silver-adjacent, conformed and joined) → marts (Gold, analytical models for consumption).

The Lambda Architecture

Lambda architecture maintains two parallel processing paths for the same data:

**Batch layer:** Processes all historical data on a schedule (daily, hourly). Produces accurate but latent results. The batch layer is the authoritative source for historical analysis — it has complete data, correct data, and has been reprocessed as often as necessary.

**Speed layer:** Processes only the most recent data (since the last batch run) in near-real-time. Produces approximate or partial results quickly. The speed layer fills the latency gap between batch runs.

**Serving layer:** Merges results from both layers. Queries combine the full historical accuracy of the batch layer with the freshness of the speed layer.

Lambda was influential in the early 2010s when batch and streaming were architecturally distinct — batch systems ran Hadoop/MapReduce; streaming systems ran Storm or early Kafka Streams. Modern cloud warehouses (Snowflake, BigQuery) can handle near-real-time ingestion without a separate streaming system, and Spark Structured Streaming unifies batch and streaming in a single API.

**When Lambda is still appropriate:** When batch and real-time systems genuinely have different compute requirements — for example, daily full recomputation of a complex ML feature that must also be available within 30 seconds of the latest event. The batch layer handles the complete history with full correctness; the speed layer handles the latest minute of events with a simplified model. The serving layer presents a merged view.

**The main criticism of Lambda:** Maintaining two separate codebases for the same business logic is expensive. When the batch and speed layers are implemented differently, they frequently diverge — producing different results for the same time period. This is the "Lambda hell" that Kappa architecture attempts to solve.

The Kappa Architecture

Kappa architecture, proposed by Jay Kreps (Kafka co-creator), simplifies Lambda by eliminating the batch layer. Everything is processed as a stream. Historical reprocessing is done by replaying the event log from the beginning — the same streaming pipeline handles both real-time processing and historical reprocessing.

The premise: if your event log (Kafka) retains enough history, and your stream processing engine (Flink, Spark Structured Streaming) can replay that log efficiently, you do not need a separate batch layer. The streaming pipeline is the single source of truth.

Kappa's requirements:

- Long event log retention (Kafka configured for extended retention — days to indefinitely)

- Efficient log replay (your stream processor can replay from offset 0 without exceeding time or cost constraints)

- Stateless or efficiently manageable stateful processing (stream processing state stores can be rebuilt from the log)

**When Kappa works well:** When your data is fundamentally event-driven and arrives via Kafka; when reprocessing latency (time to replay and recompute from the beginning) is acceptable; and when the business logic is complex enough that maintaining two implementations (batch and streaming) is genuinely expensive.

**Kappa's limitations:** Full log replay for petabyte-scale history is expensive and slow. For data that did not originate as events (database snapshots, ERP exports, CSV uploads), fitting it into a Kappa streaming model is awkward. Most enterprise data environments have a mix of event-driven and batch-oriented sources that Kappa does not handle uniformly.

**The practical reality:** Most data platforms use neither pure Lambda nor pure Kappa. They use a medallion architecture for structured analytical data and add a streaming ingestion path (Kafka → warehouse) for event data that requires near-real-time freshness. The streaming and batch paths converge in the same warehouse tables rather than being served separately.

The Data Mesh

The data mesh is an organisational and architectural pattern, not a technical system. It addresses the centralised data team bottleneck that occurs when all data pipelines and analytical models are owned by a single platform team.

Core principles:

Domain ownership: data is owned and published by the business domain that produces it, not a central data team. The sales domain owns and publishes its Salesforce data product. The finance domain owns and publishes its GL data product. Each domain team is responsible for the quality, freshness, and SLA of its data products.

Data as a product: data published by domains is treated as a product — with an interface (schema), a contract (documented columns, freshness SLA, quality guarantees), and an owner (a named person or team accountable for it).

Self-serve data infrastructure platform: a central platform team provides the infrastructure (storage, compute, cataloguing, governance) that domain teams use to publish their data products. The platform team does not own the data; domain teams do.

Federated computational governance: governance policies (privacy, security, quality standards) are defined centrally but applied by domain teams. Automated compliance checks rather than manual review.

**When data mesh is appropriate:** At organisational scale where a central data team cannot keep up with the analytical demands of many domains. Typically this occurs when an organisation has 10+ data producers (domains) and the central bottleneck is visibly limiting business value.

**Data mesh is not appropriate when:** The organisation is small enough for a central team to serve all analytical needs; when domain teams lack the data engineering capability to own and publish data products; or when the domains are not sufficiently independent to manage their own data without central coordination.

Choosing the Right Pattern

**Medallion:** Use for any lakehouse environment (Databricks, Delta Lake, Iceberg). Also use as an organisational principle for dbt projects regardless of storage layer. Appropriate for most organisations.

**Lambda:** Use when you genuinely need both full historical reprocessing accuracy and sub-minute freshness, and the two requirements impose different compute architectures. Rare in practice. Avoid if the same code can serve both needs.

**Kappa:** Use when data is primarily event-driven, arrives via Kafka, and log replay is feasible for your data volumes. Common in ad tech and IoT; less common in enterprise analytics with mixed batch and event sources.

**Data mesh:** Use as an organisational evolution for large organisations where a central data team bottleneck is demonstrably limiting business value. Not a technical decision — it is a operating model decision.

For most organisations building or modernising a cloud data platform, medallion architecture with continuous warehouse ingestion for event data is the practical starting point. Lambda and Kappa are specialist patterns for specific requirements. Data mesh is an organisational evolution, not an initial architecture.

Our data architecture consulting practice designs cloud data architectures matched to organisational requirements and scale — contact us to discuss your data architecture requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →