BlogData Architecture

Data Architecture Patterns: The Designs That Solve Real Enterprise Problems

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·June 12, 202612 min read

Enterprise data architectures are not invented from scratch — they are assembled from a set of proven patterns. Lambda, Kappa, medallion, hub-and-spoke, data mesh, event sourcing: here is what each pattern is, what problem it solves, and when to use it.

The quick answer

Data architecture is not invented from scratch for each organisation. It is assembled from a set of well-understood patterns, each of which solves a specific class of problem. Lambda architecture handles the combination of batch and real-time processing. Medallion architecture organises data into quality tiers. Hub-and-spoke integrates many source systems through a central hub. Data mesh distributes ownership to domain teams. Event sourcing preserves full event history rather than current state.

Understanding these patterns — what problem each solves, what trade-offs it carries, and when it is appropriate — is the foundation of data architecture practice. This guide covers the patterns that appear most frequently in enterprise data environments.

Medallion architecture (Bronze/Silver/Gold)

The medallion architecture organises data into three quality tiers, each representing a stage of processing from raw to consumption-ready.

**Bronze** is the raw layer — data as it arrived from source systems, unmodified. Schema preserved from source, duplicates may exist, data quality is as-received. Bronze is the immutable record of what was ingested. If anything in the Silver or Gold layers turns out to be wrong, Bronze is what you correct from.

**Silver** is the cleaned and standardised layer. Schema is normalised to the platform's conventions (renamed columns, consistent data types, trimmed strings). Deduplication applied. Basic validation checks run. Business-neutral transformations (data type casting, null handling, structural standardisation). Silver represents "the data as it should be" — cleaned and integrated, but not yet shaped for any specific analytical purpose.

**Gold** is the consumption-ready layer — dimensional models, aggregate tables, and feature sets designed for specific analytical use cases. Star schema fact and dimension tables, pre-aggregated summaries, machine learning feature tables. Gold is what BI tools and analysts query.

The medallion architecture is now the dominant pattern in modern cloud data platforms built on Databricks (Delta Lake) and in dbt-based transformation stacks. It maps directly onto dbt's staging-intermediate-mart layer convention: staging models are Silver, mart models are Gold.

The key architectural principle: each layer writes to the next but never backward. Silver reads from Bronze; Gold reads from Silver. If Gold data is wrong, you fix the transformation between Silver and Gold — not the Bronze data. This unidirectional flow makes the architecture auditable and correctable.

For the full context on how the medallion architecture is implemented in data lakehouses, see what is a data lakehouse.

Lambda architecture

Lambda architecture addresses the combination of batch and real-time processing by maintaining two parallel processing paths: a batch layer (high-latency, complete, accurate) and a speed layer (low-latency, approximate, real-time), merged by a serving layer.

**Batch layer**: processes the full historical dataset on a schedule (hourly, daily). Accurate and complete. High latency: data processed in the batch layer is only available after the batch runs.

**Speed layer**: processes a stream of recent events in near-real-time (Kafka + Flink or Spark Streaming). Low latency: data is available within seconds or minutes. The speed layer only handles data since the last batch run — it fills the "gap" that the batch layer has not yet processed.

**Serving layer**: merges batch layer results (covering historical data) with speed layer results (covering recent data) to produce a unified view for queries.

The Lambda architecture allows serving queries that require both historical accuracy (batch) and current freshness (speed) — use cases like fraud detection, real-time pricing, live operational dashboards, and customer-facing analytics where users expect data to reflect events from the last few minutes.

The limitation of Lambda: maintaining two parallel processing pipelines for the same data is operationally expensive. Batch and speed logic must be kept in sync; bugs in one path may not manifest in the other; operational complexity is approximately doubled versus a pure batch architecture. The Kappa architecture addresses this.

For the full real-time vs batch decision, see real-time data architecture.

Kappa architecture

Kappa architecture (Jay Kreps, 2014) is a simplification of Lambda: instead of maintaining separate batch and speed layers, treat everything as a stream. Historical reprocessing is handled by replaying the event log from the beginning, not by a separate batch system.

The Kappa architecture works when: your streaming platform (Kafka) retains the full event history with sufficient retention, your stream processing framework (Flink, Spark Streaming) can reprocess historical data efficiently, and your analytical requirements can be met by streaming patterns rather than requiring the full-dataset scan that batch processing enables efficiently.

Kappa reduces operational complexity versus Lambda (one processing path rather than two) at the cost of higher requirements on the streaming infrastructure. For organisations where real-time processing is the primary requirement and historical reprocessing is infrequent, Kappa is often the right choice. For organisations where batch analytics are as important as real-time, Lambda (or a hybrid) may be more practical.

Hub-and-spoke (data warehouse/data hub)

Hub-and-spoke is the traditional enterprise data integration pattern: a central data hub (data warehouse or enterprise data warehouse) receives data from multiple source systems (spokes) and provides integrated data to multiple consumers (BI tools, data marts, operational systems).

The central hub performs integration — standardising schemas across sources, applying business rules, resolving entities (the same customer in CRM and ERP is mapped to a single canonical customer record). Consumers query the hub rather than individual source systems.

The benefit: integration logic lives once (in the hub) rather than being duplicated across every consumer that needs to join data from multiple sources. The limitation: the hub becomes a bottleneck — every new source integration requires hub development, and every schema change in a source system potentially affects hub pipelines.

Modern cloud data warehouses (Snowflake, BigQuery, Databricks) are the typical hub in contemporary hub-and-spoke architectures. The modern data stack (Fivetran for ingestion + dbt for transformation) operationalises hub-and-spoke in a scalable way.

Hub-and-spoke works well for organisations with a manageable number of source systems and centralised data engineering capacity. It becomes strained when the number of source systems is very large, when source system schema changes are frequent, or when different domains have divergent data requirements that do not benefit from central integration.

For the hub-and-spoke pattern in a specific cloud architecture context, see how to build a modern data stack.

Data mesh

Data mesh (Zhamak Dehghani, 2019) inverts the hub-and-spoke model. Rather than centralising data integration in a platform team, data mesh distributes data ownership to domain teams: the team that owns the business domain (sales, marketing, finance, logistics) owns the data that domain produces and is responsible for making it available as a "data product" to other teams.

The four principles: domain ownership (data is owned by the domain, not by a central platform), data as product (domains treat their data outputs as products with quality guarantees and documentation), self-serve platform (a central platform team provides the infrastructure capabilities that domain teams use — storage, governance, access control), federated governance (domain teams apply governance standards defined centrally but enforced locally).

Data mesh solves the bottleneck of the centralised hub-and-spoke model in very large organisations: when the central data engineering team cannot scale to handle the integration requirements of every domain, distributing that responsibility to domains enables parallelism.

Data mesh is appropriate for large organisations with many distinct data domains, significant engineering capacity in those domains, and a history of centralised data team bottlenecks. It is not appropriate for most mid-market organisations — the organisational complexity of data mesh exceeds the problem it solves when the organisation is not large enough to experience the centralised bottleneck. For a full assessment, see data mesh architecture.

Event sourcing

Event sourcing is an architectural pattern where the state of a system is derived from a log of events rather than from a current-state record. Instead of storing "current customer status = Active," you store "CustomerCreated at timestamp X, CustomerActivated at timestamp Y" — and derive current status by replaying events.

Event sourcing is primarily an application architecture pattern rather than a data architecture pattern, but it has direct implications for data architecture: event-sourced systems are naturally stream-first, produce complete event history (not just current state), and often use event streaming platforms (Kafka) as their persistence layer.

For data teams, event sourcing means: source system data arrives as an event log rather than as a current-state database export. Building analytics on event-sourced data requires aggregating event sequences into current-state views (sessionisation, funnel analysis, state derivation) rather than simply querying current-state tables. This is analytically rich — full history is preserved — but more complex to model than current-state data.

Event sourcing is common in modern SaaS products and microservices architectures. Data teams working with Salesforce platform events, Segment track events, or Kafka-based microservice event streams are working with event-sourced data whether or not the source engineering team named it that way.

CQRS (Command Query Responsibility Segregation)

CQRS is an application architecture pattern that separates the read model (queries) from the write model (commands). Write operations go to a transactional store optimised for writes; read operations query a separate store optimised for reads.

In data architecture, CQRS is relevant because it produces separate read and write stores: the write store (operational database, event log) is the source for ingestion; the read store (materialised views, denormalised tables, search indices) may itself be an intermediate analytics target. Data teams need to understand which store they are ingesting from and what it represents.

CQRS is often combined with event sourcing: commands create events (event sourcing handles the write side); read models are materialised by projecting events (CQRS handles the read side).

Data vault

Data vault (Dan Linstedt, 2000) is a modelling method for enterprise data warehouses that separates loading from business transformation. A Data Vault model consists of Hubs (core business keys — Customer, Product, Order), Satellites (descriptive attributes of Hubs, with full historical records), and Links (relationships between Hubs).

Data vault's strength: it accommodates schema changes and new source systems without restructuring existing tables. New attributes are added as new Satellites; new source systems are added without changing existing Hub structures. This makes data vault the most flexible warehouse modelling approach for environments with many source systems and frequent changes.

The cost: data vault models are not analyst-friendly. Raw vault tables require significant transformation before they are queryable by analysts. A business vault layer (Business Data Vault) adds business rules on top of raw vault. Presentation views or dimensional marts add the final analyst-facing layer. This adds layers and complexity compared to Kimball dimensional modelling.

Data vault is appropriate for large enterprises with many source systems, frequent schema changes, and strict audit requirements. It is over-engineered for most mid-market organisations. For the Kimball vs Inmon comparison that contextualises data vault, see Kimball vs Inmon.

Choosing the right pattern

No single pattern is universally correct. The right architecture pattern is determined by:

- **Scale and complexity**: a 5-source hub-and-spoke works where a data mesh would be over-engineered; a data mesh becomes necessary where a centralised hub cannot scale

- **Latency requirements**: batch medallion architecture is simpler than Lambda/Kappa; Lambda/Kappa are required when sub-minute data freshness is genuinely necessary

- **Organisational structure**: hub-and-spoke works when data engineering is centralised; data mesh requires distributed engineering capability

- **Change velocity**: data vault accommodates frequent source system changes that would break a Kimball warehouse; Kimball is simpler when source schemas are stable

- **Team capability**: simpler patterns (medallion + Kimball dimensional) are executed well by small teams; complex patterns (data mesh, data vault) require larger, more specialised teams

Most mid-market organisations building a modern data platform in 2026 should: adopt the medallion architecture with Kimball dimensional modelling at the Gold layer, build on a modern cloud warehouse (Snowflake, BigQuery, or Databricks), use dbt for transformation, and implement hub-and-spoke integration with Fivetran or Airbyte for ingestion. That is the pattern combination that is most well-supported, most widely implemented, and easiest to hire for.

For the architectural assessment that identifies which patterns fit your specific environment, our data architecture consulting practice does this work as an engagement. If you are designing a new data platform or evaluating the patterns your current architecture uses against your growth requirements, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →