BlogData Engineering

Medallion Architecture: Bronze, Silver, Gold Layers Explained

James Okafor
James Okafor
Data & Cloud Engineer
·July 7, 202610 min read

The medallion architecture organises data lakehouse layers by quality and transformation stage — raw in bronze, cleaned in silver, business-ready in gold. This guide covers when to use it, how to implement it, and the common mistakes that break the pattern.

The medallion architecture is a data design pattern that organises data in a lakehouse into three sequential layers — bronze, silver, and gold — each representing a progressively higher level of data quality and transformation. Databricks popularised the name, but the underlying principle (raw → cleaned → curated) predates it significantly. This guide explains the layers, when to use this architecture, how to implement it well, and the mistakes that cause it to fail.

The three layers

**Bronze layer** — raw ingestion. This layer contains data exactly as it arrived from source systems. No transformations. No schema enforcement. No deduplication. The only modification allowed is adding ingestion metadata: the timestamp the record arrived, the source system identifier, and optionally a batch or file ID for lineage. Bronze is an immutable audit trail — it should never be modified after writes. If a source system sends malformed data, that malformed data is in bronze. The bronze layer is the answer to: "What did the source system send us, and when?"

**Silver layer** — cleaned and conformed. This layer applies the transformations that make data usable: deduplication, type casting, null handling, schema enforcement, and cross-source harmonisation. Silver is where you resolve that one source system calls it "customer_id" and another calls it "cust_id" — both are renamed to a common identifier. Silver does not apply business logic. It does not aggregate. It produces one clean record per real-world entity or event. The silver layer is the answer to: "What does the data actually say, after fixing source quality issues?"

**Gold layer** — business-ready. This layer contains the aggregated, business-logic-applied models that analytics and reporting consume. Fact tables, dimension tables, aggregate tables, denormalised wide tables for specific reporting use cases. Gold is where Kimball-style dimensional modeling lives. Gold models are purpose-built: a gold table optimised for the CFO's revenue dashboard may be entirely different from the gold table optimised for the operations team's order tracking dashboard. The gold layer is the answer to: "What business questions does this data answer, in a form the business can use?"

Why the layers matter

The primary benefit of the medallion architecture is separation of concerns. Without defined layers, pipelines tend to conflate ingestion, cleaning, and business logic into a single transformation — which means a change to business logic requires touching the same code that handles ingestion, and a source schema change potentially breaks downstream reports.

With defined layers:

- Source system changes affect only the bronze-to-silver transformation. Silver and gold are insulated.

- Business logic changes affect only silver-to-gold transformations. Bronze and silver are unaffected.

- Data quality issues can be diagnosed by querying bronze (what was received) vs silver (what was cleaned) vs gold (what was reported).

- Reprocessing is safe — since bronze is immutable and complete, you can always rerun silver and gold from bronze if logic changes.

Implementation on Delta Lake

The medallion architecture is most commonly implemented on Delta Lake (Databricks, or open-source Delta on S3/ADLS/GCS), though it is applicable to any lakehouse storage with ACID guarantees.

The typical Delta implementation:

- Bronze: APPEND-only writes using Delta merge with identity-based deduplication only on exact duplicates from re-delivered batches. Schema evolution enabled. All columns from the source preserved.

- Silver: MERGE operations using WHEN MATCHED / WHEN NOT MATCHED to upsert cleaned records. Type-safe schema. Constraints enforced. Partition by a date column (event date or processing date) for query performance.

- Gold: Most gold tables are either full refreshes (OVERWRITE on each run) or incremental appends. Full refresh is simpler and appropriate for gold tables built from silver tables that are themselves slowly growing. Incremental is necessary for large gold tables where a full refresh would be too slow.

With dbt on Delta, bronze maps to source tables (raw), silver maps to staging and intermediate models, and gold maps to mart models. The layer separation is enforced by dbt's model organisation convention.

When not to use the medallion architecture

The medallion architecture adds complexity. It is appropriate when:

- Multiple source systems need to be harmonised (silver is where harmonisation happens)

- Reprocessing capability matters (bronze enables full reprocessing from source)

- Data quality debugging is a recurring need (comparing bronze vs silver to diagnose source issues)

- Multiple downstream use cases with different granularity requirements (separate gold tables per use case)

It is overkill when:

- You have a single source system and a single reporting use case. A two-layer approach (raw + curated) is simpler.

- You are building a small data pipeline for a startup. The overhead of three layers exceeds the benefit at low data volume and team size.

- Your source data is already clean and well-structured. If bronze and silver would be identical, you do not need both.

Common implementation mistakes

**Applying business logic in silver**: The silver layer is for quality and conformance, not business decisions. "Revenue" is calculated in gold. "Gross margin" is calculated in gold. If silver starts containing business metrics, the separation of concerns breaks and debugging becomes harder. Keep silver as a clean mirror of source data, without business logic applied.

**Making bronze mutable**: Bronze loses its value as an audit trail the moment records are modified or deleted. If you delete a record from bronze because it was "wrong", you can no longer answer the question "what did the source system send us?" Use delete flags or soft deletes in silver instead.

**One gold layer for everything**: Gold tables should be purpose-built for their consumers. A single denormalised gold table that tries to serve both the CFO's revenue reporting and the operations team's order tracking will be slow and complex for both use cases. Build separate gold models per consumer or use case.

**Skipping schema evolution handling**: Source schemas change. Bronze must be able to receive schema changes without pipeline failure — Delta Lake schema evolution (MERGE SCHEMA option) handles this for Delta-based implementations. If bronze rejects records because of source schema changes, you lose data.

**No data contracts on gold**: Gold tables are the contract between the data team and the business. They should have documented schemas, defined SLAs, and data quality tests. Without these, consumers cannot trust the gold layer and bypass it — defeating the architecture's purpose.

Medallion architecture with Databricks

Databricks is the canonical implementation environment for the medallion architecture. Unity Catalog provides the governance layer — bronze, silver, and gold maps cleanly to Unity Catalog schemas within a catalog. Data lineage is tracked automatically across Delta operations. Delta Live Tables (DLT) can automate the bronze-to-silver-to-gold pipeline orchestration with built-in data quality constraints.

For teams using Databricks without DLT, the medallion architecture is typically implemented with Apache Spark notebooks or dbt-spark, orchestrated by Apache Airflow or Databricks Workflows.

Medallion architecture without Databricks

The medallion architecture is not Databricks-specific. It applies equally to:

- Snowflake: bronze as RAW schema, silver as STAGING schema, gold as MART schema — Delta Lake equivalent is Snowflake's managed storage with streams for incremental processing.

- BigQuery: bronze as dataset with raw tables, silver as staging dataset, gold as analytics dataset — scheduled queries or dbt models implement the transformations.

- Any cloud data warehouse with dbt: dbt's layer conventions (sources → staging → intermediate → marts) map directly to bronze → silver → gold.

The storage format differs; the design principle is identical.

For the data engineering foundations this pattern builds on, see data pipeline best practices and delta lake guide. For the dimensional modeling that lives in the gold layer, see kimball vs inmon. Our data architecture consulting practice designs and implements medallion architectures for organisations migrating from monolithic data warehouses to lakehouse patterns — book a free architecture review.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →