Data lakes promised a single repository for all organisational data — structured, semi-structured, and unstructured — that could be queried with flexible tools for flexible purposes. Most data lake implementations fell short of that promise. This guide covers why, and what modern data lake architectures do differently to avoid the data swamp problem.
The data lake concept emerged in the early 2010s as a response to the limitations of the data warehouse: warehouses required schema definition before loading, supported only structured data, and were expensive to scale at the data volumes that organisations were beginning to accumulate from web, mobile, and IoT sources. A data lake — typically Hadoop HDFS, later S3 — would accept any data format, at any scale, without predefined schema. Schema-on-read rather than schema-on-write.
The promise was compelling. The reality was that most data lake implementations became data swamps: storage repositories where data was deposited without sufficient structure, documentation, or governance to be analytically useful. Finding data was difficult. Understanding what the data meant was harder. Trusting that it was correct was often impossible.
Modern data lake architectures address these problems not by abandoning the data lake concept but by building the governance, structure, and processing capabilities that the original data lake implementations lacked.
Why Data Swamps Happen
Data swamps share common failure patterns:
**No defined ownership**: Data was deposited in the lake by teams that owned the data in source systems. Once deposited, no one maintained it. Schema changed in the source, the lake data became stale and inconsistent, and no one was responsible for updating it.
**No discoverability infrastructure**: Files in S3 are not discoverable by purpose. A data engineer joining the team cannot find the customer data they need without asking someone who was there when it was loaded — and that person may have left. Without a catalog or documentation layer, the lake is opaque.
**Schema-on-read without documentation creates schema-on-ignorance**: The flexibility of schema-on-read is valuable when the schema is documented. Without documentation, analysts cannot know what columns mean, what grain the data is at, or which values are valid. The flexibility becomes a liability.
**No data quality management**: Operational systems have inconsistent data quality. Loading raw data directly to the lake preserves all quality problems. Without a quality management layer, analysts encounter unexpected nulls, inconsistent values, and format variations that make the data unreliable.
The Zone Architecture
Modern data lake architectures use a zone structure that manages data through stages from raw to curated:
**Raw zone (bronze/landing)**: Data exactly as received from source systems. No transformations. Append-only; records are never modified or deleted. The raw zone is a complete record of what was ever received from each source. This is the recovery point — if a transformation is incorrect and data has been lost from later layers, the raw zone contains the original data for reprocessing.
**Cleansed zone (silver/curated)**: Data from the raw zone that has been standardised: types cast, nulls handled consistently, naming conventions applied, duplicates resolved. The cleansed zone is still granular (not aggregated) but is reliable and consistently formatted. Transformations in this layer are auditable and reversible.
**Refined zone (gold/consumption)**: Business-logic transformations applied to the cleansed layer. Domain-specific datasets, aggregations, and denormalised analytical tables designed for specific consumption patterns. The refined zone is what BI tools, data science teams, and downstream applications connect to.
This three-zone (medallion) architecture is the current standard for data lake design. It provides the recovery capability of the raw layer, the clean analytical interface of the consumption layer, and the traceability of intermediate processing.
Delta Lake and the Lakehouse
The original data lake stored data as raw files (Parquet, CSV, ORC) without ACID transaction support. Concurrent writes could corrupt data. There was no way to update or delete records without rewriting entire partitions. Schema enforcement required external tooling.
Delta Lake (the open-source format, not the company) addresses these problems by adding a transaction log on top of Parquet storage in object storage. Delta tables have:
- ACID transactions: concurrent writes do not corrupt data
- Schema enforcement: writes that violate the defined schema are rejected
- MERGE/UPDATE/DELETE: row-level modifications without full partition rewrites
- Time Travel: query historical state of the table
Delta Lake tables are the foundation of the "lakehouse" architecture: they enable the data lake's scale and flexibility (storage in object storage, arbitrary data formats) with warehouse-style reliability (transactions, schema enforcement, reliable reads). The medallion zone architecture described above is most commonly implemented with Delta tables at each zone.
Catalog and Governance
A data lake without a data catalog is a data swamp. The catalog layer is what makes data discoverable, understandable, and trustworthy.
**Technical metadata**: What tables exist, what columns they have, what types those columns are, when the data was last updated. The catalog makes this discoverable without requiring knowledge of the file system layout.
**Business metadata**: What does this table contain? What does the 'status' column mean? What is the grain? Who owns this data? This context is the difference between finding data and being able to use it.
**Data lineage**: Where did this data come from? Which upstream tables feed this one? Which downstream tables depend on it? Lineage enables impact analysis and regulatory provenance.
**Data quality metrics**: What is the current null rate for key columns? Are primary keys unique? Has the row count dropped unexpectedly? Quality metrics that are captured at each pipeline run and surfaced in the catalog give analysts confidence in what they are using.
Unity Catalog (Databricks), AWS Glue Catalog, Google Cloud Data Catalog, and tools like Atlan, DataHub, and OpenMetadata provide these catalog capabilities. The choice of catalog should be driven by which platforms your data lake uses and what governance requirements you have — not by the catalog's features in isolation.
When a Data Lake Is the Right Architecture
A data lake is appropriate when:
- Data volumes or data type diversity (unstructured, semi-structured, ML training data) exceed what a purpose-built warehouse handles cost-effectively
- The organisation needs to store raw data for regulatory compliance or future reprocessing
- ML and data science workloads require access to data at fine grain and in formats that warehouses are not optimised for
- Cost-sensitive storage of historical data that is rarely queried
A data lake is not the right architecture when:
- The primary use case is structured SQL analytics — a cloud data warehouse is simpler to operate and often lower total cost
- The team does not have the engineering capacity to build and maintain the catalog, quality, and governance layers that make a data lake usable
Our data architecture and cloud engineering practice designs data lake architectures that avoid the swamp — contact us to discuss your data lake requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →