BlogData Architecture

What Is a Data Lake? Architecture, Use Cases, and Trade-Offs

James Okafor
James Okafor
Lead Data Engineer
·February 23, 202811 min read

A data lake is a centralised repository that stores raw data in its native format — structured, semi-structured, and unstructured — at any scale. This guide explains what a data lake is, how it differs from a data warehouse, the storage formats and query engines that make data lakes useful for analytics, and the governance challenges that have made "data swamp" a more accurate description of many real-world data lakes.

What a Data Lake Is

A data lake is a centralised repository that stores data in its raw, native format — structured tables, semi-structured JSON and CSV files, unstructured text, images, video — at any scale and at low cost, without requiring upfront schema definition.

The defining characteristic is schema-on-read rather than schema-on-write. A traditional database requires you to define the schema before loading data. A data lake accepts data in any format and defers schema interpretation to query time. This makes ingestion fast and flexible — but shifts the complexity to the consumer who must understand the raw data's structure.

Data lakes are built on object storage — Amazon S3, Google Cloud Storage, Azure Blob — which provides essentially unlimited capacity at costs of $0.02-0.03 per GB per month, orders of magnitude cheaper than database storage.

How Data Lakes Emerged

The original case for data lakes in the early 2010s was cost and flexibility: store everything cheaply, run Hadoop MapReduce jobs over it when needed. The promise was that by storing all raw data, organisations would be able to answer questions they had not yet thought to ask.

In practice, the early data lake experience was poor. Hadoop was complex to operate. Raw data without schema or documentation was unusable by anyone who had not ingested it. "Store everything" without governance produced data swamps — vast repositories of undocumented, untrustworthy data that nobody knew how to query.

The modern data lake addresses these failures through:

- **Open table formats** (Apache Iceberg, Delta Lake, Apache Hudi) that add ACID transactions, schema enforcement, and metadata management to object storage

- **Fast query engines** (Apache Spark, Trino, AWS Athena, BigQuery Omni) that query Parquet files on S3 at scale

- **Governance tooling** (AWS Lake Formation, Apache Atlas, Databricks Unity Catalog) that adds access control and cataloguing

Data Lake vs Data Warehouse

The distinction between a data lake and a data warehouse has blurred significantly with the lakehouse architecture, but the traditional comparison is:

**Data Warehouse**:

- Structured, processed data only

- Schema defined before loading (schema-on-write)

- SQL query interface

- High-cost managed storage

- Strong ACID guarantees

- High query performance for structured analytical queries

- Examples: Snowflake, BigQuery, Redshift

**Data Lake**:

- Any data format including raw, semi-structured, and unstructured

- Schema defined at query time (schema-on-read)

- Multiple query interfaces: SQL, Python/Spark, ML frameworks

- Low-cost object storage

- Traditional lakes: no ACID (Parquet is not transactional); modern lakes with Iceberg/Delta: ACID

- Query performance varies by engine and table optimisation

- Examples: S3/GCS/ADLS with Parquet files, Iceberg tables, Hudi tables

When a Data Lake Is the Right Choice

**Large data volume at low cost**: Storing 50TB of raw event data in Snowflake would cost $1,500/month at $0.023/GB compressed. Storing the same data on S3 Intelligent-Tiering would cost roughly $200-400/month. For high-volume data with infrequent query access (compliance archival, raw event history), the storage economics strongly favour object storage.

**Diverse data types**: A data warehouse handles structured tabular data well. A data lake can store structured tables, semi-structured JSON event logs, document text, and audio transcripts in the same location. ML pipelines that need to process diverse data types benefit from a data lake as the common landing zone.

**ML and data science workloads**: Python/Spark-based ML training reads data directly from S3/GCS. Running ML training against a data warehouse is inefficient — the warehouse query engine is optimised for SQL, not Spark DataFrames. A data lake is the natural training data store for ML at scale.

**Multiple compute engines**: Data that needs to be accessed by Spark, Flink, Trino, and a BI tool simultaneously benefits from open table formats on object storage — all engines can read the same physical files. Data in a warehouse requires exporting to provide access to Spark.

The Data Swamp Problem

A data lake becomes a data swamp when:

**No data catalogue**: Data is stored in folders with names like "dump_2022-03-15" with no documentation of what is inside. Nobody knows what data exists, what it means, or whether it is reliable.

**No schema enforcement**: Raw JSON files with inconsistent schemas accumulate. A query that worked last month fails today because a new upstream system added a field that changed the structure.

**No access control**: The entire data lake is readable by everyone. Sensitive data (PII, financial records) is mixed with operational data with no isolation.

**No data quality**: Raw data ingested without validation. Known-bad data sits alongside good data with no labelling.

Preventing data swamps requires: a data catalogue (Glue Data Catalog, Databricks Unity Catalog, AWS Lake Formation), schema enforcement (Parquet schemas, Iceberg schema evolution), access control (Lake Formation column and row-level permissions, Databricks Unity Catalog), and data quality monitoring.

The Lakehouse Pattern

The lakehouse architecture adds data warehouse semantics to a data lake storage layer — combining the cost and flexibility of object storage with the performance and governance of a managed warehouse.

Open table formats (Apache Iceberg, Delta Lake) add to Parquet files on S3:

- ACID transactions: commit writes atomically; reads see a consistent snapshot

- Schema evolution: add columns without rewriting files

- Time travel: query the table as of any prior point in time

- Metadata-based query optimisation: partition pruning, column statistics, file skipping

Compute engines (Spark, Trino, Databricks SQL, Athena) query Iceberg/Delta tables using the file statistics to plan efficient queries.

The practical result: many organisations use a lakehouse as the raw and intermediate data layer (cheap storage, Spark processing), with a managed warehouse (Snowflake, BigQuery) as the analytical serving layer (fast SQL, BI tool integration).

Our data architecture practice designs data lake and lakehouse architectures for organisations moving beyond managed warehouses — contact us to discuss your data platform requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →