The Modern Data Stack: Every Component Explained

A plain-language guide to every layer of the modern data stack — ingestion, storage, transformation, orchestration, cataloguing, and BI — the leading tools at each layer, the architectural decisions that determine which tools belong in your stack, and what the modern data stack gets right and wrong.

The term "modern data stack" describes a particular architectural approach to analytics infrastructure that emerged around 2016–2020 and became the de facto standard for cloud-native data teams. Understanding each layer — what it does, which tools live there, and which architectural decisions matter — is essential for anyone building or evaluating their analytics infrastructure.

The Layers of the Modern Data Stack

### Layer 1: Data Ingestion

Ingestion moves data from source systems (SaaS applications, databases, event streams, files) into the data warehouse. The modern data stack approach is to use managed connectors rather than building custom ETL.

**What it does:** Connects to source systems via APIs or database replication, extracts data, and loads it into a staging schema in the warehouse. Most managed connectors perform incremental extraction — only pulling new or changed records since the last sync.

Leading tools:

- **Fivetran:** The premium managed connector service. 300+ connectors, automatic schema evolution, minimal configuration. High cost per connector at scale.

- **Airbyte:** Open-source alternative to Fivetran. Self-hosted (more operational overhead) or Airbyte Cloud. Lower cost, similar connector breadth.

- **Stitch (by Talend):** Mid-market option. Fewer connectors than Fivetran, lower price.

- **Singer:** Open-source framework for building custom connectors. More engineering overhead, flexible.

**Architectural decision:** At what granularity do you need source data? Full-history replication (every change to every record) versus current-state snapshots (the current value of every row) have different storage and query implications. For most analytics use cases, current-state replication is sufficient and significantly cheaper to store.

### Layer 2: Data Storage

The data warehouse stores the raw ingested data and the transformed analytical data. The warehouse is the central persistence layer for the entire stack.

**What it does:** Columnar storage optimised for analytical query patterns — aggregations, group-bys, range scans across large datasets. Different from OLTP databases (row-store, optimised for transactional read-write patterns).

Leading tools:

- **Snowflake:** Separated storage and compute, multi-cloud, SQL-first. Most polished analyst experience.

- **BigQuery:** Serverless, native GCP ecosystem integration, strong ML features.

- **Databricks (Delta Lake):** Unified data + ML platform, open Delta format, strong for engineering-heavy teams.

- **Redshift:** AWS-native, good for existing AWS organisations, mature ecosystem.

**Architectural decision:** Schema organisation within the warehouse matters as much as which warehouse you choose. The raw schema stores data exactly as ingested. Staging schemas hold lightly transformed data. Mart schemas hold the final analytical models consumed by BI tools. This layering keeps transformation logic organised and queryable.

### Layer 3: Data Transformation

Transformation takes raw ingested data and converts it into clean, modelled, tested analytical assets. The modern data stack approach: transformation in SQL, in the warehouse, using dbt.

**What it does:** Cleans raw data (deduplicating, handling nulls, standardising formats), implements business logic (metric definitions, entity relationships, classification), and builds the dimensional model that BI tools query (fact tables, dimension tables, wide mart tables).

Leading tools:

- **dbt (data build tool):** The dominant transformation layer tool. SQL-first, version-controlled, with built-in testing, documentation, and lineage. dbt Core is open-source; dbt Cloud adds deployment, scheduling, and a development IDE.

- **SQLMesh:** Open-source alternative to dbt with stronger CI/CD features and virtual environments for staging. Gaining adoption.

- **Dataform (Google):** Google's dbt-equivalent, natively integrated with BigQuery.

**Architectural decision:** The dbt project structure determines how maintainable your transformation code will be. The standard pattern — staging models (one per source), intermediate models (joined and enriched entities), mart models (business-layer aggregations) — keeps concerns separated and makes the lineage graph readable.

### Layer 4: Data Orchestration

Orchestration schedules and coordinates pipeline execution — running ingestion, transformation, and downstream processes in the right order, handling failures and retries, and providing visibility into pipeline health.

**What it does:** Defines the dependency graph between pipeline tasks, schedules runs, monitors execution, handles retries on failure, and alerts on issues.

Leading tools:

- **Apache Airflow:** The most widely deployed orchestrator. DAG-based, Python-native, large ecosystem of operators. Operationally complex to self-host; managed via Astronomer, AWS MWAA, or GCP Composer.

- **Prefect:** Python-native, easier to test than Airflow, Prefect Cloud for managed deployment. Growing adoption.

- **Dagster:** Asset-based orchestration model with first-class dbt integration. Strong for data asset-centric workflows.

- **dbt Cloud (scheduler):** For organisations whose transformation work is entirely in dbt, dbt Cloud's built-in scheduler may be sufficient without a separate orchestrator.

**Architectural decision:** Many mid-market organisations start with dbt Cloud's scheduler (or a simple cron-triggered deployment) and add a full orchestrator only when the pipeline complexity demands it — multiple interdependent pipelines, conditional logic, parallel execution branches. Start simple; add orchestration complexity when you have a reason for it.

### Layer 5: Data Catalogue and Lineage

The catalogue makes the data warehouse discoverable — what tables exist, what they contain, where they come from, and who owns them.

**What it does:** Indexes table schemas and metadata, tracks data lineage (source-to-report), provides search across all data assets, enables annotation with business descriptions and ownership, and surfaces data quality metrics.

Leading tools:

- **dbt Docs:** dbt's built-in documentation site, generated from schema.yml descriptions. Covers the transformation layer with automatic lineage diagrams. Free with dbt Core.

- **Atlan:** Full-featured enterprise catalogue with Tableau integration, lineage from ingestion through BI, and collaborative annotation. Mid-to-large enterprise.

- **DataHub (open source):** LinkedIn's open-source catalogue. Self-hosted, broad integration ecosystem, active community.

- **Alation:** Established enterprise catalogue, strong for regulated industries with compliance requirements.

**Architectural decision:** dbt Docs covers the transformation layer for free and is sufficient for most teams through the first few years. Full-stack catalogues become valuable when analysts routinely cannot find data and spend significant time on discovery.

### Layer 6: Business Intelligence

The BI layer connects to the data warehouse, queries the analytical models, and renders visualisations for business users.

**What it does:** Provides a no-code or low-code interface for building dashboards and reports, manages caching for performance, controls access to data and content, and distributes analytical content to business users.

Leading tools:

- **Tableau:** The most capable visualisation tool. Best for complex analytical views and organisations where BI is a strategic capability. High licensing cost.

- **Looker:** SQL-centric BI with the LookML semantic layer. Excellent for metric consistency across reports. Now Google Cloud-native.

- **Power BI:** Microsoft ecosystem integration, strong DAX calculation engine, lower cost than Tableau at scale.

- **Metabase:** Open-source, self-hosted option for cost-conscious organisations. Less capable than Tableau or Looker but sufficient for many use cases.

- **Superset (Apache):** Open-source, highly configurable, requires engineering investment to operate well.

**Architectural decision:** BI tool selection is often more about organisation fit than feature comparison. A team of 3 analysts at a Series A startup has different requirements than an enterprise with 50 analysts and strict governance requirements. Match the tool to the team and the use case, not the benchmark comparison.

What the Modern Data Stack Gets Right

**Separation of concerns:** Ingestion, storage, transformation, and BI are handled by specialised tools, each excellent at its layer. You are not locked into a single vendor's vertically integrated stack.

**Version control for transformation logic:** dbt's SQL-in-git model brings software engineering practices to data transformation for the first time at scale.

**Testability:** dbt tests make data quality assertions first-class parts of the pipeline, not an afterthought.

**Cloud economics:** Pay for what you use. No upfront infrastructure investment, no capacity planning for peak load.

What the Modern Data Stack Gets Wrong

**Cost at scale:** Fivetran plus Snowflake plus dbt Cloud plus a catalogue plus Tableau adds up quickly. Mid-market organisations regularly discover that their data stack costs more than they expected once they have meaningful data volumes.

**Operational complexity:** Five specialised tools means five things to learn, five things to troubleshoot, five vendors to manage. The integration overhead is real.

**Streaming is an afterthought:** The classic modern data stack is a batch architecture. Streaming requires adding a Kafka layer and a stream processor, which are architecturally quite different from the batch ELT paradigm.

**Semantic layer fragmentation:** Metric definitions implemented in dbt and separately in the BI tool remain a solved problem only in organisations with strong governance. Most organisations end up with metric definitions in multiple places.

Our data architecture consulting practice designs and implements modern data stack architectures — contact us to discuss the right stack for your team and data volume.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →