BlogData Architecture

The Modern Data Stack: What It Is, What It Gets Right, and What It Gets Wrong

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·May 9, 202712 min read

The modern data stack — cloud warehouse, dbt, managed ingestion connectors, BI tool — replaced the traditional ETL-to-data-warehouse pattern for good reasons. But it also brought new limitations and failure modes that organisations encounter after the initial implementation succeeds. This is an honest assessment of what the modern data stack is, why it works, and where it falls short.

The modern data stack is a loosely defined architecture pattern that emerged around 2018–2020 as cloud data warehouses became affordable, dbt became the transformation standard, and managed ingestion connectors eliminated the need to build extraction pipelines from scratch. It replaced the traditional ETL (extract, transform, load) pattern with ELT (extract, load, transform) — loading raw data into the warehouse first, transforming in-warehouse using SQL.

The shift was real and the benefits were significant. But the term "modern data stack" has been stretched to cover a growing set of tools with varying maturity and utility. This guide focuses on what the pattern actually consists of, why it emerged, and where organisations encounter its limitations.

The Pattern and Why It Emerged

The traditional ETL pattern required purpose-built ETL tools (Informatica, Talend, SSIS) to extract data from sources, apply transformations in the ETL layer, and load clean, structured data into a data warehouse. The ETL layer was complex, expensive to maintain, and difficult to test or version control.

The modern data stack shifts the transformation step to occur after loading. Raw data is extracted and loaded with minimal transformation — tools like Fivetran, Stitch, and Airbyte handle this. The transformation step happens inside the warehouse using dbt. The BI tool connects to the transformed tables.

Why this works better than traditional ETL:

**Preservation of raw data**: loading raw data first preserves all information from the source. Traditional ETL that transforms during extraction throws away information — if the transformation logic turns out to be wrong, the raw data is not available for reprocessing.

**Transformation in SQL**: SQL is the lingua franca of data teams. More people can write, review, and maintain SQL transformations than can work with proprietary ETL tool logic. dbt makes SQL transformations testable, documented, and version controlled.

**Cloud warehouse economics**: cloud warehouses with on-demand compute make in-warehouse transformation affordable. Running transformations inside the warehouse is fast and scales with compute allocation, not with ETL server capacity.

**Managed connectors**: Fivetran and Airbyte eliminated the engineering cost of extraction pipelines for common SaaS sources. Before managed connectors, extracting data from Salesforce or HubSpot required custom engineering and ongoing maintenance.

What the Modern Data Stack Gets Right

**Ingestion without custom engineering**: for the 90% of organisations whose primary data sources are common SaaS tools (Salesforce, HubSpot, Shopify, Stripe, Google Analytics, database replication from standard databases), managed connectors make ingestion a configuration problem, not an engineering problem.

**SQL-first transformation with software engineering practices**: dbt brought testing, documentation, version control, and CI/CD to SQL transformation. The quality of SQL transformation code in a dbt project is consistently higher than hand-written SQL transformation scripts maintained without these practices.

**Democratised access to powerful compute**: cloud warehouses removed the need to provision and manage on-premise data warehouse hardware. A small data team at a Series B startup can query the same class of analytical compute that a Fortune 500 was using five years ago.

**BI tool flexibility**: with data modelled in the warehouse, the BI tool is interchangeable. Swapping from Tableau to Power BI or vice versa does not require rebuilding the transformation layer.

What the Modern Data Stack Gets Wrong (or Leaves Unsolved)

**The coordination problem between tools**: managed connectors load raw data with the source system's schema. The raw data often has poor field naming, missing foreign keys, and schema that changes without warning. dbt models that depend on specific source schemas break when the source changes. The modern data stack solves the extraction problem but does not solve the source schema stability problem — that requires data contracts, which are a separate discipline.

**Reverse ETL as an afterthought**: the modern data stack is excellent at getting data into an analytics warehouse. It was designed less carefully for the use case of getting insights back out into operational systems — CRM, marketing platform, customer success tools. Reverse ETL tools (Census, Hightouch) fill this gap, but they are add-ons, not core to the original pattern.

**Streaming and real-time limitations**: the modern data stack is a batch processing pattern. Data moves in periodic loads — hourly, daily. For use cases that require real-time data (fraud detection, real-time personalisation, operational dashboards with sub-minute freshness), the modern data stack requires supplementation with streaming infrastructure (Kafka, Flink) that operates on a fundamentally different architectural model.

**Testing coverage gaps**: dbt tests cover schema contracts and simple data quality rules well. They are less effective for statistical anomaly detection, temporal consistency checks, and cross-model consistency verification. Data quality beyond dbt tests requires additional tooling.

**Semantic layer fragmentation**: data warehouse tables model data but do not capture business metric definitions. "Revenue" computed in one dbt model may differ from "Revenue" computed in another if multiple data teams are working independently. Tools like dbt Semantic Layer, Cube, and Metriql address this, but semantic layer governance is an unsolved problem for most organisations.

**Cost management at scale**: cloud warehouses with consumption-based pricing are excellent at low usage and manageable at moderate usage. At high usage — thousands of queries per day, very large datasets, many concurrent users — costs can escalate significantly beyond initial estimates. Cost governance (query budget alerting, warehouse auto-suspend, materialisation strategy optimisation) becomes a significant operational concern at scale.

When the Modern Data Stack Is Not the Right Pattern

The modern data stack is optimised for the common case: SaaS data sources, SQL transformation, analytical BI use cases. It is not optimised for:

**Highly regulated industries with strict data residency requirements**: managed connectors store data temporarily in the connector provider's infrastructure. For regulated industries (healthcare, finance, defence) with strict data residency requirements, fully managed connectors may not be permissible. Custom extraction pipelines that never leave the controlled environment are required.

**Real-time event processing**: the pattern was designed for batch. Real-time use cases (streaming analytics, event-driven ML, real-time fraud detection) require a streaming architecture alongside or instead of the batch modern data stack.

**Organisations with heavily customised, proprietary source systems**: managed connectors work for standard SaaS APIs. Highly customised ERP systems, proprietary operational databases, or legacy systems without APIs require custom extraction engineering that managed connectors cannot provide.

Our data architecture practice designs data stacks appropriate for each organisation's specific context — contact us to discuss your data architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →