The Modern Data Stack in 2025: What Has Changed and What Has Survived

How the modern data stack has evolved — what tools have consolidated, what categories have been disrupted, what the AI era changes about data infrastructure, and what the stack looks like for new builds versus established environments in 2025.

The "modern data stack" was a useful concept when it emerged around 2020: cloud data warehouses (Snowflake, BigQuery), ELT tools (Fivetran, Airbyte), SQL-first transformation (dbt), and BI tools (Looker, Mode) formed a recognisable pattern that contrasted with legacy on-premises ETL and data warehouse approaches.

In 2025, the concept is both more mature and more complicated. Some elements have become standard infrastructure; others have been disrupted or commoditised. AI has added a layer of new tooling and new requirements. Understanding what has changed — and what the data stack actually looks like for new builds versus maintenance of established environments — matters for data architects and data leaders making platform decisions.

What Has Consolidated

**Cloud data warehouses are infrastructure.** Snowflake, BigQuery, and Databricks (with the lakehouse pattern) are no longer novel; they are the expected foundation. The decision has moved from "should we use a cloud warehouse?" to "which cloud warehouse, and should we use a warehouse or a lakehouse?" The warehouse-first vs lakehouse debate has largely resolved in favour of "depends on your use case" — warehouses for SQL analytics-focused organisations, lakehouses for ML-heavy or data engineering-intensive workloads with diverse access patterns.

**dbt is the transformation standard.** SQL-first transformation with dbt has become the default for analytics engineering. The competition — Spark-based transformation, custom stored procedures, ELT tools with built-in transformation — has been largely displaced for the analytics layer. dbt Cloud, dbt Semantic Layer, and the expanding dbt ecosystem have made dbt a platform rather than a single tool.

**Fivetran and Airbyte have commoditised ingestion.** Managed connectors for SaaS-to-warehouse ingestion are table stakes. The competition is on connector breadth, pricing, and the degree to which ingestion is a solved problem rather than engineering work.

**Tableau and Power BI have held their ground in enterprise BI.** Despite the emergence of Looker, Sigma, Hex, and other modern BI tools, Tableau and Power BI remain dominant in established enterprise environments. Looker has been absorbed into Google Cloud and has evolved; the BI tool landscape has consolidated more slowly than the data engineering layer.

What Has Been Disrupted

**The standalone reverse ETL category is under pressure.** Tools like Census and Hightouch emerged to solve the problem of syncing warehouse data to operational systems. This problem is real, but it has been partially absorbed by composable CDP architecture, and some warehouses (Snowflake Data Sharing, BigQuery) have built more direct activation capabilities. The standalone reverse ETL market has not disappeared but has not grown to the scale early predictions suggested.

**Orchestration is contested.** Airflow was the default orchestration choice for most of the 2018–2023 period. Dagster, Prefect, and other alternatives have carved out significant market share by addressing Airflow's observability and developer experience limitations. The orchestration category is no longer a single dominant tool.

**The data catalogue market has not delivered.** Enterprise data catalogues (Alation, Collibra, Atlan) were supposed to solve discovery, governance, and documentation. They have delivered value in specific organisations with dedicated data governance programs but have not become the universal data infrastructure layer they were positioned to be. Most organisations still struggle with documentation and discovery.

**Streaming is real but not universal.** The pitch for real-time data infrastructure (Kafka, Flink, real-time ML) has not been validated for most organisations. Most mid-market analytics workloads run on daily or hourly batch — the latency is acceptable and the operational complexity of streaming is not justified. Streaming is standard practice in fintech, marketplace, and gaming verticals where it was always standard practice. The claim that "all analytics will be real-time" has not materialised.

What AI Has Changed

**LLM functions in the warehouse.** Snowflake Cortex, BigQuery ML's Gemini integration, and Databricks' AI Functions bring LLM inference into SQL. This has meaningful practical application: classifying unstructured text, extracting information from documents, generating embeddings for semantic search. It is not a revolution in how analytics works, but it is a genuine productivity improvement for specific use cases.

**Text-to-SQL has arrived but not eliminated SQL.** Tools like Tableau Pulse, Snowflake Cortex Analyst, and various AI query interfaces have made natural language querying viable for simple questions against well-documented data models. They have not eliminated the need for SQL or data modelling expertise — they amplify it. Badly documented data models produce bad natural language query results.

**AI infrastructure has created new data requirements.** RAG systems, vector stores, embedding pipelines, and model evaluation infrastructure are new categories that did not exist in the 2020 data stack. Organisations building AI products need this infrastructure, and it sits adjacent to (or integrated with) their existing data platform.

**Data quality requirements have increased.** AI systems are sensitive to training data quality and retrieval data quality in ways that traditional BI is not. A dashboard that shows an incorrect number is a data quality issue. An AI assistant that gives an incorrect answer because its training or context data was wrong is a trust and reliability issue of a different magnitude. This is increasing investment in data contracts, data quality automation, and lineage.

What the Stack Looks Like for New Builds in 2025

For a new mid-market B2B SaaS company building its data stack:

**Ingestion:** Fivetran or Airbyte for SaaS connectors + Segment/RudderStack for event data + Debezium for CDC from operational databases.

**Warehouse:** Snowflake or BigQuery. Databricks if ML is a primary use case from day one.

**Transformation:** dbt Core (or dbt Cloud for managed execution). Python transformations in Snowpark or Databricks where SQL is insufficient.

**Orchestration:** dbt Cloud native scheduler for dbt; Dagster or Airflow for multi-step pipelines. Managed if possible.

**BI:** Tableau or Power BI for enterprise dashboards, Metabase or Superset for internal analytics, PostHog for product analytics.

**AI/LLM layer:** Snowflake Cortex or BigQuery ML for in-warehouse inference. Vector store (pgvector, Pinecone, or Snowflake native) for RAG if building AI products.

For data architecture design and technology selection, our data architecture consulting practice helps organisations evaluate and choose the right stack — contact us to discuss your data platform requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →