How to Build a Modern Data Stack: The Complete 2026 Guide

The modern data stack — Fivetran, Snowflake, dbt, and a BI layer — has become the default architecture for mid-market analytics. Here is what it actually is, how the layers fit together, what it costs to build, and when it is not the right answer.

The quick answer

The modern data stack is a set of cloud-native, composable tools that together handle data ingestion, storage, transformation, orchestration, and analytics. It replaced the monolithic ETL-and-data-warehouse architecture of the previous decade. The defining characteristics: cloud-first, modular (each layer is a best-of-breed tool rather than a suite), SQL-accessible at every layer, and built for agility rather than upfront schema design.

The standard stack: a managed ingestion tool (Fivetran or Airbyte) loads raw data into a cloud data warehouse (Snowflake, BigQuery, or Databricks). dbt transforms raw data into analytics-ready models. An orchestration tool (Airflow, Prefect, or Dagster) schedules and monitors the pipeline. A BI tool (Tableau, Power BI, or Looker) consumes the transformed data. Optional: a semantic layer, data catalogue, and observability tooling wrap around the core stack.

The five core layers

### Layer 1: Data ingestion

Ingestion tools extract data from source systems — CRMs, databases, SaaS APIs, event streams — and load it into the data warehouse. The modern pattern is ELT (extract, load, transform) rather than ETL: load raw data first, transform it in the warehouse using compute that is already there.

**Fivetran** is the market standard for managed connectors. It maintains 300+ pre-built connectors to SaaS sources (Salesforce, HubSpot, Stripe, Google Ads), handles incremental syncing, manages schema changes, and requires zero custom code for covered sources. Pricing is consumption-based (per monthly active rows). For organisations that primarily connect to common SaaS systems, Fivetran eliminates most ingestion engineering.

**Airbyte** is the open-source alternative. Airbyte has 350+ connectors and can be self-hosted (free) or run via Airbyte Cloud. For organisations with custom source systems, Airbyte's Connector Development Kit allows custom connectors. Self-hosted Airbyte requires infrastructure management; the cost savings versus Fivetran depend on the total cost of the engineering time that infrastructure management consumes.

**Stitch** (now part of Talend) is a simpler, lower-cost Fivetran alternative for organisations with straightforward ingestion requirements. Less feature-rich than Fivetran; appropriate for smaller data teams.

**Custom pipelines with Apache Kafka** are used when real-time event streaming is required. Kafka handles high-volume, low-latency event ingestion that scheduled batch tools cannot. For most organisations, batch ingestion is sufficient — Kafka adds significant operational complexity and is only justified when sub-minute data freshness is a genuine business requirement.

### Layer 2: Data storage

The data warehouse (or lakehouse) is the central data repository that all downstream analytics reads from. The key decision here is warehouse vs lakehouse.

**Cloud data warehouses** (Snowflake, BigQuery, Amazon Redshift, Azure Synapse) are optimised for SQL analytics. Columnar storage, massively parallel query execution, separated compute and storage. Snowflake is the dominant independent option; BigQuery is the natural choice for GCP-first organisations; Redshift for AWS-heavy environments; Synapse for Azure environments (though Microsoft Fabric is increasingly replacing Synapse for new builds).

**Data lakehouses** (Databricks with Delta Lake) combine the raw storage economics of a data lake with the ACID transaction reliability and SQL query performance of a warehouse. The lakehouse is the right choice when ML and AI workloads run alongside SQL analytics, when the data volume makes warehouse storage economics prohibitive, or when open table formats (Delta Lake, Apache Iceberg, Apache Hudi) are a strategic requirement.

For most mid-market analytics use cases, a managed cloud data warehouse (Snowflake or BigQuery) is the right default. The operational simplicity and SQL query performance are better suited to analytics-first use cases than a lakehouse. See Snowflake vs Databricks for a detailed comparison.

### Layer 3: Transformation

**dbt (data build tool)** is the standard for the transformation layer. dbt models are SELECT statements — you write SQL that defines how to transform raw source data into analytics-ready tables and views. dbt compiles those models, manages dependencies between them, runs them against the warehouse, and tests the outputs.

The advantages of dbt over stored procedures and traditional SQL transformation: version control (models are files in a git repository), automated testing (dbt test runs after each model build), auto-generated documentation, and automatic lineage capture (dbt builds a dependency graph from model references). See our detailed guide on what is dbt and how it fits your data stack for a full breakdown.

For the transformation-layer pattern, see ETL vs ELT for when each pipeline pattern is appropriate.

### Layer 4: Orchestration

Orchestration tools schedule pipelines, manage dependencies between pipeline steps, and handle failures — retrying failed tasks, alerting on persistent failures, and providing visibility into pipeline run history.

**Apache Airflow** is the most widely deployed orchestration tool. Python-based DAGs (directed acyclic graphs) define pipeline logic. Extensive integrations with cloud services, warehouses, and data tools. Airflow requires infrastructure management (managed options: Google Cloud Composer, Amazon MWAA, Astronomer). The learning curve for writing and maintaining Airflow DAGs is significant — it is a serious engineering tool.

**Prefect** is a more modern Python-based orchestrator designed to address Airflow's operational complexity. Less infrastructure overhead, better developer experience, cloud-managed option (Prefect Cloud). Better for organisations that want Airflow-like flexibility without as much operational burden.

**Dagster** is asset-oriented: rather than defining workflows as sequences of tasks, Dagster models pipelines as software-defined assets — data assets that are produced by code. This approach maps more naturally onto the modern data stack's asset model (dbt models, Fivetran connectors, etc.) and produces better data lineage. Strong fit for teams that use dbt heavily.

**dbt Cloud's built-in scheduler** handles simple orchestration requirements without a separate tool. For organisations that primarily need to schedule dbt runs and Fivetran syncs, dbt Cloud's jobs feature and dbt Cloud's API may be sufficient without deploying a full orchestration platform.

### Layer 5: Business intelligence

The BI layer is where transformed data becomes reports, dashboards, and self-service analytics. The three dominant enterprise platforms:

**Tableau** — strongest for complex visualisations, large-extract performance, and environments with varied visualisation requirements. Best suited for organisations where the visual analytics quality matters and analytical depth is required. For organisations on Tableau, the modern data stack integration connects Tableau to Snowflake or BigQuery via live query or extract, depending on performance requirements.

**Power BI** — best for Microsoft-first organisations. Native integration with Azure Synapse, Fabric, M365. Often included in existing M365 licensing. DAX is well-suited for financial modelling. See Power BI vs Tableau for a detailed comparison.

**Looker** — strongest for organisations where the semantic layer (LookML model) is the primary data governance mechanism. API-first and best for embedded analytics in SaaS products. Google/BigQuery-native.

For decision framework guidance on BI tool selection, how to choose a BI tool in 2026 covers the full comparison.

The optional layers

### Semantic layer

A semantic layer sits between the warehouse and BI tools, defining business metrics centrally so that every downstream tool reads the same definition. "Monthly recurring revenue," "active customer," "net promoter score" — defined once, consumed consistently. Tools: dbt Semantic Layer (MetricFlow), Cube, AtScale.

The semantic layer is essential when multiple BI tools query the same warehouse (metric definitions would otherwise diverge), when LLM-based natural language query is planned (AI needs canonical metric definitions to answer business questions accurately), or when business logic is complex enough that embedding it in individual BI workbooks creates governance risk.

For the full breakdown, see what is a semantic layer.

### Data catalogue and lineage

A data catalogue documents what data assets exist, what they mean, who owns them, and where they came from. Essential for any organisation with more than a few dozen data assets. Tools: Microsoft Purview, Atlan, dbt Docs (lightweight option).

Lineage tracking shows how data flows from source systems through transformation into BI outputs — critical for impact analysis ("if this source system changes schema, what dashboards break?") and regulatory requirements (GDPR, BCBS 239). See data lineage for implementation guidance.

### Data observability

Observability tools monitor data quality automatically, detecting anomalies — unexpected volume changes, unusual null rates, statistical distribution shifts — without explicit rules being defined for each check. Tools: Monte Carlo, Bigeye, Metaplane.

Observability is most valuable when the data environment is large and evolving fast enough that explicit quality rules cannot keep pace. For organisations with simpler environments, dbt tests plus pipeline monitoring cover most quality requirements.

Common stack combinations

**The standard mid-market stack**: Fivetran + Snowflake + dbt Cloud + Airflow (or dbt Cloud scheduler) + Tableau or Power BI. This is the most common combination for mid-market organisations with 1–10 data engineers. Well-supported, good community, extensive documentation.

**The Google stack**: Fivetran or Airbyte + BigQuery + dbt + Cloud Composer (Airflow) + Looker. Natural for GCP-first organisations with significant BigQuery investment. Looker's BigQuery integration is native.

**The Azure/Microsoft stack**: ADF (Azure Data Factory) or Fivetran + Azure Synapse or Microsoft Fabric + dbt or Synapse transforms + Azure Synapse Pipelines (Airflow equivalent) + Power BI. For Microsoft-first organisations, especially those moving to Microsoft Fabric, the native tooling is often the right choice over independent tools.

**The ML-first stack**: Airbyte + Databricks (Delta Lake) + dbt + Dagster + Tableau or Power BI for SQL analytics + MLflow + Databricks model serving. For organisations where ML and AI workloads are as important as SQL analytics.

For cost benchmarks by stack size, what is a data warehouse covers cloud warehouse cost ranges ($2,000–$8,000/month for mid-market) and what is a data lakehouse covers lakehouse infrastructure costs ($3,000–$15,000/month).

When the modern data stack is NOT the right answer

The modern data stack is not universally the right architecture. Three scenarios where it may not be:

**Very small data volume and simple analytics requirements.** If your organisation has a handful of data sources, a small analytical user base, and reporting requirements that a few well-designed Power BI reports on a SQL Server database can meet — the modern data stack adds operational complexity without proportional value. Build what you need, not what the architecture pattern calls for.

**Real-time requirements that batch pipelines cannot meet.** The standard modern data stack is a batch architecture. Fivetran syncs on a schedule; dbt runs on a schedule; warehouses are queried after data has been loaded. For use cases requiring sub-minute data freshness (fraud detection, IoT sensor monitoring, live operational dashboards), a streaming architecture built on Kafka and Flink is required alongside or instead of the batch stack. See real-time data architecture for when streaming is genuinely needed.

**Organisations with strict data residency requirements and no approved cloud.** If regulatory or security requirements prohibit cloud data storage, a cloud-native stack is not an option. This is increasingly rare as cloud providers achieve the certifications required by most regulated industries, but it applies to certain defence and government contexts.

What does it cost to build?

Professional services to design, build, and deploy a modern data stack for a mid-market organisation typically ranges from $80,000–$250,000 depending on:

- Number of source systems to integrate

- Data volume and warehouse sizing requirements

- Complexity of transformation logic

- Whether custom connectors are required for bespoke source systems

- BI migration scope (migrating existing reports vs building from scratch)

- Governance and security requirements

Ongoing infrastructure costs: cloud warehouse ($2,000–$8,000/month for mid-market), Fivetran ($1,000–$5,000/month depending on connector count and volume), dbt Cloud ($100–$2,000/month depending on tier), BI tool licensing ($3,000–$15,000/month depending on platform and user count).

For a detailed engagement cost breakdown, data architecture consulting cost covers the full pricing spectrum.

Where to start

If you are building a modern data stack from scratch, the sequence that produces the least rework:

**First:** Define your analytics requirements before selecting tools. What decisions need to be made? What data is required to make them? Who needs access? What are the refresh frequency requirements? Tool selection follows requirements; requirements do not follow tool selection.

**Second:** Design the data architecture before building. A data architect — internal or external — should define the warehouse schema, the transformation layer design, the governance model, and the lineage requirements before engineers start building. Skipping this step produces a stack that needs to be rebuilt within 18 months.

**Third:** Start with ingestion and storage. Get your most important data sources reliably loading into the warehouse before building transformation logic or BI reports. The foundation must be stable before you build on it.

**Fourth:** Build transformation before BI. BI tools should query well-structured, tested, documented data models — not raw source tables. Building BI directly on source tables creates a maintenance burden and metric inconsistency that compounds over time.

**Fifth:** Implement governance from the start. Data cataloguing, lineage tracking, and access controls are ten times harder to retrofit than to build in. Establish the governance model when the platform is new; adding it to an established, undocumented platform requires a full audit before it can be implemented.

Our data architecture consulting practice designs and builds modern data stacks for mid-market and enterprise organisations. If you are starting a build or evaluating your current architecture against modern patterns, book a free 30-minute audit and we will tell you directly what we would do in your position.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →