dbt (data build tool) is the standard transformation layer in modern data stacks. You write SQL SELECT statements; dbt handles execution, testing, documentation, and lineage. Here is what it does, how it works, and where it fits in your data architecture.
The quick answer
dbt (data build tool) is a SQL-based transformation framework that sits in the T of ELT (extract, load, transform). You write SELECT statements that define how to transform raw data into analytics-ready tables and views. dbt compiles those statements, manages execution order, runs tests on the outputs, and generates documentation. It is the standard tool for the transformation layer in modern data stacks built on Snowflake, BigQuery, Databricks, or Redshift.
What makes dbt different from writing SQL directly in your data warehouse: your transformation logic lives in version-controlled files (not in a warehouse GUI), tests validate outputs automatically on each run, documentation is generated from the models themselves, and lineage is captured automatically from model dependencies. The discipline of "transformation as code" — applying software engineering practices to data transformation — is what dbt enables.
What dbt does (and does not do)
dbt handles transformation. It does not ingest data (that is Fivetran, Airbyte, or ADF), does not store data (that is your warehouse), does not orchestrate pipelines (though dbt Cloud has a basic scheduler), and does not serve data to end users (that is Tableau, Power BI, or Looker).
Within transformation, dbt handles:
- **Model execution**: compiles and runs SELECT statements against the warehouse in the correct dependency order
- **Testing**: runs assertions about data quality after each model builds (are values non-null? are keys unique? do foreign keys resolve?)
- **Documentation**: generates a browsable data catalogue from model descriptions and column metadata
- **Lineage**: builds a dependency graph from model references, showing how data flows from source to consumption
What dbt does not handle natively: event-triggered pipeline runs (requires an orchestrator), real-time or streaming transformation (dbt is a batch tool), complex Python-based transformations (though dbt Python models are possible on Snowflake and BigQuery), and ingestion from source systems.
dbt Core vs dbt Cloud
**dbt Core** is the open-source CLI tool. It is free, runs anywhere, and handles model compilation and execution against any supported warehouse. dbt Core requires you to manage your own execution environment — running dbt manually or integrating it into an orchestration tool (Airflow, Dagster, Prefect).
**dbt Cloud** is dbt Labs' managed SaaS product built on top of dbt Core. It adds: a managed execution environment (no local install required), a web IDE for writing and testing models, a built-in scheduler for running jobs on a cadence, Slim CI (run only models affected by a code change), the semantic layer (MetricFlow), and the Explorer lineage interface. dbt Cloud Developer tier starts free for individual use; Team and Enterprise tiers are priced per seat.
For production use at any meaningful scale, dbt Cloud simplifies operations significantly versus self-managing dbt Core execution. The main case for sticking with dbt Core is when you are already running an orchestration platform (Airflow, Dagster) and want to invoke dbt from within it rather than paying for dbt Cloud's scheduler.
How dbt models work
A dbt model is a SQL file containing a single SELECT statement. The file name becomes the table or view name in the warehouse. dbt handles the CREATE TABLE AS / CREATE VIEW AS wrapping automatically.
A model file for a customer metrics table might be: SELECT customer_id, SUM(order_amount) AS total_revenue, COUNT(order_id) AS order_count, MAX(order_date) AS most_recent_order FROM staging.orders GROUP BY customer_id. That SELECT statement — nothing else — is the entire model.
When dbt runs, it executes this statement against the warehouse and materialises the result as either a view (rebuilt on each query) or a table (rebuilt on each dbt run). The materialisation type is configurable per model.
Dependencies between models are declared using the ref() function: ref('stg_orders') references the stg_orders model, not a hard-coded schema.table path. dbt reads these references, builds the dependency graph, and executes models in the correct order. If stg_orders must exist before orders_metrics can build, dbt ensures that ordering automatically.
The staging-intermediate-mart layer pattern
Most dbt projects follow a layered architecture:
**Staging models** (stg_ prefix) perform one-to-one transformations of raw source tables: renaming columns for consistency, casting data types, basic cleaning. One staging model per source table. Staging models are the single place where raw source data is translated into the canonical naming conventions of the data platform.
**Intermediate models** (int_ prefix, optional) handle complex joins and business logic that does not belong in staging but is not the final consumption model. Order line items joined to products, joined to categories, with promotional discounts applied. Intermediate models avoid duplicating complex logic in multiple mart models.
**Mart models** (marts/ directory, often no prefix or fct_/dim_ prefix) are the final analytics-ready tables that BI tools and analysts query. Fact tables (fct_orders, fct_sessions), dimension tables (dim_customers, dim_products), and wide denormalised tables for specific analytical use cases. Mart models are the Gold layer of the medallion architecture.
This layered pattern maps directly onto the medallion architecture (Bronze/Silver/Gold): staging is Silver (cleaned, standardised), marts are Gold (ready for consumption). For the medallion architecture context, see what is a data lakehouse.
dbt tests
dbt ships with four built-in schema tests: unique (values in a column are unique), not_null (values are not null), accepted_values (values are limited to a specified list), and relationships (a foreign key resolves to a row in the referenced table).
These tests are declared in YAML configuration files alongside models. A column definition might declare: unique: true and not_null: true. dbt test runs these assertions against the built model and fails if they are violated.
Custom tests extend the built-in tests: a positive_amounts test that asserts revenue columns are never negative, a valid_date_range test that asserts dates are within expected bounds, a row_count_comparison test that asserts two tables reconcile. Custom tests are SQL files that return zero rows if the test passes (any returned rows indicate a test failure).
The dbt_utils package (and other community packages from the dbt Hub) provides a library of commonly needed tests beyond the built-in four: expression_is_true (evaluate an arbitrary boolean SQL expression), recency (assert that data is recent enough), cardinality_equality (assert the distinct count of a column), and many more.
dbt tests run as part of each dbt build. If tests fail, the build can be configured to halt — preventing bad data from reaching downstream consumers.
dbt documentation
dbt generates a static documentation site from your project: every model, column, source, and test is documented. Model descriptions come from YAML files where you document what each model represents. Column descriptions document individual fields.
The generated docs site includes the lineage graph — a visual representation of how data flows from sources through staging and intermediate models to marts. The lineage graph makes it easy to understand the impact of a change: which downstream models depend on the model you are about to modify?
For organisations looking for a lightweight data catalogue, dbt Docs serves as one without additional tooling. For more complete cataloguing (ownership, classification, access policies, external lineage), tools like Atlan, Datahub, or Microsoft Purview integrate with dbt and ingest the dbt-generated metadata.
dbt and the semantic layer
dbt's semantic layer (powered by MetricFlow) allows you to define business metrics — revenue, active users, churn rate — in dbt rather than in BI tools. Metrics defined in dbt are then queryable by semantic layer-aware BI tools, ensuring that "monthly recurring revenue" means the same thing in every tool that queries it.
This approach solves the metric consistency problem at the platform level: if Power BI and Tableau both query MRR through the dbt semantic layer, they cannot produce different numbers. Without the semantic layer, each BI tool's MRR calculation is defined separately and can drift.
The dbt semantic layer is available on dbt Cloud Team and Enterprise tiers. For the broader context of semantic layers and why they matter, see what is a semantic layer.
dbt in the modern data stack
dbt's role in the modern data stack is the T in ELT. The full stack: Fivetran (or Airbyte) loads raw data into Snowflake (or BigQuery or Databricks). dbt transforms raw data into analytics-ready models. Tableau (or Power BI or Looker) consumes dbt mart models for reporting and analysis.
dbt integrates with orchestration tools: Airflow's dbt operators, Dagster's dbt integration (Dagster is particularly well-suited to dbt's asset model), and Prefect's dbt flows allow dbt runs to be managed as part of broader pipeline workflows. For pipelines where ingestion (Fivetran sync) must complete before transformation (dbt run) begins, orchestration tools manage that dependency.
For the full modern data stack context, see how to build a modern data stack.
When dbt is the right choice
dbt is the right transformation tool when:
- Your data warehouse is a cloud warehouse (Snowflake, BigQuery, Databricks, Redshift, Synapse) — dbt compiles against these natively
- Your transformation logic is primarily SQL-expressible — dbt is a SQL tool; complex Python transformations are possible but not its strength
- You want version control, testing, and documentation for transformation logic — dbt's primary value proposition
- Your team has SQL skills — dbt is accessible to analytics engineers and SQL-proficient analysts, not just data engineers
- You are building a new data platform and want to follow modern data stack conventions from the start
dbt may not be the right choice when: transformation logic requires complex Python that SQL cannot express cleanly, you are in an on-premise environment without a supported warehouse, your organisation is heavily invested in a different transformation tool (Informatica, Talend, SSIS) and migration cost is prohibitive, or your data engineering team is more comfortable with Python-first tools (Spark, PySpark) than SQL.
For the broader transformation pattern context and when to choose ETL vs ELT, see ETL vs ELT: which pipeline pattern should you use.
Getting started
dbt Core can be installed with pip (pip install dbt-snowflake for Snowflake, pip install dbt-bigquery for BigQuery, etc.). The dbt init command scaffolds a new project. The official dbt documentation at docs.getdbt.com is comprehensive and well-maintained.
For production use, dbt Cloud's free Developer tier is the fastest path to a working environment — no infrastructure management, a web IDE, and a scheduler for running jobs on a cadence.
If you are building a data platform and want expert guidance on the transformation layer design — model architecture, testing strategy, semantic layer implementation, dbt Cloud vs Core decision — our data architecture consulting practice has built dbt-based transformation layers on Snowflake, BigQuery, and Databricks. Book a free 30-minute audit to discuss your specific requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →