dbt and Databricks: Analytics Engineering on the Lakehouse

dbt on Databricks bridges the gap between data lake flexibility and data warehouse structure — dbt models define business logic as SQL or Python transformations that Databricks executes via Spark or Photon. For organisations running Databricks as their primary data platform, dbt provides the transformation layer governance, testing, and documentation that raw Databricks notebooks cannot.

dbt on Databricks combines dbt's transformation governance, testing, and documentation with Databricks' Spark-based compute and Delta Lake storage. The combination is increasingly common in organisations that chose Databricks as their primary data platform and need the structure and reliability of analytics engineering workflows on top of the flexible, compute-intensive workloads Databricks enables.

The Databricks Data Platform Context

Databricks differs from traditional cloud data warehouses in its architecture and primary use cases. Where Snowflake and BigQuery are designed for SQL-centric analytics workloads, Databricks runs Apache Spark — a distributed compute engine that handles both SQL analytics and non-SQL workloads: machine learning, streaming, complex data transformations involving Python and R, and large-scale data processing that exceeds SQL's expressive power.

The implication for dbt: dbt on Databricks runs models as Spark SQL or Python, using Delta Lake as the storage format. Delta Lake provides ACID transactions, schema enforcement, time travel, and MERGE operations on top of cloud object storage — the properties that make Databricks behave more like a data warehouse and less like a raw data lake.

Databricks Adapter Configuration

The dbt-databricks adapter:

dbt_project_name:

target: prod

outputs:

prod:

type: databricks

host: "your-workspace.azuredatabricks.net"

http_path: "/sql/1.0/warehouses/your-warehouse-id"

token: "your-databricks-token"

schema: analytics_prod

catalog: main

Key configuration fields:

**host** — the Databricks workspace hostname.

**http_path** — the HTTP path for the SQL warehouse or cluster. SQL warehouses (Databricks SQL) are recommended over interactive clusters for dbt production workloads — they have lower startup time and are purpose-built for SQL analytics.

**token** — a Databricks personal access token or service principal token. For production, use a service principal rather than a personal user token.

**catalog** — the Unity Catalog catalog containing the target schema. Unity Catalog is Databricks' enterprise data governance layer; using Unity Catalog-enabled workspaces and specifying the catalog in the adapter configuration enables fine-grained access control at the catalog, schema, and table levels.

Unity Catalog and dbt

Unity Catalog (UC) is Databricks' centralised governance layer — a three-level namespace (catalog.schema.table) with centralised access controls, lineage tracking, and data discovery. For organisations running dbt on Databricks, Unity Catalog changes several aspects of how dbt projects are structured:

**Three-level namespace** — dbt models in a UC-enabled workspace are referenced as catalog.schema.table. The dbt-databricks adapter handles this transparently when catalog is configured in the profile. Cross-catalog references (a model in catalog A referencing a model in catalog B) require explicit database configuration on the model.

**Access control** — table-level read and write permissions in UC are granted to service principals rather than to IP addresses or cluster-level credentials. The service principal running dbt needs CREATE TABLE privileges in the target schema and SELECT privileges on source schemas.

**Lineage** — Unity Catalog tracks lineage automatically: every time a dbt model runs and creates or modifies a table, UC records the lineage from the source tables. This automatic lineage complements dbt's own lineage graph and is visible in the Databricks UI without requiring any dbt-specific configuration.

Materialisation Strategies on Databricks

dbt on Databricks supports the standard materialisation types plus Databricks-specific options:

**view** — creates a Delta view. As on other platforms, views are not materialised and run the underlying query on each access.

**table** — creates a Delta table. Full rebuild on every run.

**incremental** — creates a Delta table with incremental updates. The recommended incremental strategy on Databricks is merge, which uses Delta Lake's native MERGE INTO statement for ACID-compliant upserts.

incremental with merge strategy:

{{ config(

materialized='incremental',

incremental_strategy='merge',

unique_key='order_id',

on_schema_change='sync_all_columns'

) }}

SELECT ...

{% if is_incremental() %}

WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})

{% endif %}

Delta Lake's MERGE INTO is fully ACID: it handles concurrent writes correctly and guarantees that the merge either fully succeeds or fully rolls back. This is a significant advantage over non-ACID incremental approaches.

**Python models** — dbt on Databricks supports Python models that run as Databricks notebooks via PySpark or pandas-on-Spark. Python models are appropriate for transformations that cannot be expressed in SQL: ML feature engineering, complex array manipulations, custom aggregations using Python libraries.

A Python model:

import pyspark.sql.functions as F

def model(dbt, session):

orders = dbt.ref("stg_orders")

return orders.withColumn("revenue_bucket",

F.when(F.col("revenue") < 100, "low")

.when(F.col("revenue") < 1000, "mid")

.otherwise("high"))

Python models use the same dependency management and testing as SQL models — they can reference other models via dbt.ref() and dbt tests can be applied to their outputs.

Performance Considerations on Databricks

Databricks performance characteristics differ from SQL-centric warehouses:

**Cluster startup time** — interactive Databricks clusters have startup times of 5–10 minutes. SQL warehouses have shorter startup times (30–60 seconds). For dbt CI/CD, cluster startup time adds significantly to run time; use SQL warehouses for dbt workloads where startup time matters.

**Photon engine** — Databricks' Photon vectorised query engine accelerates SQL workloads substantially. Photon is available on compute types labelled "Photon"; use Photon-enabled SQL warehouses for dbt production workloads where query speed matters.

**Z-ordering** — Delta Lake's Z-ORDER OPTIMIZE command physically reorganises data in a table to co-locate related values, improving query performance for multi-dimensional filter predicates similar to BigQuery clustering. For large Delta tables with frequently filtered dimensions, running OPTIMIZE with Z-ORDER as a post-hook after incremental model runs improves downstream query performance.

**Auto-optimisation** — enable Delta Lake auto-optimisation (autoOptimize and autoCompact properties) on large incrementally-written tables. Small file accumulation from incremental writes degrades query performance; auto-optimisation merges small files automatically.

When to Use dbt on Databricks vs dbt on a Separate Warehouse

Some organisations run Databricks for ML and data engineering workloads while using a separate cloud data warehouse (Snowflake, BigQuery) for SQL analytics. The question of where to run dbt depends on where the authoritative business-layer tables live:

**Run dbt on Databricks** when the primary analytical data — the cleansed, business-logic-applied tables that BI tools query — live in Delta Lake and are consumed primarily through Databricks SQL or directly from Delta via connectors.

**Run dbt on a separate warehouse** when data is moved from Databricks (for ML and raw processing) to the warehouse (for business analytics), and the transformation layer that produces business metrics lives in the warehouse.

Many modern stacks use a medallion architecture: raw data arrives in the Bronze layer, Databricks processes it through Silver (cleaning and standardisation), and dbt transforms the Silver layer into the Gold layer (business metrics). Whether dbt runs on Databricks or on a separate warehouse for the Gold layer depends on where Gold is stored.

Our data architecture practice designs dbt-on-Databricks transformation architectures including Unity Catalog governance and medallion architecture for enterprise clients — contact us to discuss your Databricks analytics engineering stack.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →