BlogData Engineering

Airflow and dbt: Orchestrating Analytics Engineering Workflows

Austin Duncan
Austin Duncan
Project Manager & Data Strategist
·November 10, 202711 min read

Running dbt inside Airflow requires deciding how granular your DAG should be — one task per dbt run, one task per dbt model, or something in between. The right approach depends on your monitoring requirements, failure recovery strategy, and the size of your dbt project. This guide covers the integration patterns, their trade-offs, and the operational considerations for production dbt-on-Airflow deployments.

Integrating dbt with Apache Airflow requires deciding how to represent dbt's model execution within Airflow's task graph. The decision matters because it determines the granularity of monitoring, the precision of failure recovery, and the operational overhead of managing the integration. There is no single right answer — the appropriate integration depends on your project's size, monitoring requirements, and the operational maturity of your data team.

Integration Options

There are three primary patterns for running dbt in Airflow, each with different characteristics:

Pattern 1: Single task — dbt run as a shell command

The simplest integration: one Airflow task that runs dbt run as a shell command. The entire dbt project runs as a single atomic unit.

from airflow.operators.bash import BashOperator

run_dbt = BashOperator(

task_id='run_dbt',

bash_command='cd /dbt && dbt run --profiles-dir /dbt/profiles',

)

Advantages: minimal setup, easy to understand.

Limitations: if any model fails, the entire task fails and the next run retries the full dbt project from scratch. No per-model visibility in the Airflow UI. Failure recovery requires understanding which models failed from dbt's logs, not from Airflow's task graph.

Appropriate for: small dbt projects (under ~50 models), teams with low Airflow proficiency, or situations where dbt's own job management handles monitoring adequately.

Pattern 2: dbt Cloud jobs triggered from Airflow

If using dbt Cloud, Airflow triggers dbt Cloud jobs via the dbt Cloud API and waits for completion. The DbtCloudRunJobOperator (from the Apache Airflow Providers package) handles this:

from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator

run_dbt_cloud = DbtCloudRunJobOperator(

task_id='run_dbt_cloud_job',

job_id=12345,

dbt_cloud_conn_id='dbt_cloud_default',

wait_for_termination=True,

)

Advantages: dbt Cloud handles its own monitoring, alerting, and run history; Airflow only needs to trigger and wait.

Limitations: requires dbt Cloud; the dbt Cloud job is a black box from Airflow's perspective — job-level detail is in the dbt Cloud UI, not Airflow.

Appropriate for: teams using dbt Cloud that want Airflow to manage the broader pipeline orchestration (triggering dbt after upstream extraction completes) while delegating dbt execution detail to dbt Cloud.

Pattern 3: Per-model tasks using the Cosmos framework

Astronomer Cosmos (an open-source package) parses the dbt project's manifest and generates one Airflow task per dbt model. The full dbt DAG is represented as Airflow tasks with correct dependency order.

from cosmos import DbtDag, ProjectConfig, ProfileConfig

dbt_dag = DbtDag(

project_config=ProjectConfig('/dbt'),

profile_config=ProfileConfig(...),

schedule_interval='@daily',

dag_id='dbt_dag',

)

Advantages: per-model visibility in the Airflow UI; per-model retry on failure; failure recovery runs only failed models and their downstream dependents; full dbt lineage visible in Airflow.

Limitations: for large dbt projects, generating hundreds or thousands of Airflow tasks can strain the Airflow scheduler and UI. Performance testing on your specific Airflow infrastructure is required before adopting this pattern for large projects.

Appropriate for: teams that want full observability into dbt execution at the model level; projects where failure recovery precision matters; organisations with Airflow infrastructure capable of handling the task count.

Handling dbt Failures in Airflow

When a dbt model fails:

**Single-task pattern** — the Airflow task fails. On the next run, all models execute again. If the failure was transient (network issue, temporary resource constraint), the full re-run will succeed. If the failure was due to a data quality issue in a specific model, debugging requires reading dbt logs.

**Cosmos per-model pattern** — only the failed model's Airflow task fails. Downstream models that depend on the failed model have status upstream_failed. On retry, only the failed task and its downstream dependents re-run. This is significantly more efficient for large projects and makes root cause identification easier.

Separating Extraction, Loading, and Transformation in the DAG

The most robust Airflow-dbt integration treats extraction, loading, and transformation as separate DAG sections with explicit dependencies:

1. Extraction tasks (Fivetran, Airbyte, custom extractors) run first.

2. A source freshness sensor waits for sources to be fresh before triggering dbt.

3. dbt runs after sources are confirmed fresh.

4. Data quality checks run after dbt completes.

5. Notification tasks alert downstream consumers.

This structure ensures that dbt is never triggered against stale source data, and that downstream consumers are only notified when the full pipeline — including quality checks — has completed successfully.

Environment Variable Management

dbt in Airflow needs database credentials. Secure approaches:

**Airflow Connections** — store credentials in Airflow's encrypted connections store, accessed via environment variables or the Airflow Secrets backend.

**AWS Secrets Manager / Vault** — for more complex credential management, credentials fetched from a secrets manager at task start.

**Docker / Kubernetes secrets** — when using KubernetesExecutor, mount credentials as Kubernetes secrets into the task pod.

Do not store credentials in DAG code or dbt profiles.yml committed to the repository.

Monitoring dbt Runs in Airflow

Key monitoring considerations for dbt-in-Airflow:

**Log forwarding** — dbt logs should be forwarded to your centralised logging infrastructure (CloudWatch, Datadog, ELK) alongside Airflow's own task logs. This makes debugging post-failure easier without needing to navigate to the Airflow UI.

**SLA monitoring** — set Airflow task SLAs on dbt runs. If the dbt run takes longer than expected, an alert fires before the issue cascades into downstream consumers receiving stale data.

**Model-level run times** — if using per-model tasks (Cosmos), track task durations over time to identify models whose run time is increasing — a signal of data growth that may require incremental strategy changes.

Our data architecture practice designs Airflow-dbt orchestration architectures for enterprise data engineering teams — contact us to discuss pipeline orchestration strategy for your analytics stack.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →