Apache Airflow is the de facto standard for orchestrating data pipelines. This guide explains how Airflow works, its DAG-based task scheduling model, how it compares to modern alternatives like Prefect and Dagster, and when the complexity of a full orchestrator is actually warranted.
Apache Airflow is the de facto standard open-source platform for authoring, scheduling, and monitoring data pipelines. Built by Airbnb and open-sourced in 2015, it has become the dominant orchestration tool in the data engineering ecosystem — used by thousands of organizations to schedule ETL pipelines, dbt runs, ML training jobs, and any other workflow that involves multiple dependent steps.
What Orchestration Means
Data pipelines rarely consist of a single step. Moving data from Salesforce to a data warehouse might require: extracting records, loading to a staging table, running validation checks, running dbt transformations, running data quality tests, and updating a Slack notification. Each step depends on the previous one. If the extraction fails, the load should not proceed. If validation fails, the transformation should not run.
Orchestration is the management of this dependency graph: scheduling when pipelines run, sequencing dependent steps, handling failures, retrying failed tasks, and providing visibility into what succeeded and what failed.
Without orchestration, data engineers either write custom scheduling scripts (fragile, undocumented, hard to debug) or use cron (simple for single steps, breaks down for dependent multi-step workflows). Airflow provides a purpose-built system for managing complex, dependent data workflows at scale.
DAGs: The Core Concept
Everything in Airflow is organized around DAGs — Directed Acyclic Graphs. A DAG is a Python file that defines:
- The tasks that make up a workflow
- The dependencies between them (task B runs after task A, task C runs after both A and B)
- The schedule (run daily at 2am, run hourly, run on demand)
- The configuration (what to retry on failure, how many parallel tasks to run, how long to wait before timing out)
The "directed acyclic" constraint means the dependency graph must flow in one direction — no circular dependencies where task A depends on task B which depends on task A.
A simple DAG might look like:
- Task 1: Extract from source API
- Task 2a: Load to staging table (runs after Task 1)
- Task 2b: Run data quality check on source (runs after Task 1)
- Task 3: Run dbt transformations (runs after both 2a and 2b succeed)
- Task 4: Send Slack notification (runs after Task 3)
Airflow visualizes this as a graph in its web UI, with each task showing its current state: queued, running, succeeded, failed, skipped.
Airflow Architecture
**Scheduler** — the process that reads DAG files, determines which tasks are ready to run (dependencies met, scheduled time reached), and queues them for execution.
**Executor** — the component that actually runs tasks. The LocalExecutor runs tasks as subprocesses on the scheduler machine. The CeleryExecutor distributes tasks across a pool of worker machines. The KubernetesExecutor runs each task in its own Kubernetes pod, providing clean isolation and dynamic scaling.
**Workers** — the machines or processes that execute tasks. With CeleryExecutor or KubernetesExecutor, workers are separate from the scheduler, enabling horizontal scale.
**Webserver** — the web UI for monitoring DAGs, inspecting task logs, manually triggering runs, and managing variables and connections.
**Metadata database** — PostgreSQL or MySQL stores DAG definitions, task state, run history, variables, and connections. All Airflow components read from and write to this database.
Key Features
**Sensors** — tasks that wait for an external condition before proceeding. A FileSensor waits for a file to appear on S3. An ExternalTaskSensor waits for a specific task in another DAG to succeed. Sensors enable event-driven workflows without continuous polling.
**Hooks and Connections** — abstraction layer for connecting to external systems. Airflow maintains an encrypted connection store with credentials for databases, cloud services, and APIs. Hooks (PostgresHook, S3Hook, BigQueryHook, etc.) use these connections to interact with external systems from within tasks.
**Variables** — key-value store for pipeline configuration. Variables can be updated without modifying DAG code, enabling environment-specific configuration.
**XComs** — cross-communication mechanism for passing small values between tasks. Task A can push an XCom value (like a row count or a file path), and Task B can pull it to use as input.
**Pools** — resource slots that limit concurrent task execution. A pool with 5 slots allows at most 5 tasks from that pool to run simultaneously — useful for rate-limiting writes to a source system or managing database connection limits.
**Backfill** — running a DAG for historical date ranges. If a new pipeline needs to populate historical data, Airflow can schedule and run all historical DAG runs automatically.
Managed Airflow Options
Running Airflow infrastructure in-house requires managing the scheduler, workers, webserver, metadata database, and DAG file distribution — non-trivial operational overhead. Managed options reduce this burden:
**Amazon MWAA** (Managed Workflows for Apache Airflow) — fully managed Airflow on AWS. Integrates with IAM, S3 (for DAG storage), CloudWatch (for logging), and VPC.
**Google Cloud Composer** — fully managed Airflow on GCP. DAGs stored in GCS. Integrates with BigQuery, Dataflow, and Cloud Run.
**Astronomer** — commercial managed Airflow platform with CI/CD integration, role-based access control, and enterprise support. The company that maintains the Astronomer Runtime image, which is the standard hardened Airflow distribution.
Airflow vs Modern Alternatives
Airflow's dominance has attracted alternatives that address its real weaknesses.
**Prefect** — Python-native orchestration with a cleaner API, local execution for development without infrastructure, built-in observability, and a deployment model that does not require DAG file distribution. Prefect flows are Python functions with decorators, which many engineers find more intuitive than Airflow DAG syntax. Better suited for teams that want faster iteration and simpler local development.
**Dagster** — asset-centric orchestration that models pipelines around data assets (tables, files, models) rather than tasks. Dagster's asset model enables automatic lineage tracking, incremental processing, and better understanding of what data has been produced. Strong type system and first-class testing support. The most architecturally sophisticated of the three — higher learning curve, higher ceiling.
**dbt Core + Cloud** — for teams whose orchestration needs are primarily running dbt models, dbt Cloud's scheduler may be sufficient without a separate orchestration tool. dbt Cloud handles scheduling, testing, documentation generation, and alerting for dbt-only pipelines.
The honest comparison: Airflow has the largest ecosystem, most documentation, and most operator familiarity — if you are hiring data engineers, they have almost certainly used Airflow. Prefect has better developer experience for greenfield projects. Dagster has the most sophisticated architecture for complex data platforms. None of them is wrong.
When Airflow Is Warranted
Airflow adds real operational complexity. It is warranted when:
- Pipelines have multiple dependent steps with complex dependency logic
- Pipelines need reliable retry logic and failure alerting
- You need visibility into what ran, when it ran, and what failed
- Multiple pipelines share infrastructure and need resource pooling
- You need backfill capability for historical data loading
- DAG complexity has outgrown cron jobs and shell scripts
Airflow is likely overkill when:
- Your entire pipeline is a single dbt run on a schedule — dbt Cloud handles this
- You have two or three simple, independent pipelines — cron with Slack alerting may be sufficient
- Your team has no Airflow experience and faster iteration with Prefect is more valuable
- You are early-stage and the infrastructure overhead is a significant distraction
Our data architecture practice designs and implements production data pipeline infrastructure — contact us to discuss orchestration strategy for your environment.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →