BlogData Engineering

Apache Airflow: What It Is and How to Use It for Data Pipelines

James Okafor
James Okafor
Data & Cloud Engineer
·June 21, 202612 min read

Apache Airflow is the most widely deployed workflow orchestration platform for data engineering. It schedules, monitors, and manages complex data pipeline dependencies via Python-defined DAGs. Here is how it works and the patterns that make Airflow deployments maintainable.

The quick answer

Apache Airflow is an open-source workflow orchestration platform for authoring, scheduling, and monitoring data pipelines. You define pipelines as Directed Acyclic Graphs (DAGs) in Python — each node in the graph is a task, each edge is a dependency. Airflow handles scheduling, retries, alerting, and provides a web UI for monitoring pipeline status. It is the dominant orchestrator in the modern data stack, used to coordinate dbt runs, Spark jobs, ML training pipelines, API calls, and any multi-step process that needs scheduling, dependency management, and observability.

Core concepts

**DAG (Directed Acyclic Graph)**: the fundamental Airflow abstraction. A DAG is a Python file that defines a collection of tasks and the dependencies between them. "Directed" means dependencies flow in one direction (no cycles). "Acyclic" means no task can depend on itself, directly or indirectly. Each DAG has a schedule (a cron expression or a preset like daily, hourly), a start date, and task definitions.

**Task**: a unit of work within a DAG. Tasks are defined using operators — classes that define what a task does. Airflow ships with operators for common patterns: PythonOperator (run a Python function), BashOperator (run a shell command), SQLExecuteQueryOperator (run a SQL query), and dozens more for external services (Snowflake, BigQuery, dbt, Spark, AWS, GCP).

**Operator**: the class that defines task behaviour. Operators abstract the execution details so you can declare tasks at a high level. A SnowflakeOperator handles the Snowflake connection, executes a SQL statement, and reports success or failure — without you writing the connection logic.

**Sensor**: a special type of operator that waits for a condition to be true before proceeding. ExternalTaskSensor waits for a task in another DAG to complete. S3KeySensor waits for a file to appear in S3. Sensors are how Airflow handles event-driven triggers without polling logic in application code.

**TaskInstance**: each execution of a task within a specific DAG run. When a DAG runs, each task creates a TaskInstance with a state (queued, running, success, failed, skipped, upstream_failed). The web UI shows TaskInstance states for every DAG run.

**DAG Run**: one execution of a DAG for a specific logical date. Airflow's scheduling model is "data interval" based — a DAG scheduled daily at midnight creates a DAG run for each day's interval, with a logical date corresponding to the start of that interval. This matters for backfilling: you can trigger historical DAG runs for past dates.

**Executor**: the component that determines how tasks are actually executed. LocalExecutor runs tasks as subprocesses on the Airflow worker machine. CeleryExecutor distributes tasks across a pool of worker nodes using Redis or RabbitMQ as a broker. KubernetesExecutor launches each task as an isolated Kubernetes pod. The right executor depends on scale and infrastructure.

Setting up Airflow

**Install**: Airflow has complex dependency management — use the official constraint files for your Python version when installing. The typical production installation uses Docker or Kubernetes to run Airflow's components (webserver, scheduler, worker) as separate services.

**Managed options**: Amazon MWAA (Managed Workflows for Apache Airflow) on AWS, Cloud Composer on GCP, and Astronomer on any cloud are managed Airflow services that eliminate infrastructure management. For teams without dedicated platform engineering capacity, managed Airflow is worth the premium over self-hosted.

**Local development**: Astro CLI (Astronomer's CLI) or the official Docker Compose setup from the Airflow documentation are the standard for local development. Both spin up a complete Airflow environment (webserver, scheduler, metadata database, optional worker) via Docker Compose.

**Project structure**: Airflow reads DAG files from a configured directory (the dags_folder). In a typical project, you have a dags/ directory containing Python files — each file can define one or more DAGs. Supporting code (utilities, operators, hooks) goes in a plugins/ directory or a Python package within the project.

Writing your first DAG

A minimal DAG defines three things: a DAG object with ID, schedule, and start date; task definitions using operators; and dependency declarations (task_a >> task_b means task_b runs after task_a succeeds).

The with DAG(...) context manager syntax is the most common pattern in modern Airflow. The @dag decorator syntax is an alternative that makes DAG definitions look more like Python functions.

For task dependencies, Airflow uses the bit-shift operators: >> sets a downstream dependency (task_a >> task_b), << sets an upstream dependency. For multiple dependencies, chain(task_a, task_b, task_c) sets a sequential chain, or [task_a, task_b] >> task_c sets multiple upstreams.

Connecting to external systems

Airflow uses Connections to store credentials for external systems — databases, cloud services, APIs. Connections are stored in the Airflow metadata database (or a secrets backend like AWS Secrets Manager or HashiCorp Vault for production). Each Connection has a connection ID (a string like snowflake_prod or postgres_dw) and type-specific fields (host, login, password, schema, port, extras).

Operators reference connection IDs rather than embedding credentials. A SnowflakeOperator with snowflake_conn_id='snowflake_prod' uses whatever credentials are configured for that connection ID — keeping credentials out of DAG code and enabling rotation without code changes.

**Secrets backend**: for production, configure Airflow to read connections and variables from a secrets manager rather than the metadata database. This is a security requirement in regulated environments and best practice generally.

dbt integration

The most common Airflow use case in the modern data stack is orchestrating dbt runs. The DbtRunOperator and related operators (from packages like astronomer-cosmos or dbt-airflow) allow you to run dbt commands from Airflow tasks, with per-model task granularity if desired.

**astronomer-cosmos**: the most widely adopted dbt-Airflow integration library, maintained by Astronomer. It reads your dbt project and generates an Airflow DAG with one task per dbt model, preserving dbt dependencies as Airflow task dependencies. This gives per-model retry, per-model monitoring, and per-model alerting.

**BashOperator approach**: simpler — run dbt commands using BashOperator. One task for dbt run --select staging, another for dbt test. Less granular but simpler to maintain.

Scheduling patterns

**Cron schedules**: Airflow accepts standard cron expressions for scheduling. Common examples: daily at midnight (0 0 * * *), hourly (0 * * * *), weekdays at 6am (0 6 * * 1-5).

**Preset schedules**: Airflow provides string shortcuts — @daily, @hourly, @weekly, @monthly — as alternatives to cron expressions.

**Data-aware scheduling**: Airflow 2.4+ introduced Datasets — a way to trigger DAGs when upstream DAGs produce specific data assets, rather than on a time schedule. A DAG that produces a dataset triggers downstream DAGs that consume it, enabling event-driven pipelines without polling.

**Backfill**: Airflow can run DAGs for historical dates using airflow dags backfill. Useful for reprocessing after a pipeline fix, or for populating historical data when a new DAG is first deployed.

Monitoring and alerting

**Web UI**: the Airflow web UI provides DAG-level and task-level visibility — DAG run history, task states, execution logs for each task instance. The Graph View shows the DAG topology; the Grid View shows execution history across runs.

**SLA misses**: configure SLA (Service Level Agreement) on tasks — if a task does not complete within the defined window, Airflow triggers an SLA miss callback. SLA misses surface in the web UI and can send email alerts.

**Email alerting**: configure on_failure_callback on tasks or on the DAG to trigger when failures occur. The EmailOperator and Airflow's built-in email alerting (configured in airflow.cfg) send failure notifications. PagerDuty, Slack, and other alerting integrations are available as community providers.

**Metrics**: Airflow emits StatsD metrics for task durations, scheduler heartbeat, pool utilisation, and more. Ship these to Prometheus or Datadog for dashboards and alerting.

Common patterns and best practices

**Idempotency**: design tasks to be idempotent — running a task twice with the same inputs produces the same result as running it once. This is critical for safe retries. SQL tasks with INSERT ... SELECT are not idempotent; MERGE or DELETE + INSERT patterns are.

**Avoid logic in DAG files**: DAG files are parsed frequently by the Airflow scheduler. Heavy computation, database queries, or network calls in DAG-level code (outside of task callables) slow down scheduler performance. Keep DAG files lightweight; put complex logic in task callables or importable modules.

**Pools**: Airflow pools limit concurrency for resource-constrained systems. Define a pool for your database (e.g., max 5 concurrent tasks against the production warehouse) and assign relevant tasks to that pool to prevent overloading.

**Variables and environment-specific configuration**: use Airflow Variables (or environment variables, or a secrets backend) for configuration values that differ between environments. Avoid hardcoding environment-specific values in DAG code.

**XCom**: Airflow's mechanism for passing small data between tasks (task A pushes a value, task B pulls it). Keep XCom usage to small metadata (IDs, counts, status flags) — not large data payloads. Large inter-task data transfers should go through shared storage (S3, GCS, a database table).

Alternatives to Airflow

**Prefect**: Python-native orchestration with a simpler programming model and native support for dynamic workflows. Often preferred for teams building ML pipelines.

**Dagster**: asset-centric orchestration — the core abstraction is a data asset (like a table or file), not a task. Built-in data quality checks, lineage, and a strong development experience. Increasingly popular for modern data stacks.

**dbt Cloud**: for pure dbt orchestration, dbt Cloud's built-in scheduler eliminates the need for a separate orchestrator. If your pipelines are dbt-centric, this simplifies the stack.

**AWS Step Functions, GCP Cloud Workflows**: cloud-native workflow services without Airflow's infrastructure overhead. Less flexible for complex data pipelines but zero maintenance cost.

For the transformation layer that Airflow typically orchestrates, see what is dbt and data pipeline best practices. For the data ingestion that feeds pipelines, see fivetran vs airbyte.

Our data architecture consulting practice designs and implements modern data stacks — including Airflow deployment, DAG structure, and integration with dbt, Snowflake, and other stack components. If your current pipeline orchestration is fragile, hard to maintain, or missing observability, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →