Apache Airflow for Data Pipelines: Orchestration at Scale

Apache Airflow is the most widely deployed workflow orchestrator for data pipelines — used by data engineering teams to schedule, monitor, and manage the dependencies between extraction, transformation, and loading jobs. This guide covers Airflow's core concepts, DAG design patterns for data pipelines, and the operational practices that make Airflow reliable in production.

Apache Airflow is the most widely deployed data workflow orchestrator — a platform for defining, scheduling, and monitoring complex data pipelines that have dependencies between tasks and require reliable execution and observability. It was created at Airbnb in 2014 and open-sourced in 2015, joining the Apache Software Foundation in 2019. Airflow's core concept is the Directed Acyclic Graph (DAG): a workflow defined as a graph of tasks with directed dependencies, ensuring tasks execute in the correct order.

Core Concepts

**DAG (Directed Acyclic Graph)** — a collection of tasks with defined dependencies. A DAG defines a workflow: what runs, in what order, and on what schedule. DAGs are defined in Python files in Airflow's DAGs directory.

**Task** — the unit of work in a DAG. Each task performs one operation: extracting data from an API, running a SQL query, executing a dbt model, or sending a notification. Tasks are connected by directed edges that define execution order.

**Operator** — the type of a task. Airflow provides operators for common operations: PythonOperator (run a Python function), BashOperator (run a shell command), SQLOperator (run SQL against a database), and many data-specific operators (S3ToRedshiftOperator, BigQueryInsertJobOperator, etc.). The operator determines what the task does; the task instance is the specific execution of an operator with given arguments.

**Scheduler** — the Airflow component that continuously evaluates DAGs against their schedules, determines which tasks are ready to run (dependencies met, no upstream failures), and submits them to the executor.

**Executor** — the component that runs tasks. The LocalExecutor runs tasks on the scheduler's machine; the CeleryExecutor distributes tasks across a worker pool; the KubernetesExecutor runs each task in a separate Kubernetes pod. Most production deployments use CeleryExecutor or KubernetesExecutor for scalability and isolation.

**Webserver** — the Airflow UI: a web interface showing DAG status, task logs, execution history, and configuration.

DAG Design for Data Pipelines

Data pipeline DAGs typically follow a pattern:

1. **Extract** — tasks that pull data from sources (APIs, databases, file systems) and stage it to a data lake or landing zone.

2. **Load** — tasks that load staged data into the data warehouse.

3. **Transform** — tasks that run transformation logic (dbt, SQL, Spark) to produce analytics-ready tables.

4. **Validate** — tasks that run data quality checks and send alerts if checks fail.

5. **Notify** — tasks that signal downstream consumers (dashboards, ML pipelines) that fresh data is available.

A simple data pipeline DAG in Python (using Airflow's TaskFlow API, introduced in Airflow 2.0):

from airflow.decorators import dag, task

from datetime import datetime

@dag(schedule_interval='0 6 * * *', start_date=datetime(2024, 1, 1), catchup=False)

def daily_sales_pipeline():

@task()

def extract_from_salesforce():

# Call Salesforce API, write to S3

pass

@task()

def load_to_warehouse():

# Load from S3 to Snowflake landing tables

pass

@task()

def run_dbt_transformations():

# Execute dbt run for sales models

pass

extract_from_salesforce() >> load_to_warehouse() >> run_dbt_transformations()

daily_sales_pipeline()

Dependency Management

Airflow's dependency system supports:

**Sequential execution** — task_a >> task_b >> task_c runs tasks in order.

**Parallel execution** — [task_a, task_b] >> task_c runs task_a and task_b in parallel, then task_c after both complete.

**Conditional execution** — BranchPythonOperator selects which downstream task to execute based on a runtime condition. Use sparingly; complex branching logic is difficult to monitor and debug.

**Cross-DAG dependencies** — ExternalTaskSensor waits for a task in another DAG to complete before proceeding. Useful for pipelines that have upstream dependencies on other teams' workflows.

Scheduling and Catchup

Airflow schedules DAGs using cron syntax or preset intervals (daily, hourly, weekly). The schedule_interval parameter defines when the DAG runs.

**Catchup** — when a DAG is enabled for the first time with a start_date in the past, Airflow attempts to run all missed intervals by default. For most data pipelines, this is not desirable — set catchup=False to prevent historical backfilling.

**Data intervals** — Airflow schedules DAGs at the end of each data interval. A daily DAG with schedule_interval='@daily' scheduled for 2024-01-02 runs the logic for the 2024-01-01 data interval. The task sees ds = '2024-01-01' — the start of the data interval it processes.

Task Retry and Failure Handling

Airflow supports configurable retry behaviour per task:

- **retries** — number of retry attempts on failure (default 0)

- **retry_delay** — time to wait between retries

- **retry_exponential_backoff** — exponential backoff between retries

- **on_failure_callback** — a Python function called when the task fails after all retries

Production data pipelines should define appropriate retry behaviour for tasks that may fail transiently (network calls to external APIs) versus tasks where retrying without investigation is inappropriate (data transformation tasks where failure indicates a data quality issue).

Monitoring and Alerting

Airflow provides built-in monitoring:

**Task status** — each task instance has a status: queued, running, success, failed, skipped, upstream_failed. The DAG graph view shows status visually.

**Task logs** — each task instance generates logs accessible in the Airflow UI and storable in configured log backends (S3, GCS, Azure Blob).

**SLA misses** — tasks can have SLA (Service Level Agreement) thresholds; Airflow alerts when tasks exceed their SLA.

**Email notifications** — Airflow sends email on task failure, retry, and SLA miss when configured.

For production deployments, integrate Airflow alerting with monitoring systems (PagerDuty, Opsgenie, Slack) by configuring the on_failure_callback function to post to the appropriate channel.

Airflow vs Alternatives

Airflow is the right choice when:

- Complex dependency graphs with many tasks

- Existing Airflow expertise or infrastructure

- Need for a mature, battle-tested platform with broad operator support

Consider alternatives when:

- Small teams that find Airflow's operational complexity disproportionate to their needs

- Asset-oriented workflows fit better (Dagster)

- Python-native orchestration with simpler deployment (Prefect)

Our data architecture practice designs and implements Airflow orchestration architectures for enterprise data pipelines — contact us to discuss pipeline orchestration for your data stack.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →