BlogCloud Engineering

Azure Data Factory: What It Is and How to Use It Effectively

James Okafor
James Okafor
Data & Cloud Engineer
·June 17, 202611 min read

Azure Data Factory is Microsoft's cloud ETL/ELT and data integration service. It connects 90+ data sources, orchestrates pipeline runs, and integrates with the full Azure data stack. Here is how it works, when it is the right tool, and the patterns that make ADF pipelines maintainable.

The quick answer

Azure Data Factory (ADF) is Microsoft's cloud-based ETL/ELT and data integration service. It connects to 90+ data sources — SaaS applications, databases, file systems, cloud storage — and orchestrates data movement and transformation pipelines. In the Microsoft Azure data stack, ADF is the standard ingestion and pipeline orchestration layer between source systems and Azure data stores (Synapse, ADLS Gen2, Azure SQL Database, Databricks).

ADF sits between Fivetran-style managed connectors (simpler, less customisable) and fully custom code (Spark on Databricks, Python pipelines). It provides pre-built connectors and a visual pipeline designer while retaining the flexibility to embed custom code. The trade-off: more implementation effort than managed connectors, less operational overhead than custom code.

Core ADF concepts

### Pipelines

A pipeline is a logical container for activities that together perform a data movement or transformation task. Pipelines group related activities, define their execution order, and handle failure/retry logic. You might have: a pipeline that ingests from Salesforce, a pipeline that transforms and loads to Synapse, a pipeline that orchestrates both. Pipelines can call other pipelines (nested pipeline pattern).

### Activities

Activities are the units of work within a pipeline. ADF has three activity categories:

**Data movement activities** — Copy Activity: the core data movement activity, copies data from a source to a sink using one of ADF's 90+ connectors. The Copy Activity handles incremental loads, schema mapping, compression, and parallelism automatically based on configuration.

**Data transformation activities** — Mapping Data Flow (visual, code-optional transformation designer), Wrangling Data Flow (Power Query-based self-service transformation), Databricks notebook activity (run a Databricks notebook), HDInsight Hive/Pig/Spark activity, stored procedure activity (run a stored procedure in Azure SQL or Synapse).

**Control flow activities** — If Condition, Switch, ForEach (iterate over a list and run an activity set for each item), Until (run until a condition is met), Wait, Set Variable, Execute Pipeline, Web (call an HTTP endpoint), Lookup (query a data store and pass the result to the next activity), Get Metadata.

### Datasets and Linked Services

A **Linked Service** is a connection definition — credentials and connection string for a data store (SQL Server, Salesforce, Blob Storage). Linked Services are defined once and reused across all activities that connect to the same data store.

A **Dataset** represents a specific data object within a linked service — a specific table in SQL Server, a specific folder in Blob Storage, a specific object in Salesforce. Datasets are defined at design time and referenced in activities.

### Triggers

Triggers determine when pipelines run:

**Schedule trigger**: run on a cron schedule (daily at 2am, every 4 hours, weekdays only).

**Tumbling window trigger**: run for a specific time window — useful for time-partitioned processing (process yesterday's data, starting each morning at 3am). Supports backfill and concurrency control.

**Event-based trigger**: run when an event occurs in Azure Event Grid — a new file lands in a Blob Storage container, a topic message arrives. Event-driven rather than schedule-driven.

**Manual trigger**: run on-demand via the ADF UI or REST API.

### Integration Runtime

The Integration Runtime (IR) is the compute infrastructure that runs ADF activities. Three types:

**Azure Integration Runtime**: Microsoft-managed compute in Azure. No infrastructure management; billed per activity run and data movement volume. Use for cloud-to-cloud data movement between Azure services or public-cloud SaaS applications.

**Self-hosted Integration Runtime**: an agent installed in your on-premise network or private cloud, enabling ADF to reach data sources behind a firewall (on-premise SQL Server, Oracle, SAP, etc.). You manage the server that runs the self-hosted IR. Required for any on-premise data source.

**Azure-SSIS Integration Runtime**: a managed cluster for running SQL Server Integration Services (SSIS) packages in Azure. Lift-and-shift path for organisations with existing SSIS packages — run them in ADF without rewriting.

Common ADF pipeline patterns

### Incremental loading with watermarking

For large tables, loading all data from scratch on every run (full refresh) is slow and expensive. Incremental loading reads only records that have changed since the last run.

A watermark-based incremental load: the pipeline reads the high-water mark (last successful load timestamp) from a control table, queries the source for records updated after that timestamp, copies them to the destination, and updates the control table with the new high-water mark. The ForEach activity iterates over multiple tables; parameterised datasets make the pattern reusable across any number of source tables.

For source systems that support Change Data Capture (SQL Server CDC, Oracle LogMiner, Debezium), ADF's native CDC connectors eliminate the watermarking complexity and provide row-level change events (insert/update/delete) without custom control table logic.

### Copy with schema mapping

Source schemas rarely match target schemas perfectly. The Copy Activity's schema mapping configuration maps source column names to destination column names, handles data type conversions, and applies column selection (copy only the columns you need). This is a metadata-level transformation — no custom code required for simple renames and type conversions.

For more complex transformations (splitting a source column, applying business logic, joining to a lookup table), Mapping Data Flows handle transformation without custom code.

### Mapping Data Flows

Mapping Data Flows is ADF's visual ETL designer — a canvas-based interface where you connect transformation steps (Source, Filter, Select, Derived Column, Join, Aggregate, Sink) to define transformation logic. Data Flows compile to Spark under the hood and run on a managed Spark cluster.

Data Flows are appropriate for: transformations that require joining multiple sources, applying complex business rules that the Copy Activity's schema mapping cannot handle, and organisations that want visual transformation design without writing Spark code.

Data Flows incur additional cost (Spark cluster compute on top of ADF activity costs) and have startup latency (Spark cluster initialisation takes 1–3 minutes). For simple column mapping and type conversion, the Copy Activity with schema mapping is faster and cheaper. Use Data Flows when the transformation complexity justifies it.

### Orchestrating external compute

ADF excels at orchestrating pipelines across multiple services — it does not need to run all the transformation logic itself. Common orchestration patterns:

**ADF + Databricks**: ADF triggers a Databricks notebook or job (via the Databricks activity), Databricks performs the heavy transformation (Spark-scale processing, ML pipelines), and ADF handles the scheduling, dependency management, and retry logic around it.

**ADF + dbt Cloud**: ADF triggers a dbt Cloud job via the Web Activity (calling the dbt Cloud API), waits for completion, and chains subsequent pipeline steps. ADF handles the scheduling; dbt handles the SQL transformation.

**ADF + stored procedures**: ADF calls a SQL stored procedure in Synapse or Azure SQL Database as part of a pipeline, combining ADF's scheduling and error handling with SQL-native transformation logic.

ADF vs alternatives

**Fivetran / Airbyte**: managed connectors for SaaS ingestion. Simpler to configure and maintain than ADF for supported sources (Salesforce, HubSpot, Stripe, etc.). Less flexible for custom sources, on-premise sources, or complex transformation requirements. Use Fivetran/Airbyte for SaaS-to-cloud ingestion where a managed connector exists; use ADF where on-premise sources, custom connections, or complex orchestration are required.

**Synapse Pipelines**: ADF capabilities integrated into Azure Synapse Analytics workspace. The same pipeline engine, the same connectors, the same Mapping Data Flows — just managed from within a Synapse workspace rather than a standalone ADF resource. For organisations using Synapse as their data warehouse, Synapse Pipelines eliminates a separate ADF service. For organisations not using Synapse, standalone ADF is the choice.

**Azure Data Factory vs Microsoft Fabric Data Factory**: Microsoft Fabric includes its own Data Factory (often called Data Factory in Fabric or Fabric Pipelines). The Fabric version supports similar pipeline patterns and connectors. For new builds on Microsoft Fabric, the Fabric-native Data Factory is the go-forward choice over standalone ADF.

**Apache Airflow (on Azure via Cloud Composer equivalent MWAA, or managed via Astronomer)**: more flexible orchestration but requires more operational management than ADF. Better for complex Python-based workflows and organisations with strong Airflow expertise. ADF is simpler to operate for the common enterprise pattern of scheduled data ingestion and transformation; Airflow is more capable for complex, code-first orchestration.

ADF monitoring and error handling

ADF provides a built-in monitoring interface: pipeline run history, activity-level run details, trigger history, and real-time run status. For each pipeline run, you can drill into individual activity runs to see input parameters, output results, and error messages.

For production environments, ADF diagnostic settings send pipeline run logs and metrics to Azure Monitor (Log Analytics), enabling custom alerting and dashboards. Configure alerts for: pipeline failure (alert on run status = Failed), long-running pipelines (alert when duration exceeds a threshold), and copy activity throughput drops (alert when bytes copied per hour drops below expected baseline).

Error handling in pipelines: the Copy Activity has configurable fault tolerance (skip failed rows, log errors, continue on column type mismatch). For pipeline-level error handling, the On Failure path in activity configuration triggers error-handling activities when an upstream activity fails — typically a notification activity (call a webhook, send an email) or a cleanup activity (roll back a partial load).

For ADF cost management: ADF pricing is based on pipeline activity runs, data flow cluster hours, and data movement volume. Pipeline activities are inexpensive per run; Data Flow compute is the primary cost driver for transformation-heavy workloads. For cost-conscious architectures, minimise Data Flow usage for simple transformations (use Copy Activity instead) and use the debug cluster size minimum (Data Flow debug mode for development uses a default cluster size that is often larger than necessary).

For the Azure data architecture context that ADF fits into, see azure data architecture best practices. For how ADF compares to the ELT pattern used in modern data stacks (Fivetran + dbt), see ETL vs ELT.

Our cloud engineering practice builds ADF pipelines for Azure data platform clients — from ingestion from on-premise SQL Server and SAP to complex multi-step orchestration combining ADF, Databricks, and Synapse. If you are building an Azure data pipeline or evaluating ADF for your environment, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →