BlogData Engineering

What Is ETL? Extract, Transform, Load Explained

Austin Duncan
Austin Duncan
Project Manager & Data Strategist
·March 1, 202810 min read

ETL — Extract, Transform, Load — is the process of moving data from source systems into a destination, applying transformations along the way. This guide explains the ETL process, how it differs from ELT, the tools used for each stage, and when to use traditional ETL versus the modern ELT approach enabled by cloud data warehouses.

ETL — Extract, Transform, Load — is the process of moving data from source systems into a data warehouse or analytics environment. It is one of the most fundamental concepts in data engineering, and understanding it is prerequisite to understanding why most analytics environments are built the way they are.

The Three Stages

**Extract** is reading data from source systems. Sources may include transactional databases (Postgres, MySQL, SQL Server), SaaS applications (Salesforce, HubSpot, Stripe), flat files (CSV exports, SFTP drops), APIs (REST endpoints, webhooks), streaming systems (Kafka topics, Kinesis streams), and legacy systems (mainframes, AS/400 outputs). Extraction may be full (all records every run) or incremental (only records created or modified since the last run). Incremental extraction requires a reliable way to identify new and changed records — typically a timestamp column, a sequence ID, or a CDC (change data capture) mechanism.

**Transform** is applying business logic to convert raw source data into a form suitable for analysis. Transformations include: cleaning invalid or missing values, standardizing formats (date formats, currency, codes), joining data from multiple sources, aggregating records, computing derived metrics, applying business rules (revenue recognition logic, customer segmentation), and enforcing conformance to the target schema. In traditional ETL, transformation happens on a dedicated transformation engine before the data reaches the warehouse.

**Load** is writing the transformed data into the target system — typically a data warehouse, data mart, or data lake. Load patterns include full refresh (truncate and reload the entire table), incremental append (add new records only), upsert/merge (add new records, update changed records based on a key), and slowly changing dimension (SCD) management for tracking historical state.

ETL vs ELT

The rise of cloud data warehouses (Snowflake, BigQuery, Redshift) with abundant cheap compute changed the canonical approach from ETL to ELT — Extract, Load, Transform.

In **ELT**, raw data is loaded into the warehouse first, and transformation happens inside the warehouse using SQL. Tools like dbt (data build tool) have made this pattern widely adopted — you write SQL SELECT statements, dbt compiles them into CREATE TABLE or CREATE VIEW statements, and your transformation logic runs inside the warehouse.

The case for ELT over ETL:

**The warehouse is cheaper for transformation** — cloud warehouses have abundant compute at low cost. Running transformations in the warehouse eliminates the need for a separate transformation server.

**Raw data is preserved** — loading raw data first means you can re-transform it when requirements change. In traditional ETL, if the transformation was wrong, the raw data is gone and you must re-extract from the source.

**SQL is more accessible** — analytics engineers and data analysts can write and maintain dbt models in SQL. Traditional ETL transformation logic was often Java or Python code maintainable only by data engineers.

**Iteration is faster** — changing a transformation means changing a SQL file and rerunning dbt, not redeploying a data pipeline.

The case for ETL over ELT:

**Sensitive data** — if data must be masked or filtered before ever touching the warehouse, transformation before load is necessary. PII that should never land in the data warehouse must be handled at extraction.

**Complex transformations** — some transformations (ML-based enrichment, complex deduplication, third-party API lookups) do not belong in SQL and require Python or other languages. These are better handled in the extraction/transformation layer.

**Legacy warehouse constraints** — older warehouses may not have cheap enough compute to make in-warehouse transformation economical.

Common ETL/ELT Tools

**Managed ingestion (extraction layer)**: Fivetran, Airbyte, Stitch, Hevo. These tools handle the extraction complexity — incremental sync, schema detection, rate limit management, connector maintenance. They are the standard choice for extracting SaaS data sources. Fivetran is the market leader for enterprise-grade managed connectors.

**Transformation layer (dbt)**: dbt (data build tool) is the standard for SQL-based transformation in ELT pipelines. dbt Core is open source; dbt Cloud adds scheduling, documentation, and CI/CD. Analytics engineers write SELECT statements; dbt manages dependency ordering, tests, documentation, and incremental logic.

**Python-based orchestration**: Apache Airflow is the standard for orchestrating complex pipelines that include Python transformation steps, custom API integrations, or multi-step workflows. Prefect and Dagster are newer alternatives with better developer experience. Databricks Workflows integrates orchestration with Spark-based transformation.

**Traditional ETL tools**: Informatica PowerCenter, IBM DataStage, Talend, SSIS (SQL Server Integration Services). These tools dominated the pre-cloud era. They are still common in enterprises with legacy warehouse investments, regulatory requirements for on-premise data processing, or large investments in existing ETL processes.

Incremental Loading and Change Data Capture

Full refresh — dropping and reloading an entire table every run — is simple but expensive. For large tables, incremental extraction is essential.

The most common incremental extraction patterns:

**Timestamp filtering** — selecting only records where the "updated_at" column value is greater than the last successful run time. Simple to implement; breaks if source records do not have reliable update timestamps.

**CDC (Change Data Capture)** — reading the database's transaction log (binlog in MySQL, WAL in Postgres) to capture inserts, updates, and deletes at the row level without querying the source table. CDC is accurate, low-latency, and does not require timestamp columns. Debezium is the most common open-source CDC tool; Fivetran and Airbyte handle CDC for managed extraction.

**Sequence-based extraction** — using an auto-incrementing ID to capture new records. Works for append-only tables; does not capture updates or deletes.

Understanding your source system's capabilities and limitations — whether reliable timestamps exist, whether CDC is feasible, whether the source system can tolerate the query load of incremental extraction — is a prerequisite to designing reliable extraction.

ETL Pipeline Reliability

The quality of your analytics environment is only as good as the reliability of your ETL pipelines. Common failure modes include:

- Extract failures due to source system downtime, schema changes, or rate limiting

- Incremental logic errors that cause duplicate records or missed records

- Transformation logic bugs that produce wrong results silently

- Load failures that leave tables in a partially loaded state

- Schema drift when source systems add, rename, or remove columns

Reliable ETL pipelines require monitoring (alerting on failures), data quality checks (row count validation, null checks, referential integrity), idempotent loads (running the same pipeline twice produces the same result), and documented runbooks for common failure types.

Our data architecture practice designs and builds production-grade ETL and ELT pipelines — contact us to discuss your data integration requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →