Data Pipeline Best Practices: How to Build Pipelines That Do Not Break

Data pipelines that fail silently, produce wrong data, or require constant manual intervention are the primary reason data teams lose stakeholder trust. Here is the engineering discipline that makes pipelines reliable.

The quick answer

Data pipelines that are reliable, maintainable, and observable share a set of engineering practices that poorly designed pipelines lack: idempotency (safe to re-run), thorough error handling (fail loudly, not silently), comprehensive testing (validated outputs, not just successful execution), observability (visibility into what is happening and why), and documentation (understandable by anyone who maintains the pipeline, not just the person who wrote it).

Pipelines that do not follow these practices become the primary source of data team stress: on-call incidents, silent data corruption, hard-to-diagnose failures, and technical debt that makes every change risky. This guide covers the engineering discipline that makes pipelines reliable.

Idempotency: the most important pipeline property

An idempotent pipeline produces the same result regardless of how many times it is run. Run it once — correct result. Run it twice — same correct result, no duplicates. Run it after a failure that left partial data — correct result, no corruption.

Non-idempotent pipelines are the primary cause of data corruption incidents. A pipeline that inserts new records without checking for duplicates produces duplicates if it is re-run after a partial failure. A pipeline that truncates and reloads does not produce duplicates but loses data if the reload fails after the truncate.

Patterns that produce idempotent pipelines:

**UPSERT/MERGE rather than INSERT**: insert new records, update existing ones on a match. If a record already exists with the same key, update it rather than inserting a duplicate. Snowflake MERGE INTO, BigQuery MERGE, and dbt's incremental materialisation with the unique_key configuration all implement this pattern.

**Transactional writes**: write all data for a partition in a single transaction. Either the entire partition is written or nothing is. No partial states visible to readers. Delta Lake's ACID transactions and Snowflake's transactional writes provide this guarantee for data warehouse writes.

**Overwrite partitions atomically**: for batch pipelines writing to partitioned tables, overwrite the entire partition atomically — the new partition data replaces the old partition data in a single operation. If the write fails, the old partition remains intact. Do not append to partitions; overwrite them.

**Deterministic incremental loads**: for incremental loads, use a stable, reliable watermark (the maximum processed timestamp or sequence number stored in a control table) rather than a relative time window (yesterday's data). A relative time window changes every time the pipeline runs; a control-table watermark is deterministic regardless of when the pipeline runs.

Fail loudly, not silently

The most dangerous pipeline failure mode is silent failure: the pipeline completes successfully but produces wrong data. The scheduler shows a green checkmark; users see incorrect numbers; no one knows until an analyst discovers the discrepancy.

Causes of silent failure: unchecked null handling (a NULL value in a field that should not be null is treated as zero in a sum, underreporting the total), type coercion (a string "N/A" in a numeric field is coerced to NULL or zero rather than causing an error), schema evolution (a source column renamed without updating the pipeline query, producing a full NULL column), and filter conditions that silently exclude records (a WHERE clause more restrictive than intended).

Engineering practices that prevent silent failure:

**Schema validation at ingestion**: validate that the incoming data matches the expected schema before processing. New columns, missing columns, and data type changes should trigger an alert, not proceed silently.

**Row count checks**: after each load, validate that the number of rows loaded matches expectation — within a tolerance of the expected range based on historical patterns. A load that produces 10% of the expected rows is a failure even if the pipeline completed without error.

**Business rule assertions**: after each transformation, assert that business rules hold. Revenue should be non-negative. Customer count should not drop by more than 10% week-over-week without an explanation. Transaction amounts should not exceed a known maximum. These assertions are encoded as dbt tests or data quality checks that fail the pipeline if violated.

**Explicit NULL handling**: treat unexpected NULLs as errors, not as default values. If a column that should never be null contains NULLs, fail the pipeline and alert — do not proceed with the NULL values.

Testing pipelines

Pipeline testing goes beyond "the pipeline ran without an error." A pipeline can complete successfully and produce wrong data. Testing validates that outputs are correct, not just that the pipeline executed.

**Unit tests for transformation logic**: test individual transformation logic in isolation. For dbt models, dbt tests (not_null, unique, accepted_values, relationships) and custom tests for business rules. For Spark transformations, PySpark unit tests with representative input DataFrames and expected output validation.

**Integration tests**: test the pipeline end-to-end against a representative sample of production data. Does the full pipeline produce the expected row counts, aggregate totals, and null rates when run against the test dataset?

**Regression tests**: after any change to a pipeline, compare the output against the previous known-correct output. Did the change break any existing queries? Did it change any metric values unexpectedly?

**Data diff testing**: for incremental pipelines, validate that each run produces the expected incremental change. Did today's load add the expected number of new records? Did it correctly update the modified records? Did it not touch records that should be unchanged?

Error handling and recovery

**Alerting**: every pipeline failure should trigger an immediate alert to the on-call data engineer. The alert should include: which pipeline failed, at what step, with what error message, and what data was affected. Alerts that do not include enough context to diagnose the failure add friction to incident response.

**Retry logic with backoff**: transient failures (temporary network unavailability, throttled API, transient query timeout) should be retried automatically before triggering an alert. Implement exponential backoff: retry after 1 minute, then 2 minutes, then 4 minutes, then alert if still failing. Do not retry indefinitely — set a maximum retry count.

**Dead-letter queues**: for streaming pipelines, messages that cannot be processed (malformed data, schema mismatches, business rule violations) should be sent to a dead-letter queue for manual review rather than discarding or halting the stream. Dead-letter records should be monitored — accumulation indicates a systematic processing problem.

**Checkpoint and resume**: long-running batch pipelines should checkpoint progress periodically so that a failure mid-run can resume from the last checkpoint rather than restarting from scratch. Spark's built-in checkpointing, dbt's incremental materialisation patterns, and partition-by-partition batch processing all implement resumability.

**Rollback procedures**: for pipelines that write to production tables, define a rollback procedure — a way to undo the most recent run if it produced incorrect results. Delta Lake's RESTORE TO VERSION command, Snowflake's Time Travel, and partition overwrite patterns all provide rollback mechanisms. Document the rollback procedure before it is needed; do not figure it out during an incident.

Observability

Pipeline observability means having clear visibility into what your pipelines are doing and why. The three pillars:

**Metrics**: quantitative measures of pipeline health — rows processed per run, run duration, error count, queue depth, lag for streaming pipelines. Metrics are collected continuously and used for alerting and trend analysis.

**Logs**: detailed records of what happened during each pipeline run — which steps executed, what data was processed, what errors occurred, what parameters were used. Logs are for debugging; metrics are for monitoring.

**Lineage**: the record of how data flows from source to destination — which source tables fed which transformation models which fed which output tables. Lineage enables impact analysis (if this source changes, what breaks?) and root cause analysis (this output is wrong; what upstream processing produced it?). dbt's automatic lineage, OpenLineage for Spark, and the Tableau Metadata API capture lineage at different layers.

**Dashboards for pipeline health**: build a monitoring dashboard that shows the current state of all scheduled pipelines — last run time, last run status, current queue depth, extract freshness for Tableau data sources. This dashboard should be the first thing the on-call engineer checks when investigating a data quality complaint.

Documentation

Undocumented pipelines are a liability. When the person who built the pipeline leaves, the pipeline becomes a black box that no one can modify safely.

**Document the why, not the what**: code comments should explain why a decision was made, not what the code does (the code explains what it does). Why is this column transformed this way? Why is this filter applied? Why is this join performed in this order? Document the non-obvious decisions.

**Dependency documentation**: for each pipeline, document: what sources does it read from, what tables does it write to, what downstream pipelines or BI tools depend on its output, what is the expected refresh cadence, and what SLA does the output carry.

**Runbook**: a runbook documents the manual steps required to operate and troubleshoot the pipeline — how to re-run it, how to roll it back, how to handle the most common failure modes, who to escalate to for issues that are not self-service. For production pipelines, a runbook is not optional.

For the observability tooling that monitors data quality in pipeline outputs (beyond whether the pipeline ran), see data observability. For the transformation layer best practices for dbt-based pipelines, see what is dbt. For cloud infrastructure cost management that applies to pipeline compute, see cloud data cost optimisation.

Our cloud engineering and data architecture consulting practices build and review data pipelines for mid-market and enterprise organisations. If your pipelines are causing data quality incidents or are difficult to maintain, book a free 30-minute audit and we will identify the specific reliability gaps.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →