BlogData Engineering

Data Engineering Best Practices: What Separates Good Pipelines from Great Ones

James Okafor
James Okafor
Data & Cloud Engineer
·June 11, 202713 min read

Most data pipelines work — until they do not. The difference between pipelines that are reliable, maintainable, and debuggable and pipelines that accumulate technical debt until they fail in production is a set of practices that are easy to skip under deadline pressure and costly to retrofit later.

Data pipelines are not software in the traditional sense — they are not interactive, they do not have users in the room when they run, and their failures often manifest as silent data quality issues rather than obvious errors. This makes them harder to build well than application code. The practices that separate good pipelines from unreliable ones are mostly about handling the ways pipelines fail — not just the happy path.

Idempotency

An idempotent pipeline produces the same result every time it is run for the same input, regardless of how many times it is executed. This is the single most important pipeline property for operational reliability.

Non-idempotent pipelines accumulate problems when retried. An append-only pipeline that inserts rows without checking for duplicates will double-count data on retry. A pipeline that uses the current time to filter data will produce different results if run at different times. A pipeline that writes to a file without overwriting will accumulate versions.

Design pipelines to be idempotent: use MERGE or INSERT ... ON CONFLICT for upsert operations rather than blind inserts; partition output by date and overwrite the relevant partition on each run rather than appending; use source-side watermarks (event timestamps, not pipeline execution times) for incremental filtering; write to a staging location and atomically swap to production rather than writing in place.

Explicit Schema Handling

Pipelines that do not validate schema assumptions break silently when sources change. A column is renamed upstream; the pipeline runs without error but produces nulls for a field that should be populated. A data type changes; string-to-numeric coercions produce unexpected nulls or errors downstream.

Validate schema at ingestion: assert that expected columns are present and have expected types before processing. Fail loudly when schema assumptions are violated — a pipeline that fails with a clear error message ("expected column 'customer_id' of type INTEGER, got type STRING") is better than one that silently produces incorrect data.

For sources with explicit schemas (databases with typed columns, Avro or Parquet files with embedded schemas), use the schema as a contract and fail if the contract is broken. For sources without explicit schemas (CSV, JSON), define the expected schema and validate against it on each run.

Data Contracts Between Pipeline Stages

In multi-stage pipelines, each stage makes assumptions about its inputs. When upstream stages change, downstream stages may break. The traditional approach (communication between team members) does not scale as pipelines grow.

Explicit data contracts — documented schemas, row count expectations, freshness expectations — make assumptions testable. dbt schema tests and source tests formalise contracts for SQL-based transformations: test that primary keys are unique, that foreign keys have matching records, that non-nullable columns contain no nulls, that value ranges are within expected bounds.

For non-SQL pipelines, the same principle applies: write assertions that validate data between stages. A pipeline that fails at a data contract violation produces a clear, specific error. A pipeline without contracts fails downstream with a confusing symptom.

Incremental Processing

Full-refresh pipelines that reprocess all historical data on every run are simple but expensive and slow. As data volumes grow, a full refresh that takes 5 minutes at 1 million rows takes hours at 100 million rows.

Design for incremental processing: track the high-water mark of processed data (the maximum timestamp or sequence number of the last successful run), process only records newer than the high-water mark, and update the high-water mark on successful completion. Store the high-water mark in a persistent state store — not in the process memory.

Incremental processing introduces the possibility of missed records (records with late-arriving timestamps that fall before the high-water mark) and duplicate records (records processed in the overlap window when retrying). Design the downstream upsert logic to handle both cases correctly.

For dbt, incremental models handle this pattern with the 'is_incremental()' macro — the first run processes all data, subsequent runs process only new data, and the model is idempotent within each incremental window.

Explicit Dependency Management

Pipelines that run on time schedules without explicit dependency verification are fragile. A daily pipeline that runs at 6 AM assumes its source data is ready by 6 AM. When the source is delayed by 30 minutes, the pipeline processes yesterday's data without any indication that today's data was missed.

Use data-aware scheduling rather than time-based scheduling where possible: trigger downstream pipelines when upstream pipelines complete successfully, not on a fixed schedule. Modern orchestrators (Airflow, Prefect, Dagster) support sensor tasks that wait for specific data conditions before proceeding.

When data-aware scheduling is not feasible, add explicit checks at the start of each pipeline run: assert that source data for the expected time window exists and has the expected row count before processing. Fail with a clear error if the source condition is not met.

Observability

A pipeline that runs but produces incorrect data is worse than a pipeline that fails — because silent failures can go undetected for days while incorrect data influences decisions.

Emit metrics from pipelines: records processed per run, records loaded, records rejected, run duration, high-water mark advanced. These metrics make it possible to detect anomalies — a run that processed zero records when it normally processes 50,000 is a problem even if it did not error.

Log structured events from pipelines: job start, job completion, record counts at each stage, validation failures, retry attempts. Structured logs (JSON, not free-text) are queryable and filterable; free-text logs require parsing.

Alert on failures and on anomalies: row count drops of more than 20% from the previous comparable run; run duration more than 2x the median; high-water mark not advancing within the expected window.

Testing

Data pipelines need tests, but the test patterns are different from application code. The relevant tests:

**Unit tests for transformation logic**: Test individual calculated fields, business rules, and data transformations with known inputs and expected outputs. In dbt, unit tests define input data and assert output values for individual models.

**Integration tests with real data**: Run the pipeline against a sample of real historical data and assert that the output matches known correct values. These catch regressions in transformation logic that unit tests with synthetic data miss.

**Data quality assertions**: Run after each pipeline execution, not during development. Assert that output data meets quality expectations: no duplicate primary keys, no null values in required columns, foreign key referential integrity, value distributions within expected ranges.

**Schema change tests**: Run during CI/CD, not production. For pipelines with schema contracts, assert that the expected schema is present before deploying a new version.

Documentation

Pipelines are infrastructure that other people will maintain — or that you will maintain after forgetting the original context. Document:

- The business question the pipeline answers, not just what it does technically

- The source systems, including which fields are used and what they mean in the source context

- The grain of output tables — what each row represents

- Known data quality issues in sources and how they are handled

- The retry and recovery procedure when the pipeline fails

Our data engineering and data architecture practice builds pipelines to this standard and remediates pipelines that have accumulated technical debt — contact us to discuss your data engineering needs.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →