BlogData Architecture

Machine Learning Data Pipelines: Engineering the Data Layer for ML Systems

James Okafor
James Okafor
Senior Data Engineer
·July 31, 202713 min read

Machine learning models are only as good as the data pipelines that feed them. The data engineering work required to prepare, validate, version, and serve training and inference data is routinely underestimated in ML projects — and failures at the data layer are responsible for the majority of ML system failures in production.

Machine learning models are only as good as the data pipelines that feed them. The data engineering work required to prepare, validate, version, and serve training and inference data is consistently underestimated in ML projects, and failures at the data layer account for the majority of ML system failures in production. A model architecture that achieves state-of-the-art performance on benchmark datasets degrades substantially when deployed against production data that differs in systematic ways from the training data it was evaluated on.

What ML Data Pipelines Must Do

The ML data pipeline is responsible for four distinct workflows:

**Training data preparation** produces the dataset used to train and validate the model. The requirements are: correct temporal joins (features must be computed as they would have been observed at the prediction time, using no information from after the prediction horizon); label computation (the outcome variable must be computed accurately for historical instances); feature engineering (raw data transformed into the derived attributes the model uses); and data quality validation (checks that catch data issues before they silently corrupt training).

**Inference data preparation** produces the features needed to score new instances. The fundamental requirement — which is frequently violated — is that inference features must be computed identically to training features. Different time zone handling, different null imputation, different aggregation windows between training and inference produce systematic inconsistencies that degrade model performance in ways that are difficult to diagnose.

**Experiment data management** handles the datasets produced during model development: training sets, validation sets, test sets, and holdout sets. Experiments need to be reproducible: the same dataset, the same code, and the same random seeds should produce the same model. This requires data versioning — the ability to reconstruct the exact dataset used for a specific training run.

**Model serving infrastructure** makes model predictions available to downstream systems. This may be batch scoring (running the model over the full population and storing scores) or online inference (serving predictions via API with latency constraints). The two modes have different data infrastructure requirements.

Data Quality in ML Pipelines

Standard data pipelines validate schema and null rates. ML data pipelines require additional validation that is specific to the modelling context:

**Distribution validation** — checking that the distribution of key features falls within the range observed during training. A feature that has shifted substantially may cause model degradation even if the schema and null rates look normal. Standard checks include distribution comparisons using Kolmogorov-Smirnov tests for continuous features and chi-squared tests for categorical features.

**Label leakage detection** — verifying that features do not contain information from after the prediction time. This is easy to miss in temporal datasets where the same table contains both features and outcomes. The check requires explicit validation of the temporal boundary for each feature against the label timestamp.

**Training-inference consistency** — verifying that features computed at inference time match those computed at training time. The most practical approach is to log a sample of inference feature vectors, compare their distributions to training feature distributions, and alert when they diverge beyond a defined threshold.

**Class balance and label distribution** — for classification models, monitoring the proportion of positive labels in training data. Significant shifts in label prevalence between training batches may indicate data pipeline issues (early outcome leakage, missing population), and they affect model calibration in ways that are not always visible in standard accuracy metrics.

The Temporal Join Problem

The temporal join is the most common source of data leakage in ML pipelines. It occurs when features are computed using data from after the point in time at which the prediction would have been made.

Consider a customer churn model trained on monthly snapshots. For each customer-month, the label is whether the customer churned in the following 30 days. The features include a trailing-30-day usage metric. The temporal join must ensure that the usage metric for each customer-month observation is computed from data up to (and not including) the start of the 30-day prediction window — not from data that extends into the prediction window.

In practice, this requires point-in-time correct feature tables: for each observation in the training data, features are computed using only data that was available at the observation timestamp. Building these tables requires careful handling of slowly changing dimensions, late-arriving data, and correction events. It is significantly more complex than building a standard analytics feature table, and the errors are invisible in training metrics — leakage typically produces inflated training performance, making it appear the model is working better than it is until it fails in production.

Orchestration and Dependency Management

ML data pipelines have complex dependency graphs. A model training pipeline depends on feature tables; feature tables depend on staging tables; staging tables depend on extractions from source systems. Any upstream failure propagates downstream.

The standard orchestration tools (Airflow, Prefect, Dagster) handle dependency management and failure alerting. The additional requirements for ML pipelines:

**Data availability checks** — before triggering downstream computations, verify that upstream data meets minimum freshness and completeness requirements. A feature table computed from incomplete data is worse than no feature table, because the model will be trained or scored on a distorted population.

**Backfill handling** — when upstream data is corrected or backfilled, ML pipelines may need to recompute derived features and retraining the model. The pipeline design should handle triggered recomputation without manual intervention.

**Cost management** — ML data pipelines, particularly feature engineering over large populations, are compute-intensive. Incremental computation (computing only the features for records that have changed since the last run) reduces cost significantly compared to full-population recomputation on every run.

Versioning and Reproducibility

A model in production was trained on specific data, with specific feature engineering code, at a specific time. When the model is retrained or replaced, the ability to reproduce the previous training run is essential for debugging regression — determining whether a performance degradation is caused by model changes, feature changes, or data changes.

Data versioning for ML requires:

**Dataset versioning** — storing or reconstructing the exact training dataset for each model version. Full dataset storage is expensive at scale; the practical alternative is storing the query or pipeline code used to generate the dataset alongside a timestamp and a sample of rows sufficient to verify the dataset characteristics.

**Feature engineering versioning** — tracking changes to feature computation code. Version-controlling the dbt models, SQL queries, or Python code that compute features, and recording which version was used for each model training run.

**Model registry** — a catalogue of trained models with their training data version, feature engineering version, hyperparameters, and performance metrics. The model registry makes it possible to roll back to a previous model version when a new deployment degrades performance.

Our data architecture practice designs ML data pipelines for organisations moving from experimental models to reliable production ML systems — contact us to discuss your ML infrastructure.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →