BlogData Engineering

Feature Engineering for Machine Learning: Data Architecture Patterns That Work at Scale

James Okafor
James Okafor
Data & Cloud Engineer
·July 11, 202713 min read

Feature engineering — transforming raw data into the input representations that machine learning models learn from — is where most ML projects spend the most time and encounter the most production problems. The patterns that work at scale are fundamentally data architecture decisions: how features are computed, stored, and served consistently across training and inference.

Feature engineering is the process of transforming raw data into the numerical representations that machine learning models use as input. It is where most ML projects spend the majority of their time and encounter the majority of their production failures — not because the transformation logic is inherently complex, but because the data architecture required to compute and serve features consistently across model training and real-time inference is genuinely difficult to build well.

The core problem: a feature that is computed one way during training and a slightly different way during inference produces incorrect predictions at scale without obvious error signals. The training-serving skew problem is the most common production ML failure mode.

What Feature Engineering Involves

Feature engineering transforms raw entity data into the numerical inputs that ML models receive. For a customer churn prediction model, raw data might include: customer records, subscription events, support tickets, product usage events, and billing records. Features derived from this data might include: days since last login (recency), number of support tickets in the last 30 days (engagement signal), MRR trend over 3 months (commercial signal), and usage event count in the last 7 days normalised by user tenure (behavioural feature normalised for lifecycle stage).

Each of these features requires:

1. Identifying the relevant raw data

2. Defining the computation logic (how exactly is "days since last login" computed? As of what timestamp?)

3. Implementing the computation consistently for training data and real-time inference

4. Storing the computed value for efficient retrieval during inference (if real-time computation is too slow)

Training-Serving Skew: The Most Common Failure Mode

Training-serving skew occurs when a feature is computed differently during model training than during model inference. The model learned from training data that used one computation; it receives inference data computed using a different computation. The model's predictions are based on features that do not match what it was trained on, producing degraded or incorrect predictions.

Common causes of training-serving skew:

**Different code paths**: Training uses Python/Pandas; inference uses a different language or runtime. Small differences in handling of nulls, date calculations, or floating-point precision produce different values for the same input.

**Different timestamps**: Training features are computed as of the label timestamp (when the outcome occurred). Inference features are computed as of "now." If the feature definition does not preserve the as-of timestamp discipline consistently, training data uses features from before an event that inference features cannot have.

**Data availability differences**: Training uses data that is available as historical batch; inference uses data arriving in real-time via different pipelines. Delays in one pipeline relative to another produce feature values that differ from what the model expected.

The architectural solution to training-serving skew is the feature store.

Feature Stores

A feature store is a centralised repository for feature definitions, computed feature values, and the serving infrastructure that delivers features consistently to both training and inference workloads.

**Feature definitions**: The feature store holds the definition of how each feature is computed — the logic, the source tables, the aggregation window, the handling of nulls. When the definition is centralised, training and inference use the same definition automatically.

**Offline feature storage**: Pre-computed feature values stored in a columnar store (S3 + Parquet, Delta Lake) for batch retrieval during model training. Training jobs read from the offline store rather than computing features from raw data at training time.

**Online feature serving**: Pre-computed feature values stored in a low-latency key-value store (Redis, DynamoDB, Cassandra) for real-time retrieval during inference. Inference code retrieves pre-computed features by entity key (customer ID, product ID) in milliseconds rather than computing them at request time.

**Point-in-time correctness**: For training, the feature store must serve the feature value that was available at the time the training example occurred — not the current value, which would introduce future leakage. This is implemented via time-travel queries against the offline store: "give me the feature values for customer X as of timestamp T."

Open-source feature stores include Feast and Hopsworks. Managed feature stores include Vertex AI Feature Store (GCP), Amazon SageMaker Feature Store, and Databricks Feature Store. The choice depends on the ML platform in use.

Feature Computation Patterns

**Batch computation**: Features are computed on a schedule (hourly, daily) and stored in the feature store. Training retrieves historical values; inference retrieves the most recently computed value. Appropriate when features do not need to be current within the last refresh window.

**Streaming computation**: Features are computed from an event stream (Kafka, Kinesis) in near-real-time as events arrive. A customer's "events in last 60 minutes" feature is updated within seconds of each new event. Appropriate for time-sensitive features that drive immediate decisions (fraud detection, real-time personalisation).

**On-demand computation**: Simple features that can be computed at inference time from context available in the request (time of day, day of week, device type from request headers) are computed on-demand without pre-storage. Reserve for computations that are fast and do not require historical aggregation.

Feature Pipelines and Data Quality

Feature pipelines need the same quality and reliability practices as any data pipeline — with the additional requirement that feature quality directly affects model prediction quality.

**Null handling**: ML models cannot accept nulls as input. Every feature must have a defined null handling strategy: imputation with a default value (median, mode, zero), imputation with the population mean, or flagging null as a separate category. The null handling must be identical in training and inference.

**Distribution monitoring**: Feature distributions shift over time as the underlying data changes. A feature that had 5% null rate during training has a 30% null rate in production because a source system changed. A numerical feature's distribution shifts because the customer base changed. Feature distribution drift is an early signal of model degradation that manifests before prediction quality deterioration is measurable.

Monitor feature distributions at ingestion (before model consumes them) and alert when distributions deviate significantly from training distributions. Tools like Great Expectations, Evidently AI, and Whylogs provide feature monitoring capabilities.

SQL-Based Feature Engineering with dbt

For teams already using dbt for analytical transformations, dbt is an effective tool for feature engineering when features are derived from warehouse data and the serving latency requirement is measured in minutes rather than seconds.

dbt models at the mart layer can produce feature tables: one row per entity (customer, product, order) with computed features as columns. These tables serve as the offline feature store and feed batch inference jobs.

For online serving, a job reads the dbt-computed feature table and writes current values to an online store (Redis). This hybrid architecture uses dbt for the computation (SQL, version-controlled, tested) and a dedicated online store for low-latency serving.

Our data architecture and data engineering practice designs feature engineering architectures that eliminate training-serving skew and scale to production ML workloads — contact us to discuss your ML data architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →