Feature engineering is the process of transforming raw data into the input variables a machine learning model needs to learn effectively. This guide explains the core techniques, the role of feature stores in production ML systems, and why feature engineering is often where the most impactful ML work happens.
Feature engineering is the process of transforming raw data into the input variables — features — that a machine learning model uses to make predictions. It sits between raw data collection and model training, and it is often where the most consequential work in applied machine learning happens.
A model is only as good as the features it receives. A sophisticated neural network trained on poorly engineered features will consistently be beaten by a simpler logistic regression trained on well-engineered ones. Understanding what feature engineering is, and why it matters, is prerequisite to understanding why production ML systems require substantial data infrastructure beyond the model itself.
What Features Are
A feature is any measurable property of the data that a model uses as input. For a customer churn prediction model, features might include: number of logins in the last 30 days, time since last purchase, number of support tickets in the last 90 days, average order value over lifetime, days since account creation, change in purchase frequency between the last two quarters.
None of these features exist directly in a transactional database. The database has individual login events, individual purchase records, individual support tickets. Feature engineering converts these raw events into the aggregated, derived representations that a model can learn from.
Core Feature Engineering Techniques
**Aggregations and rolling windows** — computing statistics over time windows: count of events in last 7 days, sum of purchases in last 30 days, average session length in last 60 days. Rolling window aggregations are among the most common and most valuable features in time-series and behavioral prediction models.
**Encoding categorical variables** — machine learning models operate on numbers, not strings. Categorical variables (country, product category, device type) must be converted to numeric representations. One-hot encoding creates a binary column for each category value. Label encoding assigns an integer to each category. Target encoding replaces category values with the mean target value for that category, capturing predictive signal without expanding dimensionality.
**Temporal features** — extracting useful signals from timestamps: day of week, hour of day, is_weekend, days_since_last_event, month, quarter, season. Temporal features capture behavioral patterns that recur over time cycles.
**Interaction features** — multiplying or combining two features to capture their interaction effect. Purchase frequency multiplied by average order value gives a revenue rate feature that captures both dimensions of purchasing behavior in a single number.
**Normalization and scaling** — many models are sensitive to the scale of input features. Distance-based models (k-NN, SVM), linear models, and neural networks benefit from features on similar scales. Min-max scaling compresses features to a 0-1 range. Standardization scales features to mean zero and standard deviation one. Tree-based models (gradient boosting, random forests) are scale-invariant and typically do not require normalization.
**Handling missing values** — real-world data has gaps. Imputation strategies include replacing nulls with mean/median/mode, using model-based imputation (predicting the missing value from other features), or encoding missingness as an explicit binary feature (is_null) which may itself be predictive.
**Text and unstructured features** — converting text to numerical representations: TF-IDF vectors for keyword frequency, word embeddings (Word2Vec, GloVe) for semantic meaning, transformer embeddings (BERT, Sentence Transformers) for high-dimensional semantic representations. Extracting features from images, audio, or documents similarly requires transforming unstructured content into numeric vectors.
**Lag features** — in time-series prediction, the value of a variable at previous time steps is often the most predictive feature for its current value. Revenue last week, revenue two weeks ago, and revenue last month are lag features for this week's revenue prediction.
The Training-Serving Skew Problem
Feature engineering creates a significant operational challenge: the feature computation logic must be consistent between training and serving.
When training a model, features are computed from historical data in batch. When the model is deployed and makes real-time predictions, features must be computed from current data — and the computation must be identical. If the feature logic diverges (even subtly), the model receives inputs at serving time that differ from the distribution it was trained on, and performance degrades in ways that can be difficult to diagnose.
This is the training-serving skew problem, and it is one of the primary sources of production ML reliability issues. A feature computed as "average session length in last 30 days" during training, computed with a slightly different SQL query or time zone handling during serving, will differ in subtle ways that accumulate into systematic prediction error.
Feature Stores
Feature stores emerged as infrastructure to solve the training-serving skew problem and make features reusable across models.
A feature store has two components:
**Offline store** — historical feature values stored in a data warehouse or data lake. Used for training model datasets. Features are computed with batch jobs and stored at point-in-time, enabling time-travel queries that retrieve what a feature's value would have been at any past moment. Point-in-time correctness is critical: if you are training a churn model on historical data, the features for each training example must be the values that would have been available at the time of the prediction, not the values computed after the fact.
**Online store** — low-latency storage (typically Redis or DynamoDB) for real-time feature serving. When a model makes a prediction, it retrieves pre-computed features from the online store in milliseconds rather than computing them from scratch. Batch jobs keep the online store synchronized with the offline store.
The same feature definitions govern both stores, ensuring that the logic used during training is identical to the logic used during serving.
**Feast** (open-source, originally by Gojek) is the most widely adopted open-source feature store. Tecton is the leading commercial feature store platform. Databricks Feature Store is integrated into the Databricks ML environment. Vertex AI Feature Store is Google Cloud's managed offering.
Feature Selection
Feature engineering creates features; feature selection determines which features to include in the model. Including too many features — especially irrelevant or redundant ones — reduces model performance through overfitting, increases inference latency, and creates unnecessary maintenance burden.
Feature importance scores (from tree-based models), permutation importance, and correlation analysis identify which features contribute most to predictive power. Regularization techniques (L1/Lasso) push irrelevant feature weights toward zero during training, performing implicit selection. Recursive feature elimination systematically removes the least important features and evaluates model performance at each step.
Why Feature Engineering Matters to Data Infrastructure
Feature engineering is not just a machine learning concern — it has significant data infrastructure implications.
The computation of rolling window aggregations across billions of events requires distributed processing (Spark, Flink, or warehouse-based computation). The storage of point-in-time correct historical features requires careful pipeline design. The synchronization between offline and online stores requires orchestration and monitoring. And the re-use of features across multiple models requires a cataloguing and discovery layer — the feature store metadata — so data scientists can find existing features rather than recomputing them.
Organizations investing in ML at scale find that the data infrastructure for feature engineering often exceeds the infrastructure for model training and serving.
Our data architecture practice designs data infrastructure for machine learning use cases, including feature engineering pipelines and feature store architecture — contact us to discuss ML data infrastructure requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →