Predictive analytics extends the data architecture beyond descriptive and diagnostic reporting into forward-looking outputs: churn probability scores, demand forecasts, lead scoring, fraud risk ratings. The data infrastructure challenge is not the modelling — it is ensuring that predictions are computed reliably, updated on the right cadence, and integrated into the systems where decisions are actually made.
Predictive analytics extends the data architecture beyond descriptive and diagnostic reporting into forward-looking outputs: churn probability scores, demand forecasts, lead scoring, fraud risk ratings. The data infrastructure challenge is not the modelling — it is ensuring that predictions are computed reliably, updated on the right cadence, and integrated into the systems where decisions are actually made. Most predictive analytics projects fail not because the models are wrong, but because the infrastructure around the models breaks down.
The Anatomy of a Predictive Analytics System
A predictive analytics system has four layers, each with distinct infrastructure requirements:
**Data preparation layer** — assembling the features that will be used in the model. This includes raw data extraction from source systems, feature engineering (deriving calculated attributes from raw data), and the temporal joins required to assemble features as they would have been observed at a specific historical point in time. The last requirement is the most frequently missed: using future information to train a model on historical data (data leakage) produces models that appear to perform well in training but fail in production.
**Model training layer** — the compute environment that trains the model on historical data. The key infrastructure requirements are: reproducible training runs (same data, same code, same hyperparameters produce the same model); experiment tracking (logging the parameters and performance metrics of each training run); and model registry (versioning trained models and tracking which version is deployed in which environment).
**Prediction serving layer** — generating predictions from the trained model. Predictions can be computed in two modes: batch scoring (running the model over the full customer or product population on a schedule, storing predictions in the data warehouse) and online inference (generating a prediction for a specific entity in real time, usually via API). Most business applications are adequately served by batch scoring; real-time inference adds significant infrastructure complexity and should only be used when the latency requirement genuinely cannot be met by batch.
**Integration layer** — delivering predictions to the systems where decisions are made. A churn probability score stored in the data warehouse but not surfaced in the CRM does not affect churn decisions. A fraud risk score computed every morning but not accessible to transaction monitoring does not affect fraud prevention. The integration layer connects predictions to the applications, alerts, and workflows that operationalise them.
Common Data Architecture Patterns
**Churn prediction** requires a user activity table at the customer grain, with feature engineering that creates lagged metrics (usage in the last 7 days, usage in the last 30 days, change in usage over the last 30 days). The temporal join requirement is critical: features must be computed using only data that would have been available at the time the prediction was made. A churn model trained with next-month usage data as a feature is not predictive — it is describing churn that has already happened.
**Demand forecasting** uses time-series data at the product-location-day grain. The feature engineering includes seasonal components (day of week, month of year, holiday flags), trend components, and exogenous variables (promotions, pricing, weather where relevant). Demand forecasts need to be stored with their uncertainty bounds, not just point estimates — a forecast of 100 units with confidence interval [70, 130] carries different inventory implications than a forecast of 100 with interval [95, 105].
**Lead scoring** uses CRM activity data, website behaviour data, and firmographic attributes to predict the probability that a lead will convert to a closed opportunity. The integration point is the CRM: scores need to be written back to the lead or contact record in Salesforce or HubSpot so that sales reps see them in their normal workflow, not in a separate analytics tool.
**Customer segmentation** is a related but distinct use case: instead of predicting a binary outcome or a continuous value, it assigns customers to segments based on behavioural similarity. Segmentation models (k-means, hierarchical clustering) produce segment assignments that need to be stored per customer and refreshed as behaviour changes. The challenge is stability: segments that change substantially on each refresh are confusing for the marketing and sales teams that act on them.
The Training-Serving Skew Problem
Training-serving skew is the most common failure mode in production ML systems, and it is a data infrastructure problem rather than a modelling problem. It occurs when the features computed at training time are computed differently from the features computed at inference time, producing inconsistent inputs to the model.
The most common causes:
**Different data sources** — training uses a historical extract from the data warehouse; inference uses a real-time API query. If the two sources compute the same feature differently — different time zone handling, different null treatment, different rounding — the model receives different inputs in training and production.
**Time window differences** — training uses a correctly computed trailing 30-day window at each historical observation point; inference uses a simple query for the last 30 days without accounting for the correct temporal boundary. This is a form of data leakage in reverse: inference data includes future-relative data that training did not.
**Schema drift** — training data was prepared when the source schema had a specific structure; the source schema changed after training and the inference pipeline was not updated.
The standard solution is a feature store: a system that centralises feature computation and serves the same computed features to both training and inference pipelines. Without a feature store, preventing training-serving skew requires careful engineering discipline across every feature in every model.
Model Performance Monitoring
Predictive models degrade over time as the real world changes. A churn model trained on data from a period of organic growth may not accurately predict churn during an economic downturn. A demand forecasting model trained without a pandemic-period exception will have distorted seasonality parameters.
Model performance monitoring requires:
**Outcome labels** — the actual outcomes that the model was predicting. For churn models, this means knowing which customers actually churned. Outcome data arrives with a lag equal to the prediction horizon — a 30-day churn prediction cannot be evaluated until 30 days after it was made.
**Prediction vs. actuals tracking** — a table that records, for each prediction made, the predicted probability or value and the actual outcome when it is available. This table is the foundation for performance metrics (AUC, RMSE, calibration) computed on a rolling basis.
**Distribution monitoring** — tracking the distribution of input features over time. If the distribution of a key feature shifts significantly, the model may be receiving inputs it was not trained to handle. Distribution shifts often precede outcome-level performance degradation and can be detected earlier.
Our data architecture practice designs predictive analytics infrastructure for organisations ready to move from descriptive to forward-looking analytics — contact us to discuss your predictive analytics programme.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →