A/B Testing Data Infrastructure: Experiment Analysis at Scale

The data infrastructure required to run reliable A/B tests — experiment assignment logging, metric computation pipelines, statistical analysis at scale, and the governance framework that prevents p-hacking and novelty effects from polluting your experiment results.

A/B testing (controlled experimentation) is one of the highest-return investments a product organisation can make. But the validity of experiment results depends entirely on the quality of the data infrastructure supporting them. A poorly designed experiment data pipeline produces misleading results — false positives that ship features that do not actually work, false negatives that kill features that would have delivered value.

This guide covers the data infrastructure required to run reliable A/B tests at scale: experiment assignment logging, metric computation, statistical analysis, and the governance framework that prevents common experiment errors.

The Data Pipeline for A/B Testing

Reliable experiment analysis requires a specific data pipeline:

**Step 1: Experiment assignment.** When a user is exposed to an experiment and assigned to a variant (control or treatment), the assignment must be logged immediately: user_id, experiment_id, variant_id, assignment timestamp. This is the assignment table — the ground truth of who was in which group.

Assignment logging must be complete and accurate. A common failure mode: assignments logged from the server but not from all client platforms, resulting in an assignment table that captures only 60% of assignments. Analysis on incomplete assignments produces incorrect conclusions.

**Step 2: Event collection.** User behaviour after assignment must be captured in the event stream — the same events used for product analytics. The experiment analysis joins assignment data to event data: for each assigned user, what did they do after assignment?

**Step 3: Metric computation.** Raw events must be transformed into experiment metrics: conversion rate, revenue per user, session duration, retention. The metric definition must be precise: does revenue per user include users who generated zero revenue? Is conversion measured within 24 hours of assignment or over the experiment period?

**Step 4: Statistical analysis.** With assignment-to-metric data assembled, statistical tests determine whether the treatment effect is statistically significant. At scale, this analysis runs in SQL or Python against the warehouse.

Common Data Infrastructure Failures

**Assignment-exposure contamination.** An assignment is not the same as an exposure. A user might be assigned to a treatment variant but never encounter the feature being tested — if the feature is only visible on mobile web, desktop users assigned to the treatment never see it. Including non-exposed users in the analysis dilutes the true effect and reduces statistical power.

The solution: log exposure separately from assignment. An exposure event fires when the user actually sees the variant (encounters the feature). Analysis should filter to exposed users, not all assigned users.

**Novelty effect bias.** When a new feature is introduced, initial engagement is elevated because of novelty — users engage with new things more than familiar things. Running an experiment for two days and observing a lift may be capturing novelty, not a sustainable effect. Most experiment frameworks require a minimum experiment duration (typically 1–2 weeks) that spans the novelty period.

**Multiple testing (p-hacking).** If an analyst can choose which metric to report after seeing the results, the chance of finding a false positive increases with the number of metrics checked. Experimentation governance requires: pre-registering the primary metric before running the experiment, adjusting significance thresholds for multiple secondary metrics (Bonferroni correction or FDR control), and not peeking at results before the planned analysis date.

**Network effects contamination.** In social or collaborative products, users in the treatment and control groups interact with each other. A user in the control group might be influenced by treatment-group users' behaviour. Standard A/B testing assumes treatment-control independence — violation of this assumption requires cluster-based randomisation (randomise by group, not individual).

Experiment Assignment Randomisation

The statistical validity of an experiment depends on random, unbiased assignment to variants. The most common mechanism: hash-based assignment. The user_id (or device_id for pre-login) is hashed with the experiment_id, and the result modulo N determines the variant. This is deterministic (the same user always gets the same variant in the same experiment), evenly distributed, and can be computed without a central assignment service.

**Randomisation unit selection.** The randomisation unit (what is hashed for assignment) must match the unit of the metric being measured. If you are measuring revenue per user, randomise at the user level. If you are measuring session-level outcomes, randomise at the session level. Mismatching randomisation unit and analysis unit violates the independence assumption and inflates false positive rates.

**Novelty holdout groups.** Permanently hold out 1–5% of users from all experiments. This holdout group sees only the control experience forever. Comparing holdout-group outcomes to control-group outcomes measures the cumulative effect of all experiments shipped — accounting for interaction effects and long-term behavioural changes that short experiments cannot detect.

Statistical Methods at Scale

At scale — millions of users across hundreds of concurrent experiments — statistical analysis runs in the data warehouse:

**Z-test or t-test for proportions and means.** For binary metrics (conversion rate), the z-test for proportions is appropriate. For continuous metrics (revenue per user, session duration), the t-test. Both are computable in SQL with the appropriate formulas.

**Variance reduction (CUPED).** CUPED (Controlled-experiment Using Pre-Experiment Data) reduces the variance of experiment metrics by removing pre-experiment noise correlated with the metric. If users who were more active before the experiment are more likely to convert regardless of the treatment, CUPED controls for this, reducing the required sample size for the same statistical power. CUPED is implemented as a covariate adjustment in regression.

**Sequential testing.** Traditional A/B testing requires fixing the analysis date in advance — peeking at results before the planned date inflates false positive rates. Sequential testing (using methods like always-valid confidence intervals or the sequential probability ratio test) allows continuous monitoring with controlled Type I error rates, enabling early stopping when results are clear.

**Bayesian analysis.** Bayesian experiment analysis produces posterior distributions over treatment effects rather than binary significant/not-significant decisions. Bayesian methods are interpretable (probability the treatment is better than control), and do not require fixing sample sizes in advance.

Experiment Platform Architecture

At scale, experiments require a dedicated infrastructure layer:

**Assignment service.** A high-throughput service that returns variant assignments given a user ID and experiment ID. Backed by a fast store (Redis, Memcached) that caches active experiment configurations. Latency must be under a few milliseconds for client-facing experiments.

**Experiment registry.** A centralised store of all experiments — their hypotheses, metrics, randomisation units, start and end dates, and current status. The registry is the governance layer: experiments must be registered before they start, with primary metrics and success criteria documented.

**Analysis pipeline.** A scheduled pipeline that runs experiment analysis daily, computing metrics for each active experiment and flagging results for review. Results are published to a dashboard that experiment owners can monitor.

**Experiment-aware BI tooling.** Tableau, Looker, or Superset dashboards that surface experiment results with confidence intervals, power calculations, and metric history. Making experiment results accessible without requiring SQL reduces the analysis bottleneck for product teams.

For data engineering teams building experimentation infrastructure, our data architecture consulting practice can help design the data pipeline and analysis layer — contact us to discuss your requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →