A/B testing produces reliable results only when the data infrastructure behind it correctly handles experiment assignment, exposure logging, metric calculation, and statistical analysis. Mistakes at any layer produce false positives or false negatives — and most A/B testing infrastructure has at least one systematic flaw.
A/B testing produces reliable results only when the data infrastructure behind it correctly handles experiment assignment, exposure logging, metric calculation, and statistical analysis. Mistakes at any layer produce false positives — finding effects that do not exist — or false negatives — missing effects that do. Most A/B testing infrastructure has at least one systematic flaw, and the most common flaws are not obvious from looking at the tooling alone.
The Exposure Logging Problem
The most common infrastructure flaw in A/B testing is incorrect exposure logging: recording that a user was exposed to an experiment when they were not, or failing to record exposures when users were genuinely exposed.
**Over-logging** occurs when an experiment is logged at the page load event rather than at the point of actual feature exposure. A user who loaded a page with an experimental feature but navigated away before seeing it gets logged as exposed. This dilutes the treatment group with non-exposed users, which attenuates the measured effect size — making real effects look smaller or disappear.
**Under-logging** occurs when network failures, ad blockers, or client-side rendering issues prevent exposure events from being recorded. Users who were exposed but not logged are excluded from the analysis, which biases the experiment if the logging failure is correlated with user characteristics.
**Assignment without exposure** occurs in server-side experiments where assignment happens at request time but the feature being tested is conditionally rendered. A user assigned to the treatment group who never triggered the condition that shows the feature is technically in the treatment group but has not been exposed to the treatment. Including them in the analysis adds noise.
The correct exposure logging approach is to log the event at the moment the user encounters the experimental stimulus — not at assignment, not at page load, but at the precise point where the treatment is shown or the feature is activated.
Randomisation Unit Selection
The randomisation unit — the entity that gets assigned to treatment or control — has major consequences for experiment validity.
**User-level randomisation** is the default and correct choice for most experiments. Each user is randomly assigned to treatment or control at the start of their session, and that assignment is sticky across sessions. User-level randomisation ensures that each user's experience is consistent (they always see the same variant) and that observations from the same user are not split across treatment and control.
**Session-level randomisation** assigns users to different variants in different sessions. This is appropriate only for experiments where the treatment is stateless and session-independent — which is rare. Session-level randomisation means the same user can be in both treatment and control across different sessions, which violates the independence assumption underlying standard statistical tests.
**Cookie-level randomisation** treats new cookies as new users. Users who clear cookies, use private browsing, or switch devices get re-randomised. In products with significant anonymous-to-authenticated user flows, cookie-level randomisation produces substantial noise and underestimates true user-level effects.
For most product experiments, the correct choice is authenticated user ID as the randomisation unit, with assignment stored server-side (not in cookies), so it persists across devices and sessions.
The Metric Calculation Layer
The metrics used to evaluate experiments — the primary success metric and guardrail metrics — need to be calculated from the correct population of events: only those events that occurred after a user was exposed to the experiment.
**Pre-exposure events** that occur before a user encounters the treatment are noise. Including them in metric calculations dilutes the effect. For experiments that launch to existing users, the pre-exposure history of users in treatment and control is identical by randomisation; including it adds no signal and reduces statistical sensitivity.
**Metric grain matters**: conversions should be calculated at the user grain (did the user convert, not how many conversion events occurred) unless the experiment is designed to affect conversion frequency rather than conversion probability. User-grain metrics are less sensitive to outliers and align with business interpretability.
**Guardrail metrics** — the metrics you monitor to detect unintended harms — should be calculated for the same population and time window as the primary metric. A treatment that improves conversion but degrades session duration or increases error rates has a complicated effect that is not visible if guardrail monitoring uses different populations or time windows.
Statistical Analysis Requirements
The statistical framework for A/B test analysis needs to be established before the experiment runs, not after. Pre-registration of the primary metric, sample size calculation, and minimum detectable effect prevents selective reporting and p-value hacking.
**Sample size calculation** determines how long the experiment needs to run to detect an effect of the minimum business-relevant size. The inputs are: baseline metric rate, minimum detectable effect (MDE), desired statistical power (typically 80%), and significance threshold (typically 5%). Running experiments for fixed periods without sample size calculation leads to either under-powered tests (the experiment cannot detect real effects of the size the business cares about) or over-running (continuing the experiment after the required sample size has been reached, increasing the false positive rate from multiple testing).
**Sequential testing and peeking** is the most common source of false positives in A/B testing programmes. Checking the p-value daily and stopping the experiment when significance is reached dramatically inflates the false positive rate — a test run with 5% significance threshold and daily peeking can have a true false positive rate above 20%. The solutions are: run for pre-determined duration without peeking at significance; use sequential testing methods (mSPRT, CUPED) designed for continuous monitoring; or use Bayesian methods that produce valid inference at any sample size.
**CUPED (Controlled experiment Using Pre-Experiment Data)** reduces experiment variance by using pre-experiment metric values as covariates. For metrics that are correlated with pre-experiment values (session duration, revenue per user), CUPED can reduce the required sample size by 20–50%, allowing the same experiment to reach conclusive results faster or with fewer users.
The Infrastructure Stack
A reliable A/B testing infrastructure at minimum requires:
**Assignment service** — a service that generates deterministic user-to-variant assignments, stores them server-side, and returns consistent assignments across multiple requests. The assignment algorithm should be consistent hashing so that adding new experiments does not re-randomise existing assignments.
**Exposure logging pipeline** — an event stream that captures user ID, experiment ID, variant, timestamp, and context at the moment of exposure. This stream flows into the data warehouse where it is joined with outcome metrics for analysis.
**Experiment configuration store** — a database of active experiments with their configurations (variants, allocation percentages, eligible populations, start and end dates) that the assignment service reads. Changing experiment configurations mid-flight changes what users see and invalidates analysis unless the change is logged.
**Analysis pipeline** — the queries or transformation models that join exposure logs with outcome metrics, compute the metric values per user per variant, and run the statistical tests. Standardising this pipeline prevents inconsistent analysis across experiments.
For teams without a dedicated experimentation platform, the combination of a feature flagging tool (LaunchDarkly, Statsig, or similar) for assignment and exposure, a data warehouse for metric calculation, and dbt models for the analysis pipeline is a practical implementation that supports hundreds of concurrent experiments without requiring a custom-built platform.
Our data architecture practice designs experimentation infrastructure for product teams that need reliable statistical results — contact us to discuss your A/B testing architecture.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →