BlogBI & Analytics

E-commerce Analytics: Data Architecture for Online Retail and Direct-to-Consumer

Obed Tsimi
Obed Tsimi
Founder & Senior Tableau Architect
·July 24, 202712 min read

E-commerce analytics connects web behaviour, transaction data, fulfilment operations, and marketing attribution into the unified view that lets online retailers optimise acquisition cost, conversion, and retention simultaneously. The data architecture challenge is integrating event-level web data with transactional and operational systems at scale.

E-commerce analytics connects web behaviour, transaction data, fulfilment operations, and marketing attribution into the unified view that lets online retailers optimise acquisition cost, conversion, and retention simultaneously. The data architecture challenge is that these four data domains are generated by fundamentally different systems at fundamentally different volumes and grains — and making them analytically useful together requires deliberate integration design, not just co-location in a data warehouse.

The Four Data Domains

**Web behaviour data** is event-level, high-volume, and user-session-centric. Every page view, product impression, search, add-to-cart, and checkout step generates an event. A mid-size e-commerce site with 100,000 monthly sessions generates millions of events per month. This data comes from analytics platforms (Google Analytics 4, Amplitude, Mixpanel) or from first-party event tracking via server-side or client-side event streams. The grain is user-session-event; the key identifiers are session ID, anonymous user ID, and (where known) authenticated customer ID.

**Transaction data** is the authoritative commercial record. Each order contains a customer, line items with SKU and quantity, prices and discounts, shipping details, and payment information. The grain is order-line. Transaction data comes from the commerce platform (Shopify, Magento, BigCommerce, custom) and is lower-volume but higher-value than web behaviour data.

**Marketing data** comes from paid advertising platforms (Google Ads, Meta Ads, TikTok Ads), email platforms, affiliate networks, and SEO performance data. The grain varies by platform: impression and click data is available at the campaign/ad/keyword level; spend data is typically daily by campaign.

**Fulfilment and operations data** covers shipment tracking, delivery confirmation, returns, and customer service interactions. This data is essential for understanding the post-purchase experience but is often omitted from e-commerce analytics stacks that focus only on acquisition and conversion.

The Identity Resolution Problem

The single hardest technical problem in e-commerce analytics is identity resolution: connecting the same person across different states and sessions. A customer who browses anonymously, creates an account, makes a purchase, opens a promotional email, and browses again in a new session is represented by at least four different identifiers across different systems. Without identity resolution, the customer journey is fragmented across multiple records with no linkage.

The standard approach to identity resolution:

**Anonymous to authenticated stitching** — when a user creates an account or logs in, connect their authenticated customer ID to all anonymous session IDs that were active in the same browser before authentication. This requires storing a mapping of anonymous IDs to customer IDs as authentication events occur.

**Cross-device stitching** — connecting the same customer across different devices is harder. The most reliable signal is email opens and clicks (a click in an email on mobile can be associated with the customer ID in the email), supplemented by probabilistic matching (same IP address, same household, similar behavioural fingerprints).

**Marketing attribution stitching** — connecting ad clicks (which carry click IDs from advertising platforms) to authenticated customer records (which are in the commerce system). The standard implementation uses UTM parameters and advertising platform click IDs stored in first-party cookies or server-side session records, then matched to customer IDs when the customer converts.

Complete identity resolution is not achievable — cookie restrictions, browser privacy features, and device fragmentation guarantee that some sessions remain anonymous. The goal is maximising the proportion of sessions with resolved identity, not achieving 100%.

The Attribution Problem

Marketing attribution in e-commerce is the process of assigning credit for conversions to the marketing channels and campaigns that contributed to them. It is analytically consequential because channel-level ROAS (Return on Ad Spend) calculations drive budget allocation decisions: undervaluing a channel leads to underinvestment; overvaluing leads to overspend.

The fundamental problem is that customers interact with multiple channels before converting. A customer who sees a Facebook ad, searches organically, clicks a Google Shopping ad, and converts from an email produces four touch points with legitimate claims to attribution credit. The channel you credit determines which channel's ROAS looks best.

The common attribution models:

**Last-touch** — all credit to the last click before conversion. Systematically overvalues bottom-of-funnel channels (branded search, email) and undervalues awareness and consideration channels. Still the default in many platforms because it is simple to implement.

**First-touch** — all credit to the first touch. Opposite bias: overvalues acquisition channels, ignores retention and re-engagement channels.

**Linear** — equal credit across all touch points. Less biased but dilutes credit for high-impact touches.

**Data-driven attribution** — statistical modelling of which touch points are predictive of conversion, applied to observed conversion and non-conversion paths. More accurate but requires volume (typically 10,000+ monthly conversions) and a platform that supports it. Google Analytics 4 offers data-driven attribution by default for accounts with sufficient data.

For most e-commerce businesses, the practical recommendation is data-driven attribution from GA4 as the primary model, supplemented by incrementality testing (holdout experiments) for channels where attribution models are most uncertain.

The Conversion Rate Optimisation Data Model

Conversion rate optimisation (CRO) analytics uses funnel analysis to identify where users abandon the purchase journey. The standard e-commerce funnel is: sessions to product page views to add-to-cart to checkout initiation to checkout completion. Each transition has a conversion rate; drops in conversion rate at specific steps identify where optimisation effort should focus.

Funnel analysis requires web behaviour data at the session grain, connected to transaction data to distinguish converting from non-converting sessions. The key analytical dimensions are device type (mobile vs. desktop conversion rates typically differ substantially), acquisition channel (paid search users convert differently than organic), new vs. returning visitors, and product category (conversion rates vary significantly by product type and price point).

The CRO data model connects these dimensions to funnel stage drop-off rates, enabling the analytics team to identify: which device types have the largest conversion gap; which channels drive high traffic but low conversion; which product pages have high view but low add-to-cart rates; and which checkout steps have elevated abandonment.

Our BI strategy and data architecture practice designs e-commerce analytics infrastructure for online retailers — contact us to discuss your e-commerce analytics programme.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →