BlogData Architecture

Data Architecture Principles: The Decisions That Shape Every Data Platform

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·February 21, 202712 min read

The foundational principles that guide good data architecture decisions — separation of storage and compute, designing for the query pattern not the source schema, build for the business unit not the application, prefer managed services over custom infrastructure, and the principle of late binding that determines when data should be modelled versus left flexible.

Good data architecture is not a collection of tool choices — it is a set of principles applied consistently that produce systems which are reliable, scalable, and maintainable over years of change. The principles in this guide are not new; they draw on decades of data engineering and distributed systems design. But they are routinely violated, which is why so many data platforms require expensive remediation work 3–5 years after their initial build.

Principle 1: Separate Storage from Compute

The foundational architectural decision in modern data platforms is separating where data is stored from where it is processed. The pre-cloud architecture coupled them: a fixed cluster stored data on its local disks and processed queries using its fixed CPUs. Scaling required adding nodes — expensive, slow, and linear.

Cloud-native separation decouples them: data lives in cheap, durable object storage (S3, GCS, Azure Blob), and compute clusters are provisioned on demand for specific workloads and released when not needed. This enables:

**Independent scaling:** Scale storage and compute separately based on actual needs. A data asset that grows by 100x in volume does not require 100x more compute — query volume and data volume are independent.

**Multiple compute engines on the same data:** Different workloads (batch transformation, interactive SQL, streaming, ML) can run different compute engines against the same underlying data without copying it. Athena, Redshift Spectrum, and Databricks can all query the same S3 data lake.

**Cost proportionality:** Pay for compute when it is running, not continuously. A warehouse that processes data for 4 hours per day costs 4/24 of what a continuously-running equivalent would cost.

**Implications for design:** Every data architecture decision that couples storage to compute is a design debt that limits future optionality. Building a system where processed data can only be queried by a specific proprietary engine creates vendor lock-in that becomes increasingly painful as requirements evolve.

Principle 2: Design for the Query Pattern, Not the Source Schema

Source systems are designed for transactions, not analysis. A relational schema optimised for a CRM application — orders in one table, order line items in another, products in another, customers in another — is correctly designed for low-latency OLTP operations. It is incorrectly designed for analytical queries that need to join all four tables to answer a business question.

The analytical data model should be designed around how data will be queried, not how it exists in the source. This means:

**Pre-joining frequently joined tables:** A fact_orders mart table that includes customer_name, customer_region, and product_category eliminates join overhead for the most common analytical patterns. The analyst does not need to join dim_customers to get the customer's region.

**Designing at the right grain:** Analytical queries have a natural grain — the level at which they aggregate. A sales performance dashboard that reports by sales rep and month needs data at the rep-month grain. Building the mart at that grain means the BI tool aggregates a small pre-aggregated dataset, not a billion row fact table.

**Denormalising strategically:** Denormalisation that reduces join complexity for common patterns is good analytical architecture. Denormalisation that creates data model maintenance debt (duplicated fields that can diverge) is bad architecture. The test: does this denormalisation reduce a join that appears in 80% of queries? If yes, worth it. If it appears in 20% of queries, normalise and let the 20% join.

Principle 3: Build for the Business Unit, Not the Application

The most common data architecture mistake: building data models that mirror the source application's structure rather than the business unit's conceptual model.

An e-commerce company's order management system stores data in tables named order_header, order_line, inventory_allocation, and fulfillment_event. These names reflect the application's internal design. The business unit's concepts are: Order, Customer, Product, Shipment.

The analytical model should speak the business language. A dim_customers table represents the business concept of a customer — consolidated from Salesforce accounts, billing system customers, and product users. A fact_orders table represents an order as the business understands it — not the application's order_header and order_line tables joined together.

**Why this matters:** Business users, analysts, and product managers navigate the data model using business concepts. A data model that maps directly to business concepts is self-explanatory to these users. One that reflects application internals requires translation that introduces errors.

**How to do it:** Spend time with domain experts — finance, sales, operations — before designing the model. Map the business concepts they care about. Only then determine how source system tables map to those concepts.

Principle 4: Prefer Managed Services Over Custom Infrastructure

Every custom system is a maintenance liability. A custom ETL pipeline that you built and own means: you maintain it when the source API changes, you debug it when it fails at 2am, you rewrite it when the destination schema changes, and you document it when the original engineer leaves.

Managed services — Fivetran for ingestion, Snowflake for the warehouse, dbt Cloud for transformation deployment, Prefect Cloud for orchestration — transfer the infrastructure maintenance burden to specialists. The tradeoff is cost and reduced flexibility.

For most data platform components, the tradeoff favours managed services:

**The maintenance cost of custom infrastructure** is systematically underestimated. Engineers estimate the build cost correctly and discount the ongoing maintenance. The ongoing maintenance of a custom system over 3 years is typically 2–3x the original build cost.

**Managed services handle scaling automatically.** A custom ETL that works at 10GB/day requires re-engineering at 100GB/day. Fivetran handles 10GB and 100GB without architectural changes.

**Build custom only where you have a genuine differentiated requirement.** A source system that no managed connector supports; a transformation pattern that no managed service can execute; a security requirement that prevents data from leaving your infrastructure. These are legitimate reasons to build custom. "We want to own the code" is not a legitimate reason when the cost of ownership is underestimated.

Principle 5: Late Binding — Keep Raw Data Flexible

The principle of late binding states: defer interpretation of data as long as possible. Load raw data in its original form; apply business logic at query time or in a separate transformation layer rather than at ingestion time.

**Why late binding matters:** Business logic changes. A metric that was defined as "revenue" last year is redefined this year to exclude returns. If revenue was calculated at ingestion time and stored in the warehouse, recalculating it means reprocessing all historical data. If revenue is defined in a dbt model that queries the raw event data, redefining it means updating the model and running a full refresh.

**Raw layer design:** The raw schema stores source data exactly as it arrived, with no transformations applied. Fivetran output, CDC captures, file uploads — all land in the raw schema without modification. The raw data is the source of truth; everything else is derived.

**Transformation layer as the interpretation layer:** Business logic lives in the transformation layer (dbt models), not in ingestion pipelines. The ingestion pipeline's job is to get data from A to B reliably. The transformation layer's job is to interpret that data and make it analytically useful.

**The exception:** Lightweight normalisation at ingestion is acceptable — standardising column names to snake_case, casting strings to appropriate types, handling known encoding issues. Deep business logic at ingestion is not.

Principle 6: Document Decisions, Not Just Facts

Architectural documentation that only describes what was built — the schema, the tables, the field names — is less valuable than documentation that explains why. Why is the customer entity modelled this way? Why does the fact table include these fields pre-joined? Why is the partition key the order_date rather than the created_at?

The architectural decisions embedded in a data model are not obvious to new team members or future maintainers. Without documented rationale, those decisions appear arbitrary — and arbitrary-looking decisions get changed, sometimes breaking things that worked correctly.

**Decision records:** For each non-obvious architectural choice, document the decision, the alternatives considered, and the reasoning. This can be as simple as a comment in the dbt model's description or as formal as an Architecture Decision Record (ADR).

**Lineage documentation:** Every model should document its source and its purpose. Not just the field descriptions, but the business concept the model represents, the grain, and the primary use cases it serves.

Principle 7: Design for Change

A data architecture that cannot be changed without breaking consumers is a liability, not an asset. Source systems change. Business requirements change. Technology improves. An architecture designed on the assumption that it will be stable forever is not well-designed — it is fragile.

Designing for change means:

**Abstraction layers:** Downstream consumers reference marts, not raw tables. When a source system changes, the staging model is updated; the mart may or may not require changes; BI tool queries are unaffected. The abstraction layers isolate consumers from upstream volatility.

**Versioning for breaking changes:** When a mart model must change in a way that breaks consumer queries, the change is versioned — the new schema coexists with the old for a migration period, then the old is retired. Consumers are not forced to migrate on a single deployment day.

**Testing:** A test suite that catches regressions before they reach production is the primary mechanism for making change safe. Changes that would break downstream consumers are detected in CI, not by business users noticing incorrect numbers.

Our data architecture consulting practice applies these principles to real-world platform design and remediation — contact us to discuss the architectural principles appropriate for your environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →