Data architect interviews test system design, trade-off reasoning, and architectural judgment — not trivia. Here are the questions commonly asked, what good answers look like, and how interviewers evaluate them.
The quick answer
Senior data architecture interviews test three things: technical depth (can you design a real system?), trade-off reasoning (can you evaluate competing approaches?), and communication (can you explain complex decisions to stakeholders?). The questions are not trivia — interviewers are not asking you to recite the definition of a star schema. They are presenting ambiguous design problems and evaluating how you decompose them, what trade-offs you surface, and how confidently you defend recommendations. This guide covers the question categories, what interviewers are actually assessing, and what strong answers look like.
Category 1: System design questions
These are open-ended design problems presented as scenarios. They test whether you can translate requirements into architecture.
**"Design a data platform for a mid-market retail company"**: this question has no single correct answer. The interviewer is evaluating: do you ask clarifying questions before designing (data volumes, team size, budget, existing systems)? Do you propose a coherent stack rather than listing every tool you know? Do you make trade-offs explicit (Snowflake vs Redshift, given these requirements)? Do you address operational concerns (monitoring, alerting, on-call) or only the happy path?
Strong answers: start with questions. "What is the scale — 10 users or 10,000? Real-time requirements or batch? What is the current stack?" Then propose a layered architecture with specific tool choices and the reasoning. "For a 50-person retail company without dedicated data engineering, I would start with Fivetran for ingestion, Snowflake as the warehouse, and dbt Cloud for transformation. Looker or Power BI for BI, depending on whether they want a governed semantic layer or self-service. I would defer Kafka and streaming until there is a specific real-time use case — adding streaming complexity without a use case is premature."
**"How would you handle a pipeline that processes 10 billion events per day?"**: tests knowledge of distributed processing at scale. Address: what is the target latency (batch or real-time)? Streaming architecture (Kafka + Flink/Spark Streaming) for sub-minute latency; micro-batch (Spark on EMR/Databricks) for 5-minute to hourly. Partitioning strategy for the output tables. Cost optimisation (columnar formats, partition pruning). What breaks at 10B/day that does not break at 1M/day (checkpoint overhead, state management, partition skew).
**"Design a CDC pipeline from a production PostgreSQL database to a data warehouse"**: tests knowledge of CDC specifics. Cover: Debezium reading the WAL, publishing to Kafka, Kafka Connect sink to Snowflake or BigQuery, handling schema changes (Schema Registry, forward-compatible evolution), handling deletes (soft-delete CDC events vs hard deletes), idempotency (at-least-once delivery → handle duplicates in the warehouse with MERGE or deduplication logic).
Category 2: Trade-off questions
These test whether you can evaluate competing approaches with nuance rather than declaring one option universally superior.
**"Snowflake vs Databricks — which would you choose for this scenario?"**: strong answers are conditional. "Databricks is the right choice when ML/AI workloads are a first-class requirement alongside SQL analytics — the unified compute for Spark, Python, SQL, and MLflow is compelling. Snowflake is the right choice for SQL-heavy analytics teams where the ergonomics of the warehouse interface and the performance of Snowflake's query engine outweigh Databricks' ML integration." Then apply the framework to the specific scenario described.
**"When would you use a star schema vs a data vault?"**: demonstrate genuine knowledge of both. "Star schema optimises for query performance and analyst usability — the dimensional model is intuitive and BI tools generate efficient SQL against it. Data Vault optimises for auditability and historisation — every change is recorded, full lineage is maintained, and the model is highly adaptable to schema changes in source systems. I would recommend Data Vault for regulated industries (financial services, insurance) where the auditability requirement is non-negotiable. For most other environments, star schema is more practical." See kimball vs inmon.
**"Your cloud data warehouse bill is growing 3x year-over-year. How do you address it?"**: tests cost optimisation knowledge. Structure the answer: first, diagnose (what is driving the growth — compute or storage? which queries or workloads?), then address by category (clustering/partitioning for query scan reduction, warehouse rightsizing for compute waste, incremental materialisation for transformation efficiency). See cloud data warehouse cost optimization.
Category 3: Governance and compliance questions
These test whether you understand the governance layer, particularly important for enterprise and regulated industry roles.
**"How would you implement row-level security in your data warehouse?"**: cover the implementation pattern in at least one warehouse. Snowflake: row access policies (a function that returns a boolean, applied to the table, evaluated on every query). BigQuery: row-level security policies (filter conditions applied to tables based on user identity, defined using IAM). Tableau: user filters in published data sources (map [UserName] to a field in the data). Address: performance implications of RLS (every query must evaluate the policy), testing strategy (how do you verify RLS is working correctly?).
**"How do you handle PII in a data platform?"**: cover: classification (automated scanning to identify PII columns), access control (only authorised roles can access raw PII), pseudonymisation vs anonymisation trade-offs, compliance requirements (GDPR right to erasure — how do you handle deletion requests in a data warehouse where deletes are expensive?). Strong answers address all four, not just access control.
Category 4: Incident response and operational questions
**"A dashboard that was correct yesterday shows wrong numbers today. How do you investigate?"**: this tests systematic debugging. Strong answer: start at the data, work backward. Check if the dashboard's data has changed (query the warehouse directly to reproduce the issue). If the warehouse data is wrong, check when it changed (time travel or query history). If it changed during a dbt run, check dbt logs for the run that produced the change. If it changed during ingestion, check Fivetran sync logs. Trace the lineage back to the source. Common root causes: source schema change, dbt model logic error, extract refresh failure producing stale data, join producing fan-out duplicates.
**"How do you ensure data quality in a production pipeline?"**: cover the three-layer approach — source validation (freshness, schema, null rates), transformation testing (dbt tests — primary keys, foreign keys, business rules), and reconciliation (aggregate totals matched against source system). See data quality framework.
Category 5: Communication and influence
**"How would you present a data architecture recommendation to a CFO?"**: tests communication skills. Strong answers: lead with the business outcome, not the technology. "The current architecture produces the analytics team spending 40% of their time on fire-fighting rather than analysis. This recommendation reduces that to under 10%, freeing 3 engineer-equivalents of capacity for new development. The cost is $X; the expected benefit is $Y in recovered engineering time and $Z in reduced cloud spend." Then, and only then, briefly cover the technical approach. See data architecture ROI.
What interviewers are looking for
**Clarity under ambiguity**: the best candidates ask clarifying questions before designing, make assumptions explicit when they must assume, and acknowledge when they do not know rather than guessing.
**Proportionate depth**: interviewers are not looking for encyclopaedic knowledge. They are looking for appropriate depth — not knowing every Spark configuration parameter, but knowing when to use Spark and what the major trade-offs are.
**Defensibility**: can you defend your choices? "I would use Snowflake here because..." with a reason is better than "Snowflake is always better." The reasoning is what the interviewer is actually evaluating.
**Awareness of what you do not know**: senior architects who say "I have not used that specific product in production, but based on the architecture it would behave like X" are more credible than those who confidently describe products they have never used.
For the foundational knowledge these questions draw on, see data architecture patterns, kimball vs inmon, snowflake architecture guide, and how to become a data architect.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →