The SQL, systems design, and pipeline architecture questions that distinguish senior data engineers in interviews — plus the scenario-based questions that test real-world problem-solving rather than textbook knowledge.
Data engineering interviews have gotten harder and more consistent over the past few years. Most senior data engineering roles now follow a predictable structure: SQL assessment, systems design, pipeline architecture, and a behavioral interview about how you have handled production incidents or complex technical decisions. Knowing the structure, and what interviewers are actually evaluating in each section, makes the difference.
This guide covers each interview component with the questions that distinguish strong candidates from average ones.
SQL Assessment
SQL interviews for data engineers are not about syntax — everyone knows SELECT FROM WHERE. They test whether you understand what the database is doing and can write queries that are correct, efficient, and idiomatic.
**Window functions.** Almost every senior DE SQL interview includes window functions. The common scenarios: running totals, rankings within groups, lag/lead for comparing current and prior period values, and first/last value within a partition.
A frequently asked question: "Write a query that returns the top-3 products by revenue per country."
A naive answer uses a subquery with LIMIT 3 per country, which is incorrect in SQL (LIMIT applies to the full result, not per group). The correct answer uses RANK() or ROW_NUMBER() with a PARTITION BY clause:
WITH ranked AS (
SELECT
country,
product_name,
SUM(revenue) as total_revenue,
RANK() OVER (PARTITION BY country ORDER BY SUM(revenue) DESC) as revenue_rank
FROM sales
GROUP BY country, product_name
)
SELECT country, product_name, total_revenue
FROM ranked
WHERE revenue_rank <= 3;
**Cumulative aggregation and running totals.** A question like "compute the 7-day rolling average of daily orders" tests whether you know ROWS BETWEEN 6 PRECEDING AND CURRENT ROW within a window specification.
**Deduplication and exactly-once semantics.** Given a table with duplicate rows (e.g., event data where the same event was loaded twice), write a query that returns only the first occurrence. Tests knowledge of ROW_NUMBER() with deduplication logic and understanding of why simply using DISTINCT often does not solve the problem.
**Self-joins for sequential patterns.** "Find all customers who made a purchase within 30 days of their first purchase." Tests whether you can construct a self-join with date arithmetic rather than reaching for window functions unnecessarily.
**Explaining query plans.** Many SQL interviews ask you to describe what a query will do, predict its performance characteristics, or identify why it would be slow. Know how to read a basic EXPLAIN plan — what a sequential scan versus an index scan means, what a hash join versus a nested loop join means, and when a query will be slow because of its structure.
Pipeline Architecture and Systems Design
Systems design in data engineering interviews tests how you think about building data infrastructure, not just writing code. The format is typically: "Design a data pipeline for X" — and then a 45-minute conversation about trade-offs.
Common design prompts:
- Design a real-time fraud detection pipeline
- Design a data platform that ingests from 50 SaaS sources into a warehouse
- Design a metrics computation system that produces daily business KPIs
- Design a pipeline that processes 10TB of event data per day
What interviewers evaluate:
Do you ask clarifying questions before designing? Strong candidates ask: What are the latency requirements? What are the data volumes? What is the downstream consumer? What are the failure mode requirements? A candidate who designs without asking these questions is not thinking like an engineer.
Do you understand the trade-offs? A batch pipeline is simpler to build and operate than a streaming pipeline. If the latency requirement is hourly, building a Kafka + Flink streaming system is over-engineering. Naming the trade-off — "batch is simpler and cheaper but adds latency; streaming is complex but can support sub-minute freshness" — demonstrates engineering judgment.
Do you consider failure modes? What happens when a source API is down? When the destination warehouse has an outage? When a schema change breaks the pipeline? Strong candidates design for failure explicitly — dead letter queues, idempotent writes, alerting, and retry logic.
Do you know the tooling? For a batch ELT pipeline at scale, "I would use Fivetran for ingestion, Snowflake as the warehouse, and dbt for transformation" is a valid answer. "I would build a custom Python extraction framework" for a problem that Fivetran solves is a red flag — it demonstrates lack of awareness of what managed tooling exists.
**The idempotency question.** Almost every pipeline design interview will probe idempotency — if the pipeline runs twice for the same time window, does the warehouse end up with duplicate data? Strong candidates design idempotent pipelines: upserts or merge operations rather than appends, deduplication logic, and checkpoint-based processing that can safely restart from the last successful state.
Data Modeling Questions
**Dimensional modelling design.** Given a business domain (e-commerce, finance, SaaS subscriptions), design a dimensional model. The interviewer is looking for: identification of facts (transactions, events) versus dimensions (customers, products, time), grain definition for the fact table, and understanding of when to denormalise (star schema) versus normalise (snowflake schema).
"Design a data model for an e-commerce platform that supports order analysis, product performance analysis, and customer lifetime value calculation." A strong answer identifies: fct_orders (grain: one row per order line), dim_customers, dim_products, dim_date. It notes that CLV calculation requires additional models (customer-level aggregation table, cohort models). It discusses SCD Type 2 for product price changes and customer dimension history.
**Grain design.** "Your fct_orders table has one row per order, but you now need to support order line-level analysis. How do you design this?" Tests understanding of fact table grain and the implications of changing grain for downstream models and reports.
**Slowly changing dimension design.** "A customer's account tier can change multiple times per year. How do you track this so historical purchases show the tier at the time of purchase?" SCD Type 2 — the classic answer, but strong candidates also address the implementation (dbt snapshots, surrogate keys in fact tables, join patterns for current vs historical analysis).
Scenario-Based and Behavioral Questions
**Production incident diagnosis.** "Your pipeline ran last night but the dashboard is showing lower revenue than expected. Walk me through how you would diagnose this." Strong candidates structure their diagnosis: check pipeline logs first (did it complete? any errors?); check row counts (are tables empty or unexpectedly small?); check the specific query the dashboard runs (correct filters, correct joins?); check for schema changes in the source. They do not jump to conclusions — they describe a systematic process.
**Trade-off decisions.** "You are building a reporting pipeline for month-end financial close. The business wants it to run in under 2 hours. The current approach takes 6 hours. What would you investigate?" Tests whether you understand the performance levers: warehouse sizing, query optimisation, incremental vs full refresh, parallelism, and when to accept that some targets are not achievable without fundamental redesign.
**Communicating with non-technical stakeholders.** "A business analyst tells you that two different dashboards show different revenue numbers for the same period. How do you handle this?" Tests your approach to data quality investigation and stakeholder communication — not just the technical diagnosis but how you communicate ambiguity, manage expectations, and present findings.
Preparation Approach
For SQL: practice with real datasets on a warehouse you can query directly. LeetCode's database problems are too narrow for data engineering roles — they test algorithmic SQL that rarely appears in practice. Use real analytical scenarios: time series analysis, cohort analysis, funnel analysis, and deduplication.
For systems design: practice designing pipelines end-to-end on a whiteboard or paper before your interview. The goal is not to memorise a design but to develop fluency in the vocabulary (idempotency, backfill, schema evolution, late-arriving data, CDC) and the frameworks (what are the latency requirements, what are the failure modes, what is the operational burden).
For behavioral questions: prepare 3–4 examples of technical decisions you have made, production problems you have solved, and situations where you had to communicate technical constraints to stakeholders. Use the STAR format (Situation, Task, Action, Result) but focus on the Action and the trade-offs you made.
Our data engineering consulting practice hires and evaluates data engineers — contact us if you are looking for data engineering talent or support.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →