BlogData Engineering

Data Engineering Interview Questions: What Senior Engineers Actually Ask

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·April 11, 202713 min read

Data engineering interviews test a different skill set than software engineering interviews. The questions that reveal strong candidates are not algorithm puzzles — they are system design discussions, pipeline architecture scenarios, and probing questions about trade-offs in real infrastructure decisions. This guide covers the questions experienced data engineers ask and what good answers look like.

Data engineering interviews have evolved significantly over the past decade. Leetcode-style algorithm questions test general software engineering competence but tell you almost nothing about whether a candidate can build reliable data pipelines. The most effective data engineering interviews focus on system design, pipeline architecture, and the trade-offs that experienced engineers have had to make in real production environments.

This guide covers the questions that reveal signal — and what strong vs weak answers look like.

Pipeline Design Questions

"Walk me through how you would design an ingestion pipeline for a high-volume transactional database."

What you are testing: does the candidate understand the difference between full and incremental loading, the challenges of CDC (change data capture) at scale, and the operational considerations of running pipelines against production databases?

Strong answers cover: the choice between full table scans (simple but slow and load-intensive), query-based incremental loading (fast but misses updates and deletes), and CDC via database log tailing (Debezium, AWS DMS, or similar). Strong candidates discuss the impact on the source database (read load during full scans, binary log retention requirements for CDC), how deletes are handled (a common failure mode of incremental approaches), and how schema changes are managed.

Weak answers describe "pulling data from the database on a schedule" without discussing the incremental strategy, and cannot explain how they would handle high-volume tables with frequent updates and deletes.

"You have a pipeline that processes 10 million records per day and needs to complete within a 4-hour window. It is currently taking 6 hours. How would you approach this?"

What you are testing: does the candidate have a systematic approach to performance diagnosis, or do they jump to solutions?

Strong answers start with diagnosis before solutions. Where is the time being spent? Is it in the extract (slow source queries), the transform (expensive computations on a large dataset), or the load (network transfer or write contention)? The candidate should ask for profiling data before proposing solutions. Solutions then become targeted: for slow source queries — index optimisation or query restructuring; for slow transforms — parallelisation, reducing data volume processed per record, or pushing computation to the warehouse; for slow loads — bulk insert instead of row-by-row, write parallelism, or partitioned target tables.

Weak answers jump directly to "add more compute" or "shard the pipeline" without diagnosing where the bottleneck actually is.

Data Modelling Questions

"How would you model a customer's purchase history in a data warehouse if you need to support both transactional analysis (individual orders) and customer-level aggregate analysis (lifetime value, purchase frequency)?"

What you are testing: does the candidate understand dimensional modelling, grain declaration, and when to use different fact table structures?

Strong answers discuss the grain decision explicitly: a transaction fact table at order-line grain serves transactional analysis; a customer-period snapshot fact table serves aggregate analysis without requiring expensive aggregation queries at runtime. The candidate should discuss the trade-offs: transaction fact + aggregation at query time vs pre-aggregated snapshot + staleness risk. They should mention slowly changing dimensions for customer attributes that change over time (segment, tier, acquisition channel).

Weak answers describe a single wide table with all fields, or describe a normalised OLTP schema without addressing the analytical query patterns the model needs to serve.

"How do you handle slowly changing dimensions in your data models?"

What you are testing: does the candidate know the SCD types and when to apply each?

Strong answers cover Type 1 (overwrite — use when history is not analytically relevant), Type 2 (add row with effective dates — use when historical values matter for analysis, e.g., "what segment was this customer in when they placed this order?"), and Type 3 (add prior column — use when only one prior state is needed). They should be able to give examples of attributes that warrant each type and discuss the query complexity trade-offs of Type 2 (more joins, more complex range queries).

Weak answers know only Type 1 and Type 2 by name without being able to reason about when each is appropriate.

Infrastructure and Operations Questions

"Describe how you would design a CI/CD pipeline for a dbt project."

What you are testing: does the candidate understand software engineering practices applied to data transformation?

Strong answers cover: a PR-triggered CI pipeline that compiles the dbt project (checking for syntax errors and broken references), runs slim CI (only changed models and their downstream dependencies) using the state-modified+ selector with a deferred production manifest, runs SQLFluff linting, and provides a test environment with PR-scoped schemas so tests do not conflict with each other. The production CD pipeline runs after merge: source freshness check, full run, full test, artifact upload for next CI state comparison.

Weak answers describe "running dbt test before pushing" without understanding the PR isolation problem or the slim CI concept.

"What do you do when a pipeline fails in production at 2 AM?"

What you are testing: does the candidate have an operational mindset — are they thinking about runbooks, alerting, escalation, and the impact on downstream consumers?

Strong answers describe: the alerting system that detects the failure (dbt Cloud alerts, Airflow failure callbacks, custom monitoring on quality checks), the runbook that guides initial diagnosis (is it a source data issue? a transformation bug? infrastructure failure?), how they communicate to downstream consumers (which dashboards are affected, what the current data status is), and how they decide between quick remediation (rerun the pipeline) vs escalation (wake up the on-call engineer, hold the load, investigate the root cause before proceeding).

Weak answers describe diagnosing the failure technically without mentioning communication, consumer impact, or the decision about whether the current data state is safe to leave.

Advanced Technical Questions

"What is the difference between UNION and UNION ALL, and when would you use each?"

UNION removes duplicate rows (requiring a sort/hash operation). UNION ALL retains all rows. Use UNION ALL by default — it is faster and the deduplication of UNION is usually unnecessary if the source queries are designed correctly. Use UNION only when you genuinely need to deduplicate across the unioned result sets.

"Explain the difference between a LEFT JOIN and a LEFT ANTI JOIN."

Left JOIN returns all rows from the left table with matching right table rows joined (NULL for right columns where no match). Left anti join returns only rows from the left table where no match exists in the right table — the SQL equivalent of "give me all left-table records that are not in the right table." In SQL: SELECT l.* FROM left l LEFT JOIN right r ON l.id = r.id WHERE r.id IS NULL.

"How would you detect and handle duplicate records in an ingestion pipeline?"

Strong answers discuss: detecting duplicates before loading (COUNT vs COUNT(DISTINCT) on the natural key), deduplication strategies (ROW_NUMBER() partitioned by natural key ordered by timestamp to keep the most recent), handling duplicates in the target (UPSERT/MERGE vs delete-and-insert), and distinguishing true duplicates from legitimate repeat transactions (the same customer purchasing the same product on the same day may not be a duplicate — it depends on the business context).

What to Actually Test

The questions above reveal three competency areas: pipeline architecture (can they design reliable, scalable data movement?), data modelling (can they design models that answer analytical questions efficiently?), and operational thinking (do they understand what happens when things go wrong?).

What most data engineering interviews test but should not: algorithmic puzzle solving. A candidate who can reverse a binary tree but cannot explain the difference between FIXED and INCLUDE LOD expressions in Tableau, or cannot explain what a backfill strategy looks like for a 5-year historical dataset, is not a strong data engineering candidate for a consulting environment.

Our data engineering practice is always looking for strong data engineers — contact us if you are interested in opportunities or need a fractional team.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →