Data Engineer vs Data Scientist: Roles, Skills, and How They Work Together

Data engineers build the pipelines and platforms that make data available. Data scientists build models and analysis on top of that data. The distinction matters for hiring, team design, and understanding why ML projects fail without the right infrastructure beneath them.

The quick answer

Data engineers build the infrastructure that makes data accessible: pipelines, warehouses, transformation layers, data quality controls. Data scientists build models and analysis on top of that infrastructure: machine learning models, statistical analysis, predictive systems, and exploratory data work. The distinction matters because organisations frequently confuse the two roles, hire one when they need the other, and build ML programmes on infrastructure that was never designed to support them.

The most common failure mode: an organisation hires data scientists to build ML models before having a data engineer who has built reliable, clean, accessible data. The data scientists spend 70–80% of their time doing data engineering work (cleaning, joining, deduplicating) instead of modelling. The ML programme stalls not because the models are wrong but because the data beneath them is not ready.

Data engineer: what the role actually does

A data engineer designs and builds the systems that move, transform, and store data at scale. Concretely:

**Pipeline development**: building ingestion pipelines from source systems (databases, APIs, SaaS platforms, event streams) into the data platform. Handling schema changes, API rate limits, incremental vs full loads, error handling, and retry logic. Tools: Fivetran, Airbyte, ADF, Kafka, Spark, Python.

**Data warehouse/lakehouse architecture**: designing the storage layer — schema design, partitioning strategy, clustering, access patterns. Building the medallion architecture (Bronze/Silver/Gold). Choosing between Snowflake, BigQuery, Databricks, and Redshift based on requirements. For the decision context, see Snowflake vs Databricks.

**Transformation layer**: building and maintaining the transformation pipelines that convert raw ingested data into analytics-ready models. In modern stacks, this is dbt on a cloud warehouse. Managing model dependencies, testing, documentation, and scheduling. See what is dbt.

**Orchestration**: managing pipeline execution — scheduling, dependency resolution, failure alerting, retries. Tools: Apache Airflow, Prefect, Dagster.

**Data quality**: implementing quality checks at ingestion, transformation, and the BI layer. Writing dbt tests, configuring anomaly detection (Monte Carlo, Bigeye), and monitoring pipeline health.

**Infrastructure management**: provisioning and managing cloud data infrastructure — compute, storage, access control, cost governance. For cloud-specific patterns, see azure data architecture best practices.

A data engineer's output is infrastructure: reliable pipelines, clean data in the warehouse, tested transformation models, and data that data scientists and analysts can trust. The job is done when the data is correct, accessible, and maintainable — not when an analysis is produced.

**What data engineers are not**: they are not primarily analysts (they do not produce business insights), not data scientists (they do not build predictive models), and not database administrators (they are not primarily responsible for production OLTP database operations, though there is overlap in data platform administration).

Data scientist: what the role actually does

A data scientist builds models, analysis, and systems that extract value from data beyond what reporting dashboards can provide. Concretely:

**Exploratory analysis**: statistical investigation of datasets to understand patterns, distributions, correlations, and anomalies. Python (pandas, matplotlib, seaborn), R, or SQL-based analysis in Jupyter notebooks or BI tools.

**Machine learning model development**: building predictive models — classification (will this customer churn?), regression (what will revenue be next quarter?), clustering (what customer segments exist?), recommendation (what should we show this user?). Tools: scikit-learn, XGBoost, TensorFlow, PyTorch, MLflow.

**Feature engineering**: transforming raw data into the numerical representations that ML models consume. This is the work that most directly requires clean, well-structured data from the data engineering layer — poor data quality in features produces poor model performance.

**Model evaluation and validation**: assessing model performance (accuracy, precision, recall, AUC-ROC), identifying overfitting, comparing model versions, establishing baseline comparisons. Rigorous evaluation prevents deploying models that look good on training data but fail in production.

**Model deployment**: deploying trained models to production for inference — REST API endpoints, batch scoring pipelines, embedded model calls. In mature organisations, this may be shared with or owned by a machine learning engineer. Tools: MLflow Model Registry, Databricks Model Serving, SageMaker, Azure ML.

**A/B testing and experimentation**: designing experiments to test hypotheses about user behaviour, model performance, or business interventions. Statistical significance analysis, power calculations, and experiment tracking.

**Stakeholder communication**: translating model outputs and analytical findings into business language. Data scientists need to communicate confidence intervals, model limitations, and decision recommendations to non-technical stakeholders.

**What data scientists are not**: they are not data engineers (they should not be building production pipelines — though they often do this when engineering support is absent), not BI developers (they do not primarily build dashboards), and not database administrators.

The analytics engineer: the role in between

A third role — **analytics engineer** — has emerged between data engineering and data science, occupying the transformation and modelling layer:

Analytics engineers build and maintain the dbt transformation models that convert raw warehouse data into analytics-ready tables. They work closely with both data engineers (who provide clean Bronze/Silver data) and analysts/data scientists (who consume Gold layer models). They write SQL, version-control models, write dbt tests, and document data assets.

The analytics engineer role is appropriate in teams where: the transformation layer is owned separately from pipeline development (which goes to data engineers) and from statistical modelling (which goes to data scientists). dbt popularised this role definition, and it is now widely recognised in modern data teams.

For the full data team structure and hiring sequence, see how to build a data team.

How the roles interact

In a functioning data team, the roles form a production chain:

Data engineers build and maintain the pipelines that load raw data into the warehouse. Analytics engineers transform that raw data into clean, tested, documented models. Data scientists consume those models to build features and train ML models. BI developers consume those models to build dashboards. The chain breaks when any link fails — and the failure point is almost always the data engineering foundation.

In practice, the boundary between roles is fuzzy and negotiated within teams. In smaller organisations, a data engineer may also own the transformation layer (analytics engineer responsibilities). A data scientist may build their own feature pipelines when data engineering capacity is limited. In larger organisations, the roles are more distinct.

The important point is not exactly where the boundaries fall — it is that someone owns each layer. When "data scientist" is expected to cover everything from pipeline building to modelling to deployment, and there is no data engineer, the result is a researcher who spends most of their time on infrastructure work.

Hiring implications

**Hire the data engineer before the data scientist.** If you do not have reliable, clean data in a queryable warehouse, data scientists cannot do their work effectively. The cost of hiring a data scientist when the data foundation does not exist: the scientist spends months doing engineering work, produces models on data they cleaned themselves (not in a maintainable, reproducible way), and delivers slow results. Investment in the foundation first produces faster, more reliable ML output.

**Do not expect data scientists to solve data engineering problems.** Data scientists can work around missing infrastructure, but they do so by building ad-hoc, fragile solutions that do not scale and create technical debt. When a data scientist has been in a role for six months and the most common complaint is "I spend all my time cleaning data," the organisation does not have a data scientist problem — it has a data engineering problem.

**The data scientist-to-engineer ratio.** A common reference: mature data organisations maintain roughly 1 data engineer per data scientist. This ratio varies significantly by ML programme maturity — in early-stage ML programmes, more engineering support is required per scientist; in mature programmes with established infrastructure, the ratio can stretch. As a starting point, plan for 1:1.

For compensation benchmarks and role specifications, see how to build a data team.

Frequently asked questions

Can one person do both roles?

In small organisations, yes — and this is often described as a "data scientist" who builds their own pipelines and models. The limitation: doing both well requires a very unusual combination of engineering rigour and statistical depth. Most people are stronger in one domain. As organisations scale, the two responsibilities grow apart and specialisation becomes necessary.

What is a machine learning engineer?

An ML engineer sits between data scientist and software engineer: focused on deploying, scaling, and maintaining ML models in production. A data scientist may build a model; an ML engineer makes it production-grade, scalable, and maintainable. In smaller organisations, data scientists own ML deployment; in larger organisations, a dedicated ML engineering function handles it.

Our data scientists say they spend 80% of their time on data prep. What does that mean?

It means your data engineering foundation is inadequate for your ML programme. The path forward: hire a data engineer (or engage a consulting firm) to build reliable pipelines and a clean data warehouse. Once the data foundation exists, data scientist time spent on data prep should drop to 20% or below.

Our data architecture consulting practice builds the data platforms that make ML programmes viable. If your data team is spending more time on data plumbing than on modelling, book a free 30-minute audit to identify what is missing from the foundation.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →