BlogData Engineering

What Is Data Engineering? Roles, Responsibilities, and the Modern Data Stack

James Okafor
James Okafor
Lead Data Engineer
·February 16, 202811 min read

Data engineering is the discipline of designing, building, and maintaining the systems that collect, store, process, and serve data to the rest of the organisation. This guide explains what data engineers do, how the role differs from data science and analytics, the tools and technologies in the modern data stack, and why the data engineering function is the foundation that all analytics depends on.

What Data Engineering Is

Data engineering is the discipline of designing, building, and maintaining the systems that collect, store, process, and serve data throughout an organisation. Data engineers build the infrastructure that makes data available, reliable, and usable — for analysts, data scientists, and BI tools.

The outputs of data engineering are systems, not analyses. A data engineer builds a pipeline that delivers customer transaction data from the operational database to the analytical warehouse. They do not analyse what the transactions mean — they ensure the data is complete, timely, and correctly formatted when the analyst or data scientist needs it.

What Data Engineers Actually Do

**Build data pipelines**: The core function — designing and implementing the flows that move data from source systems to analytical destinations. A pipeline might ingest Salesforce CRM data every hour via Fivetran, CDC-stream operational database changes via Debezium into Kafka, or batch-extract files from an SFTP server and load them to S3.

**Maintain data infrastructure**: Keeping pipelines running reliably — monitoring for failures, debugging extraction errors, handling schema changes in source systems, managing database connections, and ensuring refresh schedules are met. Most data engineering time is operations, not development.

**Design storage architecture**: Selecting and configuring the data stores that serve different purposes — a data warehouse (Snowflake, BigQuery, Redshift) for analytical queries, a data lake (S3 with Iceberg) for large-scale processing, a streaming layer (Kafka) for real-time data, and caches or pre-aggregation tables for high-frequency BI queries.

**Data modelling and transformation**: Building the transformation layer that converts raw ingested data into clean, analytics-ready tables. In modern data stacks, this is done in dbt — but data engineers own the heavy transformation work (Spark jobs, complex SQL transformations) while analytics engineers own the dbt semantic layer above it.

**Tooling and platform work**: Maintaining Airflow DAGs, configuring Spark clusters, managing Fivetran connectors, setting up monitoring and alerting, and handling infrastructure-level concerns like IAM permissions, network security, and cost management.

The Modern Data Stack

The modern data stack is the set of tools that most mid-market and enterprise analytics teams use:

**Ingestion**: Fivetran or Airbyte for SaaS and database sources. Debezium for CDC. Kafka for event streaming. Custom Python connectors for proprietary or internal sources.

**Storage**: Snowflake, BigQuery, Redshift, or Databricks as the primary analytical store. S3/GCS as a data lake layer for raw data and large-scale processing. Object storage is orders of magnitude cheaper than warehouse storage.

**Transformation**: dbt (data build tool) as the SQL transformation framework. PySpark or Spark SQL for transformations too large or complex for warehouse SQL.

**Orchestration**: Apache Airflow or Dagster to schedule and monitor pipeline execution. Dagster for asset-oriented pipelines; Airflow for complex dependency graphs.

**BI layer**: Tableau, Power BI, Looker, or Metabase connected to the analytical warehouse.

The stack is relatively standardised — most modern analytics teams use some combination of these tools. The data engineering work is in the configuration, integration, and operation of these tools, not in building them from scratch.

Data Engineering vs Data Science vs Analytics Engineering

**Data engineering**: Builds the pipelines and infrastructure. Makes data available, reliable, and scalable. Output: working systems.

**Data science**: Uses data to build models — predictive models (churn, demand forecasting), recommendation systems, NLP, computer vision. Output: models and statistical findings. Depends on data engineering infrastructure to access training data at scale.

**Analytics engineering**: Transforms raw data into clean, business-ready tables using dbt. Owns the semantic layer between the warehouse and BI tools. Output: reliable, documented data models. Sits between data engineering and data analysis.

**Data analysis**: Queries data to answer business questions. Builds dashboards and reports. Communicates findings to stakeholders. Output: insight. Depends on analytics engineering infrastructure for reliable data.

The boundaries are blurry in practice. Small teams have individuals who span multiple roles. Large teams have specialised functions for each. The key distinction is the primary output: systems vs models vs data models vs insight.

Why Data Engineering Is the Foundation

Every other data function depends on data engineering being done well. If the pipelines are unreliable, analysts spend their time investigating data anomalies instead of producing insight. If the data architecture is poorly designed, query performance degrades as data volume grows. If schema changes in source systems break pipelines silently, downstream reports silently produce wrong numbers.

Investment in data engineering infrastructure has a compounding return — a reliable, well-designed data platform enables every analyst, data scientist, and BI consumer to do higher-quality work. Under-investment in data engineering produces fragile pipelines, unreliable data, and a data team that spends most of its time firefighting instead of building analytical capability.

When to Hire a Data Engineer

Organisations typically begin to need dedicated data engineering capacity when:

- The data team is spending significant time maintaining pipelines and debugging data quality issues rather than producing analysis

- Multiple source systems need to be integrated and no clear ownership exists for that integration

- Data volume or query complexity is exceeding what a single analytical tool can handle efficiently

- The organisation needs real-time or near-real-time data for operational decisions, not just daily batch refreshes

- The analytics function is growing and the data infrastructure needs to scale proportionally

Many organisations start with Fivetran handling ingestion and dbt handling transformation — managed services that reduce the need for dedicated data engineering until scale demands it.

Our data architecture practice provides data engineering consulting including pipeline design, warehouse architecture, and team augmentation — contact us to discuss your data engineering requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →