BlogData Engineering

Apache Spark Architecture: How Spark Works and When to Use It

Obed Tsimi
Obed Tsimi
Founder & Senior Tableau Architect
·December 20, 202612 min read

The internal architecture of Apache Spark — driver, executors, DAG scheduler, shuffle, and memory management — explained for data engineers making decisions about when distributed processing is worth the operational overhead versus simpler alternatives like duckdb, Polars, or SQL on a modern cloud warehouse.

Apache Spark is the dominant distributed processing framework for large-scale data engineering. Understanding how it works internally — not just how to use it — determines whether you use it appropriately, debug it effectively, and decide correctly when simpler tools are the better choice.

The Core Architecture

Spark operates on a driver-executor model. Every Spark application has exactly one driver and one or more executors.

### The Driver

The driver is the process that runs your Spark application code. It is responsible for:

- Parsing and planning the computation (converting your DataFrame API calls or SQL queries into a logical plan)

- Optimising the logical plan using the Catalyst query optimiser

- Converting the optimised logical plan into a physical plan — a DAG of stages and tasks

- Scheduling tasks on executors via the cluster manager

- Tracking task completion and handling failures

The driver is a single point of coordination. It does not process data itself; it directs executors that do. Driver memory is primarily used for the logical and physical plan, shuffle metadata, and result collection (when you call collect() on a DataFrame — a common cause of driver OOM errors).

### Executors

Executors are the worker processes that run tasks. Each executor is allocated a fixed amount of CPU cores and memory at application startup. Within an executor:

- Each core can run one task at a time

- Memory is divided between execution memory (used for shuffles, joins, aggregations) and storage memory (used for caching DataFrames)

- Task results small enough to fit in memory are returned to the driver; larger results are written to disk

Executor memory management is where most Spark performance problems originate. The default memory configuration is rarely optimal for real workloads.

### Cluster Manager

The cluster manager allocates executor resources to Spark applications. Spark supports three cluster managers: YARN (the Hadoop resource manager, common in on-premise deployments), Kubernetes (increasingly the default in cloud-native environments), and Spark Standalone (simple, useful for development). On managed Spark platforms like Databricks, the cluster manager is abstracted — you configure cluster size and autoscaling behaviour, and Databricks manages the underlying allocation.

DAG Execution Model

Spark does not execute operations immediately when you write them. Spark operations are either transformations (lazy — they build a computation graph but do not execute) or actions (eager — they trigger execution).

**Transformations** include: filter, select, withColumn, join, groupBy, union. Calling these builds the DAG but runs nothing.

**Actions** include: show, count, collect, write. Calling an action triggers the scheduler to execute the accumulated transformation DAG.

### Stages and Shuffles

Spark's scheduler partitions the DAG into stages. A stage boundary occurs whenever a shuffle is required — when data needs to be redistributed across partitions. Shuffle-requiring operations include: groupBy, orderBy, join (usually), repartition, distinct.

Shuffles are the most expensive operation in Spark. They require writing intermediate data to disk on executors, transferring it over the network to target executors, and re-reading it. A query plan with many shuffle operations will be slow; reducing shuffles is the primary focus of Spark performance optimisation.

Within a stage, tasks execute in parallel — one task per partition. The number of partitions determines the parallelism. Too few partitions underutilises available cores; too many creates scheduling overhead and small-file problems on output.

The Catalyst Optimiser and Tungsten Engine

Spark's performance advantage over raw MapReduce comes from two internal systems:

**Catalyst** is Spark's query optimiser. It analyses the logical plan you write, applies a set of optimisation rules (predicate pushdown, constant folding, join reordering, column pruning), and produces an optimised physical plan. The EXPLAIN command shows what Catalyst produces for a given query.

**Tungsten** is Spark's execution engine. It uses off-heap binary memory management (avoiding JVM garbage collection overhead), code generation (converting query plans to optimised JVM bytecode at runtime), and cache-aware computation algorithms. Databricks extended Tungsten with Photon, a vectorised native C++ execution engine that delivers significantly better performance on SQL workloads.

Memory Management

Spark uses two memory regions:

**Unified memory pool:** Managed memory shared between execution (shuffles, aggregations, sorting) and storage (DataFrame caching). The split is dynamic by default — execution can borrow from storage and vice versa, up to configurable limits.

**User memory:** JVM heap memory outside the unified pool, used for user data structures and UDF execution.

The most common memory-related failures:

**Executor OOM:** Execution memory insufficient for the shuffle or aggregation being performed. Fix: increase executor memory, reduce partition data size (increase num partitions), or rewrite the query to avoid the expensive operation.

**Driver OOM:** Caused by calling collect() on a large DataFrame (pulling all data to the driver), or by building large data structures on the driver. Fix: avoid collect() on large datasets; use DataFrame write operations instead.

**GC pressure:** Frequent JVM garbage collection pausing executors. Symptom: tasks taking much longer than expected, GC metrics high in Spark UI. Fix: increase executor memory, reduce object creation in UDFs, switch to Dataset API with typed operations.

Partitioning and Data Distribution

Spark's execution model is partition-parallel. Understanding partitioning is essential for both performance and correctness.

**Default parallelism:** For operations creating new DataFrames (range, read), Spark uses spark.default.parallelism or the number of input file splits. For shuffle operations, Spark uses spark.sql.shuffle.partitions (default 200 — frequently too high for small datasets, too low for large ones).

**Partition skew:** If data is not uniformly distributed across partitions, some tasks take much longer than others (stragglers), and the stage does not complete until the last task finishes. Partition skew is common in real-world datasets — for example, a join key where one value appears in 80% of rows. Fix: salt the join key, use skew hints (Spark 3.x), or enable adaptive query execution (AQE).

**Adaptive Query Execution (AQE):** Available since Spark 3.0, AQE dynamically re-optimises the query plan during execution based on runtime statistics — coalescing small shuffle partitions, converting sort-merge joins to broadcast joins when the smaller side is small enough, and handling partition skew. Enable it with spark.sql.adaptive.enabled=true. It resolves a significant proportion of partition-related performance issues automatically.

When Spark Is the Right Tool

Spark is the right tool when:

- **Data volume exceeds single-machine capacity:** When your data does not fit in memory (or on disk) on a single machine, distributed processing becomes necessary. For datasets under 100GB, single-machine tools are almost always faster and simpler.

- **Processing latency requirements allow batch execution:** Spark's startup overhead (cluster provisioning, driver initialisation, plan compilation) makes it unsuitable for sub-second query response. Spark is a batch processing engine.

- **Your transformation logic is complex:** Complex multi-stage transformations, ML feature engineering, and nested data processing benefit from Spark's flexible API. SQL-on-a-warehouse is simpler for straightforward aggregations.

When Spark Is Not the Right Tool

**For data under 100GB:** DuckDB, Polars, and pandas-on-a-large-VM process datasets of this size faster than Spark, with no cluster overhead, no JVM, and dramatically simpler debugging. Spark's distributed overhead is a net cost, not a net benefit, at this scale.

**For SQL analytics and BI:** A modern cloud warehouse (Snowflake, BigQuery, Redshift) is faster, cheaper, and easier to operate for SQL analytical queries than Spark. Spark SQL exists but warehouses are optimised for this workload.

**For streaming at moderate scale:** Kafka Streams and Flink are more operationally efficient for streaming workloads that do not require the Spark ecosystem. Structured Streaming is powerful but adds Spark's overhead to streaming.

**When your team does not have Spark expertise:** The skill investment to use Spark well is significant. A team that does not already know Spark will spend months learning it when dbt-on-Snowflake would have solved their problem in weeks.

Our data engineering consulting practice architects distributed data platforms including Spark-based pipelines and helps organisations decide when Spark is and is not the right investment — contact us to discuss your processing requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →