Logical Plan

Spark queries begin with a logical plan, which represents the query as a high-level directed acyclic graph (DAG) of transformations (e.g., map, filter, groupBy). This plan describes what the query needs to compute but not how it will be executed. The Catalyst optimizer rewrites the logical plan to apply performance-enhancing optimizations, such as:

  • Predicate Pushdown: Reducing the amount of data processed by pushing filters closer to the data source.
  • Column Pruning: Selecting only the columns required by the query.

Physical Plan

After optimization, the logical plan is translated into one or more candidate physical plans that define how the query will be executed. A cost model evaluates these plans to select the one with the lowest execution cost. With adaptive query optimization (AQO), Spark refines the physical plan during execution by collecting runtime statistics (e.g., cardinality, partition sizes) to adjust operations dynamically and improve efficiency.

Query Stages and Distribution

Query Stages

The physical plan is divided into query stages, which are independent units of execution. These stages are separated by shuffle boundaries, points where data must be reorganized (e.g., for groupBy or join operations). Each stage can execute in parallel and represents a contiguous set of transformations that do not require inter-stage communication.

Task Distribution

Within a query stage, the data is partitioned, and each partition is processed by a separate task. A task is the smallest unit of execution in Spark and corresponds to one partition of a physical plan node (e.g., map, reduce). The Spark driver coordinates execution by assigning tasks to worker nodes (executors), which process the data in parallel.

Inter-Stage Communication

Shuffling and Results

When one stage depends on the results of another (e.g., after a shuffle boundary), intermediate results are materialized as partitioned data stored in memory or disk (e.g., shuffle files). Shuffling redistributes data across workers to align partitions for the next stage’s computations, ensuring correctness in operations like joins and aggregations.

Stateful Transformations

Batch Mode

Stateful transformations in batch mode (e.g., groupByKey, reduceByKey) require accumulating intermediate results for the same key. During execution:

  • Intermediate results are materialized in shuffle files to ensure they are accessible across query stages.
  • The state itself is ephemeral, existing only during the execution of the query.

Streaming Mode

In streaming mode, state persists across micro-batches to support incremental computation. For stateful transformations (e.g., windowed aggregations):

  • State is stored in an external state store (e.g., RocksDB, HDFS, S3).
  • Tasks interact with the state store using Spark’s StateStore API to:
    1. Read state (e.g., partial aggregates or window buffers).
    2. Update state with records from the current micro-batch.
    3. Persist updated state back to the state store for fault tolerance. Tasks themselves remain stateless, delegating state management to the execution engine, which ensures consistency and durability through the state store.