Whole-Stage Java Code Generation (Whole-Stage CodeGen) is a physical query optimization in Spark SQL that fuses multiple physical operators (as a subtree of plans that support code generation) together into a single Java function.
Whole-Stage Java Code Generation improves the execution performance of a query by collapsing a query tree into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data.
Columnar Execution
The Whole-Stage Code Generation framework is row-based.
The input RDDs of the physical operators of a whole-stage pipeline are all RDD[InternalRow] s. The output RDD of a whole-stage pipeline is also an RDD[InternalRow] . However, the input to a whole-stage code gen stage can be columnar (RDD[ColumnarBatch]).
If a physical operator supports columnar execution, it can’t at the same time support whole-stage-codegen.
CodegenSupport Physical Operators
Physical operators that support code generation extend CodegenSupport(and keep supportCodegen flag enabled).
ObjectType Whole-Stage Java Code Generation does not support (skips) physical operators that produce a domain object (the DataType of the output expression is ObjectType) as domain objects cannot be written into an UnsafeRow.