The union type is useful when a single column could have multiple types : the value in each slot of the array could be of any of these types, which are named like struct fields and included in the metadata of the type.
Unlike other layouts, the union type does not have its own validity bitmap. Instead, each slot’s validity is determined by the children, which are composed to create the union array itself. There are two distinct union layouts that can be used when creating an array:
- dense
- sparse
Dense Unions
A dense union represents a mixed-type array with 5 bytes of overhead for each value. It contains the following structures:
- One child array for each type.
- A types buffer: A buffer of 8-bit signed integers, with each value representing the type ID for the corresponding slot, indicating which child vector to read from for that slot.
- An offsets buffer: A buffer of signed 32-bit integers, indicating the offset into the corresponding child’s array for the type in each slot.
A dense union allows for the common use case of a union of structs with non-overlapping fields: Union<s1: Struct1, s2: Struct2, s3: Struct3……>. Here’s an example of the layout for a union of the Union<f: float, i: int32> type with [{f=1.2}, null, {f=3.4}, {i=5}] as values:
-Memory-Layout.png)
Sparse union array
A sparse union has the same structure as a dense union, except without an offsets array, as each child array is equal in length to the union itself.
Even though a sparse union takes up significantly more space compared to a dense union, it has some advantages for specific use cases: in particular, a sparse union is much more easily used with vectorized expression evaluation in many cases, and a group of equal-length arrays can be interpreted as a union as you only have to define the types buffer. When interpreting a sparse union, only the slot in a child indicated by the types array is considered; the rest of the unselected values are ignored and could be anything.