Run-encoded arrays

The run-end encoding (REE) data type is a variation on a common encoding method called run-length encoding (RLE). These methods for encoding data are particularly useful for data containing long sequences of the same value, referred to as a run.

The in-memory layout of an REE array is the following:

Unlike other data types, REE arrays do not have any buffers of their own.
There are two child arrays:
- run_ends: Either 16, 32, or 64-bit signed integers. Each value is the logical array index where a run of values ends. Nulls are not allowed.
- Values: The values for the runs at the corresponding indexes of the run_ends array.

For example, the following array [1.0, 1.0, 1.0, 1.0, null, null, 2.0, 3.0, 3.0] has two sequence of 1.0 and 3.0 and would be encoded like so:

Arrow does not adopt RLE because you typically encode the explicit lengths of each run, making inefficient random access to the array for individual indexes and breaking the O(1) efficiently in arrow. It instead adopts REE where each value in run_ends contains the accumulated sum of all length for the runs to that index: if you need the length of an individual run, you can simply subtract two adjacent run ends to compute it.

if you want to determine the value of the sample array for index 5

you can perform a binary search for the first value larger than the desired index. This would be the value 6 in index 1 of the run ends child array.
you would then check the validity bitmap index 1 and find it is 0, that means the value is null

Tip

Although the child array can contain nulls and therefore one could use a single buffer as in the Variable-length binary arrays, using this approach makes easier to maintain the association of what lengths mean. A null value is handled by having a run with its corresponding value in the child array being null.

Edmondo's Vault

Explorer

Run-encoded arrays

Graph View

Backlinks