Variable-length binary view arrays (German Strings)

Researcher at the Technical University of Munich wrote a a paper that describes a system called UmbraDB, in which they outline a new-highly efficient in memory-string representation for columnar data. This representation has also been popularly named German-style strings and was later adopted by two extremely popular open source projects: Meta’s Velox engine and DuckDB. Finally, it was adapted as a data type for Arrow arrays, both to maintain zero-copy compatibility with these systems and to bring this representation to the entire ecosystem that relies on Arrow.

In this approach, the memory layout is composed of a variable number of buffers:

as usual, the first buffer is the validity bitmap
the second buffer use a 16-byte structure to represent the location and length of each view of bytes (called a view header).
the other buffers contain the binary data that’s too big to fit in the 16-bytes

In practice what happens is a short/small string optimization: for strings that are small enough, we store them inline in the view header, without requiring additional indirection. For larger string, we store a prefix we can use for some operations (prefix match) so we don’t need to look into the full screen if the prefix don’t match.

For example, the following array: ["Hello", "Penny the cat", "and welcome"] could be stored like in the following image:

How many data buffers?

There can be any number of data buffers, so this is only one possible representation

Key benefits of German Strings

The fact that the strings are fixed-length mean each item can be produced in parallel without the risk of overwriting in the second buffer (the view header buffer) locations that belong to other strings

Some operations are much faster on fixed-length value arrays (if all fits in 16 bytes, you can use SIMD)

Edmondo's Vault

Explorer

Variable-length binary view arrays (German Strings)

Graph View

Backlinks