A record batch is a collection of equal-length Arrow arrays along with a schema. Often, when reading in and manipulating data, we get that data in chunks and then want to assemble it to treat it as a single large table, as shown in the following diagram:

One way to do this would be to simply allocate enough space to hold the full table and then copy the columns of each record batch into the allocated space. That way, we end up with the finished table as a single cohesive record batch in memory but: -It’s potentially very expensive to allocate an entirely new large chunk of memory for each column and copy all the data over. -What if we get another record batch of data? We would have to do this again to accommodate the – now larger – table each time we get more data.
Chunked arrays
A chunked array is just a thin wrapper around a group of Arrow arrays of the same data type. This way, we can incrementally build up an array, or even a whole table, efficiently without constantly having to allocate larger and larger chunks of memory and copying data.
An Arrow table holds one or more chunked arrays and a schema, allowing us to conceptually treat all the data as if it were a single contiguous table of data:
- without having to pay the costs to frequently reallocate and copy the data
- losing memory locality
To reduce the negative impact of losing locality, we try to get chunks as large as possible, balancing the cost of the allocations and copies against the cost of processing non-contiguous data.
Chunk sizes alignment
The chunk sizes of the columns don’t need to be aligned with each other. So, Field 0 could be distributed into chunks of 100 elements while Field 1 is distributed into chunks of 250 elements, and so on; the underlying point being that this enables columns, and by extension tables, to be built in whichever method allows for the fewest copies of the data while enabling easy interactions.