Arrow is designed to be easily passable between processes and the IPC libraries are the interfaces for passing around record batches.
Arrow’s IPC protocol defines three message types used for conveying information:
- **Schema
- RecordBatch
- DictionaryBatch
The series of binary payloads for these messages can be reconstructed into in-memory record batches without the need for memory copying. Each message consists of a FlatBuffers message for metadata and an optional message body. FlatBuffers is a highly efficient, cross-platform serialization library designed originally by Google and whose design allows for the message to be interpreted and accessed as is without the need to deserialize it into a different intermediate format first
The Arrow Stream format
A typical stream will consist of a Schema message first, followed by some number of DictionaryBatch and RecordBatch messages. That first Schema message doesn’t contain any data buffers, only metadata information about the message, such as type information. DictionaryBatch only shows-up if there are dictionary-encoded arrays in the schema.
The Arrow File format
The file format is just an extension of the streaming format, with a magic string indicator to start and end the file and a footer. The footer contains a copy of the schemaand the memory offsets and lengths for each block of data in the file, enabling random access to any record batch in the file. It is recommended to use the .arrow extension. A stream typically won’t be written to a file, but if it is, the recommended extension is .arrows. There are also registered Multipurpose Internet Mail Extension (MIME) media types for both the streaming and file format of Apache Arrow data, as follows:
- https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.stream
- https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file
The Arrow file format is ideal to work using Memory Mapping. For example, by memory mapping a large Arrow file to read the values and process a column, we’re taking advantage of the lazy loading and zero-copy properties of memory mapping and the PyArrow library: it will only read the pages from the file that it needs when the corresponding virtual memory locations are accessed, then will the data be materialized and pulled into RAM.