Arrow is not a library, buy a collection of libraries. Tho share data efficiently between different languages and runtimes in the same processes Arrow provide:
- The C data interface fro sharing both schemas and data
- The C data interface fro streaming record batches
The Arrow project defines a small, stable set of C definitions that can be copied into a project to allow easily passing data across the boundaries of different languages and libraries. For languages and runtimes that aren’t C or C++, it should still be easy to use whatever foreign function interface (FFI) declarations correspond to your language or runtime.
Note
The C API is not intended to provide the higher-level operations exposed by Java, C++ or Go libraries. It’s just for passing the Arrow data itself.
Note
The C interface is only intended for sharing data between different components of the same processes, while for distinct processes we should use the Arrow IPC
Comparison between Arrow C and Arrow IPC
The C data interface is superior to the IPC format because of the following factors:
- Zero-copy by design
- Customizable release callback for resource lifetime management
- Minimal C definition that is easily copied to different code bases
- Data is exposed in Arrow’s logical format without any dependency on FlatBuffers
The IPC format is superior to the C data interface because of the following reasons:
- It allows for storage and persistence or communication across different processes and machines
- It does not require C data access
- It has room to add other features such as compression since it is a streamable form
The ArrowSchema structure
By separating the schema description and the data itself into two separate structures, the ABI allows producers of Arrow C data to avoid the cost of exporting and importing the schema for every single batch of data. In cases where it’s just a single batch of data, the same API call can return both structures at once, providing efficiency in both situations.
Important
The ArrowSchema structure is built in a way that it can be used also to define a single arrow::Field. The children array represents individual fields, while the parent ArrowSchema represents an entire schema
#define ARROW_FLAG_DICTIONARY_ORDERED 1
#define ARROW_FLAG_NULLABLE 2
#define ARROW_FLAG_MAP_KEYS_SORTED 4
struct ArrowSchema {
// Array type description
const char* format;
const char* name;
const char* metadata;
int64_t flags;
int64_t n_children;
struct ArrowSchema** children;
struct ArrowSchema* dictionary;
// Release callback
void (*release)(struct ArrowSchema*);
// Opaque producer-specific data
void* private_data;
};The format field
The format field describes the data type using a string. The strings are simple type, while for parametrized types their need to be parsed, and same for temporal type:
- B stands for Boolean, c for Int8 and s for int16
- w:42 for fixed-width 42-bytes binary, d:19,10 a decimal with precision 19 and scale 10
- tdD is a Date32, tts is a Time32, tDs is a duration(seconds)
- for dictionaries, it indicates the index type for the dictionary
- for nested types, i.e. list, +l, large list +L, struct +S
Tip
The format field describes only the top-level type’s format and nested data types will have children ArrowSchema.
The metadata field
The metadata field in represents a series of pairs of keys and values in a single binary string. It is encoded in this compact way:
- An int32 integer defining how many entries we have
- Repeated:
- An int32 defining the length of the key followed by the bytes for the key
- An int32 defining the length of the value followed by the bytes for the value
Extension type information is encoded in the metadata
Arrow define a user-defined extension-type as a data type, but the Arrow Schema defines the encoding and not the runtime type, therefore information about the extension would be encoded in the metadata using
ARROW:extension:nameandARROW:extension:metadatakeys
Representing a RecordBatch schema
The Record Batch schema can simply be represented as a StructArray whose children are the columns f the table/record batch. An similar representation is used by ArrowArray
The ArrowArray structure
The ArrowArray definition mimic the ArrowSchema, with the following notable aspects:
- There is a value
offsetthat describe how many elements are in the buffers before the start of the array. This is useful for slicing and reusing buffers - As you would expect from Memory management only the buffer in this array are described by the length and null count, while buffers of children array are not counted in
n_buffers
struct ArrowArray {
int64_t length;
int64_t null_count;
int64_t offset;
int64_t n_buffers;
int64_t n_children;
const void** buffers;
struct ArrowArray** children;
struct ArrowArray* dictionary;
// release callback
void (*release)(struct ArrowArray*);
// opaque producer related data
void* private_data;
};Using the C interface from Python
As described in PyArrow FFI, PyArrow FFI is the entry point for the C interface. When dealing with Go, even after building with CGO in c-shared build-mode, it is important to compile the extension using a build script like so:
import os
from pyarrow.cffi import ffi
ffi.set_source("_sample",
r"""
**#include "abi.h"**
**#include "libsample.h"**
""",
**library_dirs=[os.getcwd()],**
**libraries=["sample"],**
**extra_link_args=[f"-Wl,-rpath={os.getcwd()}"])**
ffi.cdef("""
void processBatch(uintptr_t, uintptr_t);
""")
if __name__ == "__main__":
**ffi.compile(verbose=True)**This will produce a library _sample.cpython-39-x86_64-linux-gnu.so that we can import and use:
from _sample import ffi, lib
import pyarrow.parquet as pq
tbl = pq.read_table('<path to file>
/yellow_tripdata_2015-01.parquet')
batches = tbl.to_batches(None)
c_schema = ffi.new('struct ArrowSchema*')
c_array = ffi.new('struct ArrowArray*')
ptr_schema = int(ffi.cast('uintptr_t', c_schema))
ptr_array = int(ffi.cast('uintptr_t', c_array))
batches[0].schema._export_to_c(ptr_schema)
batches[0]._export_to_c(ptr_array)
lib.processBatch(ptr_schema, ptr_array)Streaming Arrow data between Python and Go
The C streaming API is a higher-level abstraction built on the initial ArrowSchema and ArrowArray structures to make it easier to stream data within a process across API boundaries. The design of the stream is to expose a chunk-pulling API that pulls blocks of data from the source one at a time, all with the same schema. The structure is defined as follows:
struct ArrowArrayStream {
// callbacks for stream functionality
int (*get_schema)(struct ArrowArrayStream*,
struct ArrowSchema*);
int (*get_next)(struct ArrowArrayStream*,
struct ArrowArray*);
const char* (*get_last_error)(struct ArrowArrayStream*);
// Release callback and private data
void (*release)(struct ArrowArrayStream*);
void* private_data;
};There are three key functions:
- Getting the schema (0 means success)
- Getting the next chunk of data (o means success)
- Getting last error, can be called only if one of the two above doesn’t return 0
Important
The lifetime of the schema and the data chunks that are populated by the callback functions are not tied to the lifetime of ArrowArrayStream and should be released independently
Support for non-CPU device data
The support of the Arrow C data interface for non-CPU device data is detailed in ArrowDeviceArray and ArrowDeviceArray struct