Arrow is not a library, buy a collection of libraries. Tho share data efficiently between different languages and runtimes in the same processes Arrow provide:

  • The C data interface fro sharing both schemas and data
  • The C data interface fro streaming record batches

The Arrow project defines a small, stable set of C definitions that can be copied into a project to allow easily passing data across the boundaries of different languages and libraries. For languages and runtimes that aren’t C or C++, it should still be easy to use whatever foreign function interface (FFI) declarations correspond to your language or runtime.

Note

The C API is not intended to provide the higher-level operations exposed by Java, C++ or Go libraries. It’s just for passing the Arrow data itself.

Note

The C interface is only intended for sharing data between different components of the same processes, while for distinct processes we should use the Arrow IPC

Comparison between Arrow C and Arrow IPC

The C data interface is superior to the IPC format because of the following factors:

  • Zero-copy by design
  • Customizable release callback for resource lifetime management
  • Minimal C definition that is easily copied to different code bases
  • Data is exposed in Arrow’s logical format without any dependency on FlatBuffers

The IPC format is superior to the C data interface because of the following reasons:

  • It allows for storage and persistence or communication across different processes and machines
  • It does not require C data access
  • It has room to add other features such as compression since it is a streamable form

The ArrowSchema structure

By separating the schema description and the data itself into two separate structures, the ABI allows producers of Arrow C data to avoid the cost of exporting and importing the schema for every single batch of data. In cases where it’s just a single batch of data, the same API call can return both structures at once, providing efficiency in both situations.

Important

The ArrowSchema structure is built in a way that it can be used also to define a single arrow::Field. The children array represents individual fields, while the parent ArrowSchema represents an entire schema

#define ARROW_FLAG_DICTIONARY_ORDERED 1
#define ARROW_FLAG_NULLABLE 2
#define ARROW_FLAG_MAP_KEYS_SORTED 4
struct ArrowSchema {
    // Array type description
    const char* format;
    const char* name;
    const char* metadata;
    int64_t flags;
    int64_t n_children;
    struct ArrowSchema** children;
    struct ArrowSchema* dictionary;
    // Release callback
    void (*release)(struct ArrowSchema*);
    // Opaque producer-specific data
    void* private_data;
};

The format field

The format field describes the data type using a string. The strings are simple type, while for parametrized types their need to be parsed, and same for temporal type:

  • B stands for Boolean, c for Int8 and s for int16
  • w:42 for fixed-width 42-bytes binary, d:19,10 a decimal with precision 19 and scale 10
  • tdD is a Date32, tts is a Time32, tDs is a duration(seconds)
  • for dictionaries, it indicates the index type for the dictionary
  • for nested types, i.e. list, +l, large list +L, struct +S

Tip

The format field describes only the top-level type’s format and nested data types will have children ArrowSchema.

The metadata field

The metadata field in represents a series of pairs of keys and values in a single binary string. It is encoded in this compact way:

  • An int32 integer defining how many entries we have
  • Repeated:
    • An int32 defining the length of the key followed by the bytes for the key
    • An int32 defining the length of the value followed by the bytes for the value

Extension type information is encoded in the metadata

Arrow define a user-defined extension-type as a data type, but the Arrow Schema defines the encoding and not the runtime type, therefore information about the extension would be encoded in the metadata using ARROW:extension:name and ARROW:extension:metadata keys

Representing a RecordBatch schema

The Record Batch schema can simply be represented as a StructArray whose children are the columns f the table/record batch. An similar representation is used by ArrowArray

The ArrowArray structure

The ArrowArray definition mimic the ArrowSchema, with the following notable aspects:

  • There is a value offset that describe how many elements are in the buffers before the start of the array. This is useful for slicing and reusing buffers
  • As you would expect from Memory management only the buffer in this array are described by the length and null count, while buffers of children array are not counted in n_buffers
struct ArrowArray {
    int64_t length;
    int64_t null_count;
    int64_t offset;
    int64_t n_buffers;
    int64_t n_children;
    const void** buffers;
    struct ArrowArray** children;
    struct ArrowArray* dictionary;
    // release callback
    void (*release)(struct ArrowArray*);
    // opaque producer related data
    void* private_data;
};

Using the C interface from Python

As described in PyArrow FFI, PyArrow FFI is the entry point for the C interface. When dealing with Go, even after building with CGO in c-shared build-mode, it is important to compile the extension using a build script like so:

import os
from pyarrow.cffi import ffi
ffi.set_source("_sample",
    r"""
    **#include "abi.h"**
    **#include "libsample.h"**
    """,
    **library_dirs=[os.getcwd()],**
    **libraries=["sample"],**
    **extra_link_args=[f"-Wl,-rpath={os.getcwd()}"])**
ffi.cdef("""
    void processBatch(uintptr_t, uintptr_t);
    """)
if __name__ == "__main__":
    **ffi.compile(verbose=True)**

This will produce a library _sample.cpython-39-x86_64-linux-gnu.so that we can import and use:

from _sample import ffi, lib
 
import pyarrow.parquet as pq
tbl = pq.read_table('<path to file>
        /yellow_tripdata_2015-01.parquet')
batches = tbl.to_batches(None)
 
					c_schema = ffi.new('struct ArrowSchema*')
c_array = ffi.new('struct ArrowArray*')
ptr_schema = int(ffi.cast('uintptr_t', c_schema))
ptr_array = int(ffi.cast('uintptr_t', c_array))
batches[0].schema._export_to_c(ptr_schema)
batches[0]._export_to_c(ptr_array)
					
lib.processBatch(ptr_schema, ptr_array)

Streaming Arrow data between Python and Go

The C streaming API is a higher-level abstraction built on the initial ArrowSchema and ArrowArray structures to make it easier to stream data within a process across API boundaries. The design of the stream is to expose a chunk-pulling API that pulls blocks of data from the source one at a time, all with the same schema. The structure is defined as follows:

struct ArrowArrayStream {
    // callbacks for stream functionality
    int (*get_schema)(struct ArrowArrayStream*,
        struct ArrowSchema*);
    int (*get_next)(struct ArrowArrayStream*,
        struct ArrowArray*);
    const char* (*get_last_error)(struct ArrowArrayStream*);
    // Release callback and private data
    void (*release)(struct ArrowArrayStream*);
    void* private_data;
};

There are three key functions:

  • Getting the schema (0 means success)
  • Getting the next chunk of data (o means success)
  • Getting last error, can be called only if one of the two above doesn’t return 0

Important

The lifetime of the schema and the data chunks that are populated by the callback functions are not tied to the lifetime of ArrowArrayStream and should be released independently

Support for non-CPU device data

The support of the Arrow C data interface for non-CPU device data is detailed in ArrowDeviceArray and ArrowDeviceArray struct