Introduction

The flight Protocol is a protocol built on top of gRPC to transfer streaming data.

A Flight server typically handle the following requests:

  • Handshake: A simple request allowing for custom authentication logic and an implementation-defined session token if desired.
  • ListFlights: This gets a list of the available data streams on the server.
  • GetSchema: This retrieves the schema for a specific data stream.
  • GetFlightInfo: This retrieves a “plan” for retrieving a specific dataset, potentially describing the consumption of multiple streams of data. This request allows for custom serialized commands and metadata, such as a server’s particular application parameters.
  • PollFlightInfo: Similar to GetFlightInfo, it starts a query and gets information to poll the execution status. The purpose of this interface is to handle potentially long-running queries. GetFlightInfo doesn’t return until the request completes, while this should return as soon as possible and then get polled for status.
  • DoGet: This retrieves a data stream from the server.
  • DoPut: This sends a data stream from the client to the server.
  • ListActions: This retrieves the list of available implementation-defined action types.
  • DoAction: This performs a specified implementation-defined action and returns any results, such as general RPC function calls.
  • DoExchange: This opens a bidirectional stream between the server and client for both sending and receiving Arrow data along with metadata. This is particularly useful for offloading computation.

Horizontal scalability with Arrow Flight

GetFlightInfo method returns a list of endpoint objects. Each endpoint object contains a location and a Ticket object, describing the server to send a DoGet request with Ticket to retrieve a portion of the dataset. To get the whole dataset, you just need to consume all of the endpoints.

Parallel consume

With Flight, the Planner component of the server doesn’t really provide data, but only location of the data (and a Ticket), so that the client can consume data in parallel from multiple workers This pattern overcomes a typical limitation that most distributed query engines have because they operate like so:

  1. Requests are first routed through a coordinator node and then distributed out to worker nodes.
  2. The coordinator node then gathers up all the fragments of the results from the worker nodes and sends them back to the client.

When accessing a very large dataset, the data is ultimately being transported multiple times between different nodes on its way to the client. Additionally, the throughput of that coordinator node becomes a limiting factor in how fast you can get access to all of the data.

Distributed client

With the previous pattern, data shuffling to the coordinator and coordinator throughput are not a limiting factor anymore, but the client becomes a limiting factor. However, by introducing a coordinator, we can parallelize data consumption, effectively distributing the client responsibility among multiple workers

Location-aware client

An additional benefit of the Flight approach is that the FlightInfo is not limited to one location per ticket, but potentially multiple locations, so that the client can pick the location more suitable. For example, we can pick the closer geographical location

# Adding your business logic to Flight

The DoAction method of the protocol can be used to set some kind of session-specific parameter, or maybe they want to request a particular dataset to be kept in memory as a cache for future requests.

This method accepts an arbitrarily named action and optional binary data that can be interpreted in any way you want when implementing the server.

FlightSQL

The ODBC standard has a significant number of operations beyond just executing queries and fetching data rows, such as listing tables, catalogs, list keywords, available user defined functions and features.

To facilitate replacement of ODBC and JDBC which are row oriented with Flight, Dremio proposed a new extension called FlightSQL which is a framework to manage SQL queries and metadata using the Flight protocol. It adds a series of pre-defined objects, messages and types to enable the same type of metadata and command information that is available through ODBC