Job Manager

The job manager is responsible for coordinating jobs. It handles job scheduling, resource allocation, task distribution, checkpointing.

Tip

There is typically one active Job manager but for high availability you can have standby job managers

Task Managers

Task Managers execute the actual data processing tasks. Each task managers contains multiple task slots similar to Spark Executor. Task managers perform the data transformations and maintain the local state

State Backend

Flink supports state backends of various types:

  • RocksDB
  • In-memory

Checkpointing

When a new message is received, the local state is updated. Periodically, a checkpoint barrier is injected into the data stream. When task managers receive the checkpoint barrier they stop processing new records and take a snapshot of the current state. Such a mechanism allow a consistent checkpoint across all tasks.

RocksDB

It is a common practice to enable RockDB for state storage for multiple reasons:

  • Local Restart: In case of a Task Manager crash and restart on the same server, the store can be recovered from a persistent local copy on disk, since RocksDB save locally on disk
  • Larger than RAM state: RocksDB state can be much larger than RAM, since RocksDB uses an LSM tree
  • Performance tuning : RocksDB performance can be tuned, for example write buffers size or compression algorithms, to better suit specific use cases

High availability

When running Flink in highly available mode:

  • Multiple Job Managers instances are executed, and a leader election process takes place upon failures, typically involving Zookeeper
  • Tasks failed are re-allocated to different task managers