Job Manager
The job manager is responsible for coordinating jobs. It handles job scheduling, resource allocation, task distribution, checkpointing.
Tip
There is typically one active Job manager but for high availability you can have standby job managers
Task Managers
Task Managers execute the actual data processing tasks. Each task managers contains multiple task slots similar to Spark Executor. Task managers perform the data transformations and maintain the local state
State Backend
Flink supports state backends of various types:
- RocksDB
- In-memory
Checkpointing
When a new message is received, the local state is updated. Periodically, a checkpoint barrier is injected into the data stream. When task managers receive the checkpoint barrier they stop processing new records and take a snapshot of the current state. Such a mechanism allow a consistent checkpoint across all tasks.
RocksDB
It is a common practice to enable RockDB for state storage for multiple reasons:
- Local Restart: In case of a Task Manager crash and restart on the same server, the store can be recovered from a persistent local copy on disk, since RocksDB save locally on disk
- Larger than RAM state: RocksDB state can be much larger than RAM, since RocksDB uses an LSM tree
- Performance tuning : RocksDB performance can be tuned, for example write buffers size or compression algorithms, to better suit specific use cases
High availability
When running Flink in highly available mode:
- Multiple Job Managers instances are executed, and a leader election process takes place upon failures, typically involving Zookeeper
- Tasks failed are re-allocated to different task managers