Apache Flink fault tolerance mechanism


Original address:flink-release-1.2 Data Streaming Fault Tolerance


Apache Flink provides a fault-tolerant mechanism that can recover the data flow and apply it to a consistent state. Ensure that in case of failure, each record of the program will only act on the state once (exactly once), of course, it can also be degraded to at least once (at least once).

The fault tolerance mechanism is realized by continuously creating snapshots of distributed data streams. For stream applications with small state space, these snapshots are very lightweight and can be created with high frequency with little impact on performance. The status of stream computing applications is saved in a configurable environment, such as master node or HDFS.

In case of program failure (such as machine, network, software, etc.), Flink stops the distributed data flow. The system restarts all operators and resets them to the most recently successful checkpoint. Enter reset to the appropriate state snapshot location. Ensure that any record processed in the restarted parallel data stream is not part of the checkpoint state.

be careful:In order for the fault tolerance mechanism to work, the data source (such as queue or broker) needs to be able to replay the data flow.Apache KafkaWith this feature, the connector of Kafka in Flink takes advantage of this function.

be careful:Since Flink’s checkpoint is implemented through distributed snapshots, we will use the words snapshot and checkpoint alternately.


The core of Flink fault tolerance mechanism is to continuously create consistent snapshots of distributed data streams and their states. These snapshots act as consistency checkpoints that can be rolled back when the system encounters a failure.Lightweight Asynchronous Snapshots for Distributed DataflowsDescribes the mechanism by which Flink creates snapshots. This paper is influenced by distributed snapshot algorithmChandy-LamportInspired and tailored to Flink’s execution model.


One of the core concepts of Flink distributed snapshot is data barrier. These barriers are inserted into the data stream and flow down with the data as part of the data stream. Barrier will not interfere with normal data, and the data flow is strictly orderly. A barrier divides the data flow into two parts: one enters the current snapshot and the other enters the next snapshot. Each barrier has a snapshot ID, and the data before the barrier enters this snapshot. Barrier does not interfere with data flow processing, so it is very lightweight. Multiple barriers of multiple different snapshots will appear in the stream at the same time, that is, multiple snapshots may be created at the same time.

The barrier is inserted at the data source. When the barrier of snapshot n is inserted, the system will record the current snapshot position value n (represented by SN). For example, in Apache Kafka, this variable represents the offset of the last data in a partition. The location value SN is sent to a module called checkpoint coordinator. (i.e. Flink’s jobmanager)

Then the barrier continues to flow down. When an operator receives all the barriers identifying snapshot n from its input stream, it will insert a barrier identifying snapshot n into all its output streams. When sink operator (the end of DAG stream) receives all barriers n from its input stream, it confirms to the checkpoint coordinator that snapshot n has been completed. When all sink have confirmed the snapshot, the snapshot is marked as complete.

Operators that receive more than one input stream need to align the input based on the barrier. See figure above:

  • As soon as the operator receives the barrier n of an input stream, it cannot continue to process the subsequent data of the data stream until the operator receives the barrier n of the remaining streams. Otherwise, the data belonging to snapshot n will be confused with the data belonging to snapshot n + 1

  • The data streams to which barrier n belongs are not processed first, and the data received from these data streams are put into the input buffer

  • When barrier n is extracted from the last stream, the operator will transmit all data waiting to be sent back, and then transmit the barrier to which snapshot n belongs

  • After the above steps, the operator restores the processing of all input stream data and gives priority to the data in the input cache


The operator contains any form of state that must be included in the snapshot. There are many forms of status:

  • User defined state: the state directly created or modified by the transformation function (for example, map() or filter()). User defined state can be a simple variable of the Java object in the transformation function or the key / value state associated with the function. SeeState in Streaming Applications

  • System state: this state refers to the cached data as part of the operator calculation. A typical example is window buffers, where the system collects data corresponding to the window until the window is calculated and transmitted.

After receiving the barrier in all input data streams, the operator takes a snapshot of its state before transmitting the barrier to its output stream. At this time, the status update of the data before the barrier has been completed, and the data before the barrier will no longer be relied on. Because snapshots can be very large, back-end storage systems are configurable. By default, it is stored in the memory of jobmanager, but for the production system, it needs to be configured as a reliable distributed storage system (such as HDFS). After the state storage is completed, the operator will confirm that its checkpoint is completed and send out the barrier to the subsequent output stream.

The snapshot now contains:

  • For parallel input data sources: position offset in the data stream when the snapshot is created

  • For operator: the state pointer stored in the snapshot

Exactly Once vs. At Least Once

The alignment operation may add delay to the flow program. Usually, this additional delay is in the order of a few milliseconds, but we have also encountered the exception of a significant increase in delay. For applications that need to maintain milliseconds for all inputs, Flink provides a method to turn off alignment at checkpoint. When the operator receives a barrier, it will take a snapshot instead of waiting for other barriers.

Skipping the alignment operation causes the operator to continue processing the input even when the barrier arrives. This means that before checkpoint n is created, the operator continues to process the data belonging to checkpoint n + 1. Therefore, when the exception is recovered, this part of the data will be repeated because they are included in checkpoint N and will be processed again later.

be careful:Alignment operations only occur in scenarios with multiple input operations (join) or multiple output operators (repartition, shunting). Therefore, for parallel operations such as free map (), flatmap (), fliter (), even in the mode of at least one time, strict one-time operation will be guaranteed.

Asynchronous State Snapshots

We note that the mechanism described above means that when the operator stores the snapshot to the back end, it stops processing the input data. This synchronization introduces a delay each time a snapshot is created.

When storing snapshots, we can let the operator continue to process data and let the snapshots run asynchronously in the background. To do this, the operator must be able to generate a state object whose subsequent modifications do not affect the previous state. For example, the copy on write data structure used in rocksdb.

The state copied out by the operator asynchronous snapshot when the input barrier is received. Then immediately transmit the barrier to the output stream and continue normal stream processing. Once the background asynchronous snapshot completes, it will confirm the checkpoint completion to the checkpoint Coordinator (jobmanager). Now the sufficient condition for checkpoint completion is that all sink receive the barrier and all stateful operators confirm that the state backup is completed (it may be later than sink receives the barrier).

For more state snapshots, see:state backends


The error response under this fault-tolerant mechanism is obvious: once a fault is encountered, Flink selects the last completed checkpoint K. The system redeploys the entire distributed data flow and resets the status of all operators to checkpoint K. The data source is set to read from the SK location. For example, in Apache Kafka, it means that consumers start reading from SK offset.

If it is an incremental snapshot, the operator needs to reply from the latest full snapshot, and then make a series of incremental updates to this state.

Operator Snapshot Implementation

When the operator snapshot is created, there are two operations: synchronous operation and asynchronous operation.

The operator and back-end storage provide snapshots in the form of Java futuretask. This task contains the state that the synchronous operation has been completed and the asynchronous operation is still waiting. Asynchronous operations are performed in a background thread.

The fully synchronized operator returns a completed futuretask. If the asynchronous operation needs to be executed, the run () method in futuretask will be called.

In order to release the consumption of streams and other resources, these tasks can be cancelled.