Flink learning — a detailed explanation of incremental checkpoint for managing large state

Time:2021-2-24

The article is reproduced from:https://ververica.cn/develope…
Author: Qiu Congxian (Shan Zhi)

Apache Flink is a stateful flow computing framework. The state is the memory state that has been processed in the job operator for subsequent processing. State is very important in many complex scenarios of flow computing

  • Save all the historical records to find some kind of record mode
  • Save all the records of the last minute, which is used to aggregate the records of each minute
  • Save the current model parameters for model training

Stateful flow computing framework must have good fault tolerance to be useful in production environment. Fault tolerance here means that the final result is neither lost nor heavy, whether it is hardware failure or program exception.

Flink’s fault tolerance is a very powerful feature from the beginning. In case of failure, it can ensure no loss and no weight, and has little impact on the performance of normal logic processing.

The core of this is checkpoint mechanism. Flink uses checkpoint mechanism for state assurance. In Flink, checkpoint is a global asynchronous snapshot triggered at a fixed time and persisted to persistent storage system (usually distributed file system). After a failure, Flink chooses to recover from the most recent snapshot. Some users’ job status has reached GB or even TB level. It will be very time-consuming and resource consuming to do a checkpoint for such a large job status. Therefore, we introduced the incremental checkpoint mechanism in Flink 1.3.

Before incremental checkpoints, each checkpoint in Flink contains all the states of the job. We support incremental checkpointing after observing that the change of state between checkpoints is not so great. Incremental checkpoints only contain the state difference between the last checkpoint and the current checkpoint (that is, increment).

For jobs with very large status, incremental checkpoint can significantly improve the performance.There is feedback from production users that for TB level jobs, the overall checkpoint time can be reduced from 3 minutes to 30 seconds after using incremental checkpoint.These time savings are mainly due to the fact that there is no need to write all States to the persistent storage system at each checkpoint.

How to use

Currently, the incremental checkpoint mechanism can only be used on rocksdb statebackend. Flink relies on the backup mechanism inside rocksdb to generate checkpoint files. Flink will automatically clean up the previous checkpoint files, so the history of incremental checkpoints will not grow indefinitely.

In order to enable incremental checkpoint in the job, it is recommended to read the checkpoint document of Apache Flink in detail. In short, you can enable checkpoint as before, and then set the second parameter of the constructor to true to enable incremental checkpoint.

Java examples

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new RocksDBStateBackend(filebackend, true));

Scala example

val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(new RocksDBStateBackend(filebackend, true))

Flink keeps a successful checkpoint by default. If you need to keep more than one checkpoint, you can set it through the following configuration:

state.checkpoints.num-retained

Principle analysis

The incremental checkpoint of Flink is based on the checkpoint of rocksdb. Rocksdb is a kV database with LSM structure. It stores all the changes in the memory variable cache (called memtable). All the changes to the key in the memtable will override the previous value. When the current memtable is full, rocksdb will write all the data to the disk in an orderly manner. When rocksdb writes memtable to disk, the whole file is no longer changeable, which is called sstable.

The background compression thread of rocksdb will merge the sstable and merge the duplicate keys. The merged sstable contains all the key value pairs, and rocksdb will delete the sstable before merging.

On this basis, Flink records all new sstables generated and deleted after the last checkpoint. In addition, since sstable is immutable, Flink uses sstable to record state changes. To this end,Flink calls rocksdb’s flush to force all the data of memtable to be written to sstable and hard chained to a temporary directory. This step is completed in the synchronous phase, and the rest is completed in the asynchronous phase, without blocking the normal data processing.

Flink backs up all newly generated sstables to persistent storage (such as HDFS, S3) and references them in the new checkpoint. Instead of backing up sstables that already exist in the previous checkpoint, Flink refers to them. Flink can also ensure that all checkpoints will not refer to deleted files, because the deletion of files in rocksdb is completed by compression. After compression, the original contents will be merged and written into a new sstable. Therefore, Flink incremental checkpoint can cut off checkpoint history.

In order to track the gap between checkpoints, the sstable after backup and consolidation is a relatively redundant operation. However, Flink will be processed incrementally, the additional overhead is usually very small, and it can maintain a shorter checkpoint history, and read files from fewer checkpoints during recovery, so we think it is worth it.

A chestnut, for example

Flink learning -- a detailed explanation of incremental checkpoint for managing large state

The figure above takes a stateful operator as an example. The maximum number of checkpoints is two. The figure above records the local rocksdb status file, the referenced files on persistent storage, and the reference count of the file after the completion of the current checkpoint from left to right.

  • Checkpoint 1The local rocksdb contains two sstable files. The checkpoint will back up the two files to the persistent storage. After the checkpoint is completed, the reference count of the two files will be increased by 1, and the reference count will be saved in the form of key value pairs. The key is composed of the current concurrency of the operator and the file name. At the same time, we will maintain the cryptic relationship between the key and the corresponding file in a reference count.
  • Checkpoint 2Rocksdb generates two new sstable files, and two old files still exist. Flink will back up two new files, and then reference two old files. When checkpoint is completed, Flink will perform reference count + 1 operation on all four files.
  • Checkpoint 3Rocksdb combines sstable – (1), sstable – (2) and sstable – (3) into sstable – (1,2,3), and deletes three old files. The newly generated file contains all the key value pairs of the three deleted files. Sstable – (4) still exists, and a new sstable – (5) file is generated. Flink will backup sstable – (1,2,3) and sstable – (5) to persistent storage, and then increase the reference count of sstable-4. Since the number of saved checkpoints reaches the upper limit (2), checkpoint 1 will be deleted, and then the reference count of all files (sstable – (1) and sstable – (2)) referenced in checkpoint 1 will be – 1 operated.
  • Checkpoint 4Rocksdb combines sstable – (4), sstable – (5) and the newly generated sstable – (6) into a new sstable – (4,5,6). Flink backs up sstable – (4,5,6) to persistent storage, performs reference count + 1 operation on SSTab – (1,2,3) and sstable – (4,5,6), then deletes checkpoint 2, and performs reference count – 1 operation on the files referenced by checkpoint. At this time, the reference count of sstable – (1), sstable – (2) and sstable – (3) becomes 0, and Flink will delete these three files from the persistent storage.

Competition and concurrent checkpoint

Flink supports concurrent checkpoints. Sometimes late triggered checkpoints will be completed first, so incremental checkpoints need to choose a correct benchmark. Flink will only refer to the successful checkpoint files, thus preventing some deleted files from being referenced.

Recovery from checkpoint and performance

After the incremental checkpoint is turned on, no additional configuration is required. If the job is abnormal, Flink’s Jobmaster will inform all tasks to recover from the last successful checkpoint, whether it is a full checkpoint or an incremental checkpoint. Each task manager downloads the status files they need from the persistent storage.

Although incremental checkpoint can reduce the checkpoint time in the big state, there is no free lunch in the world, so we need to give up in other aspects. Incremental checkpointing can reduce the total time of checkpointing, but it can also lead to longer recovery timeIf the cluster fails frequently, Flink’s task manager needs to download the required status files from multiple checkpoints (these files contain some deleted status), and the overall recovery time of the job may be longer than that of not using incremental checkpoints.

In addition, in the case of incremental checkpoints, we cannot delete the files generated by the old checkpoints, because the new checkpoints will continue to refer to them, which may lead to the need for more storage space and the consumption of more bandwidth during recovery.

You can refer to this document for the strategy of controlling the balance between convenience and performance

https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/large_state_tuning.html