Checkpoint mechanism is the cornerstone of Flink’s reliability, which can ensure that the Flink cluster can restore the state of the whole application flow graph to a certain state before the failure when an operator fails due to some reasons (such as abnormal exit), so as to ensure the consistency of the application flow graph state. Flink’s checkpoint mechanism is based on the “chandy Lamport algorithm”.
When every app that needs checkpoint starts, the job manager of Flink creates a checkpoint for itCheckpoint Coordinator, checkpoint coordinator is fully responsible for snapshot making of this application.
1) The checkpoint coordinator periodically sends a barrier to all source operators applied to the flow.
2) When a source operator receives a barrier, it pauses the data processing process, then makes a snapshot of its current state and saves it to the specified persistent storage. Finally, it reports its snapshot making status to the checkpoint coordinator, and broadcasts the barrier to all its downstream operators to resume data processing
3) After receiving the barrier, the downstream operator will pause its own data processing process, and then make a snapshot of its related state and save it to the specified persistent storage. Finally, the downstream operator will report its own snapshot to the checkpoint coordinator, and broadcast the barrier to all its downstream operators to resume data processing.
4) Each operator makes a snapshot and broadcasts it to the downstream according to step 3 until the barrier is passed to the sink operator and the snapshot is finished.
5) When the checkpoint coordinator receives the report of all operators, it considers that the snapshot of the cycle is made successfully; Otherwise, if the report of all operators is not received within the specified time, the snapshot of this cycle will be considered as a failure.
If an operator has two input sources, it will block the input source of the barrier first, and when the barrier with the same number of the second input source arrives, it will make its own snapshot and broadcast the barrier to the downstream. The details are shown in the figure below:
1) Suppose that operator c has two input sources a and B
2) In the i-th snapshot cycle, due to some reasons (such as processing delay, network delay, etc.), the barrier sent by input source a comes first. At this time, operator c temporarily blocks the input channel of input source a and only receives the data of input source B.
3) When the barrier sent by input source B arrives, operator c makes its own snapshot and reports its snapshot to checkpoint coordinator, and then merges the two barriers into one to broadcast to all downstream operators.
4) When a fault occurs for some reason, the checkpoint coordinator notifies all operators on the flow graph to recover to the checkpoint state of a certain cycle, and then resume the data flow processing. The distributed checkpoint mechanism ensures that the data is processed only once.
The persistent storage mainly saves the snapshot data to the memory of job manager. It is only suitable for testing and very small amount of snapshot data. It is not recommended for large-scale commercial deployment.
Limitations of memorystatebackend：
By default, the size of each state is limited to 5 MB. You can increase this value in the constructor of memorystatebackend.
Regardless of the configured maximum state size, the state cannot be larger than the size of the akka frame (see configuration).
The aggregation state must be suitable for job manager memory.
It is recommended that memorystatebackend be used for：
Local development and debugging.
For jobs with few states, such as jobs with only one recording function (map, flatmap, filter,…), Kafka consumers need few states.
The persistent storage mainly saves the snapshot data to the file system, and the file systems currently supported are mainly HDFS and local files. If HDFS is used, when initializing fsstatebackend, you need to pass in a path beginning with “HDFS: / /” (that is, new fsstatebackend (“HDFS: / / / hacluster / checkpoint”). If local files are used, you need to pass in a path beginning with “file: / /” (that is, new fsstatebackend (“file: / / / data”)). Local files are not recommended for distributed applications. If an operator fails on node a, it will be recovered on node B. when using the local file, the data on node a cannot be read on node B, resulting in the failure of state recovery.
It is suggested that fsstatebackend:
Jobs with large status, long window and large key / value status.
All high availability settings.
Rocksdbstatbackend is between the local file and HDFS. Usually, the function of rocksdb is used to persist the data to the local file. When making a snapshot, the local data is made into a snapshot and persisted to fsstatebackend (fsstatebackend does not need to be specified by the user. It only needs to pass in HDFS or local path during initialization, For example, new rocksdbstatebackend (“HDFS: / / / hacluster / checkpoint”) or new rocksdbstatebackend (“file: / / / data”).
If the user uses a custom window, rocksdbstatebackend is not recommended. In the custom window, the state is saved in statbackend in the form of liststate. If there are multiple values in a key value, rocksdb will read the liststate very slowly, which will affect the performance. Users can choose fsstatebackend + HDFS or rocksstatebackend + HDFS according to the specific situation of the application.
val env = StreamExecutionEnvironment.getExecutionEnvironment() // start a checkpoint every 1000 ms env.enableCheckpointing(1000) // advanced options: //Set the execution mode of checkpoint, which can be executed at most once or at least once env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) //Set checkpoint timeout env.getCheckpointConfig.setCheckpointTimeout(60000) //If there is an error in the process of snapshot only, do you want the whole task to fail: true yes, false No env.getCheckpointConfig.setFailTasksOnCheckpointingErrors(false) //Set how many checkpoints can be executed at the same time env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
Two ways to modify state backend
First: single task adjustment
Modify current task code
Or new memorystatebackend ()
Or new rocksdbstatebackend (filebackend, true)【 Third party dependency needs to be added]
Second: global adjustment
Note: the values of state.backend can be as follows: jobmanager (memorystatebackend), filesystem (fsstatebackend), rocksdb (rocksdbstatebackend)
Advanced options for checkpoint
The default checkpoint function is disabled. When you want to use it, you need to enable checkpoint first. After it is turned on, the default checkpoint mode is exactly once
//Configure to open a checkpoint in one second env.enableCheckpointing(1000) //Specifies the execution mode of checkpoint //Two options: //CheckpointingMode.EXACTLY_ Once: default value //CheckpointingMode.AT_LEAST_ONCE env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) In general, select checkpointingmode. Physically_ Once, unless the scenario requires very low latency (a few milliseconds) Note: if it is necessary to ensure exactly_ Once, source and sink requirements must ensure that they are active at the same time_ ONCE
//If the program is cancle, keep the previous checkpoint env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) By default, checkpoints are not reserved and are only used to recover jobs in the event of failure. You can enable external persistent checkpoints and specify a retention policy ExternalizedCheckpointCleanup.RETAIN_ ON_ Cancellation: keep checkpoints when a job is cancelled. Note that in this case, you must manually clean up the checkpoint state after cancellation ExternalizedCheckpointCleanup.DELETE_ ON_ Cancellation: when the job is cancelled, the checkpoint is deleted. The checkpoint is only available when the job fails
//Set checkpoint timeout env.getCheckpointConfig.setCheckpointTimeout(60000) //The timeout of checkpointing. If it is not completed within the timeout, it will be terminated
//The minimum time interval of checkpointing, which is used to specify after the completion of the previous checkpoint //How long can another checkpoint be triggered? When this parameter is specified, the value of maxconcurrent checkpoints is 1 env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
//Set whether multiple checkpoints can be executed at the same time env.getCheckpointConfig.setMaxConcurrentCheckpoints(1) Specifies the maximum number of checkpoints that can be running env.getCheckpointConfig.setFailOnCheckpointingErrors(true) It is used to specify whether to fail the task when the checkpoint is abnormal. The default value is true. If it is set to false, the task will reject the checkpoint and continue to run
Restart strategy of Flink
Flink supports different restart policies, which control how to restart a job after it fails. The cluster can be restarted through the default restart policy. This default restart policy is usually used when no restart policy is specified. If a restart policy is specified when a job is submitted, this restart policy will override the default restart policy of the cluster.
The default restart policy is through Flinkflink-conf.yamlTo specify this configuration parameterrestart-strategyIt defines which strategy will be adopted.If checkpoint is not startedWill adoptno restartIf the checkpoint mechanism is started but the restart policy is not specified, it will be usedfixed-delayPolicy, try againInteger.MAX_VALUETimes. Please refer to the available restart policies below to see which values are supported.
Each restart policy has its own parameters to control its behavior. These values can also be set in the configuration file. The description of each restart policy contains its own configuration value information.
In addition to defining a default restart policy, you can also specify your own restart policy for each job. This restart policy can be used inExecutionEnvironmentCall insetRestartStrategy()Method to call programmatically, and note that the same applies toStreamExecutionEnvironment。
The following example shows how to set a fixed delay restart policy for a job. In case of failure, the system will try to restart every 10 seconds and restart three times.
val env = ExecutionEnvironment.getExecutionEnvironment() env.setRestartStrategy(RestartStrategies.fixedDelayRestart( 3, // restart times Time. Of (10, timeunit. Seconds) // delay interval ))
Fixed delay restart strategy
The fixed delay restart strategy will try to restart the job for a given number of times. If the maximum number of times is exceeded, the job will eventually fail. The restart policy will wait for a fixed time between two consecutive restart attempts.
The restart policy can be enabled by configuring the following configuration parameters of flink-conf.yaml as the default restart policy:
restart-strategy.fixed-delay.attempts: 3 restart-strategy.fixed-delay.delay: 10 s
Fixed delay restart can also be set in the program
val env = ExecutionEnvironment.getExecutionEnvironment() env.setRestartStrategy(RestartStrategies.fixedDelayRestart( 3, // restart times Time. Of (10, timeunit. Seconds) // restart interval ))
Failure rate restart strategy
The failure rate restart strategy will restart after the job fails, but if it exceeds the failure rate, the job will be considered as a failure. The restart policy will wait for a fixed time between two consecutive restart attempts.
The failure rate restart policy can be enabled by setting the following configuration parameters in flink-conf.yaml:
restart-strategy.failure-rate.max-failures-per-interval: 3 restart-strategy.failure-rate.failure-rate-interval: 5 min restart-strategy.failure-rate.delay: 10 s
The failure rate restart policy can also be set in the program
val env = ExecutionEnvironment.getExecutionEnvironment() env.setRestartStrategy(RestartStrategies.failureRateRestart( 3, // maximum number of failures per measurement interval Time. Of (5, timeunit. Minutes), // time interval of failure rate measurement Time. Of (10, timeunit. Seconds) // time interval between two consecutive restart attempts ))
No restart policy
The job fails directly and will not attempt to restart
No restart policy can also be set in the program
val env = ExecutionEnvironment.getExecutionEnvironment() env.setRestartStrategy(RestartStrategies.noRestart())