Job execution and daemon of Flink fault tolerance mechanism


1、 Job execution fault tolerance

Flink’s error recovery mechanism is divided into multiple levels, namely, the failover policy at the execution level and the job restart policy at the execution graph level. When an error occurs, Flink will first try to trigger a small-scale error recovery mechanism. If it still can’t handle it, it will be upgraded to a larger-scale error recovery mechanism. See the sequence diagram below for details.

When an error occurs in a task, the taskmanager will notify the jobmanager through RPC, which will change the status of the corresponding execution to failed and trigger the failover policy. If the failover policy is met, the jobmanager will restart execution, otherwise the upgrade to executiongraph fails. If the executiongraph fails, it enters the failed state. The restart policy determines whether it is restarted (restarting state) or exited abnormally (failed state).

1.1. Task failover strategy

At present, there are three task failover strategies: restartall, restartividualstrategy and restartpipelinedregionstrategy.

Restartall: restarting all tasks is the safest policy to restore job consistency. It will be used as a minimum policy when other failover policies fail. Currently, it is the default task failover policy.

Restartpipelinedregionstrategy: restart all tasks in the region where the wrong task is located. Task region is determined by task data transmission. Tasks with data transmission will be placed in the same region, but there is no data exchange between different regions.

Restartividualstrategy: restore a single task. Because if the task does not contain a data source, it will not be able to re flow data, resulting in some data loss. Considering that it provides accurate delivery semantics at least once, the scope of use of this strategy is relatively limited and only applies to jobs without data transmission between tasks.

1.2. Job restart strategy

If the task error finally triggers full restart, the job restart policy will control whether to resume the job. Flink provides three job specific restart strategies.

Fixeddelayrestartstrategy: execution failures within the specified number of times are allowed. If the number is exceeded, the job will fail. The fixed delay restart strategy can set a certain delay to reduce the load on the external system and unnecessary error logs caused by frequent retries.

Failureraterestartstrategy: allows execution failures within the specified number of times within the specified time window. If the frequency is exceeded, the job will fail. Similarly, failureraterestartstrategy can also set a certain restart delay.

Norestartstrategy: directly make the job fail when execution fails.

2、 Daemon fault tolerance

In the deployment mode of Flink on yarn, there are two key daemons: jobmanager and taskmanager. The main responsibility of jobmanager is to coordinate resources and manage the execution of jobs. The two daemons are ResourceManager and Jobmaster respectively. The relationship between the three is shown in the figure below.

2.1 fault tolerance of taskmanager

If the resoucemanager detects a task manager fault through the heartbeat timeout or through the notification of the cluster manager, it will notify the corresponding Jobmaster and start a new task manager instead. Note that the resoucemanager does not care about the Flink job. It is the job Master’s responsibility to manage the response of the Flink job.

If the Jobmaster learns about the taskmanager fault through the notification of the resoucemanager or detects the taskmanager fault through the heartbeat timeout, it will first remove the taskmanager from its slot pool and mark all tasks running on the taskmanager as failed, thus triggering the fault-tolerant mechanism for Flink job execution to recover the job.

The status of the taskmanager has been written to the checkpoint and will be automatically restored after restart, so it will not cause data inconsistency.

2.2 fault tolerance of ResourceManager

If the taskmanager detects a resource manager failure through the heartbeat timeout, or receives a notification from zookeeper that the resource manager has lost its leadership, the taskmanager will find a new leader and register itself with the resource manager. During this period, the execution of the task will not be interrupted.

If the Jobmaster detects a resource manager failure through the heartbeat timeout, or receives a notification from zookeeper that the resource manager has lost its leadership, the Jobmaster will also wait for the new resource manager to become a leader, and then re request all task managers. Considering that the taskmanager may also recover successfully, the taskmanager newly requested by the Jobmaster will be released after being idle for a period of time.

A lot of status information is maintained on the ResourceManager, including active container, available taskmanager, mapping relationship between taskmanager and Jobmaster, etc. However, these information are not ground truth and can be retrieved from the status synchronization with Jobmaster and taskmanager, so these information do not need to be persistent.

2.3 fault tolerance of Jobmaster

If the taskmanager detects a Jobmaster failure through a heartbeat timeout, or receives a notification from zookeeper that the Jobmaster has lost leadership, the taskmanager will trigger its own error recovery and wait for a new Jobmaster. If the new Jobmaster does not appear after a certain period of time, the task manager will mark its slot as idle and inform the resource manager.

If the resource manager detects a Jobmaster failure through the heartbeat timeout, or receives a notification from zookeeper that the Jobmaster has lost its leadership, the resource manager will inform the taskmanager of it and do nothing else.

The Jobmaster saves many states that are critical to job execution. The jobgraph and user code will be retrieved from persistent storage such as HDFS, and the checkpoint information will be retrieved from zookeeper. The task execution information can not be recovered because the whole job will be rescheduled, and some slots will be recovered from the synchronization information of the task manager of the resource manager.

2.4 concurrent faults

In the Flink on yarn deployment mode, because both Jobmaster and ResourceManager are in the jobmanager process, if the jobmanager process fails, usually the Jobmaster and ResourceManager fail concurrently, the taskmanager will handle it according to the following steps:

  • Follow the normal Jobmaster fault handling.
  • Try to provide the slot to the new Jobmaster over a period of time.
  • Keep trying to register yourself with ResourceManager.

It is worth noting that the pulling up of the new jobmanager is automatically completed by yarn’s application attempt retry mechanism. According to the yarn application: keep containers across application attempts configured by Flink, the taskmanager will not be cleaned up, so it can be re registered in the newly started Flink ResourceManager and Jobmaster.

3、 Summary

Flink fault tolerance mechanism ensures the reliability and persistence of Flink. Specifically, it includes job execution fault tolerance and daemon fault tolerance. In terms of job execution fault tolerance, Flink provides task level failover policy and job level restart policy to automatically retry in case of failure. In terms of daemon fault tolerance, in the on yarn mode, Flink performs fault detection through the heartbeat of internal components and yarn monitoring. The failure of the taskmanager will be recovered by applying for a new taskmanager and restarting the task or job. The failure of the jobmanager will be recovered by automatically pulling up the new jobmanager and re registering the taskmanager with the new leader jobmanager.

The above is the details of job execution and daemon of Flink fault-tolerant mechanism. For more information about job execution and daemon of Flink fault-tolerant mechanism, please pay attention to other relevant articles of developeppaer!