Introduction:This paper is shared by Apache Flink PMC and Li Yu, a senior technical expert of Alibaba. It mainly introduces Flink’s fault-tolerant mechanism from four aspects: stateful flow computing, global consistent snapshot, Flink’s fault-tolerant mechanism and Flink’s state management.
By Li Yu
Sharer: This article is shared by Apache Flink PMC and Alibaba senior technology expert Li Yu. It mainly introduces the principle of Flink’s fault tolerance mechanism. The content outline is as follows:
- Stateful flow computing
- Globally consistent snapshot
- Flink’s fault tolerance mechanism
- Status management of Flink
1、 Stateful flow computing
Stream computing means that there is a data source that can continuously send messages, and there is a resident program running code. After getting a message from the data source, it will process it, and then output the result to the downstream.
Distributed stream computing
Distributed flow computing is to divide the input flow in a certain way, and then use multiple distributed instances to process the flow.
State in flow computing
Computation can be divided into stateful computation and stateless computation. Stateless computation only needs to process a single event, while stateful computation needs to record and process multiple events.
Take a simple example. For example, an event consists of an event ID and an event value. If the processing logic parses and outputs its event value every time it gets an event, then it is a stateless calculation. On the contrary, if it gets a state every time it parses its value, it needs to be compared with the previous event value and output it when it is larger than the previous event value, This is a stateful calculation.
There are many states in stream computing. For example, in the de duplication scenario, all primary keys will be recorded; or in the window calculation, the data that has entered the window but has not been triggered is also the state of flow calculation; in the machine learning / deep learning scenario, the training model and parameter data are all the state of flow calculation.
2、 Globally consistent snapshot
Global consistent snapshot is a mechanism that can be used for backup and recovery of distributed system.
What is a global snapshot
First, a global snapshot is a distributed application, which has multiple processes distributed on multiple servers. Second, it has its own processing logic and state inside the application. Third, applications can communicate with each other. Fourth, in this distributed application, when the internal state and hardware can communicate, the global state at a certain time is called a global snapshot.
Why do I need a global snapshot
- First, it can be used as a checkpoint to back up the global state regularly. When the application fails, it can be used for recovery;
- Second, do deadlock detection. After the snapshot, the current program will continue to run. Then you can analyze the snapshot to see whether the application has a deadlock state. If so, you can deal with it accordingly.
Global snapshot example
The figure below is an example of a global snapshot in a distributed system.
P1 and P2 are two processes with message sending pipelines between them, C12 and C21 respectively. For the P1 process, C12 is the pipeline through which it sends messages, called output channel; C21 It is the pipeline through which it receives messages, called the input channel.
Except for pipes, each process has a local state. For example, P1 and P2 have three variables XYZ and corresponding values in the memory of each process. Then the local state of P1 and P2 processes and the pipeline state of sending messages between them are the initial global state, which can also be called global snapshot.
Suppose P1 sends a message to P2, asking P2 to change the status value of X from 4 to 7, but the message is in the pipeline and has not yet reached P2. This state is also a global snapshot.
Next, P2 receives the message from P1, but it has not yet processed it. This state is also a global snapshot.
Finally, the P2 receiving the message changes the local value of X from 4 to 7, which is also a global snapshot.
So when an event happens, the global state will change. The event includes sending message, receiving message and modifying its own state.
2. Global consistent snapshot
If there are two events, a and B, in absolute time, if a occurs before B and B is included in the snapshot, then a is also included in the snapshot. A global snapshot satisfying this condition is called a global consistent snapshot.
2.1 implementation of global consistent snapshot
Clock synchronization can not achieve global consistent snapshot; although global synchronization can be achieved, its disadvantages are also very obvious. It will stop all applications and affect the global performance.
3. Asynchronous global consistent snapshot algorithm – chandy Lamport
Chandy Lamport, an asynchronous global consistent snapshot algorithm, can implement global consistent snapshot without affecting the operation of the application.
The system requirements of chandy Lamport are as follows:
- First, it doesn’t affect the running of the application, that is, it doesn’t affect sending and receiving messages, and it doesn’t need to stop the application;
- Second, each process can record the local state;
- Thirdly, the recorded states can be collected in a distributed way;
- Fourth, any process can initiate a snapshot
At the same time, the chandy Lamport algorithm can be executed on the premise that the messages are orderly and non repetitive, and the reliability of the messages can be guaranteed.
3.1 flow chart of chandy Lamport algorithm
The algorithm flow of chandy Lamport is mainly divided into three parts: initiating snapshot, distributed executing snapshot and terminating snapshot.
Any process can initiate a snapshot. As shown in the figure below, when P1 initiates a snapshot, the first step is to record the local state, that is, to take a snapshot locally, and then immediately send a marker message to all its output channels. There is no time gap between them. Marker message is a special message, which is different from the message passed between applications.
After sending the marker message, P1 will start to record all the input channel messages, that is, the C21 pipeline message shown in the figure.
Distributed execution snapshot
As shown in the figure below, assume that Pi receives the marker message from CKI, that is, the marker message sent by PK to PI. There are two situations
The first case: This is the first marker message that Pi receives from other pipelines. It will record the local status first, and then record the C12 pipeline as empty. That is to say, if it sends a message from P1 later, it will not be included in this snapshot. At the same time, it will immediately send a marker message to all its output channels. Finally, start recording messages from all input channels except CKI.
As mentioned above, CKI messages are not included in the real-time snapshot, but real-time messages still occur. Therefore, in the second case, if PI has received marker messages before, it will stop recording CKI messages and save all CKI messages recorded before as the final state of CKI in this snapshot.
There are two conditions to terminate a snapshot:
- First, all processes have received the marker message and recorded it in the local snapshot;
- Second, all processes receive marker messages from their n-1 input channels and record the pipeline status.
When the snapshot is terminated, the snapshot collector (central server) starts to collect the snapshots of each part to form a globally consistent snapshot.
Demonstration of examples
In the example below, some states occur internally, such as a, which has no interaction with other processes. The internal state is the message sent by P1. You can think of a as C11 = [a – >].
How does the algorithm of chandy Lamport global consistent snapshot perform?
Suppose that P1 initiates a snapshot. When it initiates a snapshot, it first takes a snapshot of the local state, which is called S1. Then it immediately sends marker messages to all its output channels, namely P2 and P3, and then records all its input channel messages, namely messages from P2 and P3 and itself.
As shown in the figure, the vertical axis is the absolute time. According to the absolute time, why is there a time difference between P3 and P2 when they receive the marker message? Because if this is a distributed process in a real physical environment, the network conditions between different nodes are different, which will lead to the difference of message delivery time.
P3 receives the marker message first, and it is the first marker message it receives. After receiving a message, it will first take a snapshot of the local state, then mark the C13 pipeline as close, and at the same time start sending marker messages to all its output channels. Finally, it will start recording messages from all input channels except C13.
P1 receives the marker information sent by P3, but this is not the first marker it receives. It will immediately close the pipeline from C31 channel and take the current record message as a snapshot of this channel. If it receives the message from P3 later, it will not be updated in this snapshot state.
Next, P2 receives a message from P3, which is the first marker message it receives. After receiving the message, it first takes a snapshot of the local state, and then marks the C32 pipeline as close. At the same time, it starts to send marker messages to all its output channels. Finally, it starts to record the messages from all input channels except C32.
Let’s look at the message P2 receives from P1. This is not the first marker message P2 receives, so it will close all input channels and record the status of the channel.
Next, let’s see that P1 receives a message from P2, which is not the first message it receives. Then it will close all input channels and take the recorded messages as the status. Then there are two states: one is C11, which is the message sent to itself; the other is C21, which is the message sent to P1D by H in P2.
At the last time point, P3 receives a message from P2. This is not the first message it receives. The operation is the same as that described above. During this period, P3 has a local event J, which also takes J as its state.
When all processes have recorded the local state, and all input pipelines of each process have been closed, then the global consistent snapshot ends, that is, the global state recording of the past time point is completed.
3.3 the relationship between chandy Lamport and Flink
Flink is a distributed system, so Flink uses global consistent snapshot to form checkpoints to support fault recovery. There are several differences between Flink’s asynchronous global consistent snapshot algorithm and chandy Lamport’s algorithm
- Firstly, chandy lamput supports strongly connected graphs, while Flink supports weakly connected graphs;
- Second, Flink adopts tailored chandy lamput asynchronous snapshot algorithm;
- Thirdly, Flink’s asynchronous snapshot algorithm does not need to store channel state in DAG scenario, which greatly reduces the storage space of snapshot.
3、 Flink’s fault tolerance mechanism
Fault tolerance is to recover to the state before the error. There are three kinds of stream computing fault tolerance consistency guarantees: exactly once, at least once, at most once.
- Exactly only means that each event will affect the state only once. The “once” here is not an end-to-end strict once, but only once in Flink, excluding the processing of source and sink.
- At least once means that each event will affect the state at least once, that is, there is the possibility of repeated processing.
- At most once means that each event will affect the state at most once, that is, the state may be lost when an error occurs.
End to end exactly only
Exactly only means that the result of the operation is always correct, but it is likely to be produced many times; therefore, it requires a replay source.
The end-to-end exactly only means that the operation result is correct and can only be output once. It requires not only the replay source, but also the transactional sink and the idempotent output result.
Flink’s state fault tolerance
Many scenarios require semantics in exactly once, that is, processing only once. How to ensure semantics?
Exact only fault tolerance method for simple scenarios
The simple scenario is shown in the figure below. The method is to record the local state and the offset of source, that is, the location of event log.
State fault tolerance of distributed scene
If it is a distributed scenario, we need to generate a globally consistent snapshot of multiple operators with local states without interrupting operations. The job topology of the Flink distributed scene is special. It is a directed acyclic and weakly connected graph. It can use the trimmed chandy Lamport, that is, it only records the offsets of all inputs and the states of each operator, and depends on Rewindable Source (traceable source, that is, the earlier time point can be read through offset), so it is not necessary to store the state of channel, which can save a lot of storage space in the case of aggregation logic.
Finally, the recovery is to reset the location of the data source, and then each operator restores the state from the checkpoint.
3. Flink’s distributed snapshot method
Firstly, a checkpoint barrier is inserted into the source data stream, which is the marker message in the chandy Lamport algorithm mentioned above. Different checkpoint barriers will naturally segment the stream into multiple segments, and each segment contains checkpoint data;
There is a global coordinator in Flink. Unlike chandy Lamport, it can initiate a snapshot for any process. This centralized Coordinator will inject checkpoint barrier into each source and then start the snapshot. When each node receives the barrier, it only needs to store the local state because it does not store the channel state in Flink.
After the checkpoint is finished, each concurrency of each operator will send a confirmation message to the coordinator. When the confirmation messages of all tasks are received by the checkpoint coordinator, the snapshot ends.
4. Process demonstration
As shown in the figure below, suppose checkpoint n is injected into source, then source will first record the offset of the partition it is processing.
As time goes by, it will send checkpoint barrier to the downstream of two concurrencies. When the barrier reaches two concurrencies, the two concurrencies will record their local states in checkpoint respectively
Finally, when the barrier reaches the final subtask, the snapshot is completed.
This is a relatively simple scenario demonstration. Each operator has only one stream input. Let’s look at the more complex scenario in the figure below. The operator has multi stream input.
When the operator has more than one input, the barrier needs to be aligned. How to align the barrier? As shown in the figure below, in the original state on the left, when one of the barriers arrives and some of the barriers on the other barrier command still do not arrive in the pipeline, the first arriving stream will be blocked directly under the condition of ensuring exactly once, and then wait for the data processing of the other stream. When another stream arrives, it will unblock the previous stream and send the barrier to the operator.
In the process, blocking one of the streams creates back pressure. Barrier alignment results in backpressure and pause of data processing of the operator.
If the data pipeline of the received barrier is not blocked during the alignment process, and the data continuously flows in, the data belonging to the next checkpoint will be included in the current checkpoint. If a failure occurs and the source will be rewind, some data will be reprocessed, which is at least once. If you can receive at least once, you can choose other methods to avoid the side effects of barrier alignment. In addition, asynchronous snapshot can be used to minimize the task pause and support multiple checkpoints at the same time.
5. Snapshot trigger
The synchronous upload of local snapshot to the system requires the mechanism of state copy on write.
If the data processing is recovered after taking a snapshot of the metadata information, how to ensure that the recovered application logic will not modify the data being uploaded in the process of uploading data? In fact, the processing of different state storage backend is different. Heap backend will trigger copy on write of data. For rocksdb backend, LSM features can ensure that the snapshot data will not be modified.
4、 Status management of Flink
1. Flink status management
First, you need to define a state. In the example below, you need to define a value state.
When defining a state, the following information should be given:
- Status identification ID
- State data type
- Local state back end registration state
- Local state back end read write state
2. Flink status backend
Also known as state backend, there are two kinds of Flink state backend;
- The first one is JVM heap. The data in it exists in the form of Java objects, and the reading and writing are also completed in the form of objects, so the speed is very fast. But there are also two disadvantages: the first one is that the storage space required by the object mode is many times of the size of the serialized compressed data on the disk, so it takes up a lot of memory space; the second one is that although the read-write does not need to be serialized, it needs to be serialized when the snapshot is formed, so its asynchronous snapshot process will be relatively slow.
- The second type, rocksdb, needs to be serialized when reading and writing, so its reading and writing speed is relatively slow. But it has one advantage: the data structure based on LSM will form SST file after snapshot, and its asynchronous checkpoint process is the process of file copy, so the CPU consumption will be relatively low.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.