Author: Tang Yun (dried tea)
The relationship between checkpoint and state
Checkpoint is a global operation triggered from source to all downstream nodes. The following figure shows an intuitive feeling of checkpoint. In the red box, you can see that a total of 569k checkpoints have been triggered, and then all of them have been successfully completed without failure.
State is actually the main data of the main persistent backup made by checkpointAccording to the specific data statistics in the figure below, its state is only 9KB.
What is state
Let’s see what state is next. Let’s look at a very classic word count code. This code will monitor the data of the local 9000 port and count the word frequency of the network port input. We will take the local action of netcat, and then input Hello world at the terminal. What will the executive program output?
The answer is obvious,
So the question is, if you input Hello world in the terminal again, what will the program input?
The answer is obvious,
**(world, 2)**. The reason why Flink knows that it has dealt with Hello world once before is that state plays a role. This is called keyed state, which stores the data that needs to be counted before, so help Flink know that hello and world have appeared once respectively.
Let’s review the word count code. Calling the keyby interface creates a keyed stream to divide the keys, which is the premise of using keyed state. After that, the sum method calls the built-in streamgroupedreduce implementation.
What is keyed state
For keyed state, there are two characteristics
- It can only be applied to keyedstream functions and operations, such as keyed UDF and window state
- Keyed state is partitioned / partitioned, and each key can only belong to a keyed state
To understand the concept of partitioned, we need to look at the semantics of keyby. As you can see in the figure below, there are three concurrent words on the left and three concurrent words on the right. After the words on the left come in, they will be distributed through keyby. For example, for Hello word, the word “hello” will only be used for concurrent tasks at the bottom right through hash operation.
What is an operator state
- Also known as non keyed state, each operator state is bound to only one instance of an operator.
- A common operator state is the source state, such as recording the offset of the current source
Look at the word count code using operator state
FromElementsFunctionClass, in which the operator state of type list state is used. Make a classification according to the state type, as shown in the figure below:
In addition to this kind of classification, there is another kind of classification from the perspective of whether Flink takes over directly
- Managed state: the state managed by Flink. All the States just illustrated are managed states
- Raw state: Flink only provides stream to store data. For Flink, raw state is just some bytes
In actual production, only managed state is recommended. This article will focus on this topic.
How to use state in Flink
The following figure shows the method used in sum of word count
**StreamGroupedReduce**Class to explain how to use keyed state in code
The following figure shows the example of word count
**FromElementsFunction**Class and share how to use operator state in Code:
Execution mechanism of checkpoint
Before introducing the execution mechanism of checkpoint, we need to understand the storage of state, because state is the main role of checkpoint for persistent backup.
Classification of statebackend
The following figure illustrates three types of state backend built in Flink, among which
**FsStateBackend**It is stored in Java heap at run time, only when checkpoint is executed,
**FsStateBackend**Data will be persisted to remote storage in file format.
**RocksDBStateBackend**Rocksdb (LSM dB with mixed memory and disk) is used to store the state.
Memorystatebackend stores the working state data in the JAVA memory of task manager. Key / value state and window operator use hash table to store values and triggers. When taking a snapshot (checkpointing), the generated snapshot data will be sent to jobmanager together with the checkpoint ACK message, and jobmanager will save all the received snapshots in JAVA memory.
Memorystatebackend is now configured asynchronously by default to avoid blocking the pipline processing of the main thread.
The state access speed of memorystatebackend is very fast, but it is not suitable for production environment. This is because memorystatebackend has the following limitations:
- The default size of each state is limited to 5 MB (this value can be set through the memorystatebackend constructor)
- The size of all state data of each task (a task may contain multiple operators in a pipline) cannot exceed the frame size of RPC system（ akka.framesize , default 10MB)
- The total state data received by jobmanager cannot exceed the memory of jobmanager
Suitable scenarios for memorystatebackend:
- Local development and debugging
- Homework in small state
The following figure shows the data storage location of memorystatebackend:
It is worth noting that when the savepoint is triggered, the jobmanager will persist the snapshot data to the external storage.
Fsstatebackend needs to configure a checkpoint path, such as“ hdfs://namenode : 40010 / Flink / checkpoints “or” file: / / / data / Flink / checkpoints “, we generally configure it as HDFS directory
Fsstatebackend saves the working state data in the JAVA memory of task manager. When taking a snapshot, write the snapshot data to the path configured above, and then inform jobmanager of the file path. The metadata information of all States is saved in jobmanager (in Ha mode, the metadata will be written to the checkpoint directory).
Fsstatebackend uses asynchronous snapshot mode by default to prevent blocking pipline processing of main thread. You can cancel this mode through the fsstatebackend constructor:
new FsStateBackend(path, false);
Fsstatebackend is suitable for the following scenarios:
- Large state, long window, large key value (key or large value) state job
- Suitable for high availability solutions
Rocksdbstatebackend also needs to configure a checkpoint path, for example:“ hdfs://namenode : 40010 / Flink / checkpoints “or” file: / / / data / Flink / checkpoints “, generally configured as HDFS path.
RocksDBIt is an embeddable and persistent key value storage engine, which provides acid support. Developed by Facebook based on leveldb and using LSM storage engine, it is a hybrid storage of memory and disk.
Rocksdbstatebackend saves the working state in the rocksdb database of task manager; when checkpoint is used, all the data in rocksdb will be transferred to the configured file directory, and a small amount of metadata information will be saved in the jobmanager memory (in Ha mode, it will be saved in the checkpoint directory).
Rocksdbstatebackend takes a snapshot asynchronously.
Limitations of rocksdbstatebackend:
- Since the JNI bridge API of rocksdb is based on byte, the maximum value of each key or value supported by rocksdbstatebackend does not exceed 2 ^ 31 bytes ((2GB)).
- It should be noted that the state of merge operation (for example, liststate) may exceed 2 ^ 31 bytes during operation, resulting in program failure.
Rocksdbstatebackend is suitable for the following scenarios:
- Super large status, super long window (days), big key status of the job
- Suitable for high availability mode
When rocksdbstatebackend is used, the disk space of task manager can limit the state size (compared with fsstatebackend, the state size is limited to task manager memory). As a result, the throughput of rocksdbstatebackend is lower than that of the other two. Because the reading and writing of rocksdb’s state data have to go through deserialization / serialization.
Rocksdbstatebackend is the only one among them that supports incremental checkpoint.
How to save managed keyed / operator state
HeapKeyedStateBackendThere are two implementations:
- Support asynchronous checkpoint (default): storage format copyonwritestatemap
- Only support synchronous checkpoint: storage format nestedstatemap
Especially in memorystatebackend
HeapKeyedStateBackendBy default, the checkpoint serialization stage has a maximum of 5 MB of data
RocksDBKeyedStateBackendEach state is stored in a separate column family, in which keygroup, key and namespace are serialized and stored in DB as key.
Detailed explanation of checkpoint execution mechanism
This section will explain the step-by-step disassembly of checkpoint execution process. The left side of the figure below is checkpoint coordinator, which is the initiator of the entire checkpoint. In the middle is the Flink job composed of two sources and one sink. The rightmost is persistent storage, which corresponds to HDFS in most user scenarios.
a. In the first step, checkpoint coordinator triggers checkpoint; to all source nodes;.
b. In the second step, the source node broadcasts a barrier to the downstream, which is the core of the chandy Lamport distributed snapshot algorithm. Only the barrier that receives all the inputs will execute the corresponding checkpoint for the downstream task.
c. Step 3: after the task completes the state backup, it will notify the checkpoint coordinator of the state handle of the backup data.
d. In the fourth step, after the downstream sink node collects the two input barriers in the upstream, it will execute the local snapshot. Here, the process of rocksdb incremental checkpoint is specially shown. First, rocksdb will brush the data to the disk (indicated by the big red triangle), and then the Flink framework will select the files that have not been uploaded for persistent backup (indicated by the small purple triangle).
e. Similarly, after the sink node completes its checkpoint, it will return the state handle to notify the coordinator.
f. Finally, when the checkpoint coordinator collects the state handles of all tasks, it is considered that the checkpoint is globally completed this time, and a checkpoint meta file is backed up in the persistent storage.
Exactly of checkpoint_ Once semantics
In order to implement the actual once semantics, Flink caches the data received in the alignment phase through an input buffer, and processes it after the alignment is completed. For at least once semantics, the collected data does not need to be cached, and the subsequent data will be processed directly, so the data may be processed multiple times during restore. The following figure is the schematic diagram of checkpoint align in the official website document:
It should be noted that the checkpoint mechanism of Flink can only ensure that the calculation process of Flink can be implemented as active once, and the end-to-end active once needs the support of source and sink.
Difference between savepoint and checkpoint
Both can be used for job recovery, with the main differences as follows:
|The user triggers it by command, and the user manages its creation and deletion||When checkpoint is completed, it is saved in the external persistent storage given by the user|
|Standardized format storage allows job upgrade or configuration change||When the job fails (or cannot), the checkpoint stored externally will be retained|
|The user needs to provide the savepoint path to restore the job state||The user needs to provide the checkpoint path of the job state for recovery|