Up to now, only memory, file system and rocksdb are available for the status backend of Flink jobs, and rocksdb is the only choice when the amount of state data is large (from GB to TB). The performance of rocksdb depends on tuning. If the default configuration is adopted, the read-write performance may be very poor.
However, the configuration of rocksdb is also extremely complex. There are hundreds of adjustable parameters, and there is no universal optimization scheme. If we only consider the Flink state storage, we can still sum up some relatively universal optimization ideas. This paper first introduces some basic knowledge, and then lists the methods.
Note：The content of this article is based on our online practice of Flink 1.9. In version 1.10 and later, due to the task manager memory model reconfiguration, rocksdb memory becomes part of the out of heap managed memory by default, which can avoid some manual adjustment. If performance is still poor and intervention is required, you must state.backend.rocksdb . memory.managed Parameter to false to disable rocksdb memory hosting.
State R/W on RocksDB
The read and write logic of rocksdb as the back-end of Flink state is slightly different from the general situation, as shown in the figure below.
Each registered state in the Flink job corresponds to a column family, that is, it contains its own independent memtable and sstable sets. The write operation will first write the data to the active memtable. After the write operation is full, it will be converted to an immutable memtable, and flush to the disk to form sstable. The read operation will find the target data in active memtable, immutable memtable, block cache and sstable in turn. In addition, sstable also needs to be merged through the compaction strategy to form a hierarchical LSM tree storage structure, which is a cliche.
In particular, since Flink will persist the data snapshot of rocksdb to the file system in each checkpoint cycle, there is no need to write the pre write log (wal), and wal and fsync can be shut down safely.
The author has explained rocksdb’s compaction strategy in detail before, and mentioned the concepts of read amplification, write amplification and spatial amplification. The essence of rocksdb tuning is to balance these three factors. In Flink, which pays attention to real-time, we should focus on reading and writing amplification.
As a read-write cache in LSM tree system, memtable has a great impact on write performance. Here are some notable parameters. For convenience of comparison, the original parameter name of rocksdb and the parameter name in Flink configuration will be listed below, and separated by vertical bar.
- write_buffer_size | state.backend.rocksdb.writebuffer.sizeThe default size of a single memtable is 64MB. When the memtable size reaches this threshold, it is marked immutable. Generally speaking, properly increasing this parameter can reduce the impact of write amplification, but at the same time, it will increase the pressure of l0 and L1 layers after flushing. Therefore, it is necessary to modify the compaction parameter, which will be mentioned later.
- max_write_buffer_number | state.backend.rocksdb.writebuffer.countThe default number of inactive and inactive memtables is the maximum. When all memtables are full but the flush speed is slow, it will cause a write pause. Therefore, if there is enough memory or a mechanical hard disk is used, it is recommended to increase this parameter appropriately, such as 4.
- min_write_buffer_number_to_merge | state.backend.rocksdb.writebuffer.number-to-mergeThe minimum number of memtables merged before flush occurs. The default value is 1. For example, if this parameter is set to 2, it is possible to trigger flush when there are at least two immutable memtables (that is, if there is only one immutable memtable, it will wait). The advantage of increasing this value is that more changes can be merged before flush, reducing write amplification, but at the same time, it may increase read amplification, because there are more memtables to check when reading data. After testing, it is better to set the parameter to 2 or 3.
Tuning Block/Block Cache
Block is the basic storage unit of sstable. Block cache plays the role of read cache. LRU algorithm is used to store the most recently used blocks, which has a great impact on the read performance.
- block_size | state.backend.rocksdb.block.blocksizeThe size of the block. The default value is 4KB. In the production environment, it is always appropriate to increase the size of 32KB. For mechanical hard disk, it can be increased to 128-256kb to make full use of its sequential reading capacity. However, it should be noted that if the block size increases while the block cache size remains unchanged, the number of blocks in the cache will decrease, and the read amplification will be increased.
- block_cache_size | state.backend.rocksdb.block.cache-sizeThe size of block cache is 8MB by default. From the above read-write process, it can be seen that a larger block cache can effectively prevent hot data read requests from falling onto sstable. Therefore, if the memory is sufficient, it is recommended to set it to 128MB or even 256MB, and the read performance will be greatly improved.
Compaction is the most expensive operation in all LSM tree based storage engines. If it is not done well, it will be very easy to block the read and write. It is suggested that the viewer read the previous article on rocksdb’s comp action strategy to obtain some background knowledge, which will not be repeated here.
- compaction_style | state.backend.rocksdb.compaction.styleThe default level (i.e. level complex) can be used for the compaction algorithm, and the following parameters are also based on this.
- target_file_size_base | state.backend.rocksdb.compaction.level.target-file-size-baseThe size threshold of single sstable file in L1 layer is 64MB by default. For each level up, the threshold is multiplied by the factor target_ file_ size_ Multiplier (but the default value is 1, which means that the maximum sstable of each level is the same). Obviously, increasing this value can reduce the frequency of compaction and reduce write amplification, but it will also cause old data to be unable to be cleaned up in time, thus increasing read amplification. This parameter is not easy to adjust. It is generally not recommended to set it above 256MB.
- max_bytes_for_level_base | state.backend.rocksdb.compaction.level.max-size-level-baseThe total data size threshold of L1 layer, the default value is 256MB. For each step up, the threshold is multiplied by the factor max_ bytes_ for_ level_ Multiplier (the default is 10). Because the size threshold of the upper layer is calculated based on it, it should be adjusted carefully. Target is recommended_ file_ size_ Base, and can not be too small, for example, 5 ~ 10 times.
- level_compaction_dynamic_level_bytes | state.backend.rocksdb.compaction.level.use-dynamic-sizeThis parameter was mentioned before. When enabled, the multiplication factor of the above threshold will become a division factor, which can dynamically adjust the data volume threshold of each layer, so that more data can fall on the highest layer, which can reduce spatial amplification, and the structure of the whole LSM tree will be more stable. For mechanical hard disk environment, it is highly recommended to turn on.
- max_open_files | state.backend.rocksdb.files.openAs the name implies, it is the maximum number of files that rocksdb instance can open. The default value is – 1, which means there is no limit. Because the sstable index and bloon filter both reside in memory and occupy file descriptors by default, if this value is too small, the index and bloom filter cannot be loaded normally, which will seriously slow down the read performance.
- max_background_compactions/max_background_flushes | state.backend.rocksdb.thread.numThe maximum number of concurrent threads responsible for flush and compaction in the background is 1 by default. Note that Flink combines these two parameters into one DBOptions.setIncreaseParallelism In view of the fact that both flush and compaction are relatively heavy operations, if the CPU margin is sufficient, it is recommended to increase it. In our practice, it is generally set to 4.
In addition to the above method of setting parameters, users can also create dboptions and columnfamilyoptions instances by implementing the configurablerocksdboptionsfactory interface, which is more flexible. For more information, the viewer can refer to several rocksdb parameter sets (in the predefined options enumeration) predefined by Flink.
This article is reproduced from littlemagic’s blog