The blog recommends pulsar’s message storage mechanism and bookie’s GC mechanism

Time:2021-12-22

About Apache pulsar

Apache pulsar is a top-level project of the Apache Software Foundation. It is a native distributed message flow platform for the next generation cloud. It integrates message, storage and lightweight functional computing. It adopts a separate architecture design of computing and storage, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has strong consistency, high throughput Stream data storage features such as low latency and high scalability.  
GitHub address:http://github.com/apache/pulsar/

Author introduction

The author of this article: Bao Mingyu, senior engineer of Tencent TEG data platform department and Apache pulsar contributor, is keen on open source technology and has rich experience in the field of message queue. At present, he is committed to the landing and promotion of pulsar.

The MQ team of Tencent Data Platform Department has conducted in-depth research on pulsar and optimized a lot of performance and stability. At present, it has been launched in tdbank. This article is one of the pulsar technology series. It mainly briefly combs the cleaning mechanism of pulsar message storage and bookkeeper storage files. Bookkeeper can be understood as a NoSQL storage system. By default, rocksdb is used to store index data.

Pulsar message store

Pulsar messages are stored in bookkeeper, which is a fat client system. The client part is called bookkeeper, and each storage node in the server-side cluster is called bookie. As the client of bookkeeper storage system, the broker of pulsar system stores pulsar messages in the bookies cluster through the client SDK provided by bookkeeper.

Each partition of each topic in pulsar (non partition topic, which can be understood according to partition 0. The number of partition topic starts from 0) will correspond to a series of ledgers, and each ledger will only store the messages under the corresponding partition. For each partition, only one ledger will be in the open write state at the same time.

When pulsar produces and stores messages, it will first find the ledger used by the current partition, and then generate the entry ID corresponding to the current message. The entry ID is incremented in the same ledger. In the case of non batch production (the producer side can configure this parameter, which is batch by default), one entry contains one message. In the batch mode, one entry may contain multiple messages. In bookie, only the entry dimension will be used for writing, searching and obtaining.

Therefore, the msgid of the message under each pulsar needs to be composed of four parts (the old version consists of three parts), namely (ledgerid, entryid, partition index and batch index). The partition index is – 1 when it is not a partitioned topic and – 1 when it is not a batch message.

For each ledger, when the duration of existence or the number of saved entries exceeds the threshold, it will be switched. Under the same partition, new messages will be stored in the next ledger. Ledger is just a logical concept. It is a logical assembly dimension of data. There is no corresponding entity.

The blog recommends pulsar's message storage mechanism and bookie's GC mechanism

After each bookie node in the bookkeeper cluster receives the message, the data will be stored and processed in three parts: journal file, entrylog file and index file.

The journal file and entry data are written into the journal file in the way of wal. Each journal file has a size limit. When the size limit of a single file is exceeded, it will switch to the next file to continue writing. Because the journal file is swiped in real time, in order to improve performance and avoid the interaction between read and write io, It is recommended to distinguish the storage directory from the directory where the entrylog is stored, Moreover, a separate hard disk is attached to the storage directory of each journal file (SSD hard disk is recommended). Only a few journal files will be saved, and files exceeding the configured number will be deleted. The entry stored in the journal file is completely random, first come, first write. The journal file is designed to ensure that messages are not lost.

As shown in the following figure, after receiving the request to add an entry, each bookie will map to the journal directory and entry log directory according to the ledger ID, and the entry data will be stored in the corresponding directory. At present, bookie does not support changing the storage directory during operation (during use, increasing or decreasing the directory will lead to some data not being found).

The blog recommends pulsar's message storage mechanism and bookie's GC mechanism

As shown in the following figure, when bookie receives the entry write request, it writes the journal file and saves it to the write cache. The write cache is divided into two parts, one is the write cache being written, the other is the part being swiped, and the two parts are used alternately.

There is an index data structure in the write cache, and the corresponding entry can be found through the index. The index in the write cache is memory level and is implemented based on the concurrentlonglongpairhashmap structure defined by bookie.

In addition, the storage directory of each entrylog will correspond to a singledirectorydbledgerstorage class instance object, and each singledirectorydbledgerstorage object will have an index structure based on rocksdb. Through this index, you can quickly find out which entrylog file each entry is stored in. Each write cache will be sorted when adding an entry. In the same write cache, the data under the same ledger is adjacent and orderly. In this way, when the data in the write cache is flushed to the entrylog file, the data written to the entrylog file is locally orderly. This design can greatly improve the subsequent reading efficiency.

The blog recommends pulsar's message storage mechanism and bookie's GC mechanism

The index data in singledirectorydbledgerstorage will also be flushed into the index file as the entry is flushed. When the bookie goes down and restarts, you can restore the data through the journal file and entry log file to ensure that the data is not lost.

Pulsar consumer does multi-layer cache acceleration when consuming data, as shown in the following figure:

The blog recommends pulsar's message storage mechanism and bookie's GC mechanism

The order of obtaining data is as follows:

  • Get it from the entry cache on the broker side. If not, continue;
  • Get it from the part of the write cache of bookie that is being written. If not, continue;
  • In the write cache of bookie, it is obtained in the part of disk brushing. If not, continue;
  • Get from the read cache of bookie. If not, continue;
  • Read the entry log file on the disk through the index.

In each of the above steps, if data can be obtained, it will return directly and skip the following steps. If the data is obtained from the disk file, it will be stored in the read cache when it is returned. In addition, if it is an operation to read the disk, it will read more from the disk, because there is a local orderly processing during storage, and the probability of obtaining adjacent data is very large. This processing will greatly improve the efficiency of subsequent data acquisition.

In the process of using, we should try to avoid or reduce the scenario of consuming too old data, that is, triggering the reading of messages in disk files, so as not to affect the performance of the overall system.

GC mechanism of bookkeeper

Each bookie in bookkeeper will periodically clean the data. By default, it will be checked and processed every 15 minutes. The main process of cleaning is as follows:

The blog recommends pulsar's message storage mechanism and bookie's GC mechanism

  1. Clean up the ledger ID stored in bookie (compare the ledger ID stored in bookie with the ledger ID stored in ZK. If there is no ledger ID stored in ZK, delete the ledger ID stored in bookie);
  2. Count the proportion of surviving entries in each entry log. When the number of surviving ledgers in the current entry log is 0, delete the entry log;
  3. Clean up the entry log file according to the metadata information of the entry log (delete when all ledger IDS contained in the entry log are invalid);
  4. Compress the entry log file. When the proportion of entries surviving in the current entry log file is 0.5 – the default period is 1 day (major GC) or 0.2 – the default period is 1 hour (minor GC), compare the entry log file, transfer the entries surviving in the old file to the new file, and then delete the old entry log file, If the entry log file processed by a single GC is large, it may take a long time.

Through the above process, we can understand the general process of bookie in cleaning up the entrylog file.

In particular, whether a ledger can be deleted is triggered by the client. In pulsar, it is triggered by a broker.

The broker side has a periodic processing thread (2 minutes by default). It cleans up the ledger mechanism where the consumed messages are located, obtains the last confirmed message from the cursor contained in the topic, and deletes all the ledgers before the ID (note that the current ledger ID is not included) in the list of ledgers contained in the topic (including the metadata in ZK, and notify bookie to delete the corresponding ledger).

Analysis of problems encountered in operation

In the process of application, we have encountered the scenario of insufficient disk space in bookie for many times. A large number of entry log files are stored in bookie. There are two typical reasons.

Reason one

Production messages are too scattered. For example, in an extreme scenario, 1W topics are produced, one for each topic, and 1W topics are produced sequentially. In this way, the ledger corresponding to each topic will not be switched due to the length of time or storage size in a short time. The ledger ID in the active state is scattered in a large number of entry log files. These entry log files cannot be deleted or compressed in time.

If you encounter such a scenario, you can force the ledger to switch by restarting. Of course, if the consumption fails to keep up at this time, the ledger where the last ack location of the consumption is also in the active state and cannot be deleted.

Reason two

In the GC time process, if there are many existing enrylog files and a large number of them meet the minor or major GC threshold, the single minor GC or major GC takes too long, and the expired entry log files cannot be cleaned up during this period of time.

This is caused by the sequential execution of a single cleaning process. The next cleaning process will be executed only after the last round is executed. At present, it is also proposing to optimize the process to avoid the implementation of sub processes being too long and having an impact on the whole.

Summary

Firstly, this paper introduces the storage organization form, storage process and message acquisition process of pulsar messages. Secondly, the GC process of a single bookie is described in detail. In the process of using pulsar, we should try to avoid consuming too old historical data, that is, the scene where we need to read the disk to obtain data.

In the process of operation and maintenance of bookie, it is not possible to adjust the number of storage directories in the process of operation. If adjustments need to be made during operation, the capacity of a single bookie node needs to be expanded or reduced.

Related reading

The blog recommends pulsar's message storage mechanism and bookie's GC mechanism
clicklink, get Apache pulsar hard core dry goods information!