How to insert an LSM tree into NVM

Time:2022-5-9

Introduction: with the commercial promotion of nonvolatile memory products, we are more and more interested in its large-scale promotion potential in cloud native database. X-engine is an LSM tree storage engine developed by the polardb new storage engine team of Alibaba cloud database products division. It currently provides external services on Alibaba cloud polardb products. Based on x-engine, combined with the advantages and limitations of nonvolatile memory, we redesigned and implemented the basic components of the storage engine, such as main memory data structure, transaction processing and persistent memory allocator. Finally, we realized high-performance transaction processing without recording pre written logs, reduced the write amplification of the whole system and improved the fault recovery speed of the storage engine.

How to insert an LSM tree into NVM

Author | shutter
Source: Ali technology official account

Introduction: with the commercial promotion of nonvolatile memory products, we are more and more interested in its large-scale promotion potential in cloud native database. X-engine is an LSM tree storage engine developed by the polardb new storage engine team of Alibaba cloud database products division. It currently provides external services on Alibaba cloud polardb products. Based on x-engine, combined with the advantages and limitations of nonvolatile memory, we redesigned and implemented the basic components of the storage engine, such as main memory data structure, transaction processing and persistent memory allocator. Finally, we realized high-performance transaction processing without recording pre written logs, reduced the write amplification of the whole system and improved the fault recovery speed of the storage engine. The research results were published to vldb2021 in the paper “revising the design of LSM tree based OLTP storage engine with persistent memory”. The paper has a large capacity and has high reference value for subsequent research and application. As a team that continues to deepen the basic technology of database, we have dabbled and studied in transaction storage engine / high-performance transaction processing / new storage devices / heterogeneous computing devices / aifordb. People with lofty ideals at home and abroad are welcome to join or exchange.

I. background introduction

1 Introduction to persistent memory

Compared with DRAM, PM not only provides larger capacity and lower power consumption, but also has many characteristics such as byte addressing. It aims to greatly improve the memory capacity of the equipment and reduce the static power consumption of the equipment, but also provides features such as persistent byte addressing to simplify the design of the system, which brings a new opportunity for the design of data storage engine. Optane dcpmm adopts the product form of DDR4 DIMM, also known as “persistent memory module” persistent memory module / PMM. At present, dcpmm provides three options for single capacity: 128GB, 256gb and 512gb. The actual available capacity is 126.4gib, 252.4gib and 502.5gib respectively. Optane dcpmm is currently only applicable to Intel cascade Lake processors. Similar to traditional DRAM, it is connected to the processor through Intel IMC (integrated memory controller). Although dcpmm provides the same byte addressing capability as DRAM, its I / O performance is quite different from DRAM, which is mainly reflected in media access granularity, cache control, concurrent access and cross NUMA node access. Interested readers can refer to the follow-up literature [1,2].

2 Introduction to x-engine

X-engine is an OLTP database storage engine based on LSM tree architecture. Its implementation architecture is shown in Figure 2. A single database can be composed of multiple LSM tree instances (called subtable), and each instance stores a table or index or table partition. For details, please refer to SIGMOD’s paper [3] in 2019. LSM tree divides the data into multiple levels that grow in a certain proportion, which are located in memory and disk respectively. The data flows from the upper layer to the lower layer through consolidation. Since DRAM is power-off and volatile, it uses the way of pre write log (wal) to write the data to be written to the disk in advance for persistence. After the data in memory is flushed or merged to the disk, the corresponding wal is cleared. In a typical design, the data in memory is usually implemented by a skip list. When the size exceeds the limit, it will be frozen (swtich operation and immutable memory table in the figure) and dumped to disk to create a new memory table. The data of each layer in the disk is stored in multiple ordered string tables (SST), and each SST is equivalent to a B-tree. Generally, the ranges of key value pairs of different SSTs in the same layer do not overlap. However, in the actual system, in order to speed up the disk brushing operation of the memory table, it is usually allowed to overlap the range of SST in some layers, such as leveldb and rocksdb. Level0 is allowed to overlap, but the disordered data layout of level0 layer will reduce the reading efficiency.

How to insert an LSM tree into NVM
Figure 2 main architecture of x-engine

3 opportunities and challenges

The existing design of OLTP storage engine based on LSM tree architecture usually has the following problems: (1) wal is located in the write critical path, especially in order to meet the acid attribute of transaction, wal is usually written to disk in a synchronous way, so it slows down the write speed. In addition, due to the volatility of DRAM, setting too large memory table will improve the performance of the system, but it will lead to too large wal and affect the recovery time of the system. (2) Level 0 data blocks are usually allowed to be out of order to speed up the disk brushing speed of data in memory. However, if the disordered data blocks are stacked too much, it will seriously affect the reading performance, especially the range reading performance. Intuitively, PM with persistent byte addressing can be used to implement persistent memory table instead of volatile memory table in DRAM, so as to reduce the overhead of maintaining wal. However, in fact, due to the characteristics of PM itself, there are still great challenges to realize efficient persistent index, including corresponding PM memory management. In addition, the current PM hardware can only maintain 8-byte atomic writes. Therefore, in order to realize the atomic write semantics, the traditional methods still need to introduce additional logs (also known as PM internal logs).

In order to meet these challenges, halloc, an efficient PM memory manager specially optimized for LSM tree, is designed in this paper. An optimized PM based semi persistent memory table is proposed to replace the memory table in the traditional scheme DRAM. The ror lock free log algorithm is used to remove the acid attribute that the traditional scheme relies on wal to maintain transactions, and a globally ordered global index persistent index layer and in memory merging strategy are designed to replace the level0 layer of the traditional scheme, Improve query efficiency and reduce CPU and I / O overhead of level0 data maintenance. The main improvements in this paper are shown in Figure 3, where MEM represents active memtable, IMM represents immutable memtable, and SP prefix represents semi persistence. These designs mainly bring the following three benefits: (1) avoiding wal writing and introducing additional internal logs into PM programming library to realize faster writing; (2) The data in PM is directly persistent, avoiding frequent disk brushing and level 0 merging operations, and the capacity can be set larger without worrying about the recovery time; (3) Level 0 data is globally ordered, so you don’t have to worry about the accumulation of level 0 data.

How to insert an LSM tree into NVM
Figure 3 Comparison between the main scheme in the paper and the traditional scheme

Semi persistent memory table

How to insert an LSM tree into NVM

The update and addition operations of persistent indexes usually involve multiple small random writes of more than 8 bytes in PM. Therefore, the overhead for maintaining consistency and the problem of write amplification caused by random writes are introduced. In fact, the design of PM based index can be divided into three categories in the above table. No persistence refers to the use of PM as DRAM. This method can ensure the highest performance of the index, but the data will be lost after power failure. The second type is full persistence, that is, index all data (index nodes and leaf nodes, etc.) for persistence, such as bztree, wbtree, etc. This method can achieve rapid recovery, but the overhead of persistence is generally large and the performance is usually low. A compromise method is to only persist the necessary data in exchange for a compromise between performance and recovery time. For example, NV tree and FPtree only persist the leaf node, and the index node is rebuilt at restart. In the LSM tree structure, the memory table is usually small, so the semi persistent index design idea is more appropriate. In this paper, two methods are used to reduce the maintenance overhead of maintaining persistent indexes.

How to insert an LSM tree into NVM
Figure 4 structure diagram of semi persistent memory table

Only leaf nodes are persisted. In the actual application scenario, the OLTP engine based on LSM tree in the cloud usually does not design a large memory table, usually 256MB, which is mainly due to the following two reasons: (1) users in the cloud usually buy database instances with small memory; (2) LSM tree needs to maintain a small memory table to ensure fast disk brushing operation. For a 256MB memory table, we find that the cost of restarting and recovering non leaf nodes when only leaf nodes are persisted is less than 10ms, and its recovery time is fast enough compared with the database system studied. Secondly, in the design of the index, the separation of serial number and user key is used to speed up the search of key and meet the mvcc (multi version concurrent control) concurrency control constraint of memory table. As shown in Figure 4, for values with only one version (9), a pointer to the specific location of PM will be directly stored in the index. For values with multiple versions, a scalable array is used in the design to store the multi version serial number and its specific pointer. Since the index is volatile, the key is not explicitly stored in the index, and the index is rebuilt by scanning the key value pairs in the PM on restart.

Batch sequential writes to reduce write amplification. In PM, small random writes will be converted into random 256 byte block writes by the hardware controller, resulting in the problem of write amplification, which will consume the bandwidth resources of PM hardware. Since the memory table is designed as a write mode of sequential addition, in order to avoid this problem, the semi persistent memory table packages small writes into large writes (writebatch), sequentially writes the writebatch to PM, and then writes the records in it to the volatile index respectively. As shown in Figure 4, batch represents a large writebatch write, and slot is used to record the ID of the zone object allocated from the halloc allocator.

Three ror lock free log free transaction commit algorithm

Since the index in memtable in this paper does not need to be persistent, it only needs to ensure the atomic persistence of data. Although PM can provide byte addressable persistent writing, it can only provide 8-byte atomic writing (only refers to Intel optane dcpmm). Therefore, there is a risk of partial write for write operations larger than 8 bytes. The traditional scheme is to use log as data to ensure atomic writing. As shown in Figure 5, the log entries are written in sequence, and the 8-byte head pointer is updated after each log entry is written. Since the atomicity of the head update is always guaranteed by the hardware, the log entries before the head can be regarded as successfully written; There is a partial write risk in the log entries after head, which will be discarded when restarting.

There are two problems in this scheme: (1) logs can only be written in sequence, which is not conducive to giving full play to the parallel writing ability of multi-core system; (2) There are data with different life cycles in log entries, which makes it difficult to recycle log entries. As shown in Figure 5, the data in Log1 is written into three LSM tree instances, and the time of memtable disk brushing in different LSM tree instances is different, which may lead to less writing of an LSM tree instance and longer disk brushing cycle, resulting in the corresponding Log1 being unable to be effectively recycled for a long time and reducing the utilization of PM memory space.

How to insert an LSM tree into NVM
Figure 5 problems of traditional scheme

In order to solve these problems, this paper proposes ror algorithm, which uses chainlog data structure to separate data life cycle, and uses lock free ring to realize concurrent log writing. Among them, in order to further reduce the random writing for PM and improve the writing performance, the ror algorithm adopts batch to combine small chainlog into larger data blocks. As shown in Figure 6, chainlog ensures the atomicity of data of any size written to PM; Batching is used to aggregate small transaction cache batch writes to PM to reduce random writes of PM; The concurrent ring provides lockless pipelined writing for chainlog into PM to improve the scalability of multi-core system.

How to insert an LSM tree into NVM
Figure 6 overall framework of ror algorithm

For a transaction to be submitted, it is first encapsulated into a writebatch. One or more writebatches are further encapsulated into a chainlog for batch writing to PM. In this paper, the transaction concurrency control strategy of LSM tree’s original two-stage lock 2PL and mvcc is used. As shown in Figure 6, ror uses a fixed and adjustable size concurrent bucket to control concurrent writes. The first thread entering a bucket becomes the leader thread to perform specific writes, and the other threads entering the bucket become the follow thread. The leader aggregates the writebatch of himself and all follow threads belonging to the bucket into a larger writebatch for actual writing.

One of the key points of ror is chainlog, which adds the domain identifying the log life cycle and the domain identifying the write location to the 8-byte head. By identifying the domain information of the write location, you can locate which chainlogs have been partially written and discard them when restarting. Through the domain information of log life cycle, the data written to different LSM tree instances in chainlog can be written to different memory spaces and isolated from each other. In addition, chainlog is always written serially from the high-level perspective, that is, a chainlog item will be persisted only when all previous chainlogs have been persisted in PM. The serialized commit makes it only necessary to check the last chainlog entry during system recovery to ensure that the last chainlog is partially written.

First, consider the separation of data life cycle. A more intuitive way is to set a separate head pointer for each LSM tree instance. As shown in Figure 7, each head is responsible for indicating the write location of its corresponding memory space. Different memory spaces are separated from each other and have independent life cycles. The memtable in the corresponding LSM tree can be recycled immediately after brushing the disk without waiting for other LSM tree instances. The LSM tree can not be updated in a single transaction, which may lead to multiple updates to the hardware.

How to insert an LSM tree into NVM
Figure 7 example of log item life cycle

To solve this problem, ror adds the following location information to the 8-byte head used to indicate the write location in each LSM tree instance. Gidx in Figure 8 corresponds to the original 8-byte head pointer. The upper 4 bytes in the gidx pointer are used to record the last written position, and the lower 4 bytes are used to record the current written position. In every 4 bytes, 6 bits are used to record the slot of memtable and 26 bytes are used to record the offset of the corresponding slot. A total of 4GB of space can be addressed. The data written by a chainlog item is divided into n sub items, and each sub item is written to the corresponding LSM tree instance. The number of subitems and the LSN of the current chainlog are recorded in each subitem.

How to insert an LSM tree into NVM
Fig. 8 structural diagram of gidx

The example in Figure 9 shows the process, in which chainlog2 (R2) has been successfully submitted and R3 is waiting to be submitted. However, during R3 submission, the system was abnormally powered down, resulting in the successful submission of R3 sub item R31, but R32 failed to submit successfully. During system recovery, by scanning the last chainlog sub item in all LSM tree instances, we can know that the current maximum LSN is 3. R31 has been submitted successfully, but the number of sub items recorded is 2, but R32 failed to write successfully. Therefore, the gidx of LSM tree instance 1 will be rolled back (through simple and fast shift right operation). After rollback, R31 is discarded and the LSN is fallback to 2. At this time, the system returns to the global consistency state.

How to insert an LSM tree into NVM
Figure 9 schematic diagram of power failure recovery

Because the chainlog needs to meet the serialization write semantics, it can ensure that the global consistent state can be correctly established by scanning the last chainlog sub item of all LSM trees during restart and recovery. One method to ensure the semantics of serial writing is serial writing, but this method can not take advantage of the high concurrent writing characteristics of multi-core platform. In essence, it is the idea of creating sequence before writing. Ror algorithm adopts post write order construction, that is, each thread does not pay attention to the problem of order when writing. Ror algorithm will dynamically select the main thread in this process to collect the chainlog that has been written and build the sequence. Since the order creation only involves the update of chainlog metadata, the write performance is significantly improved. After the main line program is completed, it exits, and the ror algorithm continues to dynamically select other main threads for the process. The process is controlled by lock free ring and lock free’s dynamic master selection algorithm. For specific details, readers can refer to the description in the original paper.

IV. merger of global index and lightweight in memory

GI (global index) is mainly used to maintain a variable indexed data layer in PM to replace the disordered level 0 layer data in disk. In order to simplify the implementation, GI adopts the same volatile index as the memory table, and places the data in PM. Since the memory table also places the data in the index, there is no need to copy the data when transferring the data of the memory table to the GI. Only the pointer in the GI index needs to be updated to point to the data in the memory table. It is worth noting that GI can use any range index or persistent index to improve the recovery speed of the system. Since the update of GI does not need to design multiple kV updates and write transactional requirements, the existing range index without lock and log free can be applied to GI.

In memory consolidation: in memory consolidation refers to the consolidation from memory table to GI. GI adopts the same index design as the memory table, that is, key value pairs are stored in PM, and leaf nodes adopt a scalable array with data version to store data from the same key. When merging in memory, if the same key does not exist in GI, insert the key directly into GI; If it already exists, you need to check multiple versions belonging to the key in GI and perform multi version cleanup when necessary to save memory space. Since the key value pairs in PM are managed by halloc, the memory release of key value pair granularity is not allowed. When the in memory merge is completed, only the memory of the memory table is released. The PM memory in GI is released in batch only when all key value pairs in GI are merged to disk.

Snapshot: through the snapshot technology, it can ensure that the in memory merging operation can be carried out at the same time when the GI is merged to the disk, so as to avoid blocking the operation in the previous section. In the implementation of snapshot, GI will freeze the current data and create a new index internally to absorb the merged data from the memory table. This design avoids blocking the operation of the front end, but the query may involve two indexes, resulting in additional query overhead.

PM to disk consolidation: since the GI data consolidated to the disk is immutable and globally ordered, the PM to disk consolidation operation will not hinder the front-end operation. Moreover, due to the global order of GI, the merging operation can be parallelized by range and division, so as to speed up the merging speed from PM to disk.

Data consistency: the merging of PM to disk involves the change of database state, which may cause data consistency problems when the system is down. To solve this problem, this paper ensures the data consistency of database state change by maintaining the manifest log in the disk. Since the description log is not in the critical path written by the front end, it will not affect the performance of system writing.

Five halloc memory allocator

Halloc is a special PM memory allocator for LSM tree. It solves the problems of low efficiency and fragmentation of the traditional general PM memory allocator through three key technologies, such as memory reservation scheme based on object pool, application friendly memory management and unified address space management. Its main architecture is shown in Figure 10. Halloc divides its own memory address and manages its own internal metadata information by directly creating DAX files. In the address space, halloc is divided into four areas to store halloc metadata information, subtable metadata information, memory table metadata information and several memory express for specific memory use.

How to insert an LSM tree into NVM
Figure 10 overall architecture of halloc memory allocator

Memory preprocessing technology based on object pool. Halloc reduces the memory fragmentation problem in PM management by statically reserving an address space with a fixed size object pool and memory addresses that do not overlap each other. Each object pool contains a metadata area to record the allocation of objects, a freelist persistent linked list to track idle objects, and a fixed size object area whose size is explicitly specified when creating an object pool. Since the operation of the persistent linked list is designed to multiple discontinuous PM write operations greater than 8 bytes, there is a risk of data inconsistency. The traditional scheme uses internal logs such as pmdk and Mnemosyne to ensure the transactional operation, but the alternative scheme will introduce additional logging overhead. In order to eliminate the log overhead, this paper proposes the following scheme to ensure the data consistency of freelist operation.

How to insert an LSM tree into NVM
Figure 11 freelist internal structure

As shown in Figure 11, a freelist is stored continuously in memory space, including metadata area for recording allocation, index area for indexing idle objects, and object area for storing specific objects. Each object corresponds to an 8-byte index. The highest bit of each index is used to mark the persistence of the object to ensure the allocation and release of atomized objects. Freelist provides four interfaces to complete the allocation and release of an object: get is used to obtain an object from freelist, commit is used to notify halloc that the object has completed initialization and can be removed from freelist, check is used to detect whether an object has been persisted to avoid object error reference during fault restart, and release is used to release an object. The core idea is to set the persistence flag in the index of the object, and the leaked object has been identified by scanning when restarting.

For the start-up recovery of the object pool, halloc first scans the freelist object and marks it in the bitmap, and then scans the index field to confirm whether the object is reachable in freelist. Unreachable objects will be recycled. This design increases the overhead of restart to some extent, but in fact, the process scans quickly, and it takes only a few milliseconds to scan millions of objects in the experiment. The restart cost is negligible in the studied system.

Application friendly memory management. Halloc provides two kinds of object pool services for LSM tree: custom object pool and zone object pool. This design is mainly based on LSM tree’s unique additional write and batch recycling method for memory use, which greatly simplifies the memory management. For custom object pools, as shown in Figure 8, halloc maintains memtable and subtable pools to store memory table metadata and subtable metadata in the engine respectively. A subtable object contains a linked list to record all its memtable objects (linked by memlist). The first memtable object is the active memory table and the rest are frozen memory tables. Each memtable object indexes a limited number of zone objects, and each zone object records specific memtable data. Zone object pool is an object pool built in halloc. It is used by applications to manage memory in their own way. This design is mainly because the user-defined object pool can only store limited and fixed size objects. Since halloc is not aimed at general PM memory allocation, for the management of variable size and so many objects, the application needs to implement its own memory management scheme based on the zone object pool.

Unified address space management. In order to facilitate the joint management of volatile and persistent memory, halloc supports both persistent memory allocation and volatile memory allocation in a single DAX file address space, which greatly simplifies the use of PM resources. Similar to libmemkind, halloc is also applicable to jemalloc, which is used to take over the allocation of volatile memory with specific variable size. The difference is that halloc uses zone as the basic memory management unit of jemalloc, that is, jemalloc always obtains zone objects from the zone pool and further refines management. The objects allocated from the zone pool will no longer call commit, so all the allocated zone objects will be recycled after the system restarts. A major limitation of this design scheme is that the volatile memory allocated by users cannot exceed the size of a single zone, because the zone object pool can only ensure the continuity of memory addresses of a single zone. However, for large memory allocation, users can choose to allocate multiple times by splitting, or if the object size is fixed and the number is limited, use a custom object pool for static allocation.

Vi. experimental evaluation

Experimental platform. Alibaba cloud instance is used in the experiment, and the specification is ECs ebmre6p. 26xlarg。 This instance has two Intel (R) Xeon (R) platinum 8269cy CPUs. Each CPU has 52 cores, a total of 104 cores. Each core has a 32KB L1 cache and a 1MB L2 cache. All cores on each CPU share a 32MB L2 cache. Examples include 187gb DRAM and 1TB PM. PM is equally distributed to two CPUs, i.e. each CPU is equipped with 512gb PM. The instance is configured with a total of 2TB ESSD as cloud disk. In the experiment, all PMS are configured as two Linux devices, and each device belongs to a CPU. The Linux kernel version of all experiments is 4.19.81.

Parameter configuration. Unless otherwise specified, in the experiment, the size of a single memory table is 256MB, the maximum GI of a single subtable is 8GB, and the level1 of a single subtable is 8GB. The system configuration before improvement is 256MB level 0. All experiments adopt synchronous wal, use direct I / O to bypass the impact of page cache on the system, and turn off compression to evaluate the maximum performance of the system.

1. Comprehensive evaluation

Firstly, the test benchmark of ycsb standard is adopted in the experiment. A total of 8 billion records are loaded in the database in advance and evenly allocated to 16 subtables. Each record has 8 bytes of key and 500 bytes of value, with a total amount of about 500GB of data. Four configurations were configured in the experiment: (1) the benchmark system and all data were placed in the ESSD (marked XS), (2) the improved scheme and the configuration used 200GB PM space, which was managed by halloc (marked XP), (3) the benchmark system and placed all data in the faster PM (marked xs-pm), (4) the improved scheme and placed the data originally placed in the ESSD in the PM (marked xp-pm). Configuration (1) and configuration (2) are standard configurations when the system is actually used; Configuration (3) and configuration (4) are mainly used to evaluate the system performance after removing the ESSD. Each experiment uses 32 client threads, sets 50GB row cache and 50GB block cache, and runs for 30 minutes to ensure that the system compaction runs in time.

How to insert an LSM tree into NVM
Fig. 12 ycsb experimental results

The experimental results are shown in Figure 12. For write intensive load a and random requests, the performance of XP / xp-pm is 3.8 and 2.3 times that of XS / xs-pm; Under write intensive load F and random requests, the performance of XP / xp-pm is 2.7 and 2.2 times that of XS / xs-pm. As shown in Figure 10, the average access latency of XP is 36% lower than that of XS. When the load is inclined (Zipf = 1), the performance of XP is close to that of XS, and the performance of xp-pm is lower than that of xs-pm. These results show that compared with the benchmark system, the scheme in this paper has better overall performance and produces less disk I / O. However, there is little difference between the performance of xp-pm and xs-pm. Especially under the load tilt, xp-pm is not as good as the benchmark system xs-pm. However, in fact, this configuration places all the data in PM, which will not be used in practice because of its high cost.

For read intensive applications (B, C, D, e), under load B and random requests, XP / xp-pm has 1.7 times and 1.2 times higher performance than XS / xs-pm, 1.4 times and 1.1 times higher performance in load D, and has lower latency. The average latency of load B is reduced by 39% and that of load D is reduced by 26%. This is mainly because XP does not need to write wal logs, so it has lower write latency. When the load is tilted, the performance benefit of XP decreases. In load B, the performance of XP / xp-pm is only 1.1 times and 1.5 times higher than that of XS / xs-pm. In load C and load e, because there are few writes, all the data are merged into ESSD. Therefore, the performance of XP / xp-pm is similar to that of XS / xs-pm.

How to insert an LSM tree into NVM
Figure 12 (Continued) system latency and CPU and IO overhead

CPU and I / O consumption. Figure 12 (c) shows the CPU consumption and cumulative I / O when running ycsb load and a load. The results show that XP has better CPU efficiency, and its I / O consumption is 94% lower than that of the baseline system when running a load. The main reason is that XP uses a larger GI to cache more updates in PM, thus reducing the disk brushing operation of data.

How to insert an LSM tree into NVM
Figure 13 system delay and CPU and IO overhead

Database size sensitivity. In order to test the relationship between the performance benefit of the improved system and the size of the database, the experiment injects 100-600gb data respectively, and then runs the D load. The results show that, as shown in Figure 13, when the database size increases from 100GB to 600gb, the performance of the baseline system XS decreases by 88%, while XP decreases by only 27%. This is mainly because load D reads the latest updates as much as possible, while XP places the hot data in the high-speed persistent PM, and the baseline system XS needs to read data from the slow disk every time the system is started for testing.

How to insert an LSM tree into NVM
Fig. 14 experimental results of single LSM tree example (40Gb dataset)

Single LSM tree instance test. In order to compare with the latest scheme using PM to improve LSM tree, slmdb and novelsm were selected for comparison in the experiment. Since slmdb and novelsm do not support running multiple LSM tree instances in the same database, only a single subtable is set this time. In the experiment, four clients were used and 40Gb of data was loaded. The test results show that, as shown in Figure 14, XP has higher data loading performance, 22 times that of slmdb and 7 times that of novelsm. This is mainly because although slmdb and novelsm use persistent skiplist as memory tables, they still rely on wal to ensure the atomicity of transactions when it comes to transaction processing. In addition, neither of them supports concurrent writing. Slmdb uses a single-layer structure and a global persistent B + tree to index specific data on the disk. Although this design can improve the reading performance of data, it involves the consistency maintenance of disk and persistent index in PM, so the writing performance is low. Novelsm only introduces persistent memory tables, so the performance improvement is limited (PS: not very novel).

How to insert an LSM tree into NVM
Figure 15 TPC-C experimental evaluation results

TPC-C performance. In the experiment, the improved scheme is integrated into MySQL as a plug-in of storage engine, the initial database size of 80GB is preloaded, and then the TPC-C test is started for 30 minutes. The experimental results show that (as shown in Figure 15), the TPS performance of XP is improved to 2x compared with XS, and the p95 delay is reduced by 62%. This is mainly because XP avoids writing of wal and has a larger PM to cache more data. However, the performance of XP in TPS is more jittery than that of XS, mainly because XP prefers to implement the all-to-all competition strategy from level 0 to level 1, resulting in more severe cache elimination behavior. How to balance the elimination strategy of compaction and cache is an important direction in the future.

2 semi persistent memory table evaluation

In order to evaluate the performance of the semi persistent memory table in the improved scheme, in the experiment, close all background flush and compact operations of the system, and set the batch = 50 of ror to bypass the impact of ror as much as possible (batch = 50 is close to the performance limit of PM hardware). The experiment mainly focuses on the following indexing schemes: (1) the skip list based on DRAM, the default memory table index type of the baseline system (marked as SLM); (2) Implement persistent memory table (marked FFM) based on Fast & fair; (3) A semi persistent memory table (labeled FPM) based on the variant of FPtree (OLC is used to realize concurrent operation in this experiment, and HTM and leaf lock are used to realize concurrent operation in the original FPtree); (4) This paper proposes a scheme, and makes DRAM store index nodes (labeled spm-d); (5) In this paper, the scheme is, and the PM storage inode (labeled spm-p) is used. Schemes (4) and (5) are used to detect the performance of the memory table when PM is used as non persistent. Fast & fair and FPtree are persistent B + trees optimized for PM. FPtree only persists leaf nodes, so it is also a kind of semi persistent index. Since fast & fair and FPtree do not support variable size keys, this experiment adds a run-time kV parsing process for these two memory tables, that is, only pointers of fixed size kV pairs are stored in the index. In the experiment, 30 million kV pairs, 8-byte key and 32-byte value are inserted, with a total of about 1.5GB of data. KV pairs will be converted into 17 byte keys and 33 byte values in the memory table.

How to insert an LSM tree into NVM
Figure 16 memory table performance evaluation results

Insert performance: Figure 16 (a) (b) shows the write performance of different memory tables when concurrent threads increase from 1 to 32. The results show that there is little difference between spm-d and spm-p when writing. Even if spm-p places the inode in a slower PM, it is mainly because the persistence overhead is relatively large, which is the main influencing factor. In addition, for SLM / FFM / FPM, spm-d increases by 5.9x/5.8x/8.3x in sequential writing and 2.9x/5.7x/6.0x in random writing. Even if LSM places the index in the faster DRAM, spm-d / spm-p still has greater performance improvement than spm-d, mainly because SPM adopts cardinal tree index and has better read-write efficiency. Even though FPM also places inodes in faster DRAM, its implementation in this paper relies on runtime kV parsing to load specific kV data from slower PM.

Lookup performance: table 16 (c) shows the performance of point reading. In this experiment, the performance improvement of spm-d relative to SLM / FFM / FPM is up to 2.7x/14x/16x respectively. For point reading scenarios, SPM adopts prefix matching, while SLM / FFM / FPM adopts binary search. For the scenario where the key is generally short, the cardinality tree obviously has higher reading efficiency. Although FPM adopts hash based fingerprint identification technology in leaf nodes to accelerate the click reading performance, in fact, in the memory table, the click reading will be converted into a short range query to obtain the latest version of the corresponding key, so the fingerprint technology in FPM is difficult to play a role. In addition, FPM leaf nodes adopt the strategy of out of order storage in order to sacrifice a certain reading efficiency in exchange for the improvement of writing speed. Spm-d is higher than spm-p in the point reading scenario, mainly because spm-p places the index node in the PM with slower speed. When reading, the index performance is limited by the reading bandwidth of the PM. For range query performance (Figure 16 (d)), spm-d and spm-p perform worse than SLM. Although the range query performance of cardinality tree in DRAM is generally not high, it is found in this experiment that its performance is more limited by the read bandwidth of PM. In fact, the system analysis shows that spm-d consumes 70% of the time to read data from PM during range query.

3 ror assessment

Ror mainly affects the write performance of the system. In order to reduce the interference of system background tasks, all flush and compaction operations are closed in the experiment, and the indexing operation in the memory table is closed. Each thread writes 1 million kV pairs with a size of 24 bytes. In the experiment, the number of threads and the size of batch are set to evaluate the impact of these parameters on the write performance of the system.

How to insert an LSM tree into NVM
Figure 17 ror algorithm evaluation results

Impact of batch: in the experiment corresponding to table 17 (a), change the size of batch size by fixing the number of threads to 32 to test the impact of batch size on system write performance and latency. The results show that when the batch size is adjusted from 1 to 90, the write throughput of the system increases to 49x, while the average delay and p99 delay increase by 1.3x and 1.7x respectively. When batch size = 90, its write throughput even exceeds the throughput of random write 24 bytes of PM hardware, mainly because ror always tries to write in sequence. When the batch size increases from 50 to 90, its performance gains increase slowly, while the delay increases greatly, mainly because the PM hardware is close to saturation at this time. When the batch size is greater than 90, the system throughput does not increase but decreases, mainly because a large batch will lead to ror blocking, which will affect the write throughput.

Effect of the number of threads: in the experiment, fix batch size = 50 and adjust the number of threads from 1 to 64. The results in Figure 17 (b) show that the throughput increases linearly when the number of threads increases from 1 to 16. However, when the number of threads is greater than 16, the performance growth is relatively slow. For example, when the number of threads increases from 16 to 64, the throughput increases to 1.1x, but the p99 delay increases to 2.9x. This is mainly because more threads write concurrently, resulting in increased competition for access to PM hardware. In practice, selecting the appropriate number of threads and batch size depends on the specific hardware equipment and workload.

4 global index assessment

How to insert an LSM tree into NVM
Figure 18 global index GI evaluation results

In the experiment, close the complex from level 0 to level 1 to evaluate the advantages of GI over XS disordered level 0, where xs-pm means putting the level 0 and wal of the baseline system into PM. Firstly, the experiment randomly writes kV pairs of different sizes, ranging from 64 bytes to 4KB, with a total of 50GB of data, and tests the performance of random click reading and range query. Figure 15 (a) shows that XP is better than XS and xs-pm for random writes of kV pairs of different sizes. In Figure 18 (b) (c), XP is greatly improved compared with XS and xs-pm. The random read performance is 113x of xs-pm and the random range query is 21x of xs-pm. The main reason is that the complex from level 0 to level 1 is turned off in the experiment, resulting in too many disordered data blocks stacked at level 0 (more than 58 disordered data blocks observed in the experiment). Because GI in XP is globally ordered, it has high query performance.

Another experiment uses 32 client threads to write data under high pressure with a read-write ratio of 1:1. It runs for 10 minutes and observes the change of system performance with time. Figure 18 (d) also shows that compared with XS / xs-pm, XP has a performance improvement of up to 15x, and the performance is more stable. In the experiment, the performance of XS and xs-pm was reduced by 85%, while that of XP was only reduced by 35%. Although xs-pm places data in a faster PM (using PM as a common disk), its performance still lags behind that of XP, and level 0 data accumulation will still have a great impact. XP adopts globally ordered GI and more efficient in memory competition, which greatly reduces the impact of level 0 data accumulation.

5 halloc assessment

How to insert an LSM tree into NVM
Figure 19 halloc assessment results

This experiment compares ralloc and pmemobj to evaluate the persistent memory allocation performance of halloc. Halloc hosts the management of non persistent memory into jemalloc, so its performance is similar to that of similar methods. Readers can further refer to reference [4] to learn more about the performance of allocators using PM as non persistent memory. The experiment runs in a single thread. The delay of a single allocation is calculated by performing 1 million memory allocations. The size of the allocation object ranges from 128 bytes to 16kb. Since halloc is not a general-purpose memory allocator, the malloc operation of general-purpose memory allocation is simulated by allocating (halloc pool) from the object pool and grant (halloc zone). Figure 19 shows that the delay of allocation objects of halloc pool and halloc zone in all tests is less than 1US, mainly because halloc only uses one flush and fence instruction for each allocation. However, halloc is only designed to support LSM tree based systems, and its versatility is not as good as ralloc and pmemobj.

6 restart recovery performance

How to insert an LSM tree into NVM
Figure 20 performance evaluation of restart recovery

In order to evaluate the recovery performance of the system, 32GB data is written in the experiment, in which the size of key / value is 8 bytes / 500 bytes respectively, with a total of 70 million kV pairs. In the experiment, GI adopts non persistent index, and keeps the whole tribe of data in GI. Non persistent GI makes it necessary to recover all indexes in GI when the system restarts, so it is equivalent to recovering 32GB memory table. For the baseline system XS and xs-pm (place the wal in the PM), the system turns off flush, thus ensuring a valid wal of at least 32GB. Since XP can rebuild indexes in parallel, the number of recovery threads set in the experiment gradually increased from 1 to 104 (all CPU cores). Figure 20 shows that XP can start in nearly seconds through parallel index reconstruction. However, XS and xs-pm take several minutes. One important reason is that the baseline system only supports single thread recovery. Because XP can recover in parallel, you can use all CPU resources to speed up the startup when the system starts. In the actual application scenario, the size of the memory table is usually much smaller than 32GB, so the recovery time can be less than seconds. The extremely fast database recoverability may change the existing scheme of realizing ha through active and standby methods, and then omit the standby node, which has the potential to reduce the ECS instance by half.

Seven postscript

This paper is only a small step in the application of PM and other new hardware to optimize the OLTP storage engine based on LSM tree. In fact, the realization of an available industrial storage engine involves many aspects, such as reliability, availability, stability, cost performance, etc. it is more a balanced process of finding various factors. As for the scheme in this paper, there are still the following problems to be solved in the actual industrial implementation.

Data reliability. For database system, data reliability is in an extremely important position. Although PM can provide persistent byte addressing capability, it still has problems affecting data reliability, such as device write wear, hardware failure and so on. In the cloud database instance, the traditional persistent storage layer realizes high reliability data storage through multiple copies and distributed consensus protocol. Therefore, this problem still needs to be solved when PM is used as persistent storage. A promising solution is to build a highly reliable distributed persistent memory pool based on PM, but how to design the distributed PM memory pool and what I / O characteristics it may have still need to be further explored. In the face of industrialization, it may not be of great significance to optimize the persistent index for single machine PM hardware in the design of OLTP storage engine.

PM memory usage efficiency. In this paper, the memory of PM can be used for persistent and non persistent purposes, but under the fixed PM capacity, how to determine how much memory space is allocated for persistent use and how much space is used for non persistent purposes is a problem worthy of further exploration. For example, automatically adjust the allocation proportion of space according to the load, allocate more PM memory for persistent storage in write intensive load, and allocate more memory for non persistent purposes such as cache in read intensive load.

Performance jitter. For the storage engine based on LSM tree, a headache is the performance jitter, and an important reason for jitter is the severe batch failure of cache caused by background compaction. In the distributed scenario, this problem can be alleviated by Smart Cache [5]. In the case of a single machine, can you allocate a separate cache space in the PM to temporarily cache the old data of compaction, and then slowly replace the cache?

Extended reading

[1] J. Yang, J. Kim, M. Hoseinzadeh, J. Izraelevitz, and S. Swanson, “An empirical guide to the behavior and use of scalable persistent memory,” in Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST ’20), 2020, pp. 169–182.

[2] B. Daase, L. J. Bollmeier, L. Benson, and T. Rabl, “Maximizing Persistent Memory Bandwidth Utilization for OLAP Workloads,” Sigmod, no. 1, p. 13, 2021.

[3] G. Huang et al., “X-Engine: An optimized storage engine for large-scale e-commerce transaction processing,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Jun. 2019, pp. 651–665.

[4] D. Waddington, M. Kunitomi, C. Dickey, S. Rao, A. Abboud, and J. Tran, “Evaluation of intel 3D-Xpoint NVDIMM technology for memory-intensive genomic workloads,” in ACM International Conference Proceeding Series, Sep. 2019, pp. 277–287.

[5] M. Y. Ahmad and B. Kemme, “Compaction management in distributed keyvalue datastores,” Proc. VLDB Endow., vol. 8, no. 8, pp. 850–861, 2015.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.