[Serial] How to master the core technology of openGauss database? Secret three: handle storage technology (7)


Table of contents
openGauss database SQL engine

openGauss database actuator technology

openGauss storage technology

1. Overview of openGauss storage

2. openGauss row storage engine

Three, openGauss column storage engine

Fourth, openGauss memory engine

Ⅰ. Compatibility design of memory engine

Ⅱ. Memory engine index

Ⅲ. Concurrency control of memory engine

Ⅳ. Memory management of memory engine

V. Persistence of the memory engine

openGauss transaction mechanism

openGauss database security

openGauss storage technology

4. openGauss memory engine

Concurrency control of memory engine 03
The concurrency control mechanism of the memory engine adopts OCC, and the concurrency performance is very good in scenarios with few operation data conflicts.

The transaction cycle and concurrency control component structure of the memory engine are shown in Figure 42.

[Serial] How to master the core technology of openGauss database? Secret three: handle storage technology (7)

Figure 42 The transaction cycle of the memory engine and the structure of the concurrency control component

It needs to be explained here why the data organization of the memory engine is a nearly lock-free design.

In addition to the lock-free mechanism of Masstree itself mentioned above, the process mechanism of the memory engine further minimizes the existence of concurrency conflicts.

Each worker thread will copy all the records that need to be read during the transaction process to local memory, save it in the read-set (read data set), and perform corresponding calculations based on these local data throughout the transaction. The corresponding operation results are stored in the write set (write data set) local to the worker thread. Until the transaction is completed, the worker thread will enter the attempt submission process, perform the validate (check and verify) operation on the read set (read data set) and write set, and update the global version corresponding to the data in the write set if allowed.

Such a process will reduce the impact of the global version in the transaction process to the process of validation, and will not affect other concurrent transactions during any other operations performed by the transaction. Moreover, in the only verification (check verification) process, what is required is not a lock in the traditional sense, but only the lock bit in the header information that represents the lock. Correspondingly, these considerations are all to minimize resource contention and conflicts that may occur in concurrency, and to use CPU cache more efficiently.

At the same time, the existence of read set (read data set) and write set (write data set) can well support various isolation levels, and different isolation levels can be verified by comparing read set (read data set) and write set in the verification (check verification) stage. (Writing datasets) are obtained with different censorship mechanisms. By checking the corresponding lock bit (lock bit) in the global version of the row records in the two sets (data sets) and the TID structure in the row header, you can judge the conflict between your own read and write and other transactions, and then judge yourself in different isolations. Whether the level can commit (submit), or needs to abort (termination). At the same time, due to the existence of version records in the Trie node in Masstree, structural changes (insert/delete, insert/delete) of Masstree will change the version number on the related Trie node (node). Therefore, maintaining a node set (node ​​set) involved in a Range query (range query), and comparing and verifying it in the verification (check verification) phase, it is easier to check the subset involved in this Range query in the transaction submission phase. Whether there has been a change to detect the existence of Phantom (phantom read) is an operation with very low time complexity.

Memory management of the memory engine 04
Since the data of the memory engine is in full memory state, the data can be organized according to the record, and there is no need to follow the data organization form of the page, so it has a great advantage in terms of the conflict granularity of data operations. Get rid of the limitation of segment page, no longer need to share the cache area for caching and interaction with the disk, and the design does not need to consider the optimization of IO and disk performance (such as the height of the index B+ tree and HDD (Hard Disk Drive, disk). ) corresponding to random read and write problems), data reading and operations can be optimized and concurrently improved.

Because it is a full-memory data form, the management and control of memory resources is particularly important. The memory allocation mechanism and implementation will greatly affect the computing throughput of the memory engine. The memory management of the memory engine is mainly divided into three layers, as shown in Figure 43.

[Serial] How to master the core technology of openGauss database? Secret three: handle storage technology (7)

Figure 43 Schematic diagram of memory management of the memory engine

The three-tier design is described below:

(1) The first layer is the memory engine itself, which includes temporary memory usage and long-term memory usage (data storage).

(2) The second layer is the object memory pool, which is mainly responsible for providing memory for the first layer objects such as tables, indexes, row records, Key values, and Sentinel (row pointer). This layer requests large chunks of memory from the bottom layer, and then performs fine-grained allocation.

(3) The third layer is the resource management layer, which is mainly responsible for the interaction with the operating system and the actual memory application. In order to reduce the calling overhead of memory application, the interaction unit is generally about 2 MB. This layer also has memory prefetching and pre-occupancy functions.

The third layer is actually very important, mainly because:

(1) Memory prefetching can effectively reduce memory allocation overhead and improve throughput.
(2) The performance cost of interacting with the NUMA library is very high, and if it is directly placed in the interaction layer, it will have a great impact on performance.
The memory engine adapts the NUMA structure differently for short-term and long-term memory usage. For short-term use, it is generally a transaction or session (session) itself. At this time, it is generally necessary to obtain local memory on the NUMA node corresponding to the CPU core processing the session, so that the memory usage of the transaction (transaction) itself has a small overhead; and Long-term memory usage, such as the storage of tables, indexes, and records, requires interleaved memory in the NUMA concept, and should be allocated as evenly as possible on each NUMA node to prevent performance degradation caused by excessive memory consumption of a single NUMA node.

Short-term memory usage, that is, local memory from a NUMA perspective, also has a very important feature, that is, this part of memory is only used by the transaction itself (such as replicated read data and made update data), so it is avoided. Concurrency control on this part of memory.

In-memory engine persistence 05
The memory engine is based on the synchronous WAL mechanism and Checkpoint (checkpoint) to ensure the persistence of data, and through the WAL mechanism compatible with openGauss (ie Transaction log, transaction log), while the data is persistent, it can also ensure that the data can be Synchronization between the active and standby nodes provides high reliability with RPO=0 and high availability with small RTO.

The persistence mechanism of the memory engine is shown in Figure 44.

[Serial] How to master the core technology of openGauss database? Secret three: handle storage technology (7)

Figure 44 Persistence mechanism of memory engine

It can be seen that the Xlog module of openGauss is called by the manager (manager) corresponding to the memory engine, and the persistent log is written to the disk through the WAL writer thread (refresh disk thread), and is called by wal_sender (transaction log sending thread) at the same time. Go to the standby machine, and receive, drop and restore at the standby machine wal_receiver (transaction log receiving thread).

The Checkpoint of the memory engine is also activated according to the Checkpointer mechanism of openGauss itself.

The Checkpoint mechanism in openGauss is implemented by flushing the dirty pages in the shared_buffer (shared buffer) and a special checkpoint log when doing Checkpoint. Because the memory engine is full memory storage and has no concept of dirty pages, it implements the CALC-based Checkpoint mechanism.

This mainly involves a concept of partial multi-versioning: when a Checkpoint command is issued, two versions are used to track a record: the live version, which is the latest version of the record; the stable version ) version, that is, the version corresponding to this record when the Checkpoint is issued and a virtual consistency point is formed. Transactions committed before the consistency point need to update both the active (live) and stable (stable) versions, while transactions after the consistency point only update the active (live) version and keep the stable version unchanged. When there is no Checkpoint state, the stable (stable) version is actually empty, which means that the stable and live versions are actually the same value at this time; only in the Checkpoint process, there is a transaction after the consistency point. Update, at this time, it is necessary to ensure the parallel operation of checkpoint and other normal transaction processes according to the dual version.

The implementation of CALC (Checkpointing Asynchronously using Logical Consistency) has five stages:

(1) Rest (rest) stage: In this stage, there is no Checkpoint (checkpoint) process, and each record only stores the live version.
(2) prepare stage: After the whole system triggers Checkpoint, it will enter this stage immediately. In this phase, the changes to read and write by the transaction will also update the live version; but before the update, if the stable version does not exist, the data of the live version will be stored in the stable version before updating the live version. At the end of the update of this transaction, before the lock is released, a check will be performed: if the system is still in the prepare stage at this time, the stable version just generated can be removed; otherwise, if the entire system has left the prepare stage and entered the next stage, Then the stable version will be preserved.
(3) resolve phase: After all transactions that occurred before entering the prepare phase have been committed or rolled back, the system will enter the resolve phase, which means that a virtual consistency point has been generated. Here Transaction-related changes submitted before the stage will be reflected in this Checkpoint.
(4) Capture phase: After all transactions in the prepare phase are over, the system will enter the capture phase. At this time, the background thread will start to write the version corresponding to the Checkpoint (if there is no record of the stable version, the live version) to the disk, and delete the stable version.
(5) Complete (complete) stage: After the checkpoint writing process is completed, and all transactions in the capture stage are completed, the system enters the complete stage, and the performance of the write operation of the system transaction will return to the same default state as the rest stage. .
CALC has the following advantages:

(1) Low memory consumption: each record forms at most two copies of data at Checkpoint. During Checkpoint, if the stable version of the record is the same as the live version, or if there is no checkpoint, there will only be physical storage of the data itself in memory.
(2) Lower implementation cost: Compared with other memory library Checkpoint mechanisms, it has less impact on the entire system.
(3) Use a virtual consistency point: It is not necessary to block the business and processing flow of the entire database to achieve a physical consistency point, but to achieve a virtual consistency point through some multiple versions.
So far, the "openGauss storage technology" chapter is all over, and the next article will start to explain the "openGauss transaction mechanism" chapter, so stay tuned…