Introduction of storage engine
Database storage engine is the underlying software organization of database. Database management system (DBMS) creates, queries, modifies and deletes data through data engine. Different storage engines provide different storage mechanism, indexing skills, locking level and other functions. Using different storage engines, you can also obtain specific database functions.
As the supporting chassis of database, a mature storage engine must consider all aspects, including the efficiency of data reading and writing, including how to operate with the lowest cost and the lowest risk. After considering the above factors, combined with the characteristics of a key valued database, we chose the txhdb storage engine developed by Tencent to implement the data of tcapsusdb . The format and advantages of txhdb storage engine are introduced below.
Storage engine format
The data file of tcapsusdb can be roughly divided into three areas: header area, memory mapping area and file access area, as shown in the figure below. Memory mapping area and file access area are used to store real data.
- Cephalic regionIt is used to store metadata, statistical data, hash bucket, free block chain header, extended data and other information.
- Memory mapped areaWhen the data file is loaded, this part of the space will be mapped to the memory address space by MMAP, and the area will be read and written by reading and writing memory, so as to indirectly achieve the effect of caching in memory. This area is located in the front of the data file, and the default size is 1g.
- File access areaNext to the memory mapping area is the so-called file access area. The data in this area is read and written through the common file read and write interface.
A more detailed format is as follows:
The whole file is divided into head control information area and data area;
When the data file is opened, the file mapping object is established from the beginning of the file. For the write operation, at least the control head area is put into the memory mapping range;
Key value data records are organized by hash table. There are two kinds of hash conflict resolution strategies: binary balance tree and linear chain. When creating engine files, which conflict resolution strategy can be used can be determined by parameters. Binary balance tree is established by calculating another hash value (called quadratic hash) for key;
When the data is outside the MMAP area, the data is accessed by using pread / pwrite based on the offset of the starting position of the file.
The head control area is divided into the following parts:
- Basic control information area: contains magic, version information, file type, record alignment parameter, free block parameter, compression attribute, bucket number, record number, file size, first record location, bucket information, free block information, etc.
- Hash bucket information area: stores the storage offset of the first record in each bucket;
- Memory free link header: the free data block list header in the MMAP area of this file;
- File free block header: header of free data block linked list in MMAP area;
- LRU information area: LRU chain that tracks the access of data records in MMAP area;
- Extended area: transparent storage area for txhdb, through which tcapsvr stores data table description information;
Free block management
The size of data records is different, data records in the storage process, the size change or deletion will lead to some free blocks in the file, in order to reduce the cost of sorting and utilization of free blocks with different sizes. Txhdb uses block space to store data records. Block space sets its alignment by an apow parameter, that is, defines the minimum size of data block by apow. The whole storage block is composed of a block array that grows linearly layer by layer according to the minimum alignment unit. The number of data blocks is determined by the fpow parameter. If apow is 8 and fpow is 10, the diagram of idle data block is as follows:
The actual data key or value is stored by one or more free blocks at a certain level
- Priority is given to free memory blocks, followed by file blocks
- Based on memory, continuous blocks are used first, and then discrete blocks are used
- Only contiguous blocks can be used based on a file
If the records are all small records, there may be too many discrete records in the whole file, and the data can be sorted out regularly by means of data relocation.
Key value separation
Based on hash table to store data records, each data reading and writing must access the key of the data. Txhdb adopts the idea of key value separation to optimize the efficiency of data retrieval
The key and value are stored separately in the key node and the value node. The hash value is mapped to the key node, and then the key node is mapped to the value node. The key node is stored in memory first, and the value node may be stored in memory or disk.
The details are as follows:
- The key of a record may consist of multiple blocks, one head block and multiple split blocks. The offset of the next block is recorded in each block. At the same time, the offset of the value header block is recorded in the key head block.
- The Val of a record may also be composed of multiple blocks. A head block, multiple SPL blocks and the offset of Val are recorded in the head of key.
- The offset of the key is recorded in the hash bucket, and the conflicting records are recorded in the left and right of the keyhead to realize the linked list or binary tree.
- The width of online business_ It is equal to 32, that is 4B. Then the default minimum block of keyhead is 64b (the minimum value of apow is 6, 2 * * 6 = 64b). The engine’s own information needs to occupy 32b – 33b, and the service can use 31b to 32B. Based on this, the service can design more effective keys to occupy as few blocks as possible.
Multi level LRU chain for data heat management
In order to record the access hotspots of data, a multi-level LRU chain is established to track the data in the MMAP area. The LRU chain series can be customized through parameters. The multi-level LRU instead of the first level LRU chain is mainly used to evaluate the latest access times in addition to the latest access time when it is eliminated.
- Multi level LRU, considering the latest access time and access times
- The access count is increased during read-write access, and the access count is decreased during location scan
- Priority elimination of records in LRU chain with 1 access
- Swap out condition: the remaining memory is lower than a certain threshold
- Swap in condition: the remaining memory is higher than a certain threshold
We have learned the basic structure of the distributed NoSQL database search engine of tcapsusdb, and we will uncover more special mysteries of tcapsusdb design in the future.