[technical summary] from hash index to LSM tree

Time:2020-9-19

Abstract:This paper will start from the implementation of the simplest key value database, and then for some bottlenecks encountered in the implementation process, using the above index technology to optimize the database, in order to achieve a more profound understanding of the database index technology.

preface

databaseIt is one of the most commonly used components in software application system. Whether it is a large and complex e-commerce system or a simple personal blog, it will use database, or store massive data, or store simple status information. In general, we all like to divide the database intoRelational database and non relational database(also known as NoSQL database), the typical representative of the former is MySQL database, and the typical representative of the latter is HBase database. Whether it is relational or non relational, database can not do without two basic functions(1) Data storage; (2) data query。 In short,When you drop the data into the database, it can keep it and return it to you later when you want to get it.

Around these two basic functions, all kinds of databases have used many technical means to optimize them, among which the most widely known isDatabase index technology。 Index is a kind of data structure, which can greatly improve the data query (read) performance at the expense of a small amount of data storage (write) performance. There are many types of indexes,Hash indexIt is the most simple and efficient one, but due to its own limitations, it is not widely used in the database system. The most commonly used indexing technology today isB / B + tree indexIs widely used in relational database, mainly used inRead more and write lessScene. With the rise of NoSQL database,LSM (log structured merged tree) treeIt’s also becoming popular, and it’s being carried forward by Google’s BigTable paper. Strictly speaking, LSM tree is not an index in the traditional sense. It is more like a design idea, which is mainly used in the scenario of write more and read less.

This series of articles will start from the implementation of the simplest key value database, and then in view of some bottlenecks encountered in the implementation process, using the above-mentioned index technology to optimize the database, so as to achieve a more profound understanding of the database index technology.

The simplest database

Martin kleppmann is at《Designing Data-Intensive Applications》A simple database implementation is given in this book

#!/bin/bash
db_set() {
  echo "$1,$2" >> database
}

db_get() {
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

This shell code of less than 10 lines implements a simple key value database. It has two functions, DB_ Set and DB_ Get, the former corresponds to data storage function, and the latter corresponds to data query function. The database uses a simple text format (database file) for data storage. Each record contains a key value pair, and the key and value are separated by commas (,).

The use of the database is also very simple, by calling dB_ Set key value can store the key and its corresponding value in the database_ Get key to get the corresponding value of the key:

$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}' 
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

Through DB_ In the implementation of set, we find that every time the database writes, the records are appended to the database file directly. In view of the high efficiency of the sequential writing of the file system, the data writing of the database has high performance. However, additional write also means that when the same key is updated, its corresponding old value will not be overwritten, which also makes every call to DB_ When you get the value of a key, you always need to traverse all records, find all the values that meet the criteria, and get the latest one. Therefore, the read performance of the database is very low.

The simplest database read and write operation

Next, we use java to rewrite the simplest database:

/**
 * SimpleKvDb.java
 *Additional writing
 *Full file scanning and reading
 */
public class SimpleKvDb implements KvDb {
    ...
    @Override
    public String get(String key) {
        try (Stream<String> lines = logFile.lines()) {
            //Step1: filter out all the values corresponding to the key (corresponding to grep "^ $1," database)
            List<String> values = lines
                    .filter(line -> Util.matchKey(key, line))
                    .collect(Collectors.toList());
            //Step2: select the latest value (corresponding to sed - e "s / ^ $1, // tail - N 1)
            String record = values.get(values.size() - 1);
            return Util.valueOf(record);
        } catch (IOException e) {
            e.printStackTrace();
            return null;
        }
    }
    @Override
    public void set(String key, String value) {
        //Step1: append write (corresponding to echo "$1, $2" > > database)
        String record = Util.composeRecord(key, value);
        logFile.append(record);
    }
    ...
}

The results of pressure test with JHM are as follows:

Benchmark                                            Mode  Cnt  Score   Error  Units
Sim pleKvDbReadBenchmark.simpleKvDb_ get_ 10000_ Test avgt 8 0.631 ± 0.033 MS / op // read time, 1W records
Sim pleKvDbReadBenchmark.simpleKvDb_ get_ 100000_ Test avgt 8 7.020 ± 0.842 MS / op // read time, 10W records
Sim pleKvDbReadBenchmark.simpleKvDb_ get_ 1000000_ Test avgt 8 62.562 ± 5.466 MS / op // read time, 100W records
Simp leKvDbWriteBenchmark.simpleKvDb_ set_ Test avgt 8 0.036 ± 0.005 MS / op // writing time

The results show that the implementation of the database has high write performance, but low read performanceRead time will increase linearly with the increase of data volume.

So how to optimize the read performance of simplekvdb? Introduce index technology!

Index is a data structure derived from the data in the database. It does not affect the data, only affects the read-write performance of the database. For read operations, it can quickly locate the target data, thus greatly improving the read performance; for write operations, because additional update index operations are needed, the write performance will be slightly reduced.

As mentioned in the preface, there are many types of indexes, and each index has its own characteristics. Therefore, whether or not to use the index and which index to use depends on the actual application scenario.

Next, we will optimize simplekvdb with the simplest and most efficient hash index.

Add hash index to database

Considering that the key value database itself is similar to the hash table, we can easily think of the following index strategy: maintain a hash table in memory and record the byte offset of each key corresponding to the record in the data file (such as the database file mentioned above).

For write operations, after adding records to the data file, the hash table needs to be updated; for read operations, the hash table is used to determine the displacement of the key corresponding to the record in the data file, and then the byte displacement is used to quickly find the position of value in the data file and read it, thus avoiding the inefficient behavior of full-text traversal.

Read and write operations after index addition

Add hash index to simplekvdb, and the corresponding code implementation is as follows:

/**
 * HashIdxKvDb.java
 *Additional writing
 *Using hash index to improve reading performance
 */
public class HashIdxKvDb implements KvDb {
    //Data storage file
    private final LogFile curLog;
    //Index, value is the offset of the data corresponding to the key in the file
    private final Map<String, Long> idx;
    ...
    @Override
    public String get(String key) {
        if (!idx.containsKey(key)) {
            return "";
        }
        //Step1: read index
        long offset = idx.get(key);
        //Step2: read value according to index
        String record = curLog.read(offset);
        return Util.valueOf(record);
    }
    @Override
    public void set(String key, String value) {
        String record = Util.composeRecord(key, value);
        long curSize = curLog.size();
        //Step1: append write data
        if (curLog.append(record) != 0) {
            //Step2: update index
            idx.put(key, curSize);
        }
    }
  ...
}

Before the implementation of hashidxkvdb, we abstracted the data storage file into logfile object. Its two basic methods are append (append record) and read (read record according to offset). The read function can quickly locate the location of the record through the seek method of RandomAccessFile

//Additional log files are written to store database data
class LogFile {
    //File path
    private Path path;
  ...
    //Write a line of record to the log file and automatically add line breaks
    //Returns the size of bytes successfully written
    long append(String record) {
        try {
            record += System.lineSeparator();
            Files.write(path, record.getBytes(), StandardOpenOption.APPEND);
            return record.getBytes().length;
        } catch (IOException e) {
            e.printStackTrace();
        }
        return 0;
    }
    //Read a line with offset as the starting position
    String read(long offset) {
        try (RandomAccessFile file = new RandomAccessFile(path.toFile(), "r")) {
            file.seek(offset);
            return file.readLine();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return "";
    }
    ...
}

The results of pressure test with JHM are as follows:

Benchmark                                              Mode  Cnt  Score   Error  Units
Hash IdxKvDbReadBenchmark.hashIdxKvDb_ get_ 10000_ Test avgt 8 0.021 ± 0.001 MS / op // read time, 1W records
Hash IdxKvDbReadBenchmark.hashIdxKvDb_ get_ 100000_ Test avgt 8 0.021 ± 0.001 MS / op // read time, 10W records
Hash IdxKvDbReadBenchmark.hashIdxKvDb_ get_ 1000000_ Test avgt 8 0.021 ± 0.001 MS / op // read time, 100W records
HashI dxKvDbWriteBenchmark.hashIdxKvDb_ set_ Test avgt 8 0.038 ± 0.005 MS / op // writing time

From the pressure test results, compared with simplekvdb, the read performance of hashidxkvdb has been greatly improved, and the read time does not increase linearly with the increase of data volume, and the write performance does not decrease significantly.

Although the implementation of hash index is very simple, it is extremely efficient. It only needs one disk addressing (seek operation) and one disk I / O (readLine operation) to load the data. If the data has been previously loaded into the file system cache, you don’t even need disk I / O.

Data merging compact

So far, whether it is simplekvdb or hashidxkvdb, the write operation is constantly appending data to a file. This storage method is usually called append only log.

So, how can we avoid the endless expansion of the append only log until there is insufficient disk space?

An obvious feature of the append only log is that the old records will not be overwritten and deleted, but these data are often useless, because when reading the value of a key, the database will take its latest value. Therefore, one way to solve this problem is to eliminate these useless records

(1) After appending data to the append only log to a certain size, a new append only log is created for appending. In this mechanism, the data is distributed to multiple append only log files, which are called segment files.

(2) Ensure that only the current segment file is readable and writable, and the old segment file is read-only and not written.

Segment file mechanism

(3) Compact the old segment file — only keep the latest record corresponding to each key, and delete the old record.

Single file compact operation

The compact operation is usually executed in the background thread. The database will write the merged result to a new compact segment file, which will not affect the logic of reading data from the old segment file when performing the compact operation. After the compact operation is completed, delete the old segment file and migrate the subsequent read operations to the compact segment file.

Now, the key in a single compact segment file is unique, but there may be duplicate keys between multiple compact segment files. We can further compact multiple compact segment files again, and the amount of data will be reduced again.

Multi file compact operation

Similar compact operations can be performed layer by layer. For example, level 2 compact segment file can be compact to generate level 3 compact segment file. However, the more levels of compact, the better. The specific compact strategy needs to be designed in combination with the actual application scenarios.

Add compact mechanism to hashidxkvdb

Next, we try to add a compact mechanism to hashidxkvdb. Because the data will be stored in multiple segment files under compact mechanism, the previous hash index mechanism is no longer applicable. We need to maintain a hash index for each segment file. Of course, this is also relatively easy to implement. You only need to maintain a hash table, with key as the segment file and value as the hash index corresponding to the segment file.

Hash index under multiple segment files

The corresponding code implementation is as follows:

//Implementation of multi file hash index
class MultiHashIdx {
    ...
    private Map<LogFile, Map<String, Long>> idxs;
    //Gets the index of key in the specified logfile
    long idxOf(LogFile file, String key) {
        if (!idxs.containsKey(file) || !idxs.get(file).containsKey(key)) {
            return -1;
        }
        return idxs.get(file).get(key);
    }

    //Add the index of the key in the specified logfile
    void addIdx(LogFile file, String key, long offset) {
        idxs.putIfAbsent(file, new ConcurrentHashMap<>());
        idxs.get(file).put(key, offset);
    }
    ...
}

In addition, we also need to maintain a set of old segment file, level1 compact segment file and level2 compact segment file in compationhashidxkvdb, and perform compact operations on these sets regularly through the scheduledexecutorservice.

/**
 *When the current segemnt file reaches a certain size, it is appended to the new segment file. And compact the old segment file regularly.
 *Maintain a hash index for each segment file to improve reading performance
 *Support single thread write, multi thread read
 */
public class CompactionHashIdxKvDb implements KvDb {
    ...
    //Current append write segment file path
    private LogFile curLog;

    //Write the old segment file collection, and the level1 compact merge of these files will be performed regularly
    private final Deque<LogFile> toCompact;
    //The level1 compact segment files collection, which will periodically perform level2 compact merge on these files
    private final Deque<LogFile> compactedLevel1;
    //Level2 compact segment file collection
    private final Deque<LogFile> compactedLevel2;
    //Multi segment file hash index
    private final MultiHashIdx idx;

    //Time scheduling of compact
    private final ScheduledExecutorService compactExecutor;
    ...
}

Compared with hashidxkvdb, before writing new data, compationhashidxkvdb needs to determine whether the current file size is full. If it is full, it needs to create a new logfile to append and archive the full logfile to tocompact queue.

@Override
public void set(String key, String value) {
    try {
        //If the current logfile is full, it is placed in the tocompact queue and a new logfile is created
        if (curLog.size() >= MAX_LOG_SIZE_BYTE) {
            String curPath = curLog.path();
            Map<String, Long> oldIdx = idx.allIdxOf(curLog);
            curLog.renameTo(curPath + "_" + toCompactNum.getAndIncrement());
            toCompact.addLast(curLog);
            //After a new file is created, the index is also updated
            idx.addAllIdx(curLog, oldIdx);
            curLog = LogFile.create(curPath);
            idx.cleanIdx(curLog);
        }

        String record = Util.composeRecord(key, value);
        long curSize = curLog.size();
        //The index will be updated if the write is successful
        if (curLog.append(record) != 0) {
            idx.addIdx(curLog, key, curSize);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

The read operation of compationhashidxkvdb is relatively troublesome. Because the data is scattered over multiple segment files, it is necessary to complete the data search in the following order until the query is reached: currently appended logfile > tocompact queue > compactedlevel1 queue > compactedlevel2 queue. As a result, the read operation of compationhashidxkvdb is also inefficient in extreme cases (when the queried data is stored on the comatedlevel2 queue).

@Override
public String get(String key) {
    //Step 1: search from the current logfile
    if (idx.idxOf(curLog, key) != -1) {
        long offset = idx.idxOf(curLog, key);
        String record = curLog.read(offset);
        return Util.valueOf(record);
    }
    //Step 2: find from tocompact
    String record = find(key, toCompact);
    if (!record.isEmpty()) {
        return record;
    }
    //Step 3: find from compact level 1
    record = find(key, compactedLevel1);
    if (!record.isEmpty()) {
        return record;
    }
    //Step 4: find in compact Level 2
    record = find(key, compactedLevel2);
    if (!record.isEmpty()) {
            return record;
    }
    return "";
}

Hash index can be used for single file level 1 compact operation on old segment file in tocompact queue. Because the records corresponding to the hash index are always up-to-date, it is only necessary to traverse the hash index, query out the latest records corresponding to each key, and write them to the new level1 compact segment file.

//Level 1 compact is performed to merge a single old segment file
void compactLevel1() {
    while (!toCompact.isEmpty()) {
        //Create a new level1 compact segment file
        LogFile newLogFile = LogFile.create(curLog.path() + "_level1_" + level1Num.getAndIncrement());
        LogFile logFile = toCompact.getFirst();
        //Only the latest value corresponding to each key is kept
        idx.allIdxOf(logFile).forEach((key, offset) -> {
            String record = logFile.read(offset);
            long curSize = newLogFile.size();
            if (newLogFile.append(record) != 0) {
                idx.addIdx(newLogFile, key, curSize);
            }
        });
        //After writing, it is stored in the compact level 1 queue and the corresponding file in tocompact is deleted
        compactedLevel1.addLast(newLogFile);
        toCompact.pollFirst();
        logFile.delete();
    }
}

The strategy of multi file level 2 compact for compactedlevel1 queue is to merge all level1 compact segment files in the current queue into one level2 compact segment file. The specific steps are as follows:

1. Generate a snapshot of the compactedlevel1 queue. The purpose is to avoid the impact of adding a new level1 compact segment file to the queue during level 2 compact.

2. Perform the compact operation on the snapshot. Write the records in level1 compact segment file to the new level2 compact segment file in the order from new to old. If the key already exists in level2 compact segment file, it will be skipped.

3. After completing the task, delete the merged level1 compact segment file from the compactedlevel1 queue.

//Perform level2 compact to merge all the files in the compactedlevel1 queue
void compactLevel2() {
    ...
    //Generate a snapshot
    Deque<LogFile> snapshot = new LinkedList<>(compactedLevel1);
    if (snapshot.isEmpty()) {
        return;
    }
    int compactSize = snapshot.size();
    //The file naming rule of level 2 is: file name_ level2_ Num
    LogFile newLogFile = LogFile.create(curLog.path() + "_level2_" + level2Num.getAndIncrement());
    while (!snapshot.isEmpty()) {
        //Start with the latest level1 compact segment file
        LogFile logFile = snapshot.pollLast();
        logFile.lines().forEach(record -> {
            String key = Util.keyOf(record);
            //Only keys that do not exist in the current level2 compact segment file are written
            if (idx.idxOf(newLogFile, key) == -1) {
                //After successful writing, update the index
                long offset = newLogFile.size();
                if (newLogFile.append(record) != 0) {
                    idx.addIdx(newLogFile, key, offset);
                }
            }
        });
    }
    compactedLevel2.addLast(newLogFile);
    //After the write is complete, delete the corresponding file in the compact level 1 queue
    while (compactSize > 0) {
        LogFile logFile = compactedLevel1.pollFirst();
        logFile.delete();
        compactSize--;
    }
    ...
}

summary

This paper first introduces the simplest database given by Martin kleppmann, and re implements it with Java language, and then optimizes it by using hash index and compact mechanism.

Simplekvdb uses the append only log to store data,The main advantage of append only log is its high write performance(sequential writing of the file system is much faster than random writing). Additional writing also means that the old records corresponding to the same key will not be overwritten or deleted. Therefore, when querying data, you need to traverse the entire file to find all records of the key and select the latest one. The read performance of simplekvdb is very low because it involves full-text traversal.

In order to optimize the read performance of simplekvdb, we implement hashidxkvdb with hash index. Hash index is a hash table maintained in memory, which stores the offset of the record corresponding to each key in the file. thereforeEach data query only needs one disk addressing and one disk I / O, which is very efficient.

In order to solve the problem of insufficient disk space caused by the expansion of append only log, we introduce compact mechanism into hashidxkvdb, and implement the compationhashidxkvdb.Compact mechanism can effectively clean up the invalid old data in the database, thus reducing the pressure of disk space.

Although hash index is simple and efficient, it has two limitations as follows:

1. Hash indexes must be maintained in memory.If we choose to implement hash index on disk, it will bring a large number of random disk read and write, which will lead to a sharp decline in performance. In addition, with the introduction of compact mechanism, the data is scattered in multiple segment files, so we have to maintain a hash index for each segment file, which leads to the increasing memory consumption of hash index and brings great memory pressure to the system.

2. Interval queries are very inefficient.For example, when you need to query all keys in the database in the range of [key0000, key9999], you must traverse all the elements in the index, and then find the key that meets the requirements.

In view of these two limitations of hash index, how to optimize it? In the next article in this series, we’ll introduce another index that doesn’t have these two restrictions – LSM trees.

Click follow to learn about Huawei’s new cloud technologies~

Recommended Today

Solutions to leetcode problems and knowledge points 1. Sum of two numbers

Title Link 1. Two Sum  difficulty: $/ color {00965e} {easy}$ Knowledge points 1. Hash (hash) function The classic hash function times33 is widely used, and the core algorithm is as follows: hash(i) = hash(i-1) * 33 + str[i] Laruence has an article about:Hash algorithm in PHP 2. Hash collision processing method 2.1 chain address method […]