The underlying storage of sqlite is based on B-tree. The basic reading and writing unit of B-Tree to the underlying storage is page, and each page is corresponding to a globally unique page number. Generally speaking, the page number starts to increase from 1. The process of modifying data by the storage engine of class B-Tree is shown in the following figure:
From the above figure, it is necessary to distinguish several core modules of the storage engine of the B-Tree class:
- B-Tree algorithm module: Read pages from the page manager to memory, make logical modifications, and mark the page as a dirty page after modification, so that the page manager knows which pages have been modified and need to be dropped later.
- Page Manager: Responsible for providing the B-Tree algorithm module with an interface for reading and writing pages according to page numbers.
- Database file: This is not actually a module, it generally refers to the database-related files on the disk, and any modification will eventually fall to the database file. In sqlite, a database file is a single file, in other storage engines it may be a group of related files.
The top-level B-Tree algorithm module, when performing a write transaction, first initiates a request to the page manager to read pages into memory. It is noted that the B-Tree module does not directly deal with database files, but passes through pages. The manager module (expanded below), after modifying the page, marks it as “dirty page”, and the page manager is ultimately responsible for placing the dirty page into the database file.
Now let’s talk about the specific work of the “page manager” module, and some implementations are also called “buffer manager”. This module is responsible for: managing pages in memory
Manage pages in memory. This involves two parts:
- If the page is not currently in memory, the page needs to be loaded to disk according to the page number.
- The page does not have to be loaded to the disk every time it is read or written. Sometimes the page already exists in the cache. In this case, it is not necessary to load the page data to the disk. Therefore, the “page manager” module also needs to be responsible for maintaining these in-memory page caches. When to eliminate these pages, which in-memory pages to eliminate, and when to actually load them from disk, it is the work of this module.
- Externally (the external ones here are more B-Tree algorithm modules), in fact, you do not need and cannot see the details of the page cache. The page manager provides external interfaces for reading and writing pages based on page numbers.
Error recovery, transaction management, such as:
- A transaction needs to modify N pages. When the modification is in the middle, the process crashes. At this time, when restarting, the data before the transaction needs to be restored to successfully start, that is, the function of rolling back the transaction needs to be provided.
- The same transaction needs to modify N pages. When the transaction has not been committed, if the transaction level is not read uncommitted, the previous modification effect cannot be seen by other transactions. This is also what the page manager needs to do. After all, it provides external The interface for reading and writing pages is used to determine when the content of a page with the same page number is visible.
With these basic understandings, let’s take a look at the evolution of sqlite in concurrent reading and writing
The earliest implementation of the page manager was based on the Journal file, which stored the content of the page before it was modified:
What can be seen is:
- The Journal file stores the content of the page to be modified by a transaction before it is modified. This definition is a bit awkward, so let’s call it “old page content”.
- Each time a transaction is committed, it means that the modification of the transaction’s team pages has fallen into the database file. At this time, the old page content saved in the Journal file is no longer needed and can be deleted.
- Since each transaction modification must be placed on the database file, these placing operations involve multiple disk seeks, that is, multiple random disk seeks for one transaction, which is actually very expensive.
- When the transaction rollback function is required, the page manager can read the old page content from the Journal file and overwrite it.
- Although this algorithm is simple, it has obvious flaws: it does not have any concurrent read and write support. Every time a write transaction is started, from the start of the write transaction to the completion of the write transaction, other read and write transactions cannot start, which can be said to be “one write is all stuck”.
As can be seen from the above analysis, with the mechanism of the Journal file, each time a transaction is written:
- All content modifications need to be placed on the database file to be completed.
- During this process, there cannot be other concurrent read and write operations at the same time.
Starting from sqlite 3.7.0 (SQLite Release 3.7.0 On 2010-07-211, sqlite introduced a more common WAL mechanism to solve the problem of page read and write concurrency. The principle of WAL is shown in the following figure:
In the WAL mechanism, the modification of the page by the transaction:
- Instead of falling into the database file immediately, it is first written to the WAL file. This has two benefits:
- WAL files are append-only files, adding new content at the end of the file is faster for operations such as writing to disk files, because there is a lot less disk seek process.
- With WAL, the concurrency of read and write has been improved a bit: because the modification of the transaction is not immediately placed in the database file, it is invisible, and it is easier to roll back the modification of the transaction later: do not modify the part of the transaction. WAL content will do.
- Since the modification sometimes has not yet been placed on the disk, it is necessary to maintain an index of the pages in the wal, which is used to locate the pages in the WAL according to the page number. Since the wal index can control which wal file contents are “visible”, it can control that uncommitted transaction modifications are not visible to read operations.
- The WAL file cannot grow all the time. It is necessary to periodically put the committed transaction modifications in the WAL file to the database file. This process is called “checkpoint”. After the “checkpoint”, the wal index can be modified. Although the checkpoint process transfers the contents of the WAL file to the database file, it is still a random write process for the database file, and there are many disk seek operations. However, because a checkpoint accumulates multiple write transactions, the cost is small. Some.
Although there can still only be one write transaction going on at the same time, multiple read transactions exist at the same time. The core reason is that the modification is not immediately placed in the database file, so the visibility of the modification can be controlled by the wal index, that is, the write transaction can be written, and the read transaction can be read, as long as the modification that controls these write transactions is not. It can be seen in the wal index. Although WAL supports “one write and multiple reads”, instead of “one write all stuck” like the Journal file, there is still a problem that has not been solved: when doing checkpoint operations, even write transactions cannot be performed.
Two possible optimizations
The following introduces the two optimization schemes that sqlite is currently discussing. The reason why it is said to be “possible” is that this part of the code has not been merged into the trunk, and it is still in the branch for the time being. See: https://github.com/ sqlite/sql….
In order to solve the pain point of “cannot do write transactions during checkpoint”, sqlite is currently trying a new WAL-2 mechanism.
After the introduction of WAL-2, there are two WAL files at the same time, so you can: when checkpointing one of the WAL files, continue to write the other WAL file, and switch it when you checkpoint next time, so that the checkpoint will not block the write operation.
The current WAL mechanism can only support one write transaction at the same time. The BEGIN CONCURRENT mechanism can realize multiple write concurrency. This SQLite: Begin Concurrent document roughly describes the idea of this optimization:
The key to maximizing concurrency using BEGIN CONCURRENT is to ensure that there are a large number of non-conflicting transactions. In SQLite, each table and each index is stored as a separate b-tree, each of which is distributed over a discrete set of database pages. This means that:
- Two transactions that write to different sets of tables never conflict, and that
- Two transactions that write to the same tables or indexes only conflict if the values of the keys (either primary keys or indexed rows) are fairly close together.
Simply understand the above sentence:
- If different write transactions operate on different tables, although the data of different tables are physically in the same database file, they logically belong to different B-Trees, so that the pages managed by different B-Trees are not connected to each other. There will be no conflict. At most, it can be locked when the disk is placed in the database file.
- Second, even if multiple write transactions operate on the same table, as long as the key values of the same table are far apart, there is little possibility of conflict. Once a conflict is found when the transaction is committed, the transaction is done again from the beginning until there is no conflict and it is successfully committed when it can be committed. The latter idea of conflict resolution is actually not in the document. It is a method that I came up with based on other papers :).
At present, these two optimizations have not been merged into the trunk, so I have not looked at the implementation in detail. The subsequent optimization of the storage engine reflected in the sqlite trunk will be sorted out.
 SQLite Release 3.7.0 On 2010-07-21:
 SQLite: Begin Concurrent:
 sqlite3.36 version btree implementation (3) – journal file backup mechanism – codedump network log:
 sqlite3.36 version btree implementation (four) – WAL implementation – codedump network log:
Databend is an open source, flexible, low-cost new data warehouse that can also perform real-time analysis based on object storage. Looking forward to your attention, we will explore cloud-native data warehouse solutions together and build a new generation of open source Data Cloud.
- Databend Documentation: https://databend.rs/
- GitHub ：https://github.com/datafusela…
The article was first published on the public account: Databend