|1.0||2021.11.21||Supplement the content added when sharing in the company|
The title comes from the background introduction of the birth of their storage engine by influxdb:
The workload of time series data is quite different from normal database workloads. There are a number of factors that conspire to make it very difficult to get it to scale and perform well: - Billions of individual data points - High write throughput - High read throughput - Large deletes to free up disk space - Mostly an insert/append workload, very few updates The first and most obvious problem is one of scale. In DevOps, for instance, you can collect hundreds of millions or billions of unique data points every day. To prove out the numbers, let’s say we have 200 VMs or servers running, with each server collecting an average of 100 measurements every 10 seconds. Given there are 86,400 seconds in a day, a single measurement will generate 8,640 points in a day, per server. That gives us a total of 200 * 100 * 8,640 = 172,800,000 individual data points per day. We find similar or larger numbers in sensor data use cases.
Recently, I was responsible for the monitoring of some products. I think that the requirements of time series database for RT and IOPs should be very high, so I want to see how it is implemented internally – whether it is very similar to Kafka and HBase I know.
First, do a simple science popularization. Time series database is a database used to store data that changes with time and establish an index in time (time point or time interval). Then it was first applied to the data collected and generated by various types of real-time monitoring, inspection and analysis equipment applied in industry (power industry and chemical industry). The typical characteristics of these industrial data are fast generation frequency (multiple data can be generated within one second at each monitoring point), heavy dependence on the acquisition time (each data requires a unique time) There are many monitoring points and a large amount of information (the conventional real-time monitoring system can reach thousands of monitoring points, and the monitoring points are generating data every second). Its data is a historical imprint, which is invariable, unique and orderly. Time series database has the characteristics of simple data structure and large amount of data.
Students who have used time series database know it. The data of time series database is usually only added, rarely deleted or not allowed to be deleted at all. The query scenario is generally continuous. For example:
- We usually observe the data at a certain time end on the monitoring page. When necessary, we will look for a finer time period to observe.
- The timing database will push the indicators concerned by the alarm system
1.1 pits trodden by Prometheus
Here, let’s briefly review the data structure in Prometheus. It is a typical K-V pair, and K (generally called Series) is composed of
TimeStampComposition, V is the value.
In the early design, the same series will be organized according to certain rules, and the files will be organized according to time. So it becomes a matrix:
The advantage is that writing can be written in parallel and reading can also be read in parallel (whether based on conditions or time periods). But the disadvantages are also obvious: first, the query will become a matrix. This design is easy to trigger random reading and writing, which is very difficult on HDD or SSD (interested students can see Section 3.2 below).
So Prometheus improved another version of storage. Each series has a file, and the data of each series is saved in the memory for 1KB, and then swipe down once.
This alleviates the problem of random reading and writing, but also brings new problems:
- When the data is not up to 1KB and is still in memory, if the machine carsh, the data will be lost
- Series can easily become very large, which will lead to high memory consumption
- Continuing with the above, when these data are brushed down in one breath, the disk will become very busy
- Next, many files will be opened and FD will be consumed
- When the application has not uploaded data for a long time, should the data in the memory be brushed? In fact, there is no good judgment
1.2 pits trampled by influxdb
1.2.1 leveldb based on LSM tree
The write performance of LSM tree is much better than the read performance. However, influxdb provides an API for deletion. Once deletion occurs, it is very troublesome – it will insert a tombstone record and wait for a query. The query will merge the result set with the tombstone. Later, the merge program will run to delete the underlying data. In addition, influxdb provides TTL, which means that the data is deleted by range.
In order to avoid this slow deletion, influxdb adopts a sharding design. Cut different time periods into different leveldb. When deleting, just close the database and delete the file. However, when the amount of data is large, it will cause the problem of too many file handles.
1.2.2 boltdb based on MMAP B + tree
Boltdb is based on a single file as data storage, and the performance of B + tree based on MMAP is not poor at runtime. But when the write data becomes larger, things become troublesome – how to alleviate the problem caused by writing hundreds of thousands of serires at a time?
In order to alleviate this problem, inluxdb introduces wal, which can effectively alleviate the problems caused by random writing. Write multiple adjacent to the buffer, and then fresh them together, just like the bufferpool of MySQL. However, this does not solve the problem of write throughput decline. This method only delays the emergence of this problem.
Think carefully, the data hotspot of time series database only focuses on the recent data. And write more and read less, almost no deletion and modification, and the data is only added in sequence. Therefore, we can make radical storage, access and retention policies for temporal databases.
2.1 key data structure
The variant implementation of the log structured merge tree (LSM tree) with reference to the log structure replaces the B + tree in the traditional relational database as the storage structure. The application scenario suitable for LSM isWrite more and read less (change random writing into sequential writing), and hardly delete the data. The general implementation takes time as the key. In influxdb, this structure is called time structured merge tree.
There is even a not uncommon but more extreme form in the time series database, which is called round robin database (RRD). It is implemented with the idea of ring buffer. It can only store a fixed amount of the latest data, and the data exceeding the period or capacity will be covered by rotation. Therefore, it also has a fixed database capacity, but can accept unlimited data input.
2.2 key strategies
- Wal (write ahead log): like many data intensive applications, wal can ensure data persistence andMitigate the occurrence of random writes。 In time series database, it will be regarded as a carrier of query data – when the request occurs, the storage engine will merge the data from Wal and falling disk. In addition, it will do snappy based compression, which is a less time-consuming compression algorithm.
- Set aggressive data retention policies, such as automatically deleting relevant data according to expiration time (TTL) toSave storage space while improving query performance。 For ordinary databases, it is unthinkable that the data will be automatically deleted after being stored for a period of time.
- Resampling the data to save space. For example, the data of recent days may need to be accurate to seconds, while querying the data of a month agoCold dataIt only needs to be accurate to days. It is enough to query the data one year ago as long as it is accurate to weeks. In this way, the data can be re sampled and summarizedSave a lot of storage space。
Overall, compared with Kafka and HBase, the internal structure of time series database is not simple and has great learning value.
3.1 reference links
- Zhou Zhiming: Phoenix architecture
3.2 disk random read / write vs sequential read / write
When we make an addressing request to the disk (it may be to read the data of an area, or locate an area to write data), the first bottleneck we can see is the speed of the spindle, followed by the head arm.
The sequential reading and writing is about 200MB / s and 220mb / s.
SSD looks good. The random read-write speed is generally 400MB / s and 360mb / s, and the sequential read-write speed is generally 560mb / s and 550mb / s.
But the real problem is its internal structure. Its most basic physical unit is a flash memory particle. Multiple flash memory particles can form a page and multiple pages can form a block.
When writing, it will take page as the unit. We can see that the figure is 4KB. This means that even if you write 1b data, it will occupy 4KB. This is not the deadliest. The deadliest thing is deletion. Deletion occurs in the whole block. 512KB in the figure, which means that even deleting 1KB of data in it will lead to write amplification.