Abstract: This paper will start from the basic concepts, application scenarios, requirements and capabilities of time series database, and take you to understand the past and present life of time series database.
The timing database suddenly caught fire. Facebook has open source beringei time series database, and timescaledb, a time series database based on PostgreSQL, has also been open source. As a very important service in the direction of the Internet of things, the frequent voice of the industry shows that enterprises can’t wait to embrace the era of the Internet of things.
This article will start from the basic concepts, application scenarios, requirements and capabilities of time series database, and take you to understand the past and present lives of time series database.
Time series database is a vertical database highly optimized for time series data. There are a large number of application scenarios suitable for time series database in manufacturing, banking and finance, Devops, social media, health care, smart home, network and other industries:
- Manufacturing: for example, the lightweight production management cloud platform uses the Internet of things and big data technology to collect and analyze all kinds of time series data generated in the production process, and present the production progress, goal achievement and utilization of people, machines and materials on the production site in real time, so as to make the production site completely transparent and improve production efficiency.
- Banking Finance: the trading system of traditional securities and emerging encrypted digital currency, which collects and analyzes the time series data generated in the trading process to realize financial quantitative trading.
- Devops: the operation and maintenance system of IT infrastructure and applications, which collects and analyzes the monitoring indicators of equipment operation and application service operation, and master the health status of equipment and applications in real time.
- Social media: social app big data platform, tracking user interaction data, analyzing user habits and improving user experience; The live broadcast system collects the monitoring index data of the anchor, audience and intermediate links in the live broadcast process to monitor the live broadcast quality.
- Health care: business intelligence tools, collect health data in smart watches and smart bracelets, and track key indicators and the overall health of the business
- Smart home: the home Internet of things platform collects data from home smart devices to realize remote monitoring.
- Network: the network monitoring system presents the network delay and bandwidth usage in real time.
Requirements for time series data
In the above scenarios, especially in the field of IOT Internet of things and OPS operation and maintenance monitoring, there are massive monitoring data that need to be stored and managed. Taking Huawei cloud eye service (CES) as an example, a single region needs to monitor more than 70 million monitoring indicators, and 900000 reported monitoring indicator items need to be processed every second. Assuming that each indicator is 50 bytes, there is 1PB of monitoring data in a year; The monitoring data of various sensors of an autonomous vehicle is 80g a day.
The traditional relational database is difficult to support such a large amount of data and such a large writing pressure. Hadoop big data solution and the existing temporal database will also face great challenges. For large-scale IOT Internet of things and public cloud scale operation and maintenance monitoring scenarios, the requirements for timing database mainly include:
- Continuous high-performance writing: monitoring indicators are often collected at a fixed frequency. The collection frequency of sensors in some industrial Internet of things scenes is very high, some have reached 100ns, and the public cloud operation and maintenance monitoring scenes are basically collected at the second level. The timing database needs to support continuous high-pressure writing for 7 * 24 hours.
- High performance query: the value of time series database lies in data analysis and has high real-time requirements. Typical analysis tasks such as exception detection and predictive maintenance need to frequently obtain a large amount of time series data from the database. In order to ensure the real-time analysis, time series database needs to be able to quickly respond to massive data query requests.
- Low storage cost: the data volume of IOT Internet of things and operation and maintenance monitoring scenarios has increased exponentially. The data volume is more than 1000 times that of typical OLTP database scenarios, and is very sensitive to cost. It is necessary to provide low-cost storage solutions.
- Support massive timelines: in the operation and maintenance scenario of large-scale IOT Internet of things and public cloud, the indicators to be monitored are usually tens of millions or even hundreds of millions, and the time series database should be able to support the management ability of hundreds of millions of timelines.
- Elasticity: there are also scenarios of sudden business growth in the monitoring scenario. For example, the operation and maintenance monitoring data of Huawei welink service increased by 100 times during the epidemic. The time series database needs to provide flexible scalability that is sensitive enough to quickly expand capacity to cope with sudden business growth.
Open source temporal database capability
In the past 10 years, with the rapid application and development of mobile Internet, big data, artificial intelligence, Internet of things, machine learning and other related technologies, many time series databases have emerged. Because different databases adopt different technologies and design intentions, they also show great differences in solving the above time series data requirements, In the following content, this paper will select several open source time series databases that are most used as the analysis object for discussion.
Opentsdb is based on the HBase database as the underlying storage, and encapsulates its own logic layer and external interface layer upward. This architecture can make full use of the characteristics of HBase to achieve high data availability and better write performance. However, compared with influxdb, opentsdb has a longer data stack, and there is room for further optimization in read-write performance and data compression.
Influxdb is a popular time series database in the industry. It has a self-developed data storage engine and introduces inverted index to enhance the function of multi-dimensional conditional query. It is very suitable for time series business scenarios. Because time series insight report and time series data aggregation analysis are the main query application scenarios of time series database, each query may need to process the grouping aggregation operation of hundreds of millions of data. In this regard, the volcanic model adopted by influxdb has a great impact on the aggregation query performance.
Timescale is a time series database based on the transformation of traditional relational database PostgreSQL. It inherits many advantages of PostgreSQL, such as supporting SQL, supporting track data storage, supporting join, extensibility and so on. It has good read and write performance. Timescale uses a fixed schema and takes up a large amount of data space. It is also an option for businesses with relatively fixed time series services for a long time and insensitive to data storage costs.
Emergence of gaussdb (for inclusion)
At present, there is no good open source solution for the needs of high-performance writing, massive timeline and high data compression. Gaussdb (for influx) draws on the strengths of open source companies and designs a time series database of cloud native architecture. The architecture is shown in the figure below.
Compared with the existing open source time series database, the architecture design has the following two considerations:
- Separation of storage and Computing
On the one hand, the mature distributed storage system is used to improve the reliability of the system. The monitoring data has been continuously written with high performance. At the same time, there are a large number of query services. Business interruption or even data loss caused by any system failure will cause serious business impact. Using the proven mature distributed storage system can significantly improve the system reliability, reduce the risk of data loss, and significantly shorten the time to build the system.
On the other hand, under the traditional share nothing architecture, the constraint of physical binding between data and nodes is lifted. The data only logically belongs to a computing node, making the computing node stateless. In this way, when expanding a computing node, you can avoid migrating a large amount of data between computing nodes. You only need to logically transfer part of the data from one node to another node, and you can reduce the time-consuming of cluster expansion from days to minutes.
On the other hand, by offloading the multi replica replication from the computing node to the distributed storage node, it can avoid the problem that when users build their own database on the cloud in the form of cloud hosting, the distributed database and distributed storage do 3 replica replication respectively, resulting in a total of 9 replica redundancy, which can significantly reduce the storage cost.
- Kernel Bypass
In order to avoid the performance loss caused by copying data back and forth in the user state and kernel state, the gaussdb (for inclusion) system considers the kernel bypass design end-to-end, and does not choose to use standard distributed blocks or distributed file services, but customized distributed storage designed for the database, exposing the user state interface, and the computing nodes are deployed in containers, Communicate directly with storage nodes through a dedicated storage network
In addition to the architecture, gaussdb (for inclusion) has also made the following optimization for other requirements of IOT Internet of things and operation and maintenance monitoring scenarios:
- The write optimized LSM tree layout and asynchronous logging scheme improve the write performance by 94% compared with the current timing database.
- The aggregation query performance is improved by vectorization query engine, arc block cache and aggregation result cache, up to 9 times higher than that of the current temporal database
- A compression algorithm based on the distribution characteristics of time series data is designed. The compression rate is twice higher than gorilla, and the cold data is automatically classified into object storage, reducing the storage cost by 60%
- Optimize the indexing algorithm of massive timelines and improve the indexing efficiency. Under the order of ten million timelines, the writing performance is five times that of the current temporal database.
Gaussdb (for inclusion) has successfully guaranteed the company’s Wellink and cloud monitoring CES services to go online for commercial use. Next, we will explore how to find efficient analysis methods for valuable data in massive data, so as to provide users with more appropriate analysis and insight capabilities.