Introduction | tbase is an enterprise level distributed HTAP database system developed by Tencent TEG data platform team on the basis of open source PostgreSQL. It can provide customers with strong consistent and highly concurrent distributed online transaction capability and high-performance data online analysis capability in the same database cluster. This article is a sharing and sorting of Wu Xin, an expert engineer of Tencent tbase, in the cloud + community salon online, which will bring you the exploration and practice of Tencent cloud tbase in the field of distributed HTAP.
1、 Introduction to tbase distributed database
1. Development history of tbase
Tencent cloud has been experimenting with its internal business since 2009. It has more experience in the self-research process in the field of enterprise distributed database. At that time, it was mainly to meet some small needs. For example, PostgreSQL was introduced as a supplement to TDW to make up for the low performance of TDW small data analysis, and the processing demand was also small.
However, after the business grows slowly, we need to have a more efficient online transaction processing capability and expand the database, so we will continue to invest in the database development process.
The first version released by tbase in 2014 began to be used internally on Tencent big data platform; In 2015, tbase wechat payment merchant cluster was launched, supporting more than 600 million transactions every day; In 2018, V2 has greatly enhanced transaction, query optimization and enterprise level functions, and slowly began to face some external customers; In 2019, tbase won the bid for the core business of PICC group, assisted them in the leading core system in the domestic insurance industry, and provided stable service for a long time.
Considering the continuous development of tbase’s overall capability, we hope to contribute tbase’s capability to the open source community, so as to support more database localization projects. So in November 2019, we open source the database, hoping to help upgrade the digital industry.
2. Introduction to PostgreSQL database
Tbase is a distributed database based on single machine PostgreSQL self-developed. In addition to the perfect relational database capability, tbase also has many enterprise level capabilities. At the same time, it enhances the ability of distributed transactions, better supports online analysis services, and provides a one-stop solution.
In terms of data security, we have unique features, including the security system of separation of powers and the ability of data desensitization and encryption. Secondly, the flexible configuration of multiple data activities, multiple places and multiple centers also provides relatively perfect capabilities to ensure some highly available scenarios in online transactions in the financial industry. Lay a solid foundation for the localization of core businesses such as finance and insurance.
PostgreSQL is an open source RDBMS. The open source protocol is based on BSB style, so the source code can be modified more flexibly or commercialized on the basis of modification.
PostgreSQL is an open source project led by Michael Stonebraker, a Turing Award winner. As shown in the figure above, it has been iterating for a long time, has been released to version 12, and has been iterating continuously. The whole is at a relatively active level.
3. Development trend of PostgreSQL
PostgreSQL has attracted people’s attention in recent ten years, first of all, because its kernel functions, including the continuous activity of the community, have made continuous progress in the past few years. The above figure shows the statistics from DB engines. According to the data, we can see that the progress of PostgreSQL was obvious when everyone had some setbacks and growth was not too high last year.
The Yellow curve in the figure below is PostgreSQL. We can intuitively see that its development trend is relatively good.
At present, the open source tbase version is based on postgresql10. We are also continuously matching more functions of PostgreSQL, and will feed back to the open source community in the future. We hope to have a good combination and interaction with the overall PostgreSQL ecology.
2、 Open source tbase positioning and overall architecture
1. Positioning of open source tbase
Databases are mainly divided into OLAP, OLTP and HTAP according to business scenarios.
OLAP is characterized by a large amount of data, generally 10PB +, which is sensitive to storage costs. Its concurrency is not too high compared with OLTP, but it can provide better support for complex queries.
The data volume of OLTP is relatively small. Many small and medium-sized systems will not reach TB data volume, but the requirements for transaction and query requests will be relatively high, and the throughput will reach more than one million TPS. Moreover, OLTP has high requirements for disaster recovery capability.
Many domestic localization databases will start from the field of OLAP, and it will be relatively difficult to start from the perspective of OLTP. At present, this area is still seriously monopolized by IBM or oracle. We hope to realize localization in this area as soon as possible. Tbase has been working in the insurance industry for a long time and has a strong in the core business ability of OLTP.
The other is HTAP. In most previous business deployments, TP and AP will be separated, and ETL or flow duplex technology may interact between the two systems. However, it is more ideal to support two business types in one system at the same time.
Of course, this will be more complicated. First of all, you can see that their business characteristics are quite different, and the optimization direction in the kernel field is completely different, or there are great technical differences.
Tbase launches HTAP from specific requirements. In fact, tbase prefers TP, takes into account the processing capacity of better AP, and tries to achieve better compatibility in a set of systems. However, if you want to achieve more extreme performance, you still need to isolate HTAP and provide users with a complete service capability.
The perspective of tbase is also derived from requirements. Tencent cloud was the first trading system, and then slowly supplemented the analysis ability of AP.
The main business scenario requirements in this area are: firstly, the transaction data may be greater than 1t, the analysis capacity may be greater than 5T, the concurrency capacity is required to be more than 2000, and the transaction peak per second may reach 10 million. When the expansion capability is required, the impact on the original transaction capability, analysis capability, or data redistribution needs to be minimized. At the same time, a complete distributed and consistent database is achieved at the transaction level.
At the same time, tbase has also enhanced many enterprise level capabilities, such as the security guarantee capability of the separation of powers, data governance capability, the separation of hot and cold data and large and small merchant data.
Earlier, we introduced the development process of tbase. In this process, we also hope to make some contributions to the open source community.
In fact, it is still difficult to replace the core business in the domestic environment. It is more from the analysis system. In recent years, tbase has begun to systematically cut into the core transaction capability. Tbase also hopes to feed back to the community through open source to ensure that you can fill some gaps and expand the development of ecology through tbase’s HTAP capability.
After the open source, we have also received more attention and use, including Gaia mission of the European Space Agency, which is using our system to analyze the data of the Galactic star system. We hope that more students or friends will join the development process of tbase. We also hope that through this introduction, we can better cut into the interaction of the tbase open source community.
2. Overall framework of tbase
A cluster consists of these parts: GTM, coordinator and datanode. GTM is mainly responsible for the control of global transactions and is the cornerstone of providing distributed consistency protocols; Coordinator is the access portal of user business. It analyzes and issues user requests. Specific calculations and data storage are put into datanode.
3、 Introduction to HTAP capabilities
We talked about HTAP just now. Let’s talk about OLTP first. Tbase has outstanding capabilities in this part.
If users need to have requirements on transaction or concurrent transaction volume, they need a better distributed transaction system. Specific requirements include high performance and low cost. In this part, tbase has greater advantages over traditional IBM or more expensive all-in-one machines abroad.
Another requirement is scalability. In the case of node expansion, it extends the transaction processing capacity approximately linearly. How do we achieve this goal?
Briefly introduce the mvcc processing of transactions. Single machine PostgreSQL mainly maintains a list of current active transactions. It has a structure called proc array, which is equivalent to recording the current active transactions in the transaction list if each user’s session has a new transaction request. When it is necessary to judge the visibility of tuples, A snapshot in the active transaction list will be compared with the XID information recorded in the tuple header above for mvcc access control.
If it is extended to distributed, a simpler way is to have a central node. According to the previous architecture, there will be a centralized list of active things on GTM to uniformly allocate snapshots for each access request.
But there is a big problem at this time. The central node will become a bottleneck, and GTM will also have some problems, such as too large snapshot size or high network occupation.
If GTM is a centralized node, there is actually a single point bottleneck. Every request must ensure that it gets the snapshot correctly, which requires locking the active transaction list. In the case of high concurrency, the lock conflict will be very large. How do we solve this problem?
This is actually a common problem in the industry. At present, we see that the scheme of the Internet industry is derived from the direction of Google spanner.
Google spanner is a global distributed database that provides consistent database service capabilities across continents. Its concurrency control technology features: one is to store multi version concurrency control based on global time through kV; the other is to provide a TrueTime API by using GPS with high cost and global consistent service timestamp mechanism to make a set of submission protocol based on real time. Because it is globally distributed, the average error will be about 6 milliseconds, and the overall transaction delay is relatively high.
In addition, many systems will learn from Google spanner for transaction model, and cockroachdb is one of them.
In addition, percolator is also an efficient database provided by Google for search engines. It uses kV storage and mvcc based on global logical timestamp for concurrency control. Its time stamp is provided by special time stamp service. In the first stage of distributed transaction, the modification record needs to be locked, and the submission stage ends; The transaction commit time complexity is O (n). N is the number of records, which will affect the commit performance. Of course, such a design is also related to system requirements.
Let’s take a look at the capabilities of tbase in distributed transactions. In this part, we have also made great improvements on the basis of the previous work.
Firstly, we optimized the GTM cluster from the original global XID to the global timestamp (GTS). GTS is monotonically increasing. Based on GTS, we designed a new mvcc visibility judgment protocol, including vacuum and other mechanisms. This design can lower the submission protocol from the single point bottleneck of GTM to each node, reduce the pressure, and realize the active and standby high availability of GTM nodes through timestamp log replication.
Under this protocol, GTM only needs to allocate the global GTS, so the single point pressure will be solved obviously. According to our calculation, tengxu ts85 server can process about 12 million TPS per second, which can basically meet all distributed pressure and user scenarios.
We just mentioned that under the implementation of percolator, tuple needs to be locked and records need to be modified, so the performance is poor. In fact, we optimized the commit protocol and delayed the GTS writing of tuple header. When the transaction is committed, we do not need to modify the GTS information for each tuple, but store the GTS information in the corresponding GTS store file as the guarantee of transaction recovery.
When the user scans the data for the first time, it will get the status from the GTS store file and write it to the tuple header. The subsequent scanning does not need to traverse the status file, so as to speed up access and transaction processing. In this way, on the whole, the database ensures a more efficient design at the transaction level.
There are three types of centralized data distribution.
The first case is to copy the table. Each storage node in the replicated table has a complete copy of data, which is suitable for small tables with less changes and can speed up the association query.
The second is hash distribution, which is a classic way. In short, it is to hash the data according to the distribution column and scatter the data in each storage node. If the hash key is improperly selected, the data may be skewed.
The third is based on range distribution. Range distribution will break the data into small pieces according to segments. Compared with hash, the distribution is not particularly strict, and it has good support for the node elasticity of the upper layer. However, when it is calculated, the effect is not particularly good compared with hash.
On the whole, tbase selects replicated tables and enhanced hash distribution.
Here’s how to look at distributed query, pushquery and pulldata.
At first, some early systems may choose a faster implementation. For example, the storage is divided into multiple DNS, and then the data is pulled to the CN for calculation.
In this case, the advantages and disadvantages are obvious. The advantage is that it is more efficient and faster. The disadvantage is that CN is a bottleneck and has great pressure on the network. Therefore, we prefer the method on the right in the figure above to lower some data and calculations to dn nodes.
In the most basic case, it is hoped that all calculations can be carried out on DN. When redistributing DN, it needs to have the ability to interact with DN. This has been enhanced after tbase v2. At present, tbase can disperse the calculation to dn nodes as much as possible.
The figure above shows the difference between SQL shipping and planshipping.
In fact, when processing a query or a query plan, there are two situations. One is that I directly send SQL to dn for execution through analysis, and cn is only responsible for the collection of results. This optimization effect will be better, because there is no need to establish a distributed consistent submission protocol in multiple nodes, and it will be more efficient in computing resources. Some of our optimizations in the OLTP field will adopt this approach.
Another case is that in the OLAP field, the more formal plan is distributed. Make an overall plan for query on CN, decompose the plan into different computing pieces according to the redistribution, and distribute it to dn for processing
As mentioned earlier, if OLTP can be pushed to a single DN, the effect will be better. Let’s give a simple example.
The distribution column of the two tables is F1 and the data column is F2. If query can be written as the association of distribution keys and hash distribution is adopted, query can be pushed to different DNS. Because the data between different DNS are constrained by the distribution parts, there is no need to do cross calculation or data redistribution.
The second type is equivalent links with distribution keys, and there is a specific fixed value of a distribution key.
In this case, the CN can judge which DN to push through the value of F1.
There are some more complex queries, such as sub queries, but the analysis method is similar.
Subqueries may have a complex situation. If it can be judged that there is the same single node distribution as the upper layer in multi-layer subqueries, the query can also be distributed to the DN. In this case, the OLTP performance will be better affected, and the cluster capacity will be better optimized.
For complex queries, you may need to adjust the optimal configuration.
There are two methods: Rule Optimization (RBO) and cost optimization (CBO). RBO is mainly used to determine whether the query plan is consistent or not through rules for optimization. This is some early implementation methods. Because the amount of calculation is relatively small, it is more efficient for some scenarios, but the obvious disadvantage is that it is not flexible enough and can not be used in more complex scenarios.
In fact, more of these databases use CBO. In short, CBO will dynamically plan all paths and select the one with the lowest cost as the implementation plan. The advantage of this method is that it has good applicability, can be suitable for the optimization of complex scenes, and has stable performance. The disadvantage is that the implementation is complex and requires certain preconditions, including statistical information, cost calculation model construction and so on.
But this is not absolute. No one can “win” the other. It is more necessary to combine the two. Tbase is mainly optimized on CBO. For example, in the scenario of calculating some small tables, there is no need to perform redistributions, and it can be replicated directly.
Let’s make a simple analogy about some adjustments to the distributed Chinese distribution. Two tables, TBL_ A and TBL_ B。
If F1 is a distribution column, it will become push down when the distribution class is equivalent. In this case, it can be calculated directly on DN.
In the middle of the figure above, TBL_ A is the distribution key, TBL_ B is a non distributed key. In this case, if TBL_ B is small enough and TBL must be adjusted_ B is redistributed, that is, TBL_ B replication involves some cost estimation. And if TBL_ If B is large, it may be necessary to adjust TBL_ B for redistribution.
As we mentioned just now, tbase also has strong ability in OLAP. The optimization idea of this part mainly relies on the parallel computing. The full parallel ability of computing is mainly reflected in several aspects.
The first is node level parallelism. Because we are a distributed database, we can have multiple nodes or processes for computing; The other layer is process level parallelism. At present, tbase has not been changed into thread model, so parallelism is mainly reflected in the process level model. The ability of process parallelism based on PostgreSQL has been enhanced. Another layer is instruction set parallelism, which optimizes instructions and will be continuously enhanced later.
So how is Postgres’s query plan or process parallelism implemented? In the early days, we followed pg10. The parallel ability is not very strong. It only provides the basic framework and the optimization of some operators, which is a point of optimization by tbase at that time.
In the case of distributed, many single machines can be parallel, but not in distributed. Therefore, we hope to enhance these capabilities to ensure a wider range of parallel capabilities.
Parallel computing is actually a bottom-up deduction method. For example, a node at the bottom can be parallel. If it is pushed up to a certain layer and cannot be parallel, you can add a gather node to all the places below that can be parallel, collect the results from multiple processes and continue to plan upward. This is also a point we hope to enhance.
Let’s introduce some specific optimizations.
The hashjoin of early PG can be parallelized in the outer plan, but it cannot be parallelized when the hash table is built in the inner.
In short, hashjoin can be divided into several steps. The first step is to build hash table. The second step is to obtain some outer plan data, calculate the hash value and match. Here, we also parallelize the construction process of inner hash table to ensure that the left and right subtrees of hashjoin can be parallelized, and continue to push to the upper node for parallelization.
Another case is AGG (aggregation).
In many cases, it is a two-stage AGG. You need to do some partial aggs on the DN, and then go to the upper layer to plan to do further final aggs in pieces. In fact, in this case, when redistributes are encountered in the middle, you need to integrate the data in the DN first, and then do the final AGG.
In the case of multi-layer sub queries, each layer is calculated, which will lead to the overall parallel computing will not be very high. Therefore, we have also done some parallelism in redistributions, that is, in the case of partial AGG, it can be sent to the corresponding upper DN node according to hash distribution for parallelism.
There are also some data transmission optimizations.
We mentioned that redistributio nodes can enhance the parallel ability, and also need to be optimized and improved in data reception and transmission. In the early version, the single processing mode will lead to high network latency and poor performance. Therefore, we have made some optimizations in this part to achieve better parallel execution capability.
4、 Introduction to enterprise level capabilities of open source tbase
In fact, tbase has also enhanced some enterprise level capabilities, and will continue to make some open source contributions and optimizations later.
At present, the open source tbase enterprise level can realize the construction of multi location, multi center or multi activity capabilities, including security, management and audit capabilities. Tbase has high requirements for security, and will continue to contribute later.
In addition, in terms of horizontal expansion capability, tbase can expand the capacity when the user perception is relatively small. Capacity expansion is a common pain point in the case of large amount of data, and we will continue to enhance our ability in this regard.
In addition, tbase also has self-developed analysis tables and the separation of hot and cold data, which also have good results. It will reduce user costs and improve the flexibility of data distribution.
Tbase released v2.1.0 open source on July 13 this year. We also continue to build open source capabilities, including continuous enhancement of multi activity capability, enhancement of maintainability, continuous upgrading of performance and security, including problems found by some enterprise customers. It also includes the enhancement of statistical information and the optimization of small table redistribution. I hope you can continue to pay attention to tbase and have more discussions and consultations with us.
5、 Q & A
Q: What are the requirements for the layout of tbase?
A：You can access the open source version of tbase. There are specific usage methods, source based construction and construction process, all of which have clear documents. If you want to try, you can use the normal x86 server or the local Linux server, you can do some simple construction, and you can also try to keep communication on enterprise projects.
Q: Why choose PostgreSQL based development?
A: In fact, you will face the choice of MySQL and PostgreSQL. I mainly introduce the reasons why we choose PostgreSQL.
First, the protocol of PostgreSQL will be more friendly, and the flexibility of the protocol will be better. You can change its code and release it completely at will.
In addition, its kernel implementation is also more rigorous, has its own uniqueness, and continues to be enhanced, including its fast iteration speed. We have been following up on some features of merge PostgreSQL in the early stage. With the growth of PostgreSQL, our kernel has also made a relatively fast iteration. At the same time, we have done some more in-depth tuning while understanding the kernel.
Q: Is the storage cluster of DN node based on raft? Multi leader or single leader?
A：At present, our DN node does not use the raft protocol, but does active and standby replication. I know that many new businesses will be based on raft’s submission protocol, or use this replication protocol for consistency and high availability. However, in fact, multiple copies of raft still have some performance impact on the submission protocol. Compared with the traditional process, the overall process will have a longer delay, which is equivalent to the cap principle. If C is improved, a will have some impact.
We prefer OLTP system, so the transaction requirements and delay response requirements are relatively high, so we made such a choice.
Q: Can you describe the implementation process of distributed transactions in detail? How to ensure distributed transactions between multiple machines and two-phase commit?
A：Now we are basically folw two-stage submission. One is the control process and control protocol of two-stage submission, and the other is the transaction isolation protocol. Just now, I mainly talked about the enhanced version of mvcc, which is basically submitted in two stages based on the submission protocol.
Due to the use of GTM, it is different from the traditional stand-alone mode. It has made some unified coordination, which was also highlighted just now. Such an advantage is to reduce the pressure on GTM. In addition, there will be some blocking in the prepare stage, but the impact is very small after optimization, but it can greatly reduce the pressure on GTM.
Q: How can the underlying storage meet the needs of row storage and column storage at the same time? Or is it stored continuously in blocks (tiles)?
A：The underlying storage of our open source version is mainly row storage, which will be continuously enhanced in column storage and HTAP later to further improve the ability of HTAP. After gradually stabilizing, we will consider iterating the open source version.
Q: How many servers are needed at least?
A：In fact, single point construction is also possible. One DN, one CN and one GTM can also be used. In fact, it’s best to deploy two DNS to experience more distributed construction. In fact, we have more than a thousand nodes in the enterprise service cluster, including solving the single point pressure of GTM, which has better improved the overall scalability of the cluster, so we can try from two nodes to multiple nodes.
Q: Is batch or pipeline used in GTM timing? Is there any read consistency from the library now supported by tbase?
A：yes , we have. We have also made more optimizations for GTM timing. In short, we can do some monotonic timing of parallel GTS. According to the current scale or our estimation of customer scenarios, the timing capacity of 12 million QPS can be achieved in x86 servers. In the case of enhancing the server, the overall capacity is relatively strong, and basically there will be no bottleneck in this part.
Q: What security mechanism does tbase have?
A：From a business perspective, we talked about security isolation, strong row level security rules, column level access control, data encryption and desensitization enhancement. These can be applied to an enterprise level database. Now many enterprise level service capabilities are also in tbase. We will further iterate later according to the situation.