Author: Huang Xiao, tug Beijing leader, tug 2020 MOA.
Nowadays, distributed databases are in full bloom. What aspects should be considered when selecting database architecture? At the enterprise activity of tug lufax, Huang Xiao, the leader of tug Beijing, shared the common architecture application scenarios of tidb, and the following contents were sorted out from the activity sharing record of that day.
This paper is divided into three parts
- Today’s distributed database products are in a state of “all flowers are in full bloom”
- Some thoughts on the selection of database architecture in this scenario
- Tidb’s constant response scenario
Distributed database products
From the domestic database popularity ranking released by motianlun, we can see that tidb ranks first. The second to the fifth are the old domestic databases DM, gbase, oceanbase and polardb. From the curve trend in the figure above, we can see that the domestic database is in a period of vigorous development. For distributed database, we are most concerned about the following five points:
- It can process massive data;
- The database is highly available;
- It is easy to expand, such as the previous warehouse and table dismantling, application transformation, high cost, and very troublesome to merge;
- Strong consistency;
Thinking about the selection of database architecture
For an online service with high concurrency, the following points should be considered:
- stable. For any online service, you can tolerate a little slower transaction, but you can’t tolerate frequent downtime. Stability is the first priority. Without stability, efficiency is meaningless.
- efficiency. When the system is very stable, the faster the speed, the better the user experience. For example, users who place orders for takeaway and receive orders in seconds must feel very good. If the order is received after 30 minutes, the user will wonder whether there is a problem with the system or the takeaway kid is lazy.
- cost. When there is stability and efficiency, we need to think about whether the cost is worth it? Because only when the cost comes down can we make profits.
- security. Safety is a problem that we can’t get around. Whenever we do business, we all worry that our data will be leaked.
So in the database, we are most concerned about stability, efficiency, cost reduction and security. In addition to these four, there is open source. I hope this database is open source when I choose the technology, because when I encounter some problems, I have community support. In addition, when I want to contribute to the product, I can iterate with the community. stayStabilityWe will consider these aspects:
- Can this database do more work? Can it have some additional diagnostic and high availability capabilities?
- Is it easy to make monitoring alarm during operation and maintenance?
- Is it a smooth rolling upgrade? Does it have an impact on the business?
- When doing data migration, is there any problem?
- Is data verification easy to do?
- What is the efficiency of flexible expansion and reduction in operation and maintenance?
PerformanceWe are most concerned about four points:
- First,Low latency。
- Second,Is the transaction model what we usually use. As we all know, MySQL is a pessimistic transaction model. What I hope is to migrate to a new database and keep the original usage habit.
- Third,High QPS. That is, whether the database can support high access. For example, if you do an activity, the traffic will triple tonight. Can the database resist this situation? Can’t resist what kind of scene? Whether to hang up completely or whether the server has an automatic performance protection mechanism.
- Fourth,It can support massive data. If you can’t support massive data, it means that you need to communicate with the business in advance about the basic design, and determine whether you can divide the database and table first and the corresponding number of slices.
CostThere are three main considerations
- First,Application access costIt refers to whether it is easy to access the database, and whether it is necessary to communicate and train in advance.
- Second,Hardware cost is CPU + memory + disk. For example, there is a distributed database, which is a scale up type database. It requires 384g of memory, but not all Internet companies can afford the cost of this high configuration model. As we all know, it costs more than 40000 yuan on the general model, such as the machine commonly used in the Internet industry, with about 128G memory, 30 core CPU or 40 core CPU, and a 3.2T PCI card. But if high-density models are used, their prices will rise exponentially. So in this scenario, the selected database will lead to a very high hardware cost.
- Third,network bandwidth 。
SafetyThere are three points to consider:
- First,Does the database have audit function. Taking the financial industry database as an example, users certainly want to be able to audit out who has accessed the data and what operations have been performed on the data.
- Second,Data is recoverableNo matter what abnormal operation the user has done, the final data can be retrieved through backup.
- Third,Database permissions. What we need to consider is how detailed the permissions will be, because in some special scenarios, we want the database rendering to be refined to the table level or even the field level. For example, the ID card information, mobile phone number, and password account number in the personal information do not want to be displayed to DBA or other Rd.
Common application scenarios of tidb
Business scale and volume
At present, our tidb has about 1700 nodes and hundreds of clusters. The maximum number of nodes in a single cluster is about 40, and a single table has hundreds of billions of records. At present, it is in the state of small-scale access, and it is still exploring more abundant business scenarios. In terms of visit volume, the daily visit volume is more than 10 billion. The peak QPS of a single cluster is more than 100000.
In what scenario will we choose tidb database?
Distributed database is chosen because it is elastic and scalable. I hope it can pop up and take it back, and I don’t want to keep disassembling. The students who have used MySQL all know that when the traffic comes up, we have to split the library and table, one split two, two split four, four split eight. The more we split, the more terrible the number is, and the cost grows exponentially. But traffic is not necessarily exponential growth. Do you think the product flow has increased? I can’t bear the cost of dismantling. If I don’t dismantle the takeout, I should complain that I used to receive orders in 5 seconds, but now I can’t see them in 10 seconds and 20 seconds. In addition, it will also face competition from friends. So in this scenario, the main solution of the industry is the separation of storage and computing. At that time, it is because we are short of computing resources, not storage resources. Storage I only want to expand by 1.5 times, but computing I want to expand by four times. These two are not so matched. In the case of no match, the storage computing is assigned to the architecture to solve this problem. So one of the reasons for choosing tidb is that it is a computing storage separation architecture.
In the era of rapid development of the Internet, there are often black swan events. But after the appearance of black swan, the flow is in a state of explosive growth in the short term. However, the explosive growth of traffic does not mean that the number of your DBAs is explosive growth, nor does it mean that your mechanism is explosive growth. In this case, no matter how much manpower the DBA invests, it is too late to dismantle. In addition, from placing an order to order this machine, there is no problem in the pressure measurement of the machine room. When it’s really online and you can use it, maybe at least a month has passed. So, one of the big pain points we have is in theIn the case of explosive business growth, we have no time to dismantle it。
During the activity promotion period, there will be a very steep flow peak. After the event, the summit will go down immediately. In view of this situation, the DBA needs to cooperate with the business party to carry out the whole link flow pressure measurement before the big promotion, the dismantling of the demolition, the expansion of the expansion. If it is removed, it needs to be removed one by two, or one by four, one by eight. Remove those libraries, and finally import the data back through DTS. Whether the data is consistent or not needs to be considered when importing it back. Both the business side and the DBA are very painful.
So, from the above three scenarios,The biggest pain we have is that there is no separation of storage and computing。
One of the reasons why we choose tidb is its storage separated computing architecture. In terms of storage, tidb memory is mainly responsible for SQL parsing and execution of SQL Engine. PD mainly provides metadata information and timestamp function of distributed database. Tikv provides unlimited expansion of distributed storage function.
In this scenario, storage is a cluster of tikv, and computing is a cluster of tidb. They are not related to each other. They can be expanded or shrunk independently without affecting other components. This is a perfect solution to our demands. So we chose tidb.
Financial level strong consistency scenario
In addition to the elastic scaling scenario, we also consider the financial level strong consistency scenario when using tidb. Now let me explain why this scene was introduced.
Let’s take a look at a problem encountered on MySQL. MySQL 5.6 is semi synchronous, while MySQL 5.7 is enhanced semi synchronous, also known as loss less, which means less data loss semi synchronous. Before the commit is successful, first transfer the binlog log log of the transaction to a slave library, and then change the InnoDB engine on the master library after the slave library returns my ack.
But this will bring risks, which is equivalent to not telling the business party that the commit operation is successful at this time. But binlog has actually been sent to the slave library. At this time, if the master database crashes and the slave database has been submitted, there is a risk.
Loss less semi synchronous replication does not solve the problem of data consistency.
When the timeout of semi synchronization is set to infinite length, it is not a strong consistent scenario.
Although the timeout can be adjusted to infinite length. At this time, if the network between master and slave is broken, no slave library can receive ack. MySQL later referred to Mgr to solve this problem.Although Mgr solves the strong consistency of data, it does not solve the scalability of data. A Mgr can only accept up to nine nodes, and no matter 5.7 or 8.0 versions of Mgr are very sensitive to network jitter, second level network jitter will lead to write node handoff. Mgr multi write mode found too many bugs in the community, so now we use Mgr, are using single write mode, to avoid transaction conflicts, avoid triggering more problems.
MySQL semi synchronization does not solve the consistency problem. Tidb solves this problem through multi raft protocol.
In the tikv layer, the data is divided into different regions, each group of regions has multiple copies, and then a raft group is formed. There will be a leader in raft group, responsible for reading and writing. This ensures that when the leaders of this group of regions hang up, the remaining nodes will reselect a leader to be responsible for reading and writing. In this way, the data written to raft group will not be lost. At least if a single node hangs up, the fault will not be lost.
Let’s take a look at a typical financial scenario that requires distributed transactions.
Cross database transaction scenario
In addition to strong consistency, the financial system also requires business. MySQL semi synchronous timeout is not unlimited.
Local life service companies strongly rely on the performance ability of merchants, but the performance ability of merchants depends on the data consistency and high availability of the system. Take the take out order as an example. When the order is recorded to the user side and the merchant side respectively, it involves the cross database business. At this time, it is impossible to rely solely on the data consistency of MySQL.
Scenario of sub database and sub table
A typical scenario of distributed database is sub database and sub table.
For example, in the transfer scenario under the user dimension, if the account of user a is reduced by 100 yuan and the account of user B is increased by 100 yuan, they may be on different data slices. This transaction certainly does not want one commit to succeed and the other to fail. Therefore, in the scenario of sub database and sub table, the consistency of distributed transactions should be maintained.
Service oriented SOA scenario
The typical scenario of distributed transaction is service oriented SOA.
As shown in the figure above, in the process of microservice, we want to keep the consistency of the yellow, blue and green databases as a whole. So, how to ensure the overall consistency of transactions in this scenario?
Before there was no distributed database, the order business could write multiple transactions. When the MySQL Cluster on the client side is down, the MySQL Cluster on the business side may not be down at the same time. At this time, the verification service can find that the business side has the order, but the client side does not have the order. At this time, the data can be supplemented by the way of by-pass replenishment.But this approach is very business scenario dependent and very complex. The following figure shows the logic of replenishment. First, the cluster status is polled to determine whether it is down.
If it is, judge whether it is the merchant or the client. If it’s from the business side, check the client side and pull the data from the client side to make up for it. If it’s found that the data from the business side is lost, check binlog on the client side to see if it can be pulled back to promote BCP (business check platform), which is equivalent to a business transaction verification mechanism. Parsing and compensating binlog to another dimension is also a kind of replenishment logic.
One of the most popular concepts in the industry is the flexible transaction of base. Ba represents the basic availability of business, s represents flexible transaction, equivalent to single transaction, and e represents final consistency. In the base scenario, the industry basically adopts two methods. One is TCC, that is try / confirm / cancel. There are many participants in it. They all have a try. If they can do it, they will submit it. If they can’t, they will cancel it. In addition, when a transaction lasts for a long time, Saga mode may be used to do it. This is a common solution in the industry, but we find that the order business only relies on this kind of replenishment logic, and the effect is not good.
For example, in this scenario, what should we do when a computer room in Beijing goes down? What should I do if the network of a computer room in Shanghai is not good? So at this time, we need to set the whole distribution link. Set refers to the distribution of traffic according to user dimensions from the traffic entrance. For example, a, B and C users are all allocated to the first set, and D, e and f users are allocated to the second set. Two sets are synchronized by DTS. When one set fails, it will be unavailable for a short time. At this time, all traffic can be transferred to the second set. In this way, we can ensure that another set can continue to place orders, and the service is available. But set is a more bone breaking solution, because it needs to carry out a complete transformation from the traffic entrance, the current business, and the database.
In this case, not all businesses are willing to do this transformation, because it is a very painful thing. In addition to the order type business, there is actually another kind of business called account type business. The order business means that when an order is placed, multiple dimensions are written into the record. But for the business of account type, there is a compulsory demand for the business of financial layer. And for finance, there are strong demands for more living and disaster recovery in different places.
There are three problems that cannot be solved
- The balance cannot be less than zero. It must be a rigid transaction with consistent data.
- Encounter IDC power or network failure, the whole machine room down, how to do?
- When using set to solve the order business, a single computer room can use set to solve the problem, but if it is an account business, it is relatively difficult to solve the problem. In the scenario of set based two-way replication, the bad data has spread to multiple clusters. It is very difficult to retrieve the data at this time.
These are the two pain points that we will encounter in transactional business. First of all, the effect of order replenishment is not good, and the business side is not necessarily willing to cooperate with the whole set of modifications. Secondly, when the business of account type has strong consistency demands on data, it can’t make up the bill, and how to do it when the data is bad. This is our strong demand for financial grade data with strong consistency.
Solution: percolator distributed transaction model
Therefore, based on the above scenario demands, we choose percolator distributed transaction model.
In the financial product database, it is recommended to choose the pessimistic mode, because it is consistent with the original mysql, and the amount of changes to the business side is relatively small, so it is easier to be compatible. In addition, RD is freed from the cumbersome logic of replenishment and splitting, so that they can focus on their own business and save costs. When using tidb distributed transaction, there are two suggestions:
- First, small transaction packaging. Tidb is a distributed transaction, which needs a lot of network interaction. If small transactions are divided into one transaction to execute, multiple network interactions will lead to a very long network delay, which has a great impact on performance.
- Second, big things need to be split. If the transaction model is particularly large, it will take a long time to update. Because larger transactions update more keys, the read initiated during the transaction has to wait for the transaction to commit. This has a serious impact on the response delay of reading, so it is recommended to split the large transaction.
Data center scenario
The third scenario we encounter is the data platform scenario, that is, for massive data, the data scenario begins to become fuzzy and complicated.
Fuzziness and complication here refer to some requests that were partial to AP or partial to data analysis. Now we hope to get real-time data and put it on tidb.
Taking our practical application scenario as an example, when we want to calculate whether the hotel room price is competitive, we will grab a lot of data for calculation. Real time data is required, and the price of online rooms cannot be affected during calculation. If it is not split out, it will lead to the continuous real-time computing power of OLTP data on the same library, which will cause a certain response delay.
In this scenario, we pull the data to tidb through binlog synchronization. In this cluster, we do a lot of calculation and high-frequency point search. For a large amount of data, the amount of writing is also very high. Rockdb, the underlying storage of tidb, adopts LSM tree model, which is a friendly data structure for writing. So tidb writing in this scenario can meet our needs.
There will also be a small number of report class requests on such a cluster. The first is the scene of real-time computing. The second is to build a search engine with this solution.
We have some consumption data related to payment, receipt and other vouchers in finance. We hope that these data can be extracted from various systems and gathered together to form a large data market. After this data is formed, it can be used many times. For example, to do an operation report, real-time market, data market, or as a data subscriber to use. There are many systems involved in data synchronization, such as binlog, message queue or business double write.
Here are some other scenarios we might consider when using tidb.
First, the separation of hot and cold data. With the increase of the company’s operation history data, our online data volume will be very large. We will import part of the history data to tidb cluster, which can also reduce the cost appropriately.
Second, the company’s internal log class and business monitoring data. This is because tidb is the underlying LSM tree data mode, which is very write friendly and can be expanded infinitely. So it is more appropriate to analyze this log.
Third, there are many restrictions on changing MySQL tables. In this scenario, in order to ensure that the data is not delayed, and at the same time to control the business peak in the process of online table modification, or when the master-slave delay occurs, the table modification operation can be suspended. At present, most of the online tables are modified using Pt OSC or GH OST. However, too many sub tables will take a long time. Or let the business side accept to change the table during peak hours to reduce the writing ability, or think of other ways to solve this problem. So the second level DDL of tidb solves our big pain point.
Fourth, it is a special scenario, that is, the business migrated from ES or HBase. The main problem of migrating from HBase is that HBase does not support secondary index, and the business migrated from ES is migrated to tidb because of the poor availability of ES.
Fifthly, it is a hot scene this year. With the rise of 5g and Internet of things and the explosive growth of data volume, we will encounter many demands for the combination of TP and AP. In this scenario, we actually implement T + 0 analysis requirements of TP and AP classes in one system. For example, when we do big promotion activities, we will calculate the activity effect for the issuance of coupons. There are a lot of big data t + 0 analysis demands, and it is difficult to achieve only relying on T + 1 reports. However, if we have hatp, we can upload data online and provide it to the market for judgment, reducing trial and error costs and marketing costs.
The above is the common architecture application scenarios of tidb, and I hope it can help you.