In November 19, 2019, ant gold clothing held a conference on “Peak vision and focus on new financial technologies” in Beijing, introducing the technology behind the 2019 double 11 Alipay, and releasing the new OceanBase 2.2 version and SOFAStack dual mode micro service platform. We have arranged and published series of lectures on the official account of the ant gold service technology.
Yang Zhenkun, senior researcher of ant financial services and founder of oceanbase database, shared “oceanbase: enterprise level relational database for the future” at the press conference. The following is the transcript of the speech:
On October 2, the TPC-C benchmark test results of oceanbase database were released. Today, I would like to report some things behind it, and then give you a brief introduction of our technical solutions, as well as some things about TPC-C benchmark and oceanbase for testing and certification. Finally, a brief summary.
What we are doing is an online transaction processing system (OLTP), but the greater significance lies not in the benchmark itself, but in the change of the method of transaction processing in relational database from a centralized system to a distributed system. In addition, the greater significance is not OLTP itself, but in OLAP. You may find it strange that what you have done is the OLTP benchmark. How can its value be found in OLAP?
Firstly, the background of database and business application is introduced. We can see that the current centralized database is facing challenges. To do database product research and development, we must have business requirements first. After several years of development, after the emergence of automatic teller machines in the mid-1980s, the very important feature of database began to be widely used, that is, online transaction processing.
In fact, in addition to the ability to conduct online transaction analysis, it is also necessary to conduct a comparative study on the online transaction processing ability of the traditional business database.
But in recent years, the situation has changed. OLTP and OLAP, which were originally done by the same relational database, have been changed into two systems: relational database sub database and sub table continue to do online transaction processing, while data warehouse does business intelligence analysis, that is, online analysis processing.
Why does this happen? Because of the Internet. In just a few years, the Internet has increased the total transaction volume and data volume by hundreds and thousands of times. The fastest developing single hardware still can’t catch up with this speed. Even if a single hardware can have this processing capacity and capacity, it is certainly not economical.
It is very inconvenient for such a system to become two systems. First of all, the data warehouse itself has no data, and the data has to come from the transaction processing database. So we have to build a bridge to extract, transform and load the data from the transaction database through ETL, which is non real-time. Otherwise, the data warehouse will become a transaction database.
Secondly, there are many challenges after the transaction database is divided into databases and tables. For example, the order number needs to be globally unique. If there is only one database, it is very easy to handle. For multiple databases, it is necessary to add a member in the order number to represent which sub database, and each sub database can be unique. What do you do when your sub database becomes two digit to three digit? This is business expansion. What if it is business shrinking? So the business needs to make a lot of changes.
The third is the data warehouse itself. The data warehouse is naturally oriented to a certain topic. We have never heard of a relational database facing a certain topic. A relational database is a relational database. You can build indexes and materialize views according to your needs, but the data warehouse database can only be oriented to a certain topic. If there are multiple different topics, it is necessary to build multiple data warehouses. Although similar topics can be merged together, this does not change the theme oriented nature of the data warehouse. It will cause a lot of data redundancy. Another problem is timeliness, because the data warehouse is not essentially updated in real time.
The transaction processing database cannot be expanded because it is centralized. The most important reason is the consistency and consistency of transactions. With the development of database for more than half a century, the centralized system is always used to deal with transactions.
Another reason is that the online transaction processing system requires very high availability and reliability. The inherent defect of distributed system is reliability: when multiple machines are together, the overall reliability will decline exponentially, unless you have special technology. For example, when you put 100 5-9 machines together, the reliability of the whole system is only 3-9, and any key business dare not use the 3-9 reliability system.
For these two reasons, the transaction processing system has been centralized for many years. When we built the distributed transaction processing system of oceanbase, there were a lot of queries, and this question has been heard until last year. At last, the company decided to make an online transaction processing benchmark certification. This benchmark is not only a running point. First of all, you should prove that your system can do transaction processing and meet the requirements of transaction acid, atomicity, consistency, isolation and persistence. Only through acid test can the next test be carried out. Therefore, this TPC-C Benchmark is not like someone said, pile up the machine can run a high score.
If the ball in the picture represents 100 yuan, the most common financial scenario is that a transfers 100 yuan to B. The biggest difficulty in this transfer process is that the process must be atomic, that is, there is no intermediate process: as long as a transfers the money, B must receive it; if B does not receive the money, a must not transfer out the money at the same time.
If the two accounts are on one machine, there can be a more mature approach to this matter; if it is on two machines, it becomes very difficult. How to coordinate the two machines to do this at the same time? The database designer has done this in two stages. The first stage is to check account a to see if it can transfer money, and also to check account B to see if it exists and whether it can receive transfer in. If one of these checks fails, the transfer will be cancelled. For example, if there is no money to transfer, the second step is to inform a to deduct 100 yuan and B to add 100 yuan.
In fact, this method has a big defect: if the first stage of inspection is good, and in the second stage a, the machine suddenly has a little problem, what to do? B do you want to add 100 yuan over there? According to the agreement, the first stage is normal, then B’s 100 yuan should be added. But if the machine a is completely broken and takes a standby machine, and the standby machine does not have this transfer information, it does not know that there was such a transfer at all. If the 100 yuan of B is not added, if the machine a is just a CPU, it will not know where it came from The load is high and the network is very busy and blocked for a while. After a while, it is normal again. The 100 yuan is deducted. It is not right for B to add the 100 yuan. Therefore, it is not right for B to add the 100 yuan. As a result, there is no distributed database that can be used for transaction processing for many years.
Is there a transaction processing system that can be expanded and contracted at any time? This is exactly the goal of the company to establish oceanbase in 2010.
Oceanbase technology solution
The oceanbase project was the first to do this. First of all, it was driven by the market. Our visit volume and data volume have increased by dozens or even hundreds of times than before. It is difficult to use the traditional database, or we can’t afford the database.
Second, because online transaction processing is a real-time system, it can not be stopped, otherwise, we can not eat Alipay or take a car.
Third, the database data can not be wrong, but how can software without bug? It’s just a real-time transaction database. There were tens of thousands of databases in Taobao and Alipay at that time. There were so many databases that had two advantages for this project: first, the economic value was large enough to pay so much money for the commercial database system; two, so many businesses provided a soil for the growth of new databases. We always talk about the countryside surrounding the city, and we can always find some relatively marginal business.
At the beginning of the oceanbase project, two important objectives were set:
- The system should be able to scale horizontally;
- The system must be highly available, despite the use of common hardware.
Now let’s look at how oceanbase solves the problem of high availability.
This figure is a schematic diagram of the primary and secondary images of a traditional database: the primary database does transactions and synchronizes them to the standby database. If the standby database is required to be completely consistent with the primary database, each transaction must be synchronized to the standby database in real time. If there is an exception in the standby database or a network exception between the primary database and the standby database, the transactions on the primary database will be overstocked, and the primary database will be destroyed in a very short period of time, resulting in the unavailability of services, which may be worse than data errors. We will ask why the data warehouse is also distributed. Why doesn’t it worry about machine failure? The root cause is that the data in the data warehouse is not updated in real time. If the above exception occurs, it can pause the update.
Our approach is to add a standby database. The primary database synchronizes transactions to two standby databases. As long as one standby database receives the transaction, plus the master database receives at least two databases. The key to this is the majority faction. Each transaction will land in at least two of the three databases. If any one of the databases is broken, even if the main database is broken, each transaction exists on at least one machine. We use this method to make the system highly available.
The probability of damage to two machines at the same time is very low if it is natural damage, but it is not necessarily if there are human factors. For example, if the machine is turned off artificially and replaced with a component for upgrading, plus natural damage, two machines may fail at the same time. Therefore, the more important business will write five transaction logs, three or four data. Even if a machine is shut down manually and a machine fails naturally, the whole business system is still available.
Back to the distributed transaction, the method of oceanbase is: we change every physical node into a Paxos group, which is equivalent to changing into a virtual node. There are three or five physical nodes behind the virtual node. According to the majority success protocol, if two or three of the three or five nodes write successfully, the transaction is judged as successful. In fact, such a method is used to solve the problem that if one machine fails, the two-phase submission can not be carried forward. We have solved the problem of distributed transaction through some seemingly simple ways.
Oceanbase had some business in CCB relatively early. The vast majority of ant financial’s databases are in oceanbase, and some of them are continuously migrating. Alibaba was the first to set up a project, and financial institutions such as e-commerce bank and Bank of Nanjing are also moving a large number of business to oceanbase. Bank of Xi’an is a business developed this year, and has migrated Oracle business to oceanbase.
TPC-C benchmark was born in the 1980s. With the advent of ATM, database manufacturers hope to promote their online transaction processing system. Each database manufacturer has its own way on the benchmark of online transaction processing, and there is no uniform standard, which is not convincing enough, and users can not make reference and comparison among various systems.
At this time, Jim gray, together with a number of academic and industrial authorities, proposed the debit standard. Although the standard has been issued, but the database manufacturers have not strictly followed the standard test, but wantonly tamper with the standard to make themselves run higher results. It’s like having a law enforcement team, and everyone interprets and enforces laws according to their own understanding.
Omri Serlin is amazing. He has convinced eight companies to set up TPC organizations and formulated TPC series standards, which are equivalent to legislation. At the same time, TPC is also responsible for supervising and auditing the test process and test results, which is equivalent to law enforcement. From then on, the benchmark in database field has a unified standard.
TPC’s benchmark has been revised continuously, and many versions have been revised in recent years. On the one hand, it adapts to the changes of business, on the other hand, it changes the software and hardware. Even in today’s view, it is still a universally applicable scenario, whether it is finance, transportation, communication, etc., or a very universal scenario.
TPC-C tests a total of five transactions. Most of them are order creation, also called transaction creation. TPC-C model is a sales model, with a maximum of 15 items and an average of 10 items. The model takes a warehouse as a unit, each warehouse has 10 sales points / warehouses, and each sales point serves 3000 customers. This test is to simulate that one of the customers goes to the sales point to buy something, and the number of items purchased may be 5 or 15. Because all the items purchased are not necessarily in the local warehouse, it is assumed that each product has a 1% probability of being in other warehouses and each order is created In a distributed system, 10% of the transactions are distributed. Also simulate the whole order creation, in which 1% of the orders are to be rolled back. The whole performance index TPMC is the number of orders created per minute. Assuming 100 transactions, 45 transactions are order creation, 55 transactions are additional transactions, and 43 orders are paid. In order payment, there is a 15% probability that payment will not be made in the local warehouse, but it will become a distributed transaction if it needs to be paid in a remote warehouse.
This is a TPC-C test model that simulates a person going to a sales terminal to buy something. Your request will be sent to an application system, and you can imagine that the application system is a simplified version of Taobao or Alipay. Then your request will be sent to the database. The entire application system and database should be disclosed. What kind of machine you use, what machine configuration is, including the price should be disclosed. Terminal emulators are not required to be public. There are many hard requirements. There are nine tables in the warehouse. There are rules on how much data each table has and what kind of distribution each data has. If you don’t meet the requirements, you can’t test it.
Only 12.86 TPMC are allowed per warehouse. The 60 million TPMC we made needs about 4.8 million warehouses according to this proportion. The total data is about 336t, and our data is multiplied by 2. The specification defines that a single component of a system is allowed to fail. If shared memory is used, it means that a single component in shared memory is allowed to fail. As you know, if a single component fails in shared memory, shared memory will not fail. We are useless. We are using virtual machine. If you want to pass this test, you can only write two copies of each data. If any machine breaks down, our system can still work normally.
In addition, you must test the functions before you do the performance. There are many functions. The key is to prove that you can meet the functions of database transactions, that is, atomicity, consistency, isolation and persistence. Isolation requires serialization, which is also a relatively difficult thing, especially for distributed databases. In today’s distributed database, besides oceanbase, there is also Google spanner.
In addition, there are two requirements for running performance: first, it is required to run stably for 8 hours without any manual intervention; second, performance collection should be carried out for at least 2 hours and the performance fluctuation during this period should not exceed 2%. These are the requirements of the actual production system. These two hours are used for performance collection. It is necessary to keep the previous proportion, the proportion of order creation and the proportion of payment in these two hours. Under this premise, record the number of orders created in the two hours, and then get the real TPMC value in 2 hours. Oceanbase worked for 8 hours, and the auditor thought our results were too high. Therefore, the whole performance acquisition of oceanbase ran for 8 hours, and the overall fluctuation was less than 0.5%, because we didn’t want to leave any room for others to question.
Now the results are in the publicity period, there are 60 days, 60 days when others say that the results do not conform to the standard or cheat. You have to stand up and prove yourself that I really conform to the norms and standards. After 60 days, the result is so-called active. At this time, anyone in your area can buy the system at the published price.
Many hardware devices will not be produced after three years, so it thinks that this is a historical record after three years, that is, histroy. The record is still there, which is valid and legal, but you can’t buy a system.
Let’s compare three recent results in history:
First, in August 2010, when oceanbase was just launched, it was still thinking about how to live. In 2010, DB2 achieved more than 10 million results with 3 power and 30 storage. Results in less than 4 months, Oracle achieved more than 30 million results with 27 SPARC and 97 storage units. Storage is a bigger bottleneck.
Many people are also asking: why didn’t Oracle do it for so many years? Oracle didn’t do it. In 2012, Oracle did a result. At that time, it took the x86 machine and made more than 5 million yuan. It was done again in 2013. It is also a single machine. It has done more than 8 million yuan with a better workstation. Oracle has done more than 30 million yuan. Why do we do 5 million and 8 million yuan? In fact, my own view is to deter other manufacturers. There are 27 workstations in this one, with an average of a little more than 1 million. In 2012 and 2013, we made 5 million and 8 million yuan. If you do it again, I can use 8 million yuan for a single machine. Even if it is not linear, it is a terrible result. Unless there is a breakthrough in distributed, no single database can achieve the performance results of Oracle.
There was another change. We didn’t buy a lot of hardware. In the end, we used 204 database servers, three management nodes and three monitoring nodes, a total of 210. We use virtual machine. Compared with physical machine, the memory and CPU of virtual machine are increased by 50%. If we run this result on the physical machine, it will increase by 50%, because the virtual machine still has certain consumption.
The application system uses 64 servers, which are required to be disclosed. Some people are questioning who can afford to test 380 million yuan with so much money? 380 million means that if a user buys a system and uses it for three years, it is 380 million including hardware, software and technical support. The hardware cost of the whole three years is the cost of renting a virtual machine. The whole cost only accounts for one fifth of the total cost in the system. The testing cost is about using a machine that has been rented for three months. If you find Alibaba cloud to rent it, your hardware cost only accounts for less than one fifth of 380 million. This is still 36 months’ cost. In fact, we only spent three months’ cost. In fact, we can estimate the whole hardware investment.
The greater value of our test is not OLTP, but to prove to others that we can do transaction processing in distributed database, and more importantly, we want to prove that this database can do transaction processing as well as intelligent scenario analysis. In most scenarios, users and customers no longer need to build a data warehouse to copy and export data. Otherwise, such a system is only for trading, but it has not enough value.
Finally, a brief summary is given. After the 1980s, transaction processing and business intelligence have become a core requirement of relational databases. However, with the development of centralized architecture over the years, the expansion ability is limited, especially after the emergence of Internet and mobile Internet. TPC-C benchmark itself doesn’t matter what we do or what others do. It defines business requirements. It defines order creation, order payment, order query and order delivery, and these are business requirements. As long as the business requirements are met, even if the file system can be tested and a good result can be measured, it is also a skill.
Through this test, we want to prove that we are the first distributed database with transaction processing capability, which is not available before. The most important thing oceanbase will do next is not only the function of relational database, but also the ability of business intelligence to provide customers with transaction processing and business intelligence analysis. Thank you.