How does oceanbase get the first place in TPC-C test?

Time:2019-11-2

How does oceanbase get the first place in TPC-C test?

Alimei’s Guide:TPC-C is the benchmark standard for order creation and order payment of commodity sales formulated by TPC Organization (International Affairs performance Committee), and it is the authoritative benchmark standard for database online transaction processing system.

Oceanbase, a distributed relational database developed by ant financial services, has attracted a lot of attention since it won the first place in TPC-C test. Today, we invite the core R & D personnel of oceanbase to make a professional technical interpretation of this test.

I. how to do TPC-C test for oceanbase

The opportunity to challenge TPC-C testing is believed to be the dream of all database kernel developers, but TPC-C testing standard is very complex. As this is the first time that domestic database and distributed database have impacted this list, oceanbase team has been preparing for this challenge for more than one year.

preparation in advance

TPC-C test first needs to find the only official certified auditor to audit and supervise the test. They also attach great importance to the oceanbase audit. There are only three auditors in the world who participate in the test audit this time.

Testing system

At present, there is no TPC-C standard testing tool that can be used out of the box. Take benchmarksql, which is the most commonly encountered benchmark SQL in POC environment of various manufacturers, as an example. It can be said that it is only a pressure measuring tool to simulate TPC-C pressure model, and even the most basic data import is not compliant. A large number of string generation does not guarantee global randomness, and the lack of the most basic configuration of think time and keying time in the pressure measuring stage leads to a very high TPMC with a small amount of data, the most critical one It greatly simplifies the test model as a tool direct connection DB test, and completely fails to comply with the TPC-C test standard specification.

In the standard definition, the test system can be divided into two parts: RTE (remote terminal emulator) and SUT, but in fact, from the perspective of role, SUT can be further divided into two parts: was (web application server) and DB server.

Among them, DB server is the database service of each test manufacturer; the role of RTE is the client terminal in the test model, where transaction triggering and RT statistics are completed; the standard explicitly requires that each user terminal must maintain a long connection, which is obviously unable to bear so many connections in the sea warehouse. Was is the bridge between RTE and DB, which is determined by the standard The connection pool can be used to manage all requests with transparency.

Among the three roles, was and DB must be commercially available and provide payment services. Oceanbase uses openresty as its was supplier this time. While RTE is generally implemented by each testing manufacturer according to the standard, but all implementation codes must be audited strictly. Oceanbase has fully implemented a complete set of fully compliant RTE, and supports deployment in large-scale test system. It is important to know that in the actual TPC-C testing process, it is not only a test of the capability of the DB side, but also a great resource consumption and pressure on the RTE side. According to the 6088w TPMC test results, we have run 960 RTE clients on 64 cloud servers of 64c128g to simulate a total of 47942400 user terminals. Finally, we need to conduct consistency and persistence audit verification based on so many RTE statistical results.

Although it’s just a test client, there are also a lot of small details in RTE that may lead to the final test failure. For example, all transactions that you may not notice need to increase 100 ms RT after using the Web terminal to simulate the terminal, for example, to simulate the 100 ms delay displayed by the user terminal output.

Test planning

TPC-C has never been a simple test, it is very scientific and does not give the forced hardware and software configuration, but gives the test specifications and various audit inspection limit standards. All database manufacturers can fully tune according to their own characteristics to get the best performance or price ratio. But at the same time, it also poses a huge problem for all the testing manufacturers. How to plan the existing resources reasonably to ensure the successful completion of an impact on the TPC-C list.

  1. Hardware selection, not only the database server selection, but also the RTE and was server selection. Before this, a lot of tests and tuning need to be done in advance to figure out the resource allocation of each role server under the premise of ensuring cost performance. The biggest advantage of this oceanbase test is that all the cloud resources are used. We don’t need to pay attention to the details of the lowest machine room, cabinet and wiring, but only need to get the model we need through rapid specification adjustment.
  2. Parameter selection, how to choose the appropriate configuration parameters is a very headache. For example, one of the most typical problems is how many warehouses we need to run and how many warehouses we need to host on each database server. In order to simulate real business scenarios as much as possible, TPC-C standard defines different think time and keying time for each transaction to ensure that the data of a warehouse can provide a maximum of 12.86tpmc value. Therefore, database manufacturers must load more warehouses if they want to achieve higher results, but on the other hand, the storage space of a single machine is expected to be 80% (experience value) to be reserved for 60 days Incremental storage. Under the distributed database architecture, in order to make every database server run full and make full use of the local storage space, and maximize the utilization of the resources of CPU, memory, IO and storage space of each server, we adjusted and optimized it for nearly a month.

Performance measurement

In TPC-C standard, the most concerned part of performance pressure test specifies the following three states:

  1. Ramp up, the standard allows each database to warm up for a certain period of time to reach a stable state, but all configurations in the ramp up phase must be consistent with the final report configuration.
  2. Steady state: under the premise of acid and serializable isolation level, the database needs to be able to run for more than 8 hours with the final TPMC value reported in stable state without any human intervention, and checkpoint needs to be completed every half an hour.
  3. Measurement interval, the standard specifies that it needs to be able to support 8 hours of stable operation, but the performance acquisition phase only needs to be set for more than 2 hours. At this stage, the cumulative TPMC fluctuation should not exceed 2%, and at least 4 checkpoints must be completed.

Therefore, before the general database performance pressure test is generally the following process, the advanced line reaches the steady state after preheating for a period of time, waiting for stable operation for a period of time and completing a checkpoint before entering the 2-hour performance acquisition stage.

Oceanbase used the most rigorous TPC-C test process so far to complete this performance test. We first used 10 minutes for preheating, then kept running for 25 minutes in 6088w TPMC steady state and completed a checkpoint, and then continued to run the complete 8-hour performance pressure test acquisition phase, which took more than 8.5 hours, with the maximum fluctuation of intermediate performance less than 0.5% The final results also excited the auditors.

Before and after the whole performance test, auditors also need to check the random distribution of data and transactions. In short, it is a large number of full table scans and statistics of SQL. The largest SQL needs to access the order line table results of more than trillion rows, which can be regarded as the “TPC-H test” in tpc-c. The first time we encountered these SQL in the field audit, we also encountered a large number of SQL execution timeouts. However, based on the latest parallel execution framework of oceanbase2.2, we managed these large SQL very quickly. Therefore, to successfully complete TPC-C test is not only a partial student, but also a general relational database in the true sense of keeping no short board. From this point of view, Oracle It is still an example for oceanbase.

ACID

  1. In test a, the atomicity support of the database is confirmed by submitting and rolling back payment transactions. As in test I, oceanbase has run test a twice to deal with distributed transactions and local transactions respectively.
  2. C test, the c test in the standard includes 12 cases, the first four of which must be verified. In fact, each case can be considered as a complex large SQL, and the key point for distributed database is to always ensure global consistency.
  3. In I test, the standard requires that all five transactions in TPC-C model except stocklevel transactions need to meet the highest serializable isolation level, and nine cases are constructed to verify the isolation. For the distributed database, this requirement is not so easy to implement. Fortunately, oceanbase provides global timestamp support in version 2. X, so all I tests run two rounds of tests under the special requirements of the auditor, i.e. full local test and full distributed test. This should also be the first time in TPC-C test that database manufacturers need to do two rounds of I tests and run 18 cases Maybe in the near future, TPC-C standard definition will also be updated for distributed database due to oceanbase’s test.
  4. In D test, oceanbase has a great natural advantage over traditional database in this scenario. Each warehouse data of oceanbase has two data and three logs. Through Paxos strong synchronization, only second level RTO can be guaranteed under the premise of RPO = 0.

In the face of the most stringent item in the D test standard – permanent failure of some storage media, oceanbase also uses the most rigorous test scenario. Under the pressure of using a full amount of 6000W TPMC that exceeds the requirements of the standard, we forcibly destroy a cloud server node. The overall TPMC recovers to 6000W in two minutes and continues to run to the end of the test time. These performances are far beyond the TPC-C specification In comparison, other traditional databases basically rely on disk raid for recovery in the D test scenario of storage medium failure with logs. Oceanbase should be the first database manufacturer to complete all d tests without raid completely relying on the database recovery mechanism.

At the same time, we killed two server nodes in D test. First, we killed a data node. After waiting for TPMC to recover and stabilize for 5 minutes, we killed the rootserver leader node again. TPMC still recovered quickly.

2. SQL optimization of TPC-C benchmark

TPC-C is a typical OLTP (on-line transaction processing) scenario test, which examines the transaction processing capacity of database under high concurrent pressure. The final performance indicators are TPMC (transaction per minute, that is, the number of new order transactions processed by the system in the TPC-C model) and the average system cost per TPMC Quantity standard. In the OLTP scenario, the response time of each request is extremely short. Therefore, when TPC-C testing is carried out, all database manufacturers will try their best to compress every operation time to the shortest possible time. It’s no exaggeration to say that in TPC-C testing, the optimization of some key operations often needs to be refined to the CPU instruction level.

Before entering our topic, let’s talk about the transaction model in TPC-C, which is mainly divided into five kinds of transactions: order creation, order payment, order query, order shipment and inventory query. These five kinds of transactions occur in a certain proportion. The final measurement of the test is the number of order creation transactions per minute. As you know, the transaction of each database is actually composed of several SQL statements associated with a certain logical relationship. They are either all successful or all failed in a transaction. This is called “atomicity” in the database, that is, “a” in acid. So how long does a transaction in TPC-C take? It’s clear from the report – only a dozen milliseconds. Considering that a transaction is composed of multiple SQL, the average time of each SQL is less than 1 millisecond!

In the C / S (client server) model, a SQL statement needs to go through a process from the client input, network transmission, SQL optimization, execution and result return to the client. The specific execution of each SQL may only update a field, and the execution time required is very short. From the perspective of the whole link, a large amount of time will be spent in the interaction with the client, resulting in a waste of resources and an increase of time. So how to solve this problem? The answer is to use stored procedures.

stored procedure

The so-called “stored procedure” is a procedure oriented programming language provided by database for users. Based on this language, the user can encapsulate the logic of the application program as a procedure, which can be stored in the database and called at any time. In this way, the user can complete the work which needs to interact with the database many times by one interaction, which saves the transmission and waiting time of the intermediate network (see Figure 1). If the average network overhead of a transaction is 30%, that is to say, 30% of the CPU is spent on the receiving, sending and parsing of the network. So it’s amazing to save 30% of CPU resources in 60 million TPMC test and convert it into system processing power. Using stored procedures can also reduce the transaction response time, shorten the critical area of transaction lock in the database kernel, indirectly improve the CPU utilization of the system, and increase the overall throughput. Stored procedures also play an important role in reducing the waiting time of application side.

How does oceanbase get the first place in TPC-C test?

In TPC-C, stored procedure is very important to improve the execution efficiency of the whole system. Oceanbase version 2.2 not only fully supports stored procedures, but also optimizes the execution efficiency of stored procedures.

Compile and execute

As a high-level process oriented language, stored procedures need to be converted into machine code before they can be executed. This process can be generally divided into “compile execution” and “explain execution”. Generally speaking, compile execution has the characteristics of code optimization and high efficiency compared with explain execution. Oceanbase has implemented a compiler supporting stored procedures by using llvm compiler framework which has been mature in recent two years. By means of just in time compilation, the stored procedures are translated into efficient binary executable code, and the execution efficiency is improved by several orders of magnitude. At the same time, the llvm framework transforms stored procedures into machine independent intermediate code, which makes stored procedures naturally obtain cross platform compilation and execution capabilities. The built-in optimization process of llvm ensures that we can obtain correct and efficient executable code on various hardware platforms.

Array Binding

Another function that plays an important role in TPC-C testing is the ability to batch process DML statements, also known as “array binding” in Oracle. The execution process of an SQL in a database can be roughly divided into two stages: plan generation and execution. Although we cache the execution plan of SQL, finding a suitable execution plan is still a time-consuming part of the whole execution process. Is there any way to save this time? When a group of SQL execution plans are exactly the same and only execution parameters are different, we can make their execution into a batch processing process through a specific syntax in the stored procedure. At this time, “plan generation” only needs to be done once, which is called “array binding”.

In array binding, the database will first find the plan to be used, and then execute the plan. After each execution, it will re execute the process of parameter binding. For example, it’s like repeatedly assigning values instead of redefining a data structure in a C for loop. The use of array binding is controlled by the user, so we need to use the forall keyword in the stored procedure to trigger this function. In the TPC-C test process, we have used array binding for many times to improve the processing power of the system, and the effect is very obvious.

Prepared statement and execution plan cache

Prepared statement is a binary request interaction protocol, which can greatly reduce the interaction cost of the system. Oceanbase not only supports the use of prepared statement between user program and database, but also supports the use of this interaction mode when the stored procedure engine calls the SQL Engine for execution. After a prepare operation is performed on SQL and a unique ID is obtained for the stored procedure, only the ID and corresponding parameters need to be passed in for each subsequent execution. The system can find the corresponding stored procedure or SQL plan through cache and start execution. Compared with the interactive method of using SQL text, this process saves a lot of CPU overhead of requesting text parsing.

Oceanbase internally implements a cache to cache the executable code and SQL execution plan of stored procedures. Stored procedures and SQL with different parameters can quickly obtain the required execution objects through this cache, which generally takes less than tens of microseconds, effectively avoiding the millisecond level delay and CPU consumption caused by re coding.

Updatable view

In the OLTP scenario, there are many examples to improve the performance by reducing the number of interactions between the application and the database. The updatable view is one of them. Our common database views are usually read-only. By defining views, users can define the data they are interested in and their access interfaces. However, views can also be used as the entry for update operations. For example, in the TPC-C new order creation scenario, applications need to get product information, update inventory and get updated values. Generally, this process can be realized through two SQL statements:

select i_price,i_name, i_data from item where i_id = ?;

    UPDATE stock
      SET s_order_cnt = s_order_cnt + 1,
          s_ytd = s_ytd + ?,
          s_remote_cnt = s_remote_cnt + ?,
          s_quantity = (CASE WHEN s_quantity< ? + 10 THEN s_quantity + 91 ELSE s_quantity END) - ?
      WHERE s_i_id = ?
          AND s_w_id = ?
      RETURNING s_quantity, s_dist_01,
          CASE WHEN i_data NOT LIKE'%ORIGINAL%' THEN 'G' ELSE (CASE WHEN s_data NOT LIKE '%ORIGINAL%' THEN 'G'ELSE 'B' END) END
      BULK COLLECT INTO ...;

But by creating an updatable view:

  CREATE VIEW stock_item AS
      SELECT i_price, i_name, i_data, s_i_id,s_w_id, s_order_cnt, s_ytd, s_remote_cnt, s_quantity, s_data, s_dist_01
      FROM stock s, item i WHERE s.s_i_id =i.i_id;

We can update the inventory and get the product and inventory information through one statement:

UPDATE stock_item
      SET s_order_cnt = s_order_cnt + 1,
          s_ytd = s_ytd + ?,
          s_remote_cnt = s_remote_cnt + ?,
          s_quantity = (CASE WHEN s_quantity< ? + 10 THEN s_quantity + 91 ELSE s_quantity END) - ?
      WHERE s_i_id = ?
          AND s_w_id = ?
      RETURNING i_price, i_name, s_quantity,s_dist_01,
          CASE WHEN i_data NOT LIKE'%ORIGINAL%' THEN 'G' ELSE (CASE WHEN s_data NOT LIKE '%ORIGINAL%' THEN 'G'ELSE 'B' END) END
      BULK COLLECT INTO ...;

In this way, the interaction of a statement is omitted, and the update logic is more intuitive. Updatable views allow users to operate on views like normal tables, but not all views can be defined as updatable views. For example, for views with distinct and group by, it is not clear which row semantics to update, so it is not allowed to update. Specifically, the above view of the stock item two table join needs to satisfy the premise that the unique key of the updated table remains unique (key preserved table) after the join, that is, item.i’id must be the only one.

It should be emphasized that the TPC-C specification prohibits the use of materialized views, while the updatable views do not change the storage form of the underlying data tables, which is in line with the specification.

Because the design principle of TPC-C is to reflect the operation scenario of an OLTP system as “real” as possible, many of the optimizations we have done are widely applicable. For example, for a highly concurrent OLTP system, most of the SQL requests take a very short time. Using a pure C / s interaction model will inevitably waste the system’s time in the frequent interaction between the application and the database. Using stored procedures can greatly reduce the time-consuming of this interaction, and enhance the system’s immunity to network jitter. This core capability is to A distributed OLTP database is indispensable.

In this TPC-C test, we adopted the Oracle compatibility mode supported by oceanbase version 2.0, and all stored procedures and SQL use the Oracle compatible data types and syntax. This is also to pursue the ultimate optimization and ensure that the product iteration can develop in the general and normal direction.

III. challenges of TPC-C benchmark database transaction engine

The TPC-C test of oceanbase is very different from other databases such as Oracle and DB2 on the list in terms of hardware use. The database server of oceanbase uses 204 + 3 ecs.i2.16xlarge Alibaba cloud ECS servers, including 204 as data nodes and 3 as root nodes. Each reader can easily purchase on-demand on Alibaba cloud website. If you look at the TPC-C test reports of Oracle and DB2, you will find that these databases will use special storage devices, such as the test of Oracle, the former record holder, in 2010, using 97 ComStar special storage devices, 28 of which are used to store the redo log of the database.

The difference of hardware brings different challenges to the software architecture. In the special storage device, the reliability of the device itself is ensured by hardware redundancy. When the database software uses such storage device, the data will not be lost. However, this way has brought great cost consumption, and the price of dedicated storage devices are particularly expensive.

Oceanbase uses a general ECS server to provide database services, and only uses the local hard disk of the ECS machine for data storage, which is the most general hardware condition. However, this way poses a great challenge to the software architecture, because the reliability of a single ECS server is not as high as that of a dedicated storage device. This also poses a great challenge to the transaction engine of oceanbase, which can implement the acid feature on the ordinary ECS server.

TPC-C test is a complete and strict requirement for transaction acid feature. The following are the solutions of oceanbase for the characteristics of transaction acid.

Paxos log synchronization guarantees durability

The transaction durability of the oceanbase database is guaranteed by the persistence of the redo log. All the redo logs will be strongly synchronized to the other two database service machines in real time, including the machines that generate the redo logs. In total, three machines will persist the redo logs in the hard disk.

Oceanbase uses Paxos consistency synchronization protocol to coordinate the persistence of redo logs on these three machines. Paxos protocol uses more than half (also known as “the majority”) successful algorithm (when there are three copies, two successful algorithms are more than half). When two of the machines are persistent, the transaction can be submitted, and the redo logs of the remaining machine are usually In this case, the persistence is completed immediately. However, if this machine happens to have an exception, it will not affect the transaction submission, and the system will automatically supplement the missing redo log after it recovers. If the machine fails permanently, the system will distribute the synchronous data of the failed machine to other machines in the cluster. These machines will automatically fill up the missing content and keep up with the latest redo log.

The biggest advantage of using Paxos consistency protocol is the perfect balance between data persistence and database service availability. When three copies are used, at least another copy has data when one copy is broken at any time, and the write can continue, because there are two copies left, and subsequent writes are not affected.

Therefore, oceanbase not only ensures the transaction persistence, but also greatly improves the continuous service capability of the database. When TPC’s auditors audit oceanbase’s persistence capability on site, under the continuous pressure of clients, they randomly select a machine from the oceanbase cluster to perform forced power-off operation. They find that the data in the database is not only not lost, the database does not need any human intervention, but also can continuously provide services. The auditors are surprised and greatly appreciate oceanbase 。

Resolve atomicity with automatic two-phase commit

In the five transactions of TPC-C test model, “order creation” and “order payment” will modify a lot of data respectively, which are relatively complex transactions. TPC-C standard is a mandatory requirement for Atomicity of transactions. It requires that the modification of warehouse, order, user and other tables within a transaction must take effect atomically. It is not allowed that only half of the cases are successful.

Oceanbase data is split into multiple machines according to the warehouse ID. if all transactions occur in the same warehouse, no matter how large the data volume is, the modification of transactions will only involve the data of one machine, that is to say, transaction submission is completed on one machine. This is a perfect linear extension scenario. But this is not in line with the actual business scenario. Most of the actual businesses will have many data interactions between different dimensions. TPC-C test standard also considers this seriously, so it puts forward the requirements for the randomness rules of transaction operation data, and finally ensures that 10% of the “order creation” transactions and 15% of the “order payment” transactions will operate two or more warehouses. Within the oceanbase database, this results in cross machine transaction operations, which must use the two-phase commit protocol to ensure atomicity.

Oceanbase will automatically track the data of all SQL statement operations in a transaction, and automatically determine the two-stage commit participants according to the actual data modification location. When the transaction starts to commit, oceanbase will automatically select the first participant as the coordinator, and the coordinator will send a prepare message to all participants. Each participant needs to write their own redo log and prepare log After the coordinator confirms that the redo log and prepare log of all participants are completed, he / she will send a commit message to all participants, and then wait for the commit work of all participants to be completed. The whole protocol is completed automatically in the process of transaction submission and is completely transparent to users. Oceanbase automatically selects a coordinator for each two-phase commit transaction. Any machine in the whole system can share the coordinator’s work, so oceanbase can extend the transaction processing capacity linearly.

Multi version concurrency control ensures transaction isolation

TPC-C standard requires that “order creation”, “order payment”, “order delivery” and “order payment” transactions are all serializable. Oceanbase adopts a multi version concurrency control mechanism. When a transaction is submitted, a transaction’s submission timestamp will be requested. Changes in the transaction will be written to the storage engine with a new version, and the data of the previous version will not be affected. At the beginning of the transaction, a read timestamp will be obtained, and only the submitted data based on the read timestamp will be seen in the read operation of the data in the whole transaction. Therefore, the read of transaction will not encounter dirty data, non repeatable read data and unreal read data. At the same time, the modification of transaction will hold row lock on the modified data row to ensure that two concurrent modification transactions of the same row will be mutually exclusive.

Oceanbase’s global timestamp generator is also composed of multiple replicas, which can be deployed on three machines independently, or on the root node machine as in this TPC-C evaluation, sharing resources with the root node. The three copies of global timestamps is a highly available architecture. Any time the acquisition of timestamps is confirmed by at least two of the three machines. Therefore, if any machine fails, the acquisition of timestamps will not have any impact.

According to TPC-C standard, oceanbase has prepared 9 different scenarios to test the isolation of transactions when there are read-read and read-write conflicts, and finally all of them have passed the auditor’s audit perfectly.

Consistency

With the above transaction capabilities, oceanbase can perfectly guarantee the consistency constraints of various data. In TPC-C standard, 12 different consistency test scenarios are proposed to check the consistency of the data in the database before and after various test runs. Because oceanbase’s test data is large in scale, the SQL for consistency verification needs to check a large number of data, so the challenge of consistency verification is the efficiency of the SQL itself. Based on the parallel query capability of oceanbase, all the computing resources of the whole cluster are utilized, the running time of SQL verification is shortened by several orders of magnitude, and the audit of consistency function is well completed.

Copy table

In TPC-C test model, there is an item table. The content of this table is to test the information of all the products sold by the simulated sales company, including the name, price and other information of the products. In order creation transaction execution, you need to request the data in this table to determine the price information of the order. If the data in the goods table is stored on only one machine, then the order creation transaction on all machines will request the machine containing the goods table, which will become the bottleneck. Oceanbase supports the copy table function. After the product table is set as the copy table, the data of the product table will be automatically copied to each machine in the cluster.

TPC-C standard does not limit the number of copies of data, but regardless of the organization form of data, the standard requires that the acid of transactions must be guaranteed. Oceanbase uses a special broadcast protocol to ensure the acid characteristics of all replicas of the replication table. When the replication table is modified, all replicas will be modified at the same time. In addition, when there is a machine failure, the logic of the replication table will automatically remove the invalid copies to ensure that there will be no unnecessary waiting due to the machine failure in the process of data modification. Replication tables are used in many business scenarios, such as dictionaries that store key information in many businesses, and tables that store exchange rate information in financial businesses.

IV. storage optimization of TPC-C benchmark

TPC-C requires that the performance of the database under test (TPMC) is proportional to the amount of data. The basic data unit of TPC-C is warehouse, and the data volume of each warehouse is usually about 70mb (related to the specific implementation). TPC-C specifies that the upper limit of TPMC obtained by each warehouse is 12.86 (assuming the database response time is 0).

Suppose a system obtains 1.5 million TPMC, which corresponds to 120000 warehouses. The data volume is about 8.4tb based on 70mb / warehouse. Some manufacturers use the modified TPC-C test which does not meet the audit requirements. It does not limit the TPMC upper limit of a single warehouse. It is meaningless to test the performance of hundreds to thousands of warehouses loaded into memory. It is impossible to pass the audit. In the real TPC-C test, the consumption of storage accounts for a large part. Oceanbase, as the first database based on the shared nothing architecture to be listed at the top of TPC-C, also as the first database using LSM tree storage engine architecture to be listed at the top of TPC-C, has the following key points in the storage architecture:

  1. In order to ensure reliability, oceanbase stores two copies of data and three copies of logs, while the traditional centralized database test TPC-C only stores one copy of data;
  2. Because oceanbase stores two copies of data, and oceanbase TPC-C uses the same Alibaba cloud server I2 model as the production system, the storage capacity of SSD hard disk becomes a bottleneck. Oceanbase uses online compression to alleviate this problem and further increases the use of CPU; accordingly, the centralized database test stores a piece of data without opening compression;
  3. The oceanbase LSM engine needs to perform the complex operation in the background on a regular basis, while TPC-C requires the test to run for at least 8 hours and the jitter is less than 2% within 2 hours. Therefore, oceanbase storage needs to solve the jitter problem caused by the background operation of LSM engine;

Two data

In order to ensure the reliability and no loss of data (RPO = 0), there are two different schemes: one is fault tolerance at the hardware level, the other is fault tolerance at the software level. Oceanbase chooses to be fault-tolerant at the software level, with the advantage of lower hardware cost, which brings about the problem of redundant storage of multiple copies of data. Oceanbase uses Paxos protocol to ensure strong consistency of data in case of single machine failure. In Paxos protocol, a piece of data needs to be synchronized to the majority (more than half) to be considered as write success, so generally speaking, the number of copies is always odd, and the most common deployment specification for cost considerations is three copies.

The primary problem brought by three copies is the increase of storage cost. Before, TPC-C test of commercial database was mostly based on disk array, while TPC-C specification clearly does not require disaster tolerance for disk array, so it is obviously difficult to accept TPC-C test with three times of storage space compared with traditional database.

We have noticed the fact that only logs are synchronized through Paxos protocol, and logs need to be written in three copies, but data is not. Only two copies of data are needed to complete disaster recovery of single machine failure. When one data is unavailable due to server downtime, and the other data only needs to be supplemented through logs, then access to the outside world can continue.

Compared with data storage, log storage is smaller. We separate data from logs and define three different types of replicas: Replica f contains both data and synchronization logs, and provides external read-write services; replica D contains both data and synchronization logs, but does not provide external read-write services; replica l only synchronizes logs and does not store data. When the f-copy fails, the d-copy can be converted to the f-copy, and the external service can be provided after the data is supplemented. In TPC-C test, we use FDL mode to deploy (one f copy, one D copy and one L copy), which uses twice the storage space of data copy. Both the D and l replicas need to play back the logs, and the D replica needs to synchronize the data, which will consume the network and CPU.

Online compression

Under the shared nothing architecture, oceanbase needs to store at least two copies of data to meet the requirements of disaster recovery, which means that oceanbase needs twice as much storage space as the traditional database.

In order to alleviate this problem, oceanbasetpc-c test chooses to compress the data online. The storage capacity of a warehouse in Oracle database is close to 70mb, while the storage capacity of oceanbase after compression is only about 50MB, which greatly reduces the storage space. TPC-C specification requires that the disk space can meet the storage of 60 days’ data volume. For oceanbase, two copies of data need to be saved. Although the reliability is better, the data volume equivalent to 120 days needs to be saved. These storage costs should be included in the overall price.

Oceanbase uses 204 ECS I2 cloud servers to store data, and the server specifications are consistent with online real business applications. The log disk of each server is 1TB, and the data disk is close to 13tb. After calculating the storage space of two compressed data for 60 days, the data disk of the server basically does not have much margin. From the perspective of resource cost consumption of the server, it has reached a good balance. If oceanbase’s stand-alone performance TPMC is further improved, the disk capacity will become a bottleneck. The oceanbase LSM engine is append only, which has the advantage of no random modification and online compression. Whether it is TPC-C testing or the core OLTP production system (such as Alipay transaction payment), OceanBase will open online compression and change storage space through CPU.

Smooth storage performance

The great challenge of TPC-C test is that the performance curve is required to be absolutely smooth in the whole pressure test process, and the fluctuation range on the curve cannot exceed 2%. This is a difficult thing for the traditional database, because it requires the fine control of all background tasks, and the blocking backlog of foreground requests due to the excessive use of resources of a background task. For oceanbase, things become more difficult, because the storage engine of oceanbase is based on LSM tree, and the LSM tree needs to perform the action regularly. Comparison is a very heavy background operation, which will take up a lot of CPU and disk IO resources, which will naturally affect the user query and writing in the foreground. We have made some optimization to smooth the impact of background tasks on performance. From the final test results, the jitter of performance curve in the whole 8-hour compression test process is less than 0.5%.

  • Hierarchical dump

In lsmtree, the data is first written to the memtable in memory. In order to release the memory at a certain time, the data in the memtable needs to be merged with the sstable in the disk. This process is called interaction. In many storage systems based on LSM tree, in order to solve the write performance problem, the sstable is usually divided into multiple layers. When the number or size of sstables in one layer reaches a certain threshold, the sstable in the next layer is merged. Multi tier sstable solves the problem of writing, but too many sstables will slow down the performance of query. Oceanbase also draws on the idea of layering, but at the same time uses a more flexible strategy of interaction to ensure that the total number of sstables will not be too much, so as to make a better balance between read and write performance.

  • Resource isolation

Background tasks such as comparison need to consume a lot of server resources. In order to reduce the impact of background tasks on user queries and writes, we isolated the resources of front and back tasks in four aspects: CPU, memory, disk IO and network io. In terms of CPU, we divide background tasks and user requests into different thread pools and isolate them according to CPU affinity. In terms of memory, different memory management is done for the front and back requests. In terms of disk IO, we control the IOPs requested by background task IO, and use the deadlock algorithm for flow control. In the aspect of network IO, we divide background task RPC and user request RPC into different queues, and control the bandwidth usage of background task RPC.
Storage CPU usage

TPC-C benchmark mainly focuses on the overall performance of TPMC, and many people will also pay attention to single core TPMC. However, this indicator only makes sense under the same architecture. For the CPU usage of the enclosure, there are three points as follows:

  1. For the centralized architecture, in addition to the CPU for the database, the CPU is also required for the dedicated storage device. For example, in the second Oracle test of more than 30 million TPMC, the database used 108 t3sparc processors, with 1728 physical cores and 13824 execution threads. At the same time, the storage device used Intel server as the machine head, with 97 servers, 194 Intel x5670 CPUs and 2328 physical cores.
  2. The centralized database uses highly reliable hardware and only needs to store one copy, while oceanbase is fault-tolerant at the software level. Although the hardware cost is lower, it needs two copies of data and three copies of logs. Maintaining multiple copies requires a lot of CPU;
  3. Oceanbase has opened online compression in TPC-C test and production system, further increasing CPU usage;

Therefore, it is unscientific to simply compare the CPU cores of oceanbase and Oracle. It also needs to include the CPU cores of shared storage devices, as well as the CPU overhead caused by oceanbase’s multi copy storage and online compression. TPC-C’s recommended solution is not to pay attention to the specific software architecture and hardware architecture, but to pay attention to the overall cost of hardware. In the oceanbase test, the hardware cost only accounts for about 18% of the overall cost, only considering the cost performance of the hardware is much better than the centralized database.

Follow up development

Oceanbase has the advantages of adopting distributed architecture, lower hardware cost, better availability and linear expansion. However, the performance of oceanbase stand-alone is far behind that of Oracle and DB2. In the future, we need to focus on optimizing the performance of stand-alone storage. In addition, oceanbase is positioned to support both OLTP and OLAP services in the same engine. At present, the OLAP processing capacity of oceanbase is not as good as that of Oracle. In the future, it is necessary to strengthen the processing capacity of the storage module for large queries, and support the OLAP operator to be pushed down to the storage layer or even directly do OLAP calculation on the compressed data.

Authors: Yang Zhenkun (founder of oceanbase), Cao Hui (technical expert of oceanbase), Chen Mengmeng (senior technical expert of oceanbase), Pan Yi (senior technical expert of oceanbase), Han Fusheng (senior technical expert of oceanbase), Zhao Yuzhong (senior technical expert of oceanbase)

Read the original text

This article is from alitech, a partner of yunqi community. If you need to reprint it, please contact the original author.

Recommended Today

Data stack product sharing: building real time big data processing platform based on streamworks

Counting stackWe have an interesting open source project on GitHub and giteeFlinkX,FlinkXIt is a unified batch data synchronization tool based on Flink. It can collect both static data and real-time data. It is a global, heterogeneous and batch data synchronization engine. If you like, please give us a star! star! star! GitHub open source project:https://github.com/DTStack/fl… […]