For the first time! Evolution and engineering practice of oceanbase storage system architecture

Time:2020-11-25

OB Jun: as a 100% self-developed distributed database, oceanbase has experienced nearly ten years of development. In the past decade, the storage architecture of oceanbase has undergone several evolutions to meet the increasingly complex and demanding storage requirements. This paper is based on Zhao Yuzhong’s speech at the 2019 SACC China system architects conference.

With the continuous growth of user data, the vertical expansion ability based on traditional shared storage gradually becomes ineffective, and distributed storage becomes the standard configuration to deal with the massive data of users.

As an architect, what should be paid attention to when designing the distributed storage architecture of the system? Or in other words, what features should an ideal distributed storage product have for customers?

For the first time! Evolution and engineering practice of oceanbase storage system architecture

We believe that a perfect distributed storage architecture should focus on these five aspects

  • Scalability: scalability is an important feature of distributed storage which is different from stand-alone storage. The scalability of distributed storage is much better than that of stand-alone storage, but there are still great differences between the scalability of different distributed storage systems. Some distributed systems may work well on the order of tens of nodes, but if the number of nodes goes on, it may face great problems when there are hundreds or even thousands of nodes. There is no end to the growth of user data. If the linear expansion is not achieved, the system that can support the business today may become an obstacle to the further development of the business tomorrow.
  • High availability: it is very common to have node failure in distributed system. The more nodes in distributed system, the higher the frequency of node failure. In case of node failure, it is very important for many businesses to ensure that the system is still available. According to different fault types and recovery time, high availability can also be divided into different levels. Can the system tolerate single point failure, multi-point fault, single room fault, multi machine room fault, single city fault and multi City fault? Can system recovery achieve day level recovery, hour level recovery, minute level recovery and second level recovery? Different business scenarios may have different requirements for high availability.
  • Consistency: consistency is actually a concept that has been abused. Many students will confuse it with the C in acid in database transaction characteristics. Here we refer to the consistency of distributed system. So what is the consistency of distributed systems? In a word, consistency refers to whether users can always read the latest written data in a distributed read-write system. If it can always be read, then the system is strongly consistent, otherwise it is weakly consistent. Final consistency is a special case of weak consistency, which means that although the latest data can not always be read, the latest data can still be read with the termination of the write operation. Although many distributed systems claim to provide consistency, most of the time they only provide weak consistency or ultimate consistency. Strong consistency is very important for some businesses, especially the financial business related to transactions. If you can’t guarantee that you always read the latest data, you will have the possibility of asset loss.
  • Low cost: the distributed storage system can use cheaper PC server to replace high-end small and large computers, which has significant advantages in cost. However, low cost does not mean low performance. In fact, there are many nodes in the distributed system, and the ability to use these nodes at the same time can bring us higher performance than large servers. Low cost and high performance can save more system cost for our users.
  • Ease of use: low cost features usually focus on hardware costs, while ease of use is related to labor costs. For development students, ease of use means simple and easy-to-use interface, the best learning and migration cost is zero, at the same time, powerful, can meet a variety of needs; for operation and maintenance students, ease of use means that the system is stable and robust, system monitoring and operation and maintenance means are perfect, learning and use threshold is low.

Evolution of Architecture

Architecture design serves the business, and any perfect system architecture needs to be used by business to create value. Of course, there are many contradictions between the high requirements of the system architecture, the high usability and the high implementation cost of the system, as well as the high requirements for the developers to implement the system There has to be a trade-off.

Let’s review the trade-off and thinking behind each architecture change based on the evolution of storage architecture since oceanbase was founded more than nine years ago.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

1) Oceanbase version 0.1 (2010)

Oceanbase was founded by Yang Zhenkun in Taobao in 2010. At that time, most of Taobao’s businesses had done sub databases and tables according to the user dimension. It seemed that a new distributed storage system would not be useful. Finally, we found the first business of oceanbase: Taobao favorites, which is the favorite collections we used when we opened our hand Taobao to see our favorite commodity points. Until today, it still runs on the oceanbase database.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

At that time, the favorite faced a problem that could not be solved by sub database and sub table. Its core business mainly included two tables: one was the user table, which recorded the items collected by a user from several to thousands; the other was the commodity table, which recorded the description, price and other details of a product. If a user adds / deletes a collection, the corresponding data can be inserted / deleted from the user table; at the same time, if a merchant needs to modify the product description, such as modifying the commodity price, the corresponding updating of the product table is OK.

When the user opens the favorites, the latest product information can be displayed to the user through the connection query between the user table and the product table. At the beginning, the two tables were in one database, and they have been running very well. However, with the increase of user data, a single database cannot be placed. The common practice is to split the table according to the user dimension. The user table can be disassembled in this way, but there is no user field in the commodity table. If the splitting is performed according to the commodity entry, the user can open it When a favorite is used, it is necessary to query and connect multiple different databases. At that time, the database middleware did not have such ability. Even if it could, a query would take a very long time, which would greatly affect the user experience, and the business encountered great difficulties.

Oceanbase has taken over the user’s problem. If we analyze the five features of scalability, high availability, consistency, low cost and ease of use, what are the just needs of the business and what can be abandoned by the business? The strongest business requirement is scalability, because the traditional stand-alone mode has come to an end; the most important thing to give up is ease of use, because the use of write query is very simple, providing a simple read-write interface can meet the business needs, and the business does not even need to build an index on the table. At the same time, we also notice that the business also has certain requirements for consistency. The business can tolerate certain weak consistent read, but can not tolerate data error. These features determine that oceanbase is a relational distributed database supporting online transaction processing from the first day of its birth.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

We have noticed the feature of the business of favorites. Its stock data is relatively large, but the daily increment is not large. After all, there are not many users adding collections every day. It pays more attention to the expansibility of data storage, but does not require high scalability for writing. We divide the data into two parts: baseline data and incremental data. The baseline data is static and distributed on chunkserver. Incremental data is written on the updateserver, which is usually stored in the memory. Online transactions are supported through redo log. In the low peak period of daily business, the data on the updateserver will be merged with the data on chunkserver, which is called “daily merge”. Mergeserver is a stateless server, which provides data writing routing and data query merging; rootserver is responsible for scheduling and load balancing of the whole cluster. This is a storage architecture similar to LSM tree, which also determines that the future storage engines of oceanbase are based on LSM tree.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

Looking back at the architecture of oceanbase0.1, it actually has a strong consistency, because the writing is a single point, and the data read must be the latest data written. At the same time, the cost is not high, and it also has certain scalability. The storage space can be easily expanded to meet the needs of the business at that time.

2) Oceanbase version 0.2-0.3 (2011)

Soon oceanbase version 0.1 was launched, and read services were provided for the favorite business. However, the business could not switch all traffic to oceanbase, because the architecture of oceanbase version 0.1 had a big defect: it was not highly available. The downtime of any server will result in the inaccessibility of data, which is unacceptable for the business of favorites. Soon we brought the architecture of oceanbase version 0.2 to make up for the shortcomings of high availability.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

In oceanbase version 0.2, we introduced the primary and standby database mode, which was also the common disaster recovery mode of traditional databases at that time. The data was synchronized from the primary database to the standby database through the redo log. When the primary database has problems, the standby database can be switched to the primary database to continue to provide services. The synchronization of the redo log is asynchronous, which means that the handover between the master and the standby is lossy and may lose several seconds of data.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

When we compare the architectures of oceanbase version 0.2 and oceanbase version 0.1, we can find that oceanbase Version 0.2 finally has the important feature of high availability, but high availability is not without cost. Firstly, the system is no longer strongly consistent. We can not guarantee that the business can always read the latest data. In the case of downtime, some data may be lost. The introduction of secondary primary and secondary databases greatly increases the cost, and the number of machines we use doubles. The later version of oceanbase 0.3 did a lot of code optimization based on oceanbase 0.2, which further reduced the system cost. However, there was no significant difference between oceanbase 0.3 and oceanbase 0.2 in terms of architecture.

3) Oceanbase version 0.4 (2012)

With the success of the favorite business, we soon received more new business. Taobao express is a business oriented business, but also faces the problem of sub database and sub table. First of all, the data volume of Taobao through train is increasing, which is difficult to support by a single database. At the same time, it is an OLAP type business. There are many associated queries among multiple tables, and the dimensions of each table are different, so it is impossible to split them according to user ID. We are satisfied with oceanbase’s scalability, high availability and low-cost business, but the interface usage is really too painful. So the question is, what is the best interface language? For programming, different languages may have different supporters, but for data manipulation, we think SQL must be the best language. For simple kV query, you may feel that SQL is too heavy, but when your business is gradually complicated, SQL must be the most simple and portable.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

In oceanbase version 0.4, we have preliminary support for SQL. Users can use standard SQL to access oceanbase, and support simple addition, deletion, modification and query as well as associated query, but the support for SQL is not complete. At the same time, oceanbase 0.4 is our last open source version.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

Comparing the architecture of oceanbase version 0.4 and oceanbase version 0.2, we finally added the white board of ease of use in oceanbase 0.4, and gradually began to look like a standard distributed database.

4) Oceanbase version 0.5 (2014)

By the end of 2012, the OceanBase team came to Alipay. At that time, Alipay was faced with a strong demand for the complete removal of IOE. The cost of IOE is too high, but the stability of PC server is difficult to compare with high-end storage. If we use MySQL’s open source database instead of PC server, business will be faced with potential risks of losing data. At that time, the disaster recovery plan based on MySQL was still the main standby synchronization. For the core system such as Alipay transaction payment, the loss caused by the loss of an order was immeasurable. Business has put forward higher requirements for strong consistency and high availability of database, which also makes us build a new generation architecture of oceanbase version 0.5.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

In oceanbase version 0.5, we introduce Paxos consistency protocol to ensure data consistency under single point of failure through majority election. Generally, the deployment mode of oceanbase version 0.5 will be three copies. When one copy fails, the other two copies will complete the logs and re select a primary provider. We can ensure that no data is lost under a single point of failure, and the failure recovery time is less than 30s. At the same time, in order to better support the business, in oceanbase version 0.5, we are fully compatible with MySQL protocol, support secondary index, and have a rule-based execution plan. Users can seamlessly connect to oceanbase with MySQL clients, and can use oceanbase just like mysql.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

Comparing the architectures of oceanbase version 0.4 and oceanbase version 0.5, we find that oceanbase version 0.5 is based on Paxos, has stronger high availability and strong consistency, and has better ease of use based on SQL. However, the cost is to change from two copies to three copies, which further increases the system cost by 50%.

5) Oceanbase version 1.0 (2016)

Under the framework of oceanbase version 0.5, the business requirements for strong consistency, high availability and ease of use are well supported, and the pain points are gradually focused on scalability and cost. With the continuous growth of user write volume, the write single point of update server will always become the bottleneck. At the same time, the three copies also bring high cost consumption. Oceanbase version 1.0 brings a new architecture, focusing on solving the pain points of scalability and cost.

In oceanbase version 1.0, we support multi-point writing. From the architecture, we merge the updateserver, chunkserver, mergeserver and rootserver into one observer. Each observer can undertake reading and writing. The overall architecture is more elegant and the operation and maintenance deployment is simpler. A table can be divided into multiple partitions, and different partitions can be scattered on different observers. Users’ read and write requests are routed to specific observers through a layer of proxy obproxy. Each partition is still highly available through Paxos protocol. When one observer fails, the partition on this observer will automatically switch to other observers containing the corresponding partition to provide services.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

In terms of cost, we notice that in Paxos protocol, only the log needs to be synchronized with three copies, and the log needs to be written in three copies, but the data is not. Compared with the data, the log volume is always small. If the log and data are separated, we can use the storage cost of two copies to achieve high availability of three copies. In the oceanbase 1.0 version, we divide the replica into two types: full function copy and log copy. The full function copy contains both data and log to provide complete user read and write; the log copy only contains log and only performs Paxos voting.

At the same time, in oceanbase version 1.0, we introduce the concept of multi tenant. In the same oceanbase cluster, multiple different tenants can be supported. These tenants share the whole cluster resources. Oceanbase will isolate the CPU, memory, IO and disk usage of different tenants. Tenants can configure different resource capacities according to their own needs, and the cluster will perform dynamic load balancing according to the load of different observers. This enables us to deploy many small tenants to the same large cluster, reducing the overall system cost.

Compared with oceanbase version 0.5, the scalability of oceanbase version 1.0 has been greatly improved, and the cost has been greatly reduced due to the introduction of log copy and multi tenant technology. However, the improvement of scalability is not without cost, and multi-point writing brings great complexity. First of all, the user’s write does not necessarily only write to a single partition. For multiple partitions, it will inevitably bring distributed transactions. We use two-phase commit protocol to complete distributed transactions. Secondly, it is extremely difficult to obtain a global monotone increasing timestamp in a distributed system. Because there is no global clock, we do read-write concurrency control in a single partition based on local timestamp, which limits the consistency of the system. Although the query of single partition is still strong consistency, cross partition query cannot guarantee strong consistency of read It will be a big restriction for users. At the same time, due to the partition relationship, the secondary index becomes the local index in the partition, which requires that the index key must contain the partition key, which can not support the global unique index, which also causes some inconvenience for users.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

6) Oceanbase version 2.0 (2018)

The external overall architecture of oceanbase 2.0 is not much different from that of oceanbase 1.0. It is still a three replica architecture with share nothing. However, internally, we have greatly improved the scalability, high availability, consistency, low cost and ease of use.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

In terms of scalability, we implement the partition splitting function. When creating a table, the user may not have a good estimate of the appropriate number of partitions. When the partition is too large, the number of partitions can be increased by splitting the partition. Although partition splitting is a heavy DDL operation, partition splitting can be carried out online in oceanbase 2.0, and it will not have a great impact on users’ normal business reading and writing.

For the first time! Evolution and engineering practice of oceanbase storage system architecture

In terms of high availability, we support the function of primary and standby databases. For some users with only two computer rooms, they can do lossless disaster recovery in the computer room through three copies in the computer room, and do damage recovery across computer rooms through the primary and standby database.

In the aspect of consistency, we support global snapshot, and realize strong consistency in distributed read-write. Based on global snapshot, we also complete the support of global index and foreign key.

In terms of low cost, we support tablegroup in the transaction layer, which allows a group of similar tables to be “bound” together, reducing the overhead of distributed transactions. The data coding is introduced in the storage layer, and the occupation of storage space is further reduced by dictionary, RLE, const, difference, column equivalence, and inter column prefix. The data coding is adaptive, and the appropriate coding algorithm is automatically selected according to the data characteristics.

In terms of ease of use, we support Oracle tenants, allowing users to use MySQL tenants and Oracle tenants in the same set of observer clusters, and support stored procedures, window functions, hierarchical queries, expression index, full-text index, ACS, SPM, recycle bin and other functions.

summary

Although today’s version 2.0 of oceanbase achieves a better balance in terms of scalability, high availability, consistency, low cost and ease of use, such an architecture is not achieved overnight, from oceanbase version 0.1 to oceanbase From the perspective of the development process of version 2.0, the architecture of oceanbase is always evolving. In order to better serve the business, many things are always faced with many trade-offs. The improvement of one feature will be at the cost of the reduction of other features. There is no end point for the optimization and evolution of the architecture. In order to better meet the needs of the business in the future, the architecture of oceanbase will continue to evolve.

The author introduces:Zhao Yuzhong (huamingchen Qun), a senior technical expert of ant financial services, is currently responsible for the storage related development work in the oceanbase team. In 2010, he received his Ph.D. in computer science from University of Science & Technology China. He joined Alipay in the research and development of distributed transaction framework in the same year, and joined OceanBase team in 2013.


Author: Zhao Yuzhong

Read the original

This article is the original content of yunqi community, which can not be reproduced without permission.