This article is selected from papers published in 2016 by Andrew Pavlo (Associate Professor, Department of computer science, Carnegie Mellon University) and Matthew Aslett (vice president, 451 Institute)What’s Really New with NewSQL. The following content is translated by the technical team of companion fish, and the software of record letter number has been reorganized and edited for the second time.
In recent years, a new type of database management system (DBMS) called newsql has emerged, which claims to have the ability to expand the workload of modern online transaction processing (OLTP) system, which is impossible for the previous system.
In view of the fact that traditional DBMS has been developed for nearly 40 years, it is necessary to carefully consider whether the advantage of newsql is really what they say or just a kind of commercial propaganda. If we can really get better performance, then the next question is whether they really have a technological breakthrough, or just because of the development of hardware, the original problem is no longer the bottleneck?
In order to explore these problems, we first discuss the history of database development, in order to understand the background and reason of the emergence of newsql. Then it discusses the concept, characteristics, classification of newsql in detail, and the newsql system under each classification.
1、 Brief history of DBMSs
The world’s first database system, IBM IMS, was born in 1966. It was used to store the information of parts and suppliers needed by Saturn V and Apollo space exploration projects. The main contribution of IMS is to show the idea that “application logic and data operation logic should be separated”. Application developers only need to pay attention to the logical changes of data, but not the specific implementation. After IMS, the first batch of relational databases appeared, which are mainly represented by system R system of IBM and Ingres of University of California, the predecessor of PostgreSQL. Ingres quickly became popular in the information systems of other universities and became commercialized in the late 1970s. At about the same time, Oracle adopted the design similar to system R, developed and released the first version of its DBMS. In the early 1980s, a number of companies emerged, and they also launched their own commercial database products, such as Sybase and Informix. After system R, IBM released a new relational database, DB2, in 1983. The latter reused part of system R’s code, but they are not open source.
From the late 1980s to the early 1990s, object-oriented language began to be popular, which also gave birth to a number of object-oriented DBMS, in order to smooth the gap between database model and language. However, because there is no standard interface like SQL, these object-oriented DBMS have not been widely accepted in the market. However, some of their design concepts are gradually integrated into relational databases, and many popular relational databases have added support for object, XML and JSON data. In addition, document oriented NoSQL database is more or less an extension of object-oriented DBMS.
One big event in the 1990s was the release of two open source relational databases, MySQL and PostgreSQL. MySQL was born in Switzerland in 1995, mainly based on the development of msql system of ISAM; PostgreSQL was launched in 1994, and two Berkeley students redeveloped the source code of Postgres of quel to increase the support of SQL query language.
Since 2000, Internet applications have sprung up, which require more resources than traditional software services. Internet applications need to support the concurrent access of a large number of users, and have high availability requirements. It is better to be online forever. In practice, database has become the bottleneck of Internet application. Many manufacturers try to expand the capacity vertically to improve the performance of single machine hardware, but the improvement in this way is very limited, showing obvious diminishing marginal returns. And the vertical expansion is usually not smooth, moving data from one machine to another requires a long time offline service, which is unacceptable to Internet users. In order to solve this problem, some companies custom develop middleware to divide data into several common stand-alone DBMS
A logical database is abstracted from the upper application, and the data is distributed to different physical databases. When an application initiates a request, the middleware needs to forward or rewrite the request and distribute it to one or more nodes of the database cluster behind. After these nodes execute the request and return the data, the former will aggregate the data and return it to the upper application. Based on this idea, two famous systems are Oracle based cluster of eBay and MySQL based cluster of Google. Later, Facebook adopted a similar strategy to build an internal MySQL Cluster, which is still in use today. Although the strategy of data fragmentation in middleware can handle simple point read and point write operations, it is very difficult to update multiple data or join multiple tables in a transaction. Because of this, these early middleware do not support this kind of operation. EBay requires users of these middleware to complete the join logic in the application layer logic. Obviously, this violates the idea that “application logic and data operation logic should be separated”, and re exposes data operation logic to application developers.
Finally, the middleware based fragmentation scheme is gradually abandoned, and each company will focus on self-developed distributed database. In addition to the problems exposed by middleware solutions, traditional database solutions also expose two problems
First,Traditional databases pay attention to consistency and correctness, and sacrifice availability and performance.But this kind of trade-off runs counter to the requirements of Internet applications that focus on concurrency and availability.
Second,Many functions of traditional relational database are not suitable for Internet applications, but supporting these functions will consume additional resources. If a lighter database can be used, the overall performance may be improved.In addition, relational model may not be the best way to express application data. Using complete SQL to complete simple query seems to be “killing the chicken with a bull’s knife”.
These problems are the origin of NoSQL movement from 2005 to 2010. NoSQL fans generally believe that the reason that hinders the horizontal expansion and availability of traditional databases lies in acid guarantee and relational model, so the core of NoSQL movement isAbandon the strong consistency of transactions and relational models, and embrace the final consistency and other data models(such as key / value, graphs and documents). The two most famous NoSQL databases are Google’s BigTable and Amazon’s dynamo. Since neither of them is open source, other organizations have started to launch similar open source alternative projects, including Facebook’s Cassandra (based on BigTable and dynamo) and powerset’s HBase (based on BigTable). Some start-ups have also joined the NoSQL movement. They are not necessarily inspired by BigTable and dynamo, but they all respond to the philosophy of NoSQL. Among them, mongodb is the most famous.
At the end of the 21st century, there are many distributed database products for users to choose. The advantage of using NoSQL is that application developers can pay more attention to the application logic itself rather than the scalability of the database. But at the same time, many applications, such as financial system and order processing system, are rejected because they can’t give up the consistency requirement of transactions. Some organizations, such as Google, have found that many of their engineers focus too much on data consistency, which not only exposes the abstraction of the database, but also increases the complexity of the code. At this time, they either choose to go back to the era of traditional DBMS and expand vertically with higher machine configuration, or choose to go back to the era of middleware and develop middleware supporting distributed transactions. The cost of these two solutions is very high, so the newsql movement is brewing.
2、 The rise of newsql
This paper considers that newsql is a general term for a kind of modern relational database. This kind of database provides scalable performance for general OLTP read-write requests, and supports acid guarantee of transactions. let me put it another way,These systems not only have the expansibility of NoSQL database, but also keep the transaction characteristics of traditional database.Newsql brings the idea of “application logic and data operation logic should be separated” back to the world of modern database, which also verifies that the development of history always presents a spiral.
In the 2000s, many data warehouse systems (such as vertica, greenplus and asterdata) appeared. These systems designed to process OLAP requests are not in the scope of newsql defined in this paper. OLAP database pays more attention to large, complex and read-only queries for massive data, and the query time may last seconds, minutes or even longer. The read-write transaction of newsql database design has the following characteristics:
- Time consuming
- Using index query, involving a small amount of data (non full table scan or large distributed join)
- High repetition, usually using the same query statements and different query parameters
Some scholars think that newsql system refers toThe implementation uses lock free concurrency control technology and share nothing architecture database.All database systems that we think are newsql do have such characteristics.
3、 Classification of newsql
Now that newsql has been defined, we can analyze the picture of the entire newsql database. The newsql databases on the market can be divided into three categories
- The new SQL database is completely redesigned and developed with the new architecture
- Implementation of database with newsql feature in middleware layer
- The database as a service product (DAAS) provided by cloud computing platform is usually based on the new architecture
Before writing this article, the two authors will use the “replacement of stand-alone database storage engine” scheme divided into newsql. Typical examples of this solution are some solutions to replace MySQL and InnoDB, such as tokudb, scaledb, akiban, deepsql, myrocks, etc. The advantage of using the new storage engine is that there is no perception of such replacement for the application. But now, the two authors retract their previous views. They think that extending the stand-alone database system by replacing the storage engine and using plug-ins is not a typical representative of the newsql system and does not belong to the scope of this article. Usually, the solutions to improve the performance of database OLTP scenario by replacing MySQL storage engine will fail in the end.
Next, we will discuss the three types of newsql databases.
1. New architecture
The newsql system built from scratch is most interesting, because the project design has the greatest degree of freedom without considering the burden of the old system in design and architecture. All database systems in this category are based on the distributed architecture of shared nothing, and include the following modules:
- Multi node concurrency control
- Multi replica data replication
- Flow control
- Distributed query processing
Another advantage of using the new database system is that each component can be optimized for multi node environment, such as query optimizer, communication protocol between nodes and so on. A concrete example is that most of the newsql databases can directly transfer the internal data of the same query between nodes, without routing the data through the central node like some middleware based solutions.
Except for Google’s spanner, databases in this category usually manage their own storage modules, which means thatThese DBMS also need to be responsible for distributing data to different nodes, rather than using out of the box distributed file system (such as HDFS) or storage fabric (such as Apache ignite). This is a very important design decision. Self management means “send the query to the data”, while relying on three-party storage means “bring the data to the query”. The former transmits the query command, while the latter transmits the data itself. Obviously, the former is more friendly to the consumption of network bandwidth resources. Self managed storage layer also enables database system to use more refined and efficient replication strategy, not limited to block based replication strategy. In general, self built storage layer can achieve higher performance.
Using the new architecture is not without shortcomings. Its biggest disadvantage is that its technology is too new, which leads users to worry that there are many holes behind these technologies that have not been filled. This further means that the user group using the new system is too small, which is not conducive to the polishing of the product itself. In addition, some widely accepted operation and maintenance, monitoring, alarm ecology also need to start from scratch. In order to avoid such problems, some database systems, such as clustrix, memsql and tidb, choose the communication protocol compatible with MySQL; and cockroachdb chooses the communication protocol compatible with PostgreSQL.
Examples: clustrix, cockroach dB, Google spanner, h-store, hyper, memsql, nuodb, SAP Hana, voltdb, tidb, etc.
Cockroach DB architecture
2. Transparent data slicing middleware
There are also some products on the market that offer middleware solutions similar to those of eBay, Google, Facebook and other companies, and support acid. Each database node under middleware usually:
- Running the same stand-alone database system
- Only part of the overall data
- Not used to receive read and write requests separately
These centralized middleware are mainly responsible for routing requests, coordinating transaction execution, distributing data, copying data and partitioning data to multiple nodes. Usually, there is a thin communication module in each database node, which is responsible for communicating with middleware, executing requests instead of middleware and returning data. All these modules work together to provide a single logical database to the outside world.
The advantage of using middleware is that it is very simple to replace the existing database system, and the application developers have no perception. The most common stand-alone database in middleware solution is mysql, which means that in order to be compatible with MySQL, the middleware layer needs to support MySQL communication protocol. Although Oracle provides MySQL proxy and fabric tools to help everyone compatible, in order to avoid the potential problems of GPL license, most companies choose to implement protocol layer compatibility by themselves.
The disadvantage of middleware based solution is that it relies on traditional database.Traditional databases generally adopt disk oriented architecture, which was born in the 1970s. Therefore, this kind of solution can not use the memory oriented architecture used by some newsql systems, so it can not effectively use its more efficient storage management module and concurrency control module. Previous studies have shown that the disk centric architecture design to some extent limits the traditional database to use more CPU cores and more memory space more efficiently. At the same time, for complex queries, middleware solutions may introduce redundant Query Planning and optimization (once for middleware and once for database nodes), although this approach can also be regarded as partial optimization of queries.
Examples: agildata scalable cluster, MariaDB maxscale, scalearc, and scalebase.
Logical architecture of MariaDB
Many cloud computing providers provide database as a service products of newsql. With this kind of service, the development team does not need to maintain the database system on the private hardware or the virtual machine purchased from the cloud service provider, but the cloud service provider takes over the configuration, tuning, replication, backup and so on of the database. Users of cloud service providers only need to access the database system through the given URL, or use the control panel and API to manage the system.
Consumers of dbaas pay according to their resource utilization. Because the computing resources used by different database queries may be quite different, dbaas providers usually do not charge according to the number of queries as they do with block storage services, but let consumers determine their maximum resource usage limits (such as storage space, computing resources and memory usage) to provide service guarantee. The more famous newsql system belonging to dbaas is aurora of Amazon. It is compatible with both MySQL and PostgreSQL communication protocols. Behind it, the storage management module based on log structured is used to improve the parallel efficiency of I / O.
There are also some companies relying on the cloud platform services provided by large cloud service providers to build dbaas solutions, such as cleardb, which can be deployed on the infrastructure of major cloud service providers. The advantage of this solution is that the database can be distributed on different providers in the same region to reduce the risk of failure.
Examples: Amazon Aurora, cleardb.
Amazon Aurora cloud deployment architecture
4、 The current situation of newsql
In this section, we discuss the design ideas of each module of newsql database system to see if they are innovative in practice and theory.
1. Main memory storage
Traditional DBMS uses disk oriented storage design, and the data is mainly stored in block storage devices, such as SSD or HDD. Due to the slow read-write speed on the block storage device, these DBMS will cache the read data blocks in memory, and also cache the data to be written out in memory, so as to improve the efficiency by batch writing. Because memory is smaller and more expensive than disk, this design greatly reduces the cost of computer. Now, decades later, the price and capacity of memory have been improved. Except for some very large OLTP databases, most databases can be fully loaded into memory, which enables DBMS to quickly read data from memory, and even eliminates the need for modules such as buffer pool manager and heavyweight concurrency control mechanism. Turning from disk oriented to memory oriented gives DBMS new possibilities for performance optimization.
Many newsql databases adopt memory based storage architecture, whether it is the experimental database of academic circles, such as h-store and hyper, or the commercial database of the industry, such as memsql, SAP Hana and voltdb. The performance of these systems under OLTP load is better than that of storage architecture based on external memory.
In fact, the idea of loading the whole database into memory is not new. The University of Wisconsin Madison has laid a foundation in the field of memory database in the 1980s, covering index, query processing, data recovery and other aspects. In the same era, the first distributed memory database, Prisma / DB, was developed. In the 1990s, the first batch of commercial in memory databases came out, including altibase, Oracle TimesTen and at & tdatablitz.
Altibase memory database architecture
The innovation of memory oriented newsql database is to try to clear the inactive data out of the memory, so as to reduce the memory usage, which makes these newsql databases can store more data than its memory space without going back to the disk oriented storage architecture. In a broad sense, such practices need to establish an internal tracking mechanism to find inactive records in memory and clear them out to external memory when necessary. Take h-store as an example, its anti caching module will clear out inactive records and place a tombstone record in its original location. When a transaction wants to access this record, it will abort the transaction, and then start an independent thread to load the record asynchronously and put it back into memory. Of course, the same goal can be achieved by directly using the paging mechanism of the virtual memory of the operating system, such as voltdb. In order to avoid false negative, all secondary indexes need to keep the key of the cleared record. If the data table used by the application has more secondary indexes, even if the record is cleared, it will still cause a waste of memory. Microsoft’s Siberia project uses bloom filter to determine whether the target record exists in order to reduce memory consumption. Although Siberia is not a newsql database, it is worth referring to.
Memsql, another newsql database, does not track the meta information of records. Instead, it adopts different solutions to solve the problem that the amount of data is larger than the memory. Memsql organizes data in the form of log structured, which improves the performance of data writing. In addition, the administrator can manually tell the database to store a table in column format.
In general, the memory based storage scheme is not significantly innovative, which can be seen as an extension of the previous scheme.
2. Partitioning / sharding
All newsql databases are expanded horizontally by dividing the whole database into different subsets, namely partitions or Shards.
It is not a new thing to operate data on distributed database. Phil Bernstein and his colleagues have made some achievements in this field as early as the late 1970s. Their project of sdd-1 has made many basic contributions in distributed transaction processing. In the early 1980s, the development teams of system R and Ingres built their respective distributed database systems: IBM’s R * adopted the design of share nothing and disk oriented, which are similar to the design of sdd-1. In the distributed version of Ingres, a query optimization algorithm is proposed, which can dynamically split queries into smaller queries recursively, which is remembered by the world. Later, the gamma project at the University of Wisconsin Madison explored different fragmentation strategies.
However, these early distributed database systems have not been developed in the long run for two reasons
First, computer hardware in the 20th century was expensive, and most companies could not afford it.
Second, high-performance, high available Internet applications did not exist in the 20th century, when QPS of databases were generally between dozens and hundreds
But now, these two assumptions are no longer tenable. With the popularization of various cloud infrastructures and the help of open source distributed systems and tools, building a data intensive application has become much simpler than in the past, and also makes the distributed database back to the stage of history.
In the distributed database, the data table will be divided into several segments according to a field or a group of partitioning attributes, either by hash function or based on range. The associated data segments of multiple data tables are often combined and put into the same partition (node), which is responsible for executing any request to access the internal data of the partition. However, dbaas (Amazon Aurora, cleardb) does not support this fragmentation strategy. Ideally, the distributed database should be able to automatically distribute queries to multiple slices for execution, and then merge the results. In addition to scalearc, other newsql databases provide such support.
Many OLTP database schemas can be transformed into a tree structure
According to the primary key hash of root table, the data involved in each query can be in a slice. For example, if there is a customer table and the database is partitioned according to the customer ID, then all the order records and account information of the customer are stored in the same partition. In this way, almost all transactions can be executed in the same partition, which reduces the communication between nodes in the database system, and does not need to bear the cost of two submission (2pc), so as to improve the overall performance.
Some nodes in newsql database are not equivalent, such as nuodb and memsql. Nuodb will use one or more nodes in the cluster to be responsible for storage manager (SM). Each SM stores a piece of data. Within SM, the data is further divided into blocks. Other nodes in the cluster are responsible for executing transactions (TE), and the blocks involved in transactions are cached in each te node. This is a typical design of separation of computing and storage. When executing a transaction, a te node needs to obtain all the blocks involved in the transaction from SM or other te nodes, and then te needs to obtain the write lock of the data to be modified, and broadcast it to the corresponding SM and other te nodes after modifying the data locally. In order to reduce the phenomenon of blocks passing back and forth between TE nodes, nuodb has made an article on the load balancing strategy, so that the local blocks are distributed to the same te as much as possible. This idea enables nuodb to divide the data reasonably without any assumptions on the data table (such as the tree structure above). Memsql also adopts the design of separation of computing and storage similar to nuodb, but different from nuodb, the aggregator of memsql does not cache any data. Instead, it splits the complete query into small queries for each storage node to execute, that is, the computing node is stateless. The computing and storage of nuodb and memsql have the ability of independent horizontal expansion. Whether the heterogeneous node architecture is better than the homogeneous node architecture in performance, operation and maintenance complexity has not been determined.
An important feature of newsql database is to support live migration of data. It allows the database to rebalance data on different physical nodes, alleviate hotspots, and expand without affecting the service of the database itself. Compared with NoSQL, newsql is more difficult to achieve this, because the latter also needs to ensure that the characteristics of acid are not damaged. In general, newsql has two solutions to support online migration of data:
One is to organize the data into coarse-grained logical slices and hash them to the physical nodes.When rebalancing is needed, these logical slices are moved between these nodes. This solution has been used in clustrix, agildata, Cassandra and dynamodb.
The second is in finer granularityFor example, on a tuple or a group of tuples, rearrange data by value range. Mongodb, scalebase and h-store adopt this scheme.
3. Concurrency control
Concurrency control mechanism is the core and most important part of transaction processing in database system, because it involves almost all aspects of the whole system. Concurrency control allows different users to access the database as if they are in separate possession. It ensures the atomicity and isolation of transactions and affects the overall behavior of the system.
In addition to which concurrency control mechanism to use, another key point in the design of distributed database system lies inUsing centralized or decentralized transaction coordination protocol. In the centralized protocol, the starting point of all transaction operations is the centralized coordinator, which decides whether to approve the operation; in the decentralized protocol, each node maintains the transaction state information of accessing its own data, and these nodes need to communicate and coordinate with other nodes to determine whether there is concurrency conflict. A decentralized coordinator is more scalability friendly, but usually requires highly synchronized clocks on the wall of different nodes to determine the total order of transactions.
The first batch of distributed database systems were born in the 1970s. They generally used two-phase locking (2PL) mechanism. As the first database to support distributed transactions in the share nothing cluster, sdd-1 uses a centralized coordinator. IBM’s R * is similar to sdd-1, but it adopts decentralized coordination mechanism, distributed 2PL, and each transaction will lock the data it accesses. The distributed version of Ingres database also uses distributed 2PL, but it depends on the centralized deadlock detection mechanism.
Because it is too complex to solve the deadlock problem, almost all newsql databases abandon 2PL. The current popular concurrency control mechanism is different variants of timestamp ordering (to). In this mechanism, the database assumes that concurrent transactions will not be executed in the order that may cause violation of serializable ordering.The most popular concurrency control protocol in newsql system is decentralized mvccWhen an update occurs, the database creates a new version for each record. Keeping multiple versions of a record allows read transactions to not block write transactions, and write transactions to not block read transactions. Newsql databases using decentralized mvcc mechanism include memsql, hyper, Hana and cockroach dB. Although they all make customized changes more or less according to needs, the core concept of this scheme is not innovative. Mvcc was mentioned in a PhD thesis of MIT as early as 1979. In the early 1980s, the first commercial databases using mvcc came out, including VAX RDB of digital and InterBase developed by Jim Starkey, who are also the authors of storage engines nuodb and mysqlfalcon.
Classification of common concurrency mechanisms
Other database systemsThe fusion scheme of 2PL and mvcc is usedIn this scheme, the write transaction still needs to obtain the data lock according to the 2PL mechanism. Whenever a transaction modifies a record, the database will also create a new version for the record according to mvcc mode; the read transaction does not need to obtain the lock, so it will not block the write transaction. The most famous implementation of this scheme is InnoDB of MySQL, but it is also adopted by Google spanner, nuodb and clustrix. Based on the original mvcc, nuodb broadcast the recorded version information through the gossip protocol between nodes to improve the performance. All middleware and dbaas schemes inherit the concurrency control mechanism of stand-alone database behind them. Since most of them use mysql, they naturally adopt the hybrid mechanism of 2PL and mvcc.
This paper argues that the concurrency control mechanism implemented by Google spanner (including its offspring F1 and spannersql) is the most novel in all newsql systems. Although it is based on the hybrid mechanism of 2PL and mvcc, the difference is that spanner uses hardware devices (GPS, atomic clock) High precision clock synchronization characteristics are obtained, and these highly synchronized clocks are used to generate time stamps for transactions and obtain the order of transactions, so as to achieve data consistency of multi version databases in Wan. Cockroachdb also supports the same transaction consistency, but it does not use these hardware devices. Instead, it relies on a hybrid clock protocol based on low synchronization clock and logical counter.
Until the publication of the paper, the only commercial newsql database that does not use mvcc variant is voltdb, which still uses to concurrency control, in which the transactions on each slice can only be executed one at a time, rather than weaving the operations in the transaction together like mvcc. In voltdb, single partitioned transactions are scheduled in a decentralized way, while cross partitioned transactions are scheduled in a centralized way. Voltdb sorts transactions according to logical timestamps, and then executes them in turn on a slice. When a transaction is executed on a slice, it monopolizes all the data of the whole slice, so the system does not need to deal with more fine-grained locking logic. The disadvantage of this concurrency mechanism based on fragmentation is that if the transaction involves multiple partitions and the network between the coordinator and these partitions is delayed, these partitions will be idle and unable to process other requests. The concurrency control mechanism based on fragmentation is not a new idea either. Hector Garcia Molina proposed a similar variant scheme in a paper in 1992, which was implemented in KDB and h-store (the academic predecessor of voltdb) in the late 1990s.
Overall, there is no significant innovation in the core concurrency control mechanism used by newsql database, mainly due to the engineering of the old method in modern hardware and distributed environment.
4. Secondary indexes
Supporting secondary index is not difficult for a single database, because the data are all on one node, but it is not easy for a distributed database. For example, suppose there is a customer table, which is divided into different physical nodes according to the customer ID. When you want to query according to the customer’s email address, you need to go to each node to get the correct result.
In order to support secondary index, a distributed database needs to consider two design decisions
- Where to store the secondary index
- How to maintain secondary index in transaction
If there is a centralized coordinator in a system, such as the middleware scheme, then the secondary index can be stored on the coordinator node and the partition node. The advantage of this scheme is that there is only one version of the index data to be maintained globally.
Newsql database with new architecture usually uses two-level index scheme, that is, each node stores part of the index, rather than the whole index on a single node, copy to other nodes when needed. The trade-offs here are well understood:
Clustrix combines the above two schemes: the secondary index is partitioned according to the range, and each node has the corresponding relationship between the range and the partition. When a query or update request is encountered, the request is first routed to the appropriate node, and then the latter performs the corresponding operation. This two-tier design combines the advantages of the two schemes.
If newsql database does not support secondary index, the common practice of developers is to build secondary index and put it in distributed cache system. However, relying on external system will make the behavior between index and data inconsistent, so developers need to be cautious.
To ensure the availability of applications and the persistence of data, the best way is to achieve replication at the database level. All modern databases, including newsql system, provide some kind of data replication mechanism.
There are two important design decisions in database replication
How to ensure data consistency across nodes?In a strongly consistent database system, the newly written data must be persisted by all the corresponding replication nodes before the transaction can be considered as committed. In this way, all read requests can be sent to any replication node, and the data they receive can be considered as the latest data. Strong consistency of the database system is very good, but DBMS to maintain such a synchronous state requires the use of atomic commitment protocol such as 2pc to ensure data synchronization. If there is a node failure or network partition in this process, the database service will not respond. This is also why NoSQL systems usually use the weakly consistent or event consistent model. Under the guarantee of weak consistency, the master node can tell the transaction and delivery without waiting for all replication nodes to persist data. All the newsql systems we know support strongly consistent data replication, but there is nothing new about how these systems realize strongly consistent data replication. The basic research on state machine replication of database system existed as early as 1970s. Nonstop SQL, which came out in 1980s, is the first distributed database system using strongly consistent replication.
How to perform cross node data dissemination?There are two main execution modes. The first is called active replication, which makes each replication node perform the same request. For example, when a new request is received, the database system will execute the same request on all replication nodes; the second is active passive replication, that is, the request is executed on one node first, and then the state is passed to other replication nodes. Most newsql database systems use active passive replication, which is mainly because the order of each request arriving at different nodes is different. If active active replication is used directly, it is easy to cause data inconsistency. In contrast, deterministic DBMS, such as h-store, voltdb and cleardb, all use active replication, because in a deterministic DBMS, the execution order of transactions on different nodes can be consistent.
The difference between newsql and the previous database system in engineering is that the former also considers the replication of wide area network. In the era of cloud service popularity, it is not difficult to deploy multi center applications. Although newsql can support WAN data synchronization, users need to guarantee the network quality between DCs. Spanner and cockroachdb provide the only two databases for WAN data synchronization and consistency optimization before the publication deadline. Spanner adopts the combination scheme of atomic clock and GPS hardware clock, while cockroachdb adopts the mixed clock scheme.
6. Crash recovery
Another important feature of newsql database system is fault recovery mechanism. Generally, the fault-tolerant ability of traditional database mainly focuses on ensuring data persistence, while newsql also needs to provide higher availability, that is, in case of failure, it can ensure the normal use of database services.
In the traditional database, the implementation of fault recovery is usually based on wal, that is, restart the system after the fault, load the last checkpoint and replay the wal, so that the system can return to the correct state before the fault. This scheme is called Aries, which was first invented by IBM database researchers around 1990. Nowadays, almost all database systems adopt Aries scheme or its variants.
In the distributed database with replicates, the traditional failure recovery scheme can not be used directly. The reason is: after the master node crashes, the system will choose a replica as the new master node. When the old master node returns to the line, because the online cluster has written a lot of new data, it can only rely on the network Checkpoint and wal cannot keep up with the cluster. It needs to synchronize all new changes from the new master node. At present, there are mainly two implementation schemes
First, use local checkpoint and wal to recover data firstAnd then pull the new logs from the master or other replication nodes. As long as the log processing speed of the recovered node is faster than the data writing speed, it will eventually be able to keep up with the whole cluster. For databases using physical / physical logging, it is possible to keep up with the whole cluster, because it is much faster to synchronize data updates to the local than to execute SQL directly.
Second. Abandon local checkpoint and wal mechanismsThe advantage of this scheme is that the same mechanism can be used to expand cluster nodes.
Usually, the newsql system based on middleware and dbaas will add additional infrastructure, such as leader election, based on the failure recovery mechanism of stand-alone database to achieve its required management capability. The newsql system based on the new architecture will choose to abandon the failure recovery mechanism of stand-alone database and use out of the box log replication components, such as zookeeper and etcd, or implement its own algorithms, such as Paxos and raft.
The solutions and technical components mentioned above have existed since 1990s.
5、 Future trends
Future database systems should be able to perform analytical queries and machine learning algorithms on newly generated data. This kind of workload is often called real-time analytics or hybrid transaction analytical processing (HTAP), which can extract insights and knowledge from historical data and new data at the same time. Traditional business analysis and business intelligence can only analyze historical data, but the value of data is usually the highest when it is generated, and gradually decreases with the passage of time. Therefore, if we can reduce the interval between data production and analysis, we will be able to produce greater value.
There are three ways to support HTAP workload
The most common method is to deploy two independent databases. One is used to handle OLTP workload, which is called front-end database; the other is used to handle OLAP workload, which is called back-end database.The front-end database processes the data generated by the new transaction, while in the background, the system management uses ETL tool to import the data from the OLTP database into the back-end database, which is usually a data warehouse. The application executes OLAP queries on the back-end database to avoid slowing down the front-end OLTP system, while the new data generated in the OLAP system will be pushed back to the front-end OLTP database to form a closed loop.
Another popular design is lambda architecture.That is to say, another batch processing system, such as Hadoop and spark, is used to calculate composite views on historical data, while a stream processing system, such as storm and spark streaming, is used to calculate quasi real-time views on newly generated data. Batch processing systems usually need to scan the data set periodically, and the results are fed back to the online service through the streaming system.
One of the common characteristics of the above two schemes is that the data is directed to heterogeneous systems, which also brings many problems. First of all, it takes time to transmit data modification to another system, and the time is usually in minutes or hours, which directly leads to the calculation of data can not be real-time; secondly, the operation and maintenance cost of the deployment and maintenance of two systems is also very high, usually this kind of labor cost can account for about 30% of the total cost If developers want to analyze the data in two systems at the same time, they need to write two sets of data acquisition logic. Although heterogeneous systems are also used, some schemes try to hide the fact of two systems and only expose one set of interfaces, but it usually requires copying data from OLTP system (such as HBase) to OLAP System (such as spark) behind the scenes.
The third solution is to use an HTAP databaseThat is to say, it supports high throughput and low latency OLTP workload, and it also supports complex logic and long-time OLAP workload running on historical (cold) data and new (hot) data. The main difference between ohap and general database is that the former combines some special OLTP implementation schemes (in memory storage, lock free execution) and OLAP (columnar storage, vectorized execution).
SAP Hana and memsql are among the first newsql databases to claim to support HTAP workload. Hana is implemented by using different execution engines internally, one engine is used to store data in rows, which is suitable for OLTP workload, the other is used to store data in columns, which is suitable for OLAP workload; memsql uses different storage managers (SM) to store data, one is used for row storage, the other is used for column storage, which is combined in the execution engine layer. Hyper uses mvcc and column storage scheme instead of concurrency control and row storage scheme similar to h-store to support more complex OLAP queries. Voltdb has also shifted its marketing strategy from pure OLTP to support streaming computing. Similarly, S-store tries to increase streaming processing capacity based on h-store architecture. Even some OLAP targeted systems (such as Greenplum) in the 2000s began to increase their support for OLTP.
Although the rise of HTAP database means the end of single large-scale OLAP data warehouse, it will not happen in the short term, because these data warehouses are still the general back-end databases of most companies, storing all the historical data of the company. But one day, it will no longer be necessary to move data to meet OLAP workload.
For more dry cargo, please pay attention to WeChat official account number “soft copy”.