This article is from: dbaplus community
Introduction to the author
Zhang Liang, head of data research and development of Jingdong Digital Technology Co., Ltd. He loves open source and currently leads two open source projects elastic job and sharding sphere (sharding JDBC). He is good at distributed architecture based on Java and cloud platform based on kubernetes and mesos. He advocates elegant code and has a lot of research on how to write expressive code. In early 2018, he joined Jingdong Digital Technology Co., Ltd. and now he is the head of data research and development. At present, the main focus is on building sharding sphere into a first-class financial data solution in the industry.
The accumulation of data is the enterprise wealth of today’s giants in all walks of life, and database is an important way of data storage. With the popularity of big data and micro services, the traditional relational database will also usher in changes. Cloud native database architecture is getting more and more attention, so I want to talk about cloud native data architecture with you. As the first part of this paper, we will first analyze the current development status of various databases.
1、 Is relational database still available?
Relational database has been a leader in the field of database for decades. The following figure is the database ranking published by the world’s more authoritative DB engines in June 2018. The ranking is mainly based on the number of keyword searches by Google and Bing search engines, the number of employees, the number of job searches, and the number of questions and concerns on stack overflow
As of June 2018, among the top six databases, only mongodb, which ranks fifth, is a document database, while the rest are relational databases, and the proportion of the top three databases is far ahead of other databases.
Relational database is still strong after the bombardment of big data, NoSQL and newsql, which is closely related to its inherent advantages. Its advantages are mainly reflected in the impact on developers, operation and maintenance personnel and the system itself.
For developers, the primary advantage of relational database is SQL oriented.
SQL is a structured query language for relational databases. Although different relational databases have different SQL dialects, SQL based on ANSI standard is supported by most relational databases. And SQL is a database oriented access language, which can easily add, delete, modify, query, authorize and manage the database. SQL query flexibility is very high, can be very convenient in online transaction processing (OLTP) and online analytical processing (OLAP) between the conversion.
In addition, SQL is a programming language that application development engineers must master, and its popularity is very wide. The probability of recruiting an application development programmer who can’t write SQL at all is very small. Therefore, SQL greatly reduces the cost of developer recruitment.
In addition to SQL language itself, various development languages support relational database very well. Take Java as an example: JDBC is the standard interface for Java language to access database, and various relational database manufacturers provide drivers to implement JDBC interface. Engineers using java do not need to perceive the differences between different relational databases, they only need to program according to JDBC interface.
Due to the difficulty of one-to-one correspondence between relational database storage and object-oriented Java program, many object relational mapping (ORM) frameworks are produced to simplify the impedance mismatch of relational object model, such as JPA and its official implementation hibernate, mybatis, jooq, etc., which further simplifies the daily development work of application engineers. ORM framework is mostly encapsulated by JDBC, which has high compatibility with relational databases.
Operation and maintenance advantages
Because relational database exists for a long time, for each common relational database, it is easy to recruit the corresponding database administrator (DBA), so as to ensure the stability, security, integrity and performance of relational database, and ensure the monitoring and analysis of the system bottleneck of relational database and the rationality of the design.
Mature relational databases have their own perfect ecosystem, which is used to ensure high availability, data backup, performance monitoring and analysis and other mature supporting tools. Large scale enterprises and important business systems generally need special DBAs for operation and maintenance.
Only time is the standard of maturity and stability of inspection technology. Relational database has been tested for decades, and has been used in a very large scale. Its storage engine has been very mature. The database engine based on mvcc achieves a good balance between performance and correctness, and greatly improves the query efficiency through B + tree index. In the face of key nodes such as data, careful selection of relational database is the preferred solution for architects.
The transaction based on acid is another powerful guarantee that relational database brings to application system. Acid is the acronym for the four basic elements that a database transaction can perform correctly. It includes atomicity, consistency, isolation and persistence. Only the database supporting transaction can guarantee the correctness and integrity of data to the maximum extent
- Atomicity。 All operations in the same transaction are either completed (COMMIT) or not completed (rollback) and cannot be stuck in an intermediate phase. If an error occurs during the execution of the transaction, the data will be restored to the state before the transaction.
- Consistency。 Non read only transactions should encapsulate the transition of the state of the database from one consistent state to another. Consistent state means that the data in the database should meet the integrity constraints, and the intermediate state of the transaction should not be perceived outside the transaction.
- Isolation。 When multiple transactions are executed concurrently, they should not affect other transactions, just as only one operation is executed by the database in parallel.
- Durability。 After the transaction is completed, all changes made by the transaction to the database will be persisted in the database.
It’s not difficult to use transactions in programming. Various development frameworks such as spring have made it very simple and elegant at the aspect oriented (AOP) level.
The performance and access capacity of relational database are impeccable in the era of enterprise application oriented to single data node. However, with the rapid expansion of the amount of visits and data, it is difficult for relational database to become the underlying support of such a large-scale system, or even the bottleneck of the application system.
There are three main shortcomings of relational database
- Single node concurrent access is limited。 At the same time, because the data stored in the database is stateful, it is difficult to arbitrarily split and expand the service. A single database node carries a large number of query and update requests from service nodes, which is not a peer-to-peer architecture deployment mode.
- Limited data capacity of single node。 The carrying capacity of a single database node is limited. The larger the amount of data, the deeper the index created for querying the data. Index depth determines the number of IO accesses. The deeper the index depth is, the slower the search is.
- Serious performance degradation of distributed transactions。 After splitting the database, we need to use distributed transaction instead of local transaction. Xa based distributed transaction uses two-phase commit, locking resources in the preparation phase until the end of the whole transaction. When the system concurrency increases, the performance will decline sharply.
To sum up, the shortcomings of relational database, in the final analysis, are caused by the original intention of the design. It is not a distributed product, and it is not friendly to distributed system, so it is difficult to adapt to the Internet architecture model. In the face of stateless services that can be flexibly expanded at any time, relational databases have become a little cumbersome.
2、 NoSQL not up to expectations
With the disadvantages of relational database exposed more and more obvious, NoSQL has become a useful supplement. However, NoSQL is not intended to replace relational database, but not only SQL, which provides an alternative to SQL.
NoSQL has many kinds of classification, including key value database, document database, column family database and graph database, which are used to solve different scenarios.
1. Key value database
The representative of key value database is redis。 It is used as a cache in many scenarios, but redis also provides disk dropping function. In the case of primary key query, redis is very efficient, but it can’t do anything about content query.
Redis provides the ability of cluster, which can disperse data to different nodes, effectively dispersing the access bottleneck of a single node. If all the data of redis cannot be loaded in the memory and the disk is dropped, the performance of redis will be degraded. Therefore, in the case of large amount of data, it is a good solution to partition the redis data according to the primary key.
Redis provides transaction functions through multi, exec, discard and watch commands. Redis transaction provides a one-time, sequential and non interruptible command execution mechanism. However, even if the execution of some commands in the transaction fails, it cannot be rolled back. Therefore, the transaction of redis does not correspond to the transaction in the database domain one by one.
2. Document database
The representative of document database is mongodb。 Document model is closer to object-oriented data expression. It has a high degree of freedom schema model, which can easily map with JSON data.
The design concept of document database is completely different from that of relational database. It has no statically defined table structure, but can flexibly add or delete attributes in documents and embed subdocuments and arrays. Therefore, the design of document oriented database application should focus on the object itself, rather than giving priority to how to define the database table structure. This design makes it very convenient for the development engineer to modify the program logic without considering the lock table caused by the change of database table structure.
Mongodb does not support acid transactions until the latest version 4.0. Previous versions only support final consistent transactions. Compared with the relational database, the stability of mongodb’s acid transaction has yet to be verified. Therefore, it is not recommended to use mongodb for very critical business systems such as order, transaction, accounting, etc., but for business systems such as forums that require lower level of data transaction.
3. Column family database
The representative of column family database is HBase in Hadoop big data system。 It is a distributed database specially used to deal with massive data.
HBase determines a record by row primary key and column family. The attribute in each column family is not fixed, which is similar to document database. HBase can also automatically segment data, making the data storage automatically have the ability of horizontal expansion. HBase’s data is stored in a distributed file system such as HDFS, and the support for massive data is the best.
LSM tree (log structured merge tree) was used in HBase. It puts the changes to the data in the memory, and after reaching the specified threshold, it merges the changes and writes them to the disk in batches, converting the single write operation into batch write operation, which greatly improves the write speed. But when reading data, HBase needs to find the data in memory and disk respectively, which has a certain impact on the performance. Therefore, HBase is more suitable for the application of write more and read less. In addition, HBase also does not support acid transactions, and can only query data through row keys.
Graph database is a database used to deal with graph relations. It is used in special scenarios, so it will not be introduced here.
In general, NoSQL database has many kinds, which are suitable for different scenarios. Let’s briefly compare the three types of NoSQL databases described in the previous article through the following table:
Although the use scenarios of NoSQL are very different, most of them support the functions of fragmentation and data migration required by distributed database, and are better than traditional relational database in the support of massive data and large concurrency.
NoSQL database can provide good scalability and flexibility, but its disadvantages are very obvious
Different NoSQL databases have their own query language. Compared with SQL, it is more difficult to make the application standard interface. And NoSQL can’t provide acid transaction operation, so many enterprises can’t apply NoSQL to core business system.
As the definition of NoSQL is based on relational database, not relational database.
3、 The rising newsql
Because SQL and acid transactions are too popular, and the demand for distributed database is unprecedented, so another kind of database, newsql, came into being.
Newsql is the abbreviation for all kinds of distributed and extensible databases. It inherits the processing ability of NoSQL for massive data, and maintains the support of traditional relational databases for SQL and acid transactions. The focus of newsql is on hybrid databases, which are more inclined to find a multi-mode database setup scheme that no longer distinguishes OLTP and OLAP queries.
In 2016, Andrew Pavlo and Matthew Aslet published a paper: “what’s really new with newsql?” in which they divided newsql into three categories: new architecture, transparent sharding middleware and Database-As-a-Service.
1. New architecture
This kind of newsql is a new database system for distributed architecture design.
They generally use share nothing architecture, support multi node concurrency control, highly fault-tolerant automatic data replication, flow control and distributed query processing.
Because they are designed for distributed multi node systems, they are better for query optimization and communication protocol between nodes. For example, multiple data nodes of newsql database can communicate directly without relying on the central node.
In addition to Google’s spanner, other similar databases need to manage the storage and distribution of data in disk and memory. This means that this type of database system is responsible for sending queries to the data node instead of copying the data to the requesting node to reduce network transmission.
Due to the use of a new architecture design and storage engine, it has not been fully verified by time, so enterprise technology selectors are particularly cautious. At the same time, because the number of engineers who have experience in the operation and maintenance of the new generation of newsql is also small, compared with the relational database, the current users are relatively small. Many enterprises will try to follow up the new architecture of newsql, but have not yet migrated the core system.
The most typical new architecture products are Google’s spanner and domestic database tidb.
2. Transparent slicing Middleware
Transparent partitioned database middleware allows applications to write data to multiple data nodes, but data nodes still use relational database oriented to single data node. Transparent sliced middleware uses central components to route data operation requests, coordinate transactions, manage data distribution and copy data. The whole cluster is a logical instance, and the application can be used smoothly without any change.
The core advantage of transparent partitioned database middleware is compatibility. It can switch between the existing stand-alone relational database and partitioned middleware at low cost, without any code changes by developers. They are designed to make full use of the computing and storage capacity of traditional relational database in a distributed environment, rather than realizing a new relational database.
In this way, not only the stability and compatibility of traditional relational database can be utilized, but also the processing of distributed scene can be added. Incremental rather than subversive is the core concept of this kind of newsql products. Because of open source and popularity, database middleware based on MySQL protocol is the most common.
Because the traditional relational database based on single data node is designed for disk, the memory based storage management and concurrency control is not as efficient as the new distributed architecture newsql. In addition, SQL parsing, query plan optimization and other work will be repeated in middleware and database, making the overall efficiency slightly inferior to the new design of newsql.
In the domestic large and medium-sized Internet companies, this kind of newsql is very popular, and each company basically has its own database middleware. However, due to the heavy coupling with the company’s internal business system, there are few mature open source products. The sharding proxy in the sharding sphere ecosystem that we will discuss later belongs to this kind of newsql product.
3. Cloud database
The last type of newsql is cloud database products provided by cloud computing companies. The user of the cloud database does not need to maintain the database and its hardware, but all the data is hosted to the services provided by the cloud platform. The user can connect to the cloud database through the URL of the database, and operate and monitor the system through API or operation dashboard.
The cost of using cloud database is the lowest, and engineers do not need to consider any details of the database. It is an ideal solution for small and medium-sized companies, but for companies with huge data volume, it is more appropriate to adopt the first two newsql open source or self-developed solutions.
Aurora provided by Amazon is a typical application of this kind of newsql.
In general, although newsql is not yet mature, it is a correct attempt for the future. The three types of newsql database have different concerns. The focus of the new architecture type of database is a thorough innovation; the focus of the transparent fragmented database middleware is incremental; and the cloud database pays more attention to the details of shielding users.
Although different types have their own merits, their core functions are similar. No matter which kind of newsql, hybrid database will be the future development direction. When OLTP and OLAP are no longer distinguished, the development cost will be greatly reduced.
So far, we have basically sorted out the current development status of all kinds of databases. In the next article, I will explain in detail the core functional features of cloud native databases. Students with relevant thoughts are also welcome to leave messages.