The author of this paper: coder, senior developer, keen on open source community,GitHubStar of 20K, follower of 15K; believer of digital currency, who loves secondary market trading (stock + digital currency); currently investing in Zhenge fund.
If you want to communicate with the coder, you can add his wechat: daimajia (famous source and identity). If you are starting a business or have an idea to start a business, you are welcome to send it to BP or communicate with the coder by email:[email protected]
In the industrial age, the use of coal and steel is an indicator of the degree of development of a country. In the information age, the amount of data will be a new indicator of the degree of development, and almost all industry competition is essentially data competition. Behind the growth of data is the evolving database engine generation after generation. In the investment work of Zhenge fund, Chinese teams are constantly trying to challenge the overseas monopoly position in the database field and create a new generation of database engine. In my spare time, I made a simple summary of the whole database development history.
The whole database has gone through four stages of development.
The first stage: non relational database
Before the modern database came out (1960s), the file system was the earliest database. Programmers read text files, extract key data from files by code, and try to construct the relationship between data in their mind. Programming languages that were popular in those years often had strong file and data processing capabilities (such as Perl). With the growth of the amount of data, the diversification of data dimensions, and the increasing requirements for data credibility and data security, simply storing data in TXT text has become an extremely challenging thing.
Then, people began to put forward the concept of database management system (DBMS). The evolution of database is the thinking and optimization of data structure and data relationship.
Hierarchical model and network model (1960)
The database model in the first stage is hierarchical databases.
Figure 1: a hierarchical model database expressing school structure
Hierarchical model is the earliest database model. With the early IBM mainframe gradually spread. Compared with text file management data, this model is a huge improvement, but there are also many problems.
Question:
-
Although it can express one-to-one structure well, it is difficult to express many to many structure
- For example: the figure can better express that a department has multiple teachers, but it is difficult to express that a teacher may belong to multiple departments.
-
The hierarchy is not flexible enough
- For example, adding a new database relationship may bring great changes to the whole database structure, so that it will bring huge workload in the real development
- Query data need to have the latest structure diagram in mind at any time, and need to traverse the tree structure to do derivation
Then on the basis of hierarchical model, people put forward the optimization scheme, namely: network model.
Figure 2: database of mesh model
Network model is the most popular database model before relational database came out. It solves the problem of data many to many. However, the following problems still exist:
Question:
- It is difficult to implement and maintain from the code level
- Query data needs to have the latest structure diagram in mind at any time
The second stage: relational database
Early model (1970)
Relational model is a great leap over network model. In the network model, different types of data always rely on another type of data. As shown in Figure 1, teachers are subordinate to departments, which is the source of pain in the real design and development of hierarchical model and network model
One of the innovations of relational model is to remove the links between tables and store the relationship only in one field of the current table, so as to realize the relative independence between different tables. Here’s the table: when you only look at table2, you know product_ Code will point to the specific details of a product. Table2 and table1 are connected naturally while they are relatively independent.
Figure 3: relational database
Product in table2_ The code column points to the corresponding data in table1, thus establishing the relationship between table2 and table1
In 1970, when e.f.codd developed this model, people thought it was difficult to implement. As in the above example, when you retrieve table2, you encounter product_ Code column, you need to traverse table1 again. Limited by the hardware conditions at that time, this retrieval method always makes the machine hard to load. But soon, with the blessing of Moore’s law, the question that everyone questioned is no longer a problem. The IBM, DB2, Ingres, Sybase, Oracle, Informix and MySQL that you hear today were born in this era.
So far, a big classification has been born in the field of database: online transaction processing (OLTP), which refers to a kind of database specially used for daily affairs, such as the database of adding, deleting and modifying for bank transactions. Another kind of database, on-line analytical processing – OLAP (online analytical processing) database, will be mentioned later.
Data warehouse (1980s)
With the development of relational database and the digitalization of different business scenarios, people begin to have the idea of collecting data of different business scenarios and trying to analyze data and assist business decision (Decision Support System). Based on this requirement, the concept of data warehouse was born.
As shown in the figure below: an enterprise often stores different business scenario data in different databases. Before there is no mature data warehouse product, data analysts often need to do a lot of preparatory work to collect the data they need. The essence of data warehouse is to solve the business scenario of data analysis and mining.
Figure 4: Data Warehouse
Explanation: ETL is the abbreviation of extract, transform and load. Because the data in different databases or systems, there may be format is not unified, unit is not unified and so on. We need to do a data preprocessing.
Data warehouse is a subject oriented, integrated, non-volatile, time-varying data set to support managers’ decision-making.
OLAP (online analytical processing)
With the concept and implementation of data warehouse in 1980s, people tried to do data analysis on this basis. But there are some new problems in the process of analysis. The most obvious problem is efficiency. Because the previous relational database was not built for data analysis. What data analysts want is an engine that supports multidimensional data views and operations.
Like the data cube below, compared with the two-dimensional data display and two-dimensional data operation in the relational database mentioned above. OLAP database can quickly build and operate multi-dimensional data.
Data cube
Organize and display data of multiple dimensions
Multiple operations of data cube
In 1993, Edgar F. Codd, the founder of relational database, proposed the concept of on-line analytical processing (OLAP). In essence, it is the concept of multidimensional database and multidimensional analysis ability. The goal is to meet the specific query and report requirements of decision support or multidimensional environment.
Stage 3: NoSQL
Time continues to advance, after the advent of the Internet era, the surge in the amount of data to the relational database also brings new challenges. The most obvious challenges are as follows:
Challenge 1: high cost of data column expansion
Because table is defined in advance in relational database When the database already has hundreds of millions of data, the business scenario needs a new column of data. You are surprised to find that under the rules of relational database, you have to operate the hundreds of millions of data at the same time to add a new column (otherwise the database will report errors), which is a great challenge to the server performance in the production environment .
Imagine social networking sites like Facebook, twitter and Weibo, where fields are constantly changing every day to add new features.
For example, if you need to add the status column, you must add active or in active content to hundreds of millions of rows at one time, otherwise the database will not meet the compliance constraints
Challenge 2: the challenge of database performance
With the continuous growth of business scale, the performance problems of relational database begin to surface. Although database suppliers have proposed various solutions, the underlying relational binding design is still the fundamental reason for the performance ceiling. Developers began to try extreme operations such as database splitting, table splitting, and caching to squeeze out performance.
Based on this challenge, a new database model NoSQL is proposed.
To solve the problem of expanding data columns, NoSQL proposes a new data storage format, which removes the relationship of relational model. There is no association between the data, so the scalability of the architecture is replaced.
New data structure, put the correlation data together
The underlying innovation of NoSQL comes from being born for cluster scalable scenarios.
On the basis of NoSQL theory, four types of databases are developed according to enterprise application scenarios
- Document database(document oriented): such as mongodb and CouchDB. Document generally refers to a data storage structure, such as XML, JSON, jsonb, etc.
- Key value database(key value database): redis, memcached and riak are key value pair databases
- Column storage database(column family): such as Cassandra and HBase
- Graph database(graph oriented): such as neo4j, orientdb, etc. This paper focuses on the data organization of relational chain between data.
With the continuous enlargement of enterprise data, new requirements are put forward for data processing ability. The word big data that we hear everyday represents a huge technology architecture. Including data collection, sorting, calculation, storage, analysis and other links. The database is just one part of it. As shown in the figure below, the big data architecture in 2017, the database mentioned in this paper, basically only represents the storage link in the figure. Hadoop, Kafka, hive, spark and materialize are big data engines that you hear everyday. Don’t confuse them.
Database is just a part of the big data concept
The fourth stage:
With the advent of the cloud era, the cloud native database based on the cloud environment continues to occupy the database market share.
The biggest difference between cloud native database and managed / self built database is that cloud native database is cloud oriented for independent resources, and its CPU, memory, storage, etc. can realize independent flexibility. By using the massive resource pool of large cloud vendors, it can maximize its resource utilization and reduce costs, and support independent expansion of specific resources to meet the changing business needs of a variety of users, To achieve complete serverless; and the managed database is still limited to the traditional server architecture, the ratio of resources is limited in a range, its elastic range, resource utilization are greatly limited, unable to make full use of the dividend of the cloud.
Based on cloud native database technology, the future entrepreneurial team does not need to spend a lot of energy to deal with the attack of massive data, just focus on the business.
Representatives of cloud native databases include polardb of Alibaba cloud, cynosdb of Tencent cloud, taurusdb of Huawei cloud and aurora of Amazon cloud.
Finally, this paper ends with a database distribution map of Alibaba CIO college. The database products and distribution map in the diagram well represent the current pattern of database industry.
Appendix:
In the field of database, there is a cap theory that has to be mentioned. You can read it if you are interestedRuan Yifeng’s blog。
Cap theory
In the field of modern database, there is a cap theory
- Consistency (data consistency)
- Availability (data availability)
- Partition tolerance
The simple understanding of cap theory is that distributed database can not achieve consistency, availability and partition fault tolerance at the same time. More specific explanation can be referred toRuan Yifeng’s blogIt’s very well written. I won’t expand it here.
Relational database chooses consistency and partition fault tolerance, while NoSQL chooses partition fault tolerance and availability to meet business needs.