The third generation of an analytical database


Editor’s recommendation:
In the 1960s, the first enterprise database product was born; Over the past 60 years, the database field has been continuously updated iteratively; With the explosive growth of data scale and increasingly rich data types, new databases are rising in an all-round way, and standardized cloud services have become a trend.
Let’s re understand the “past and present lives” of the database from the perspective of history.

It has been more than 20 years since I studied database management system in University. Looking back on the past career, it is also a small accident to embark on the road of no return to the database: the first job was assigned to Oracle in the sub group, which began to have an indissoluble bond with the database.
There are too many contents about the database. I dare not talk about any academic theory here. I just want to sort out my understanding of the database, hoping to help those friends who are interested in the database better understand the ancient and vibrant thing database.

What is a database

The database is translated from “database” in English,Data + base, as the name suggests, is the source and foundation of data。 So why have a database? Database is first of all a computer software. Before the birth of the so-called database, the common method may be that programmers write a small program to complete the work of data processing and analysis.

With the popularity of computers, more and more scenes begin to use computers, resulting in more and more data and more data analysis needs. In order to reduce the threshold of data analysis and enable more people to manage and analyze data more conveniently and efficiently, engineers have created a special software to help peopleReasonably store data to improve access efficiency, provide easy-to-use interfaces and rich analysis algorithms to facilitate use, and integrate effective management tools to improve data securityWait, this is the database, also known as the database management system (DBMS).

Database is a complete set of data management system, including data storage model, data organization architecture, data analysis algorithm, data management tools, data access interface and so on.

For example, granary. If you have a third of an acre of land and produce just enough food for your family, if you can’t finish eating, you can find a jar and put it down. This jar only needs to be convenient for your family. As you grow more and more land, such as 10000 mu of land, you can’t eat all the food produced, so you must build a special warehouse for storing food, and at the same time, it is also convenient for different businesses to pull food. In order to ensure the safety and efficiency of food storage, you must carry out special design and treatment of grain warehouses, such as constant temperature and humidity, automatic spraying, transmission system and so on. Database is similar.

The database originated from the Apollo moon landing program. At that time, a large number of data analysts were needed to analyze a large number of data, so we had to develop a data management and analysis software that can be used by more people. This was indeed the Lighthouse of mankind at that time, so I had to praise NASA engineers.

What is the core function of database

Databases are divided into different categories according to different application scenarios, such as the most classic classificationOLTP (online transaction processing) and OLAP (online analytical processing)。 For example, you have to use credit card payment every day to take the subway, buy lunch, buy drinks, go shopping on Taobao, etc. each transaction needs to be accurately recorded in the background database, which is the OLTP type.

You will also query your consumption of last month through the system. The system will summarize and send you the transaction data of last month, and tell you how much you spend on meals, transportation, entertainment, etc. OLAP is the type that supports this scenario.

OLTP mainly deals with short transactions and requires high transaction throughput, because everyone may have to pay more than ten times a day, but the amount of data to be processed each time is relatively small; In OLAP, everyone may only use it once a month, but the amount of data to be processed each time is relatively large and the calculation is complex.

In recent years, with the rise of multi-functional data processing platforms (such as HTTP, scene processing, multi-modal data processing, and so on), multi-functional data processing and artificial data processing, such as HTTP can also be produced. How to understand these classifications?

Similar to cars with different functions, there are trucks, buses, MPVS, SUVs, pickup trucks, fuel vehicles, new energy vehicles and so on. The core functions of the car are the same, but in order to adapt to different scenarios and needs, different cars will have different architecture design and adjustment parameters, that’s all.

So, what core functions should the database have?

First of all, database, database, we must save the data. The data shall be safely stored in a persistent storage medium in a reasonable format to ensure the correctness, integrity and security of the data. This is the core function of all data systems. In other words, give the data to the database, and the database should ensure that the data is not lost and good. This is the minimum requirement. Just like the granary, the grain that can’t be stored is moldy and eaten by mice.

Secondly, the database should improve the efficiency of data access as much as possible. We should store data in a more efficient way, so that the data can be stored faster, easier for users to understand and more convenient for the use of upper business. More efficient when querying data. It’s like someone coming to deliver grain into storage. They should weigh, dry, inspect, pack and store it quickly. They can’t wait for a week. If some people want to buy wheat and others want to buy corn, they must quickly find the corresponding storage place as required and hand over the grain to the grain merchant.

Thirdly, the database should provide rich data analysis algorithms, and complete the calculations closely related to the data in the database as far as possible, so as to reduce the overhead of data transmission and reduce the computational pressure of the upper business logic. For example, grain depots should provide perfect grain processing measures, such as weighing, drying, packaging, quality grading, etc., to facilitate grain trading.

Finally, the database should provide an easy-to-use interface, reduce the use threshold of data analysts, support various data analysis tools, and make the use of data more convenient. For example, grain depots should have convenient parking lots, clear signs, professional and friendly staff, etc.

What are the core components of the database

These core components usually include the following functions:

a. Storage management

How data is organized and stored, whether it is key value or relational, whether it is stored by row or column, whether it supports compression, deletion and modification, what data types and storage interfaces are supported, and POSIX or object storage. Whether to support the separation of computing and storage, distributed storage, transaction processing, multiple copies, what algorithm is used to speed up the retrieval (indexing) of data, etc. Storage management is the core component of database. If the problem of storage management is solved, the problem of database is solved in half.

The third generation of an analytical database

b. Query optimizer

To improve the efficiency of data query, the database must find an optimal execution path. For example, whether indexes need to be used during query, if there are multiple indexes, which one should be selected, if the data is distributed in different storage units (tables, sets, etc.), in what order should the access efficiency be the highest, and so on. The problem faced by the optimizer may be an extremely complex path planning problem. It needs to calculate the optimal path in a very short time and needs a large number of core optimization algorithms. It belongs to the most complex part of the database.

For example, if you want to travel from Shanghai to Hainan with your whole family, including the elderly and children, and want to make a plan with the best cost performance and the highest family satisfaction, what factors should be considered in the plan? First of all, how to get there, whether by car, by train or by plane. How long does it take to drive, how many times do you need to rest in the middle, do you and your wife have time, can the elderly and children stand it, gasoline costs, and tolls; Plane, how to get to the airport, how much luggage, whether you can take it with you, whether there is a discount on the ticket, what to do when you get off the plane, etc. Do you want to rent a car, a quiet place for the elderly, a tourist attraction, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like, a place you like At this point, can you understand how many problems a query optimizer needs to consider?

The third generation of an analytical database

Of course, this part of the work can be relatively simple to implement (based on Rules). For example, as my wife said, the plan will be much simpler by determining the time, flying back and forth, five-star hotel and taking private beach. However, this job may also be too complicated to imagine (based on machine learning, based on actual expenses, etc.). For example, your wife says you are fully responsible, and the specific time is uncertain. From August to September, you should spend less money, do more work, do more research and find the best solution. Then doing this plan will be very complex and require a lot of decision-making information. The decision probability will be relatively optimized, which can adapt to more scenarios than the rule-based plan.

c. Execution module

After the optimizer completes the execution plan, the execution module will perform relevant calculations on the data according to the execution plan, including data access, conventional addition, subtraction, multiplication and division, sorting, average value and hash, as well as some machine learning algorithms, data compression / decompression, and finally return the calculated results to the client.

The third generation of an analytical database

d. Internal management and scheduling

To work normally, the database also needs some internal coordination and management modules, such as memory and storage synchronization, storage space sorting, metadata management, cluster state detection, fault tolerance and fault recovery.

e. Management tools and interfaces

In order to improve ease of use, databases need to provide a set of management tools, such as backup / recovery, state detection, runtime monitoring, resource isolation, permission management, security audit, user-defined interfaces, various data access interfaces, etc.

Development and Prospect of database

The development of database is evolving with the development of computer architecture. From host computer to personal computer + network (x86) to cloud service, database has also experienced a series of evolution.

The third generation of an analytical database

a. Host Era

The original computer and database are only used in aerospace and military fields, and only need to support professional data analysts for data analysis. By the late 1970s, with computers entering more business scenarios, the demand for a large number of data analysis came into being, and the database needs to face more general user needs. In the paper on relational database first released by IBM, the most emphasized point is to enable database users to use these data efficiently for analysis without worrying about how to store and organize data.

In order to facilitate the use of users, SQL (Structured Query Language) is defined. According to this syntax, database users only need to pay attention to how to analyze the data, and do not need to pay attention to the underlying data distribution and storage.

In order to support concurrent data operations of a large number of users, the database transaction feature is defined to ensure that users can see the data content in line with business logic under concurrent data operations.

In order to ensure the efficiency and security of the database, the database redo log (transaction log) is designed, including a series of concepts that often appear in the current database, such as undo log, commit log, checkpoint and so on.

In the era of mainframe, the hardware cost is extremely expensive. Whether it is storage, memory or CPU resources, they are relatively scarce. Then, in the design and use of the database, various algorithms and architectures will be adopted to reduce the use of memory, reduce the redundancy of data and improve the efficiency of data retrieval. Therefore, various data index types, powerful query optimizer and data cache algorithm have been greatly developed in the database. At the same time, when using the database, it is also necessary to design various complex models of the data (three paradigm model, star model, snowflake model, etc.) to reduce the redundancy of the data. Of course, this will also increase the difficulty of developing the database application.

b. X86 Era

With the wide use of X86 servers and the development of network technology, it is more cost-effective to form n x86 servers into a cluster through the network and use the computing and storage capacity of this cluster to replace expensive hosts. Under this trend, we have designed various distributed database systems that can use the ability of cluster. The core idea of these systems is to disperse the data on different nodes and improve the ability of data storage and analysis by using the computing and storage resources of multiple nodes. Under the distributed processing architecture, data consistency protocol, multi copy mechanism, high availability mechanism, data fragmentation mechanism, capacity expansion / reduction mechanism and so on have also become the problems that must be designed and solved in distributed database.

In the x86 era, due to the sharp decline in hardware costs, users pay more attention to the flexibility of data analysis and the efficiency of delivery. Therefore, when using the database, we will pay more attention to how to speed up the process of data analysis and how to make the data easier for human understanding, rather than complex model construction in order to reduce the redundancy of data.

c. Cloud age

With the further development of technology, cloud services that improve the use efficiency of hardware resources and reduce production, operation and maintenance costs by virtualization / containerization of traditional hardware are adopted by more and more enterprises. In order to better adapt to the technical system of cloud services, the database has also designed relevant cloud features, such as storage and computing separation, elastic scaling, microservicing, cross domain data synchronization and so on.

In the cloud era, users pay more attention to the efficiency and input-output ratio of data analysis, and whether products can provide convenient integrated data processing services, so that business developers can focus more on the business itself, and database services are evolving towards standardized cloud services.

d. Prospect

Different database architectures and deployment methods are not a simple relationship of iteration and substitution, but a process of simultaneous and gradual iteration for a long time. Today, many financial institutions still choose to use database products on the host, but the new business and scenarios are very limited. Data processing products based on X86 server are still the mainstream choice of enterprise database. At the same time, the market share of cloud database is also gradually growing and expanding. What kind of database products should be adopted should be determined according to their own business needs, and the appropriate is the best. Of course, from the perspective of technology evolution, cloud technology (including public cloud and private cloud) will be the general trend, because cloud can provide higher efficiency.

As one of the three basic technologies (chips and operating systems) of the information industry, database has been very hot in terms of capital and technology for a long time. In recent years, there have been a considerable number of excellent database products and enterprises in China. In the process of mankind moving towards digital civilization, more and more data will be generated, and more value needs to be mined from the data. As the core of carrying data, database will continue to play an important role. Fortunately, I have been working in this field and look forward to working with my colleagues to contribute to the progress of human digital technology.

With the vision of redefining data science, zilliz is committed to building a world leading open source technology innovation company, and unlocking the hidden value of unstructured data for enterprises through open source and cloud native solutions.
Zilliz built the Milvus vector database to accelerate the development of the next generation data platform.
Milvus database is a graduation project of LF AI & data foundation. It can manage a large number of unstructured data sets and is widely used in new drug discovery, recommendation system, chat robot and so on.