AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis


The 2009 Aliyun Summit and Shanghai Developers Conference opened on July 24. The Summit shares with future world developers the technological drinks in the fields of open source big data, IT infrastructure cloud, database, cloud prototype, Internet of Things, and so on, and jointly explores the trend of advanced science and technology. This paper summarizes the excellent speeches of Nan Xian, a senior technology expert of Aliyun Intelligence in the database special session. In order to share the core technology and application scenarios of AnalyticDB for MySQL, a PB level cloud data warehouse in Aliyun.

Database Special PPT Download

The content of this article is organized from the speech video and PPT.

From “BigData”to “FastData”

Now there’s a saying “From Big Data to FastData.” Why? In fact, BigData has been developing for many years, and has become or is about to become a capability available to all industries and developers. However, for BigData vendors, it is slowly found that their own development has encountered some bottlenecks, such as the merger of Cloudera and HORTONWORKS, there is some pressure on the stock price, and senior Hadoop players like MAPR are also looking for buyers.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

The reason for these problems is that demand is changing and the value exploration of data itself is changing. The most important point is that we only need to face a large amount of data, but now we need more real-time and intelligent calculation and analysis for the full amount of data. Therefore, the positioning of Aliyun Analytic DB is a data warehouse product that can realize real-time calculation and online data value.

Introduction to AnalyticDB for MySQL: PB-level Real-time Data Warehouse

The core idea of Analytic DB for MySQL (ADB), a PB-level real-time data warehouse, is to realize the online optimization of cost performance and data value. AnalyticDB’s function is to realize multi-table complex association analysis and millisecond retrieval of billions and trillions. In the benchmark evaluation environment of 10TB scale of TPC, the global official organization, AnalyticDB achieves the first performance in the world, and is at least five times ahead of other domestic manufacturers. Compared with open source Spark, its performance is better. It’s more than 10 times higher. At the same time, papers on the core technology of AnalyticDB were also published at the world’s top academic conference in the field of database.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

AnalyticDB for MySQL has been incubating within the Alibaba Group for many years and has accumulated a large number of users on the Alibaba cloud. Behind such a strong performance, Analytic DB has many core capabilities:

Compatibility & Beyond MySQLFirst, AnalyticDB is compatible with MySQL and goes beyond MySQL. Because MySQL is more able to solve many business problems in the TP scenario. It is very stable and popular. But in the analysis scenario, it is not enough to be compatible with MySQL. It also needs support for window functions, CTE functions and OLAP standards. AnalyticDB fully supports these functions and standards.

Cloud Native, Real-time On-Demand Extreme ElasticityAnalyticDB is a cloud native database, and can achieve extreme flexibility on demand in real time. AnalyticDB’s disks can be scaled flexibly, and the nodes of a single user cluster can be scaled from the smallest three to the largest 2000, and mixed load can be realized.

Analysis of Unstructured and Structured FusionAnalyticDB can realize unstructured and structured fusion analysis. The unstructured data here not only includes JSON, BLOB and array types in MySQL, but also supports AI retrieval vectors for face recognition. AnalyticDB supports five vector types and five vector retrieval algorithms.

Complete enterprise-level characteristicsAnalyticDB has complete enterprise features, such as backup and recovery, and other data tools such as recycling bin do not have the ability. At the same time, Analytic DB also provides high reliability assurance across AZ and Region.

AnalyticDB Core Technology: Layered Storage + Storage Computing Separation Architecture Brings Extreme Flexibility and Openness

As shown in the figure below, the front technology module of AnalyticDB is the interface layer, which is very compatible with MySQL protocol and can provide JDBC, ODBC, UDF and AI interfaces. The next layer is the optimizer, which is known to face great challenges in complex analysis scenarios. AnalyticDB also supports rule-based optimizer RBO, cost-based optimizer CBO and history-based learning optimizer HBO. The next layer is Xihe, a very large-scale distributed computing engine, which is based on open source code and has strong performance after many years of optimization and transformation by R&D team in Ali. The next layer is the AnalyticDB row-column hybrid storage engine, which has powerful storage and retrieval capabilities. At present, Analytic DB is based entirely on cloud infrastructure construction, including basic capabilities on cloud such as ECS, GPU, FPGA, OSS, etc.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

For the technical architecture of AnalyticDB, there are three roles from top to bottom. The top level is Frontnode, also known as Coordinator Coordinator Coordinator. It supports multiple Master linear extensions. When writing and querying are concurrent, it can achieve linear extensions, and can support extensions to hundreds of Frontnodes at most. Frontnode distributes data under the write process and writes data to different storage nodes. Worker in the middle layer is the computing node, which can be flexibly scaled, and is also the core of the separation of computing and storage. When the load arrives, the physical execution plan made by the upper parser is calculated. The source node reads the underlying storage. One of the most important concepts of storage is Group, which is explained here, such as MySQL’s master or three copies. For Analytic DB, there are multiple groups because they are distributed. It can be considered that data is partitioned horizontally to each group, and Raft protocol is used to ensure data consistency and real-time. Cost is also a very important consideration in large-scale cluster, so Analytic DB achieves hierarchical storage at the bottom level, that is, when data needs to be separated between hot and cold, it can put hot data on SSD and exchange cold data into or out of OSS in real time through hierarchical storage system.

AnalyticDB Core Technology – Intelligent “Row-Column Mixing + Full Index” Brings Extreme Performance

In addition to the distributed optimizer and ultra-large-scale high-performance execution engine mentioned above, the provision of extreme performance also depends on how much compression the storage engine itself can do for computing. For database systems, the most important reason why Analytic DB can achieve the first performance in the world is the optimization of the storage layer.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

Here we introduce two technical points of AnalyticDB in storage layer optimization, full index and column mixing. AnalyticDB implements a set of indexing frameworks to accommodate various scenarios. AnalyticDB’s indexing framework has the ability of multi-channel progressive stream merging. It can merge many indexing conditions and single table location conditions, and generate a row number. When you get this line number, you can filter out a large amount of data in the storage layer. After the data filtering is completed, you just need to hand it to the computing engine for aggregation and other computing tasks. AnalyticDB supports many different types of indexes, inverted index and hash index for string type data, multi-dimensional KD tree index for digital type, JSON index, bitmap index and unstructured vector index.

Another reason why Analytic DB has high performance is that it uses mixed row and column storage. As we all know, the row storage used in MySQL and other databases is suitable for updating, but the disadvantage is that it is not suitable for analysis and calculation, because analysis and calculation usually carries out statistical analysis or aggregation operations on some columns of a large table with more than 100 columns, and row storage needs to read out all the data of more than 100 columns of each row. The enlargement of columns is very obvious. The advantage of column storage is that, in addition to the high compression rate for the same Schema column, it is naturally suitable for filtering in computing.

The mixed storage of rows and columns designed by AnalyticDB is shown in the figure above. Header files contain metadata such as Sum, Count, Max, Min, etc. These statistics are used to filter indexes, conditions, and blocks. There are many different blocks inside each file, and each block can be seen as a Row Group, which is stored internally as a Column. Therefore, when scanning is needed, data can be quickly retrieved once it is located in which Row Group. The reason why Row Groups are designed to be Row Groups rather than whole files is that detailed data is often needed. If stored in columns, data of all rows needs to be spelled out. This can be very expensive, and the two scenarios can be dynamically compatible by dynamically adjusting the size of different Row Groups.

AnalyticDB Core Technology: A System Set Storage Considering Multiple Scenarios

Analytic DB has powerful performance, besides the flexibility of its architecture, it also has a storage system that can take into account a variety of scenarios. It can be compatible with multi-dimensional analysis, detailed query, real-time writing and other scenarios. There are three key technologies behind it, including the column mixing mentioned above and the mixed load. AnalyticDB supports mixed load management with high concurrency and low latency. In addition, it supports a very large-scale fusion computing engine.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

AnalyticDB for MySQL: Typical scenarios and customer value

AnalyticDB supports the synchronization of large data, application queues, MySQL, Oracle data to AnalyticDB through DTS and other data synchronization tools for data analysis. In the upper layer, different visual reports and user APPs can be accessed. To take an internal case, Ali’s mother has a targeted marketing platform called Damopan. There are more than 10,000 tables on the Damo disk. The total data volume is more than 100 TB. The average query size is Join of 10 tables. With Analytic DB, the Damo disk can support thousands of column dimensions screening. It can achieve a peak write speed of 1.2 million per second. The QPS of Join in more than 10 tables can reach 50.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

Data Lake Analytics: Global Data, Open Analysis

Aliyun’s DataLake Analytics is Serverless’s open analytics product, with no data storage, and therefore very low cost. If the user’s data is in OSS or a storage engine, it can be calculated directly without moving the data. Because of the Serverless design concept, it has strong flexibility and low cost.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

Data Lake Analytics and AnalyticDB are complementary relationships. AnalyticDB belongs to data warehouse, while Data Lake Analytics can perform fusion analysis. Data Lake Analytics not only inherits AnalyticDB’s spoon and computing engine, but also integrates Spark, and its bottom layer implements Serverless based on K8S.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

Next, let’s share two specific cases.

Customer Case 1: Mobile Operations APP Solution

Here, take the other camera as an example to share the mobile operation APP solution. Tens of billions of data from other cameras are stored in MySQL. Previously, NoSQL databases such as MongoDB were used for data storage, but it was found that the analysis performance was unacceptable. At that time, it took at least 40 minutes for each report to be completed. After using AnalyticDB to rebuild the system architecture, we can find a significant improvement in user statistics, activity effects and behavior analysis, and achieve a report construction span from 40 minutes to seconds.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

Customer Case 2: Real-time Warehouse in Logistics Industry

The second case is to share the real-time warehouse plan of the logistics industry with the example of the company delivering four parties in cross-border logistics. For the data architecture of four parties, business processing is implemented through MySQL, and data is synchronized to AnalyticDB through DTS. For log data, agent is used to manage, message queue Kafka is used to collect data, and internal program is used to inject the collected data into AnalyticDB. The front end connects QuickBI and DataV of Aliyun to realize data screen and report. It takes only one and a half months to complete the construction of the real-time data warehouse.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

AnalyticDB for MySQL: Customers are in all walks of life

AnalyticDB for MySQL has customers in all walks of life, such as the Internet, new retail, digital government, finance and public security, taxation, power grid, political law, etc. These industries are widely using Aliyun’s AnlyticDB.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

AnalyticDB for MySQL Version 3.0: Next Generation Cloud Native OLAP Products

The new version of Analytic DB for MySQL 3.0 created by the Aliyun database team has been developed and is currently in public beta.

AnalyticDB for MySQL: PB Cloud Warehouse Core Technology and Scenario Analysis

Compared with previous versions, Analytic DB for MySQL version 3.0 mainly improves in the following four aspects

Easy to useCompared with version 2.0, users can feel more database features.
Analytic DB version 3.0 integrates the computing power, analytical power and database experience of large data on a very large scale. Users can feel that Analytic DB 3.0 is a MySQL with super computing power and super fast computing speed. It supports up to 256 DBs and greatly improves compatibility. In Alibaba’s internal tests, the compatibility is as high as 99.99%, and it can be instantly visible when writing data.

Higher performanceReal-time write performance of Analytic DB version 3.0 has improved by 1.5 times, query performance has increased by 40% on the basis of the original world number one, and it is expected that the list will continue to be refreshed in the future. AnalyticDB supports batch data import from MaxCompute, Hadoop, OSS and MySQL databases to AnalyticDB, and achieves data import speed at the TB level per hour.

More flexibleAnalyticDB version 3.0 has higher flexibility. Disk storage space can be flexibly scaled. For example, a node can be expanded from 100G to 1TB or even larger. At the same time, disk space can be scaled up and down vertically, and the number of nodes can also be scaled up and down arbitrarily.

More reliableAnalyticDB 3.0 has a fully aligned privilege system with MySQL, which supports Library-level privileges, table-level privileges and column-level privileges. And the data storage uses three copies, reaching the industrial level of security, but also has complete data backup and recovery capabilities.

Author: Visitors be77vkb76molw

Read the original text

This article is the original content of Yunqi Community, which can not be reproduced without permission.

Recommended Today

Hadoop MapReduce Spark Configuration Item

Scope of application The configuration items covered in this article are mainly for Hadoop 2.x and Spark 2.x. MapReduce Official documents…Lower left corner: mapred-default.xml Examples of configuration items name value description mapreduce.job.reduce.slowstart.completedmaps 0.05 Resource requests for Reduce Task will not be made until the percentage of Map Task completed reaches that value. mapreduce.output.fileoutputformat.compress false […]