Interpretation of 100 billion level OLAP engine

Time:2021-2-26

Introduction:Data bank is a commercial data product operated by brand consumers. Because its core analysis ability needs to realize free analysis in any dimension and strong demand for response time on massive data, we use analyticdb as the underlying analysis engine on a large scale. Finally, with low cost and excellent performance, we support millions of OLAP queries every day during the promotion period of tens of thousands of brand businesses Inquiry.

Author: senior technical expert of data technology and products Department of Yang ran

preface

data bankIt is a commercial data product operated by brand consumers. Due to its core analysis ability, we need to realize free analysis in any dimension on massive data and strong demand on response time. We use analyticdb as the underlying analysis engine on a large scale, and finally support millions of OLAP queries every day during the promotion period of tens of thousands of brand businesses with low cost and excellent performance.

At present, the data bank stores tens of trillions of data in analyticdb, occupying about 1.6p of storage space, and the average query response time is less than 5 seconds.

Introduction to data banking

As a commercial data product operated by consumers, data bank provides many data capabilities such as link flow analysis, crowd selection, crowd portrait, etc.

Link flow analysis

AIPL is a unique indicator of data bank, which is used to measure the relationship between brand and consumer (AIPL is the abbreviation of four stages, namely a cognition, I interest, P purchase and l loyalty). Link flow analysis is used to obtain the change of AIPL relationship between consumers in any two days of a brand (as shown in the figure below, the change of AIPL of a brand from last year’s double 11 to this year’s double 11 is untrue Data).
Interpretation of 100 billion level OLAP engine

In this scenario, the user can select any two dates in the past 540 days, plus the two dimensions of brand and category. The possible input of the user isMillion billion level

Crowd portrait

Crowd portrait is the core ability of consumers to operate products. In addition to the specific crowd of users, data bank can also draw the crowd of link flow to help the brand analyze the reasons for the change of consumer relationship (as shown in the figure below, a brand’s “double 11” last year was in purchase status, but this year’s “double 11” is in loss status, not real data )。
Interpretation of 100 billion level OLAP engine

In this scenario, data bank provides users with more than 200 tags, most of which are industry-related. Each crowd portrait only involves some tags. If all tags are pre calculated for the crowd, it will lead to a great waste of resources.

Crowd circle count / crowd circle

Crowd selection is the core competence of consumer operation products. Compared with most consumer operation products, users can only use tag data. Data bank crowd selection (minute level) allows users to use tags, contacts (which can be understood as consumer behavior, such as purchasing, searching, watching live broadcast, etc.) and other kinds of data. At the same time, users can also view the consumption under the condition of circle selection in real time The number of participants (in seconds).
Interpretation of 100 billion level OLAP engine

In this scenario, all kinds of circle selection conditions can be freely combined through the intersection, union and difference. At the same time, some circle selection conditions, such as the purchase amount, are numerical values that users can fill in and cannot be enumerated.

Why data banks choose analytical DB

For ordinary analysis business, if there is no requirement for response time, offline computing (Hadoop / hive / maxcompute) can meet almost all the needs of data analysis. However, from the perspective of users’ online response, high-frequency functions have strong demand for response time.

For example: user decision needs a lot of timeCrowd portraitThe choice of the next portrait depends on the result of the previous one. If offline computing is used, it will not only greatly prolong the decision-making time of users, but also interrupt the continuity of users’ analytical thinking, which will have a great impact on the user experience.

There are two ways to solve the problem of response time

  1. Precomputation, which calculates the indicators under the combination of all the optional dimensions of the user offline. When the user is analyzing, the system directly goes to the database to get the results.
  2. OLAP online computing stores the lightly aggregated data (all user optional dimensions) in the MPP engine, and calculates the indicators immediately according to the conditions submitted by users.

These two ideas have their own characteristics. Precomputing needs to consider a series of problems, such as dimension explosion, offline precomputing can not be completed in a limited time, or the resource waste caused by the result of precomputing not being used due to the change of demand. OLAP can provide free computing in any dimension without pre computing, but it also needs to consider the storage cost, capacity and computing performance of MPP engine.

In general, as a data product for consumer operation, it has a strong demand for response time, which is not suitable for the use of pre computing; at the same time, because of the huge amount of data (tens of trillions, Pb level), the overall cost is also an important consideration.

OLAP engine selection

OLAP of data bank has several challenges:

  • Data volume challenge:The amount of original data is very large, with a total of 1.6pb of historical data and 22 trillion records, of which more than 10 trillion tables are stored in analyticdb.
  • Data writing performance challenges:60 billion rows of records are added in a single day, with a total size of 10TB. The baseline task is imported within two hours from 7:00 to 9:00 every morning, and the import speed is required to be at least 10 million TPS;
  • Performance challenges for complex queries:The query type is complex, and most of them are large complex interaction analysis. The 2 trillion level large table is filtered and then associated with 8 tables, which needs to be returned in 10 seconds.
  • Export performance challenges:Fast result set export, business needs to analyze the selected crowd to do data export. We need to be able to export the results of analyticdb to maxcompute conveniently and efficiently. And it is required to be able to support more than 20 million level result export tasks per minute.
  • Cost challenge:Considering the Pb level data, it is too expensive to store all the data in SSD, so we hope that the system can have the hierarchical storage capacity of hot and cold data, and realize the online performance with the price close to offline.
  • Stability challenges:For complex workload, in real online scenarios, the three challenges mentioned above will appear at the same time, which requires the system to run smoothly under such complex workload.

Through preliminary technical research and testing, analyticdb is selected as the basic platform for data banking business analysis, as follows:

1. Cold and hot data layering capability
The enterprise level feature of data separation provided by analyticdb can greatly improve the cost performance of data storage. Hot table (stored in ESSD), cold table (stored in OSS) and temperature table (mixed mode, some exist in ESSD and some are stored in OSS) can be selected according to table granularity. Customers can freely specify according to business requirements, and hot and cold policies can be arbitrarily converted. For users, it is a storage, a set of syntax, and easy to realize joint query. The scenarios we use are mainly cold tables, and analyticdb has SSD cache for cold tables to speed up, reduce costs while taking into account performance. Data bank stores tens of trillions of data in analyticdb, occupying about 2p of storage space. It has become the core data warehouse storage of data bank, and is expected to continue to grow in the future.
Interpretation of 100 billion level OLAP engine
**
2. High throughput real-time writing
The bottom layer of analyticdb adopts the distributed parallel architecture, which achieves high write / import throughput. Massive data can be directly written in tens of millions or even hundreds of millions of TPS in real time. At the same time, for offline aggregated tables, analyticdb provides a direct batch load mode, which can directly load data with high throughput for online query.
Interpretation of 100 billion level OLAP engineInterpretation of 100 billion level OLAP engine

  1. Powerful real-time computing capability with high concurrency and low latency

Under the three business query scenarios, there is a strong demand for response time. Most of the queries are aggregation and connection queries after the circle selection of trillions of tables. Analyticdb uses cold data caching, preheating and other technologies to make the average response time of these queries less than 10 seconds.

Data model and table design

Data bank mainly stores four types of data in analyticdb
1. AIPL data, that is, the relationship between brands and consumers
2. Label data, that is, consumer attributes
3. Contact data, that is, consumer behavior
4. Crowd data, that is, the data bank users in the data bank precipitation crowd

Since the analysis object of all scenarios is consumer ID, most tables use consumer ID as the distribution key, which can avoid data shuffle (redistribution) in the query process. The following mainly introduces the table design of AIPL and label. The table design of contact point is similar to that of crowd, so it will not be repeated.

1. AIPL data

AIPL data is partitioned by day, but due to the large amount of data generated every day (more than 50 billion), even if the batch import performance of analyticdb is outstanding, it still needs a long time to import. Considering that analyticdb does not support multi-level partitioning, we split the AIPL table from the brand dimension into 20 sub tables, which can improve the import performance on the one hand and the query performance on the other.

In addition to providing AIPL analysis in the brand dimension, data banks can also drill down to the secondary category dimension. To support the secondary category, there are two schemes:
1. Extending the secondary category dimension on the original AIPL table
2. Add a new set of AIPL tables containing secondary category dimensions

The first scheme saves more storage space and only needs to import a set of tables; the second scheme has better query performance.

After verification, we use the second scheme. The query without secondary category accounts for a large proportion in data bank. When the query does not contain secondary category, the first scheme requires group by consumer ID, which will occupy a large amount of memory in the execution process, with low concurrency and poor performance. Thanks to the lower storage cost of analyticdb, the use of the second solution did not lead to a substantial increase in cost.
At the same time, because the brand ID is a necessary dimension for query, and the AIPL status is a frequently used dimension, setting the brand ID and AIPL status as aggregate columns can effectively reduce query IO and improve query performance.

The table structure of the two groups of tables is as follows:

_ --AIPL table without secondary category dimension_
CREATE TABLE `aipl_[001-020]` (
  `customer_id` bigint,
  `brand_id` bigint,
  `aipl_status` int,
  `day` bigint
)
DISTRIBUTE BY HASH(`customer_id`)
PARTITION BY VALUE(day)
CLUSTERED BY (`brand_id`,`aipl_status`)

_ --AIPL table with secondary category dimension_
CREATE TABLE `aipl_cate_[001-020]` (
  `customer_id` bigint,
  `brand_id` bigint,
  `cate_id` bigint,
  `aipl_status` int,
  `day` bigint
)
DISTRIBUTE BY HASH(`customer_id`)
PARTITION BY VALUE(day)
CLUSTERED BY (`brand_id`,`cate_id`, `aipl_status`)

2. Label

In general, due to the existence of multi value tags (for example, a consumer can have multiple interests), the tag table will be designed as kV structure, as follows (schematic diagram)

CREATE TABLE `tag` (
  `customer_id` bigint,
  `tag_key` int,
  `tag_value` int
)
DISTRIBUTE BY HASH(`customer_id`)

However, data bank can select multiple tags for merging and subtracting in crowd selection. The use of kV structure will result in the merging and subtracting of consumer ID sets with multiple tag values in memory, and the performance is poor.

Using the multivalued column (multivalue / or JSON column) of analyticdb, the table structure of the label table of data bank is as follows:

CREATE TABLE `tag` (
  `customer_id` bigint,
  `tag1` int,
  `tag2` int,
  `tag3` multivalue,
  ......
)
DISTRIBUTE BY HASH(`customer_id`)

The intersection and union of multiple tags will be converted into the and / or relationship of the tag table at the bottom. However, due to the large number of tags, the table with more than 200 columns is not only slow in import performance, but also filled with more null values (customer)_ If Id has no value under a certain tag, it will fill in a specific value), which leads to a more severe data expansion. Therefore, similar to the split of AIPL, the tag table is also split into more than ten tables according to different topics.

Crowd selection accelerates

Data bank will solidify the crowd selected by users (that is, save the consumer ID list) for subsequent operation. Because crowd selection will involve dozens of sub queries, the time of circle selection is long, and the intermediate results may be large. Therefore, data bank will split a circle selection into multiple query slices, send them to ADB for execution, and finally delete the query results of each slice User ID list) in ETL to complete the crowd circle.

The whole crowd selection process makes full use of the index ability of ADB and the ability of offline mixed load, which can not only speed up the crowd selection, but also improve the utilization rate of the overall resources, especially for queries with high conditional screening rate (such as crowd selection involving AIPL). At the same time, the original cloud elasticity of ADB can easily cope with the peak pressure of double 11.

Business value

Generally speaking, the business value of analyticdb to data bank is mainly reflected in the following aspects:
1. High performance OLAP engine:Under the background of data bank’s 22 trillion pieces of data, occupying 1.6p storage space, it realizes the average 3-5 seconds of query response, supports the second level OLAP product implementation of data bank, and provides users with flexible and efficient analysis tools.
2. Significant cost reduction:After using the hot and cold hierarchical storage + pay as you go mode of analyticdb, the cost of data usage is greatly reduced while ensuring the performance. Compared with the previous generation version, the cost is reduced by about 46%.
3. Flexibility to deal with big promotion,Data bank is based on the mixed crowd circle selection mode implemented by analyticdb. During the big promotion period, using the cloud native flexibility of analyticdb, it can realize the rapid expansion of resources to cope with the peak.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Recommended Today

How can spring boot quickly configure multiple redis data sources

brief introduction The main application scenario of redis multiple data sources is that multiple data sources need to be usedredisServer or use multipleredisLibrary, this paper usesfastdepRely on the integration framework for rapid integrationRedisMultiple data sources and integrationlettuceThe connection pool only needs to be created after the dependency is introducedyamlConfigure the connection information of multiple data […]