Introduction:This article will focus on the best practice of hologres in Alibaba’s Taobao marketing activity analysis scene, and reveal the technical test behind the first landing of Flink + hologres streaming batch integration in Alibaba’s double 11 marketing analysis screen.
Summary: in the just concluded 2020 tmall double 11, the cloud native real-time data warehouse built by maxcompute interactive analysis (hereinafter referred to as hologres) + real-time computing Flink was first implemented in the core data scene, setting a new record for the big data platform. On this occasion, we will successively launch the cloud native real-time data warehouse double 11 real-time series. This article will focus on the best practice of hologres in Alibaba’s Taobao marketing activity analysis scene, and reveal the technical test behind the first landing of Flink + hologres streaming and batching on Alibaba’s double 11 marketing analysis screen.
In Taoxi business operation, big promotion is a very important scene in business operation and user growth. As the core data product used to serve decision-making and guide operation during big promotion, marketing activity analysis product covers the analysis of the whole link before, during and after the activity, which needs to meet the different requirements of data timeliness and data flexibility of different roles in different stages The overall product picture is as follows:
The old version of marketing activity analysis is based on the conventional real-time offline data system FW product architecture. Many problems have been exposed in all kinds of previous activities, including three core problems
- Inconsistency between real-time and offline data:The real-time and offline data of the same caliber are inconsistent, including data logic caliber and data interface. Due to the separation of real-time and offline data development (developers and interfaces), not only the operation and maintenance cost of the overall data is increased, but also the burden of product construction is greatly increased.
- High maintenance cost:With the increase of business volume, the original database can not support complex and changeable application scenarios quickly and flexibly. Conventional HBase, MySQL and ADB databases can only satisfy massive data, high concurrent storage point query and OLAP query at a single point. Therefore, in the face of extremely complex business, they need to rely on multiple databases, and the overall maintenance cost and dependency cost will be very high.
- Poor scalability:In the FW framework, the logic complexity of product building is high, the scalability is poor, and the maintenance cost during the activity is very large
Therefore, how to quickly respond to the frequently changing business demands and more efficiently deal with the data problems during the activities become more and more important. The upgraded new generation of marketing activity analysis architecture needs to meet the following advantages:
1. Real time data warehouse and offline data warehouse can unify the model (real-time offline logic), interface (data storage and data retrieval), and truly achieve the integration of flow and batch
2. It needs a more powerful data warehouse, which can not only meet the concurrent write query of massive data, but also meet the timely query function of business
3. Simplify the existing product building logic and reduce the complexity of product implementation
Based on the appeal background, we need to reconstruct the current architecture and find alternative products to solve the business pain. After a long time of calling and trying, we finally chose the technical solution based on real-time computing Flink + hologres + FBI (a visual analysis tool inside Alibaba) to realize the Framework Reconstruction of tmall marketing activity analysis.
2、 Flow batch integration technology scheme
Through in-depth analysis of business data requirements, as well as multi-faceted data model exploration and data warehouse research, the overall technical framework of marketing activity analysis product reconstruction is finally determined, as shown in the figure below
- Through the upgrade of stream batch integrated architecture, the stream batch SQL logic & computing engine level is unified
- The data storage and query are unified by hologres
- By using the ability of FBI products, we can reduce the construction cost, meet the high flexibility of business, and meet the needs of different roles for reports
Next, we will introduce in detail the core technical solutions of the whole technical solution: stream batch integration, hologres, FBI
1. Flow batch integration technology framework
The structure of traditional data warehouse is shown in the figure below,The core problems of traditional data warehouse architecture are as follows
- Due to the split of storage layer between streams and batches, clusters, tables and fields are all separated, which leads to the need to write different access logic when the application layer is connected.
- The processing logic between streams and batches can not be reused, the SQL standards are different, and the computing engines are different. As a result, the real-time and offline systems need to be developed separately. In fact, in many cases, the logic is similar, but the system cannot be flexibly converted before, resulting in repeated workload
- The computing layer clusters are separated, and the peak time of real-time and offline resource utilization is different, which leads to low resource utilization and obvious peaks and troughs
The flow batch integrated data warehouse architecture is shown in the figure below. The upgraded architecture mainly has the following core points to pay attention to:
- First of all, although the DWD layer of data warehouse is different on the storage medium, it needs to ensure the equivalence of the data model, and then encapsulate the logical table (one logical table maps to two physical tables, that is, real-time DWD and offline DWD). The writing of data calculation code is based on the logical table
- Secondly, code development based on logic table, personalized configuration of flow and batch computing mode, and different scheduling strategies need to have a development platform (dataphin flow batch unified development platform) as support to form a convenient integration of development, operation and maintenance
- Finally, the unification of storage layer based on onedata specification is not only the unification of model specification, but also the unification of storage media, which achieves seamless convergence
In this year’s double 11, the flow peak processed by real-time computing Flink has reached a record of 4 billion records per second, and the data volume has also reached an amazing 7tb per second. The application of streaming batch integrated data based on Flink has emerged in the marketing activity analysis scene, and has withstood severe production tests in terms of stability, performance and efficiency
The overall Flink flow and Flink batch task show strong stability during the activity period, and there are no link capacity, single machine point, network bandwidth and other problems in the whole process
2. Hologres stream batch Integrated Landing
The data architecture of stream batch integration realizes the unification of the whole data level, and needs to select a product to unify the whole storage. This product needs to support not only high concurrent writing, but also timely query, and also OLAP analysis.
In the old version of the architecture, each page module will involve data query of one or more databases, such as mysql, HBase, adb3.0 “old version of hybrid DB” and so on. Due to the high concurrent writing and high-performance point query characteristics of HBase, most of the real-time data will be placed in HBase. Due to the advantages of convenient management and easy query of MySQL table, dimension table data and offline data are usually stored in HBase. In addition, some modules of the product involve data with small amount and many dimensions, such as marketing play data ADB will be selected as the database of OLAP multidimensional analysis. In this way, there will be two pain points: the separation of real-time data and offline data, and the messy management of multi database and multi instance.
One goal of the new marketing activity analysis product is to achieve unified storage, reduce operation and maintenance costs and improve R & D efficiency; the other goal is high performance, high stability and low cost.
After benchmarking with various products, we choose hologres as the unified product of the whole marketing activity analysis. As a one-stop real-time data warehouse compatible with PostgreSQL 11 protocol, hologres seamlessly connects with big data ecology, supports Pb level data analysis and processing with high concurrency and low latency, and can easily and economically use the existing Bi tools for multi-dimensional analysis and business exploration of data. In such a complex business scenario, hologres’s advantages are extremely prominent.
Through the in-depth analysis of the three modules of the overall marketing activity analysis, and combined with the requirements of the business side for the timeliness of data, the specific real-time link scheme is formulated for the data of several modules of the overall marketing activity analysis
- We use hologres’s real-time checking capability for core modules such as live broadcast, pre-sale, additional purchase and traffic monitoring,
- In the face of complex and changeable marketing scenarios, we choose the OLAP real-time query capability of hologres
Aiming at the point search ability and OLAP analysis ability required by marketing activity analysis, tmall marketing activity analysis has established DT camp and DT camp OLAP databases respectively, in which DT_ Due to the need to store some historical data during the activity for a long time in the camp point database for comparison, the overall data level is nearly 40tb; in the OLAP database of marketing play, some detailed data of play are stored, and the overall data level is nearly 100TB. Due to the high accuracy requirement of the overall data of marketing play, there is no lossy precision query method The query performance of the whole data warehouse puts forward higher requirements.
In order to improve the overall performance of hologres, we mainly do the following optimization strategies for marketing activity analysis data warehouse:
- Set distribution key: for count (distinct user)_ ID)_ Set the ID to distribution key, and do count distinct in each shard in hologes to avoid a large amount of data shuffle and greatly improve the query performance.
- Minimize count distinct: reduce the cost of count distinct by transforming SQL through multi tier group by operation
- shard prunning：In some scenarios, the query will specify some keys in the PK of a table to query. If the key combination of these scenarios is set to distribution key, the shards that will be hit by this query can be determined when processing the query, so as to reduce the number of RPC requests, which is very important for high QPS scenarios
- Generate the optimal plan:Marketing activity analysis includes point query or range query based on summary data, OLAP query based on original data, and topn query after aggregation of single table. For different query types, hologres can generate the optimal execution plan according to the collected statistical information to ensure the QPS and latency of the query
- Write Optimization:The writing of marketing activity analysis is based on the column storage table update operation. In hologes, the corresponding uniqueID will be found according to the specified PK, and then the corresponding record tag will be found according to the uniqueID for deletion, and then a new record will be queried. In this case, if an incremental segment key can be set, the query can be performed according to the segment key Key can quickly locate to the file, improve the speed of locating to the record according to PK, and improve the writing performance. The writing peak value of the marketing activity analysis system can reach 800W / s update
- Small file merge:For some tables that are not written very frequently, the key updated over a period of time is relatively fixed, which results in the memory table When flush is used, it is a relatively small file. However, hologres’s default compression policy does not do any compression on these files, resulting in a large number of small files. By deeply optimizing the compression parameter, increasing the frequency of compression and reducing the number of small files, the query performance can be significantly improved
Hologres performance during the double 11, the write peak of the click scene is tens of W / s, the service capacity is hundreds of W / s, OLAP write peak is 400W / s, the service capacity is 500W / s. At the same time, single point query & OLAP query can almost meet the demand of 99.7% of single query less than Ms. therefore, during the whole activity period, hologres has a very stable overall performance and can support fast point query and OLAP analysis at the same time.
3. FBI analysis screen
As the preferred data visualization platform in Ali ecosystem, FBI can not only quickly support the construction of various reports for data analysis, but also support the rapid access and expansion of multiple data sets, as well as the advanced function of supporting the construction of various analytical data products.
In the core process of building FBI products, the construction cost can be greatly reduced through four core functions:
1) The “real time hour minute model” of real-time and offline integration automatically realizes the accurate trend and comparison of real-time data
Aiming at the underlying data of batch flow integration defined by marketing activities, in order to meet the flexibility of user analysis of real-time data, real-time comparison and hour comparison, the FBI abstracts a set of standard data model of real-time offline integration. After creating the model, it can realize the accurate comparison of real-time data, automatic routing of minute table for trend analysis, and direct routing of hour trend to hour table.
2) FBI original fax function, minimalist definition, output a variety of complex indicators
For complex indicators, such as channel proportion, category proportion, year-on-year contribution, and cumulative turnover of activities, SQL is used to define in the last version, which not only ensures the length of SQL, but also greatly reduces the stability and maintainability of products. In order to solve this kind of problem, the FBI has constructed a set of analysis DSL which is easy to learn and understand. It is called fax function (20 + analysis functions such as year-on-year difference, contribution rate and activity accumulation). A simple sentence can define various complex indicators used in marketing activity analysis.
3) Through the analysis of ability configuration and proprietary logic plug-in, greatly save the page construction time
Product page construction is a very core link. How to save user configuration? The FBI’s method is as follows:
a. Configuration of general analysis capability: for the most commonly used analysis scenarios such as cross table, activity comparison, date variable parameter transfer, etc., the abstract upgrade to a simple configuration item can complete the corresponding analysis of the same period comparison and year-on-year difference.
b. Plug in of proprietary logic: the customization capabilities of activity parameters, display and hiding, and result sorting, which act on blocks, can be covered by data plug-ins.
4、Build a high security system for FBI, upgrade release control, monitoring and early warning, change prompt, etc., and support 1-5-10
3、 Escort of test terminal
In order to further guarantee the product quality of marketing activity analysis, the test end has done strict data comparison and verification from details to summary to product end, and at the same time, it has carried out all-round monitoring for the core data of Datong
During the activity, the test and inspection function greatly improves the ability to actively find data problems, as well as the ability to find core problems in time, and greatly improves the quality and stability of the whole data product during the activity
4、 Business feedback & value
During the whole period of the 11th National Day of the people’s Republic of China, based on the real-time calculation of Flink + hologres flow batch integrated marketing activity analysis productsIt not only supports the high frequency access of thousands of PV per capita of tmall business group + small two, but also achieves the goal of 0 P1 / P2 faultAt the same time, the whole product showed several advantages compared with previous years during the activity:
- Rich:Real time data is widely used in marketing activity analysis products. The core dimension can be down to multiple dimensions, such as active products and business label layering. At the same time, real-time data of business and commodity dimensions are added in addition purchase and pre-sale, which supports business BD more friendly
- stable:Based on the continuous high and stable output of hologres, the overall double 11 period shows strong stability in both real-time data writing and data reading; at the same time, the engineering side monitors the user access and data response efficiency in real time, analyzes and solves business problems in real time; the product inspection covers the core data of the product, which further ensures the stability of the whole product
- High efficiency:The application of streaming batch technology, as well as the unified docking of hologres, not only greatly improved the demand access efficiency during the activity period (the overall demand carrying capacity during this year’s double 11 is three times that of last year), but also improved the timeliness of problem feedback and solution as a whole (3-4 times that of previous activities)
5、 Future prospects
Although it has experienced a big test, the exploration of technology is endless. We need to constantly improve to deal with more complex business scenarios
1) Dataphin stream batch integration of products to further improve, reduce the cost of manual intervention, while further ensuring the quality of data
2) Hologres resource isolation, read-write resource isolation, to better ensure the SLA of query; open hologres and maxcompute, support metadata interoperability, provide higher protection for product metadata; dynamic expansion, can flexibly respond to peak and daily business needs.
3) FBI product tools can improve the function of product version management. The same page supports multiple editing without coverage, and supports product building more efficiently
This article is the original content of Alibaba cloud and cannot be reproduced without permission.