Author:Zhang Guangming (Shi Heng)）Senior technical expert of Alibaba search Division
1. Business background
Alibaba e-commerce search recommendation real-time data warehouse carries the real-time warehouse scenarios of Alibaba Group’s Taobao, Taobao special edition, hungry and other e-commerce businesses, and provides a variety of data application support, including real-time large screen, real-time report, real-time algorithm training, real-time A / b experimental Kanban and so on.
Value of data
We believe that the data is in the brain position of Alibaba search recommendation, which is reflected in algorithm iteration, product operation and boss decision-making. So how does the data flow in the search recommendation business scenario? The first is information collection. When users use the search and recommendation function of mobile Taobao, they will trigger the buried point information on the server; Next, it will be processed by offline and real-time ETL, and then loaded into the product engine; Then we will build an analysis system based on the engine to help algorithms and products make analysis and decisions. After forming a decision, some new content will be online, and users can see some business forms produced by the algorithm model; In this way, a new process of data acquisition, processing, loading and analysis is produced. In this way, data can be used to form a complete service link, in which each link is very important.
Search for recommended typical scenarios
There are many different application scenarios of real-time data in e-commerce search and recommendation, such as real-time analysis, algorithm application and refined crowd operation.
1) Real time analysis and algorithm application scenarios
In the real-time analysis and algorithm application scenarios, we use the real-time data warehouse to build analysis reports, real-time large screens, training algorithm models and other types of data products. The real-time data demand search recommendation scenario mainly has the following characteristics:
- Large amount of data: single day Pb storage
- Total number of single table:_ 100 billion+_
- QPS high peak: write to RPS6500W+
- Peak query QPS:_ 200+_
- High data flexibility, diversified analysis scenarios, fixed condition high-frequency analysis and non fixed condition multi-dimensional query
2) Refined crowd operation scenario
In e-commerce operation, there is often a need to adopt different operation strategies for different groups of people. In the traditional way, offline data is used to launch activities for people, but generally the operation effect of the previous day can be seen only the next day. In order to observe and improve the operation effect more efficiently, real-time crowd delivery and crowd portrait have become essential needs.
The real-time data warehouse will provide real-time crowd behavior effect data for activity operation in the form of real-time large screen and real-time report, such as real-time UV and real-time transaction volume of people in different regions and ages. In addition, it is also necessary to correlate and compare real-time data with offline data to provide real-time month on month and year-on-year data.
2. Typical real-time data warehouse demands
Based on the above background, during the construction of real-time data warehouse, we summarized the following typical demands of real-time data warehouse:
- Grouped cross section
For example, the display of indicators by industry is usually queried in SQL with group by;
- Multidimensional filtering
Scene filtering, user filtering, commodity filtering, merchant filtering, etc. usually use the array field to filter the attribute value;
Calculate real-time indicators based on detailed data aggregation, such as sum and count_ Distinct calculation, etc;
- A/B Test
Calculate the real-time gap data between the test bucket and the reference bucket by analyzing the bucket field in the log buried point;
- Specify key
When troubleshooting problems or observing core business indicators, it is often necessary to specify business ID and commodity ID to query real-time indicators. Aggregate calculation needs to be carried out after filtering based on the ID field in the detailed real-time table;
- Flow batch integration
Since the real-time data warehouse only retains the data of the last two days, when facing the requirements of year-on-year and month on month calculation, it is necessary to read the offline data and perform associated calculation with the real-time data, so that the product / operation can intuitively see the comparison performance between this year’s real-time data and the same period last year when viewing the upper level report.
3. Real time data warehouse architecture
Based on the appeal of typical real-time data warehouse, we abstract the typical real-time data warehouse architecture shown in the figure below.
The real-time collected business logs are cleaned and filtered by real-time calculation Flink, and the results are written into the OLAP engine. The OLAP engine should not only support multi-dimensional interactive query, but also support kV query and flow batch query to meet our various business demands. At the same time, the OLAP engine also needs to connect with various business applications built on the upper layer and provide online services.
Based on this typical real-time architecture, the following is the evolution process of real-time architecture under our search recommendation scenario.
1) Real time data warehouse architecture version 1.0
The first is version 1.0 of real-time data warehouse architecture, as shown in the figure below. This version is mainly composed of three plates:
In the data collection layer, we divide the upstream real-time collected data into user behavior logs, commodity dimension tables, merchant dimension tables, user dimension tables, etc. why are there dimension tables? Because each service will not bury all the information in the log when it is buried, if all the information is carried by the user behavior log, the flexibility will be particularly poor. Therefore, the dimension table plays the role of information expansion in the business.
The collected user behavior log will be written to Flink for real-time calculation in real time. The dimension table data such as user dimension table and commodity dimension table will be archived in maxcompute. After preliminary calculation, it will be synchronized to the batch engine through data synchronization tool (dataX).
In the data processing layer, in the stream processing part, Flink performs preliminary processing on the user behavior log data written in real time. The specific processing includes data analysis, cleaning, filtering, association dimension table, etc.
In the batch processing part, in order to query and filter data according to attributes in data query and service, it is necessary to associate the user’s real-time behavior with the dimension table in Flink job, which requires that the batch processing system can support high QPS query. At that time, the QPS of a single table of search business was up to 65 million. After multi-party research, HBase was selected as the batch processing engine of dimension table.
In the Flink job, a real-time wide table containing multiple dimension columns is output based on the attribute data in the associated HBase dimension table such as user ID, commodity ID and merchant ID, and then output to the OLAP engine. In order to simplify Flink’s real-time operation and reduce the pressure of real-time calculation, we do not use window functions to aggregate indicators in Flink, but simply filter and correlate the real-time logs and directly input the detailed data to the downstream, which requires the downstream engine to support both kV query, OLAP multi-dimensional interactive query and flow batch integrated query.
Data query and service
In the first version of the architecture, we use lightning engine to carry the real-time detailed data output by Flink, realize the integration of query flow and batch based on lightning, and then provide unified real-time data query service for upper layer applications.
However, the limitations of lightning are also very obvious: first, the query method is non SQL type, which is not friendly enough. If you write SQL, you need secondary encapsulation. Second, lightning adopts a public cluster, and multi-user resources are not isolated. When a large amount of data needs to be queried, problems such as performance fluctuation and resource queuing are easy to occur, which makes the query time-consuming and has certain restrictions in the use of actual business scenarios.
2) Real time data warehouse architecture version 2.0
Based on the limitations of lightning, we hope to find an alternative product, which has the ability to support OLAP interactive query and high QPS dimension table verification query. Therefore, in the real-time data warehouse architecture of version 2.0, we began to access hologres.
At the beginning, we just used hologres instead of lightning to provide kV and OLAP query capabilities, which solved the limitations brought by lightning. This architecture looks very good, but because it still needs to go through HBase to store dimension tables, with the growth of the amount of data, the time for importing data into HBase is also longer, which actually wastes a lot of resources. Moreover, with the increase of the real-time requirements of online services, the disadvantages of HBase are becoming more and more obvious.
One of the core capabilities of hologres is to speed up offline data, especially for maxcompute data. It can connect with its resources at the bottom and speed up query. Therefore, we came up with the idea of replacing HBase with hologres. With hologres as the unified storage, there is no need to import and export data, ensuring that one data is stored one by one.
Therefore, the final real-time data warehouse architecture version 2.0 is as follows:
In the data processing stage, the user dimension table, commodity dimension table and merchant dimension table are directly stored in hologres in row storage mode to replace HBase storage. Jobs in Flink can directly read the dimension table of hologres and associate it with the behavior log.
In the data query and service stage, we uniformly store the real-time detailed data output by Flink processing to hologres, which provides high concurrent data real-time writing and real-time query.
4. Hologres based best practices
Due to the access of hologres, the real-time data warehouse version 2.0 not only simplifies the architecture and saves resources, but also truly realizes the integration of flow and batch. This architecture has been used so far. Here are the best practices of hologres in searching and recommending specific business scenarios based on this architecture.
1) Bank deposit best practices
Hologres supports two storage modes: row storage and column storage. Row storage is friendly to key value query scenarios and is suitable for point query and scan based on primary key. The table in row storage mode can be regarded as a table similar to HBase, and different tables are used to store dimension information of different entities. In Flink real-time job, dimension table data can be efficiently read from hologres row storage table and associated with entities in real-time stream.
2) Listing best practices
The default table storage mode in hologres is column storage. Column storage is friendly to OLAP scenarios and suitable for various complex queries.
Based on the listing and storage mode of hologres, we have built a real-time data query Kanban for search and recommendation services, which can support real-time filtering of dozens of different dimensions.When the maximum peak number of writes per second (RPS) exceeds 5 million, you can still query the aggregate indicator results filtered by multiple dimensions at the second level.
At the same time, the hologres table supports setting the TTL attribute of table data. Generally, we set the life cycle of a real-time table to 48 hours, and the data exceeding 48 hours will be automatically deleted. In the real-time Kanban, users can query the real-time data in the last two days to avoid unnecessary waste of resources.
3) Best practice of flow batch integration
Hologres supports not only ad hoc analysis and query of data based on real-time details, but also direct accelerated query of maxcompute offline table. Therefore, we use this feature to realize the query of stream batch integration (real-time offline Federation analysis).
In the tmall promotion campaign, we used hologres’s federal analysis ability to build the target completion rate of core businesses and the comparison Kanban of the same period last year, which provided effective data support for operation algorithm decision-making.
Among them, the development of target completion rate Kanban becomes easier with the help of real-time offline Federation analysis, that is, the indicators of the day are queried in real time through hologres, and the indicators of the day in the real-time table are divided by the target indicators set in the offline table, so that the operation can see the completion of the goals of the core businesses updated in real time.
Compared with the same period last year, the calculation logic of real-time Kanban is similar. You can join the real-time table with the offline table last year in SQL to calculate the key indicators year-on-year.
All calculations can be completed in hologres, and the calculation logic can be expressed through SQL. There is no need for additional data development. One data and one set of code reduce the difficulty of development, operation and maintenance, and truly realize the integration of flow and batch.
4) High concurrency real-time update
In some scenarios, we not only need to write data to the OLAP engine in real-time increments, but also need to update the written data.
For example, during order closing attribution, Flink real-time job will double flow join the order submission data flow and the progress click data flow, and associate the last click event before order submission. When multiple click events arrive successively, we need to update the detailed data of order attribution. At this time, we need to use the update support of hologres to update the original data through the primary key of the data to ensure the accuracy of transaction attribution data. In practice, the update write peak of hologres can reach 50W, meeting the high concurrent real-time update requirements of business.
5. Future outlook
We hope to continuously improve the existing real-time data warehouse based on hologres engine in the future. The main directions are:
1) Real time table join
At present, hologres supports the join between 10 billion level tables and 100 million level tables, and the second level query response. Based on this feature, it is expected that the dimension table association work originally completed by Flink real-time job in the data processing stage can be changed to real-time join calculation in the query hologres stage. For example, table 1 is a detailed data table and table 2 is a user dimension table. In the query stage, the join can filter the user dimension table and then associate it with the detailed data table to achieve the purpose of filtering data. Such an improvement will bring several benefits:
1) Reduce the amount of data storage in hologres and avoid a large amount of data redundancy stored in the real-time table (for example, the data of the same commodity ID will be stored repeatedly);
2) Improving the timeliness of dimension attributes in real-time data and calculating after real-time joining dimension table data in the query stage can always use the latest dimension attributes when filtering data through dimensions.
2) Persistent storage
In the future, we will explore how to store the real-time data of common dimensions in a persistent manner by using the computing and storage capabilities of hologres.