Practice of big data recommended by xiaohongshu on Alibaba cloud

Time:2020-10-30

This article is mainly divided into three parts. In the first part, we will talk about the use scenarios of real-time computing in the recommendation business. The second part talks about how little red book uses some of the new features of Flink. The third part mainly talks about some real-time analysis scenarios of OLAP and the cooperation with Alibaba cloud MC hologres.

Author: recommended person in charge of the project by xiaohongshuGuo Yi

Recommended business structure of xiaohongshu

Practice of big data recommended by xiaohongshu on Alibaba cloud
First of all, some typical recommendation services are shown in the figure. The main modules using big data are the online recommendation engine on the far left. Generally, the recommendation engine can be divided into recall, sorting, and back row. I won’t elaborate here. Mainly from the perspective of big data, the recommendation engine mainly uses the prediction model to estimate the user‘s liking degree for each candidate note. According to which strategy to recommend to users. When using the recommended model, we need to grasp the note features, which will flow back to our training data to train the new model. After the recommendation engine returns the notes, the user’s consumption behavior on the notes, including display, click, like and so on, will form a user behavior flow. These user behavior flows are combined with feature flows, which generate model training data to iterate the model. After combining the information of users and notes, some analysis reports used by users and notes portraits and recommendation business will be generated.
After more than a year of transformation, in the recommendation scenario, in addition to the analysis of data to strategy, which requires human participation in iterative strategy, other modules are basically updated in real time or near real time.

Real time computing application of recommendation service

Practice of big data recommended by xiaohongshu on Alibaba cloud
Here’s a little bit more about real-time computing after data reflow of features and user behavior, and how we use the data they generate. When the recommendation engine generates the feature stream, because of its large volume, it includes all the notes returned by the recommendation, about 100 articles, and all the features of these notes, so the total number of these features is about several hundred. At present, our approach is to write the features to an efficient kV that we have developed and cache for several hours, and then the user behavior data is backflow from the client, and then we start the data flow processing.
Our first step is to attribute and summarize the user behavior of the client. Here’s what attribution and aggregation are. Because on the app of xiaohongshu, the client’s marking is divided into pages. For example, the user looks at the note in the home page recommendation and clicks it. After clicking, the user will jump to the note page, and then the user will browse the note on the note page and praise it. At the same time, users may click on the author’s Avatar to enter the author’s personal page and pay attention to the author in the personal page. Attribution refers to a series of user behavior should be counted as the behavior of home page recommendation, and will not mix with other business. Because the search user, in the search to see the same note, may also return the same results. Therefore, we should distinguish the user’s behavior from which business, which is the attribution.

Then summary refers to a series of user behaviors. For the same note, we will generate a summary record, which can facilitate subsequent analysis. Then after attribution, there will be a real-time single user behavior data stream. On the summary side, because there is a window period, the summary data will generally be delayed. At present, it is about 20 minutes. When we generate the attribution and summary data stream, we will add some dimension table data. We will find the features that we recommend according to the user’s notes. At the same time, we will add some basic information of users and notes to the data stream. In fact, there are four more important user scenarios. The first scenario is to generate breakdown information of different services. This is mainly to know a user’s click through rate and other business indicators in different note dimensions. At the same time, I can also know the click rate of a note for different users. This is what we recommend in real time One of the more important characteristics of the. Another important thing is a wide table that we analyze in real time. The wide table is that we change the user’s information, note information and summary information of user’s note interaction into a multi-dimensional table for real-time analysis. This will be described in more detail later. Then there are two more important ones. One is the information of real-time training. The training information is the feature that I expanded the interaction between users and notes. When sorting, the features were added with some tags that we summarized. Then we trained the model to update the model. Then the other is that all my summary information will enter the offline data warehouse, and then some subsequent analysis and report processing will be carried out.

Stream computing optimization Flink batch flow integration

Practice of big data recommended by xiaohongshu on Alibaba cloud
Then I’ll talk about how we can use some of Flink’s new features to optimize the flow computing process. Here I will mainly talk about two points, the first of which is batch flow integration.
Just now, we summarize and analyze the behavior of a user according to the behavior of the note. There are a lot of summarized information here. In addition to the simplest information, such as whether the user likes to collect the note, there are some complicated tags, such as how long the user has stayed on the note page, or the note Whether the previous click is an effective click. For some advertising scenes or under some scenes, we need to know that if the user clicks and stays for more than 5 seconds, then the click is effective. We hope that such complex logic can be implemented only once in our system, which can be used in both real-time and batch computing. In the traditional sense, this is very difficult, because in most implementations, batch and stream are two versions, that is, we have implemented a version of the definition of effective click on Flink, for example, we also need to implement an offline version of the definition of effective click, which may be a version written by SQL. So the little red book uses a new function in flip-27. The log file is a batch form, which can be converted into a stream form. In this way, I can achieve the unification of batch flow in the sense of code.

Stream computing optimization multi sink optimization

Practice of big data recommended by xiaohongshu on Alibaba cloud
Then there is another function of Flink, which is multi sink optimization on Flink 1.11. It means that one piece of data will be written to multiple data applications. For example, I will need to make a wide table of user behavior and generate an offline data. So, what multi sink optimization does is that you only need to read from Kafka once. If it is the same key, it only needs to go to lookup once. KV can generate multiple data and write to multiple sinks at the same time. This can greatly reduce the pressure on Kafka and the pressure on kV query.

Typical OLAP scenes of xiaohongshu

Practice of big data recommended by xiaohongshu on Alibaba cloud
Finally, I would like to talk about our OLAP scenario and a cooperation between Alibaba cloud maxcompute and hologres. There are many OLAP scenarios under the recommendation business of xiaohongshu. Here, I will talk about four common scenario applications. The most common one is a real-time analysis based on the user’s experimental group grouping. Because we need a lot of adjusting policies or updating models in the recommendation business, and then every time we adjust the policies and update the models, we will conduct an experiment to put users in different abtests to compare their behaviors. In fact, a user in the recommendation will be in multiple experiments at the same time. In each experiment, it belongs to an experimental group. The experimental analysis we do according to the experimental groups is mainly to take out an experiment, and then summarize the user behavior and data. According to the experimental group in this experiment, we can analyze the user indicators of different experimental groups What’s the difference. Then this scenario is a very common scenario, but it is also a scenario with a large amount of computation, because it needs to be grouped according to the user’s experimental tags.
Then another scenario is that our recommendation of xiaohongshu is actually running on multiple data centers. Different data centers often have some changes, such as changes in operation and maintenance. We need to start a new service, or we may have some new models that need to be launched in a certain computing center first. Then we need an end-to-end scheme to verify different numbers According to whether the data between centers is consistent, whether the user experience in different data centers is the same. At this time, we need to compare the behavior of users in different data centers according to different data centers, and whether their final indicators are consistent. We also use our model and code release. We will look at the old and new versions of a model release or a code release, and compare the indicators of user behavior they generate to see if they are consistent. Similarly, our OLAP is also used to alarm real-time business indicators. If there is a sharp drop in the click through rate and the number of likes of users, our real-time alarm will also be triggered.

The scale of OLAP data in Little Red Book

Practice of big data recommended by xiaohongshu on Alibaba cloud
At peak times, we have 350000 user behaviors recorded in our real-time calculations per second. Then we have about 300 fields in our large table, and then we hope to keep the data for more than two weeks about 15 days, because when we do experimental analysis, we often need to look at the comparison of the data of this week and the previous week, and then we have about 1000 queries every day.

Little Red Book + hologres

Practice of big data recommended by xiaohongshu on Alibaba cloud
In July, we cooperated with maxcomputer and hologres of Alibaba cloud. Hologres is actually a new generation of intelligent data warehouse solution, which can solve real-time and offline computing through one-stop method. At the same time, its application can be mainly used in real-time large screen, tableau and data science. Our research is more suitable for our recommendation scenarios.

Application scenarios of xiaohongshu hologres

Practice of big data recommended by xiaohongshu on Alibaba cloud
What hologres does is mainly to query and accelerate the offline data, and then make interactive query response on the offline data at the table level. He does not need to do the work of moving data from offline to real-time data warehouse, because it is all in it. The whole real-time data warehouse, through the construction of user insight system, real-time monitoring of user data of the platform, can conduct real-time diagnosis of users from different angles, which can help to implement refined operation. This is actually a very suitable scenario for our users. Then, its real-time offline federated computing can be based on the interactive analysis of real-time computing engine and offline maxcompute, real-time offline federated query, and build a full link fine operation.

Hologres VS  Clickhouse

Practice of big data recommended by xiaohongshu on Alibaba cloud
Before cooperating with Alibaba cloud maxcompute, we built our own Clickhouse cluster. At that time, we were also a large-scale cluster, using 1320 cores in total. Because the Clickhouse was not a solution for computing and storage separation, we only stored data for 7 days in order to save costs, and then because the Clickhouse did not have a tag for user experiments, we did not Very good optimization, so we would be very slow to query data for more than three days at that time. Because it is an OLAP scenario, we hope that each user’s query can produce results within two minutes, so we can only query the data of the past three days. At the same time, another problem is that the Clickhouse has some problems in supporting components. Therefore, we did not configure components on the Clickhouse cluster. If the upstream data flow is jittery and the data causes some duplication, there will be some duplicate data in the downstream Clickhouse. At the same time, we also sent special personnel to operate and maintain the Clickhouse. Then we found that if you want to make a cluster version of Clickhouse, its operation and maintenance cost is still very high. So we cooperated with Alibaba cloud in July to migrate the largest user wide table recommended by us to maxcompute and hologres. Then we have 1200 cores on hologres. Because it is a computing storage solution, 1200 cores are enough for us to use. However, we have a greater demand for storage. We have stored data for 15 days. Because hologres has made some customized optimization for the scenario of grouping users according to experiments, we can now easily query data from 7 to 15 days. In this scenario, the query performance of this scenario is compared with that of the Clickhouse It has been greatly improved. In fact, hologres also supports primary key, so we also configure the primary key. In this scenario, we use the insert or ignore method. Then, because the primary key is configured, it naturally has the function of de duplication. In this way, as long as we ensure at least once in the upstream, the downstream data will not be duplicated. Then because we put it on Alibaba cloud, there is no o & M cost.

Data download

Click downloaddata