This article is shared by the search algorithm architecture team of Jingdong. It mainly introduces the application of Apache Flink in the online learning of Jingdong commodity search sorting. The main outline of the article is as follows
2. JD search online learning architecture
3. Real time sample generation
4、Flink Online Learning
5. Monitoring system
6. Planning summary
In the commodity search ranking of Jingdong, we often encounter the problem that the lack of diversity of search results leads to the non optimal solution of the system. In order to solve the problem of the diversity of commodity ordering caused by data Matthew effect, we use binomial Thompson sampling model, but the algorithm still adopts a consistent strategy for all users, and does not effectively consider the personalized information of users and commodities. Based on this situation, we adopt online learning to integrate deep learning with Thompson sampling, realize personalized diversity ranking scheme, and update relevant parameters of the model in real time.
In this scheme, Flink is mainly used to generate real-time samples and implement online learning. In the process of online learning, samples are the cornerstone of model training. In the processing of large-scale sample data, after comparing Flink, storm and spark streaming, we finally choose Flink as the framework of real-time sample stream data production and iterative online learning parameters. The overall link of online learning is particularly long, involving online feature log, streaming feature processing, association between streaming feature and user behavior tag, abnormal sample processing, real-time training and updating of model dynamic parameters, etc, Therefore, we have access to the observer system of Jingdong, and have a complete full link monitoring system to ensure the stability and integrity of data at all stages; Let’s first introduce the online learning architecture of JD search.
2、 JD search online learning architecture
The system architecture of JD search’s ranking model mainly includes the following parts:
1、Predictor It is a model prediction service, which is divided into static part and dynamic part in the load model. The static part is trained from offline data and mainly learns the dense feature representation of user and doc. The dynamic part mainly contains the weight vector of DOC granularity, which is updated in real time by the real-time online learning task.
2、Rank It mainly includes some sorting strategies. After the final sorting result is determined, the feature log will be recorded in real time, and the features of DOC will be written into the feature data stream in order as the data source of subsequent real-time samples.
3、Feature Collector The task is to undertake the characteristic data sent by the online prediction system, and output query + doc granularity characteristic flow for the downstream shielding cache, de duplication, filtering and other online system specific logic.
4、Sample join In this task, feature data, exposure, click, add purchase, order and other user behavior tag data are used as the data source, and are associated into the sample data meeting the business requirements through Flink’s Union + timer data model. The algorithm can select different tags as positive and negative sample tags according to the target requirements.
5、Online learning The task is responsible for consuming the real-time samples generated by the upstream for training and updating the dynamic part of the model.
3、 Real time sample generation
Online learning has high requirements for the timeliness and accuracy of online sample generation, but also for the stability of the operation. In the case of massive user log data real-time influx, we should not only ensure that the data delay of the job is low, the sample correlation rate is high, and the task is stable, but also the throughput of the job is not affected, and the resource utilization rate is the highest.
The main process of searching and sorting online samples of JD is as follows:
1. The data sources include exposure stream, feature stream and user behavior stream. They are unified in the form of jdq pipeline stream and supported by JD real-time computing platform.
2. After receiving the feature stream, exposure stream and label stream, the data is cleaned to get the data format required by the task.
3. After getting each standard stream, the union operation is performed on each stream, and then the keyby operation is performed.
4. We add the Flink timer in the process function as a real-time window for sample generation.
5. Jdq can be used as the input of online learning. HDFS stores the sample data persistently for offline training, incremental learning and data analysis.
Online sample task Optimization Practice:
The throughput of JD search sample data reaches GB per second, which puts forward high optimization requirements for distributed processing fragmentation, super large state and exception handling.
1. Data skew
When using keyby, there will inevitably be data skew. Here we assume that the key design is reasonable, the shuffle mode is selected correctly, the task has no back pressure and the resources are used enough, and the data skew is caused by the task parallelism setting. Let’s first see how the key in Flink is distributed to subtasks.
keygroup = assignToKeyGroup(key, maxParallelism) subtaskNum = computeOperatorIndexForKeyGroup(maxParallelism, parallelism, keyGroupId)
Suppose that our concurrency setting is 300, then maxparallelism is 512. This design will inevitably lead to the distribution of one subtask and two keygroups, and the natural inclination of data. There are two solutions to these problems
Set the degree of parallelism to the nth power of 2;
Set the maximum parallelism to N times of the parallelism.
If scheme 1 is used to adjust concurrency, only the power of 2 can be adjusted. It is recommended to use scheme 2. If parallelism is 300 and maxparallelism is set to 1200, if the data is still skewed, you can set maxparallelism larger to ensure that each keygroup has fewer keys. This can also reduce the occurrence of data skew.
Online samples used Flink’s state. We used to put the state in memory by default. However, with the increase of volume, the amount of state data increased sharply. It was found that GC took a long time. Then we changed the strategy and put the state in rocksdb. The GC problem was solved. We have configured checkpoint as follows:
Start incremental checkpoint;
Set the timeout time, interval time and minimum pause time of checkpoint reasonably.
● let Flink manage the memory occupied by rocksdb, and tune the blockcache and writebuffer of rocksdb.
Optimize the use of state data, put state data into multiple state objects, and reduce the cost of serialization / deserialization.
During task tuning, we found that our task took a long time to access rocksdb. Looking at jstack, we found that many threads were waiting for data serialization and deserialization. With the gradual increase of algorithm features, the number of features in the sample exceeded 500, making the magnitude of each data larger and larger. However, when doing sample Association, we don’t need feature association. We only need the corresponding primary key Association. Therefore, we use valuestate to store the primary key and mapstate / liststate to store the feature equivalence. Of course, these eigenvalues can also be stored in external storage. Here we need to make a trade-off between network IO and local io.
Start local recovery when failure recovery.
Since our checkpoint data has reached the TB level, once a task fails, the pressure is very high, whether for HDFS or for the task itself. Therefore, we give priority to local recovery, which can not only reduce the pressure of HDFS, but also increase the speed of recovery.
4、 Flink online learning
For online learning, let’s first introduce Bernoulli Thompson sampling algorithm. Assuming that the reward probability of each commodity follows beta distribution, we maintain two parameters for each commodity: the number of successes Si and the number of failures fi, as well as the number of successes of the common prior parameters of all commodities α And the number of failures β。
The expected reward: Q (at) of the optimal product sampled according to the corresponding beta distribution= θ i. And choose the product that is expected to reward the largest to show to users. Finally, a real reward is given according to the environment, and the corresponding parameters of the model are updated to achieve the effect of online learning. The parameter represents a commodity feature and is represented by an n-dimensional vector, which is predicted by the original feature through MLP network. The original feature gets an n-dimensional vector as the personalized representation of the product through DNN network. Logistic regression function is used to model the likelihood function, and Flink is used to construct the real-time sample composed of the representation and real-time feedback, which is used to iteratively update the parameter distribution.
1. Data order guarantee
After receiving the real-time samples from jdq, because there is no guarantee of the order of the data before, watermark mechanism is used to ensure the order of the data.
2. Sample data processing
When the window will reach a certain positive and negative proportion or data volume, a batch training is performed to iterate out a new parameter vector. The embedding data of the product is put into the state of Flink, and then the parameters are updated as the dynamic part of the model.
3. Synchronous iteration, asynchronous iteration
When personalized EE parameters online learning adopts asynchronous update mode, there is the problem of parameter update order disorder, which will reduce the convergence speed of online learning model, resulting in a waste of traffic. Therefore, the parameter asynchronous update mode is changed to synchronous update mode to avoid the parameter reading and writing disorder. In the synchronous update mode, the parameter vector stored in the status needs to be used in the next training iteration. If the parameter is lost, the iteration process of the product will be interrupted. In order to prevent the parameter loss caused by the system risk, the parameter double guarantee is designed. The parameters can be recovered from checkpoint or savepoint after normal task exception or restart. If the parameters cannot be recovered under unexpected circumstances, the parameters of the previous version can be retrieved from the remote online service and recorded to state.
4. Multi trial version support
The online learning task uses the same Flink task to support multiple versions of the model to carry out AB experiments in different experimental buckets. Different AB traffic buckets are distinguished by the version number. The corresponding real-time samples are processed with docid + version as the key, and the iteration process does not affect each other.
In order to improve the bandwidth utilization and performance requirements, we use PB format to transmit data internally. After investigation, Pb’s transmission format is better than Flink’s general class kryo serialization. Therefore, we use Flink’s custom serialization solution to directly transmit data between ops in Pb format.
5、 Monitoring system
Here, we distinguish between service full link monitoring and task stability related monitoring. The specific situation is described in detail below.
1. Full link monitoring
The whole system uses the internal observer platform of JD to realize the whole link monitoring of business, mainly including predictor service related monitoring, feature dump QPS monitoring, feature and label quality monitoring, association monitoring, train related monitoring and ab index related monitoring, as follows:
2. Mission stability monitoring
Task stability monitoring here mainly refers to the task stability monitoring of Flink. The link throughput is up to GB / s, the characteristic message QPS is up to 10W, and the online learning is uninterrupted. No matter for online sample tasks or online learning tasks, relevant monitoring alarms are essential.
■ container memory, CPU monitoring, thread number, GC monitoring
Sample related business monitoring
6、 Planning summary
Flink has excellent performance in real-time data processing, disaster recovery, throughput, rich operators, easy to use, and naturally supports batch stream integration. At present, there is an open source framework for online learning, so online learning is a good choice. With the expansion of machine learning data scale and the improvement of the requirements for data timeliness and model timeliness, Flink has been widely used, Online learning is not only a supplement to offline model training, but also a trend of model system efficiency development. For this purpose, we have made the following plans:
Thanks to the R & D Department of real-time computing and the search sorting algorithm team for their support.