This article is published by cloud + community
Author: Tencent Technology Engineering
Introduction: in recent years, deep learning has made a lot of achievements in the field of recommendation system. Compared with traditional recommendation methods, deep learning has its own unique advantages. Our team also tried some in-depth learning methods and accumulated some experience in the image and text recommendation of QQ. This paper mainly introduces an in-depth learning method for the recall module of the recommendation system. Its source is a paper for youtube video recommendation published by Google in recsys in 2016. On the basis of this paper, we made some modifications, and did online AB test, compared with the traditional collaborative recall, the click rate and other indicators improved significantly.
For the integrity of the system, before introducing the main model, this paper first introduces and summarizes the traditional recommendation algorithm and recall algorithm, and points out the advantages of deep learning method compared with the traditional method, and then enters the topic of this paper – deep recall model.
1. Overview of recommended system algorithm
According to the data types used, the recommended algorithms can be divided into two categories:
The first is recommendation algorithm based on user behavior data, which is also called collaborative filtering. Collaborative filtering is divided into memory based and model based. Among them, the representative algorithms of memory based collaborative filtering include user based usercf and item based itemcf. Their characteristics are to directly calculate the similarity of user user or item item using behavioral data . The representative algorithms of model-based collaborative filtering are mainly some hidden variable models, such as SVD, matrix decomposition MF [2,3], theme model PLSA, LDA , etc. their characteristics are to use behavioral data to calculate the hidden vectors of user and item first, and then use these hidden vectors to calculate the matching degree between user or item item for recommendation.
In practice, we can get not only the user’s behavior data, but also the user and object’s portrait data, such as gender, age, region, label, classification, title, text, etc. in some literatures, the data other than these behaviors are called side information. The traditional collaborative filtering does not consider the side information. If the side information is combined with the behavioral data, the accuracy of recommendation should be improved. This kind of algorithm using both behavioral data and side information belongs to the second kind of algorithm.
In the second kind of algorithm, the most common model is CTR model. CTR model is essentially a two class classifier, and LR, xgboost , lightgbm  and other classifiers are widely used. Among them, behavioral data and side information are used to construct features and class markers of training samples. After the training of the classifier, the item with the largest TOPK probability can be pushed to the user as the recommendation result by predicting the probability of the user clicking on the item. Compared with the pure behavior collaborative filtering, the CTR model using side information usually achieves better recommendation results. The key to the success of the whole CTR model lies in how to combine the side information and behavior data to construct discriminative user features, item features and cross features;
In the past five years, the CTR model based on deep learning has developed gradually and achieved better results than the traditional CTR model in many application scenarios. Compared with the traditional CTR model, the deep CTR model has its own unique advantages, which are mainly reflected in the following aspects:
(1) integration ability of characteristics:Deep learning can transform any classification variable into a low-dimensional dense vector with semantic structure by using embedding technology. It is easy to combine the classification variable and continuous variable as the input of the model. It is more convenient and effective than the traditional method of using one hot or multi hot to represent the classification variable, especially suitable for the scenario of Web Recommendation with more classification variables;
(2) the ability of automatic cross feature:Deep learning through the strong nonlinear fitting ability of neural network, it can automatically learn the cross characteristics of users and items from the original input data, and can match users and items in a deeper level, so as to improve the accuracy of recommendation. By automatically learning cross features, the workload of feature engineering is greatly reduced;
(3) end to end learning:Traditional CTR model will separate feature construction and model training, first do data preprocessing, feature construction or feature learning, and then train the model. These pre constructed or learned features are not necessarily the best fit for the current model. In depth learning, input data does not need to do too much preprocessing or feature learning. Feature learning and model training are carried out at the same time. The learned features are also the most suitable for fitting the current model. The generalization ability of the learned model is usually better. Because there is no need for special learning features, it also improves the efficiency of developing models.
(4) perfect computing platform and tools:No matter how beautiful the model is, it has to be solved to play a real role in practice. For deep learning, some existing computing platforms (such as tensorflow) make model solving very easy. Researchers only need to focus on the design and optimization of the model, and do not need to derive complex solutions. The solution of the model is completed by the automatic differentiation technology of tensorflow and other tools, which greatly reduces the difficulty of model implementation and landing.
Because of the above reasons, CTR model based on deep learning has been widely concerned, developed rapidly in recent years, and achieved remarkable results in many businesses.
2. Recall model
Generally speaking, the recommendation system is divided into two layers: recall layer and sorting layer.
Among them, the recall layer is mainly responsible for quickly selecting the pool of items related to users’ interests from all items, greatly reducing the scope of items and preparing for sorting. In the case of recommendation of information products, the recalled items usually need to meet the timeliness. Therefore, in the information recommendation scenario, the recall model should meet the following requirements:
(1) efficiency: to complete the recall of goods in a short response time;
(2) relevance: recall the items matching the user’s interest as much as possible;
(3) timeliness: new online items should also be recalled to ensure that the latest content also has the opportunity to be exposed;
2.1 traditional recall model
Taking the image and text recommendation as an example, the common way of recall is to associate similar articles with the articles clicked by users in history. According to the different data used, it can be divided into content based recall and collaborative recall.
(1) recall articles with the same or similar labels and classifications according to the labels and classifications in the user portrait;
(2) by using the title or body of the article, the title vector or body vector of the article is calculated by word2vec, glove and other methods, and the similar articles on the content are recalled by calculating the cosine similarity of the vector.
(1) we can use Jaccard formula to directly calculate the user coincidence degree of two articles as the behavior similarity of the two articles;
Advantages and disadvantages:
These two kinds of recall methods have their own advantages and disadvantages.
The first method has the advantage of recalling the latest articles. The disadvantage is that the recall content may be too centralized or the interest may drift. For the vectorization method, it is easy to appear the phenomenon of “recall what you click”, which reduces the diversity of recommendation. For direct use of labels or classifications to recall, may recall articles of low relevance, reduce the accuracy of recall.
The second method can solve the problem of over concentration of recall content and interest drift to some extent, but because it depends on behavior data, it can only recall the articles contained in training data, not the latest articles. In order to recall new articles, they must be recalculated at regular intervals. So the articles it recalled are not up to date.
In order to solve the shortcomings of the above two methods, we use the method of deep learning to recall, make full use of the user’s image information to match the user’s interests and articles in depth, and ensure that the latest articles can be recalled. In the industry, the recall method used by Google in YouTube video recommendation  is more successful, but it doesn’t consider whether the recall content is up-to-date or not, so in order to make this method recall the latest item, we have made some modifications to the YouTube deep recall model, so that it can recall articles that match the user’s interest in depth and keep the recall article Chapter is up to date. “In addition, because the YouTube recall model uses a recall method in which the user vectors directly match the article vectors, it is easy to meet the performance requirements of a quick recall candidate set from a large number of articles by using a vector index system.”
2.2 deep recall model
For the sake of simplicity, I call the deep recall model proposed by Google for youtube video recommendation as YouTube recall model. Before we talk about the model, let’s talk about two basic network structures that are generally used in the deep CTR model: embedding layer and full connection layer.
The input of depth CTR model usually contains various classification variables, such as tags in user profiles, primary classification, secondary classification, etc. There are many values of these classification variables, such as tags, which may have hundreds of thousands of values. If one hot or multi hot is used to represent tags, tens of thousands of high-dimensional sparse vectors will be generated. In the depth CTR model, the embedding method is usually used to represent these classification variables with many values as a low-dimensional dense vector, and then input them into the network. The low-dimensional dense vector is called embedding of the classification variable, and the layer network that transforms the classification variable into embedding is called embedding layer. The following figure shows an example of how to use embedding method to represent the label in user’s portrait as a low-dimensional dense vector.
Figure 1 example of using embedding method to represent classified variables
First of all, we initialize a lookup table, which is a matrix (it can be fixed in advance, or it can be learned from the data and updated iteratively). The number of rows of the matrix is the number of values of the classification variables. In this example, it is the total number of labels. The number of columns of the matrix is the dimension of the low-dimensional dense vector (specified in advance). If the total number of tags is 10000, then the lookup table has 10000 lines, and the line I is the embedding corresponding to the tag number I. In the example above, the user has two labels with numbers of 308 and 4080 respectively and corresponding weights of 0.7 and 0.3. When calculating embedding, first take out the vectors of line 308 and line 4080 and record them as
And then we use 0.7 and 0.3 as the weight weighted sum of these two vectors to get the embedding of the user tag
In fact, using the above method to find embedding is the same as using the high-dimensional sparse vector obtained after inputting the classification variable one hot or multi hot to multiply by the lookup table, and the result is the same. The only difference is that using the way of summing the lookup table to do this matrix multiplication will be much more efficient. So, in essence, embedding does linear dimensionality reduction.
Fully connected layer
The full connection layer is the basic structure of MLP, which defines the nonlinear mapping relationship from the first layer to the first + 1 layer. Its main function is to make the model have the ability of nonlinear fitting, or the ability of learning feature intersection. Figure 2 shows a full connection diagram from layer 1 to layer 1 + 1.
Figure 2 Schematic diagram of full connection from the first floor to the first + 1 floor
The network structure of YouTube recall model
According to the user’s click history and image, the YouTube recall model calculates the user’s liking probability for each item in the item library, and returns the item with the highest TOPK probability as the recall list. In essence, the YouTube recall model is a super large-scale multi classification model, each item is a category, and the user’s characteristics are calculated through the user’s profile as the model input to predict the user’s favorite TOPK categories (TOPK items).
In the scenario of recall of main feeds, the user’s input mainly includes the following types of data:
(1) the user’s click history, including reading, liking, commenting, collecting and BIU articles;
(2) user’s interest profile, including user’s label, primary classification, secondary classification, etc;
(3) demographic characteristics of users, including gender, age, etc;
(4) the user’s context information, including the user’s region information, time period when accessing the recommendation system, etc;
It should be noted that the lookup table in our model is composed of word2vec vectors of the text segmentation, which is not updated in the training process. In the original paper of YouTube recall model, item is video, and item’s lookup table is learned. But in the picture and text recommendation scenario, if the article’s lookup table is learned, then there is no way to recall the latest articles. Only those articles that have appeared in the training samples can be recalled, which can not meet the needs of the new article recommendation. In order to recall the new article, we modify the original model, and directly use the word2vec vector of the article to construct the lookup table of the article, where the word2vec vector of the article is obtained by the weighted sum of the word vector of the article, and the word vector is learned and fixed by word2vec in advance. Every time a new article is put into storage, we can get its word2vec vector by weighted summation of the word vector of the article, and then save it. When the YouTube model is recalled online, whether calculating the user’s interest vector or calculating the inner product, we can get the direction of each article in real time, including the latest article, so as to meet the needs of the recall of the latest article.
Figure 3 network structure of YouTube deep recall model
Taking all the user’s side information and click behavior data as training samples, maximizing the above total likelihood function, we can learn all the look up table and DNN network weights and offsets.
Because the softmax in this model needs to carry out all articles in the library, and the number of articles is generally in the order of hundreds of thousands, direct optimization is not feasible, and it will be very time-consuming, so the optimization algorithm specially designed for this large-scale multi classification problem will be used in the actual calculation. The name of this algorithm is candidate sampling, and the most widely used one is noise pair Nce (noise contractual estimation), which changes the softmax of click samples into multiple binary logistic problems. Since softmax has the property of sum to one, maximizing the probability of click samples will inevitably lead to minimizing the probability of non click samples, while binary logistic does not have this property, so it is necessary to assign negative samples artificially. Nce is a technique of randomly selecting negative samples according to the heat degree. By optimizing the nce loss, all parameters of the model can be quickly trained. In practice, we use the function tf.nn.nce’u loss provided by tensorflow to do candidate sampling and calculate the nce loss.
3. Experiment and analysis
In order to verify whether the deep recall model of YouTube is more effective than the traditional recall method, we conducted an online AB test experiment. Among them, after the offline training of the deep recall model is completed, it is used for online recall by the server, and the comparative recall method is based on collaborative filtering of articles.
(1) user vector calculation:
User’s recent click history, including reading, liking, commenting, collecting and sharing articles;
User’s latest interest profile, including user’s label, first level classification, second level classification and corresponding weight;
The demographic characteristics of users are mainly gender and age;
The user’s context information mainly includes the user’s region information and current time;
(2) positive sample selection:
Articles clicked by users one day after the time point of portrait statistics;
(3) article vector:
Calculate the user vector and the word2vec vector of all articles involved in the positive samples;
(4) sample size:
For user sampling, the number of users for training is sampled to tens of millions, and the total number of samples reaches hundreds of millions;
Sampling analysis of experimental results
Before the online experiment, we did sampling analysis. In other words, randomly select a number of users, get their historical click articles, and then see what articles are recalled by YouTube recall model and collaborative filtering model respectively, and subjectively see which recall is more consistent with the user’s historical click. The following is a case analysis of a user:
From this user’s historical click article, we can see that his interest points are entertainment, social and technology. Collaborative recall and Youtube recall can basically recall articles related to these categories. In contrast, articles of collaborative recall will be similar in content, and articles of YouTube recall will not only have similar content, but also have theme related, which will be better in diversity and promotion. For example, for collaborative recall, if history clicks Marvel’s couplet 4, there will be Marvel’s articles in the recall, and if history clicks Ma Yun, there will be Ma Yun in the recall. As for the YouTube recall, the phenomenon of “recall what you want” is much less. The recalled articles not only maintain relevance, but also have a certain degree of promotion. For example, “Fu Lian 4” recalled “crazy aliens”. Although they are not the same series of films, they are all new films. Maybe the user is not particularly concerned about Marvel’s movies, but just about some new movies. YouTube model may recognize the user’s interest trend and recall “crazy aliens”. So subjectively, there will be a sense of promotion on relevance.
However, the above case analysis is only subjective feelings, and sampling analysis can not represent the overall. The most reliable way is to evaluate the recall algorithm through online AB test to see if it can improve the core indicators online.
Online assessment indicators
The evaluation indexes of online AB test experiment are the number of hits, the number of exposures, the hit rate and the article coverage of the algorithm. The control group of AB test experiment uses item based collaborative filtering algorithm, i.e. itemcf, to calculate item item similarity through Jaccard, and recall similar articles according to the articles clicked by users. The experimental group was YouTube deep recall model. In the case of the same rank, the comparison between the experimental group and the control group on the number of hits, the number of exposures and the click rate is as follows:
Click through rate comparison
Comparison of hits
As can be seen from the online indicators, the number of YouTube deep recall exposures is slightly lower than that of collaborative recall, but the click through rate will be significantly higher than that of collaborative recall. The number of YouTube recall model exposures is about 80% of that of collaborative recall, but the average click through rate is about 20% higher than that of collaborative recall. It shows that the articles of YouTube deep recall model recall can match users’ interests better than collaborative recall.
In addition, the YouTube recall model is also better than the collaborative recall in the diversity of recommended content and article coverage. The statistical results of online experiments show that the total number of recommended de duplication articles in the experimental group is about 2% more than that in the control group. This data indirectly reflects that the YouTube recall finds more articles matching user interests than the collaborative recall.
(1) this paper introduces a recall model based on deep learning and compares it with the traditional recall method;
(2) online experiments show that the click through rate of recall articles of YouTube recall model is significantly higher than that of collaborative filtering recall articles, which shows that YouTube recall model learns more accurate user interest vector through image data, matches more articles that meet user interest, and shows the advantages of deep learning model in feature integration and automatic learning cross features;
(3) because the YouTube recall model uses the article vector of word2vec pre training when calculating the user vector, and each article can calculate its word2vec vector when it is put into storage, the YouTube recall model can recall the latest articles put into storage, so as to achieve the real up-to-date recall.
 Xiang Liang. Practice of recommendation system. Beijing: People’s post and Telecommunications Press, 2012
KorenY, Bell R, Volinsky C. Matrix factorization techniques for recommender systems.Computer, 2009 (8): 30-37.
HuY, Koren Y, Volinsky C. Collaborative filtering for implicit feedback datasets.ICDM, 2008: 263-272.
BleiD M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of machineLearning research, 2003, 3(Jan): 993-1022.
ZhangS, Yao L, Sun A. Deep learning based recommender system: A survey and newperspectives. arXiv preprint arXiv:1707.07435, 2017.
Mikolov,Tomas & Chen, Kai & Corrado, G.s & Dean, Jeffrey. EfficientEstimation of Word Representations in Vector Space. Proceedings of Workshop atICLR. 2013.
MikolovT, Sutskever I, Chen K, et al. Distributed representations of words and phrasesand their compositionality. NIPS.2013: 3111-3119.
PenningtonJ, Socher R, Manning C. Glove: Global vectors for word representation. EMNLP.2014: 1532-1543.
CovingtonP, Adams J, Sargin E. Deep neural networks for youtube recommendations. RecSys.2016: 191-198.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. SIGKDD. 2016:785-794.
Ke G, Meng Q, Finley T, et al. Lightgbm: A highly efficient gradient boostingdecision tree. NIPS. 2017: 3146-3154.
This article has been released by Tencent cloud + community in various channels
For more fresh technology dry goods, you can pay attention to our Tencent cloud technology community – official number and Zhihu organization number of Yunjia community