Alimei’s Guide:Search, recommendation and advertisement are the core businesses for Internet content providers to create value. On Alibaba’s e-commerce trading platform, search, recommendation and advertisement are also of great significance and value. Now, Alibaba’s recommendation technology has been double optimized. Let’s take a look at the new recommendation technology and experience together.

**I. background**

Search, recommendation and advertising seem to have different business forms, but in fact, their technical composition is very similar. From the perspective of recommendation, search can be regarded as a kind of recommendation with query relevance constraints, while advertising is a kind of recommendation with marketing intention (price) constraints of advertisers. Therefore, the innovation of recommendation technology plays a fundamental role in promoting the overall development of search, recommendation and advertising business technology.

From the perspective of technology evolution, the recommendation algorithm has been updated in recent years. From the first generation of item based collaborative filtering (item CF) to the second generation of vector retrieval method based on the internal product model, the recommendation technology opens the ceiling of the search range of candidate subsets. However, the vector retrieval method limits the inner product model, which is a user commodity preference measurement method, and cannot accommodate more advanced scoring models (such as deep network with attention structure).

In order to further open the ceiling of model capability in recommendation technology on the basis of full database retrieval and efficiency constraints, previously Alibaba mom’s precision oriented advertising business team independently proposed a new generation of tree based deep model (TDM), which has achieved significant improvement in large-scale recommendation problems. Recently, the team has made the latest achievements in the research of large-scale recommendation, and introduced how to realize the joint optimization of model, index and retrieval algorithm through data-driven way. Based on the latest research results, the paper has been accepted by the neuroips 2019 conference.

**II. Problems in the existing system**

As shown in the figure below, in a large-scale task, the system of search, recommendation and advertisement usually consists of three major components: model, index and retrieval algorithm. The model calculates the preference probability of a single user product, and the index organizes all the products together orderly. The retrieval algorithm recalls the final recommendation results in the index according to the output of the model. The three factors together determine the recall quality and have internal relations.

However, taking recommendation as an example, the existing recommendation system often fails to fully consider the relationship between model index and retrieval. From the perspective of joint optimization, the existing representative algorithms of several generations of recommendation system are analyzed as follows:

1. In item-cf, the inverted index is built according to some custom similarity measure between items. The retrieval process is to query the candidate set in the inverted index according to the user’s historical behavior and then sort and cut it. The model scores the items in the candidate set according to some custom rules during the sorting process. In the system, the model and retrieval are solidified by rules without learning and tuning.

2. In the pattern of vector retrieval, the system will learn a vector representation for users and commodities respectively, and its inner product is used as the prediction of users’ preference for commodities. Retrieval is equivalent to the nearest neighbor retrieval of user vector in the set of commodity vectors. In large-scale problems, approximate nearest neighbor index structure can be used to speed up the retrieval. In the process of building a vector retrieval recommendation system, the goal of model training is to accurately predict the preference probability of a single user commodity, while the goal of building a KNN retrieval index is to minimize the approximate error, and the optimization direction of the two is not the same. At the same time, the expression ability of preference prediction in the form of inner product is limited, which can not accommodate more advanced scoring model.

3. In TDM, we implement and innovate the joint optimization of model, index and retrieval by alternately iterating and optimizing the model and tree structure, plus the parameterless layer by layer beam search retrieval process. However, in TDM, the optimization objectives of the model and the learning of tree structure index are not completely consistent, which may lead to the mutual influence of the two optimization and lead to the suboptimal overall effect. Especially for tree structure index, the construction of model training samples and the selection of retrieval path are more closely related to tree structure, so its quality is particularly important.

To sum up, this paper proposes a joint optimization of tree based index and deep model (JTM) to solve the problems existing in the current large-scale recommendation methods, breaking the mutual constraints brought by the independent optimization of each module of the system, so as to optimize the overall recommendation efficiency.

**Depth tree matching recommendation technology for end-to-end joint learning**

JTM inherits the system framework of TDM tree structure index + user commodity preference scoring model of any depth. Through joint optimization and hierarchical feature modeling, JTM has achieved significantly higher recommendation accuracy than TDM. In order to better understand JTM, let’s first briefly understand the principle of TDM.

**3.1 depth tree recommendation model TDM**

The task of recommender system is to select a subset of user’s current preferences from candidate sets (e.g. commodity Library). When the scale of candidate set is large, how to make recommendation from the whole database quickly and effectively is a challenging problem. TDM creatively uses the tree structure as the index structure and further makes the user’s preference for nodes on the tree meet the following approximate maximum heap properties:

Where p (L) (n|u) is the true value of user U’s preference probability for node n, and α (L) is the normalized term of preference probability distribution in layer L. This modeling ensures that the K nodes with the largest preference probability in the L-th layer must be included in the sub nodes of the Top-k node in the L − 1 layer. Based on this model, TDM transforms the recommendation problem into a top-down hierarchical retrieval problem in the tree structure. The following figure shows the generation process of TDM candidate subsets.

First, each item in the candidate set is assigned to a different leaf node of the tree, as shown in figure (b). The non leaf node on the tree can be regarded as an abstraction of its set of child nodes. Figure (a) shows the calculation process of the user’s preference probability for the node. The user information and the node to be scored are first vectorized into the input of the deep scoring network (for example, full connection network, attention network, etc.), and the output of the network is the user’s preference probability for the node. In the process of searching the candidate subset of Top-k, that is, Top-k leaf node, we use the top-down beam search method. In layer L, we only grade and sort the children of K nodes selected in layer L − 1 to select k candidates in layer L. Figure (b) shows the retrieval process.

By using tree structure as index, the time complexity of top 1 retrieval for a user’s preference subset is O (log (n)), where n is the size of all candidate sets. This complexity also has nothing to do with the structure of user preference scoring model. At the same time, the assumption of approximate maximum heap transforms the goal of model learning into the learning of user node preference distribution, which enables TDM to break the limitation of user preference scoring in the form of inner product brought by nearest neighbor retrieval mode and enable any complex scoring model, thus greatly improving the accuracy of recommendation.

**3.2 joint optimization framework in JTM**

It can be seen from the retrieval process that the recommendation accuracy of TDM is determined by both the user preference scoring model M and the tree index structure T, and the two are in a coupling relationship. Specifically, given n positive samples, that is, user U (I) is interested in commodity C (I), tree structure t determines which non leaf nodes model M needs to select to return commodity C (I) to user U (I). The joint optimization of M and t can avoid the overall suboptimal result caused by the conflict of the two optimization directions. Therefore, in JTM, we jointly optimize m and T under a common loss function. First, we construct the objective function of joint optimization.

Note that P (π (c) | u; π) is the user’s preference probability for the leaf node π (c), where π (⋅) is the projection function that projects the goods in the candidate set onto the leaf node of the tree. π (c) determines the index order of the products in the candidate set in the tree structure. If (U, c) is a positive sample, we have p (π (c) | u; π) = 1. At the same time, under the assumption of approximate maximum heap, the preference probability of all ancestor nodes of π (c) is also 1, that is to say. Where BJ (⋅) is the projection function that projects a node to its ancestor node in the j-th layer, and Lmax is the number of layers of tree t. It is recorded as the estimated value of user U’s preference probability for node π (c) returned by model M, where θ is the parameter of the model. Given the n-positive samples, we hope to jointly optimize π and θ to fit the user preference distribution of the above-mentioned nodes on the tree. For this reason, we want π and θ to minimize the following global empirical loss functions:

In the solution, because the optimization π is a combinatorial optimization problem, it is difficult to optimize simultaneously with θ using gradient based optimization algorithm. Therefore, we propose a joint optimization framework of alternately optimizing θ and π, as shown in the figure below. The convergence of the whole algorithm is promoted by the goal consistency of the optimization of θ and π. In fact, if model learning and tree learning can make the loss function decrease at the same time, then the whole algorithm will converge, because {L (θ T, π T)} is a monotone decreasing sequence with a lower bound of 0.

In the model training, min θ L (θ, π) is to solve the user node preference scoring model of each layer. Thanks to the tree structure and the properties of approximate maximum heap, we only need to fit the user node preference distribution in the training set in the model training, which enables us to use any complex neural network model, and min θ L (θ, π) can be solved by popular algorithms such as SGD and Adam. Sampling strategies such as noise coherent estimation (NCE) can be used to speed up normalization in computation.

Tree structure learning is to solve Max π – L (θ, π) with given model parameters, which is a combinatorial optimization problem. In fact, given the shape of the tree (for the sake of expression, we assume that the shape of the tree is a complete binary tree. The optimization algorithm proposed by us can be easily extended to the case of multi tree). Max π – L (θ, π) is equivalent to finding an optimal match between the candidate set and all leaf nodes, which is further equivalent to the maximum match of a weighted bipartite graph. The analysis process is as follows:

If the k-th commodity CK is assigned to the m-th leaf node nm, that is π (CK) = nm, we can calculate the following weights:

among

Set of training samples whose target product is CK. Taking the leaf node and candidate set of the tree as the vertex, the full connection between the leaf node and the candidate set as the edge, and Lck, nm as the weight of the edge between CK and nm, we can construct a weighted bipartite graph V, as shown in the flowchart (b) in Section 2.1. In this case, every possible π (⋅) is a match of V, and we have

C is the set of all CK. Therefore, Max π – L (θ, π) is equivalent to solving the maximum weight match of V.

For large-scale candidate sets, the traditional algorithm for maximum weight matching, such as Hungary algorithm, is difficult to use because of its high complexity. Even the simplest greedy algorithm, the cost of calculating and storing all weights is unacceptable. To solve this problem, we propose a segmented tree learning algorithm using tree structure. Compared with directly allocating all goods to the leaf node, we implement the allocation of goods to the node step by step from top to bottom in the tree. Notes:

Repeat this process until each item is assigned to the leaf node. The flow of the whole algorithm is shown in the figure below:

**3.3 hierarchical user interest expression**

In essence, JTM (and TDM) is an in-depth transformation of index structure and retrieval method in recommendation system. Each layer of the tree structure can be regarded as the aggregation representation of different granularity of goods. JTM can find the best candidate subset of user information matching from coarse to fine through the top-down layer-by-layer correlation retrieval on the tree, which is also consistent with the process of selecting preferred goods from human perspective. Through joint consideration of model and index structure, JTM and TDM decompose a complex large-scale recommendation task into several cascaded sub retrieval tasks. In the upper level retrieval tasks, only a coarser granularity is needed for circle selection, and the candidate set of each layer circle selection is far less than the whole candidate set, so the training difficulty will be greatly reduced. It can be predicted that when the solution of the sub search task is ideal enough, the result of the cascade search will surpass the effect of directly circling the candidate set in the candidate set.

In fact, the tree structure itself provides a hierarchical structure of candidate set, because each non leaf node is a learned abstraction of all its sub nodes, which inspires us to do the most accurate hierarchical modeling of user behavior characteristics when training model M to do the sub retrieval task of each layer.

Specifically, each product of user’s historical behavior is an ID class discrete feature. In model training, each product and node on the tree are embedded in a continuous feature space and optimized with the model at the same time. Starting from the point that each non leaf node is the abstraction of its child node, given the user behavior sequence C = {C1, C2, * cm}, we propose to use C L = {BL (π (C1)), BL (π (C2)), * BL (π (CM))} to combine the target node and other characteristics of the user to generate the input retrieved by model m in layer 1. In this way, the ancestor node of the first layer is used as the behavior sequence of user abstraction. There are two main benefits:

1. Interlayer independence.

Traditional embedding, which shares the user behavior sequence in each layer of retrieval, will introduce noise when training m as the user preference scoring model of each layer, because M’s training objectives in each layer are different. A direct solution is to give each product a separate embedding at each level for joint training. But this will greatly increase the number of parameters. The hierarchical user behavior feature proposed in this paper uses the embedding of corresponding layer nodes to generate the input of M, so as to realize the independence of inter layer embedding learning without increasing the total number of parameters.

2. Accurate user modeling.

M selects the coarse to fine abstraction of the final candidate subset layer by layer in the retrieval process. The hierarchical user behavior feature expression proposed by us captures the essence of this retrieval process, abstracts the user behavior with the nodes of the current layer, so as to increase the predictability of user preferences and reduce the confusion caused by too coarse or too fine feature expression.

**IV. experimental results**

**4.1 experimental setup**

We used Amazon books and userbehavior, two large public datasets, to evaluate the effectiveness of the method. Amazon books is a record of users’ behavior on Amazon. We chose the largest book subclass. Userbehavior is Alibaba’s open-source Taobao user behavior data set. The size of the dataset is as follows:

In the experiment, we compared the following methods:

- Item-cf: basic collaborative filtering algorithm, which is widely used in personalized recommendation tasks.
- YouTube product DNN: applied to the vector inner product retrieval model recommended by YouTube video.
- HSM: hierarchical softmax model is widely used in NLP field as an alternative to normalized probability calculation.
- TDM: our previous recommendation for depth tree matching.
- DNN: the version of TDM model after removing the tree structure. This method does not use index structure, so it will grade the whole candidate set and then select TOPK. Because of the high computational complexity of full candidate scoring, it can not be applied in practice, but it can be used as a strong baseline for comparison.
- JTM: the joint optimization method proposed in this paper. At the same time, we compared two versions of JTM, jtm-j and jtm-h. Among them, jtm-j is a version that uses tree structure joint optimization but does not use hierarchical user interest expression; jtm-h, on the contrary, uses hierarchical user interest expression but uses fixed initial tree structure instead of joint learning.

In all neural network models, the same three-layer full connection network is used as the scoring model. In terms of evaluation, we use precision, recall and F-measure as performance evaluation indicators, which are defined as follows:

Pu is the collection of products recalled to user u, and Gu is the true collection of the collection of user U’s interest.

**4.2 comparison results**

The following table shows the comparison results of each method on two datasets. Compared with the best baseline method DNN (too much computation to be applied in practice), JTM’s recall on Amazon books and user behavior has achieved a relative improvement of 45.3% and 8.1% respectively.

The performance of DNN is better than that of YouTube product DNN, which reflects the limitations of the inner product model. It is impossible to fully fit the user commodity preference distribution only by constructing the user preference probability in the form of inner product. In addition, TDM performance is not as good as DNN, which shows the necessity of tree structure optimization.

Poor tree structure may lead to the convergence of model learning to sub optimal results. Especially for the sparse data like Amazon books, the embedding of nodes on the tree can not be fully learned and has no significant distinction, which leads to the insignificant TDM effect. Correspondingly, jtm-j scheme solves the problem of data sparsity in coarse granularity to some extent by applying the hierarchical user interest representation method proposed in this paper. Therefore, compared with TDM, it has made a very significant improvement in Amazon books data set. Through joint optimization, JTM is significantly better than DNN in all data sets and evaluation indicators, and the retrieval is complex. The impurity is much lower.

These results show that JTM can learn better tree structure and interest model through joint optimization under unified goal. From the comparison among JTM, jtm-j and jtm-h, it can be found that no matter the joint learning under the same goal or the hierarchical user interest expression, the final recommendation accuracy can be improved. In addition, under the joint framework of JTM, tree structure learning and hierarchical interest representation are overlapped, which has the effect of 1 + 1 > 2.

**4.3 learning convergence of tree structure**

In the tree based recommendation method, the tree structure directly affects the sample generation during training and the retrieval path during prediction. A good tree structure can play an important positive role in model training and interest retrieval. In the figure below, we compare JTM’s tree joint learning scheme based on unified goal with TDM’s scheme based on commodity embedding clustering. Among them, the first three graphs are the effects on Amazon books dataset, and the last three graphs are the effects on userbehavior dataset.

From the experimental results, it can be found that the JTM scheme proposed in this paper can stably converge to a better tree structure in the iterative process of tree structure learning. In contrast, the clustering based scheme will appear similar to over fitting at the end of iteration.

**Five. Conclusion**

JTM provides a scoring model and an algorithm framework of tree index structure for joint optimization of depth tree matching model under a unified goal. In tree structure optimization, based on the characteristics of tree structure, we propose a hierarchical reconstruction algorithm which can be used for large-scale tasks. In the process of model optimization and scoring, based on the essence of thinning candidate set layer by layer in tree retrieval, we propose a hierarchical modeling method for user behavior characteristics.

JTM inherits the advantages of TDM which breaks the constraints of inner product model and can accommodate any depth scoring model. In addition, JTM brings significant improvement through joint optimization. JTM thoroughly solves the non optimal combination problem of historical recommendation system architecture, and establishes the system composition of fully data-driven bottom-to-end index, model and retrieval joint optimization. Furthermore, JTM is a major technological innovation of the existing architecture of search, recommendation and advertisement based on user tag doc two-stage search.

The previous TDM solution has been based on the deep learning platform x-deep learning developed by Alibaba, which is open-source in GitHub. Click the GitHub download link to learn more.

Ali mom’s information flow advertising algorithm team is always looking for talents in big data processing and machine learning algorithm! For interested parties, please contact: [email protected]

Author: technical team of Ali mom

Read the original text

This article is from alitech, a partner of yunqi community. If you need to reprint it, please contact the original author.