The practice of ranking in-depth learning based on knowledge map in public comment search

Time:2020-3-25

1. introduction

Challenges and ideas

Search is the biggest access for users to search information on the public comment app, and it is an important link between users and information. However, the ways and scenarios of user search are very diverse, and due to the variety of docking services and the large difference in traffic, it brings great challenges to the public comment search (hereinafter referred to as comment search), which is embodied in the following aspects:

  1. Diverse intentions: users can find various types and ways of information. Information types include POI, list, UGC, strategy, talent, etc. For example, search methods include by distance, by heat, by dishes and by geographical location. For example, when users search by brand, they need to find the nearest or frequently visited branch; but when users search for dishes, they will be more sensitive to the number of recommended dishes, and the distance factor will be weakened.
  2. Business diversity: users’ frequency of use, difficulty of selection and business demands are different among different businesses. For example, the frequency of home decoration users is very low, the behavior is very sparse, the distance factor is weak, and the selection cycle may be very long; while the food is mostly real-time consumption scenarios, the user behavior data is much, and the distance is sensitive.
  3. Various types of users: different users have great differences in price, distance, taste and preference categories; search needs to be able to deeply mine various preferences of users to achieve customized “thousands of people and thousands of faces” search.
  4. LBS searchCompared with e-commerce and general search, the dimension increasing effect of LBS greatly increases the complexity of search scenarios. For example, for tourism users and permanent users, the former may pay more attention to the famous local merchants when searching food, but is relatively insensitive to distance.

The above characteristics, combined with time, space, scene and other dimensions, make comment search face more unique challenges than general search engine. To solve these challenges, we need to upgrade NLP (natural language processing) technology, carry out deep query understanding and deep evaluation analysis, and rely on knowledge map technology and deep learning technology to upgrade the overall search architecture. Under the close cooperation of the NLP center of meituan and the public comment search intelligent center, after only half a year, the core KPI of comment search is still greatly improved on the basis of high position, which is six times of the increase in the past year and a half, and the annual goal is completed half a year in advance.

Reconstruction of search architecture based on Knowledge Map

Meituan NLP center is building meituan brain, the world’s largest catering and entertainment knowledge map (for relevant information, please refer to meituan brain: modeling methods and applications of knowledge map). It fully mines and associates the scene data, uses NLP technology to let the machine “read” the user’s public comments, understands the user’s preferences in dishes, prices, services, environment, etc., constructs the knowledge association among people, stores, commodities and scenes, thus forming a “knowledge brain” [1]. By adding the knowledge map information to each search process, we have upgraded and reshaped the overall structure of the review search. Figure 1 shows the five-tier search structure based on the knowledge map of the review search. This article is the second in “meituan brain” series (for the first article in the series, please refer to meituan catering and entertainment knowledge map – meituan brain unveiling). It mainly introduces the evolution process of the core ranking layer in the 5-tier architecture of comment search. The article is mainly divided into the following three parts:

  1. The evolution of core sequencing from traditional machine learning model to large-scale deep learning model.
  2. The feature engineering practice of searching scene deep learning sorting model.
  3. Lambdadnn, a deep learning listwise sorting algorithm, is suitable for searching scenes.

The practice of ranking in-depth learning based on knowledge map in public comment search

2. Exploration and practice of sorting model

There is a separate branch in the field of machine learning, learning to rank (L2R). The main categories are as follows:

  1. L2R can be divided into pointwise, pairwise and listwise according to the difference of sample generation method and loss function.
  2. According to the model structure, it can be divided into linear sorting model, tree model, deep learning model, and their combination (gbdt + LR, deep & wide, etc.).

In terms of ranking model, comment search has also experienced a relatively common iterative process in the industry: from the early linear model LR, FM and FFM with automatic second-order cross features, to the nonlinear tree model gbdt and gbdt + LR, to the recent comprehensive migration to the large-scale deep learning ranking model. The following is a brief introduction to the application, advantages and disadvantages of the traditional machine learning model (LR, FM, gbdt), and then a detailed introduction to the exploration and practice process of the depth model.

Traditional machine learning model

The practice of ranking in-depth learning based on knowledge map in public comment search

  1. LR can be regarded as a single layer and single node linear network structure. The advantage of the model is that it can be explained well. Generally speaking, good interpretability is an index that industry pays more attention to. It means better controllability, and it can also guide engineers to analyze the problem optimization model. However, LR relies on a lot of input of artificial feature mining, and the limited feature combination can not provide strong expression ability.
  2. FM can be regarded as adding a part of second-order cross terms on the basis of LR. The introduction of automatic cross feature is helpful to reduce the input of manual mining, increase the nonlinearity of the model and capture more information. FM can automatically learn the relationship between the two features, but the higher level of feature intersection is still not satisfied.
  3. Gbdt is a boosting model. A strong model is obtained by combining multiple weak models to fit the residual gradually. The tree model has natural advantages, can well mine the combination of high-order statistical characteristics, and has better interpretability. The main defect of gbdt is that it relies on continuous statistical features, and it can not deal with high-dimensional sparse features and time series features well.

Depth neural network model

With the development of business, it becomes more and more difficult to obtain the index income in the traditional model. At the same time, the complexity of business requires us to introduce multi-dimensional information sources, such as massive user historical data, characteristics of large-scale knowledge map, to achieve accurate and personalized sorting. Therefore, starting from the second half of 2018, we will try our best to promote the transfer of the main model of L2 core sorting layer to the deep learning sorting model. The advantages of depth model are as follows:

  1. Powerful model fitting abilityDeep learning network consists of multiple hidden layers and hidden nodes. With nonlinear activation function, it can fit any function in theory, so it is very suitable for the complex scene of comment search.
  2. Strong feature representation and generalization ability: deep learning model can deal with many features that cannot be dealt with by traditional models. For example, deep network can directly learn the hidden information of high-dimensional sparse ID from massive training samples, and express it by embedding; in addition, for text, sequence features and image features, deep network has corresponding structure or unit to process.
  3. Ability to automatically combine and discover features: the deep FM proposed by Huawei and the deep cross network proposed by Google can automatically combine features, instead of a lot of manual work.

The following figure shows the network structure based on the wide & deep model proposed by Google [2]. In the wide part, some fine-grained statistical features commonly used in LR and gbdt stages are input. Through the long-term statistical high-frequency behavior characteristics, it can provide a good memory ability. Deep part learns low order and high latitude sparse category features through deep neural network, fits the long tail part of the sample, finds new feature combination, and improves the generalization ability of the model. At the same time, we can use the end-to-end method to preprocess and represent the features that are difficult to be described by traditional machine learning models, such as text, head graph, and so on.

The practice of ranking in-depth learning based on knowledge map in public comment search

3. Feature engineering practice of search depth ranking model

The emergence of deep learning liberates algorithm engineers from many things of manual mining and feature combination. There is even an argument that algorithmic engineers who specialize in feature engineering may face the risk of unemployment. However, the automatic feature learning of deep learning is mainly reflected in the field of CV. The feature data of CV field is the pixel points of the picture – dense low-order features. Deep learning can automatically combine and transform the low-order features through the powerful tool of convolution layer, which is more significant than the previously defined image features in terms of effect. In the field of NLP, due to the emergence of transformer, great progress has been made in automatic feature mining. Bert has made state of the art effect in multiple NLP tasks by using transformer.

However, in the field of CTR prediction and sequencing learning, deep learning has not yet formed a rolling trend for artificial feature engineering in automatic feature mining, so artificial feature engineering is still very important. Of course, there are some differences between deep learning and traditional model in feature engineering. Our work mainly focuses on the following aspects.

3.1 feature preprocessing

  • feature normalization Depth network learning is almost based on back propagation, and this kind of gradient optimization method is very sensitive to the scale of features. Therefore, it is necessary to normalize or standardize the features to make the model converge better.
  • Feature discretizationIn general, continuous values are rarely used as features directly in industry, but they are discretized and then input into the model. On the one hand, the discretization feature has better robustness to outliers, and on the other hand, it can introduce nonlinear capability to features. Moreover, discretization can better embed. We mainly use the following two discretization methods:

    • Equal frequency bucket: equal frequency segmentation is performed according to the sample frequency. The missing value can be selected to a default bucket value or set separately.
    • Tree model bucket: the effect of equal frequency discretization is often not good when the feature distribution is particularly uneven. At this time, we can use the single feature combined with label training tree model, take the branch point of the tree as the segmentation value, and the corresponding leaf node as the bucket number.
  • Feature combinationBased on the business scenario, the basic features are combined to form a richer behavior representation, which can provide prior information for the model and accelerate the convergence speed of the model. Typical examples are as follows:

    • The cross characteristics between user gender and category can depict the preference differences of different gender users in category. For example, male users may pay less attention to “beauty” related businesses.
    • The cross characteristics between time and category can depict the time differences of different category merchants. For example, bars are more likely to be clicked at night.

3.2 embedding

The greatest charm of deep learning lies in its powerful feature representation ability. In the comment search scenario, we have massive user behavior data, rich business UGC information and multidimensional fine-grained label data provided by meituan brain. We use deep learning to embed these information into multiple vector spaces, and use embedding to represent the personalized preferences of users and the accurate portraits of merchants. At the same time, the vectorized embedding is also convenient for further generalization, combination and similarity calculation of depth model.

3.2.1 embedding of user behavior sequence

User behavior sequence (search term sequence, click merchant sequence, filter behavior sequence) contains rich user preference information. For example, when users filter “distance first”, we can know that the current user is likely to be an instant consumption scenario, and is more sensitive to distance. Behavior sequence features generally have three access modes as shown in the figure below:

  • Pooling: after embedding, access the sum / average pooling layer. The access cost of this method is low, but the temporal relationship of behavior is ignored.
  • RNN: LSTM / Gru access, aggregation by cyclic network. The cost of this method is to increase the complexity of the model and affect the performance of online prediction.
  • Attention: after embedding the sequence, attention mechanism is introduced, which is shown as weighted sum pooling; compared with LSTM / Gru, the calculation cost is lower [4].

The practice of ranking in-depth learning based on knowledge map in public comment search

At the same time, in order to highlight the different effects of users’ long-term preferences and short-term preferences on ranking, we divide the behavior sequence according to the time dimension: session, half an hour, one day, one week and other granularity, and also make profits online.

3.2.2 embedding of user ID

A more common way to describe user preferences is to directly connect user IDs into the model as features after embedding, but the effect of online is not satisfactory. By analyzing the user’s behavior data, we find that quite a part of the user ID’s behavior data is sparse, which leads to the insufficient convergence of the embedding of user ID and the insufficient description of user’s preference information.

Airbnb’s article published on KDD 2018 provides a solution to this problem [9] – clustering user IDs by using basic user profiles and behavior data. The main scene of airbnb is to provide short-term accommodation services for tourism users. Generally, users travel between 1-2 times a year, so the user behavior data of airbnb is more sparse than that of comment search.

The practice of ranking in-depth learning based on knowledge map in public comment search

As shown in the above figure, the user’s portrait features and behavior features are divided into discrete buckets, and the feature names and bucket numbers are spliced. The cluster ID obtained is: US ﹣ LT1 ﹣ PN3 ﹣ Pg3 ﹣ R3 ﹣ S4 ﹣ C2 ﹣ B1 ﹣ BD2 ﹣ BT2 ﹣ nu3.

We also adopt the scheme similar to airbnb, the problem of sparsity has been solved well, and some additional benefits have been obtained. As a localized life information service platform, most users’ behaviors are concentrated in their permanent places, which leads to the lack of personalized sorting when users arrive at a new place. Through this clustering method, users with the same behavior in different places can be gathered together, which can also solve some cross site personalization problems.

3.2.3 merchant information embedding

In addition to directly adding the merchant ID to the model, meituan brain also uses deep learning technology to mine UGC, fully depict the taste, characteristics and other fine-grained emotions of merchants, such as “good parking”, “delicate dishes”, “willing to visit again” and other labels shown in the figure below.

The practice of ranking in-depth learning based on knowledge map in public comment search

Compared with the pure star rating and comment number of merchants, these information has more angles and finer granularity. We also embed these tags and input them into the model:

  • Direct connection: directly input the label features into the model after they are pooled. This access method is suitable for end-to-end learning, but limited by the size of the input layer, it can only take the top label, which is easy to lose the abstract entity information.
  • Group direct connection: it is similar to direct connection access, but the labels are classified first, such as dishes / styles / tastes, etc.; each classification takes the entity of top n, and then Pouling generates semantic vectors of different dimensions. Compared with the direct connection without grouping, it can retain more abstract information.
  • Submodel access: DSSM model can be used to input tags as merchants’ embedding expression of learning merchants. In this way, the abstract information of tags can be kept to the maximum, but the cost of online implementation and calculation is high.

3.2.4 accelerate the convergence of embedding features

In our deep learning sorting model, in addition to embedding features, there are also a large number of strong memory features of query, shop and user dimensions, which can converge quickly. In order to accelerate the convergence of embedding features, we try the following solutions:

  • Low frequency filteringFiltering for features with low frequency can greatly reduce the amount of parameters and avoid over fitting.
  • Pre training: use multi class model to pre train sparse embedding features, and then enter the model to fine tune:

    • Through unsupervised models such as word2vec and fasttext, the user merchant click relationship is modeled to generate merchant embedding under co-occurrence relationship.
    • The query merchant click behavior is modeled by DSSM and other supervision models to get query and merchant embedding.
  • Multi-Task: for sparse embedding features, set a separate sub loss function, as shown in the following figure. In this case, the updating of embedding feature depends on the gradient of two loss functions, while the sub loss function is independent of the strong feature, which can accelerate the convergence of embedding feature.

The practice of ranking in-depth learning based on knowledge map in public comment search

3.3 picture features

Pictures occupy a large display area in the search results page, and the quality of the pictures will directly affect the user’s experience and click. The first picture of the comment merchants comes from the pictures uploaded by the merchants and users, and the quality is uneven. Therefore, image feature is also an important kind of sorting model. At present, the following image features are mainly used in comment search:

  • Basic characteristicsAbstract the basic information such as brightness and chroma saturation of the picture, and get the basic features of the picture after the feature discretization.
  • Generalization characteristics: ResNet50 is used for image feature extraction [3], and the generalization feature of the image is obtained by clustering.
  • Quality characteristics: using the self-developed image quality model, extract the middle layer output as the embedding feature of image quality.
  • Label characteristics: extract whether the image is food, environment, price list, logo, etc. as image classification and label features.

The practice of ranking in-depth learning based on knowledge map in public comment search

4. Deep learning listwise sorting algorithm for search scenarios lambdadnn

4.1 gap of search business index and model optimization objective

In general, there are always some gap in the prediction target and business index of the model. If the prediction target of the model is closer to the business target, it can ensure the optimization of the model and the corresponding improvement of the business indicators; otherwise, the offline indicators of the model will be improved, but the improvement of the online key business indicators is not obvious, or even negative problems will occur. Most of the in-depth learning in the industry uses the log loss of pointwise as the loss function, which has a large gap with the search service index. It is reflected in the following two aspects:

  1. The commonly used indicators of search business are QV ﹣ CTR or SSR (session success rate), which is more concerned with the success rate of user search (whether there is click behavior); while the log loss of pointwise focuses more on the click rate of a single item.
  2. The search business is more concerned about the quality of the results at the top of the page, while the pointwise method treats all samples equally.

The practice of ranking in-depth learning based on knowledge map in public comment search

Based on the above reasons, we optimize the loss function of deep learning model.

4.2 optimization objective improvement – from log loss to ndcg

In order to make the optimization goal of the sorting model as close to the search business index as possible, we need to calculate the loss according to query, and the samples in different positions have different weights. Compared with log loss, ndcg (normalized discounted cumulative gain) which is commonly used in search system is obviously closer to the requirements of search business. The calculation formula of ndcg is as follows:

The practice of ranking in-depth learning based on knowledge map in public comment search

The cumulative part is DCG (discounted cumulative gain), which represents the gain of the position discount. For the result list L under query, function g represents the correlation score of the corresponding doc. Usually, the exponential function is taken, that is, G (Lj)=2lj-1(ljIt represents the level of correlation, such as {0, 1, 2}); the function η is position loss, generally using η (J) = 1 / log (j + 1), the higher the correlation between doc and query and the higher the position, the greater the DCG value. In addition, we usually only focus on the effect of sorting the top k of the list page, ZkIt represents the possible maximum value of DCG @ K, and the normalized result is ndcg @ K.

The problem is that ndcg is a non smooth function everywhere, so it is not feasible to optimize it directly. Lambdarank provides an idea: bypass the objective function itself, directly construct a special gradient, modify the model parameters according to the direction of the gradient, and finally achieve the method of fitting ndcg [6]. Therefore, if we can propagate this gradient back through the depth network, we can train a depth network to optimize ndcg. This gradient is called lambda gradient, and the depth learning network constructed by this gradient is called lambdadnn.

To understand lambda gradients, you need to introduce lambdarank. Lambdarank model is constructed by pairwise, which usually constructs a sample pair with and without click samples under the same query. The basic assumption of the model is as follows, let pijDoc for the same queryiCompared to DocjMore relevant probability, where siAnd SjDoc respectively.iAnd DocjModel score:

The practice of ranking in-depth learning based on knowledge map in public comment search

Using cross entropy as loss function, let sijRepresents the real mark of the sample pair, when dociThan DocjWhen more relevant (i.e. dociHave been clicked by the user, and docjNot clicked), with Sij=1, otherwise – 1; the loss function can be expressed as:

The practice of ranking in-depth learning based on knowledge map in public comment search

When constructing the sample pair, we can always make I a more relevant document, and there is always an SijIf ≡ 1 is substituted into the above formula and derivative is carried out, then the gradient of loss function is:

The practice of ranking in-depth learning based on knowledge map in public comment search

So far, the location information of the sample has not been considered in the calculation of the loss function. Therefore, the gradient shall be further improved, and DOC shall be considerediAnd DocjThe ndcg value changes when exchanging positions, and the following formula is the lambda gradient mentioned above. It can be proved that the gradient constructed in this way can reach the goal of optimizing ndcg after iterative updating.

The practice of ranking in-depth learning based on knowledge map in public comment search

The physical meaning of lambda gradient is shown in the figure below. Where blue indicates more relevant (clicked by the user) documents, the lambda gradient is more likely to be enhanced by the doc on the top (as shown by the red arrow). With the calculation method of lambda gradient, we use the depth network to predict the doc score under the same query in the training, and calculate the lambda gradient according to the actual situation of users clicking doc and back propagation back to the depth network, then we can get a depth network that can directly predict ndcg.

The practice of ranking in-depth learning based on knowledge map in public comment search

4.3 engineering implementation of lambdadnn

We use tensorflow distributed framework to train lambdadnn model. As mentioned above, lambda gradient needs to calculate samples under the same query, but normally all samples are shuffled randomly to each worker. So we need to preprocess the samples:

  1. Shuffle through queryid, aggregate the samples of the same query, and package the samples of the same query into a tfrecord.
  2. Because the number of DOC recalled in each query request is different, it should be noted that TF will automatically supplement the size of each sample in mini batch when pulling data for training for variable size query samples, resulting in a large number of meaningless default value samples in the input data. Here we provide two ways to deal with it:

    • In the MR process, the key is processed so that multiple query samples are aggregated together, and then dynamic segmentation is performed during training.
    • Read the complemented sample, obtain the index bit according to the set complementation mark, and remove the complemented data.

The practice of ranking in-depth learning based on knowledge map in public comment search

In order to improve the training efficiency, we worked closely with the data platform center of the basic R & D platform to explore and verify a number of optimization operations:

  1. The mapping of ID class features and other operations are completed in the preprocessing to reduce the repeated calculation in the multi round training process.
  2. Transfer the sample to tfrecord, read the data by using the recorddataset method, and calculate and process the data. The calculation performance of the worker has increased by about 10 times.
  3. Concat combines multiple category features into multi hot tensors for an embedding “lookup operation, which reduces map operation and helps parameters to be stored and calculated in pieces.
  4. Sparse tensor keeps the index value when calculating gradient and regularization, and only updates the part with value.
  5. Multiple PS servers store large-scale tensor variables in pieces to reduce the communication pressure of worker synchronous update, reduce update blocking and achieve a smoother gradient update effect.

As a whole, for a sample size of about 3 billion and a feature dimension of over 100 million levels, a round of iteration will be completed in about half an hour. The training task of minute level can be achieved by properly increasing the resources of parallel computing.

4.4 further improvement of optimization objectives

In the calculation formula of ndcg, the weight of the loss changes exponentially with the position. However, there is a big difference between the curve of actual exposure click through rate with position and the theoretical loss value of ndcg.

For mobile scenes, when users browse through the drop-down sliding list, the visual focus will change with the sliding screen and page turning. For example, when users turn to the second page, they tend to refocus, so they will find that the exposure click through rate of the second page’s head is actually higher than that of the first page’s tail. We try two schemes to fine tune the index position loss in ndcg:

  1. Fitting the loss curve according to the actual exposure click rate: according to the actual statistical exposure click rate data, the fitting formula replaces the index loss formula in ndcg, and the curve drawn is shown in Figure 12.
  2. Calculate position bias as position lossPosition bias has been widely discussed in the industry. 7. The process of users clicking on a merchant is divided into two steps: observation and click: A. users need to see the merchant first, and the probability of seeing the merchant depends on the location; B. the probability of clicking on the merchant after seeing the merchant is only related to the correlation of the merchant. The probability calculated in step a is position bias. There are many things that can be discussed here, which will not be detailed here.

The practice of ranking in-depth learning based on knowledge map in public comment search

Compared with the base tree model and pointwise DNN model, the lambda DNN model trained by the above ndcg computing transformation has a very significant improvement in business indicators.

The practice of ranking in-depth learning based on knowledge map in public comment search

4.5 lambda depth sorting framework

In addition to combining with DNN network, lambda gradient can be combined with most common network structures. In order to further learn more cross features, we have tried lambdadepfm and lambdadcn networks respectively on the basis of lambdadnn. Among them, DCN network is a parallel network structure with cross. The output features of each layer of the cross network and the original input features of the first layer are obviously crossed, which is equivalent to the mapping and fitting between each layer of the cross learning features The residual error.

The practice of ranking in-depth learning based on knowledge map in public comment search

The off-line comparative experiment shows that the combination of lambda gradient and DCN network gives full play to the characteristics of DCN network, and the concise polynomial crossover design effectively improves the training effect of the model. The comparison effect of ndcg indicators is shown in the figure below:

The practice of ranking in-depth learning based on knowledge map in public comment search

5. Deep learning sequencing diagnosis system

Although the deep learning ranking model has greatly improved the business indicators, the “black box attribute” of the deep learning model has led to huge explanatory costs, and also brought some problems to the search business:

  1. Bad case can’t respond quickly in daily searchSearch business needs to deal with a large number of “soul torture” from users, businesses and bosses in daily life, such as “why is this sort”, “why is the quality of this business similar to me, but it will rank in front of me”. When we first switch to the deep learning sorting model, we are at a loss for such a problem. It takes a lot of time to locate the problem.
  2. Unable to learn from bad case to summarize the law of continuous optimization: if you don’t understand why the sorting model can get a bad sorting result, you can’t locate what’s wrong with the model, and you can’t summarize the rules according to bad case, so as to determine the future optimization direction of the model and features.
  3. Whether models and features are fully learned is unknown: after mining some new features, we usually decide whether the features are online or not according to whether the offline evaluation indicators are improved. However, even if a feature is enhanced, we can’t know whether it performs well enough. For example, if the distance feature fitted by the model has a longer distance but a higher score in a specific distance segment.

All of these problems will lead to some sort results that users can’t understand. We need to clearly diagnose and explain the depth ranking model.

There have been some explorations on the interpretability of machine learning model. Lime (local interpretable model agnostic explanations) is one of them, as shown in the following figure: generate nearest neighbor samples by disturbing the characteristics of a single sample, and observe the prediction behavior of the model. According to the distance distribution weights of these disturbed data points from the original data, an interpretable model and prediction results are obtained based on their learning [5]. For example, if we need to explain how an emotion classification model predicts that “I hate this movie” is a negative emotion, we can predict emotion by discarding some words or constructing some samples in disorder, and finally we will find that “I hate this movie” is a negative emotion because of the word “hate”.
The practice of ranking in-depth learning based on knowledge map in public comment search

Based on the idea of lime interpreter, we developed a set of deep model interpreter tool – Athena system. At present, Athena system supports two working modes, pairwise and listwise

  1. The pairwise pattern is used to explain the relative ordering between two results in the same list. By reassigning or replacing the characteristics of samples, observe the trend of sample scoring and ranking, and diagnose whether the current sample ranking meets the expectation. As shown in the figure below, through the feature bit panel on the right side, we can quickly diagnose why the ranking of “Nanjing famous brand” is higher than that of “golden age Shunfeng harbor”. The first line of feature rank information shows that if the 1.3km distance feature of “Shunfeng harbor in golden age” is replaced by the 0.2km distance feature of “Nanjing daipai stall”, the ranking rank will rise by 10; therefore, it can be concluded that the decisive factor of “Nanjing daipai stall” ranking in the front is due to its proximity.
  2. Listwise pattern is basically similar to lime’s working pattern. Disturbance samples are generated from the whole list of samples to train the output feature importance of linear classifier model, so as to achieve the purpose of model interpretation.

The practice of ranking in-depth learning based on knowledge map in public comment search

6. Summary and Prospect

In the second half of 2018, reviews Search completed a comprehensive upgrade from tree model to large-scale deep learning sorting model. The team has made some explorations in deep learning feature engineering, model structure, optimization objectives and engineering practice, and has made significant gains in the core indicators. Of course, there are still many points to explore in the future.

At the feature level, the label information provided by a large number of knowledge maps has not been fully mined. From the perspective of usage, simple access in the form of text labels loses the structural information of knowledge map. Therefore, graph embedding is also the direction to be tried in the future. At the same time, the team will also use Bert to do some work on the deep semantic expression of query and merchant text.

At the level of model structure, at present, the online network structure is still dominated by the fully connected DNN network structure, but the DNN network structure is not as good as deepfm and DCN in learning low rank data. At present, lambdadeepfm and lambdadcn have made profits offline, and will further optimize the network structure in the future.

In the model optimization goal, when lambda loss calculates the loss, it only considers the sample pairs with and without click in query, and a large number of query without click is discarded. At the same time, the behavior of the same user under different query in a short time also contains some information that can be used. Therefore, at present, the team is exploring the model that comprehensively considers log loss and lambda loss. Through multi task and shuffle samples according to different dimensions, the model can be fully learned. At present, we have made some gains offline.

Finally, the GroupWise model proposed by TF ranking, Google’s open source recently, also gives us some inspiration. At present, most of the listwise methods are only reflected in the model training stage. In the scoring and prediction stage, they are still pointwise, that is, they will only consider the relevant characteristics of the current merchant, not the results of the list context. In the future, we will also explore in this direction.

Reference material

  1. Meituan brain: modeling method and application of knowledge map
  2. Wide & Deep Learning for Recommender Systems
  3. Deep Residual Learning for Image Recognition
  4. Attention Is All You Need
  5. Local Interpretable Model-Agnostic Explanations: LIME
  6. From RankNet to LambdaRank to LambdaMART: An Overview
  7. A Novel Algorithm for Unbiased Learning to Rank
  8. Unbiased Learning-to-Rank with Biased Feedback
  9. Real-time Personalization using Embeddings for Search Ranking at Airbnb

Author brief introduction

  • Feiyi, joined meituan reviews in 2016, senior algorithm engineer, is currently mainly responsible for the research and development of the core ranking layer of reviews search.
  • Zhu Sheng, a senior algorithm engineer who joined meituan reviews in 2016, is currently responsible for the research and development of the core ranking layer of reviews search.
  • Tang Biao, who joined meituan review in 2013, is a senior algorithm expert and the search technology director of review platform, and is committed to the technical implementation of deep level query understanding and large-scale deep learning sorting.
  • Zhang Gong, who joined meituan review in 2012, is a researcher of meituan review. At present, he is mainly responsible for the evolution of comment search business and the construction of group search public service platform.
  • Zhong Yuan, Ph.D., head of NLP center, AI Platform Department, meituan, and head of review search intelligence center. He has published more than 30 papers in international top academic conferences, won the best paper award of ICDE 2015, and is the speaker of ACL 2016 tutorial “understanding short texts”, published 3 academic monographs, and obtained 5 US patents. Previously, he served as a research director of Microsoft Asia Research Institute and research scientist of Facebook company in the United States. He was responsible for Microsoft Research Institute knowledge map, dialogue robot project and Facebook product level NLP service.

The practice of ranking in-depth learning based on knowledge map in public comment search