Search NLP industry model and lightweight customization

Time:2022-5-14

Introduction: the open search NLP industry model and lightweight customization solution solve the problems of reducing the cost of customer tagging, no tagging or a small number of simple tagging, making the expansion of the search field easier to use.
Special guests:

Xu Guangwei (kunka) — Alibaba algorithm expert

Search NLP algorithm

Search link

This is a complete link from query words to search results. The role of NLP algorithm is mainly in the query analysis in the second stage, which includes multiple NLP algorithm modules, such as word segmentation, error correction, entity recognition, word weight, synonyms and semantic vectors on the text side. The system is a multi-channel recall sorting architecture combining text and semantic vector, so as to meet the search effect requirements of different business scenarios. Of course, in addition to query analysis, there are also many applications of NLP algorithm in the first stage of search guidance and the fourth stage of sorting service.

Search NLP industry model and lightweight customization

Query analysis

NLP algorithm mainly plays a role in several sub modules here:

Search NLP industry model and lightweight customization

  • Word segmentation. Accurate word segmentation can improve retrieval efficiency and make recall results more accurate,
  • Spelling error correction: spelling errors in the query entered by users can be corrected automatically to improve the search experience.
  • Entity recognition can mark each word in the query with the corresponding entity label, so as to provide key features for subsequent query rewriting and sorting.
  • The word weight model will mark the high, medium and low gears for each word, and re check the lost words in the query results.
  • Synonyms, expand words with the same meaning to expand the scope of recall.
  • Finally, after the complete query analysis module, an overall query rewriting is carried out to convert the query entered by the user into the query string that can be recognized by our search engine.

Now open search not only supports Alibaba’s self-developed search engine, but also is compatible with the open source es engine, which can make it easier for users to use our algorithm capabilities.

Industry model

Customer pain points

1. The domain adaptation of general model is difficult

  • The general model mainly solves the problems of news and information industry;
  • The effect will be greatly reduced in specific industries;

E-commerce model and common domain

Search NLP industry model and lightweight customization

2. Few open industry models

  • Cloud service providers basically only provide general models
  • Public industry data sets also mainly cover general areas

Search NLP industry model and lightweight customization

Solving difficulty

The process of building an industry search NLP model:

Search NLP industry model and lightweight customization

  • The data set needs to be labeled in tens of thousands of months. At the same time, it also requires a high level of knowledge for the industry.
  • The next step is model training. This step requires professional algorithm personnel. If they are not unfamiliar with the algorithm, the iterative efficiency of the model will be very low
  • The last step is to optimize the efficiency of the model deployment, which needs to be done by the operation and maintenance personnel. In fact, there are many challenges in the data set annotation stage.

Difficulties in word segmentation and tagging

1. High requirements for domain knowledge

For example:

  • Name of drug: lidocaine chlorhexidine aerosol | lidocaine chlorhexidine aerosol
  • Address: Wangying village, Sikeshu Township, Nanzhao County | Wangying village, Sikeshu Township, Nanzhao County

2. It is difficult to judge cross ambiguity

For example:

  • Washing powder | washing powder
  • Difficulties in entity recognition and annotation

1. High requirements for domain knowledge

For example:

  • Australia aitamei (mother and baby brand) gold section 1, Kobe (sneaker Series) 4
  • Pytorch implements Gan (algorithm model)

resolvent

Based on the data accumulation of Alibaba’s internal search, open search combines automatic data mining and self-developed algorithm model to transform the construction link of industry model.

Taking word segmentation and ner as examples, the following model diagram is the process of word segmentation. Firstly, we mine the new words in the target domain through the automatic new word discovery algorithm. After obtaining these new words, we will build a remote supervised training data in the target domain.

Search NLP industry model and lightweight customization

Based on such remote supervised training data, we propose a structural model of anti learning network, which can achieve the effect of noise reduction, so we get a domain model of our target domain last year.

Search NLP industry model and lightweight customization

The following model diagram is the process of NER. We adopt the model structure of graph ner combined with graph neural network, which can integrate knowledge base and annotation data. The knowledge base is the new words automatically mined by the new word discovery module in the link of word segmentation just now, and then we make an automatic entity word marking to build the domain knowledge base. The corresponding technical papers have been published on the ACL of the NLP field summit.

Search NLP industry model and lightweight customization

To sum up, take the e-commerce industry as an example to see the effect achieved on the open search industry model through the technical scheme mentioned above.

It can be seen that the enhanced version of the open search e-commerce industry will be much better than the general version.

This scheme is not only applicable to the e-commerce industry, as long as it is an industry with data accumulation, it can quickly build an industry model.

Search NLP industry model and lightweight customization

Open search lightweight customization

Customer pain points

Search NLP industry model and lightweight customization

First of all, we can see that the direct use of the general model can achieve an effect of 60 points.

The applicability of the industry model just mentioned can reach 80 points.

But specific to each customer, there are customization problems in subdivided areas. The average customer’s goal may be to achieve 90 points.

For example, the following two examples:

  • The “Vance soda series” on the left is actually a specific brand and series name of a sneaker. Although the open search e-commerce model can correctly identify the brand and common words, there is no correct identification of the specific subdivided series of soda.
  • The example on the right below is “Hanben cuibaowei drink”. The open search e-commerce model here does not recognize the unique brand and its sub series. Based on the industry model provided by us, if customers do independent customization and optimization, they will encounter those problems when introducing the industry model solutions above, so it is difficult to break through 85 points in the end,

Search NLP industry model and lightweight customization

Our goal is to reduce the labeling cost of customers, completely without labeling or a small number of simple labeling, so that customers’ customization will be easier to use, so as to directly achieve the effect of 85 points.

Solution ideas

The overall process is similar to the industry model building link. We should make these capabilities products instrumental so that customers can participate in the tuning independently.

New training model

The following figure is a tool demo we made. The above is the creation model. Some customers can choose the basic industry model, and then upload the unmarked data of their own field to automatically start the model training.

Search NLP industry model and lightweight customization

2. Effect evaluation

The following is an intuitive effect evaluation on our system after model training. You can see the changes of the basic model and the effect of the model after automatic training. Customers can also do a small amount of manual annotation to verify the effect of the model.

Search NLP industry model and lightweight customization

This link has been used internally in Alibaba and will be revealed to customers on open search products in the near future. Originally, it may take one to two months for us to make a lightweight customization to achieve the above effect, and we also need to mark more than 10000 sentences of these marked data. Now, based on this scheme, this effect can be achieved in only one week without labeling at all or only labeling less than 1000 labeled data.

Search NLP industry model and lightweight customization

Lightweight customized effect display

Our tool can automatically find these new words in the scene and predict the entity labels of these new words. We can see that these new words in brackets are predicted in different contexts and a distribution of labels, so as to guide us to judge whether this new word is a legal new word and what the entity label it belongs to is, so as to provide the most key information for our model.

Address scenario

Search NLP industry model and lightweight customization

E-commerce scenario

Search NLP industry model and lightweight customization

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.