How does idle fish make the accuracy of second-hand attribute extraction reach 95% +?

Time:2021-9-14

How does idle fish make the accuracy of second-hand attribute extraction reach 95% +?

First effect

How does idle fish make the accuracy of second-hand attribute extraction reach 95% +?Figure 1 – demo of second-hand attribute extraction algorithm (1)

background

As a c2x app, free fish has the following characteristics compared with Taobao from the perspective of commodity release:

  1. Insufficient product information due to light releaseLeisure fish adopts the light release mode of graphic description, which caters to the user’s experience of rapid release, but also leads to the problem of insufficient structured information of goods. If the platform wants to better understand what goods are, it needs algorithms to identify the pictures and texts described by users.
  2. Commodities have unique second-hand attributesDifferent from the first-hand attributes of Taobao’s new products (such as brand, model, specification parameters, etc.), the second-hand attributes refer to the attributes that can reflect the impairment / preservation of the goods after the goods have been started for a period of time, such as [times of use], [purchase channel], [integrity of packaging / Accessories], etc. Different categories have unique second-hand attributes of this category, such as [shelf life] for personal beauty care, [screen appearance] and [disassembly and repair] for mobile phones, and [whether it has been watered] for clothing.

Problems and difficulties

Second hand attribute extraction is an information extraction problem in the NLP field. The common approach is to disassemble it into named entity recognition (NER) task and text classification task.

The difficulties of second-hand attribute extraction tasks include:

  1. Different models need to be built for different categories and different secondary attributes / attribute clusters.
  2. If supervised learning (BERT family) is used, the marking work will be very heavy and the development cycle will become very long.

Solution

Methodology

In today’s NLP environment, the Bert family (or various algorithms derived from transformer) is still popular, dominating the NLP lists such as glue and clue, and the information extraction task is no exception. Therefore, the author also uses the Bert family in some scenarios in this scheme. However, the author believes that no algorithm is omnipotent in various scenarios, only the most suitable algorithm in a given field and a specified scenario. In addition, the author summarizes his own methodology of attribute extraction:

  1. The sentence pattern is relatively fixed, or the sentence pattern is limited by the template. For example, the text description template is a typical time + place + person + Event (something, somewhere, someone did something). Use NER. The suggested methods are CRF, bilstm + CRF, Bert family, Bert family + CRF, etc.
  2. The sentence pattern is not fixed, but the domain / scene keywords are relatively fixed, or there are some keyword templates, common names, jargon, etc., which are classified by text:
  3. When there are not many synonyms and expressions (≤ dozens to hundreds), the keywords are lognormal distribution / exponential distribution (i.e. there are many high-frequency and concentrated keywords). The suggested method: regular expression + rule.
  4. There are many synonyms and expressions (≥ hundreds to thousands). Typical ones are place name recognition. The recommended method is to use Bert family
  5. Sentence patterns and words are not fixed. Typical emotional analysis such as social comment / chat. Suggested method: use the Bert family.

Scheme architecture

How does idle fish make the accuracy of second-hand attribute extraction reach 95% +?Figure 2 – architecture of second-hand attribute extraction scheme

  • NLP task, as mentioned earlier, decomposes different second-hand attribute identification requirements into text multi classification, multi label classification and ner tasks.
  • Text multi classification: that is, the “n-out-of-1” problem, such as judging whether the goods are mailed according to the text (second classification).
  • Multi label classification: that is, multiple “n-to-1” problems are carried out at the same time, such as judging the screen appearance (good / medium / poor) and body appearance (good / medium / poor) of a mobile phone product at the same time. The common method of multi label classification is to share the network layer for different labels and superimpose the loss function with a certain weight. Due to a certain degree of connection between multiple labels, the effect is sometimes better than multiple separate “n-to-1” problems. At the same time, because multiple attributes (attribute clusters) are modeled together, it will be easier to train and infer.
  • Ner: named entity recognition.

Modeling method

  1. In the manual marking stage, due to the high labor cost of marking, we need to try to use the group’s alinlp for assistance. Firstly, the input text is analyzed by using alinlp’s e-commerce ner model. Then disassemble the second-hand attributes belonging to ner tasks, such as shelf life / warranty period / capacity / usage times / clothing style, which can be directly located to the keywords of relevant parts of speech or entities for bio labeling; For other second-hand attributes belonging to the classification task, they can be marked based on the word segmentation results of e-commerce ner to improve the efficiency of manual marking.
  2. In the algorithm training stage, this is the core of the scheme. The training algorithm of this scheme mainly adopts three ways:

(1) Use Albert tiny: modeling adopts the mainstream scheme of pre training + finetune. Because the model infers faster, it is used in real-time online scenarios with very high requirements for QPS and response. For NER tasks, you can also try to connect a layer CRF at the back of the network or not. Albert: Albert means “a lite Bert”, which is worthy of its name. Its advantage lies in its fast training speed.

The source code of Albert is basically the same as that of Bert, but the network structure has several important differences:

  1. The word embedding layer does factorization, which greatly reduces the number of parameters in the word vector. Let the size of thesaurus be V, the length of word vector be h, and for Bert, the parameter of word vector be vH; For Albert, the word direction length is reduced to e, then expanded to h, and the parameter is vE + e * h, because e is far less than h and H is far less than V, the amount of parameters used for training has been sharply reduced.
  2. Cross layer parameter sharing: Taking Albert base as an example, Albert will share the attention parameter of each layer or the parameter of the full connection layer FFN between the 12 layers. By default, both are shared. In the source code, through tenorflow.variable_ The reuse parameter of scope can be easily implemented. Parameter sharing further reduces the amount of parameters to be trained. In addition, Albert also optimizes some training tasks and training details, which are not shown in the table below.

Albert is divided into: – Albert large / XLarge / XXLarge: 24 layers – Albert base: 12 layers – Albert small: 6 layers – Albert tiny: 4 layers according to the network depth. Generally speaking, the more layers, the longer the training and inference time will be. Considering that the real-time nature of online deployment requires faster inference speed, the smallest Albert tiny is selected in this scheme. Its Chinese inference speed is about 10 times higher than that of Bert base, and the accuracy is basically retained (data quoted fromgithub/albert_zh)。

(2) Using strutbert base: the mainstream scheme of pre training + finetune is adopted for modeling. It is estimated that the accuracy of second-hand attribute recognition is about 1% to 1.5% higher than that of Albert tiny, which can be used in offline T + 1 scenarios. For NER tasks, you can also try to connect a layer CRF at the back of the network or not.StructBert: it is Ali’s self-developed algorithm, which has the advantages of high precision,Glue listIt has ranked third in the.

Word structural Objective: on the basis of Bert’s MLM task, strutbert adds the task of disrupting word order and forcing it to reconstruct the correct word order: in the paper, a trigram is randomly selected for disrupting, and then the following formula is added as the constraint of MLM loss function. Strutbert’s inspiration may come from an online paragraph: “research shows that the order of Chinese characters is not fixed. It is confirmed that it is chaotic to send words after you read this sentence.”.

How does idle fish make the accuracy of second-hand attribute extraction reach 95% +?

Figure 4 – objective function of word structural (quoted from strutbert paper)

The reason why strutbert is selected in this scheme is that the group has a special pre training model (Interface) for the algorithm in the e-commerce field, which is divided into:

  • Strutbert base: 12 layers
  • Strutbert Lite: 6 layers
  • Strutbert tiny: 4th floor

In the off-line T + 1 scenario, it pursues higher accuracy and has no great requirements for real-time performance. Therefore, strutbert base is selected for this scheme.

The reason why strutbert is selected in this scheme is that the group has a special pre training model (Interface) for the algorithm in the e-commerce field, which is divided into:

  • Strutbert base: 12 layers
  • Strutbert Lite: 6 layers
  • Strutbert tiny: 4th floor

In the off-line T + 1 scenario, it pursues higher accuracy and has no great requirements for real-time performance. Therefore, strutbert base is selected for this scheme.

(3) Using regular expressions: advantages: the fastest speed, more than 10-100 times faster than Albert tiny; And in many second-hand attributes with relatively fixed sentence patterns and keywords, the accuracy is higher than the above two algorithms; And easy to maintain. Disadvantages: it relies heavily on business knowledge, industry experience and data analysis to sort out a large number of regular patterns.

  1. Rule revision stage

==========

  1. Normalization of recognition results: for NER tasks, many recognized results cannot be used directly and need to be “normalized”. For example, if the size of a men’s dress is recognized as “175 / 88a”, it should be automatically mapped to “L”.
  2. There may be conflicts or dependencies between some second-hand attributes. Therefore, after algorithm recognition, the recognition results need to be modified according to business rules. For example, if the seller of a commodity claims to be “brand new”, but at the same time indicates that “it has been used only three times”, the “brand new” will be automatically downgraded to “non brand new” (99 new or 95 new, and the classification of different categories is slightly different).

Algorithm deployment

  • Offline T + 1 scenario: deploy through ODPs (current name maxcompute) + UDF, that is, the algorithm will be written into UDF script through python, and the model file will be uploaded to ODPs as a resource.
  • Online real-time scenario: the model is distributed deployed through pai-eas, and the data interaction is completed through iGraph (a real-time graph database) and TPP.

Algorithm evaluation

For each second-hand attribute of each category, formulate evaluation standards, and then sample a certain amount of data for manual evaluation by outsourcing. The evaluation work gives the accuracy rate, accuracy rate and recall rate by comparing whether the results of manual recognition are consistent with those of algorithm recognition.

Final effect

Accuracy

After manual evaluation of the identification results of this scheme, the accuracy and precision of each category have reached a very high level, and the error value is far less than the online limit, and has been online applied to the commodities of the main categories of idle fish.

Effect display

How does idle fish make the accuracy of second-hand attribute extraction reach 95% +?

Figure 5 – demo of second-hand attribute extraction algorithm (2)

Application scenario & Future Prospect

The results of second-hand attribute extraction have been applied to the following scenarios:

  1. Pricing scenario
  2. Chat scene
  3. High quality commodity pool excavation
  4. Search shopping guide
  5. Personalized product recommendation

Follow up outlook:

  1. At present, the second-hand attribute extraction covers the mainstream category commodities of idle fish in total. With the development, the follow-up plan covers all categories.
  2. At present, second-hand attribute extraction mainly depends on text recognition. Idle fish goods are graphic descriptions. In the future, we can consider working on pictures and improving the structured information of goods through image algorithm.
  3. Use and analyze the second-hand attributes of commodities to form high-quality commodity standards and expand the pool of high-quality commodities.

Author: Jian Li
Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission

Recommended Today

Beautify your code VB (VBS) code formatting implementation code

However, vb.net does have many new functions that VB6 does not have. The automatic typesetting of code is one, which is the function we want to realize today – VB code formatting.Let’s look at the effect: Before formatting: Copy codeThe code is as follows: For i = 0 To WebBrowser1.Document.All.length – 1 If WebBrowser1.Document.All(i).tagName = […]