Research on the application of bet distillation in the identification of garbage public opinion

Time:2021-7-1

Introduction:Recently, large-scale pre training models such as Bert have achieved remarkable results in various sub tasks of NLP field. However, due to the large number of parameters, it is difficult to go online and can not meet the production requirements. Public opinion audit business contains a lot of junk public opinion, which will consume a lot of manpower. In this paper, we try the best distillation technology to improve the performance of textcnn classifier in the task of garbage public opinion recognition, and make use of its small and fast advantages to successfully implement.

Research on the application of bet distillation in the identification of garbage public opinion
Recently, large-scale pre training models such as Bert have achieved remarkable results in various sub tasks of NLP field. However, due to the large number of parameters, it is difficult to go online and can not meet the production requirements. Public opinion audit business contains a lot of junk public opinion, which will consume a lot of manpower. In this paper, we try the best distillation technology to improve the performance of textcnn classifier in the task of garbage public opinion recognition, and make use of its small and fast advantages to successfully implement.

The risk samples are as follows:

Research on the application of bet distillation in the identification of garbage public opinion

A traditional distillation scheme

At present, there are four kinds of model compression and acceleration technologies

  • Parameter pruning and sharing
  • Low rank factorization
  • Transfer / compact convolution filter
  • Knowledge distillation

Knowledge distillation is to transfer the knowledge of teacher network to student network, making the performance of student network as teacher network. This paper focuses on the application of knowledge distillation.

1 soft label

Knowledge distillation was first proposed by Caruana et al in 2014. Through the introduction of teacher network (complex network, good effect, but long prediction time) related soft tags as a part of the overall loss, to guide the student network (simple network, slightly poor effect, but low prediction time) to learn, to achieve the purpose of knowledge transfer. This is a general, simple and different model compression technique.

  • The class prediction of large-scale neural network includes the similarity between data structures.
  • With a priori, the small-scale neural network can converge with very little new scene data.
  • The distribution of softmax function is more uniform with the increase of temperature.

The loss formula is as follows:

Research on the application of bet distillation in the identification of garbage public opinion

Among them,

Research on the application of bet distillation in the identification of garbage public opinion

From this we can see that distillation has the following advantages:

  • We can learn the feature representation ability of the large model, and we can also learn the information between categories that does not exist in one hot label.
  • With the ability of anti noise, as shown in the figure below, when there is noise, the gradient of teacher model has a certain correction to the gradient of student model.
  • To a certain extent, the generalization of the model is strengthened.

Research on the application of bet distillation in the identification of garbage public opinion

2 using hints

(ICLR 2015) fitnets Romero et al. Not only use the final output Logits of the teacher network, but also use the parameters of the middle hidden layer to train the student network. Get deep and thin fitnets.

Research on the application of bet distillation in the identification of garbage public opinion

The middle layer learning loss is as follows:

Research on the application of bet distillation in the identification of garbage public opinion

By adding the middle layer loss and limiting the solution space of the student network by the parameters of the teacher network, the optimal solution of the parameters is closer to the teacher network, so as to learn the high-order representation of the teacher network and reduce the redundancy of the network parameters.

3 co-training

(arXiv 2019) route constrained optimization (RCO) Jin and Peng are inspired by curriculum learning, and know that the gap between students and teachers is very large, which leads to distillation failure and cognitive bias. They propose route constrained hint learning, which changes the learning path to teacher network every time, And output the results to student network for training. Student network can learn from these intermediate models step by step, from easy to hard.

The training path is as follows:
Research on the application of bet distillation in the identification of garbage public opinion

Two bert2textcnn distillation scheme

In order to improve the accuracy of the model, ensure the timeliness and deal with the shortage of GPU resources, we began to build the scheme of distillation from Bert model to textcnn model.

Scheme 1: off line logit textcnn distillation

The traditional method of Caruana was used for distillation.

Research on the application of bet distillation in the identification of garbage public opinion

Scheme 2: joint training of Bert textcnn

Parameter isolation: the teacher model trains once and passes the logit to the student. The parameter update of teacher is affected by the label, and the parameter update of student is affected by the soft label loss of teacher loigt and the hard label loss of label.

Research on the application of bet distillation in the identification of garbage public opinion

Scheme 3: joint training of Bert textcnn

No parameter isolation: similar to scheme 2, the main difference is that the gradient of student’s soft label in the previous iteration will be used to update the teacher parameters.

Research on the application of bet distillation in the identification of garbage public opinion

Scheme 4: joint training Bert textcnn loss

Teachers and students train at the same time, using the method of multi task.

Research on the application of bet distillation in the identification of garbage public opinion

Scheme 5: multiple teachers

Most of the models need to cover the samples of the online historical model when updating, and use the online historical model as a teacher to let the model learn the knowledge of the original historical model, so as to ensure a high coverage of the original model.

Research on the application of bet distillation in the identification of garbage public opinion

The results are as follows

Research on the application of bet distillation in the identification of garbage public opinion

From the above experiments, we can find a very interesting phenomenon.

1) Scheme 2 and scheme 3 both use the method of training the teacher first and then the student, but scheme 2 is lower than scheme 3 because of the difference of whether the gradient return update is isolated or not. In scheme 3, the teacher is trained once every time, and the student is trained once. The soft loss learned by the student will be fed back to the teacher, so that the teacher knows how to guide the student, and the performance of the teacher is improved.

2) Scheme 4 adopts the method of common update and feedback gradient. On the contrary, the performance of textcnn declines rapidly. Although the performance of Bert does not decline, it is difficult for Bert to guide textcnn correctly in every step of feedback.

3) The logit of historical textcnn is used in scheme 5, which is mainly to replace the online model and maintain a higher coverage rate of the original model. Although the recall rate is reduced, the overall coverage rate is 5% higher than that of single textcnn.

Reference

1.Dean, J. (n.d.). Distilling the Knowledge in a Neural Network. 1–9.
2.Romero A , Ballas N , Kahou S E , et al. FitNets: Hints for Thin Deep Nets[J].
3.Jin X , Peng B , Wu Y , et al. Knowledge Distillation via Route Constrained Optimization[J].

Welcome to join the big security machine intelligence team of ant group. We focus on mass public opinion, mining the existing financial risks and platform risks with the help of big data technology and natural language understanding technology, escorting the safety of users’ funds, and improving users’ experience in the ant state. Extrapolate directly to Lingke [email protected] If you have a letter, you must answer it.