abstractObjective: to introduce the frontier method of learning with noise, and to solve the optimization strategy of neural network in imperfect scenes, in order to improve the performance of the model.
This article is from learning from noise labels with deep neural networks, Huawei cloud community, written by guessing ditch.
The success of neural network is based on a large number of clean data and deep network model. But in the real scene, the data and model are often not particularly ideal, such as the case that the data level is wrongly labeled, such as the dog is labeled as a wolf, and the actual business scene pays attention to timeliness, the neural network layer can not be particularly deep. We try to iterate the effective training method of neural network in the case of data and model defects continuously, and solve the problem of noise data in the process of network training through noise label learning technology. This technology has been implemented in the actual business scenarios of the team, through the loss function, network structure, model regularization, loss function adjustment, sample selection, and so on The optimization of multiple modules, such as tag correction, is not limited to the full supervised, semi supervised and self supervised learning methods to improve the robustness of the whole model
【Robust Loss Function】
The core idea is that when the whole data is clean, the traditional cross entropy loss function can learn a small number of negative samples, which can improve the robustness of the model; When the data noise is relatively large, CE will be deviated by the noise data. We need to modify the loss function so that the weight of each sample is equally important in the training. Therefore, it is not difficult to think of using GCE loss to control the super parameters, combining CE loss and Mae loss
- A. Ghosh, H. Kumar, and P. Sastry,“Robust loss functions under label noise for deep neural networks,” in Proc. AAAI, 2017
- Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels, NeurlPS 2018
In addition, there is also a reference from the KL divergence idea. The author thinks that when calculating entropy, the distribution of real data and predicted value represented by original Q and P have no problem in relatively clean data, but in the data with large noise, Q may not represent the distribution of real data, on the contrary, P can represent the distribution of real data, Therefore, a symmetric cross entropy function based on symmetry is proposed
- Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric cross entropy for robust learning with noisy labels,” in Proc. ICCV, 2019, pp. 322–330
In this part, we use the ingenious network structure for reference. In the process of model training, we select data through the model and select a batch of clean data to gradually improve the robustness of the model. The first thing we need to introduce is the coteaching framework. First, we select data from two models and input them to each other to calculate loss. The data passed to each other’s network is the lowest loss data in each min batch. With the increase of epoch, the amount of data changes. In addition, at the end of each epoch, the data will be shuffled to ensure that the data will not be forgotten forever
- How does Disagreement Help Generalization against Label Corruption? ICML 2019
Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels, NeurlPS 2018
Another idea is to score clean samples and noise data based on attention attention mechanism. This paper is called attention feature mixup. There are two parts when calculating the final loss. One part is the cross entropy loss of each graph and label of the same class; Another loss is the loss of new data X ‘and tag y’ from data mixup
In this part, some regular ticks are added to prevent the model from over fitting to the noise data. The common regularization methods include: label smooth, L1, L2, mixup, etc
In fact, this part is also some training tips, which are inseparable from the improvement of loss function. I won’t introduce the tips in detail here
This module is the first mock exam from how to pick out better data. One way is called Area Under the Margin metric (AUM), which is our last year’s participation in CVPR WebVision 2020 (the highest level competition in image recognition field, replacing ImageNet) to win the championship. The specific idea is to use the model to calculate the Logits value of each image and the largest Logits of other classes in a min batch to calculate the difference as area, so that the average of epoch in multiple rounds can get the Aum value of each image. Experiments show that if the data is relatively clean, the value of area will be relatively large, if there is MIS label data, the value of area is relatively small, or even negative, the author uses this idea to separate the clean data and noise data of a class. Of course, the paper also points out that the threshold of 99% of clean data and noise data is the best.
- Pleiss, Geoff, et al. “Identifying mislabeled data using the area under the margin ranking.“, NeurlPS 2020.
In another paper, data partition is based on the idea of density clustering, which divides the data of a class into easy dataset, SMI hard dataset and hard dataset. Generally, noise data is difficult to train. For each graph, a weight of 1.0, 0.5 and 0.5 is recommended; The training of the model draws lessons from the idea of course learning
- Guo, Sheng, et al. “Curriculumnet: Weakly supervised learning from large-scale web images.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
The noisy learning algorithm based on semi supervised learning first introduces a dividmix method. In fact, it is a co teaching idea. However, after picking out clean samples and noise samples, the noise samples are treated as unlabeled samples and trained by fixmatch. At present, the SOTA of semi supervised image classification should still be fixmatch, It can achieve nearly supervised results in 90% of the unlabeled samples… So the idea of achieving high accuracy is basically towards the direction of semi supervision and how to distinguish the noise completely
The whole pipeline is divided into two parts: CO divide and semi supervised learning
In the co divide part, we use the pre trained model to calculate the loss of n samples. Here is an assumption that the n variables are generated by the mixture of two Gaussian distributions. The distribution with the larger mean is the noise sample, and the distribution with the smaller mean is the clean sample. Then, based on this setting, we can calculate the loss of each sample, Then, the training data can be divided into two categories: labeled and unlabeled according to the set threshold, and then the SSL method is used for training.
It should be noted that in order to make the model converge, we need to train several epochs with all the data before dividing the data, so as to achieve the purpose of “preheating”. However, the “warm-up” process will cause the model to over fit the asymmetric noise samples, so that the noise samples also have a small loss, so GMM is not easy to distinguish, which will affect the later training. In order to solve this problem, we can add an additional regular term – H on the basis of the original cross entropy loss in the “warm-up” training. This regular term is a negative entropy, which punishes the samples with sharp prediction probability distribution to make the probability distribution flat, so as to prevent the model from being overconfident.
After dividing the training data, we can use some ready-made semi supervised learning methods to train the model. In this paper, we use the commonly used mixmatch method, but before using mixmatch, we also make improvements on CO refinement and co guess
- DivideMix: Learning with Noisy Labels as Semi-supervised Learning. ICLR 2020
The idea of label correction method is very simple, which is equivalent to a concept of re marking a false label, but it is too violent to completely abandon the original label. In the “label correction phase”, this article of iccv2019 uses a pre-trained model to get several graphs in each class randomly, and uses the clustering method to get the clustering center of each class of prototype sample, The final loss is the sum of the cross entropy loss calculated by the original tag and the pseudo tag calculated by the pseudo tag
- Han, Jiangfan, Ping Luo, and Xiaogang Wang. “Deep self-learning from noisy labels.“, ICCV 2019
Result and Conclusion：
The research in the field of noisy learning is very meaningful. We have verified it in our scene, and it has a good improvement, with a minimum of 2-3 points and a maximum of 10 points. Of course, the verification in one scene can not fully illustrate the effectiveness of the method, and we also find that the combination of multiple methods sometimes can not improve the performance of a double, On the contrary, it may reduce the final result.
We hope to use the idea of automl to select the optimal combination method. In addition, we hope that the method of noisy learning can be more mobile. After all, most of them are still focused on classification tasks. Later, we will also explore the methods of meta learning in the field of noisy learning, and constantly update the latest methods of each module to improve the mmclassification, Welcome to communicate with us
Click follow to learn about Huawei’s new cloud technology for the first time~