Author | ass
Edit | CV
Report | I love computer vision (wechat ID: aicvml)
Unsupervised Person Re-identification via Softened Similarity Learning：
Unsupervised pedestrian re recognition based on softening similarity learning
- Paper link: https://arxiv.org/abs/2004.03547
- Code link: https://github.com/ryanaleksa… (Unofficial)
- First author: Yutian Lin (now an associate researcher at Wuhan University)
- Co authors: Hangzhou University of Electronic Science and Technology (first author), Huawei Technology Co., Ltd., baidu Research Institute, reler Laboratory of Sydney University of science and technology
- The image data is completely unlabeled;
- Give up clustering method and use soft label to solve hard quantization loss;
- The application of image slice information and cross camera identification information in unsupervised field;
- SOTA is implemented in the field of pedestrian re recognition using unsupervised method.
The main highlights are as follows:
1. Abandon the clustering method and adopt softened classification
Disadvantages of clustering: the image is roughly divided into clusters for training based on clustering method, which makes the model highly dependent on the clustering results. As shown in Fig. 1 (b), the image of the same person can be divided into different clusters, which are further trained using incorrectly assigned pseudo tags. Since the error of unsupervised clustering is inevitable, learning with hard quantization loss may tend to fit the noise labels generated by clustering.
Softened label classification: with the clustering method, the image belongs to an accurate category of the originalone-hotDifferent from tags, this paper mines the relationship between unlabeled images as mild constraints. The author will talk aboutThe first k images with high target similarity are assigned soft labels, treat labels as a distribution and encourage images to be associated with several related categories. In the figure below, purple is the target, and yellow is the K dependent images close to the target.
2. Some auxiliary information is introduced to help find similar images
The constraint of soft label is relatively weak, but compared with hard classification, it also provides more space for the algorithm. thereforeWhen measuring the similarity between imagesThe global and partial features of each pedestrian image and camera identification information will also be taken into account.
02 proposed method
The frame can be divided into three sub components (displayed in three colored rectangles):
- The baseline classification network is used to classify each image into different categories and generate feature representation;
- Explore the similarity between unlabeled images based on feature embedding and auxiliary information, and select k reliable images for each training data;
- Soften the target label distribution according to the generated K reliable images, and fine tune the network with the softened labels to make the selected K reliable images closer and exclude other images.
Next, I will introduce the specific implementation steps of each component.
1、 Baseline: initialization with hard labels
The red box and red arrow in the overall model frame diagram belong to this baseline step
Maximize image featuresAnd lookup tablesAt the same time, minimize each image featureAnd corresponding centroid featuresCosine distance between. The initial baseline network recognizes each unmarked image by learning and obtains the initial discrimination ability.
1. Label initialization: because we don’t have a ground truth label for each pedestrian, for each pedestrian, its label is defined according to its index, and each pedestrian is also considered as an independent class.
2. Nonparametric classifier:
Classification model of nonparametric classifier: my understanding is to directly use the standardized image features for classification without going through other layers, which is called nonparametric classifier.
Where the author uses a lookup tableTo store the features of all training images, and take the features of each image as the weight vector of each category. Finally, softmax is used to realize multi classification.
① Data preprocessing: through standardizationTo achieve
② Classification: the possibility that an image x belongs to i-th class is defined by softmax
amongIt represents the ith row of lookup table V and stores the weight parameters (i.e. image features) of this class.Is a temperature parameter, that is, it indicates the softening degree of the probability distribution on different categories (i.e. the hardening degree of the label).
③ Loss and optimizer
Loss: cross entropy loss
Where t (YJ) is the conditional empirical distribution on the category label. For the ground truth class, we set the probability of distribution to 1 and 0 for all other classes.
2、 Model learning with soft simulation
Green and blue in the overall model frame diagram belong to this step
Not only minimize the cosine distance between each image feature and the ground truth feature in the lookup table, but also minimize the distance between each image feature and its reliable image. At the same time, the cosine distance between each image feature and other categories of features is maximized.
Forcing the same person’s characteristics to belong to different categories will have a negative impact on the network. Therefore, the author proposes a method to assign a similar representation to the image estimated as the same pedestrian, that is, the soft label method.
1. Similarity calculation: for two imagesand, we define the distance between two images as the difference between two images. (refer to the next section for image distance calculation)
2. Define label: forGenerally speaking, the K images closest to it are called dependent images. And define these images as, their labels are set to。Called yes andThe same people, andIs a dependent class. Not the same class.
3. Redefine the target label: we propose a softened classification network, which learns the similarity between identities in a smoother way (non hard label), rather than training K reliable images as the same class. In the training process, we hope that the network can not only predict each image into the ground truth class, but also predict the training image into the reliable class. Therefore, we reassign a non-zero value to the reliable class in the target tag. dataThe target label distribution of is written as:
among λ Is a super parameter that balances the relationship between the ground truth class and the reliable class. When λ When it is 1, the baseline network is simplified to a function with only 0 and 1 labels, that is, the model learning recognizes the ground truth label of each image, but can not learn the similarity and consistency between the images of the same person. On the other hand, when λ If it is too small, the model may not be able to predict the ground truth tag.
4. Loss: cross entropy loss
Images are labeled with soft label distribution (representing probability) rather than one hot label. The tag is no longer the ground truth class, but the probability of K possible reliable classes. By considering the reliable class, the reliability of the ground truth class is reduced and the reliability of the reliable class is increased, so as to guide the network to smoothly learn the similarity between pedestrian images.
3、 Similarity estimation with auxiliary information
In order to achieve better results, the author also adds other methods to help estimate similarity.
Part similarity exploration
After extracting the CNN feature map, the author divides it horizontally into P parts. Each partition feature is averaged and pooled into a partial feature representation. We take the average distance of the corresponding parts of the two images as the partial distance between the two images
amongIs the feature embedding function of part I of two images.
The cross-camera encouragement（CCE）
Using the CCE item, the difference between images with the same camera identification increases. Therefore, CCE helps to include more reliable images under different cameras and reduce some negative images under the same camera.
The implementation effect of pedestrian re recognition will be affected by different camera attributes. Images taken by the same camera “naturally” have some similarities. Therefore, a cross camera encouragement term (CCE) is also proposed in this paper, which is trained to promote the images taken under different cameras to be regarded as reliable images.
After training, first, by learning cross camera information, the network can predict a person’s similar features in different camera views, which is conducive to re recognition tasks. Second, many different pedestrians wearing similar clothes appear in the same lens. CCE can help find the ground truth across cameras instead of these negative samples.
As shown in the figure below, in the absence of CCE, although the query image and the image captured by CAM3 belong to the same person, they are very different due to the camera gap. Even a negative sample (red example) because they come from the same camera. Therefore, the query distance is also small,
The author represents the camera identification of the training sample as。 In addition, two imagesand The CCE formula between is:
It is a parameter to control the influence of CCE.
After adding the above CCE and image slice similarity, the overall distance is defined as:
Among them λ It balances the contribution of overall and partial similarity. As shown in the green part of the overall framework, the differences between the two images include global distance, local distance and cross camera incentives. By calculating the global and local distances, the similarity between global appearance and local details is measured to ensure the accuracy of reliable image selection.
By adding CCE items, images from different cameras are often selected as reliable images, which enables the network to learn from different images. Both are beneficial to the resolution of the training model.
Comparison with the State-of-the-Arts
Compared with all unsupervised methods, the author’s method achieves SOTA in two image data sets market-1501 and dukemtmc Reid.
Compared with all unsupervised methods, the author’s method achieves SOTA in two video data sets Mars and dukemtmc videoreid.
The author experimented with hyperparameters in market-1501 λ、 The number of reliable images K and other parameters.
Finally, ablation experiments on image slice information and CCE information are done in market-1501 and dukemtmc data sets, which proves the necessity of both.