# A new SOTA – pseudo supervised target location method (psol) | CVPR 2020

Time：2020-11-12

In this paper, a pseudo supervised target localization method (psol) is proposed to solve the problem of weak supervised target localization. The method divides the location and classification into two independent networks, and then uses the deep descriptor transformation (DDT) to generate pseudo GT for training. The overall effect reaches SOTA, and it is worth learning

Source: Xiaofei’s algorithm Engineering Notes official account

Thesis: Rethinking the route directions weakly supervised object localization

# Introduction

Because it is difficult to label a large number of training data, some researches on how to use weak supervised method to learn. Generally, the training data with weak supervision only contains image level tags, and there is no specific target location tag / semantic tag. In the weak supervised algorithm, the weak supervised target localization (wsol) is the most practical task, which only needs to locate the object location of the given label

after the experiment, the paper thinks that the localization part of wsol should be class agnostic and has nothing to do with classification. Based on this observation, wsol is divided into two parts: class agnostic target localization and target classification. As shown in Figure 1, it is named pseudo supervised object localization (psol). Firstly, the algorithm generates pseudo supervised GT bbox through deep descriptor transformation (DDT), and then regresses these bboxes to remove the restriction of only one layer full connection in wsol (as the channel wise weight of convolution) and the trade-off problem caused by the coupling of location and classification
The main contributions of this paper are as follows:

• Weak supervised target localization should be divided into two independent parts: class unknown target location and target classification
• Although the bbox generated is biased, the paper still thinks that they should be optimized directly without class tags, and finally achieve SOTA
• On different datasets, psol algorithm does not need fine tuning to be able to locate and migrate well

# Related Works

It needs to be explained here that the weak supervised target localization (wsol) is different from the weakly supervised target detection (wsod). Wsol assumes that there is only one target in the image, but wsod does not have this assumption. Therefore, wsod generally needs additional methods to generate region proposal

# Methodology

### A paradigm shift from WSOL to PSOL

At present, wsol can generate bbox with category label, but there are several problems

• The learning goal is not clear, which leads to the performance degradation of location task. Independent CNN can’t carry out localization and classification tasks at the same time, because localization requires the global characteristics of the target, while classification only needs the local features of the target
• Cam (class activation mapping) stores a 3D feature map to calculate the Heatmap of a class, and then filters it with threshold. However, it is very difficult to determine the threshold

Inspired by the class agnostic process of selective search and fast RCNN, wsol is divided into two subtasks: class agnostic target location task and target classification task. Psol is proposed. The model is updated directly through pseudo GT bbox without directly generating bbox, which can significantly solve the problems mentioned above

### The PSOL Method

• ##### Bounding Box Generation

The difference between psol and wsol is to generate pseudo bbox for unlabeled training images. Detection is the best choice, which can directly provide bbox and category. However, the largest detection training set is only 80 categories, which can not provide general target detection. Moreover, most of the current detectors need a lot of computing resources and input size, so they can not be used on large-scale data sets. In addition to the detection model, we can try to locate the bbox directly on the training map

1. WSOL methods

First, we get the feature graph of the final convolution of the input image $I$through the pre training network $f$$and then get the final tag$l through the global pooling and the final full connection layer_ {pred}$。 According to$l_ {PRED} $or$l_ Get the weight of a specific class in the final total connection of $W / in / mathbb {r} ^ D$, and sum the spatial positions in $g$by channel wise, and get the Heatmap $h, h of the specific category_ {i,j}={\sum}_ {k=1}^d G_ {i,j,k}W_ K$, sample $h$to the original size, and use threshold filtering to generate the final bbox

1. DDT recap

The cooperative supervision method has a good performance in the localization task, and DDT is the best one with the least amount of calculation. For the set of $n$identical tag graphs $s$, we use the pre training model $f$to get the final feature graph $g / in / mathbb {r} ^ {H / times w / times d} = – mathbb {r} ^ {HW / times d} = f (I)$, and the large feature set $g is obtained by combining these feature graphs_ {all}\in \mathbb{R}^{n\times hw\times d}$。 In depth, principal component analysis (PCA) is used to get the feature vector with the largest eigenvalue of $p$. Then the channel wise weighted sum of $g$is used to obtain the final Heatmap $h, H_ {i,j}={\sum}_ {k=1}^d G_ {i,j,k}P_ K$, sample $h$to the original size, and then zero filter and maximum connected region analysis to get bbox

• ##### Localization Methods

After the bbox is generated, the bbox regression is used for fine tuning. Here, single class regression (SCR) is used. Suppose bbox is $(x, y, W, H)$, $(x, y)$is the upper left coordinate and $(W, H)$is width and height. First, convert the value $x ^ * = \ frac {x} {W_ i}$, $y^*=\frac{y}{h_ i}$, $w^*=\frac{w}{w_ i}$, $h^*=\frac{h}{h_ i}$where $W_ I$and $h_ I$is the width and height of the input image. The final output is activated by sigmoid, and the training uses the least square difference

# Experiments

### Experimental Setups

• Datasets, using imagenet-1k and cub-200, the bbox of the test data is accurately labeled, while the bbox on the training set is generated by the method mentioned above
• Metrics was used to verify three indicators: the GT known LOC was correct when the predicted and GT IOU was more than 50\% \$; the top-1 LOC, the top-1 classification and GT known LOC were correct; the top-5 LOC was the correct classification and GT known LOC in the top-5 results
• Base models, including vgg16 / inception V3 / resnet50 / densenet161, do not increase image input. Some wsol methods need to use the weight of category information (single layer full connection) to generate Heatmap, but psol does not. For the sake of full layer connection, the full layer connection is replaced by the single layer connection
• Joint and separate optimization, for the joint optimization model (- joint), bbox regression branch is added on the original basis, and then the classification and positioning of the model are trained at the same time. For the independent optimization model (- SEP), two models were trained separately

# Results and Analyses

### Ablation Studies on How to Generate Pseudo Bounding Boxes

On the verification set, ddt-vgg16 has the best performance

### Comparison with State-of-the-art Methods

After comparing with SOTA and visualizing the results, we found that: 1

• DDT itself is better than wsol method, which shows that class agnosticism is useful. Wsol should be divided into two independent models
• All psol methods are better than joint training in separate training, which shows that the learning contents of positioning and classification are different
• Posl has great advantages in cub-200. Because of the large similarity of categories, the category label may not be able to help positioning, but the DDT of collaborative location is more advantageous
• CNN has the ability to process noisy data and get higher accuracy. The GT known LOC of psol model is basically higher than that of ddt-vgg16
• Some constraints in wsol are not brought into psol. For example, only single-layer full connection layer and larger output feature map are allowed. Removing the common three-layer full connection layer will affect the accuracy. VGg full is better than vgg-gap. In addition, the wsol method is not effective in complex networks, such as densenet. The main reason is that densenet uses multiple layers to classify, not only the last layer, but also the semantics of the last layer is not as clear as VGg. Psol densenet avoids this problem and achieves the highest accuracy

### Transfer Ability on Localization

Psol can migrate from Imagenet to cub-200 without any supervision information, even better than wsol method of fine tune, which proves that target location and category association are unnecessary

### Combining with State-of-the-art Classification

The performance of psol is still better than that of wsol

### Comparison with fully supervised methods

Compared with the supervision method, the description of the paper is not very clear. The supervised classification network in the table should use wsol method + positioning loss. From the results, fast RCNN ensembled from ilsvrc has the highest accuracy. Region proposal network has better general ability to deal with different categories without fine tuning, which shows that location and classification are separated

# CONCLUSION

In this paper, a pseudo supervised target location method (psol) is proposed to solve the problem of weak supervised target location method. The method divides the location and classification into two independent networks, and then uses the deep descriptor transformation (DDT) to generate pseudo GT for training. The overall effect reaches SOTA, and it is worth learning