In this paper, cornernet is proposed to detect targets by detecting corner pairs, which has the same performance as the current SOTA detection model. Cornernet uses the method of human pose estimation for reference, and creates a new framework in the field of target detection. Many papers based on corernet develop new corner target detection

Source: Xiaofei’s algorithm Engineering Notes official account

**Cornernet: detecting objects as paired keypoints**

**Address: https://arxiv.org/abs/1808.01244****Paper code: https://github.com/princeton-vl/CornerNet**

# Introduction

Most of the target detection algorithms are closely related to anchor box. This paper thinks that using anchor box has two disadvantages: 1) it needs to tile a large number of anchor boxes on the feature graph to avoid missing detection, but only a small number of anchor boxes are used in the end, resulting in imbalance of positive and negative samples and slow training. 2) The introduction of anchor box brings extra super parameters and special network design, which makes the model training more complex.

Based on the above considerations, this paper proposes cornernet, which defines target detection as the detection of upper left corner and lower right corner. The network structure is shown in Figure 1. The heat maps of the upper left corner and the lower right corner are predicted by convolution network, and then the two groups of heat maps are combined to output the prediction box, which completely eliminates the need of anchor box. Experiments also show that cornernet has the same performance as the current mainstream algorithms, which creates a new paradigm of target detection.

# CornerNet

### Overview

In cornernet, the target is detected by detecting the upper left corner and the lower right corner of the target. Convolution network predicts two groups of heatmaps to represent the corner positions of different types of targets, corresponding to the upper left corner and the lower right corner respectively. In order to map the upper left corner and the lower left corner, an embedding vector is predicted for each corner, and the distance between two corners belonging to the same target will be very small. In addition, the prediction of offset is added, and the position of diagonal point is adjusted slightly.

The structure of cornernet is shown in Figure 4. Using the hourglass network as the backbone network, two groups of results are output through two independent prediction modules, corresponding to the upper left corner and the lower right corner respectively. Each prediction module outputs the heat map, embedding vector and offset for the final prediction through corner pooling.

### Detecting Corners

the predicted heat map size is $C times h times w $, $C $is the number of categories, excluding the background class. Each corner point of GT only corresponds to one positive sample point, and the other points are all negative sample points. However, in training, the negative sample points will not be punished equally, but the punishment of negative sample points within the radius of positive sample points will be reduced. The main reason for this is that the negative sample points close to the positive sample points can generate a prediction box with high enough IOU, as shown in Figure 5.

the radius is set according to the size of the target to ensure that the generated prediction box can at least meet the IOU greater than $t $. After the radius is set, penalty attenuation is carried out according to the two-dimensional Gaussian kernel $e ^ {- \ frac {x ^ 2 + y ^ 2} {2 sigma ^ 2}}. $x $and $y $are the distance relative to the positive sample point, and $ sigma $is 1 / 3 of the radius. Define $p_ {CIJ} $is the predicted score of position $(I, J) $with respect to category $C $, $y_ In order to get the score based on Gaussian kernel, a variant of focal loss is designed

Due to the existence of pooling layer, the original graph position $(x, y) $is usually mapped to $(- lfloor / frac {x} {n} rfloor, lfloor / frac {y} {n} rfloor) $, and $n $is the down sampling factor. When the points in the heat map are mapped back to the original image, there may be a loss of accuracy due to pooling, which will greatly affect the IOU calculation of small targets. In order to solve this problem, this paper proposes offset prediction. Before mapping the position of the heat map to the original image, the corner position is adjusted slightly

$o_ K $is the offset value, and $x $and $y $are the coordinates of corner point $k $. It should be noted that the network predicts a set of offset values for the upper left corner and the lower right corner respectively, and the offset values are shared among categories. In training, smooth L1 loss is added to the positive sample points to train the offset value of corners

### Grouping Corners

When there are multiple targets in the picture, it is necessary to distinguish the corresponding relationship between the upper left corner and the lower right corner of the prediction, and then form a complete prediction box. This paper refers to the strategy of human pose estimation, each corner predicts a one-dimensional embedding vector, and judges the corresponding relationship according to the distance between the vectors. Define $e_ {t_ k} The embedding vector of the upper left corner of the $k $target, $e_ {b_ k} $is the embedding vector in the lower right corner, using pull loss and push loss to combine and separate corners respectively

$e_ K $is $E_ {t_ k} $and $e_ {b_ k} The average value of $, $- delta = 1 $, where the pull loss and push loss are the same as the offset, and only used for positive sample points.

### Corner Pooling

generally, there is no target information in the corner position. In order to determine whether the pixel is the upper left corner, it is necessary to find the highest point of the target horizontally to the right and the leftmost point of the target vertically to the down. Based on such prior knowledge, corner pooling is proposed to locate corners.

suppose that we need to determine whether the position $(I, J) $is the upper left corner, we first define $F_ T $and $f_ L $is the input characteristic graph of corner pooling in the upper left, $f_ {t_ {I, J}} $and $f_ {l_ {I, J}} $is the feature vector of the input feature graph at position $(I, J) $. The size of the feature graph is $H / times w $, and corner pooling starts with $F_ The eigenvectors from $(I, J) $to $(I, H) $in T $are pooled to maximize the output vector $t_ {ij} $, also for $F_ The eigenvectors from $(I, J) $to $(W, J) $in L $are also pooled to maximize the output vector $L_ {ij} $and finally $t_ {ij} $and $l_ Add {ij} $. The complete calculation can be expressed as follows:

Equations 6 and 7 use element wise maximum pooling.

In the implementation, formula 6 and formula 7 can perform the efficient calculation of the whole feature map as shown in Figure 6, which is a bit similar to dynamic programming. For corner pooling at the upper left corner, the input feature graphs are pre calculated from right to left and from bottom to top respectively. Each position only needs to follow the output of one position for element wise maximum pooling. Finally, the two feature graphs are added directly.

The complete structure of the prediction module is shown in Figure 7, which is actually an improved version of the residual block, replacing the $3 times 3 $convolution module with the corner pooling module, and finally outputting the heat map, embedding vector and offset.

### Hourglass Network

Cornernet uses the hourglass network as the backbone network, which is used in the human pose estimation task. The hourglass module, as shown in Figure 3, first samples the features down, then recovers the features up, and adds multiple short-circuit connections to ensure the details of the recovered features. The hourglass network used in this paper includes two hourglass modules, and the following improvements are made

- Replace the maximum pooling layer responsible for downsampling with the convolution of stride = 2
- A total of five down sampling and gradually increasing the dimension (256, 384, 384, 384, 512)
- Up sampling uses two residual modules + nearest neighbor up sampling
- The short circuit connection contains two residual modules
- At the beginning of the network, four $7 times 7 $convolution modules with stripe = 2 and channel = 128 and one residual module with stripe = 2 and channel = 256 dimensions are used for processing
- The original version of hourglass network will add a loss function to each hourglass module for supervised learning, but this paper finds that it has an impact on the performance, and does not use this method

# Experiments

Compare the effect of corner pooling.

Compare the effect of negative sample point penalty attenuation.

Compare the effect of the combination of hourglass network and corner detection

The results of thermal map and migration prediction are compared.

Compared with other kinds of detection network.

# CONCLUSION

In this paper, cornernet is proposed to detect targets by detecting corner pairs, which has the same performance as the current SOTA detection model. Cornernet uses the method of human pose estimation for reference, and creates a new framework in the field of target detection. Many papers based on corernet develop new corner target detection.

If this article is helpful to you, please like it or read it

More content, please pay attention to WeChat official account.