# Objects as points: prediction target center, without NMS and other post-processing operations | CVPR 2019

Time：2021-4-18

In this paper, centernet algorithm is proposed based on the key point prediction network, which regards the detected target as the key point, finds the center point of the target first, and then regresses its size. Compared with the centrernet algorithm with the same name in the previous paper, the algorithm in this paper is more concise and powerful enough. It does not need NMS and other post-processing methods, and can be extended to other detection tasks

Source: Xiaofei’s algorithm Engineering Notes official account

Papers: objects as points

• Paper code: https://github.com/xingyizhou/CenterNet

# Introduction

Although the current anchor based method has high performance, it needs to enumerate the possible positions and sizes of all targets, which is actually a waste. Therefore, this paper proposes a simple and efficient centernet, which represents the target as its center point, and then regresses the size of the target through the center point feature.

Centernet converts the input image into a heat map. The peak in the heat map corresponds to the center of the target. The feature vector of the peak is used to predict the height and width of the target, as shown in Figure 2. In reasoning, only simple forward calculation is needed, and post-processing operations such as NMS are not needed.

Compared with the existing methods, centernet has better trade-off performance in accuracy and speed. In addition, the architecture of centernet is universal and can be extended to other tasks, such as 3D object detection and human key point prediction.

define input image $I / in R ^ {w / times H / times 3}$, forecast key point heat map $hat {y} in [0, 1] ^ {frac {w} {r} times frac {h} {r} times C}$, where $R$is the scale of the heat map, set to 4 and $C$is the type of key points. When $hat {y}_ When {x, y, C} = 1$, the pixel is the key point of detection_ When {x, y, C} = 0 $, the pixel is the background. In the backbone network method, this paper attempts a variety of fully convolutional encoder decoder networks: hourglass network, residual network with deconvolution and DLA (deep layer aggregation). The training of key point prediction part is the same as that of cornernet. For the GT key point of category$C ‘$p  in  mathcal {r} ^ 2$, the corresponding position of the key point on the heat map is calculated, and then the Gauss kernel $y is used_ {xyc}=exp(-\frac{(x-\tilde{p}_ x)^2+(y-\tilde{p}_ y)^2}{2\sigma^2_ According to the distance from the pixel position to the key point, different weights are given to get the GT heat map$y [in [0,1] ^ (< frac {w} {r} times < frac {h} {r} times C)} $,$- Sigma_ P $is the standard deviation of target size adaptation, as shown in Figure 3. If the same kind of Gaussian kernel scattering overlaps, the maximum value of element wise is taken. The loss function of training is the logistic regression of penalty attenuation, with the addition of focal loss$- alpha $and$- beta $are the super parameters of focal loss, and$n $is the key point. In order to recover the error caused by the scaling of the feature graph, the offset value of each key point is additionally predicted to be$- hat {o} in – mathcal {r} ^ {frac {w} {r} times {frac {h} {r} times 2} $. The offset value is independent of the category, and is trained by L1 loss The offset value only uses GT key points, and other points do not participate in the training. # Objects as Points Define$(x ^ {(k)}_ 1, y^{(k)}_ 1, x^{(k)}_ 2,y^{(k)}_ 2) $is the GT box of the target$k $, and the category is$C_ K $, where the central point is$P_ k=(\frac{x^{(k)}_ 1+x^{(k)}_ 2}{2}, \frac{y^{(k)}_ 1+y^{(k)}_ 2}{2})$。 In this paper, we use the heat map to get all the center points, and then regress the size of each target_ k=(x^{(k)}_ {2}-x^{(k)}_ {1}, y^{(k)}_ {2}-y^{(k)}_ {1})$。 In order to reduce the computational burden, the prediction of size is independent of the category, and only GT key points are used

The complete centernet loss function is as follows:

Centernet directly predicts the heat map of key points $\ hat {y}$, offset value $\ \ hat {o}$, and target size $\ \ hat {s}$, with a total of $C + 4$outputs predicted for each location. All outputs share backbone network features, followed by their own $3 times 3$convolution, relu and $1 times 1$convolution.

In the reasoning process, we first obtain the peak value of each type of heat map, and the peak value should be higher than the value of the surrounding eight connecting points. Finally, we take the Top-100 peak value. For each peak point $(x_ i, y_ i)$, using the predicted key value $\ hat {y}_ The {x, y, C}$is used as the detection confidence, combined with the predicted offset value $- hat {o} = (- Delta – hat {x})_ i, \delta \hat{y}_ i)$and target size $\ hat {s} = (- hat {w})_ i, \hat{h}_ i)$generate prediction box:

# Experiment

The accuracy and speed of target detection in different backbone networks are compared.

Target detection performance comparison.

3D detection performance comparison.

Performance comparison of human key point detection.

# Conclusion

In this paper, centernet algorithm is proposed based on the key point prediction network, which regards the detected target as the key point, finds the center point of the target first, and then regresses its size. Compared with the centrernet algorithm with the same name in the previous paper, the algorithm in this paper is more concise and powerful enough. It does not need NMS and other post-processing methods, and can be extended to other detection tasks.