In this paper, centernet algorithm is proposed based on the key point prediction network, which regards the detected target as the key point, finds the center point of the target first, and then regresses its size. Compared with the centrernet algorithm with the same name in the previous paper, the algorithm in this paper is more concise and powerful enough. It does not need NMS and other post-processing methods, and can be extended to other detection tasks

Source: Xiaofei’s algorithm Engineering Notes official account

**Papers: objects as points**

**Address: https://arxiv.org/abs/1904.07850****Paper code: https://github.com/xingyizhou/CenterNet**

# Introduction

Although the current anchor based method has high performance, it needs to enumerate the possible positions and sizes of all targets, which is actually a waste. Therefore, this paper proposes a simple and efficient centernet, which represents the target as its center point, and then regresses the size of the target through the center point feature.

Centernet converts the input image into a heat map. The peak in the heat map corresponds to the center of the target. The feature vector of the peak is used to predict the height and width of the target, as shown in Figure 2. In reasoning, only simple forward calculation is needed, and post-processing operations such as NMS are not needed.

Compared with the existing methods, centernet has better trade-off performance in accuracy and speed. In addition, the architecture of centernet is universal and can be extended to other tasks, such as 3D object detection and human key point prediction.

# Preliminary

define input image $I / in R ^ {w / times H / times 3} $, forecast key point heat map $hat {y} in

[0, 1] ^ {frac {w} {r} times frac {h} {r} times C} $, where $R $is the scale of the heat map, set to 4 and $C $is the type of key points. When $hat {y}_ When {x, y, C} = 1 $, the pixel is the key point of detection_ When {x, y, C} = 0 $, the pixel is the background. In the backbone network method, this paper attempts a variety of fully convolutional encoder decoder networks: hourglass network, residual network with deconvolution and DLA (deep layer aggregation).

The training of key point prediction part is the same as that of cornernet. For the GT key point of category $C ‘$p in mathcal {r} ^ 2 $, the corresponding position of the key point on the heat map is calculated, and then the Gauss kernel $y is used_ {xyc}=exp(-\frac{(x-\tilde{p}_ x)^2+(y-\tilde{p}_ y)^2}{2\sigma^2_ According to the distance from the pixel position to the key point, different weights are given to get the GT heat map $y [in [0,1] ^ (< frac {w} {r} times < frac {h} {r} times C)} $, $- Sigma_ P $is the standard deviation of target size adaptation, as shown in Figure 3. If the same kind of Gaussian kernel scattering overlaps, the maximum value of element wise is taken. The loss function of training is the logistic regression of penalty attenuation, with the addition of focal loss

$- alpha $and $- beta $are the super parameters of focal loss, and $n $is the key point. In order to recover the error caused by the scaling of the feature graph, the offset value of each key point is additionally predicted to be $- hat {o} in – mathcal {r} ^ {frac {w} {r} times {frac {h} {r} times 2} $. The offset value is independent of the category, and is trained by L1 loss

The offset value only uses GT key points, and other points do not participate in the training.

# Objects as Points

Define $(x ^ {(k)}_ 1, y^{(k)}_ 1, x^{(k)}_ 2,y^{(k)}_ 2) $is the GT box of the target $k $, and the category is $C_ K $, where the central point is $P_ k=(\frac{x^{(k)}_ 1+x^{(k)}_ 2}{2}, \frac{y^{(k)}_ 1+y^{(k)}_ 2}{2})$。 In this paper, we use the heat map to get all the center points, and then regress the size of each target_ k=(x^{(k)}_ {2}-x^{(k)}_ {1}, y^{(k)}_ {2}-y^{(k)}_ {1})$。 In order to reduce the computational burden, the prediction of size is independent of the category, and only GT key points are used

The complete centernet loss function is as follows:

Centernet directly predicts the heat map of key points $\ hat {y} $, offset value $\ \ hat {o} $, and target size $\ \ hat {s} $, with a total of $C + 4 $outputs predicted for each location. All outputs share backbone network features, followed by their own $3 times 3 $convolution, relu and $1 times 1 $convolution.

In the reasoning process, we first obtain the peak value of each type of heat map, and the peak value should be higher than the value of the surrounding eight connecting points. Finally, we take the Top-100 peak value. For each peak point $(x_ i, y_ i) $, using the predicted key value $\ hat {y}_ The {x, y, C} $is used as the detection confidence, combined with the predicted offset value $- hat {o} = (- Delta – hat {x})_ i, \delta \hat{y}_ i) $and target size $\ hat {s} = (- hat {w})_ i, \hat{h}_ i) $generate prediction box:

Because the peak point extraction method is enough to replace NMS, all prediction frames are directly output through the key points, and there is no need for NMS operation and other post-processing. It should be noted that the paper uses a clever method to achieve the peak point acquisition. First, the feature graph is pooled with the maximum value of $3 times 3 of padding = 1, and then the output feature graph is compared with the original graph. The point with the same value is the peak point that meets the requirements.

# Implementation details

The input of centernet is $512 times 512 $, and the output heat map size is $128 times 128 $. Four kinds of network structures are tested: resnet-18, resnet-101, dla-34 and hourglass-104, in which the deformable convolution is used to improve RESNET and dla-34.

### Hourglass

The structure of hourglass is shown in Figure a. the number in the box is the scaling ratio of the feature graph. It contains two hourglass modules. Each module has five down sampling layers and five up sampling layers. The layers corresponding to up sampling and down sampling have short circuit connection. Hourglass has the largest network size and the best prediction effect of key points.

### ResNet

The general structure of RESNET is the same as the original version. Deconvolution is added to restore the size of the feature map. The weight of deconvolution is initialized as bilinear interpolation operation, and the dotted arrow is $3 times 3 $deformable convolution operation.

### DLA

DLA uses hierarchical short circuit connection. The original structure is shown in Figure C. In this paper, most of the convolution operations are modified to deformable convolution, and the output of each layer is fused with $3 times 3 $convolution, and finally the $1 times 1 $convolution is used to output to the target dimension, as shown in Figure D.

# Experiment

The accuracy and speed of target detection in different backbone networks are compared.

Target detection performance comparison.

3D detection performance comparison.

Performance comparison of human key point detection.

# Conclusion

In this paper, centernet algorithm is proposed based on the key point prediction network, which regards the detected target as the key point, finds the center point of the target first, and then regresses its size. Compared with the centrernet algorithm with the same name in the previous paper, the algorithm in this paper is more concise and powerful enough. It does not need NMS and other post-processing methods, and can be extended to other detection tasks.

If this article is helpful to you, please like it or read it

More content, please pay attention to WeChat official account.