This paper proposes that Pconv can effectively integrate the internal relationship between scales by 3D convolution of feature pyramid and regularization with specific Ibn. In addition, this paper proposes SEPC, which uses deformable convolution to adapt to the irregularity of the correspondence between actual features and maintain scale equilibrium. Pconv and SEPC significantly improve the detection algorithm of SOTA, and do not bring too much additional computation

Source: Xiaofei’s algorithm Engineering Notes official account

**Paper: scale equalizing pyramid revolution for object detection**

**Thesis address:https://arxiv.org/pdf/2005.03101.pdf****Thesis Code:https://github.com/jshilong/SEPC**

## Introduction

feature pyramid is an important means to solve the problem of object scale, but there is a large semantic gap between feature maps of different levels. In order to eliminate these semantic gaps, many studies focus on how to strengthen feature fusion, but most of these studies directly scale and add the feature map, and do not take into account the internal attributes of the feature pyramid. Inspired by the scale space theory (multi-scale feature extraction), this paper proposes Pconv (pyramid revolution), which uses 3-D convolution to associate similar feature maps and mine the interaction between scales. Considering that the inter layer features of the feature pyramid change greatly and the correspondence of each point between layers is irregular, this paper proposes that SEPC (scale equalizing pyramid revolution) can deform the high-level features of the feature pyramid, which can adapt to the actual scale change and maintain the inter layer scale balance.

the main contributions of the paper are as follows:

- A lightweight pyramid convolution Pconv is proposed to mine the correlation of internal scales by 3-D convolution of feature pyramids.
- A scale balanced pyramid convolution SEPC is proposed to reduce the difference between feature pyramid and Gaussian pyramid (this paper proves that Pconv has scale invariance on Gaussian pyramid).
- This module can improve the performance of SOTA single stage target detection algorithm, and hardly affect the reasoning speed.

## Pyramid convolution

Pconv (pyramid convolution) is actually a 3-D convolution, spanning scale and spatial dimensions. As shown in FIG. 4A, Pconv can be expressed as n different 2-D convolutions.

However, the size of characteristic graphs of different pyramid levels is different. In order to accommodate different sizes, Pconv uses different Stripes when processing different characteristic graphs. The paper samples $n = 3 $, the stripe of the first convolution kernel is 2, and the stripe of the smallest convolution kernel is 0.5.

Pconv can be expressed as Formula 1, $W_ 1$、$w_ 0 $and $W_ {- 1} $is three independent 2-D convolution kernels, and $x $is the input characteristic graph$*_ {S2} $represents the convolution kernel with stripe 2.

The convolution kernel with stripe of 0.5 samples the characteristic graph bilinear up twice, and then processes it with the convolution kernel with stripe of 1. Pconv also uses zero padding. For the underlying and top-level pyramid levels, only two items of formula 2 need to be used. The calculation amount of Pconv is about 1.5 times that of the original FPN.

#### Pipeline

As shown in Figure 5a, retinanet can be regarded as a Pconv of $n = 1 $. Replace four conv heads with Pconv heads of $n = 3 $. The stacked Pconv can effectively gradually improve the correlation without too much additional calculation. However, in order to reduce the amount of calculation as much as possible, you can choose to share four layers of Pconv for classification and positioning branches, and then add an additional layer of ordinary convolution layer respectively, as shown in Figure 5B. The amount of calculation in this design is even less than that of the original retinanet. See Appendix 1 for specific calculation.

#### Integrated batch normalization (BN) in the head

Pconv uses the shared BN layer to count all feature graphs in feature pyramid instead of single graph statistics. Since the statistics come from all the characteristic graphs in the pyramid, the variance will become smaller. In this way, even if a small batch size is used, the BN layer can be well trained (the variance is stable).

## Scale-equalizing pyramid convolution

Pconv uses a fixed convolution kernel size for different levels. On the Gaussian pyramid (the degree of ambiguity is not serious and the Gaussian kernel is close to the scaling scale of the feature map), Pconv can extract features with constant scale. See the original Appendix 3 for specific proof.

however, in practice, due to the existence of multi-layer convolution and nonlinear operation, the fuzziness of the feature pyramid is much more serious than that of the Gaussian pyramid (the scaling degree of the feature may be out of proportion to the size of the feature image). It is difficult to extract scale invariant features using a fixed convolution kernel size. Therefore, this paper proposes SEPC (scale equalizing pyramid revolution), which uses deformable convolution for the high-level features except the bottom layer and predicts an offset separately, which can adapt to the fuzziness of each layer, maintain the scale balance between feature maps, and extract the features with unchanged scale.

SEPC has the following advantages:

- The adaptability of deformable convolution can deal with the large interlayer ambiguity of feature pyramid.
- Eliminate the difference between feature pyramid and Gaussian pyramid (the paper proves that Pconv can extract features with unchanged features from Gaussian pyramid).
- Since the convolution computation of high-level features is reduced by 4 times compared with that of low-level features (area reduction), adding deformable convolution to high-level features only brings a small amount of additional computation.

SEPC is divided into two versions. SEPC full adds SEPC to combined head and extra head in Figure 5b, while SEPC Lite only adds SEPC to extra head.

## Experiments

#### Single-stage object detectors

#### Effect of each component

#### Comparison of different BN implementations in the head

The output of BN layer $y = \ gamma \ frac {X – \ Mu} {\ sigma} + \ beta $, $\ gamma $and $\ beta $are parameters, and $\ Mu $and $\ sigma $are statistical results. The comparison of three BNS in Figure 7, in which integrated BN (Ibn) is the shared BN proposed in the paper, and all parameters and statistics are shared

#### Comparison with other feature fusion modules

#### Comparison with state-of-the-art object detectors

#### Extension to two-stage object detectors

## CONCLUSION

This paper proposes that Pconv can effectively integrate the internal relationship between scales by 3D convolution of feature pyramid and regularization with specific Ibn. In addition, this paper proposes SEPC, which uses deformable convolution to adapt to the irregularity of the correspondence between actual features and maintain scale equilibrium. Pconv and SEPC significantly improve the detection algorithm of SOTA, and do not bring too much additional computation.

If this article is helpful to you, please like it or read it

More content, please pay attention to WeChat official account.