Channelnets: channel wise convolution, convolution sliding in the channel dimension | neurips 2018

Time：2021-11-30

Channel wise convolution slides on the channel dimension to skillfully solve the complex full connection characteristics of input and output in convolution operation, but it will not be as rigid as packet convolution. It is a very good idea

Source: Xiaofei’s algorithm Engineering Notes official account

Thesis: channelnetworks: compact and efficient revolutionary neural networks via channel wise revolutions

• Thesis Code: https://github.com/HongyangGao/ChannelNets

Introduction

Deep separable convolution can reduce the amount of calculation and parameters of the network, in which point wise convolution accounts for most of the parameters. This paper believes that the next core of network lightweight is to change the dense connection mode from input to output. Therefore, this paper proposes the concept of channel wise convolution, which thins out the dimensional connection of input and output rather than full connection, which is different from the strict grouping of packet convolution. Let the convolution slide on the channel dimension, which can better retain the information exchange between channels. Based on the idea of channel wise convolution, this paper further proposes channel wise deep separable convolution, and constructs channelnets by replacing the last full connection layer + global pooling operation of the network based on this structure.

Channel-Wise Convolutions and ChannelNets

Figure a shows the structure of deep separable convolution, and figure B shows the structure of deep separable convolution after grouping, in which each point represents one-dimensional features.

Channel-Wise Convolutions

the core of channel wise convolution lies in the sparsity of input and output connections. Each output is only connected with part of the input. Conceptually, it is different from packet convolution. It does not strictly distinguish the input, butSampling multiple related inputs with a certain stripe for output (sliding in the channel dimension), it can reduce the number of parameters and ensure a certain degree of information flow between channels. Suppose the convolution kernel size is $d_ K$, the output size dimension is $n$, and the input characteristic map size is $d_ f\times d_ F$, the parameter quantity of ordinary convolution is $m \ times D_ k\times d_ K \ times n$, calculation amount $m \ times d_ k\times d_ k\times d_ f\times d_ f\times d_ F \ times n$, and the parameter quantity of channel wise convolution is $d_ c\times d_ k\times d_ k$，$d_ C$is generally a number much less than $M$, representing the input dimension of one sampling, and the calculation amount is $d_ c\times d_ k\times d_ k\times d_ f\times d_ F \ times n$, parameter quantity and calculation quantity are separated from the dimension $M$of the input feature.

Group Channel-Wise Convolutions

The grouping idea of packet convolution will lead to the information barrier between channels. In order to increase the channel information exchange between packets, it is generally necessary to add a fusion layer behind to continue to maintain the grouping and integrate the characteristics of all groups at the same time. This paper uses the grouped channel wise convolution layer as the fusion layer, which contains $g$channel wise convolutions. Define the input feature dimension of $n$, the number of groups is $g$, the stripe of each channel wise convolution is $g$(here refers to the sliding stride on the channel), and output the $n / g$feature map (sliding $n / g$times). In order to ensure that the output of each group covers all inputs, it needs to meet $d_ C \ Ge g$, and finally concatenate all the outputs. The structure is shown in Figure C.

Depth-Wise Separable Channel-Wise Convolutions

The depth separable channel wise convolution is followed by a channel wise convolution to fuse features to reduce the amount of parameters and computation. The structure is shown in Figure D. The stripe of channel wise convolution in the figure is 1, $d_ C$is 3, which can reduce the amount of parameters while feature fusion.

Convolutional Classification Layer

General networks use global pooling and full connection layer for final classification, but the amount of parameters of this combination is very huge. The combination of global pooling + full connection layer can actually be converted into deep separable convolution. The fixed weight depth convolution is used to replace the global pooling, and the pointwise convolution is used to replace the full connection layer. Therefore, the above deep separable channel wise convolution can be further used for optimization. Here, since there is no operation such as activation function or BN between pooling and full connection, it is more efficient to use conventional three-dimensional convolution.

Suppose the input characteristic diagram is $m \ times D_ f\times d_ F$, the number of categories is $n$, and deep convolution or global pooling can be considered as convolution kernel size of $d_ f\times d_ F \ times 1$, fixed weight $1 / D ^ 2_ Three dimensional convolution of f$, and channel wise can be considered as convolution kernel with size of $1 \ times 1 \ times D_ C$, the two can be combined into a convolution kernel with a size of $d_ f\times d_ f\times d_ Three dimensional convolution of C$. To match the category quantity, $d_ C = M-N + 1$, that is, only $(m-n + 1)$input characteristic graphs need to be used for prediction of each category.

The paper visualizes the weight of the fully connected classification layer. Blue represents the weight of 0 or close to 0. It can be seen that the weight of the fully connected classification layer is actually very sparse, that is, it only uses part of the input, so it is reasonable to use part of the input features here.

ChannelNets

Channelnet is built according to the infrastructure of mobilenet, and the grouping module (GM) and grouping channel wise module (gcwm) in Figure 3 are designed. Because the GM module has the problem of information blocking, gcwm is used in front of the GM module to generate grouping features containing global information.

Channelnet contains three versions:

• Channelnet-v1 replaces part of the depth separable convolution into GM and gcwm. The number of packets is 2, including about 3.7 million parameters.
• Channelnet-v2 replaces the last deep separable convolution with deep separable channel wise convolution, saving about 1 million parameters, accounting for 25% of the parameters of channelnet-v1.
• Channelnet-v3 replaces the last pool layer and full connection layer as the above-mentioned revolutionary classification layer, saving about 1 million (1024×1000-7x7x25) parameters.

Experimental Studies

Compare network performance in ilsvrc 2012.

Compared with lighter network performance, the width multiplier idea of mobilenet is used to scale the dimensions of each layer.

Comparing the impact of packet channel wise convolution on channelnet, replacing gcwm with GM module, considering that only 32 parameters are added to gcwm module, such performance improvement is efficient.

Conclustion

Channel wise convolution slides in the channel dimension to skillfully solve the complex full connection characteristics of input and output in convolution operation, but it will not be as rigid as packet convolution. It is a very good idea. However, I feel that the performance of the paper itself is not optimal enough. The comparison of the paper is only mobilenetv1, which is a little worse than mobilenetv2.