Shufflenetv1 / V2 overview | lightweight network

Time:2021-7-24

Shufflenet series is a very important series in lightweight networks. Shufflenetv1 puts forward channel shuffle operation, so that the network can use packet convolution to accelerate, while shufflenetv2 pushes down most of the designs of V1, puts forward channel split operation from reality, accelerates the network and reuses features at the same time, and achieves good results
Source: Xiaofei’s algorithm Engineering Notes official account

ShuffleNet V1


Thesis: shufflenet: an extremely efficient revolutionary neural network for mobile devices

Shufflenetv1 / V2 overview | lightweight network

Introduction

The accuracy of neural network is getting higher and higher, and the reasoning performance is gradually slowing down. In practical application, we have to make a compromise between performance and accuracy. Therefore, the paper analyzes the time-consuming of small network and puts forward shufflenet. This paper first introduces the core operation channel shuffle and group revolutions of shufflenet, then introduces the structure of shuffle unit, and finally introduces the architecture of shufflenet.

Channel Shuffle for Group Convolutions

Shufflenetv1 / V2 overview | lightweight network

In some current mainstream networks, pointwise convolution is usually used to reduce the dimension, so as to reduce the complexity of the network. However, due to the high input dimension, the overhead of pointwise convolution is very huge. For small networks, expensive pointwise convolution will bring significant performance degradation. For example, in resnext unit, pointwise convolution accounts for 93.4% of the computation. Therefore, this paper introduces packet convolution. Firstly, two implementations of shufflenet are discussed:

  • Fig. 1a is the most direct method, which makes all operations absolutely dimensionally isolated, but this will lead to a specific output associated with only a small part of the input, blocking the information flow between groups and reducing the expression ability.
  • Figure 1b redistributes the dimensions of the output. First, divide the output of each group into multiple subgroups, and then input each subgroup into different groups, which can well preserve the information flow between groups.

The idea of Fig. 1b can be simply implemented by the channel shuffle operation. As shown in Fig. 1C, assuming that the convolution layer containing the $g $group outputs the $g \ times n $dimension, first output reshape() as $(g, n) $, then transfer(), and finally flatten() returns the $g \ times n $dimension.

ShuffleNet Unit

Shufflenetv1 / V2 overview | lightweight network

Based on the channel shuffle operation, the paper proposes two shufflenet units, starting from the basic residual structure in Fig. 2a, including a $3 \ times $depth convolution for feature extraction:

  • Figure 2B shows the shufflenet unit with the same size of the characteristic diagram. Replace the initial $1 \ times $convolution layer with the pointwise packet convolution + channel shuffle operation. The function of the second pointwise packet convolution is to restore the input dimension of the unit and facilitate the element wise addition with the shortcut. The latter two convolution operations only connect BN and not BN + relu according to the recommendations of the separable depth convolution paper. This paper attempts to follow the second pointwise packet convolution with another channel shuffle operation, but it does not improve much accuracy.
  • Figure 2C shows the shufflenet unit with the feature map size halved, which can be used for feature down sampling between blocks. It mainly adds $3 \ times $average pooling in the shortcut and replaces the last element wise addition with channel concatenation to increase the output dimension without too much computation.

  the calculation of shuffle unit is relatively efficient. For the input of $C \ times h \ times w $, the middle dimension of bottleneck is $M $, the calculation amount of RESNET unit is $HW (2cm + 9m ^ 2) $flops, the calculation amount of resnext unit is $HW (2cm + 9m ^ 2 / g) $flops, the calculation amount of shufflenet unit is $HW (2cm / G + 9m) $, and $g $is the number of convoluted groups. Under the condition of equal computing resources, the reduction of computing amount means that shuffenet can use more dimensional feature graphs, which is very important in small networks.
  it should be noted that although deep convolution usually has low theoretical complexity, its efficiency in implementation is not high. For this purpose, shufflenet uses deep convolution only for features (lower dimensions) in bottleneck.

Network Architecture

Shufflenetv1 / V2 overview | lightweight network

The structure of shufflenet is shown in Table 1. Three different stages are stacked by shufflenet units. The first shufflenet unit of each stage is special. Using the structure of stripe = 2 in Figure 2c, the size of the characteristic diagram is reduced by one time and the number of channels is doubled. Other shufflenet units use the structure in Figure 2B, and the dimension of bootlink is set to $1 / 4 of the output. In Table 1, networks with different packet numbers are designed and the corresponding output dimensions are modified. The overall model size is maintained at about 140mflops. The larger the number of packets in the network, the larger the settable dimension.

Experiments

In order to set different network complexity, add a scaling factor of $s $to the network layer dimension in Table 1. For example, shufflenet 0.5x doubles the output dimension of all layers in Table 1.

Shufflenetv1 / V2 overview | lightweight network

Performance for different scales and number of packets.

Shufflenetv1 / V2 overview | lightweight network

Compare the effects of channel shuffle on different network sizes.

Shufflenetv1 / V2 overview | lightweight network

While maintaining the complexity, replace stage 2-4 with other mainstream network structures as much as possible (see the original text for specific design) for performance comparison.

Shufflenetv1 / V2 overview | lightweight network

Compare the performance of mobilenet with the same complexity.

Shufflenetv1 / V2 overview | lightweight network

Compare the performance of mainstream networks.

Shufflenetv1 / V2 overview | lightweight network

Compare the performance as the trunk of target detection.

Shufflenetv1 / V2 overview | lightweight network

CPU single thread reasoning speed comparison.

Conclusion

The core of shufflenet is to use the channel shuffle operation to make up for the information exchange between packets, so that the network can make full use of pointwise packet convolution, which can not only reduce the main network calculation, but also increase the dimension of convolution. From the experimental point of view, it is a very good work.

ShuffleNet V2


**Thesis: shufflenet V2: practical guidelines for efficient
CNN Architecture Design**

Shufflenetv1 / V2 overview | lightweight network

Introduction

Shufflenetv1 / V2 overview | lightweight network

It is found that, as an indicator of computational complexity, flops is not equal to speed. As shown in Figure 1, the speeds of similar networks with flops are quite different. It is not enough to only use flops as an index to measure the computational complexity. We should also consider the memory access consumption and GPU parallelism. Based on the above findings, the paper lists five essentials of lightweight network design from theory to experiment, and then puts forward shufflenet V2 according to the design essentials.

Practical Guidelines for Efficient Network Design

In order to ensure the correctness of the results, theoretical tests are carried out in the following industrial equipment:

  • GPU. A single NVIDIA GeForce GTX 1080Ti is used. The convolution library is CUDNN 7.0
  • ARM. A Qualcomm Snapdragon 810.

It includes the following five lightweight network design essentials:

  1. G1: Equal channel width minimizes memory access cost (MAC).

Most mainstream networks use deep separation convolution, in which pointwise convolution bears most of the computational overhead. Suppose you enter dimension $C_ 1 $and output dimension $C_ 2 $, the size of the characteristic graph is $h $and $W $, then the computation amount of the convolution kernel of $1 \ times 1 $is $B = HWC_ 1 c_ 2 $, memory access consumption $MAC = HW (c)_ 1+c_ 2)+c_ 1 c_ 2 $, Mac can be expressed as the formula related to B:

$MAC=hw(c_1+c_2)+c_1 c_2 \ge hw\sqrt{c_1 c_2} + c_1 c_2=\sqrt{hwB} + \frac{B}{hw}$

The above formula is in $C_ 1 $and $C_ The minimum value is obtained when 2 $is equal, that is, the memory access consumption is the minimum when the input and output dimensions are equal.

Shufflenetv1 / V2 overview | lightweight network

In order to avoid the inconsistency between theory and practice, the paper makes a comparison on the actual equipment. Under the condition of keeping the flops unchanged, adjust the proportion of input and output dimensions. It can be seen that the calculation speed is the fastest under the condition of 1:1. Therefore, when designing the structure, try to keep the dimension of convolution input and output consistent.

  1. G2: Excessive group convolution increases MAC

Packet convolution can reduce flops. In the case of fixed flops, packet convolution can use more channels, but the increase of channels will improve Mac. The relationship between MAC and flops of $1 \ times $packet convolution is

Shufflenetv1 / V2 overview | lightweight network

$g $is the number of groups, $B = HWC_ 1 c_ 2 / g $is flops. Under the condition of fixed input and calculation, MAC increases with the increase of $g $.

Shufflenetv1 / V2 overview | lightweight network

The paper also makes a comparison on the actual equipment. Using more packets reduces the speed of reasoning, mainly due to the increase of MAC. Therefore, it is necessary to carefully select the number of groups according to the platform and task. Selecting a large number of groups can improve the accuracy to a certain extent, but it will also lead to a rapid increase in computing consumption.

  1. G3: Network fragmentation reduces degree of parallelism

At present, some networks use multiple passes in a single block. For example, nasnet-a uses 13 branches in a single block, while conventional networks only use 2-3 branches. Although this design can improve the accuracy, it is not friendly to device parallel computing and will lead to performance degradation.

Shufflenetv1 / V2 overview | lightweight network

Shufflenetv1 / V2 overview | lightweight network

In the case of fixed flops, the performance of serial and parallel branch structures are compared respectively. From the results, the single branch structure has the best performance, and the performance degradation is most obvious on GPU devices.

  1. G4: Element-wise operations are non-negligible

Shufflenetv1 / V2 overview | lightweight network

This paper analyzes the time-consuming of shufflenetv1 and mobilenetv2, and finds that the consumption of element wise operations (relu, addtensor, addbias, etc.) can not be ignored, especially on GPU devices. Although the flops of these operations are not high, their MAC is relatively high.

Shufflenetv1 / V2 overview | lightweight network

In the actual device comparison, when the flops are fixed, using more element wise operations will lead to the decline of network performance.

Finally, the network design essentials found in the paper are summarized as follows:

  • Convolution using the same input-output dimension
  • Understand the loss caused by packet convolution
  • Reduce the number of branches
  • Reduce element wise operations

ShuffleNet V2: an Efficient Architecture

As mentioned above, the pointwise packet convolution and bottleneck results of shufflenetv1 will improve the Mac, resulting in computational loss that can not be ignored. In order to achieve high performance and high accuracy, the key is to obtain the same large-dimensional convolution as input and output without dense convolution and too many packets.

Shufflenetv1 / V2 overview | lightweight network

  the unit structure of shufflenetv1 is shown in Figure 3AB. In order to achieve the above purpose, channel split operation is added on the basis of V1, as shown in Figure 3C. At the beginning of each unit, the characteristic diagram is divided into $C-C ^ {‘} $and $C ^ {‘} $parts. According to G3, a branch passes directly back. According to G1, the other branch contains three convolutions with the same input and output dimensions. According to G2, packet convolution is no longer used, and the beginning of unit is equivalent to packet convolution. After completing the convolution operation, connect the feature, restore it to the input size of unit (conforming to G1), and then perform the channel shuffle operation. There is no element wise addition operation here, which complies with G4. When implementing, concat / channel shuffle / channel split are combined to further improve the performance.
  the sampling operation in space is slightly modified, as shown in Figure 3D. The channel split operation is removed, so the output dimension will be doubled.

Shufflenetv1 / V2 overview | lightweight network

Similar to shufflenetv1, set $C ^ {‘} = C / 2 $stage2-4 as the structure of stacking shufflenet units, and add a $1 \ times 1 $convolution before global pooling to help feature fusion. Shufflenetv2 is not only fast, but also has high accuracy. It mainly benefits from two aspects. First, the high performance of the model makes it possible to use larger dimensions and network capacity. Second, channel split can make some features directly pass through the block, which is equivalent to densenet’s feature reuse.

Shufflenetv1 / V2 overview | lightweight network

This paper visually compares the feature reuse degree of densenet and shufflenetv2. In densenet, the connection of adjacent layers is stronger than that of other layers, which means that there is redundancy in the dense connection of all layers. In shufflenet, the influence between layers is attenuated by the multiple of $(1-C ^ {‘}) / C = 0.5 $, which is similar to densenet.

Experiment

Shufflenetv1 / V2 overview | lightweight network

Apply shufflenetv2 unit to large networks for comparison.

Shufflenetv1 / V2 overview | lightweight network

Compare the performance of shufflenetv2 as the detection network backbone.

Shufflenetv1 / V2 overview | lightweight network

The performance is compared with mainstream classification networks of different sizes.

Conclusion

Starting from practice and guided by the actual reasoning speed, this paper summarizes five design essentials of lightweight network, and puts forward shufflenetv2 according to the essentials, which takes into account the accuracy and speed. Among them, the channel split operation is very bright and achieves the effect of feature reuse similar to densenet.

CONCLUSION


Shufflenet series is a very important series in lightweight networks. Shufflenetv1 puts forward channel shuffle operation, so that the network can use packet convolution to accelerate, while shufflenetv2 pushes down most of the designs of V1, puts forward channel split operation from reality, accelerates the network and reuses features at the same time, and achieves good results.



If this article is helpful to you, please like it or read it
More content, please pay attention to WeChat official account.

Shufflenetv1 / V2 overview | lightweight network