The core of espnet series lies in the cavity convolution pyramid. Each layer has different division rates. It can integrate multiscale features without increasing the amount of parameters. Compared with the depth separable convolution, the depth separable cavity convolution pyramid has higher cost performance. In addition, the multiscale feature fusion method of HFF is also worthy of reference
Source: Xiaofei’s algorithm Engineering Notes official account
ESPNet
Espnet: efficient spatial pyramid of divided revolutions for semantic segmentation
 Thesis address: https://arxiv.org/abs/1803.06815
 Thesis Code: https://github.com/sacmehta/ESPNet
Introduction
Espnet is a lightweight network for semantic segmentation. The core lies in the ESP module, which contains point wise convolution and hole convolution pyramid, which are used to reduce the computational complexity and resample the characteristics of each effective perception domain respectively. ESP module is more efficient than other convolution decomposition methods (mobilenet / shufflenet). Espnet can reach 112fps / 21fps / 9fps on GPU / notebook / terminal equipment.
ESP module
The ESP module decomposes the standard convolution into point wise convolution and spatial pyramid of divided convolutions. The point wise convolution maps the input to the lowdimensional feature space. The hole convolution pyramid uses $k $group $n \ times n $hole convolution to resample the lowdimensional features at the same time. The division rate of each hole convolution is $2 ^ {k1} $, $k = \ {1, \ cdots, K \} $. This decomposition method can greatly reduce the amount of parameters and memory of ESP module, and maintain a large effective sensing domain.

Width divider K
For the standard convolution with input and output dimensions of $M $and $n $and convolution kernel size of $n \ times n $, the amount of parameters to be learned is $n ^ 2Mn $, and the effective sensing domain is $n ^ 2 $. The super parameter $k $is used to adjust the computational complexity of the ESP module. First, use point wise convolution to reduce the input dimension from $M $to $\ frac {n} {K} $(reduce), then use the above hole convolution pyramids to split and transform the lowdimensional features, and finally merge the outputs of K groups of hole convolutions. The ESP module contains $\ frac {Mn} {K} + \ frac {(NN) ^ 2} {K} $parameters, and the effective perception domain is $[(n1) 2 ^ {k1} + 1] ^ 2 $, which has been improved to a certain extent in terms of parameters and perception domain.

Hierarchical feature fusion (HFF) for degridding
It is found that although the empty convolution pyramid brings a larger perception domain, the direct concate output will bring strange grid lines, as shown in Figure 2. In order to solve this problem, the output is added hierarchically before concate. Compared with adding additional convolution for postprocessing, HFF can effectively solve the grid texture without too much computation. In addition, in order to ensure the gradient transmission of the network, a shortcut connection from input to output is added to the ESP module.
Relationship with other CNN modules
This paper lists the core modules of some lightweight networks and compares them. It can be seen that the ESP module has very good values in parameter quantity / memory / perception domain.
ESPNet
Figure 4 shows the evolution process of espnet, $l $is the size of the characteristic diagram, modules with the same $l $have the same size of the characteristic diagram, and the red and green modules are the down sampling and up sampling modules respectively. Generally, there is no explanation, i.e. $\ alpha_ 2=2$、$\alpha_ 3=8$。
Experiments
Only some experiments are listed here. For other specific experiments, you can see the paper.
Replace the ESP module in Figure 4D for experimental comparison.
Compared with other semantic segmentation models.
Conclusion
Espnet is a lightweight network of semantic segmentation. While ensuring lightweight, it designs the core module for the scene of semantic segmentation, uses the empty convolution pyramid to extract the features of multiple sensory domains and reduce the amount of parameters, and uses HFF to skillfully eliminate the grid lines, which is very worthy of reference.
ESPNetV2
Thesis: espnetv2: a light weight, power efficient, and general purpose revolutionary neural network
 Thesis address: https://arxiv.org/abs/1811.11431
 Thesis Code: https://github.com/sacmehta/ESPNetv2
Introduction
Model lightweight includes three methods: model compression, model quantization and lightweight design. This paper designs the lightweight network espnetv2, and the main contributions are as follows:
 The general lightweight network structure can support visual data and serialized data, that is, visual tasks and natural language processing tasks.
 Based on espnet, depth separable cavity convolution is added to expand, which has better accuracy and fewer parameters than espnet.
 From the experimental point of view, espnetv2 has good accuracy and low parameter quantity in multiple visual tasks, including image classification, semantic segmentation and target detection.
 A cyclic learning rate scheduler is designed, which is better than the general scheduler with fixed learning rate.
Depthwise dilated separable convolution
Assuming that the input is $X \ in \ mathbb {r} ^ {w \ times h \ times C} $, the convolution kernel is $X \ in \ mathbb {K} ^ {n \ times n \ times C \ times \ hat {C} $, and the output is $Y \ in \ mathbb {r} ^ {w \ times h \ times \ hat {C}} $, the parameter quantities and effective sensing domains of standard convolution, grouping convolution, deep separation convolution and deep separable cavity convolution are shown in Table 1.
EESP unit
Based on the deep separable hole convolution and grouping point wise convolution, the ESP module is improved, and the EESP (extreme efficient spatial pyramid) module is proposed. The original ESP module structure is shown in Fig. 1a. Firstly, the point wise convolution is replaced by the grouping point wise convolution, and then the hole convolution with large amount of calculation is replaced by the deep separable hole convolution. Finally, HFF is still used to eliminate the grid texture. The structure is shown in Fig. 1B, which can reduce the computational complexity of $\ frac {MD + n ^ 2D ^ 2K}{\ frac {MD}{g} + (n ^ 2 + D) DK} $, $k $is the number of empty convolution pyramid layers. Considering that the separate calculation of $k $point wise convolution is equivalent to the point wise packet convolution with a single packet number of $k $, and the implementation of packet convolution is more efficient, it is improved to the final structure of Fig. 1C.
In order to learn multiscale features more efficiently, the paper proposes a down sampling version of EESP module (structured EESP with shortcut connection to an input image), which mainly makes the following improvements:
 Modify the depth separable hole convolution to the version of stripe = 2.
 Add an average pooling operation for the module’s original shortcut.
 Replace the element wise addition operation with the concate operation, which can increase the feature dimension of the output.
 In order to prevent the loss of information generated with down sampling, a shortcut connecting the input image is added. The path uses multiple pooling operations to make its space size consistent with the feature map output by the module, then uses two convolutions to extract features and adjust dimensions, and finally performs element wise addition.
Network architecture
The network structure of espnetv2 is shown in Table 2. Each convolution of ESSP module is followed by BN layer and prelu. The prelu of the last packet convolution of the module is added after element wise, $g = k = 4 $, and others are similar to espnet.
Cyclic learning rate scheduler
In the training of image classification, the paper designs a cyclic learning rate scheduler. In each cycle of $t $, the learning rate is calculated as follows:
$\eta_ {Max} $and $\ ETA_ {min} $is the maximum and minimum learning rate, and $t $is the cycle.
The visualization of the cyclic learning rate scheduler is shown in Figure 4.
Experiments
Image classification performance comparison.
Semantic segmentation performance comparison.
Target detection performance comparison.
Text generation performance comparison.
Conclusion
On the basis of espnet, espnetv2 combines the design method of deep separation convolution to further lighten the model. Combined with richer feature fusion, the model can be extended to a variety of tasks and has very good performance.
CONCLUSION
The core of espnet series lies in the cavity convolution pyramid. Each layer has different division rates. It can integrate multiscale features without increasing the amount of parameters. Compared with the depth separable convolution, the depth separable cavity convolution pyramid has higher cost performance. In addition, the multiscale feature fusion method of HFF is also worthy of reference.
If this article is helpful to you, please like it or read it
More content, please pay attention to WeChat official account.