Dynamic relu: Microsoft’s refreshing device may be the best relu improvement | ECCV 2020

Time:2020-11-19

In this paper, a dynamic relu is proposed, which can dynamically adjust the corresponding segmented activation function according to the input. Compared with relu and its variants, only a small amount of extra calculation can bring about significant performance improvement, and can be embedded into the current mainstream model seamlessly

Source: Xiaofei’s algorithm Engineering Notes official account

Paper: dynamic relu

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Introduction


Relu is an important milestone in deep learning. It is simple but powerful and can greatly improve the performance of neural network. At present, there are many improved versions of relu, such as leaky relu and prelu. The final parameters of these improved and original versions are fixed. Therefore, the paper naturally thought that it would be better to adjust the parameters of relu according to the input characteristics.

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Based on the above idea, dynamic relu (Dy relu) is proposed. As shown in Figure 2, dy-relu is a piecewise function $F_ {\ \ theta {(x)}} (x) $, the parameter is obtained by the super function $\ \ theta {(x)} $according to the input $x $. The context of each dimension of the synthetic input of the hyperfunction $\ \ theta (x) $comes from the adaptive activation function $F_ {\ \ theta {(x)}} (x) $, can significantly improve the expression ability of the network with a small amount of extra computation. In addition, this paper provides three forms of dy-relu, which have different sharing mechanisms in spatial location and dimension. Different forms of dy-relu are suitable for different tasks. The paper also verifies that dy-relu has a good improvement in key point recognition and image classification.

Definition and Implementation of Dynamic ReLU


Definition

The original relu is defined as $y = max \ \ {x, 0 \ \} $, $x $is the input vector, and for the input $C $dimension feature $X_ C $, the activation value is calculated as $y_ c=max\\{x_ c, 0\\}$。 Relu can be expressed as piecewise linear function $y_ c=max_ k\\{a^k_ c x_ c+b^k_ This paper is based on the dynamic input function of $Luc_ C \ \} $adaptive $a ^ k_ c$,$b^k_ c$:

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Factor $(a ^ k_ c, b^k_ c) $is the output of the hyperfunction $\ \ theta (x) $:

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

$k $is the number of functions, $C $is the number of dimensions, and the activation parameter $(a ^ k_ c, b^k_ c) $is more than $X_ C $is related to $X_ {J \ \ ne C} $.

Implementation of hyper function $\\theta(x)$

In the end, we use a $1.0 to normalize the size of the input layer (1.0 times) to reduce the size of the input layer to 1.0 times- 1 $, $\ \ sigma $are sigmoid functions. The subnet outputs $2kc $elements, corresponding to $a ^ {1: K}_ {1: C} $and $B ^ {1: K}_ The final output is the sum of initial value and residual error

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

$\ \ alpha ^ k $and $\ \ beta ^ k $are $a ^ K_ C $and $B ^ k_ Initial value of C $, $\ \ lambda_ A $and $\ \ lambda_ B $is the scalar used to control the size of the residual. For the case of $k = 2 $, the default parameter is $\ \ alpha ^ 1 = 1 $, $\ \ alpha ^ 2 = \ \ beta ^ 1 = \ \ beta ^ 2 = 0 $, that is, the original relu, and the default scalar value is $\ \ lambda_ a=1.0$,$\\lambda_ b=0.5$。

Relation to Prior Work

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Dy-relu is very likely. Table 1 shows the relationship between dy-relu and the original relu and its variants. After learning specific parameters, dy-relu can be equivalent to relu, leakyrelu and prelu. When $k = 1 $, offset $B ^ 1_ When C = 0, it is equivalent to se module. In addition, dy-relu can also be a dynamic and efficient maxout operator, equivalent to converting maxout’s $k $convolutions into $k $dynamic linear changes, and then output the maximum value as well.

Variations of Dynamic ReLU


Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

This paper provides three forms of dy-relu, which share different spatial location and dimension mechanisms

DY-ReLU-A

Spatial and channel shared. The calculation is shown in Figure 2A. Only $2K $parameters are output. The calculation is the simplest and the expression ability is the weakest.

DY-ReLU-B

Only spatial shared and channel wise is calculated, as shown in Figure 2B, and $2kc $parameters are output.

DY-ReLU-C

Spatial and channel wise are not shared, and each element of each dimension has a corresponding activation function $max_ k\\{a^k_ {c,h,w} x_ {c, h, w} + b^k_ {c,h,w} \\}$。 Although the expression ability is very strong, there are too many parameters needed to be output ($2kchw $). For example, if the full connection layer is used directly, it will bring too much extra calculation. For this reason, the paper improves the calculation, as shown in Figure 2c, decomposes the spatial position into another attention branch, and finally decomposes the dimension parameter $[a ^ {1: K}_ {1:C}, b^{1:K}_ {1: C}] $multiplied by spatial position attention $[\ \ pi_ {1:HW}]$。 In the calculation of attention, the convolution and normalization methods of $1 \ \ times 1 $are used, and the constrained softmax function is used for normalization

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

In this paper, $\ \ gamma $is used to average the attention. In the paper, $\ \ frac {HW} {3} $, $\ \ tau $is set as temperature, and a larger value (10) is set at the beginning of training to prevent attention from being too sparse.

Experimental Results

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Image classification and contrast experiment.

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Key point recognition experiment.

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Compared with relu on Imagenet in many aspects.

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

Compared with other activation functions.

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020

By visualizing the input and output of Dy relu in different blocks and slope changes, we can see its dynamic.

Conclustion


In this paper, a dynamic relu is proposed, which can dynamically adjust the corresponding segmented activation function according to the input. Compared with relu and its variants, only a small amount of extra calculation can bring huge performance improvement, and can be embedded into the current mainstream model seamlessly. As mentioned earlier, aprelu is also used for dynamic relu. The subnet structure is very similar, but dy-relu is $max_ The existence of {1 \ \ Le K \ \ Le K} $is more likely and effective than aprelu.



If this article is helpful to you, please give me a like or read it
More content, please pay attention to WeChat official account.

Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020