# Dynamic relu: Microsoft’s refreshing device may be the best relu improvement | ECCV 2020

Time：2020-11-19

In this paper, a dynamic relu is proposed, which can dynamically adjust the corresponding segmented activation function according to the input. Compared with relu and its variants, only a small amount of extra calculation can bring about significant performance improvement, and can be embedded into the current mainstream model seamlessly

Source: Xiaofei’s algorithm Engineering Notes official account

Paper: dynamic relu

## Introduction

Relu is an important milestone in deep learning. It is simple but powerful and can greatly improve the performance of neural network. At present, there are many improved versions of relu, such as leaky relu and prelu. The final parameters of these improved and original versions are fixed. Therefore, the paper naturally thought that it would be better to adjust the parameters of relu according to the input characteristics.

Based on the above idea, dynamic relu (Dy relu) is proposed. As shown in Figure 2, dy-relu is a piecewise function $F_ {\ \ theta {(x)}} (x)$, the parameter is obtained by the super function $\ \ theta {(x)}$according to the input $x$. The context of each dimension of the synthetic input of the hyperfunction $\ \ theta (x)$comes from the adaptive activation function $F_ {\ \ theta {(x)}} (x)$, can significantly improve the expression ability of the network with a small amount of extra computation. In addition, this paper provides three forms of dy-relu, which have different sharing mechanisms in spatial location and dimension. Different forms of dy-relu are suitable for different tasks. The paper also verifies that dy-relu has a good improvement in key point recognition and image classification.

## Definition and Implementation of Dynamic ReLU

The original relu is defined as $y = max \ \ {x, 0 \ \}$, $x$is the input vector, and for the input $C$dimension feature $X_ C$, the activation value is calculated as $y_ c=max\\{x_ c, 0\\}$。 Relu can be expressed as piecewise linear function $y_ c=max_ k\\{a^k_ c x_ c+b^k_ This paper is based on the dynamic input function of$Luc_ C \ \} $adaptive$a ^ k_ c$,$b^k_ c$： Factor$(a ^ k_ c, b^k_ c) $is the output of the hyperfunction$\ \ theta (x) $:$k $is the number of functions,$C $is the number of dimensions, and the activation parameter$(a ^ k_ c, b^k_ c) $is more than$X_ C $is related to$X_ {J \ \ ne C} $. #### Implementation of hyper function$\\theta(x)$In the end, we use a$1.0 to normalize the size of the input layer (1.0 times) to reduce the size of the input layer to 1.0 times- 1 $,$\ \ sigma $are sigmoid functions. The subnet outputs$2kc $elements, corresponding to$a ^ {1: K}_ {1: C} $and$B ^ {1: K}_ The final output is the sum of initial value and residual error

$\ \ alpha ^ k$and $\ \ beta ^ k$are $a ^ K_ C$and $B ^ k_ Initial value of C$, $\ \ lambda_ A$and $\ \ lambda_ B$is the scalar used to control the size of the residual. For the case of $k = 2$, the default parameter is $\ \ alpha ^ 1 = 1$, $\ \ alpha ^ 2 = \ \ beta ^ 1 = \ \ beta ^ 2 = 0$, that is, the original relu, and the default scalar value is $\ \ lambda_ a=1.0$，$\\lambda_ b=0.5$。