Adaptive parameterized relu is a dynamic activation function that does not “treat all inputs equally”, submitted to IEEE Transactions on industrial electronics on May 3, 2019 and hired on January 24, 2020,Published on IEEE official website on February 13, 2020。
On the basis of summarizing the traditional activation function and attention mechanism, this paper interprets a dynamic activation function under the attention mechanism, namely adaptive parametric corrector linear unit (aprelu), hoping to be helpful to you.
1. The traditional activation function is static
Activation function is an important part of modern artificial neural network, and its function is to realize the nonlinearity of artificial neural network. Let’s first introduce some of the most common activation functions, namely sigmoid activation function, tanh activation function and relu activation function, as shown in the figure below:
The gradient values of sigmoid activation function and tanh activation function are (0,1) and (- 1,1) respectively. When there are many layers, the artificial neural network may encounter the problem of gradient disappearance. The gradient of relu activation function is either zero or one, which can well avoid the problems of gradient disappearance and gradient explosion, so it has been widely used in recent years.
However, there is still a flaw in the relu activation function. If all features are less than zero in the training process of artificial neural network, the output of relu activation function is all zero. The training failed at this time. In order to avoid this situation, some scholars have proposed leaky relu activation function, which does not set the features less than zero to zero, but multiplies the features less than zero by a small coefficient, such as 0.1 and 0.01.
In leaky relu, the value of this coefficient is set manually. However, the manually set coefficient may not be the best. Therefore, he Kaiming et al. Proposed the parametric relu activation function (parameterized relu activation function, prelu activation function), set this coefficient as a parameter that can be trained, and use the gradient descent method with other parameters in the training process of artificial neural network. However, the prelu activation function has a characteristic: once the training process is completed, the coefficient in the prelu activation function becomes a fixed value. In other words, the value of this coefficient in the prelu activation function is the same for all test samples.
Here we will briefly introduce several common activation functions. What’s wrong with these activation functions? We can think about it. If an artificial neural network adopts one of the above activation functions, or a combination of the above activation functions, then after the training, when the artificial neural network is applied to the test samples, the nonlinear transformation adopted for all the test samples is the same, that is, static. In other words, all test samples will experience the same nonlinear transformation. This is actually a more rigid way.
As shown in the figure below, if we use the scatter diagram on the left to represent the original feature space, the scatter diagram on the right to represent the high-level feature space learned by the artificial neural network, the small dots and blocks in the scatter diagram to represent two different types of samples, and F, G and h to represent the nonlinear function. Then these samples realize the transformation from the original feature space to the high-level feature space through the same nonlinear function. In other words, the “=” in the picture means that for these samples, the nonlinear transformation they experience is exactly the same.
Then, can we set the parameters of the activation function for each sample separately according to the characteristics of each sample and make each sample experience different dynamic nonlinear transformations? This is achieved by the aprelu activation function to be introduced later in this paper.
2. Attention mechanism
The aprelu activation function introduced in this paper draws lessons from the classic sequence and exception network (senet), which is a very classic deep learning method under the attention mechanism. The basic principle of senet is shown in the figure below:
Here is the idea contained in senet. For many samples, the importance of each feature channel in the feature map is likely to be different. For example, the characteristic channel 1 of sample a is very important, and the characteristic channel 2 is not important; Characteristic channel 1 of sample B is not important, but characteristic channel 2 is important; At this time, for sample a, we should focus on feature channel 1 (i.e. give feature channel 1 a higher weight); Conversely, for sample B, we should focus on feature channel 2 (i.e. give feature channel 2 a higher weight).
In order to achieve this goal, senet learns a set of weight coefficients through a small fully connected network to weight each channel of the original feature map. In this way, each sample (including training samples and test samples) has its own unique set of weights for weighting each feature channel. This is actually an attention mechanism, that is, to pay attention to important feature channels and give them higher weight.
3. Adaptive parameterized modified linear element (aprelu) activation function
Aprelu activation function, in essence, is the integration of senet and prelu activation functions. In senet, the weight learned by the small fully connected network is the weight used for each characteristic channel. The aprelu activation function also obtains the weight through a small fully connected network, and then takes this group of weights as the coefficients in the prelu activation function, that is, the weight of the negative part. The basic principle of aprelu activation function is shown in the figure below.
As like as two peas, we can see that in the APReLU activation function, the function of its nonlinear transformation is exactly the same as that of PReLU activation function. The only difference is that the weight coefficients of negative features in aprelu activation function are learned through a small fully connected network. When the artificial neural network adopts aprelu activation function, each sample can have its own unique weight coefficient, that is, unique nonlinear transformation (as shown in the figure below). At the same time, the input feature graph and output feature graph of aprelu activation function have the same size, which means that aprelu can be easily embedded into existing deep learning algorithms.
To sum up, aprelu activation function enables each sample to have its own unique set of nonlinear transformations, provides a more flexible dynamic nonlinear transformation mode, and has the potential to improve the accuracy of pattern recognition.
Zhao M, Zhong S, Fu X, et al. Deep residual networks with adaptively parametric rectifier linear units for fault diagnosis[J]. IEEE Transactions on Industrial Electronics, 2020, DOI: 10.1109/TIE.2020.2972458, Date of Publication: 13 February 2020