Author: LogM

This article was originally published at https://segmentfault.com/u/logm/articles and is not allowed to be reproduced.~

## 1. ReLU for CNN?

The derivatives of sigmoid are in the range of [0,0.25], tanh is in the range of [0,1], and ReLU is in the range of {0,1}.

If sigmoid is used in every layer of CNN, multiple multiplications of derivatives less than 1 cause “gradient disappearance”.

So why not use tanh? Although the derivative of tanh can reach 1, there is still a problem of “gradient disappearance” at the edge. It can only be said that tanh alleviated the “gradient disappearance” to some extent.

There is no “gradient disappearance” in the positive half of ReLU, but there is a “gradient disappearance” in the negative half. (However, some people think that the existence of the negative half can make the parametric matrix sparse and has a certain regularization effect.)

Most importantly, ReLU is much less computational than sigmoid and tanh.

A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line).

ReLU was not first invented for deep networks, so it’s hard to find out what problems deep networks solved by the birth of ReLU from the inventor’s point of view. In fact, when scholars use ReLU on deep networks to find good results, they have put forward some theories to explain why ReLU works well. So these theories supporting ReLU are somewhat rigid.

Because ReLU is not specially developed for deep network, there are still many problems and much room for improvement when ReLU is transplanted to deep network.

Reference: How did Krizhevsky and others think of using Dropout and ReLU in CNN?

## 2. Tanh for RNN?

“Tanh for RNN” mainly refers to tanh for implicit state activation functions of GRU and LSTM, and sigmoid functions are generally used for various gates inside them because they are 0-1 outputs.

The parameter matrix W of RNN in each time step is the same. Without activation function, it is equivalent to the continuous multiplication of W. Then the elements whose absolute value is less than 1 in W will rapidly change to 0 in the continuous multiplication, and the elements whose absolute value is greater than 1 will quickly become infinite in the continuous multiplication. Therefore, RNN is more likely to appear “gradient disappearance” and “gradient explosion” than CNN.

Understanding the source of RNN’s “gradient explosion” should help you understand why ReLU is not recommended.

At first sight, ReLUs seem inappropriate for RNNs because they can have very large outputs, so they might be expected to be far more likely to explode than units that have bounded values.

Understanding the source of RNN’s “gradient disappearance” should help us understand why tanh is not very suitable. (Tanh has the problem of gradient disappearance, although it’s much smaller than sigmoid.)

So neither ReLu nor Tanh is perfect. There will be`RNN + ReLU + Gradient Tailoring`

and`GRU/LSTM + tanh`

Which of the two options is better debated?

Reference: Why use tanh instead of ReLu as activation function in RNN?