**This article is from machine learning alchemy**

Generally speaking, the gradient vanishing problem and the gradient explosion problem can be called the gradient explosion problem**Gradient instability problem**。

**Knowledge to memorize**The problem of gradient vanishing is solved by using relu instead of sigmoid, BN layer and residual structure. The problem of gradient explosion can be restricted by regularization. The derivative of sigmoid is [0, 0.25]

# 1 causes

The reason for both is that**Chain rule**. When there are too many layers in the model, there will be many product terms when calculating the gradient. Use the following example to understand:

This is an example of only one neuron per layer. The activation function of each neuron is sigmoid. Then we want to update the parameter B1.

According to the generally accepted symbol:

- $w_ 1*x_ 1 + b_ 1 = z_ That’s what Z means;
- $\sigma(z_ 1)=a_ 1 $, that’s what a means.

We can get this partial derivative

$\frac{\partial C}{\partial b_1} = \frac{\partial z_1}{\partial b_1}\frac{\partial a_1}{\partial z_1} \frac{\partial z_2}{\partial a_2}\frac{\partial a_2}{\partial z_2} \frac{\partial z_2}{\partial a_3}\frac{\partial a_3}{\partial z_3} \frac{\partial z_3}{\partial a_4}\frac{\partial a_4}{\partial z_4} \frac{\partial C}{\partial a_4}$

Then, simplify:

$\frac{\partial C}{\partial b_1}=\sigma'(z_1)w_2\sigma'(z_2)w_3\sigma'(z_3)w_4\sigma'(z_4)\frac{\partial C}{\partial a_4}$

The key is this $/ sigma ‘(z)_ 1) The derivative of the sigmoid function is in the range of 0 ~ 0.25, which means that the deeper the network layers are, the smaller the gradient of the previous layers will be. The following figure shows the derivative of sigmoid function

Therefore, this phenomenon often occurs

In the figure, the gradient ranges of the four hidden layers are shown respectively. As you can see, the speed of gradient update in the shallowest hidden layer is very small. [the vertical axis in the figure changes exponentially].

So the gradient explosion is also very easy to understand, which is $W_ j\sigma'(z_ j) Over $1, so it explodes.

[Note: if the activation function is sigmoid, the maximum derivative is 0.25, and $W_ J $is generally not greater than 4, so the sigmoid function is generally a gradient vanishing problem

[summary]:

- Gradient vanishing and gradient explosion are the gradients of the previous layers, because the chain rule multiplies the number less than (greater than) 1, resulting in the phenomenon that the gradient is very small (large);
- The maximum derivative of sigmoid is 0.25, which is generally a gradient vanishing problem.

# 2 solutions

## 2.1 replace activation function

The most common solution is to change the activation function. Now in the neural network, in addition to sigmoid, the activation function of each layer is usually relu.

**【ReLU】**If the derivative of the activation function is 1, then there is no gradient explosion problem.**[benefits]**We can find that the derivative of the relu function is equal to 1 in the positive part, so we can avoid the problem that the gradient disappears.**[not good]**But the derivative of the negative part is equal to 0, which means that as long as one $Z in the chain rule_ If J $is less than 0, then the gradient of this neuron is 0 and will not be updated.

**【leakyReLU】**In the negative part of relu, a certain slope is added

The problem of dead neurons in relu is solved.

**【elu】**Like leaky relu, it is to solve the problem of dead neurons, but the increasing slope is not fixed

However, compared with leakrelu, the computation is much more.

## 2.2 batchnorm layer

This is awesome success, and it must be used in image processing. The essence of BN layer is to**Solving gradient problem in back propagation**。

In neural networks, there is such a problem**Internal Covariate Shift**。

Suppose that the input data of the first layer is processed by the first layer to get the input data of the second layer. At this time, the input data of the second layer will change relative to the data distribution of the first layer, so the parameters of the second layer are updated to fit the distribution of the input data of the second layer. However, when it comes to the next batch, because the parameters of the first layer have also changed, the distribution of the input data of the second layer is different from that of the previous batch. Then the parameter update direction of the second layer will also change. The more layers there are, the more obvious the problem is.

But in order to keep the distribution of each layer unchanged, if the output data of each layer are normalized to 0 mean, 1 variance is not good? But in this way, we will not learn the characteristics of the input data at all. No matter what data are subject to the standard distribution, it will be a bit strange to think about it. So BN is the kind of parameter that can be learned through training by adding two adaptive parameters. In this way, the data of each layer are normalized to the normal distribution of the mean value of $/ beta $, and the standard deviation of $/ gamma $.

[changing the input distribution into normal distribution is a kind of behavior to remove the absolute difference of data and expand the relative difference, so BN layer has a good effect in classification. For image to image tasks, the absolute difference of data is also very important, so the BN layer may not have the corresponding effect. 】

## 2.3 residual structure

Residual structure, a simple understanding, is to let the deep network through the shortcut, so that the network is not so deep. In this way, the problem of gradient disappearance is alleviated.

## 2.4 regularization

The gradient explosion problem mentioned before is generally due to $W_ If J $is too large, L2 regularization can solve the problem.