This part introduces the failure cases of back propagation algorithm and the common methods of regularized neural network.
Many common situations will lead to the error of back propagation algorithm.
The gradient of the lower layer (closer to the input) may become very small. In the depth network, the calculation of these gradients may involve the product of many small terms.
When the gradient of lower layers gradually disappears to 0, the training speed of these layers will be very slow, or even no longer training.
The relu activation function helps prevent the gradient from disappearing.
If the weight in the network is too large, the gradient of the lower layer will involve the product of many large terms. In this case, the gradient will explode: the gradient is too large to converge.
Batch standardization can reduce the learning rate and thus help prevent gradient explosion.
The relu unit disappears
Once the weighted sum of the relu unit is lower than 0, the relu unit may stagnate. It will output 0 activation without any contribution to the network output, and the gradient will no longer flow through it during the back propagation algorithm. Since the source of the gradient is cut off, the input of relu cannot make enough changes to restore the weighted sum above 0.
Reducing the learning rate helps prevent the relu unit from disappearing.
This is calleddiscardAnother form of regularization can be used in neural networks. Its working principle is that some network elements are randomly discarded in each step of the gradient descent method. The more discarded, the stronger the regularization effect:
- 0.0 = no discard regularization
- 1.0 = discard all content. The model can’t learn any laws.
- Values between 0.0 and 1.0 are more useful.
This work adoptsCC agreement, reprint must indicate the author and the link to this article