1 multilayer perceptron

Definition: multilayer perceptron is to introduce one or more hidden layers into single layer neural network, namely input layer, hidden layer and output layer

2. Activation function of multilayer perceptron

If there is no activation function, the multi-layer perception opportunity degenerates into a single layer

The formula of multilayer perceptron: hidden layer H = XW_{h}+b_{h}

Output layer o = HW_{0}+b_{0}=(XW_{h}+b_{h})W_{0}+b_{0}=XW_{h}W_{0}+b_{0}W_{0}+b_{0}

Among them, XW_{h}W_{0}Equivalent to W, B_{0}W_{0}+b_{0}It is equivalent to B, that is, the form of Wx + B, which is the same degree function as that of single layer, so it becomes a single layer again

Function of activation function

(1) Let the multi-layer perceptron become the real multi-layer perceptron, otherwise it is equal to one layer perceptron

(2) By introducing nonlinearity, the network approximates any nonlinear function and makes up for the defect of the previous single layer

The characteristics of activation function

(1) It is easy to learn the parameters of the network by numerical optimization method

(2) The activation function is as simple as possible to improve the calculation efficiency

(3) The range of the derivative of the activation function and the derivative of the activation function should be in an appropriate range, otherwise the stability and efficiency of the training will be affected

5 common activation functions

1 SIGMOD type

Common in the early neural networks, RNN and binary classification items, the value range is 0 to 1, which can be used to output the probability of binary classification

Disadvantages: the function in the saturated region can’t update the gradient, so it is difficult to propagate forward

2. Tahn (hyperbolic tangent)

3 relu (modified linear element)

The most commonly used activation function of neural network has no saturation region. Although it is non differentiable on Z = 0, it does not violate the characteristics of activation function (it is allowed to be non differentiable at a few points), and is widely used in convolution networks