## objective

**The neural network can solve the problem of linear indivisibility by introducing nonlinear factors into neurons**For example, the simplest XOR.

Hornik proved that a multilayer feedforward network can approximate a continuous function of any complexity with any precision with only one hidden layer containing enough neurons. Therefore, the neural network can approximate any nonlinear function arbitrarily, so that the neural network can be applied to many nonlinear models.

If the excitation function is not used, the output of each layer is a linear function of the input of the upper layer. No matter how many layers the neural network has, the output is a linear combination of the inputs. Then the multi-layer network degenerates into a single-layer network.

## Common activation functions

The classical functions include:**Sigmoid function**、**Tanh function**、**Relu function**。

A good activation function usually needs to satisfy the following properties:

- Nonlinearity: derivative is not constant
- Differentiability: ensures the computability of gradient in optimization. Although there are finite points in relu that are nondifferentiable, they can replace gradients everywhere.
- Simple calculation: complex activation function will reduce the calculation speed, so relu is more popular than sigmoid and other activation functions with exp operation.
- Unsaturation: saturation refers to the problem that the gradient is close to zero (that is, the gradient disappears) in some intervals, so that the parameters cannot be updated continuously. The most classic example is sigmoid, whose derivative is close to 0 when x is a large positive value and a small negative value. For x < 0, the gradient of relu is always 0, and then it will be saturated. Leaky relu and prelu are proposed to solve this problem.
- Monotonic: the sign of derivative is invariable. When the activation function is monotone, the single-layer network can be guaranteed to be convex. However, the activation function such as mish does not satisfy the monotonicity condition, so the monotonicity is not a hard condition, because the neural network is inherently non convex.
- Few parameters: most activation functions have no parameters. For example, prelu with a single parameter will slightly increase the size of the network. Another exception is maxout. Although it has no parameters, under the same number of output channels, the number of input channels required by k-way maxout is k times that of other functions, which means that the number of neurons also needs to be K times.

## Advantages and disadvantages of activation function

### Sigmoid activation function

**advantage:**

- The gradient is smooth and the derivation is easy
- The output value is between 0-1

**Disadvantages:**

- The activation function has a large amount of computation (power operation and division are included in both forward propagation and backward propagation);
- Gradient vanishing: when the input value is larger or smaller (on both sides of the image), the sigmoid function value is close to zero. The sigmoid derivative is close to zero, resulting in the final gradient close to zero, which can not achieve the purpose of updating parameters;
- The output of sigmoid is not zero centered. This characteristic will lead to the input of the back network layer is not zero centered, and then affect the operation of gradient descent.

**Why does sigmoid appear gradient disappearance**：

In the back propagation algorithm, the derivative expression of sigmoid is as follows

The original function and derivative of sigmoid are as follows:

As can be seen from the figure,**The derivative starts from 0 and soon approaches to 0, which is easy to cause the phenomenon of “gradient vanishing”**

**The solution of gradient disappearance**Generally speaking, sigmoid network will produce gradient vanishing phenomenon within 5 layers. The problem of gradient vanishing still exists, but it has been effectively alleviated by new optimization methods, such as layered pre training in DBN, layer by layer normalization in batch normalization, Xavier and MsrA weight initialization and other representative technologies.

**Why pursue zero centered**: the closer the center is to 0, the closer the SGD will be to natural gradient (a secondary optimization technique), thus reducing the number of iterations required.

### Tanh function

The value range of hyperbolic tangent function is [- 1,1].

The effect of tanh is good when the feature difference is obvious, and it will continue to expand the feature effect in the cycle process.

**advantage:**

- It converges faster than sigmoid function.
- Compared with sigmoid function, its output is centered on 0.

**Disadvantages:**

The biggest problem without changing the sigmoid function is that the gradient due to saturation disappears.

### Relu function

**advantage:**

Because there is no need to do exponential operation, the convergence speed of SGD obtained by using relu is much faster than that of sigmoid and tanh

**Disadvantages:**

- Training is very “fragile”, with the progress of training, may appear
**Neurons die, weights cannot be updated**What’s going on. For example, a very large gradient flows through a relu neuron. After updating the parameters, the neuron will no longer activate any data. Then the gradient of the neuron will always be 0, that is, the relu neuron will die irreversibly in training. If the learning rate is large, it is likely that 40% of the neurons in the network are dead. - Relu’s
**The output has offset phenomenon**That is to say, the output mean value is always greater than zero. The convergence of the network is affected by the shift phenomenon and neuron death.

For the defects of relu, there are many improved functions (leaky relu, prelu or maxout) in the future, which will not be expanded in detail here.

## Expansion: saturation

Professor bengio defined the function whose partial derivative is equal to 0 only in the limit state as**Soft saturation activation function**Correspondingly, for any x, if there is a constant C, when x > C or x < C, there is always h ‘(x) = 0, then it is called**Right / left hard saturation**When x is smaller than C and H ′ (x) = 0, it is called**Left hard saturation**. If both the left hard saturation and the right hard saturation are satisfied, then the activation function is called**Hard saturation**。

## reference resources

1.Activation function in neural network

2.Summary of activation functions in neural networks

3.Various activation functions in neural networks

4.Comparison of common activation functions