Summary of network weight initialization methods (2): Lecun, Xavier and he Kaiming

Time:2020-2-22

Catalog

  • Weight initialization best practices
  • The correlation between expectation and variance
  • Analysis of variance of full connection layer
  • Initialization method under tanh
    • Lecun 1998
    • Xavier 2010
  • Initialization method under relu / prelu
    • He 2015 for ReLU
    • He 2015 for PReLU
    • Implementation of caffe
  • Summary
  • Reference resources

Blog: blog.shinelee.me | blog Park | CSDN

Weight initialization best practices

前向传播

反向传播

In the last chapter, all zeros, constants, too large and too small weight initialization are not good. What kind of initialization do we need?

  • Because of the weight\(w\)The size and plus and minus of are short of a priori, so it should be initializedNear 0, but it can’t be all zero or constant, so there must be someRandomnessThat is to sayMathematical expectation\(E(w)=0\)
  • Because the gradient disappears and the gradient explodes, the weight is not easy to be too large or too small, soVariance to weight\(Var(w)\)Some control

  • In the multilayer structure of deep neural network, the output of each activation layer is input to the later layer, so we hope thatThe output variance of different active layers is the sameThat is to say\(Var(a^{[l]})=Var(a^{[l-1]})\), which meansThe input variance of different activation layers is the sameThat is to say\(Var(z^{[l]})=Var(z^{[l-1]})\)
  • If the activation function is ignored, forward propagation and back propagation can be regarded as the continuous multiplication of weight matrix (transpose). If the value is too large, it may fall into saturation zone in the forward direction, explode in the reverse direction, and if the value is too small, the gradient may disappear in the reverse direction. So when initializing,The numerical range (variance) of weight should consider the forward and backward processes

The process of random initialization of weights can be regarded as the process of random sampling from a certain probability distribution. The commonly used distribution includes Gaussian distribution, uniform distribution, etc. the control of weight expectation and variance can be transformed into parameter control of probability distribution,The problem of weight initialization becomes the problem of parameter setting of probability distribution

In the last round, we know that the back propagation process is affected by the weight matrix and activation function at the same time. Then, what kind of adaptation should be done for weight initialization when the activation function is different and the super parameter configuration of each layer is different (the number of input and output)? Here, the research results of each company are summarized as follows,

weight initialization

Among them, fan in\(fan\_in\)Fan out\(fan\_out\)It is the number of inputs and outputs of the current full connection layer. More precisely, one output neuron is connected with the\(fan\_in\)The number of connections feeding into the node\(fan\_out\)The number of connections flowing out of the node is shown in the following figure (from the link),

MW33zn.png

For convolution layer, its weight is\(n\)individual\(c\times h \times w\)The size of the convolution kernel, then an output neuron and\(c\times h \times w\)The input neurons are connected, i.e\(fan\_in = c\times h \times w\), an input neuron and\(n\times h \times w\)The output neurons are connected, i.e\(fan\_out=n\times h \times w\)

The correlation between expectation and variance

Next, we first review the related properties of expectation and variance calculation.

For random variables\(X\), the variance can be calculated by the following formula,
\[
Var(X) = E(X^2) – (E(X))^2
\]

If two random variables\(X\)and\(Y\), they are independent of each other, so their covariance is 0,
\[
Cov(X, Y) = 0
\]

Further available\(E(XY)=E(X)E(Y)\), derived as follows,
\[
\begin{align} Cov(X, Y) &= E((X-E(X))(Y-E(Y))) \\
&= E(XY)-E(X)E(Y) =0 \end{align}
\]

Variance of sum of two independent random variables,
\[
\begin{aligned} \operatorname{Var}(X+Y) &=E\left((X+Y)^{2}\right)-(E(X+Y))^{2} \\ &=E\left(X^{2}+Y^{2}+2 X Y\right)-(E(X)+E(Y))^{2} \\ &=\left(E\left(X^{2}\right)+E\left(Y^{2}\right)+2 E(X Y)\right)-\left((E(X))^{2}+(E(Y))^{2}+2 E(X) E(Y)\right) \\ &=\left(E\left(X^{2}\right)+E\left(Y^{2}\right)+2 E(X) E(Y)\right)-\left((E(X))^{2}+(E(Y))^{2}+2 E(X) E(Y)\right) \\ &=E\left(X^{2}\right)-(E(X))^{2}+E\left(Y^{2}\right)-(E(Y))^{2} \\ &=\operatorname{Var}(X)+\operatorname{Var}(Y) \end{aligned}
\]

The variance of the product of two independent random variables,
\[
\begin{aligned} \operatorname{Var}(X Y) &=E\left((X Y)^{2}\right)-(E(X Y))^{2} \\ &=E\left(X^{2}\right) E\left(Y^{2}\right)-(E(X) E(Y))^{2} \\ &=\left(\operatorname{Var}(X)+(E(X))^{2}\right)\left(\operatorname{Var}(Y)+(E(Y))^{2}\right)-(E(X))^{2}(E(Y))^{2} \\ &=\operatorname{Var}(X) \operatorname{Var}(Y)+(E(X))^{2} \operatorname{Var}(Y)+\operatorname{Var}(X)(E(Y))^{2} \end{aligned}
\]

Analysis of variance of full connection layer

For the linear combination layer + nonlinear activation layer, the calculation is as follows, where\(z_i^{[l-1]}\)by\(l-1\)Tier I\(i\)Inputs to activation functions,\(a_i^{[l-1]}\)For its output,\(w_{ij}^{[l]}\)For the first time\(l\)Tier I\(i\)Output neurons and\(j\)Weight of input neuron connections,\(b^{[l]}\)For offset, calculate as follows
\[
\begin{align}a_i^{[l-1]} &= f(z_i^{[l-1]}) \\z_i^{[l]} &= \sum_{j=1}^{fan\_in} w_{ij}^{[l]} \ a_j^{[l-1]}+b^{[l]} \\a_i^{[l]} &= f(z_i^{[l]})\end{align}
\]

In the initialization phase, treat each weight and each input as a random variable, and make the following assumptions and inferences,

  • Each element of network input\(x_1,x_2,\dots\)byIndependent identical distribution
  • Each floorWeight random initialization, weight of the same layer\(w_{i1}, w_{i2}, \dots\)Independent identical distributionAnd expect\(E(w)=0\)
  • Weight of each layer\(w\)And input\(a\)Randomly initialized and independent, so the product of the two constitutes a random variable\(w_{i1}a_1, w_{i2}a_2, \dots\)They are also independent of each other and distributed in the same way;
  • According to the above calculation formula, the\(z_1, z_2, \dots\)byIndependent identical distributionOn the same level.\(a_1, a_2, \dots\)Also forIndependent identical distribution

It should be noted that the above independent and identically distributed assumption is only established in the initialization phase. When the network starts training, according to the back propagation formula, the weights are no longer independent of each other after updating.

In the initialization phaseInput\(a\)And output\(z\)The relationship between variances is as follows\(b=0\)
\[
\begin{align}
Var(z) &=Var(\sum_{j=1}^{fan\_in} w_{ij} \ a_j) \\
&= fan\_in \times (Var(wa)) \\
&= fan\_in \times (Var(w) \ Var(a) + E(w)^2 Var(a) + Var(w) E(a)^2) \\
&= fan\_in \times (Var(w) \ Var(a) + Var(w) E(a)^2)
\end{align}
\]

Initialization method under tanh

If the activation function is linear identity mapThat is to say\(f(x)=x\)Then\(a = z\)Nature\(E(a)=E(z)\)\(Var(a) = Var(z)\)

Because of the expectation of network input\(E(x)=0\), expectation of each layer weight\(E(w) = 0\)Under the assumption of mutual independence, according to the formula\(E(XY)=E(X)E(Y)\)Knowable\(E(a)=E(z)=\sum E(wa)=\sum E(w)E(a)=0\)。 From this, we can get,
\[
Var(a^{[l]}) = Var(z^{[l]}) = fan\_in \times Var(w) \times Var(a^{[l-1]})
\]

Further, the\(n^{[l]}\)For the first time\(l\)Output number of layers(\(fan\_out\)(2)\(l\)Number of inputs for the layer ($fan? In\() that is, the output quantity of the previous layer is \)n^{[l-1]}(. Part)The variance of L $layer output is
\[
\begin{align}
Var(a^{L}) = Var(z^{[L]}) &= n^{[L-1]} Var(w^{[L]}) Var(a^{[L-1]}) \\
&=\left[\prod_{l=1}^{L} n^{[l-1]} Var(w^{[l]})\right] {Var}(x)
\end{align}
\]

In the back propagation, the\(n^{[l-1]}\)Replace with\(n^{[l]}\)(that is)\(fan\_in\)Replace with\(fan\_out\)), and\(x\)Replace with the partial derivative of the loss function to the network output.

So, after\(t\)The variance of layer, forward propagation and backward propagation will be enlarged or reduced respectively
\[
\prod^{t} n^{[l-1]} Var(w^{[l]}) \\
\prod^{t} n^{[l]} Var(w^{[l]})
\]

In order to avoid gradient disappearance and gradient explosion, it is better to keep this coefficient at 1.

It should be noted that,The above conclusion is obtained under the condition that the activation function is an identity map, and the tanh activation function can be approximately an identity map near 0, that is, $tanh (x) \ approx x x $.

Lecun 1998

Lecun’s paper efficient BackProp in 1998, in the case of input standardization and tanh activation function, enables\(n^{[l-1]}Var(w^{[l]})=1\)In the initialization stage, the variance of each layer in the forward propagation process is kept unchanged, and the weight is sampled from the following Gaussian distribution, where the\(l\)Layer\(fan\_in = n^{[l-1]}\)
\[
W \sim N(0, \frac{1}{fan\_in})
\]

Xavier 2010

In paper xavier-2010-understanding the efficiency of training deep feed forward neural networks, Xavier and bengio consider both forward and reverse processes\(fan\_in\)and\(fan\_out\)The weight is sampled from the following Gaussian distribution,
\[
W \sim N(0, \frac{2}{fan\_in + fan\_out})
\]

At the same time, the method of initialization from uniform distribution is also mentioned, because the relationship between the variance of uniform distribution and the range of distribution is
\[
Var(U(-n, n)) = \frac{n^2}{3}
\]

If order\(Var(U(-n, n)) = \frac{2}{fan\_in + fan\_out}\)There are
\[
n = \frac{\sqrt{6}}{\sqrt{fan\_in + fan\_out}}
\]

That is, the weight can also be sampled from the following uniform distribution,
\[
W \sim U(-\frac{\sqrt{6}}{\sqrt{fan\_in + fan\_out}}, \frac{\sqrt{6}}{\sqrt{fan\_in + fan\_out}})
\]

In the case of different activation functions, the effect of Xavier initialization method on test error is as follows, with\(N\)Using Xavier initialization method, softsign is a tanh like activation function that improves the saturation region. The difference between tanh and tanh n in test error can be seen clearly in the figure.

test error

There are more weight and gradient contrast diagrams in the training process in the paper, which will not be pasted here. Please refer to the paper for details.

Initialization method under relu / prelu

Let’s move the formula,
\[
Var(z)= fan\_in \times (Var(w) \ Var(a) + Var(w) E(a)^2)
\]

Since the tanh activation function can be approximated to identity map near 0, it can be considered as\(E(a) = 0\), but for the relu activation function, its output is greater than or equal to 0, and there is no negative number, so\(E(a) = 0\)The assumption of is no longer true.

activation functions

However, we can further deduce that,
\[
\begin{align}
Var(z) &= fan\_in \times (Var(w) \ Var(a) + Var(w) E(a)^2) \\
&= fan\_in \times (Var(w) (E(a^2) – E(a)^2)+Var(w)E(a)^2) \\
&= fan\_in \times Var(w) \times E(a^2)
\end{align}
\]

He 2015 for ReLU

For a specific layer\(l\)Yes, there are.
\[
Var(z^{[l]}) = fan\_in \times Var(w^{[l]}) \times E((a^{[l-1]})^2)
\]

Suppose that\(w{[l-1]}\)From a distribution symmetrical about the origin, because\(E(w^{[l-1]}) = 0\)And\(b^{[l-1]} = 0\), it can be considered that\(z^{[l-1]}\)The expectation of the distribution is 0 and is symmetric about the origin 0.

For a distribution symmetrical about the origin 0, only the part greater than 0 is retained after relu, then
\[
\begin{align}Var(x) &= \int_{-\infty}^{+\infty}(x-0)^2 p(x) dx \\&= 2 \int_{0}^{+\infty}x^2 p(x) dx \\&= 2 E(\max(0, x)^2)\end{align}
\]

Therefore, the above formula can be further obtained,
\[
\begin {align}Var(z^{[l]}) &= fan\_in \times Var(w^{[l]}) \times E((a^{[l-1]})^2) \\&= \frac{1}{2} \times fan\_in \times Var(w^{[l]}) \times Var(z^{[l-1]}) \end{align}
\]

Similarly, a reduction factor of 1 is required, i.e
\[
\frac{1}{2} \times fan\_in \times Var(w^{[l]}) = 1 \\
Var(w) = \frac{2}{fan\_in}
\]

The weight of each layer is initialized to
\[
W \sim N(0, \frac{2}{fan\_in})
\]

Similarly, the weight of each layer is initialized to
\[
W \sim N(0, \frac{2}{fan\_out})
\]

It is mentioned that either of the above two can be used aloneBecause when the network structure is determined, the ratio of the amplification and contraction coefficients of the difference between the two is a constant, that is, the multiplication of the ratio of fan in and fan out of each layer, which is explained as follows:,

He initialization

Using Xavier and he initialization, when the activation function is relu, the drop of test error is compared as follows: for the network of layer 22, the drop of he initialization is faster; for the network of layer 30, Xavier does not drop, but he drops normally.

Xavier vs He

He 2015 for PReLU

For the prelu activation function, the negative part is\(f(x) = ax\), as shown on the right,

ReLU and PReLU

For prelu, find\(E((a^{[l-1]})^2)\)It is not difficult to integrate the positive part and the negative part,
\[
\frac{1}{2} (1 + a^2) \times fan\_in \times Var(w^{[l]}) = 1 \\Var(w) = \frac{2}{(1 + a^2) fan\_in} \\W \sim N(0, \frac{2}{(1 + a^2) fan\_in}) \\W \sim N(0, \frac{2}{(1 + a^2) fan\_out})
\]

Implementation of caffe

Although he said in paper that he used it alone\(fan\_in\)or\(fan\_out\)Either can be used. However, in the implementation of Caffe, there is a way to average the two values, as shown below. Of course, the default is to use\(fan\_in\)

MSRA in Caffe

Summary

At this point, the introduction of the initialization method of depth neural network weight has come to an end. Although because of the BN layer, weight initialization may not be so important. However, after analyzing the classical weight initialization method, we believe that the understanding of the neural network operation mechanism will be more profound.

Above.

Reference resources

  • cs231n-Neural Networks Part 2: Setting up the Data and the Loss
  • paper-Efficient BackProp
  • paper-Understanding the difficulty of training deep feedforward neural networks
  • paper-Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
  • wiki-Variance
  • Initializing neural networks
  • Weight Initialization in Neural Networks: A Journey From the Basics to Kaiming
  • Kaiming He initialization
  • Choosing Weights: Small Changes, Big Differences
  • Understand Kaiming Initialization and Implementation Detail in PyTorch