# NLP Tutorial (3) – Neural Networks and Backpropagation

Time：2022-11-19

This series isStanford CS224nA full set of study notes for “Natural Language Processing with Deep Learning”, the corresponding course video can be found athereCheck.

ShowMeAIFor all the courseware of the CS224n course, aChinese translation and notes, and made into a GIF animation! clickLecture 3 – Word vector advancedandLecture 4 – Neural Network Backpropagation and Computational GraphsView the courseware annotations and explanations with learning. For more information, see the end of the article.

## introduction

CS224n is a professional course in the direction of deep learning and natural language processing produced by Stanford, a top university. The core content covers RNN, LSTM, CNN, transformer, bert, question and answer, summary, text generation, language model, reading comprehension and other cutting-edge content.

This set of notes introduces single-layer and multi-layer neural networks and how to use them for classification purposes. We then discuss how to train them using a distributed gradient descent technique called backpropagation. We will see how to do parameter updates sequentially using the chain rule. After a rigorous mathematical discussion of neural networks, we discuss some practical tips and tricks for training neural networks, including: neuronal units (non-linear), gradient checking, Xavier parameter initialization, learning rates, Adagrad, and more. Finally, we will encourage the use of recurrent neural networks as language models.

### content points

• Neural Networks
• backpropagation
• Neurons
• Hinge loss
• Xavier parameter initialization
• learning rate

## 1. Neural Network Basics

(This section can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial |Neural Network BasicsDeep Learning Tutorial |shallow neural networkandDeep Learning Tutorial |deep neural network

In the previous discussion, it was considered that a nonlinear classifier is needed because most of the data is linearly inseparable, otherwise the performance of linear classifiers on these data is limited. A neural network is a class of classifiers with nonlinear decision boundaries as shown in the figure below. We can clearly see its non-linear decision boundary on the graph, let’s see how the model learns it.

Neural networks are biologically inspired classifiers, which is why they are often called “artificial neural networks” to distinguish them from organic ones. In reality, however, human neural networks are far more capable and complex than artificial ones, so it’s usually best not to draw too many similarities between the two.

### 1.1 A single neuron

A neuron is a general-purpose computational unit that takes $$n$$ inputs and produces an output. Different neurons will have different outputs according to their different parameters (generally considered as neuron weights).

A common choice for neurons is the $$sigmoid$$, or “binary logistic regression” unit. This kind of neuron takes $$n$$-dimensional vector as input, and then calculates an activation scalar (output) $$a$$ . The neuron is also associated with a $$n$$-dimensional weight vector $$w$$ and a bias scalar $$b$$.

The output of this neuron is:

$$a=\frac{1}{1+exp(-(w^{T}x+b))}$$

We can also combine the weight and bias terms in the above formula:

$$a=\frac{1}{1+exp(-[w^{T}\;\;x]\cdot [x\;\;1])}$$

The visualization of the above formula is shown in the figure below:

❐ Neurons are the basic building blocks of neural networks. We will see that a neuron can be one of many functions that allow non-linearities to accumulate in the network.

### 1.2 Single layer neural network

We extend the above idea to multiple neurons, considering the input $$x$$ as the input of multiple such neurons, as shown in the figure below.

If we define the weight of different neurons as $${w^{(1)}, \cdots ,w^{(m)}}$$ and the bias as $${b_1, \cdots ,b_m}\ ) and the corresponding activation output is \({a_1, \cdots ,a_m}$$ :

$$a_{1} =\frac{1}{1+exp(-(w^{(1)T}x+b_1))}$$

$$\vdots$$

$$a_{m} =\frac{1}{1+exp(-(w^{(m)T}x+b_m))}$$

Let’s define a simplified formula to better express complex networks:

$$\sigma(z) = \begin{bmatrix} \frac{1}{1+exp(z_1)} \\ \vdots \\ \frac{1}{1+exp(z_m)} \end{bmatrix}$$

$$b = \begin{bmatrix} b_{1} \\ \vdots \\ b_{m} \end{bmatrix} \in \mathbb{R}^{m}$$

$$W = \begin{bmatrix} -\;\;w^{(1)T}\;\;- \\ \cdots \\ -\;\;w^{(m)T}\;\;- \end{bmatrix} \in \mathbb{R}^{m\times n}$$

We can now write the output of scale and bias as:

$$z=Wx+b$$

The activation function sigmoid can be changed into the following form:

$$\begin{bmatrix} a_{1} \\ \vdots \\ a_{m} \end{bmatrix} = \sigma(z) = \sigma(Wx+b)$$

So what is the role of these activations? We can think of these activations as indicators that some weighted feature combination exists. We can then use combinations of these activations to perform classification tasks.

### 1.3 Forward and reverse calculation

So far we know that an input vector $$x\in \mathbb{R}^{n}$$ can be transformed by a layer of $$sigmoid$$ units to obtain an activation output$$a\in \mathbb{R}^ {m}$$ . But what is the intuition for doing so? Let us consider a named entity recognition problem in NLP as an example:

Museums in Paris are amazing

Here we want to judge the center wordParisIs not named entity. In this case, we will most likely not only want to capture the word vectors of the words in the window, but also some other interactions between the words for classification purposes. For example, there may only beMuseumsis the first word andinis the second word,Parisis the named entity. Such non-linear decisions usually cannot be captured by the input directly provided to the Softmax function, but need to add intermediate layers of neural network before scoring. Therefore, we can use another matrix $$\mathbf{U} \in \mathbb{R}^{m \times 1}$$ with the activation output to calculate the unnormalized score for the classification task:

$$s=\mathbf{U}^{T}a=\mathbf{U}^{T}f(Wx+b)$$

Among them, $$f$$ is the activation function (such as the sigmoid function).

Dimensional analysis: If we use a $$4$$ dimensional word vector to represent each word, and use a window of $$5$$ words, then the input is $$x\in \mathbb{R}^{20}$$ . If we use $$8$$ sigmoid units in the hidden layer and generate a score output from the activation function, where $$W\in \mathbb{R}^{8\times 20}$$ , $$b\in \mathbb{R}^{8}$$ , $$U\in \mathbb{R}^{8\times 1}$$ , $$s\in \mathbb{R}$$ .

### 1.4 Hinge loss

Like many machine learning models, neural networks require an optimization objective function, an error that we want to minimize or maximize. Here we discuss a commonly used error measurement method:maximum margin objective maximum margin objective function. The idea behind using this objective function is to guarantee a higher score for calculations on “true” labeled data than on “fake” labeled data.

Going back to the previous example, if we make the “true” label windowMuseums in Paris are amazingThe computed score of $$s$$, let the “false” label windowNot all museums in ParisThe calculated score is $$s_c$$ (the subscript $$c$$ indicates that this window is corrupt)

Then, we maximize$$(s-s_c)$$ or minimize$$(s_c-s)$$ the objective function. However, we modify the objective function to ensure that the error is only calculated when $$s_c > s \Rightarrow (s_c-s) > 0$$. The intuition behind this is that we only care about “correct” data points having higher scores than “wrong” data points, and the rest doesn’t matter. Therefore, when $$s_c > s$$, the error is $$(s_c-s)$$, otherwise it is 0. Therefore, our optimized objective function is now:

$$minimize\;J=max\,(s_c-s,0)$$

However, the optimization objective function above is risky because it cannot create a safe margin. We want the “true” data to have a score greater than some positive interval $$\Delta$$ than the “false” data. In other words, we want the error to be calculated when $$(s-s_c < \Delta)$$, not when $$(s-s_c < 0)$$. Therefore, we modify the optimization objective function as:

$$minimize\;J=max\,(\Delta+s_c-s,0)$$

We can scale this interval so that $$\Delta=1$$ allows other parameters to be automatically adjusted during the optimization process without affecting the performance of the model. (Hinge loss and the minimum interval problem, you can readShowMeAIofMachine Learning Algorithms Tutorialmiddle pairSVM algorithmexplanation). Finally, we define the optimization objective function over all training windows as:

$$minimize\;J=max\,(1+s_c-s,0)$$

According to the above formula:

$$s_c=\mathbf{U}^{T}f(Wx_c+b)$$

$$s=\mathbf{U}^{T}f(Wx+b)$$

❐ The maximum-margin objective function is often used with support vector machines

### 1.5 Backpropagation (single-sample form)

In the previous section, we mentioned the hinge loss. Let’s explain how to train with different parameters in the model when the loss function $$J$$ is positive. If the loss is $$0$$, then there is no need to update the parameters. We generally use gradient descent (or a variant like SGD) to update parameters, so know the gradient information for any parameter that is needed in the update formula:

$$\theta^{(t+1)}=\theta^{(t)}-\alpha\nabla_{\theta^{(t)}}J$$

Backpropagation is a method that uses the differential chain rule to compute the gradient of loss for any parameter on the model. In order to further understand backpropagation, let’s look at a simple network in the figure below:

Here we use a neural network with only a single hidden layer and a single output unit. Now let’s set up some symbol definitions first:

• $$x_i$$ is the input of the neural network
• $$s$$ is the output of the neural network
• The neurons in each layer (including the input and output layers) receive an input and generate an output. The $$j$$ neuron of the $$k$$th layer receives a scalar input $$z_j^{(k)}$$ and generates a scalar activation output$$a_j^{(k)}$$
• We define the backpropagation error calculated by $$z_j^{(k)}$$ as $$\delta_j^{(k)}$$
• The $$1$$th layer is the input layer, not the $$1$$th hidden layer. For the input layer, $$x_j=z_j^{(1)}=a_j^{(1)}$$
• $$W^{(k)}$$ is the transition matrix that maps the output of the $$k$$ layer to the input of the $$k+1$$ layer, so this new symbol is used in Section 1.3 above Examples in $$W^{(1)}=W$$ and $$W^{(2)}=U$$

Start nowbackpropagation

Assuming that the loss function $$J=(1+s_c-s)$$ is positive, we want to update the parameter $$W_{14}^{(1)}$$, we see $$W_{14}^{ (1)}$$ only participates in the calculation of $$z_1^{(2)}$$ and $$a_1^{(2)}$$. This is very important for understanding backpropagation——Backpropagated gradients are only affected by the values ​​they contribute to. $$a_1^{(2)}$$ is multiplied with $$W_1^{(2)}$$ in the subsequent forward calculation to calculate the score. We can see from the max margin loss:

$$\frac{\partial J}{\partial s}=-\frac{\partial J}{\partial s_c}=-1$$

To simplify we only analyze $$\frac{\partial s}{\partial W_{ij}^{(1)}}$$ . so,

\begin{aligned} \frac{\partial s}{\partial W_{ij}^{(1)}} &= \frac{\partial W^{(2)}a^{(2)}}{\partial W_{ij}^{(1)}}=\frac{\partial W_i^{(2)}a_i^{(2)}}{\partial W_{ij}^{(1)}}=W_i^{(2)}\frac{\partial a_i^{(2)}}{\partial W_{ij}^{(1)}} \\ \Rightarrow W_i^{(2)}\frac{\partial a_i^{(2)}}{\partial W_{ij}^{(1)}} &= W_i^{(2)}\frac{\partial a_i^{(2)}}{\partial z_i^{(2)}}\frac{\partial z_i^{(2)}}{\partial W_{ij}^{(1)}} \\ &= W_i^{(2)}\frac{f(z_i^{(2)})}{\partial z_i^{(2)}}\frac{\partial z_i^{(2)}}{\partial W_{ij}^{(1)}} \\ &= W_i^{(2)}f^{\prime}(z_i^{(2)})\frac{\partial z_i^{(2)}}{\partial W_{ij}^{(1)}} \\ &= W_i^{(2)}f^{\prime}(z_i^{(2)})\frac{\partial}{\partial W_{ij}^{(1)}}(b_i^{(1)}+a_1^{(1)}W_{i1}^{(1)}+a_2^{(1)}W_{i2}^{(1)}+a_3^{(1)}W_{i3}^{(1)}+a_4^{(1)}W_{i4}^{(1)}) \\ &= W_i^{(2)}f^{\prime}(z_i^{(2)})\frac{\partial}{\partial W_{ij}^{(1)}}(b_i^{(1)}+\sum_{k}a_{k}^{(1)}W_{ik}^{(1)}) \\ &= W_i^{(2)}f^{\prime}(z_i^{(2)})a_j^{(1)} \\ &= \delta_i^{(2)}\cdot a_j^{(1)} \end{aligned}

Among them, $$a^{(1)}$$ refers to the input of the input layer. We can see that the gradient calculation can be simplified to $$\delta_i^{(2)}\cdot a_j^{(1)}$$ in the end, where $$\delta_i^{(2)}$$ is essentially the first\ The backpropagation error of the $$i$$th neuron in the (2\) layer. The result of multiplying $$a_j^{(1)}$$ and $$W_{ij}$$ is input to the $$i$$th neuron in the $$2$$th layer.

Let’s take the following figure as an example, let’s explain backpropagation from the “error sharing/distribution”, now we want to update $$W_{14}^{(1)}$$ :

• ① We start back propagation from the error signal of $$a_1^{(3)}$$ 1
• ② Then we multiply the error with the local gradient of the neuron that maps $$z_1^{(3)}$$ to $$a_1^{(3)}$$. In this example the gradient is exactly equal to 1, so the error is still 1. So there is $$\delta_1^{(3)}=1$$
• ③ Here the error signal 1 has reached $$z_1^{(3)}$$. We now need to distribute the error signal so that the “fair share” of the error reaches $$a_1^{(2)}$$
• ④ Now the error in $$a_1^{(2)}$$ is$$\delta_1^{(3)}\times W_1^{(2)}=W_1^{(2)}$$ (in$$The error signal of z_1^{(3)}$$ is $$\delta_1^{(3)}$$ ). So the error in $$a_1^{(2)}$$ is $$W_1^{(2)}$$
• ⑤ Same as step 2, we move the error on the neuron that maps $$z_1^{(2)}$$ to $$a_1^{(2)}$$, and $$a_1^{( 2)}$$ is multiplied with the local gradient, where the local gradient is$$f'(z_1^{(2)})$$
• ⑥ Therefore, the error in $$z_1^{(2)}$$ is $$f'(z_1^{(2)})W_1^{(2)}$$, we define it as $$\delta_1^ {(2)}$$
• ⑦ Finally, we assign the “error sharing” of the error to $$W_{14}^{(1)} by multiplying the above error with \(a_4^{(1)}$$ involved in the forward \) .
• ⑧ Therefore, the gradient loss for $$W_{14}^{(1)}$$ can be calculated as $$a_4^{(1)}f'(z_1^{(2)})W_1^{(2) }$$

Note that the results we get using this method are exactly the same as the previous differentiation method. Therefore, computing the gradient error of the corresponding parameter in the network can use either the chain rule or the error sharing and distribution method – both methods can achieve the same result, but it may be helpful to think of them in various ways.

Bias update: Bias items (such as $$b_1^{(1)}$$ ) and other weights are equivalent in mathematical form, just in the calculation of the next layer of neural $$z_1^{(2)}$$ element input The multiplied value is the constant 1. Therefore, the bias gradient of the $$i$$th neuron in the k layer is$$\delta_i^{(k)}$$ . For example, in the above example, we update $$b_1^{(1)}$$ instead of $$W_{14}^{(1)}$$, then the gradient is$$f'(z_1^ {(2)})W_1^{(2)}$$ .

General steps for backpropagation from $$\delta^{(k)}$$ to $$\delta^{(k-1)}$$:

• ① We have an error $$\delta_i^{(k)}$$ propagated backwards from $$z_i^{(k)}$$, as shown in the figure below

• ② We propagate this error back to $$a_j^{(k-1)}$$
• ③ Therefore, the error received in $$a_j^{(k-1)}$$ is $$\delta_i^{(k)}W_{ij}^{(k-1)}$$
• ④ However, $$a_j^{(k-1)}$$ may participate in the calculation of multiple neurons in the next layer when the forward calculation may produce the following figure. Then the error of the $$m$$th neuron in the $$k$$ layer also uses the previous method to backpropagate the error to $$a_j^{(k-1)}$$

• ⑤ So now the error received in $$a_j^{(k-1)}$$ is$$\delta_i^{(k)}W_{ij}^{(k-1)}+\delta_m^{(k )}W_{mj}^{(k-1)}$$
• ⑥ In fact, we can simplify the above error sum to $$\sum_i\delta_i^{(k)}W_{ij}^{(k-1)}$$
• ⑦ Now we have the correct error at $$a_j^{(k-1)}$$ and then compare it with the local gradient $$f^{\prime}(z_j^{(k-1)})$$ Multiply, reverse the error information to the $$j$$th neuron of the $$k-1$$th layer
• ⑧ Therefore, the error to reach$$z_j^{(k-1)}$$ is$$f ^{\prime} (z_j^{(k-1)})\sum_i\delta_i^{(k)}W_{ ij}^{(k-1)}$$

### 1.6 Backpropagation (vectorized form)

In the real neural network training process, we usually update the network weights based on a batch of samples. The more efficient way here is the vectorization method. With the help of the vectorization form, we can directly update the weight matrix and bias vector at one time. . Note that this is just a simple extension of the above model, which will help to better understand the method of error backpropagation at the matrix-vector level.

For a more fixed parameter$$W_{ij}^{(k)}$$, we know that its error gradient is$$\delta_j^{(k+1)}\cdot a_j^{(k)}$$ . where $$W^{(k)}$$ is the matrix that maps $$a^{(k)}$$ to $$z^{(k+1)}$$. Therefore, we can determine the gradient error of the entire matrix $$W^{(k)}$$ as:

$$\nabla_{W^{(k)}} = \begin{bmatrix} \delta_1^{(k+1)}a_1^{(k)} & \delta_1^{(k+1)}a_2^{(k)} & \cdots \\ \delta_2^{(k+1)}a_1^{(k)} & \delta_2^{(k+1)}a_2^{(k)} & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} = \delta^{(k+1)}a^{(k)T}$$

We can therefore write the gradient of the entire matrix form as the outer product of the backpropagated error vector and the forward activation output in the matrix.

Now let’s see how the error vector $$\delta^{(k+1)}$$ can be calculated.

We have from the example above

$$\delta_i^{(k)}=f^{\prime}(z_j^{(k)})\sum_i\delta_i^{(k+1)}W_{ij}^{(k)}$$

This can be simply rewritten in matrix form:

$$\delta_i^{(k)}=f^{\prime} (z^{(k)})\circ (W^{(k)T}\delta^{(k+1)})$$

In the above formula, the $$\circ$$ operator represents the multiplication of corresponding elements between vectors ( $$\mathbb{R}^{N}\times \mathbb{R}^{N}\rightarrow \mathbb {R}^{N}$$ ).

Computational Efficiency: After exploring element-wise updates and vector-wise updates, it must be realized that in scientific computing environments such as MATLAB or Python (using the Numpy/Scipy library), the computational efficiency of vectorized operations is very high . Therefore, vectorized operations should be used in practice. In addition, we also want to reduce redundant calculations in backpropagation – for example, note that $$\delta^{(k)}$$ is directly dependent on $$\delta^{(k+1)}$$ superior. So we need to ensure that when $$\delta^{(k+1)}$$ is used to update $$W^{(k)}$$, we must save $$\delta^{(k+1)}$$ with Repeat the above steps for the calculation of $$\delta^{(k)}$$ in the back – and then calculate the $$(k-1) \cdots (1)$$ layer. Such a recursive process is what makes backpropagation a computationally affordable process.

## 2. Neural Networks: Tips and Advice

(This section can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial |Practical aspects of deep learning

In the last part we introduced how to calculate error gradients/updates of parameters in neural networks using calculus-based methods.

Here we present a method for numerically approximating these gradients—although computationally inefficient and cannot be used directly for training neural networks, this method can estimate the derivatives of any parameter very accurately; thus, it can be used as the derivative of Useful checks for correctness.

Given a model parameter vector $$\theta$$ and a loss function $$J$$, the numerical gradient around $$\theta_i$$ is obtained by the central difference formula:

$$f^{\prime}(\theta)\approx \frac{J(\theta^{(i+)})-J(\theta^{(i-)})}{2\varepsilon }$$

Where $$\varepsilon$$ is a very small value (generally about $$1e^{-5}$$ ). When we use $$+\varepsilon$$ to perturb the $$i$$th element of the parameter $$\theta$$, we can calculate the error $$J(\theta^{(i+)} on the forward propagation )$$ . Similarly, when we use $$-\varepsilon$$ to perturb the $$i$$th element of the parameter $$\theta$$, we can calculate the error $$J(\theta^{( i-)})$$ .

Thus, computing the forward pass twice, we can estimate the gradient for any given parameter in the model. We note that the definition of the numerical gradient is very similar to that of the derivative, where, in the scalar case:

$$f^{\prime}(\theta)\approx \frac{f(x+\varepsilon)-f(x)}{\varepsilon}$$

Of course, there is still a little difference – the above definition only calculates the gradient in the positive perturbation $$x$$. While it is possible to define numerical gradients in this way, in practice using a central difference formula can often be more accurate and stable since we perturb the parameters in both directions. To better approximate the derivative/slope near a point, we need to examine the behavior of the function $$f^{\prime}$$ to the left and right of the point. Taylor’s theorem can also be used to show that the central difference formula has a $$\varepsilon^{2}$$ proportional error, which is quite small, while the derivative definition is more error-prone.

Now you may be wondering, if this method is so accurate, why don’t we use it instead of backpropagation to calculate the gradient of the neural network?

• ① We need to consider efficiency – whenever we want to calculate the gradient of an element, we need to do two forward passes in the network, which is very computationally resource intensive.
• ② Many large-scale neural networks contain millions of parameters, and it is obviously not a good choice to calculate each parameter twice.
• ③ In optimization techniques such as SGD, we need to calculate the gradient through thousands of iterations, and using such methods can quickly become overwhelming.

We only use gradient checking to verify the correctness of our analytical gradients. The implementation of the gradient test is as follows:

def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""

f(x) = f(x) # evaluate function value at original point
h = 0.00001

# iterate over all indexes in x
it = np.nditer(x, flags=['multi_index',

while not it.finished:

# evaluate function at x+h
ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh_left = f(x) # evaluate f(x + h)
x[ix] = old_value - h # decrement by h
fxh_right = f(x) # evaluate f(x - h)
# restore to previous value (very important!)
x[ix] = old_value

# compute the partial derivative
# the slope
grad[ix] = (fxh_left - fxh_right) / (2 * h)
it.iternext() # step to next dimension
return grad

### 2.2 Regularization

Like many machine learning models, neural networks are prone to overfitting, which makes the model achieve near-perfect performance on the training set, but cannot generalize to the test set. A common way to deal with overfitting (“high variance problem”) is to use $$L2$$ regularization. We only need to add a regular term to the loss function $$J$$, and the current loss function is as follows:

$$J_{R}=J+\lambda\sum_{i=1}^{L}\left \| W^{(i)} \right \| _F$$

In the above formula, $$\left | W^{(i)} \right | _F$$ is the matrix $$W^{(i)}$$ (the $$i$$th in the neural network The Frobenius norm of the weight matrix), $$\lambda$$ is the size of the weight in the hyperparameter control loss function.

❐ Definition of Frobenius norm of matrix $$U$$: $$\left | U \right | _F=\sqrt{\sum_i \sum_{l} U_{il}^{2}}$$

When we try to minimize $$J_R$$, regularization is essentially when optimizing the loss function, punishing weights with too large values ​​(to make the numerical distribution of weights more balanced, and to prevent some of the weights from being particularly large. ).

Due to the quadratic nature of the Frobenius norm (computing the sum of the squares of the elements of the matrix), the $$L2$$ regularization term effectively reduces the flexibility of the model and thus reduces the possibility of overfitting.

Adding such a constraint can be explained using Bayesian thinking. This regular term is to add a prior distribution to the parameters of the model, and optimize the weight to make it close to 0-how close depends on$$\lambda\ ) value. Choosing an appropriate \(\lambda$$ value is important and needs to be chosen through hyperparameter tuning.

• If the value of $$\lambda$$ is too large, many weights will be close to $$0$$, so the model cannot learn meaningful things on the training set, and the performance on the training, validation and test sets is often different. very bad.
• If the value of $$\lambda$$ is too small, the model will still overfit.

It should be noted that the bias term will not be regularized and will not be calculated into the loss term – try to think about why

Why is the bias term not calculated in the loss term

The bias term is only an offset relationship in the model, which can be fitted with a small amount of data, and from experience, the size of the bias value has no significant impact on the performance of the model, so regularization is not required Bias term

Sometimes we use other types of regular terms, such as $$L1$$ regular terms, which add up all the absolute values ​​of the parameter elements – however, in practice, $$L1$$ regular terms are rarely used because It will make the weight parameter sparse. In the next section, we discuss Dropout, another effective regularization method by randomly setting neurons to $$0$$ during forward propagation

❐ Dropout actually “freezes” some units by ignoring their weights at each iteration. Instead of setting them to $$0$$, these “frozen” units are assumed by the network to be $$0$$ for this iteration. “Frozen” units are not updated for this iteration

### 2.3 Random inactivation Dropout

Dropout is a very powerful regularization technique, which was published by Srivastava in the paper “Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting“Introduced for the first time, the figure below shows how Dropout is applied to a neural network.

The idea is simple and effective – during training, in each forward/backward pass we randomly “drop” some subset of neurons (or etc. valence, we maintain a certain probability $$p$$ of neurons are activated). Then, during the testing phase, we will use all neurons to make predictions.

Neural networks using dropout generally learn more meaningful information from the data, are less prone to overfitting and generally achieve higher overall performance on today’s tasks. One intuitive reason why this technique should work so well is that what dropout essentially does is exponentially train many smaller networks at once and average their predictions.

In fact, the way we use Dropout is that we take the output $$h$$ of each neuron layer, and keep the neuron with probability $$p$$ active, otherwise set the neuron to $$0$$ . Then, in backpropagation we only pass back the gradients to the neurons that were activated in the forward pass. Finally, during testing, we use all neurons in the neural network for forward propagation calculations. However, there is a key subtlety, in order for dropout to work effectively, the expected output of the neurons in the test phase should be roughly the same as in the training phase – otherwise the output size may be very different, and the performance of the network has changed. More clear. Therefore, we usually have to divide the output of each neuron by some value during the test phase – this is left as an exercise for the reader to determine what this value should be so that the expected output during training and test is equal (this value is $$p$$ ).

#### 1) Dropout content supplement

The following is derived from“Neural Networks and Deep Learning”

• Purpose: To alleviate the overfitting problem and achieve the effect of regularization to a certain extent
• Effect: Reduce the dependence of the lower layer nodes on it, forcing the network to learn more robust features

#### 2) Interpretation of ensemble learning

Each discard is equivalent to sampling a sub-network from the original network. If a neural network has $$n$$ neurons, then a total of $$2^n$$ sub-networks can be sampled.

Each iteration is equivalent to training a different sub-network, which all share the parameters of the original network. Then, the final network can be approximated as a combined model integrating exponentially different networks.

#### 3) Interpretation of Bayesian Learning

Dropout can also be interpreted as an approximation of Bayesian learning. Use $$y=f(\mathbf{x}, \theta)$$ to represent the neural network to be learned, Bayesian learning is to assume that the parameter $$\theta$$ is a random vector, and the prior distribution is$$q(\theta)$$ , the prediction of the Bayesian method is:

\begin{aligned} \mathbb{E}_{q(\theta)}[y] &=\int_{q} f(\mathbf{x}, \theta) q(\theta) d \theta \\ & \approx \frac{1}{M} \sum_{m=1}^{M} f\left(\mathbf{x}, \theta_m\right) \end{aligned}

Where $$f(\mathbf{x}, \theta_m)$$ is the network after the m-th application of the discarding method, and its parameter $$\theta_m$$ is a sampling of all parameters $$\theta$$.

#### 4) Variational Dropout (Variational Dropout) in RNN

Dropout generally randomly discards neurons, but it can also be extended to randomly discard connections between neurons, or randomly discard each layer.

In RNN, the hidden state at each moment cannot be randomly discarded, which will damage the memory ability of the recurrent network in the time dimension. A simple approach is to perform random loss on connections in non-temporal dimensions (i.e. acyclic connections). As shown in the figure, the dotted line represents random discarding, and different colors represent different discarding masks.

However, according to the interpretation of Bayesian learning, the dropout method is a sampling of parameters $$θ$$. The parameters of each sample need to be constant at each moment. Therefore, when using dropout on a recurrent neural network, each element of the parameter matrix needs to be randomly dropped, and the same dropout mask is used at all times. This method is called Variational Dropout.

The figure below gives an example of the variational dropout method, and the same color indicates that the same dropout mask is used.

### 2.4 Neuron activation function

The neural networks we have seen above are all based on the sigmoid activation function for nonlinear classification. But in many applications, better neural networks can be designed using other activation functions. Some common activation functions and gradient definitions of activation functions are listed below, which can be replaced with the sigmoidal functions discussed earlier.

#### 1) Sigmoid

This is a common choice we have discussed, the activation function $$\sigma$$ is:

$$\sigma(z)=\frac{1}{1+exp(-z)}$$

Among them $$\sigma(z)\in (0,1)$$

The gradient of $$\sigma(z)$$ is:

$$\sigma^{\prime}(z)=\frac{-exp(-z)}{1+exp(-z)}=\sigma(z)(1-\sigma(z))$$

#### 2) tanh

The tanh function is an alternative to the sigmoid function, which can converge faster in practice. The main difference between tanh and sigmoid is that the output range of tanh is from -1 to 1, while the output range of sigmoid is from 0 to 1.

$$tanh(z)=\frac{exp(z)-exp(-z)}{exp(z)+exp(-z)}=2\sigma(2z)-1$$

Among them $$tanh(z)\in (-1, 1)$$

The gradient of $$anh(z)$$ is:

$$tanh^{\prime}(z)=1-\bigg(\frac{exp(z)-exp(-z)}{exp(z)+exp(-z)}\bigg)^{2}=1-tanh^{2}(z)$$

#### 3) hard tanh

Sometimes the hardtanh function is sometimes preferred over the tanh function because it requires less computation. However, when the value of $$z$$ is greater than $$1$$, the value of the function will be saturated (it will always be equal to 1 as shown in the figure below).

The hardtanh activation function is:

\begin{aligned} hardtanh(z) = \begin{cases} -1& :z<1\\ z & :-1\le z \le 1 \\ 1 & :z>1 \end{cases} \end{aligned}

The differentiation of the hardtanh function can also be expressed in the form of a piecewise function:

\begin{aligned} hardtanh ^{\prime}(z) &= \begin{cases} 1 & :-1\le z \le 1 \\ 0 & :otherwise \end{cases} \end{aligned}

#### 4) soft sign

The soft sign function is another non-linear activation function that can be an alternative to tanh because it does not saturate prematurely like hard clipped functions:

$$softsign(z)=\frac{z}{1+ \left | z \right |}$$

The differential expression of the soft sign function is:

$$softsign^{\prime}(z)=\frac{sgn(z)}{(1+z)^{2}}$$

Among them, $$sgn$$ is a sign function, which returns 1 or -1 according to the sign of $$z$$.

#### 5) ReLU

The ReLU (Rectiﬁed Linear Unit) function is a common choice in the activation function, and it will not saturate when the value of $$z$$ is particularly large. With great success in computer vision applications:

$$rect(z)=max(z,0)$$

The differentiation of the ReLU function is a piecewise function:

\begin{aligned} rect^{\prime}(z) &= \begin{cases} 1 & :z > 0 \\ 0 & :otherwise \end{cases} \end{aligned}

#### 6) Leaky ReLU

When the value of $$z$$ is less than $$0$$, the traditional ReLU unit will not backpropagate the error leaky ReLU improves this. When the value of $$z$$ is less than $$0$$, There will still be a small error backpropagated back.

$$leaky(z)=max(z, k\cdot z)$$

where $$0. The differential of the leaky ReLU function is a piecewise function:$$
\begin{aligned}
leaky ^{\prime} (z) &=
\begin{cases}
1 & :z > 0 \\
k & :otherwise
\end{cases}
\end{aligned}
$$### 2.5 Data preprocessing As is the general case with machine learning models, a critical step in ensuring that the model achieves reasonable performance on the task at hand is to perform basic preprocessing on the data. Some common techniques are outlined below. #### 1) Go to the mean Given a set of input data $$X$$, generally the value in $$X$$ is subtracted from the average eigenvector of $$X$$ to zero-center the data. It is important in practice that only the mean of the training set is calculated, and that the same mean is subtracted from the training, validation and test sets. #### 2) Normalization Another common technique (although not $$mean\;Subtraction$$ commonly used) is to reduce each input feature dimension so that each input feature dimension has a similar magnitude range. This is useful so that different input features are measured in different “units”, but initially we often assume that all features are equally important. This is achieved by dividing the features by their respective standard deviations computed in the training set. #### 3) Whitening Compared with the above two methods, whitening is not so commonly used. It is essentially that after the data is transformed, the correlation between the features is low, and all the features have the same variance (the covariance matrix is ​​$$1$$ ). First, Mean Subtraction processing is performed on the data to obtain $$X ^{\prime}$$ . Then we perform singular value decomposition on $$X ^{\prime}$$ to get the matrix $$U$$ , $$S$$ , $$V$$ and calculate $$UX^{\prime}$$ to be\ (X^{\prime}\) projects onto the basis defined by the columns of $$U$$. We finally scale our data appropriately by dividing each dimension of the result by the corresponding singular value in $$S$$ (if any singular value is 0, we divide by a small value instead). ### 2.6 Parameter initialization A critical step in getting the best performance out of a neural network is to initialize the parameters in a sensible way. A good way to start is to initialize the weights to small random numbers usually distributed around 0 – works pretty well in practice. In the paper “Understanding the difficulty of training deep feedforward neural networks (2010)“, Xavier studies the effect of different weight and bias initialization schemes on training dynamics. Experimental results show that for sigmoid and tanh activation units, when a weight matrix $$W\in \mathbb{R}^{n^{(l+1)}\times n^{(l)}}$$ with Random initialization in the following uniform distribution method can achieve faster convergence and lower error:$$
W\sim U\bigg[-\sqrt{\frac{6}{n^{(l)}+n^{(l+1)}}},\sqrt{\frac{6}{n^{(l)}+n^{(l+1)}}}\;\bigg]
$$Where $$n^{(l)}$$ is the number of input units of W $$(fan\text{-}in)$$, $$n^{(l+1)}$$ is W $$( fan\text{-}out)$$ the number of output units. In this parameter initialization scheme, the bias unit is initialized to $$0$$. This approach tries to preserve the variance of activations across layers as well as the variance of backpropagated gradients. Without such initialization, the gradient variance (which contains the correction information) typically decays with backpropagation across layers. ### 2.7 Learning Strategies The rate/magnitude of model parameter updates during training can be controlled using a learning rate. In the simplest gradient descent formulation, $$\alpha$$ is the learning rate:$$
\theta^{new}=\theta^{old}-\alpha\nabla_{\theta}J_{t}(\theta)
$$You might think that we should take a larger value for $$\alpha$$ if we want to converge faster – however, faster convergence is not guaranteed at faster convergence. In fact, if the learning rate is very high, we may encounter a situation where the loss function is difficult to converge, because the parameter update range is too large, which will cause the model to cross the minimum point of convex optimization, as shown in the figure below. In non-convex models (and the models we encounter many times are non-convex), the results of high learning rates are unpredictable, but the probability that the loss function will not converge is very high. A simple solution to avoid the loss function being difficult to converge is to use a small learning rate and let the model iterate carefully in the parameter space – of course, if we use a learning rate that is too small, the loss function may not Converge in a reasonable amount of time, or get stuck at a local optimum. Therefore, like any other hyperparameter, the learning rate must be tuned efficiently. The most computationally resource-intensive part of a deep learning system is the training phase, and some research has attempted to improve new ways of setting the learning rate. For example, Ronan Collobert scales the weights $$W_{ij}$$ by taking the reciprocal of the square root of the neuron$$(n^{(l)})$$ of $$fan\text{-}in$$ ( $$W\in \mathbb{R}^{n^{(l+1)}\times n^{(l)}}$$ ) learning rate. There are other techniques that have been proven effective – this method is called annealing annealing. After many iterations, the learning rate is reduced in the following way: to ensure a high learning rate to start training and to quickly approach the minimum; when getting closer and closer At the minimum value, we start to reduce the learning rate, allowing us to find the optimal value in a more subtle range. A common way to implement annealing is to reduce the learning rate $$\alpha$$ by a factor $$x$$ after every $$n$$ iterations of learning. Exponential decay is also a very common method, where the learning rate becomes $$\alpha(t)=\alpha_0 e^{-kt}$$ after $$t$$ iterations, where $$\alpha_0$$ is the initial Learning rate and $$k$$ are hyperparameters. Yet another approach is to allow the learning rate to decrease over time:$$
\alpha(t)=\frac{\alpha_0\tau}{max(t,\tau)}
$$In the above scheme, $$\alpha_0$$ is an adjustable parameter representing the initial learning rate. $$\tau$$ is also a tunable parameter indicating the point in time at which the learning rate should start decreasing. In practice, this method is very effective. In the next section we discuss another approach to adaptive gradient descent that does not require manual setting of the learning rate. ### 2.8 Optimizing updates with momentum (Momentum) (Neural network optimization algorithm can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial |Neural Network Optimization Algorithm The momentum method, inspired by the study of dynamics in physics, is a variant of the gradient descent method that attempts a more efficient update scheme using the “velocity” of the update. The pseudocode for the momentum update is as follows: # Computes a standard momentum update # on parameters x v = mu * v - alpha * grad_x x += v ### 2.9 Adaptive optimization algorithm (Neural network optimization algorithm can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial |Neural Network Optimization Algorithm AdaGrad is an implementation of standard stochastic gradient descent (SGD), but with one key difference: the learning rate is different for each parameter. The learning rate of each parameter depends on the history of the gradient update of each parameter. The smaller the historical update of the parameter, the faster the update with a larger learning rate. In other words, parameters that were not updated too much in the past are now more likely to have higher learning rates.$$
\theta_{t,i}=\theta_{t-1,i}-\frac{\alpha}{\sqrt{\sum_{\tau=1}^{t}g_{\tau,i}^{2}}} g_{t,i} \\ where \ g_{t,i}=\frac{\partial}{\partial\theta_i^{t}}J_{t}(\theta)


In this technique, we see that if the historical RMS of the gradient is low, then the learning rate can be very high. A simple implementation of this technique looks like this:

# Assume the gradient dx and parameter vector x
cache += dx ** 2
x += -learning_rate * dx / np.sqrt(cache + 1e-8)

Other common adaptive methods are RMSProp and Adam, and their update rules are as follows:

# Update rule for RMS prop
cache = decay_rate * cache + (1 - decay_rate) * dx ** 2
x += -learning_rate * dx / (np.sqrt(cache) + eps)

m = beta * m + (1 - beta1) * dx
v = beta * v + (1 - beta2) * (dx ** 2)
x += -learning_rate * m / (np.sqrt(v) + eps)

## Aifanfan Enterprise Query Result Optimization Practice

Author | summer guide: Aifanfan enterprise query brings together 200 million+ enterprise multi-dimensional and comprehensive information on the whole network, uses the open source full-text search engine Elasticsearch (hereinafter referred to as ES) as the search platform, and is committed to allowing users to find the desired enterprise faster and more accurately, but How can […]