**Neural network learning notes (2)**

This paper is the second part of neural network learning notes, followed by the author’s neural network learning notes (1). The main content is to summarize the common configuration of neural network, including the following points: (1) data preprocessing; (2) weight initialization; (3) regularization and dropout; (4) loss function.

**1. Data preprocessing**

For neural networks, the common methods of data preprocessing mainly include 0-1 normalization, principal component analysis (PCA) and one hot coding label.

(1) 0-1 normalization: transform the data of all dimensions of the sample into new data with 0 as the mean and 1 as the standard deviation. The processing method is to calculate the mean value and standard deviation of each dimension of training data, and then subtract the mean value from the data of this dimension and divide by the standard deviation. As for why the data need to be processed, the author has not found a good answer. Welcome to leave a message below;

**Be careful:**When we are doing normalization processing, the values (such as mean and variance) we use for processing can only be obtained from the training set, and then the values calculated from the training set can be directly applied to the verification set / test set, rather than recalculating the new values on the verification set / test set, or directly calculating the mean and variance from the overall data set, and then Partition data sets. I think this is to ensure the fairness of model generalization ability detection and avoid any data leakage from the validation set / test set into the training process.

(2) principal component analysis (PCA): for neural network, this method is mainly used to reduce the dimension of data (also can be used for data compression). There are many articles on the Internet to interpret the basic process of PCA, but I think there are still many loose points in these articles, so I recommend the explanation of PCA in this link to all readers. Generally speaking, PCA is reduced to solving a matrix, which can reduce the dimension of the original data, and at the same time, the reduced dimension data can be restored to the maximum by using this matrix. Therefore, the solution of this matrix can be transformed into an optimization problem, and the basic operation process of PCA can be obtained by applying the basic linear algebra knowledge.

(3) one hot coding: in multi classification tasks, for example, to divide a sample with ten dimensions into five categories, the sample data we get may be as follows: features: [x1, X2, X3, x4, X5, X6, X7, X8, x9, X10], tags: [y], y belongs to {0, 1, 2, 3, 4}, among which five numbers from 0 to 4 are used to represent five different categories. When using neural network to accomplish this task, we often set up five nodes in the output layer, each node corresponds to a category. Therefore, in order to ensure the correspondence between the tags and the actual output data, we need to transform the tags of training samples into one hot coding format. In the above example, if a sample is marked as “4”, then the one-hot code is [0,0,0,0,1]. In the keras library, there is a function “to” category”It’s easy to convert tags to one hot format.

**2. Weight initialization**

We know that in the process of optimization, the initialization of parameters will affect the final initialization results, which is also very important for the initialization of neural network parameters. Next, we will introduce the commonly used weight initialization methods and the wrong ways to avoid in the process of neural network application. (the following content is mainly extracted from cs231n course)

(1) small random number

When initializing the weights of neural networks, it is common to initialize the weights of neurons to very small random numbers, which is called “breaking symmetry”. The reason for this is that if the weights of neurons are random and different at first, they will calculate different gradients and become various components of a whole neural network. The implementation method of numpy is: w = 0.01 * np.random.random (D, H). It should be emphasized that the smaller the value is, the better, because if the weight value is too small, then in the process of back propagation, each parameter can only get a very small update.

**Be careful:**Do not set all weights to 0. This will cause each neuron to calculate the same output, so that in the process of back propagation, the same weight value will be calculated for each parameter, resulting in the same weight update. In other words, if the weights are all initialized to the same number, the neural network has no “asymmetric source”.

(2) bias initialization

The common way to initialize the offset is to set the offset to 0, because “breaking symmetry” has been initialized by the weight. For neurons with relu as a non-linear function, some people like to initialize the bias to a small constant, for example: 0.01, because this can ensure that the neuron can have non-zero output from the beginning, and also can update the weight from the beginning. However, it is not clear whether this approach can provide a stable performance improvement. We usually set the bias to 0

（3）Batch Normalization

This technology was proposed by Ioffe and Szegedy in 2015. In this paper, we have a very detailed interpretation of this technology. Here, we only do a basic interpretation of this technology. The basic process is that in the process of mini batch gradient descent training, one training process contains M samples of data. The original linear activation x corresponding to a neuron is transformed by subtracting the mean value E (x) of linear activation of m instances in mini batch and dividing by the variance var (x) of linear activation of m instances. The expression is as follows. Many deep learning libraries have provided batch normalization implementation, such as the batch normalization layer of keras. In practice, batch normalization has a strong robustness for poor initialization, and this technology helps to improve the training speed of neural network.

**3. Regularization and dropout**

Next, we will introduce some common methods that can effectively avoid over fitting problems in neural networks, namely regularization and dropout

(1) regularization

L2 regularization: this method avoids over fitting by adding a penalty term, the square of all parameters, to the final loss function. To express by formula is to add 1 / 2 * λ * | w to the loss function||_{2}Where λ is the regularization coefficient used to control the punishment intensity. Intuitively, this kind of regularization tends to punish too large or too small parameter values.

L1 regularization: the penalty term added in the loss function is the absolute value of the parameter, i.e. 1 / 2 * λ * | w|. As a result, the model prefers the parameter close to 0.

Max norm constraints: add constraints directly to parameters, such as “|” W “||_{2}

It should be noted that we seldom regularize the bias of the network, although in practice, regularization bias rarely leads to worse results.

（2）Dropout

This is a very effective method to avoid over fitting problem proposed in recent years. It comes from Srivastava et al. In short, the basic idea of this method is to choose to retain the output of neurons with a certain probability p (a super parameter) in the process of training, or directly set the output to 0. As shown in the figure below, dropout can be interpreted as extracting a smaller network from the neurons in a larger fully connected network for training during the training process. In the test phase, dropout is not used, so the results of the whole network can explain the average prediction value of multiple sampling networks.

**4. Loss function**

The loss function is used to measure the error between the predicted value and the marked value of the model. The loss of a data set as a whole is the average error of all samples. According to the different tasks that need to be performed, different loss functions can be selected. In this paper, we mainly introduce: (1) two classification problems; (2) multi classification problems; (3) regression problems, the loss functions commonly used by these three kinds of problems.

(1) two categories

Hinge loss: mark each sample as either – 1 or 1, and the expression is:L = max (0, 1-y * l), where y is the predicted value and l is the marked value. Intuitively, if the predicted value is consistent with the marked value (the same is – 1 or 1), then the function will take the small value of 0; if not, it will take the large value of 2, which will lead to the increase of error.

Binary cross entropy: used to predict the classification problem with a single output node, and the final output range is between [0,1]. For this kind of model, the prediction results will be divided according to a certain threshold value during the final prediction. For example, if the output value is greater than the threshold value of 0.5, then the prediction category is 0, otherwise it is 1. The expression of the loss function is: l = – L * log (y), where l is the tag value and Y is the output value of the model. Simply understand that if the prediction result is equal to the mark, for example, if the prediction value is 0.8 and the mark is 1, then the value of the loss function will be very small; if not, for example, if the prediction value is 0.1 when the mark is 1, then the loss function will be very large.

(2) multi classification

This problem refers to the classification of samples into two or more categories, such as Minist database and cifar-10 database. For this problem, in the end of neural network, softmax layer will be used to compress the final result to the range of [0,1], which can be interpreted as making the output value of the output node directly correspond to the possibility that the sample belongs to a certain category. The cross entropy expression of multi classification problem is: l = – L * log (y) – (1-L) * log (1-y).

(3) regression problem

The essence of regression problem is to fit the data, so that the gap between the real value predicted by the model and the marked value is the smallest. For this kind of problem, the common loss function is L2 norm and L1 norm between model output and tag. The concept of norm can refer to this article.