Introduction:There is no doubt that neural network is the most popular machine learning technology. So I think it’s very meaningful to understand how neural networks learn.
There is no doubt that neural network is the most popular machine learning technology. So I think it’s very meaningful to understand how neural networks learn.
To understand how neural networks learn, let’s take a look at the following pictures:
If we express the input and output values of each layer as vectors, the weights as matrices, and the errors as vectors, then we can get a view of the above neural network, which is only the application of a series of vector functions. In other words, the function takes the vector as the input, performs some transformation on them, and then outputs the transformed vector. In the figure above, each line represents a function, which can be a matrix multiplication plus an error vector or an activation function. These circles represent the vectors that these functions act on.
For example, we start with the input vector, then input it into the first function, which is used to calculate the linear combination of its components, and then we take the resulting vector as the output. Then we take this vector as input to the activation function, and so on, until we get to the last function in the sequence. The output of the last function is the predicted value of the neural network.
So far, we have discussed how neural networks can get output, which is what we are interested in. We know that a neural network simply passes its input vector to a series of functions. But these functions depend on some parameters: weight and error.
How can neural network get good prediction by learning these parameters?
Let’s recall what a neural network is really: it’s actually just a function, a large function composed of small functions arranged in order. This function has a set of parameters. At first, we don’t know what these parameters should be. We just initialize them randomly. So in the beginning, the neural network will give us some random values. So how can we improve them? Before trying to improve them, we need a method to evaluate the performance of neural networks. If we have no way to measure the quality of the model, how can we improve the performance of the model?
Therefore, we need to design a function, which takes the predicted value of neural network and the real label in the data set as the input and a number representing the performance of the neural network as the output. Then we can transform the learning problem into the optimization problem of finding the minimum or maximum value of the function. In the field of machine learning, this function is usually used to measure how bad our predictions are, so it is called loss function. Our problem is to find the neural network parameters that minimize the loss function.
Stochastic gradient descent algorithm
You may be very good at finding the minimum of a function from calculus. For this kind of problem, we usually take the gradient of the function, make it equal to 0, find all the solutions (also known as the critical point), and then choose the one that makes the function value the minimum. This is the global minimum. Can we do the same thing to minimize our loss function? In fact, it doesn’t work. The main problem is that the loss function of neural network is not as concise as that in calculus textbooks. It is an extremely complex function with thousands, hundreds of thousands or even millions of parameters. Sometimes it is impossible to find a convergent solution to the problem. This problem is usually solved by iterative methods. These methods do not try to find a direct solution, but start with a random solution and try to improve it in each iteration. In the end, after a lot of iterations, we’ll get a pretty good solution.
One of the iterative methods is gradient descent method. As you may know, the gradient of a function gives us the steepest rising direction. If we take the negative value of the gradient, it will give us the steepest descent direction, that is, we can reach the minimum value as quickly as possible in this direction. Therefore, at each iteration (also called a training round), we calculate the gradient of the loss function and subtract it (multiplied by a factor called learning rate) from the old parameters to obtain the new parameters of the neural network.
Where θ (theta) is the vector containing all the parameters of the neural network.
In the standard gradient descent method, the gradient is calculated by taking the whole data set into account. Usually this is not desirable because the calculation can be expensive. In practice, the dataset is randomly divided into blocks, which are called batches. Update each batch. This method is called random gradient descent.
The above update rule only considers the gradient calculated at the current position at each step. In this way, the trajectory of a point moving on the loss function surface is sensitive to any change. Sometimes we may want to make this trajectory more robust. To this end, we use a physics inspired concept: momentum. Our idea is that when we update, we also take into account previous updates, which will accumulate into a variable Δ θ. If more updates are made in the same direction, we will move in this direction “faster” and not change our trajectory due to any small disturbances. Think of it as speed.
Where α is a non negative factor, which can determine how much value the old gradient can contribute. When it’s zero, we don’t use momentum.
Back propagation algorithm
How do we calculate gradients? Think back to neural networks and loss functions, which are just a combination of functions. So how to calculate the partial derivatives of composite functions? We can use the chain rule. Let’s take a look at the following pictures:
If we want to calculate the partial derivative of the loss function with respect to the first layer of weight parameters: we first let the first linear expression take the partial derivative of the weight parameter, and then multiply the partial derivative of the next function (i.e., the activation function) with respect to the output of the previous function, and perform this operation until we multiply the partial derivative of the loss function with respect to the last activation function. What if we want to calculate the derivative of the weight parameters of the second layer? We have to do the same process, but this time we start from the second linear combination function to calculate the derivative of the weight parameters. Then, the other terms we want to multiply also appear when we calculate the derivative of the weight parameters of the first layer. So instead of calculating these terms over and over again, we’re going to compute them from back to front, hence the name back-propagation algorithm.
We will first calculate the partial derivatives of the loss function with respect to the output layer of the neural network, and then propagate these derivatives back to the first layer by maintaining the operational product of the derivatives. It should be noted that we have two kinds of derivatives: one is the derivative of a function with respect to its input. We multiply them by the product of derivatives to track the error of neural network from output layer to current layer. The second kind of derivative is about parameters, which we use to optimize parameters. Instead of multiplying it by the product of other derivatives, we store them as part of the gradient, which we will use later to update the parameters.
Therefore, when we encounter a function with no learnable parameters (such as activation function), we only take the derivative of the first one for the sake of back propagation error. However, when the function we encounter has learnable parameters (such as linear combination, weight and deviation), we take these two derivatives: the first is the input with error propagation, the second is the weighted sum deviation, and they are stored as part of the gradient. We want to add any parameters to the first layer of the function until we start learning without any parameters. This is the back propagation algorithm.
Softmax activation and cross entropy loss function
In the classification task, the softmax function is commonly used in the last layer.
The softmax function converts its input vector into probability distribution. As you can see from the figure above, the vector elements of softmax’s output are all positive, and their sum is 1. When we use softmax activation, we create nodes equal to the number of classes in the data set at the last layer of the neural network, and the softmax activation function will give the probability distribution on the possible classes. Therefore, the output of neural network will output to us the probability that the input vector belongs to each possible class. We choose the class with the highest probability as the prediction of neural network.
When the softmax function is used as the activation function of the output layer, the cross entropy loss is usually used as the loss function. Cross entropy loss measures the similarity of two probability distributions. We can express the real label of the input value x as a probability distribution: the probability of the real class label is 1, and the probability of the other class label is 0. This representation of the tag is also known as a hot code. Then we use cross entropy to measure the closeness between the predicted probability distribution and the real probability distribution.
Where y is a heat code of the true tag, y hat is the probability distribution of the prediction, and Yi, Yi hat are the elements of these vectors.
If the predicted probability distribution is close to a thermal code of the real tag, the value of the loss function will be close to 0. Otherwise, if they differ greatly, the value of the loss function may be infinitely large.
Mean square error loss function
Softmax activation and cross entropy loss are mainly used for classification tasks, while neural networks can easily adapt to regression tasks by using appropriate loss function and activation function at the last layer. For example, if we don’t have a class label as a basis, we have a list of numbers that we want to approximate, and we can use the mean square error (MSE) loss function. Usually, when we use the MSE loss function, we use identity activation (i.e., f (x) = x) at the last level.
To sum up, the learning process of neural network is just an optimization problem: we need to find the parameters to minimize the loss function. But it’s not easy. There are a lot of books about optimization techniques. Moreover, in addition to optimization, there are also problems in choosing the neural network structure for a given task.
I hope this article will help you and thank you very much for reading it.