Let’s talk about “Mathematics” before we start deep learning

Time:2021-1-19

Abstract:Deep neural network is based on calculus and some statistics.

Deep neural network (DNN) is essentially composed of multiple connected perceptrons, one of which is a single neuron. We can regard artificial neural network (ANN) as an input system with a group of inputs fed along a weighted path. These inputs are then processed and an output is generated to perform certain tasks. As time goes on, artificial neural networks will “learn” and develop different paths. Various paths can have different weights, and paths that are considered more important (or produce more ideal results) are assigned higher weights in the model than those that produce less ideal results.

In deep neural networks, if all inputs are densely connected to all outputs, then these layers are calledDense layer(Dense layers)。 In addition, the deep neural network can contain multiple neural networksHidden layer(Hidden layer)。 The hidden layer is basically the point between the input and output of the neural network, where the activation function transforms the input information. It is called a hidden layer because it cannot be observed directly from the input and output of the system. The deeper the neural network is, the more information the network can identify from the data.

However, although it is our goal to learn as much information as possible from the data, the deep learning model may be affected by over fitting. This happens when the model learns too much information from the training data, including random noise.

Models can identify very complex patterns in the data, but this can have a negative impact on the performance of new data. The noise received in the training data is not suitable for new or unseen data, and the model can not generalize the discovered patterns. Nonlinearity is also very important in deep learning model. Although the model will learn a lot of information because it has multiple hidden layers, the application of linear form to nonlinear problems will lead to poor performance.

Let's talk about

Now the question is, “how do these layers learn?” Then, let’s apply the artificial neural network to a real scene to solve the problem, so as to understand how to train the model to achieve the goal. Under novel coronavirus pneumonia, many schools have been transitioning to virtual learning, which has made some students worry about their chances of passing the course. Any AI system should be able to solve the problem of “can I pass this course?”.

For the sake of simplicity, let’s imagine that this model has only three inputs: the number of students attending classes, the time spent on homework, and the number of network drops during the whole teaching process. The output of this model will be a binary classification; students either pass or fail the course. It’s the end of the semester. Student a has listened to 21 classes and spent 90 hours to finish his homework. In this semester, the network has been disconnected seven times. These inputs are fed into the model, and the output predicts that students have a 5% chance of passing the course. A week later, the final results were announced and student a passed the course. So what’s wrong with the prediction of this model?

Technically, there was no problem. The model could have worked according to the model currently developed. But the problem is that the model doesn’t know what happened. We just initialized some weights on the path, but the model does not know what is right and what is wrong; therefore, the weights are not correct. That’s what learning is all about. The idea is that the model needs to know when it’s wrong, and we do that by calculating some form of “loss.”. The computational loss depends on the current problem, but it usually involves minimizing the difference between the predicted output and the actual output.

Let's talk about

In the above scenario, only one student and one error point need to be minimized. Usually, however, this is not the case. Now, consider having multiple students and minimizing differences. Therefore, the total loss is usually calculated as the average of the differences between all predicted values and actual observations.

Recall that the calculated loss I mentioned earlier depends on the current problem. Therefore, since our current problem is binary classification, appropriate loss calculation will be necessaryCross entropy loss. The idea behind this function is that it compares the predicted distribution of students through the course with the actual distribution and tries to minimize the difference between these distributions.

Suppose we no longer want to predict whether students will pass the course, but now we want to predict their grades. Cross entropy loss is no longer a suitable method. On the contrary, cross entropy loss is not a suitable method,Mean square error lossWill be more appropriate. On the contrary, the loss of mean square error is more appropriate. The idea is that it will try to minimize the square difference between the actual value and the predicted value.

Let's talk about

Now that we know some loss functions, we can go into loss optimization and model training. A key factor to have a good depth neural network is to have appropriate weights. Loss optimization should be to find a set of weights of $W $, which will minimize the calculated loss. If there is only one weight component, the weight and loss can be drawn on the two-dimensional graph, and then the weight that minimizes the loss can be selected. However, most deep neural networks have multiple weight components, so it is very difficult to visualize a $n $dimensional graph.

Instead, the derivative of the loss function relative to all weights is calculated to determine the maximum rising direction. Now that the model understands the up and down directions, it moves down until it reaches the convergence point at the local minimum. Once the descent point is completed, a set of optimal weights will be returned, which are the weights that the deep neural network should use (assuming the model is well developed).

The process of calculating this derivative is calledBack propagation(back propagation), which is essentially the chain rule of calculus. Considering the neural network shown above, how do small changes in the first set of weights affect the final loss? That’s what derivatives or gradients are trying to explain. However, the first set of weights is input to a hidden layer, and then the hidden layer has another set of weights, resulting in the output and loss of prediction. Therefore, the influence of weight change on hidden layer should also be considered. These are the only two parts of the network. However, if there are more weights to consider, the process can be continued by applying chain rules from output to input.

Let's talk about

Another important factor to consider when training deep neural networks is the learning rate. When the model is looking for an optimal weight set, it needs to update its weight through some factors. Although this may seem trivial, it is quite difficult to determine the factors that the model should move. If the factor is too small, the model may run for an exponential time, or fall into a place that is not the global minimum. If the factor is too large, the model may deviate from the target point completely, and then diverge.

Although a fixed learning rate may be ideal, it is notAdaptive learning rate(adaptive learning rate) will reduce the chance of the problems mentioned above. That is to say, the factors will change according to the current gradient, the size of the current weight or some other factors that may affect the location of the best weight in the next step of the model.

Let's talk about

It can be seen that the deep neural network is based on calculus and some statistics. Evaluating the mathematics behind these processes is very useful because it helps people understand what’s really happening inside the model, which can lead to the development of a better overall model. However, even if these concepts are not easy to understand, most programs come with tools such as automatic differentiation, so don’t worry.

Click follow to learn about Huawei’s new cloud technology for the first time~