[mindspire: machine learning with little MI] some suggestions on applying machine learning

Time:2022-3-4

So far, little MI has introduced many different learning algorithms to you. I feel that everyone has unconsciously become an expert who knows many advanced machine learning technologies. Of course, how to use these learning algorithms efficiently and effectively while understanding? How to choose the most appropriate and correct path instead of wasting time on meaningless attempts? If you want to know, follow little Mi to continue machine learning~

1 machine learning diagnosis method
Specifically, we can still use the learning example of predicting house prices as an example. If we have completed the regularized linear regression, that is, minimizing the value of the cost function, after we get the learning parameters, if we put the hypothetical function on a new set of house samples for testing, it is found that there is a huge error between the predicted house prices and the actual data. So what we have to do at this time is to improve the algorithm. Then the problem comes. How should we improve the algorithm?

One of the first ways to improve the performance of this algorithm may be to use more training samples. Specifically, it is to obtain more actual sales data of different houses. Unfortunately, many people spend a lot of time trying to collect more training samples. They always think that if there are twice or even ten times the amount of training data, it will solve the problem, right? But sometimes getting more training data doesn’t really work.

Another way is that some people may try to choose fewer feature sets. If there are a series of features, such as, etc. Maybe there are many features, maybe you can take a moment to carefully select a small part of these features to prevent over fitting. It may not be helpful for you to use more features. You want to collect more data from the perspective of obtaining more characteristics. Similarly, you can expand this problem to a large project, such as using telephone survey to get more housing cases, or conducting land survey to get more information about this land, etc., so this is a complex problem. In the same way, we very much hope that we can know the effect before we spend a lot of time completing these works. Of course, we can also try to increase the characteristics of polynomials, such as the product of the square of and the square of. We can spend a lot of time considering this method, and we can also consider other methods to reduce or increase the value of regularization parameters. Many of the methods listed above can be extended to a project of six months or more. Unfortunately, most people choose these methods according to their feelings, that is, most people choose one of these methods casually. For example, they will say “Oh, let’s find more data” and spend six months collecting a lot of data, Then maybe another person said, “well, let’s find more features from the data of these houses”. As a result, many people spend at least six months to complete a method they choose casually, and after six months or more, they regret to find that they have chosen a path of no return. Fortunately, there are a series of simple methods that can get twice the result with half the effort. Eliminate at least half of the methods on the list and leave those promising methods. At the same time, there is also a very simple method. As long as you use it, you can easily eliminate many options, thus saving you a lot of unnecessary time. Finally, to improve the performance of machine learning system, suppose we need to use a linear regression model to predict house prices. When we use the trained model to predict unknown data, we find that there is a large error. What can we do next?

1. Obtain more training samples – usually effective, but costly. The following methods may also be effective. The following methods can be considered first.

2. Try to reduce the number of features

3. Try to get more features

4. Try to add polynomial features

5. Try to reduce the degree of regularization

6. Try to increase the degree of regularization

We should not randomly choose one of the above methods to improve our algorithm, but use some machine learning diagnosis methods to help us know which of the above methods are effective for our algorithm.

In the next few sections, Xiao Mi will first introduce you how to evaluate the performance of machine learning algorithms, and then continue to discuss these methods, which can also be called “machine learning diagnosis method”. “Diagnostic method” means: This is a test method. By performing this test, you can deeply understand whether a certain algorithm is useful or not. Finally, we can also know what kind of attempt is meaningful if we want to improve the effect of an algorithm. Of course, the execution and implementation of these diagnostic methods sometimes take a lot of time to understand and implement, but doing so can spend time on the blade, because these methods can help us save at least a few months when developing learning algorithms.

2 evaluate a hypothesis
In this section, Xiao Mi will introduce you how to use the algorithm you have learned to evaluate the hypothetical function. In the later study, we will also discuss how to avoid over fitting and under fitting based on this.

When we determine the parameters of the learning algorithm, we consider selecting parameters to minimize the training error. Some people think it must be a good thing to get a very small training error, but we already know that just because this assumption has a very small training error, it does not mean that it must be a good hypothetical function. Moreover, we have also learned the example of fitting the hypothetical function, so it is not applicable to the new training set. So, how to judge whether a hypothetical function is over fitted? For this simple example, we can usually draw a graph of the hypothetical function and then observe the graph trend. However, for the general case of more than one characteristic variable, it will become difficult or even impossible to observe by drawing the hypothesis function. Therefore, we need another method to evaluate our hypothesis function over fitting test. In order to test whether the algorithm is over fitted, we divide the data into training set and test set. Usually, 70% of the data is used as the training set and the remaining 30% of the data is used as the test set. It is very important that both training set and test set need to contain various types of data. Usually, we need to “shuffle” the data, and then divide it into training set and test set.

Test set evaluation after learning the parameters of our model through the training set, using the model for the test set can calculate the error in two ways:

For the linear regression model, the cost function J is calculated by using the test set data
For the logistic regression model, in addition to using the test data set to calculate the cost function:

The rate of misclassification is calculated for each test set sample:

Then average the calculation results.

3 model selection and cross validation set
Suppose we choose between 10 binomial models with different times:

Obviously, the higher degree polynomial model can adapt to our training data set, but adapting to the training data set does not mean that it can be extended to the general situation. We should choose a model that can adapt to the general situation better. We need to use cross validation sets to help select models. That is, 60% of the data is used as the training set, 20% of the data is used as the cross validation set, and 20% of the data is used as the test set

The method of model selection is:

1. Use the training set to train 10 models

2. Use 10 models to calculate the cross validation error (value of cost function) of the cross validation set respectively

3. Select the model with the lowest value of cost function

4. Use the model selected in step 3 to calculate the generalization error (value of cost function) of the test set

4 diagnostic bias and variance
When running a learning algorithm, if the performance of the algorithm is not ideal, there are mostly two situations: either the deviation is relatively large or the variance is relatively large. In other words, the situation is either under fitting or over fitting. So which of these two cases is related to deviation, which is related to variance, or both? It is very important to make this clear, because being able to judge which of the two situations occurs can guide us to choose the most effective method and way to improve the algorithm. In addition, Xiao Mi will explain in detail the problems about deviation and variance to help you figure out how to evaluate a learning algorithm and judge whether it is deviation or variance, because this problem is very important to understand how to improve the effect of learning algorithm. The problems of high deviation and high variance are basically the problems of under fitting and over fitting.

We usually draw the cost function error and polynomial degree of the training set and cross validation set on the same chart to help the analysis:

For the training set, when D is small, the fitting degree of the model is lower and the error is larger; With the increase of D, the fitting degree increases and the error decreases. For the cross validation set, when D is small, the degree of model fitting is low and the error is large; However, with the increase of D, the error decreases first and then increases. The turning point is when our model begins to fit the training data set. If the error of our cross validation set is large, how can we judge whether it is variance or deviation? According to the chart above, we know:

When the training set error and cross validation set error are approximate: deviation / under fitting

When the error of cross validation set is much larger than that of training set: Variance / over fitting

5 regularization and bias / variance
In the process of training the model, some regularization methods are generally used to prevent over fitting. The degree of regularization we need to choose is too small, that is, the value of the polynomial is too high.

For example, the selected value is 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0. We also divide the data into training set, cross validation set and test set.

The method selected is:

1. Use the training set to train 12 regularized models with different degrees

2. Use 12 models to calculate the cross validation error of the cross validation set

3. Select the model with the smallest cross validation error

4. Use the model selected in step 3 to calculate the generalization error of the test set. We can also draw the value of the cost function error and of the model of the training set and cross validation set on a chart at the same time:

When it is small, the error of training set is small (over fitting), while the error of cross validation set is large

With the increase of, the training set error increases (under fitting), while the cross validation set error decreases first and then increases.

6 learning curve
The learning curve is a good tool. You can use the learning curve to judge whether a learning algorithm is in the problem of deviation and variance, that is, the learning curve is a good sanity check of the learning algorithm. The learning curve is a graph that takes the training set error and cross validation set error as a function of the number of training set samples (m). That is, if we have 100 rows of data, we start with 1 row of data and gradually learn more rows of data. The idea is: when training less rows of data, the trained model will be able to perfectly adapt to less training data, but the trained model can not well adapt to cross validation set data or test set data.

How to use the learning curve to identify high deviation / under fitting: as an example, we try to use a straight line to adapt to the following data. It can be seen that no matter how large the error of the training set is, it will not change much:

In other words, in the case of high deviation / under fitting, adding data to the training set may not be helpful. How to use the learning curve to identify high square difference / over fitting: suppose we use a very high-order polynomial model with very small regularization. It can be seen that when the error of the cross validation set is much greater than that of the training set, adding more data to the training set can improve the effect of the model.

In conclusion, in the case of high variance / over fitting, adding more data to the training set may improve the effect of the algorithm.

7 specific countermeasures
Xiao MI has taught you how to evaluate a learning algorithm, and discussed the problem of model selection, deviation and variance. So how do these diagnostic rules help us judge which methods may help to improve the effect of learning algorithms, and which may be futile? Reviewing the six optional next steps proposed in Section 1, let’s take a look at what we should choose under what circumstances:

1. Get more training samples – solve high variance

2. Try to reduce the number of features – solve high variance

3. Try to get more features – solve high deviations

4. Try to add polynomial features – solve high deviation

5. Try to reduce the degree of regularization – solve the high deviation

6. Try to increase the degree of regularization – solve the problem of high variance

Variance and deviation of neural network:

Using a smaller neural network is similar to the case with fewer parameters, which is easy to lead to high deviation and under fitting, but the calculation cost is small. Using a larger neural network is similar to the case with more parameters, which is easy to lead to high square error and over fitting. Although the calculation cost is relatively high, it can be adjusted by regularization to better adapt to the data. Generally, choosing a larger neural network and using regularization will be better than using a smaller neural network. For the selection of the number of hidden layers in neural networks, the number of layers is usually gradually increased from one layer. In order to make better selection, the data can be divided into training set, cross verification set and test set. The neural networks with different layers of hidden layers are trained, and then the neural networks with the lowest cost of cross verification set are selected.

Well, the above is the deviation and variance problem introduced by Xiao Mi to you, as well as the learning curve method for diagnosing this problem. When improving the performance of learning algorithms, you can make full use of the above contents to judge which ways may be helpful and which methods may be meaningless, so that we can use machine learning methods to effectively solve practical problems. I hope that some of the skills mentioned in these sections, such as the diagnosis method represented by variance, deviation and learning curve, can really help you apply machine learning more efficiently and make them work efficiently.

Next week we will start learning support vector machines. See you next time! (wave for ten minutes)