Design of machine learning system


Today, little Mi will take you to learn how to design a machine learning system and some problems you may encounter when designing a complex machine learning system. Of course, in addition, Xiaomi will also provide some small tips on skillfully building complex machine learning systems. Oh, by the way, secretly telling you may help you save a lot of time when building a large-scale machine learning system~

1 initial steps
No more nonsense, let’s take an example! Suppose we need to design an algorithm of spam classifier. In order to solve such a problem, the first decision we have to make is how to select and express feature vectors. We can select a list of 100 words that most often appear in spam, and obtain our feature vector (1 if they appear, 0 if they don’t appear) according to whether these words appear in the email. The vector dimension is 100 × 1。

In order to build this classifier algorithm, we can do many things, such as:

Collect more data and let us have more spam and non spam samples
Mail based routing information develops a series of complex features
Develop a series of complex features based on the text information of e-mail, including considering the processing of truncation
Develop complex algorithms to detect deliberate spelling errors (write watch as w4tch)
Among the above choices, which algorithm should be studied specifically and spend time and energy needs careful consideration, rather than just following the feeling. When we use machine learning, we can always “brainstorm” and come up with a bunch of methods to try. Later, when Xiao Mi takes you to learn error analysis, you can teach you how to use a more systematic method to select the appropriate one from a pile of different methods.

2 error analysis
Error analysis can usually help us make decisions more systematically. If we are going to study machine learning or build machine learning applications, the best practice is not to build a very complex system with how complex variables; Instead, we build a simple algorithm so that we can implement it quickly. When Wu Enda mentioned the problem of studying machine learning, he thought that it would only take one day at most to get the results as soon as possible, even if the results were not very good. Even if the operation is not perfect, run it again, and finally verify the data through cross validation. Once finished, we can draw the learning curve. By drawing the learning curve and checking the error, we can find out whether the algorithm has high deviation and high square difference, or other problems. After such analysis, we can decide whether to use more data for training or add more characteristic variables. In fact, doing so is a good method, because we can’t know in advance whether we need complex characteristic variables, or whether we need more data, or anything else, so it’s difficult to know where we should spend our time to improve the performance of the algorithm. However, when we practice a very simple, even imperfect method, we can draw a learning curve to make further choices, so as to avoid premature optimization problems.

The idea of this approach is that we must use evidence to lead our decisions and how to allocate our time to optimize algorithms, not just intuition. In addition to drawing the learning curve, another very useful work is error analysis: when we are constructing a spam classifier, take a look at the cross validation data set and see for ourselves which emails are incorrectly classified by the algorithm. Therefore, through these spam and non spam that are misclassified by the algorithm, we can find some systematic laws: what type of mail is always misclassified. Often after doing this, this process can inspire us to construct new characteristic variables, or tell us the shortcomings of the current system, and then inspire us how to improve it. The recommended method for constructing a learning algorithm is:

1. Start with a simple and fast algorithm, implement the algorithm, and test the algorithm with cross validation set data

2. Draw a learning curve and decide whether to add more data, or add more features, or other options

3. Error analysis: manually check the samples that produce prediction errors in our algorithm in the cross validation set to see if these samples have a systematic trend

Taking our spam filter as an example, what the error analysis should do is to test all the messages that our algorithm produces wrong predictions in the cross validation set to see whether these messages can be grouped according to classes. For example, medicine spam, counterfeit spam or password theft email. Then look at which group of emails the classifier has the largest prediction error, and start to optimize. Think about how you can improve the classifier. For example, if you find that some features are missing, write down the number of times they appear. For example, record the number of misspellings, the number of abnormal mail routing, and so on, and then start to optimize from the most frequent cases. Error analysis does not always help us judge whether relevant improvements should be made. Sometimes we need to try different models and compare them. When comparing models, we use numerical values to judge which model is better and more effective. Usually, we look at the error of cross validation set. In our spam classifier example, for “should we process discount / discounts / discounted / discounting into the same word?” If this can improve our algorithm, we will use some word cutting software. Error analysis can not help us make this kind of judgment. We can only try to use and not use word cutting software, and then judge which is better according to the results of numerical test.

Therefore, when constructing learning algorithms, we always try many new ideas and implement many versions of learning algorithms. If we manually detect these examples every time we practice new ideas to see whether they perform poorly or well, it will be difficult to make a decision. Whether stem extraction is used and whether it is case sensitive. However, through a quantitative numerical evaluation, we can see whether the error becomes larger or smaller, and whether our idea improves the performance of the algorithm or makes it worse, so as to greatly improve the speed of practicing the algorithm. So Xiao Mi forced to conduct error analysis on the cross validation set instead of the test set, ha ha.

In conclusion, when we are studying a new machine learning problem, we recommend that you implement a relatively simple, fast, even if not perfect algorithm. After the initial implementation, it will become a very powerful tool to help us decide the next step and determine the optimization method through error analysis. At the same time, when we have a fast and imperfect algorithm implementation and a numerical evaluation data, it will help us try new ideas, quickly find out whether these ideas can improve the performance of the algorithm, so as to make a faster decision, what to give up and what to absorb in the algorithm, and the error analysis can help us systematically choose what to do.

3 precision and recall
In the introduction of the previous section, MI mentioned the error analysis and the importance of setting the error measurement value, that is, setting a real number to evaluate the learning algorithm and measure its performance, so as to have the evaluation of the algorithm and the error measurement value. One important thing to note is that using an appropriate error measure may have a very subtle impact on the learning algorithm, which involves the problem of skewed classes. Class skew shows that there are very many samples of the same kind in our training set, and there are few or no samples of other classes. For example, we hope to use algorithms to predict whether cancer is malignant. In our training set, only 0.5% of cases are malignant tumors. Suppose we write a non learning algorithm that predicts that the tumor is benign in all cases, the error is only 0.5%. However, the neural network algorithm we get through training has an error of 1%. At this time, the size of the error can not be regarded as the basis for judging the effect of the algorithm. We divide the results predicted by the algorithm into four cases (precision and recall):

True positive (TP): the prediction is true and the actual is true
True negative (TN): the prediction is false and the actual is false
False positive (FP): the prediction is true and the actual is false
False negative (FN): the prediction is false and the actual is true

Precision = TP / (TP + FP)

With the above example of predicting whether cancer is malignant, the higher the percentage of patients with malignant tumors, the better.

Recall = TP / (TP + FN)

Among all patients who actually have malignant tumors, the higher the percentage of patients who successfully predict malignant tumors, the better. Therefore, for the algorithm that always predicts the benign tumor of patients just now, the recall rate is 0.

4 selection of threshold value
How to balance the recall accuracy and recall rate? Continue to use the example of tumor prediction just now. If the output result of our algorithm is between 0-1, use the threshold value of 0.5 to predict true and false.

Just mentioned in the previous section: because precision = TP / (TP + FP). Therefore, among all the patients we predicted to have malignant tumors, the higher the percentage of patients who actually have malignant tumors, the better; and

Recall = TP / (TP + FN) cases. Among all patients who actually have malignant tumors, the higher the percentage of patients who successfully predict malignant tumors, the better. If we want the prediction to be true only when we are very sure (the tumor is malignant), that is, we want a higher accuracy rate, we can use a threshold greater than 0.5, such as 0.7 and 0.9. In doing so, we will reduce the false prediction of patients as malignant tumors, while increasing the failure to successfully predict malignant tumors. If we want to improve the recall rate and let all patients who may be malignant tumors get further examination and diagnosis as much as possible, we can use a threshold smaller than 0.5, such as 0.3. We can draw the relationship between recall and precision under different threshold values into a chart. The shape of the curve varies according to different data:

In this case, we can select the formula F1 to calculate the threshold value:

Select the threshold value when F1 value is maximum.

5 data
Of course, how much data is used for training is another important aspect in the design of machine learning system. Under certain conditions, getting a large amount of data and training in some type of learning algorithm can be an effective method to obtain a learning algorithm with good performance. This often happens when these conditions are true for your problem and you can get a lot of data. This can be a good way to obtain a very high-performance learning algorithm.

Michele banko and Eric Brill conducted a very interesting research. In the research, they tried to distinguish common confusing words through machine learning algorithms. They tried many different algorithms and found that when the amount of data is very large, these different types of algorithms work well.

For example, in such a sentence: for breakfast I ate__ eggs(to,two,too). In this example, “two” is a confusing word. So they regard such machine learning problems as a kind of supervised learning problems, and try to classify them. What kind of words are appropriate in a specific position of an English sentence. They used several different learning algorithms. For example, they used a variance for a variance in logistic regression, which is called “perceptron”. They also adopted some commonly used algorithms, such as winnow algorithm, which is very similar to the regression problem. There is also a memory based learning algorithm, naive algorithm and so on. And when do we want to get more data instead of modifying the algorithm? This is what we really want to know. What they did was to change the size of the training data set and try to apply these learning algorithms to different sizes of training data sets. The following are their results.

These trends are very obvious. Firstly, most algorithms have similar performance. Secondly, with the increase of training data set, the size of training set in millions is represented on the horizontal axis, from 0.1 million to 1000 million, that is, to the samples of 1 billion training sets, and the performance of these algorithms is correspondingly enhanced. In fact, if you choose any algorithm, you may choose an “inferior” algorithm. If you give more data to this inferior algorithm, it is likely to look better than other algorithms, or even better than “superior algorithms” from these examples. Because this original study is very influential, and a series of many different studies have shown similar results. These results show that many different learning algorithms sometimes tend to show very similar performance, which also depends on some details, but what can really improve the performance is to give an algorithm a large amount of training data. Such a result has aroused a general consensus in machine learning: “the person who succeeds is not the one who has the best algorithm, but the one who has the most data”. So is this really true? Because if we have a learning algorithm, and if this statement is true, getting a large amount of data is usually the best way to ensure that we have a high-performance algorithm, rather than discussing what algorithm to use. If there are such assumptions and there are a large number of training sets that we think are useful under these assumptions, we assume that in machine learning problems, the eigenvalue x contains enough information that can help us predict y accurately. For example, if we use some confusing words, such as two, to and too, if it can describe X and capture the words around the blank space that needs to be filled in, After the feature is captured, we hope to have “for breakfast I ate_eggs”, so there is a lot of information to tell me that the word I need to fill in is “two”, not the word “to” or “too”.

Therefore, feature capture, even one word in the surrounding words, can give me enough information to determine what label y is. In other words, from these three groups of confusing words, what word should I choose to fill in the blanks. Let’s take a look at a lot of data. Assuming that the eigenvalue has enough information to predict the y value, suppose we use a learning algorithm that requires a large number of parameters, such as logistic regression or linear regression with many features, or neural network with many hidden units, which is another learning algorithm with many parameters. These are very powerful learning algorithms. They have many parameters, which can fit very complex functions, Therefore, we need to call these algorithms and think of these algorithms as low deviation algorithms, because we can fit very complex functions, and because we have very powerful learning algorithms, these learning algorithms can fit very complex functions. It is likely that if we run these algorithms with these data, this algorithm can fit the training set well, so the training error will be very low. Now suppose we use a very, very large training set. In this case, although we want to have many parameters, if the number of training sets is larger or even more than the number of parameters, these algorithms are unlikely to over fit, that is, the training error is expected to be close to the test error. Another way to consider this problem is to have a high-performance learning algorithm. We hope it does not have high deviation and variance. Therefore, we will solve the deviation problem by ensuring that there is a learning algorithm with many parameters, so that we can get a low deviation algorithm and ensure it by using a very large training set.

We have no variance problem here. Our algorithm will have no variance, and by putting these two values together, we can finally get a learning algorithm with low error and low variance. This enables us to test the test data set well. Fundamentally, this is a key assumption: eigenvalues have enough information and we have a good class of functions, which is the key to ensure low error. It has a large number of training data sets, which can ensure more variance values, so it puts forward some possible conditions for us. If you have a large amount of data and you train a learning algorithm with many parameters, it will be a good way to provide a high-performance learning algorithm. Therefore, the key of the test is: first, the y value can be accurately predicted according to the eigenvalue X. Secondly, we get a huge training set, and we can train a learning algorithm with many parameters in this training set. If we can’t do both, we can only choose a learning algorithm with good performance.

Well, how to design a machine learning system? Little MI has also learned deeply ~ btw, little Mi predicted wrong last week. We’ll learn support vector machine next week, ha ha ~ (wave for ten minutes!)