- Previously, we have learned a series of supervised learning algorithms such as linear regression, logical regression and neural network, and come to the conclusion: in machine learning, the important thing is not to use algorithm a or algorithm B, but to collect a large amount of data. And the performance of the algorithm depends on the user’s personal level, such as the selection of eigenvectors and regularization parameters
- Next, we discuss the last supervised learning algorithm: support vector machine (SVM). Compared with logistic regression and neural network, SVM provides a more powerful way to learn complex nonlinear equations

### From the perspective of logical regression

- We started with logistic regression and changed it to support vector machine

(1) First of all, what do we want to do with logistic regression. When the sample y = 1, assume that the function H approaches 1, that is, the vector product is far greater than 0; when the sample y = 0, assume that the function H approaches 0, that is, the vector product is far less than 0

(2) Then we observe the cost function of logistic regression when there is only one sample, that is, M = 1. Let y = 1 and get the curve as shown in the figure. If we want the cost function to be small, we need Z to be far greater than 0. We change the curve to two segments, horizontal on the right, which is a term of the cost function of support vector machine. Let y = 0 to get the curve as shown in the figure. If the cost function is small, Z should be far less than 0. We continue to replace the curve with two line segments to get another term of support vector machine. These two terms are defined as cost1 (z) and cost0 (z)

- Next, we start to construct the cost function of SVM

(1) The negative sign of the logistic regression cost function is put inside, and then the two logarithmic terms are replaced by the cost function

(2) Remove the 1 / M term. Because in a sample set, because m is the same, with or without has no effect on the comparison of the size of the cost function

(3) In the conceptual change, instead of using the regularization parameter to weigh the average error and the regularization term, the C parameter is used to weigh (actually, it is the same, which can be regarded as 1 / regularization parameter)

After three steps, we get the cost function of SVM

- Here we should pay attention to the difference between logistic regression and support vector machine

(1) The output of logistic regression is probability, and then we artificially set a critical value to judge whether it is 0 or 1

(2) And support vector machine directly predicts whether y is 0 or 1

### Support vector machine: also known as large space classifier

- Support vector machines are also called large space classifiers. Let’s see why. We first consider what is required to minimize the cost function of SVM

(1) When y = 1, the latter term is 0. Only when Z is greater than or equal to 1, the former term approaches 0

(2) When y = 0, the front term is 0, and only when Z is less than or equal to – 1, the back term approaches 0

Therefore, unlike logistic regression, which only requires positive and negative or sets its own critical value, support vector machine requires higher requirements, 1 and – 1 (personal understanding: it is not black or white, there is a blank space between positive and negative labels, and there is a transition zone). This is equivalent to embedding additional security factors and increasing the accuracy

- Before synthesis, to make the error term zero, one of the two conditions must be satisfied

(1) When y = 1, the error term is determined by cost1, and Z is required to be greater than 1

(2) When y = 0, the error term is determined by cost0 and Z is required to be less than – 1

- Because of the security factor, the decision boundary of SVM is more robust. This is because there is a greater shortest distance between the decision boundary and the training sample. Because of this characteristic of support vector machine, it is sometimes called large space classifier

- If we set the C parameter very large (100000), then the weight of the error term will be very large. It is very likely that the shape of the curve will be changed because of an abnormal point, which is obviously unwise

### The mathematical principle behind large space classification

- Next, let’s see why this kind of cost function can make the decision boundary have a larger distance from the training sample

##### Review the inner product operation of vector

U transpose multiplied by V = V transpose multiplied by u = the projection of V onto u, multiplied by the modulus of U (the projection is signed)

##### Applying inner product to cost function of support vector machine

- The parameter 0 of our simplified cost function is 0, which has only two characteristics, so we can put it into two-dimensional coordinates

(1) When one of the conditions is satisfied, the cost function has only one regularization term, which can be written as the square of the parameter module of the assumed function

(2) Let’s assume that there is only one sample in the sample set, then Z can write the projection of the modular multiplication eigenvector of the parameter vector on the parameter vector, and compare the product with 1 or – 1

(3) Because parameter 0 is 0, the decision boundary passes through the origin. Because the parameter variables and the decision boundary are orthogonal, the parameter vector and the distance are parallel. If the parameter vector is smaller than – 1, the cost of the function is smaller. This is the reason why support vector machine produces large space classification

- To sum up, support vector machine finds the decision boundary of the maximum distance by reducing the size of parameter vector module as much as possible

### kernel function

- When fitting the nonlinear boundary, we usually solve it by constructing polynomial characteristic variables, but this method has too much computation. In support vector machine, we construct new features to fit the nonlinear boundary

##### How to construct new characteristic variable

- We select some points manually, measure the similarity between X and each point by function, and these similarities constitute new feature variables. Among them, the similarity function is the kernel function, there are many kinds of kernel functions, here we take Gaussian kernel function as an example

- The intuitive feeling of kernel function: when x is close to the selected point, the molecule is close to 0, and F is close to 1; when x is far away from the selected point, the molecule is larger, and F is close to 0. So kernel function is to measure the similarity between X and the marked point, which is close to 1, and far away from 0. Each feature point l defines a new feature variable F

- The influence of kernel function parameters on kernel function: we use contour map. When the parameter of kernel function becomes smaller, the density shrinks; when the parameter of kernel function becomes larger, the sparsity expands

- After defining a new characteristic variable, let’s take a look at the new hypothesis function: suppose we have got the hypothesis parameters. When the training sample is close to L1, F1 is close to 1, F2 and F3 are close to 0, assume that the function value is 0.5, and predict y = 1; when the training sample is far away, F1, 2 and 3 are close to 0, assume that the function value is – 0.5, and predict y = 0