Statistical learning 3: linear support vector machine (SVM)


learning strategy

Soft interval maximization

The “linearly separable support vector machine” defined in the previous chapter requires that the training data be linearly separable. However, in practice, the training data often include outliers, so it is often linear and inseparable. This requires us to make some modifications to the algorithm in the previous chapter, that is, relax the conditions and convert the original hard interval maximization to soft interval maximization.
Given training set

D = \{\{\bm{x}^{(1)}, y^{(1)}\}, \{\bm{x}^{(2)}, y^{(2)}\},…, \{\bm{x}^{(m)}, y^{(m)}\}\}

among\(\bm{x}^{(i)} \in \mathcal{X} \subseteq \mathbb{R}^n\)\(y^{(i)} \in \mathcal{Y} = \{+1, -1\}\)
If the training set is linearly separable, the linearly separable support vector machine is equivalent to solving the following convex optimization problems:

\underset{\bm{w}, b}{\max} \quad \frac{1}{2} || \bm{w}||^2\\
\text{s.t.} \quad y^{(i)} (\bm{w}^T \bm{x}^{(i)} + b) \geqslant 1 \\
\quad (i = 1, 2, …, m)
\end{aligned} \tag{2}

among\(y^{(i)} (\bm{w}^T \bm{x}^{(i)} + b) -1 \geq 0\)Represents a sample point\((\bm{x}^{(i)}, y^{(i)})\)Satisfy that the function interval is greater than or equal to 1. Now, let’s do some research on each sample point\((\bm{x}^{(i)}, y^{(i)})\)Relax the condition and introduce a relaxation variable\(\xi_{i} \geqslant 0\), make the constraint\(y^{(i)} (\bm{w}^T \bm{x}^{(i)} + b) \geq 1-\xi_{i}\)。 And for each relaxation variable, a value of\(\xi_{i}\)The objective function is transformed into:\(\frac{1}{2} || \bm{w}||^2+C\sum_{i=1}^{m}\xi_{i}\), here\(C>0\)It is called penalty coefficient. At this time, the optimization function should make the interval as large as possible (make\(\frac{1}{2} || \bm{w}||^2\)As small as possible), and minimize the number of misclassification points. This is called soft spacing.

Linear support vector machine

In this way, the linear support vector machine becomes the following convex quadratic programming problem (original problem):

\underset{\bm{w}, b}{\max} \quad \frac{1}{2} || \bm{w}||^2 + C\sum_{i=1}^{m}\xi_{i}\\
\text{s.t.} \quad y^{(i)} (\bm{w}^T \bm{x}^{(i)} + b) \geqslant 1-\xi_{i} \\
\xi_{i} \geqslant 0 \\
\quad (i = 1, 2, …, m)
\end{aligned} \tag{3}

Because it is convex quadratic programming, so about\((\bm{w}, b, \bm{\xi})\)The solution must exist and can be proved\(\bm{w}\)The solution is unique, but\(b\)The solution of may not be unique, but exists in an interval.
set up\((2)\)The solution is\(\bm{w}^{*}, b^*\)In this way, the separation hyperplane can be obtained\(\{\bm{x} | \bm{w}^{*T}\bm{x}+b=0\}\)Classification decision function\(f(\bm{x})=\text{sign}(\bm{w}^{*T}\bm{x}+b^*)\)


Conventional constrained optimization algorithm

As in the previous chapter, we will discuss the original problem\((2)\)The dual problem of the original problem is as follows (the derivation is the same as that in the previous chapter):

\underset{\bm{\alpha}}{\max} -\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_i\alpha_jy^{(i)}y^{(j)}\langle \bm{x}^{(i)}, \bm{x}^{(j)}\rangle + \sum_{i=1}^{m}\alpha_i \\
s.t. \sum_{i=1}^{m}\alpha_iy^{(i)} = 0 \\
C – \alpha_i – \mu_i = 0\\
\alpha_i \geq 0 \\
\mu_i \geq 0 \\
i=1, 2,…,m

Next, let’s eliminate it\(\mu_i\), just leave\(\alpha_i\), convert the constraint to:\(0 \leqslant \alpha_i \leqslant C\), and convert the maximum of the objective function to the minimum\((4)\)Further transformation is carried out to obtain:

\underset{\bm{\alpha}}{\max} -\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_i\alpha_jy^{(i)}y^{(j)}\langle \bm{x}^{(i)}, \bm{x}^{(j)}\rangle + \sum_{i=1}^{m}\alpha_i \\
s.t. \sum_{i=1}^{m}\alpha_iy^{(i)} = 0 \\
0 \leqslant \alpha_i \leqslant C \\
i=1, 2,…,m

This is the original optimization problem\((3)\)Dual form of.

In this way, we get the solution of the dual problem as\(\bm{\alpha}^*=(\alpha_1^*, \alpha_2^*, …, \alpha_m^*)^T\)
At this time, the situation is more complex than that when it is completely linearly separable. For example, the positive examples are not only distributed on the soft interval plane and the side of the positive examples, but also on the side corresponding to the negative examples (due to misclassification); The same is true for negative cases. As shown in the figure below:


The solid line in the figure is a hyperplane and the dotted line is an interval boundary“\(\circ\)”As a positive example“\(\times\)”Is a negative example point,\(\frac{\xi_i}{||\bm{w}||}\)As an example\(\bm{x}^{(i)}\)The distance to the interval boundary.
At this point, we will compare with the previous chapter\(\alpha_i^* > 0\)Note that it is not required\(\alpha_i^*)Corresponding sample point\((\bm{x}^{(i)}, y^{(i)})\)Examples of\(\bm{x}^{(i)}\)It is called support vector (soft interval support vector). Its distribution is much more complex than that in the case of hard interval: it can be on the interval boundary, between the interval boundary and the separation hyperplane, or even on the side of the misclassification of the interval hyperplane. if\(0, then\(\xi_i =0\), the support vector just falls on the interval boundary; if\(\ alpha_i = C, 0, then the classification is correct and\(\bm{x}^{(i)}\)Between the interval boundary and the separation hyperplane; if\(\alpha_i^*=C, \xi_i=1\), then\(\bm{x}^i\)On the separation hyperplane; if\(\alpha_i^*=C, \xi_i>1\)The classification is wrong,\(\bm{x}^{(i)}\)Separate the side of the hyperplane misclassification.

Next, we need to transform the solution of the dual problem into the solution of the original problem.

For the solution of dual problem\(\bm{\alpha}^*=(\alpha_1^*, \alpha_2^*, …, \alpha_m^*)^T\), if any\(\alpha^*\)A component of\(0, then the solution of the original problem\(w^*\)and\(b^*\)It can be obtained by the following formula:

\bm{w}^{*} = \sum_{i=1}^{m}\alpha_i^{*}y^{(i)}\bm{x}^{(i)} \\
b^{*} = y^{(s)} – \sum_{i=1}^{m}\alpha_i^*y^{(i)}\langle \bm{x}^{(s)}, \bm{x}^{(i)} \rangle

type\((11)\)The derivation method is similar to that in the previous chapter, which is derived from the solution of the original problem (convex quadratic programming problem) satisfying the KKT condition.

In this way, we can get the separation hyperplane as follows:

\[\{\bm{x} | \sum_{i=1}^{m}\alpha_i^*y^{(i)}\langle \bm{x}^{(i)}, \bm{x} \rangle + b^* = 0\}

The classification decision function is as follows:

\[f(\bm{x}) = \text{sign}(\sum_{i=1}^{m}\alpha_i^*y^{(i)}\langle \bm{x}^{(i)}, \bm{x} \rangle + b^*)

(similarly, the reason why it is written\(\langle \bm{x}^{(i)}, \bm{x} \rangle\)The inner product form of is for the convenience of introducing kernel function later)

To sum up, according to formula\((3)\)The algorithm for solving linear separable support vector machine is as follows:
You can see the steps in the algorithm\((2)\)Is a random sample\(\alpha_s^*\)Calculated\(b^*\)Therefore, according to formula\((3)\)The strategy (original problem) is solved\(b\)May not be unique.

Unconstrained optimization algorithm based on hinge loss function

In fact, the problem\((3)\)It can also be written in the form of unconstrained optimization. The objective function is as follows:

\[ \underset{\bm{w}, b}{\min} \space \frac{1}{m}\sum_{i=1}^{m}\max(0, 1-y^{(i)}(\bm{w}^{T}\bm{x}^{(i)} + b))+\lambda||\bm{w}||^2

type\((8)\)The first term is empirical risk or empirical loss (plus the regular term as a whole as structural risk). function

\[ L(y(\bm{w}^{T}\bm{x} + b))=\max(0, 1-y(\bm{w}^{T}\bm{x} + b))

It is called hinge loss function. s
The hinge loss function means that if the sample\((\bm{x}^{(i)}, y^{(i)})\)Is correctly classified and the function interval (confidence) is greater than 1 (i.e\(y^{(i)}(\bm{w}^{T}\bm{x}^{(i)} + b)>0\)), then the loss is 0, otherwise the loss is\(1-y^{(i)}(\bm{w}^{T}\bm{x}^{(i)} + b)\)。 Like the classification diagram mentioned above,\(\bm{x}^{(4)}\)It is correctly classified, but the function interval is not greater than 1, and the loss is not 0.
0-1 loss function, hinge loss function and perceptron loss function are summarized as follows:

\[ \begin{aligned}

The image comparison of these three functions is shown in the following figure:


It can be seen that the hinge loss function is similar to the hinge, so it is named. Although the 0-1 loss function is the real loss function of the binary classification problem, it is not continuously differentiable, so it is NP difficult to optimize it, so we turn to optimize the objective function composed of its upper bound (hinge loss function). At this time, the upper bound loss function is also called surrogate loss function. For the perceptron loss function, it is intuitively understood as: when\((\bm{x}^{(i)}, y^{(i)})\)When correctly classified, the loss is 0, otherwise it is\(-y(\bm{w}^{T}\bm{x} + b)\)。 In contrast, the hinge loss function not only requires correct classification, but also requires that the loss is 0 when the degree of certainty is large enough, that is to say, the hinge loss function has higher requirements for learning effect.
The hinge loss function is continuous everywhere. At this time, the numerical optimization algorithm based on gradient can be used (gradient descent method, Newton method, etc.), which will not be repeated here. However, the objective function is non convex and does not necessarily guarantee convergence to the optimal solution.


In the last chapter, we tried to solve the problem without the help of SMO algorithm\((3)\)For this convex quadratic programming problem, it is found that the convergence speed is too slow. Now we try to use the gradient descent algorithm based on the hinge loss function to directly solve the parameters\(\bm{w}\)and\(b\)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from scipy.optimize import minimize
import numpy as np
import math
from copy import deepcopy
import torch
from random import choice

#Data preprocessing
def preprocess(X, y):
    #Normalize x min max
    for i in range(X.shape[1]):
        field = X[:, i]
        X[:, i] = (field - field.min()) / (field.max() - field.min()) 
    #Labels are uniformly converted to 1 and - 1
    y = np.where(y==1, y, -1)
    return X, y

class SVM():
    def __init__(self, lamb=0.01, input_dim=256):
        self.lamb = lamb
        self.w, self.b = np.random.rand(
            input_dim), 0.0

    #Note that a batch of loss can be parallelized
    def objective_func(self, X, y):
        loss_vec = torch.mul(y, torch.matmul(X, self.w) + self.b)
        loss_vec = torch.where(loss_vec > 0, loss_vec, 0.)
        return 1/self.m * loss_vec.sum() + self.lamb * torch.norm(self.w)**2 

    def train_data_loader(self, X_train, y_train):
        #A batch is randomly selected for each round of iteration
        data = np.concatenate((X_train, y_train.reshape(-1, 1)), axis=1)                                                    
        X_train, y_train = data[:, :-1], data[:, -1]
        for ep in range(self.epoch):
            for bz in range(math.ceil(self.m/self.batch_size)):
                start = bz * self.batch_size
                yield (
                    torch.tensor(X_train[start: start+self.batch_size], dtype=torch.float64), 
                torch.tensor(y_train[start: start+self.batch_size],dtype=torch.float64))

    def test_data_loader(self, X_test):
        #A batch is randomly selected for each round of iteration
        for bz in range(math.ceil(self.m_t/self.test_batch_size)):
            start = bz * self.test_batch_size
            yield X_test[start: start+self.test_batch_size]

    def compile(self, **kwargs):
        self.batch_size = kwargs['batch_size']
        self.test_batch_size = kwargs['test_batch_size']
        Self. ETA = kwargs ['eta '] # learning rate
        Self. Epoch = how many times does kwargs ['epoch '] # traverse the training set

    def sgd(self, params):
        with torch.no_grad(): 
            for param in params:
                param -= self.eta*param.grad

    def fit(self, X_train, y_train):
        self.m = X_ Train. Shape [0] # number of samples
        #If W is initialized to a random number, it cannot be initialized to 0
        self.w, self.b = torch.tensor(
            self.w, dtype=torch.float64, requires_grad=True), torch.tensor(self.b, requires_grad=True)
        for X, y in self.train_data_loader(X_train, y_train):
            loss_v = self.objective_func(X, y)
            self.sgd([self.w, self.b])
            # print(self.w, self.b)
        self.w = self.w.detach().numpy()
        self.b = self.b.detach().numpy()

    def pred(self, X_test):
        #Traverse each sample in the test set
        self.m_t = X_test.shape[0]
        pred_list = []
        for x in self.test_data_loader(X_test):
            pred_list.append(np.sign(np.matmul(x, self.w) + self.b))
        return np.concatenate(pred_list, axis=0)
if __name__ == "__main__":
    X, y = load_breast_cancer(return_X_y=True)
    X, y = preprocess(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
    clf = SVM(lamb=0.01, input_dim=X_train.shape[1])
    CLF. Compile (batch_size = 256, test_batch_size = 1024, ETA = 0.001, epoch = 1000) # defines training parameters, y_train)
    y_pred = clf.pred(X_test)
    acc_score = accuracy_score(y_test, y_pred)
    print("The accuracy is: %.1f" % acc_score)

As you can see, the effect is not very good. Ha ha, it’s still debugging

The accuracy is: 0.6