Machine learning (4): popular understanding of SVM and code practice

Time:2021-6-22

Last articleWe introduce the use of logistic regression to deal with classification problems. In this paper, we talk about a more powerful classification model. This article still focuses on code practice, you will find that we have more and more ways to solve problems, and the problem handling is more and more simple.

Support vector machine (SVM) is one of the most popular machine learning models. It is especially suitable for the classification of small and medium-sized complex data sets.

1、 What is support vector machine

SMV searches for an optimal decision boundary among many instances. The instances on the boundary are called support vector machines, which are called support vector machines because their “support” (support) is separated from the hyperplane.

So how do we ensure that the decision boundary we get isoptimalWhat’s wrong with it?

Machine learning (4): popular understanding of SVM and code practice

As shown in the figure above, all three black lines can segment the dataset perfectly. Therefore, we can get innumerable solutions by using only a single line. So, which line is the best?

Machine learning (4): popular understanding of SVM and code practice

As shown in the figure above, we calculate the distance between the line and the segmentation instance, so that our line is consistent with the distance of the datasetAs far away as possibleThen we can get the unique solution. It is our goal to maximize the distance between the dotted lines in the figure above. The example highlighted in the figure above is called support vector.

This is support vector machine.

2、 Mapping theory from code

2.1 importing data sets

Add reference:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Import data set (you don’t need to care about this domain name)

df = pd.read_csv('https://blog.caiyongji.com/assets/mouse_viral_study.csv')
df.head()
Med_1_mL Med_2_mL Virus Present
0 6.50823 8.58253 0
1 4.12612 3.07346 1
2 6.42787 6.36976 0
3 3.67295 4.90522 1
4 1.58032 2.44056 1

The data set simulates a medical study in which mice infected with the virus were treated with two different doses of drugs to observe whether the mice were infected with the virus after two weeks.

  • featuresObjective: 1. Drug Med_ 1_ Ml drug Med_ 2_ mL
  • label: whether infected with virus (1 infected / 0 uninfected)

2.2 observation data

sns.scatterplot(x='Med_1_mL',y='Med_2_mL',hue='Virus Present',data=df)

We used Seaborn to plot the scatter plot of infection results of two drugs at different dose characteristics.

Machine learning (4): popular understanding of SVM and code practice

sns.pairplot(df,hue='Virus Present')

We use pairplot method to draw the corresponding relationship between two features.

Machine learning (4): popular understanding of SVM and code practice

We can make a general judgment that increasing the dosage of the drug can prevent the mice from being infected.

2.3 using SVM to train data set

#SVC: support vector classifier
from sklearn.svm import SVC

#Data preparation
y = df['Virus Present']
X = df.drop('Virus Present',axis=1) 

#Define the model
model = SVC(kernel='linear', C=1000)

#Training model
model.fit(X, y)

#Draw an image
#Defining the method of drawing SVM boundary
def plot_svm_boundary(model,X,y):
    
    X = X.values
    y = y.values
    
    # Scatter Plot
    plt.scatter(X[:, 0], X[:, 1], c=y, s=30,cmap='coolwarm')

    
    # plot the decision function
    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # create grid to evaluate model
    xx = np.linspace(xlim[0], xlim[1], 30)
    yy = np.linspace(ylim[0], ylim[1], 30)
    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = model.decision_function(xy).reshape(XX.shape)

    # plot decision boundary and margins
    ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    # plot support vectors
    ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100,
               linewidth=1, facecolors='none', edgecolors='k')
    plt.show()
plot_svm_boundary(model,X,y)

Machine learning (4): popular understanding of SVM and code practice

We importsklearnNextSVC(supply vector classifier) classifier, which is an implementation of SVM.

2.4 SVC parameter C

SVC method parametersCRepresents the L2 regularization parameter, the strength of regularization andCThe value of the cityinverse ratioThat is to say, the larger the C value is, the weaker the regularization intensity is, which must be strictly positive.

model = SVC(kernel='linear', C=0.05)
model.fit(X, y)
plot_svm_boundary(model,X,y)

When we reduce the value of C, we can see that the degree of model fitting data is weakened.

Machine learning (4): popular understanding of SVM and code practice

2.5 nuclear techniques

Application of SVC methodkernelParameter can be taken{'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}. As used earlier, we can makekernel='linear'Linear classification. So what if we do nonlinear classification?

2.5.1 polynomial kernel

Polynomial kernelkernel='poly'In a nutshell,Using single feature to generate multi feature to fit curve. For example, we expand the corresponding relationship from X to y as follows:

X X^2 X^3 y
0 6.50823 6.50823**2 6.50823**3 0
1 4.12612 4.12612**2 4.12612**3 1
2 6.42787 6.42787**2 6.42787**3 0
3 3.67295 3.67295**2 3.67295**3 1
4 1.58032 1.58032**2 1.58032**3 1

So we can fit the data set with a curve.

model = SVC(kernel='poly', C=0.05,degree=5)
model.fit(X, y)
plot_svm_boundary(model,X,y)

We use the polynomial kernel and use thedegree=5Set the value of the polynomialMaximum times5. We can see that there is a certain radian in the segmentation.

Machine learning (4): popular understanding of SVM and code practice

2.5.2 Gaussian RBF kernel

The default kernel of SVC method is GaussianRBFThat is radial basis function. Now we need to introducegammaParameter to control the shape of the bell function. Increasing the gamma value will make the bell curve narrower, so the influence range of each instance will be smaller, and the decision boundary will be more irregular. Reducing the gamma value will make the bell curve wider, so the influence range of each instance is larger and the decision boundary is flatter.

model = SVC(kernel='rbf', C=1,gamma=0.01)
model.fit(X, y)
plot_svm_boundary(model,X,y)

Machine learning (4): popular understanding of SVM and code practice

2.6 parameter adjustment skills: grid search

from sklearn.model_selection import GridSearchCV
svm = SVC()
param_grid = {'C':[0.01,0.1,1],'kernel':['rbf','poly','linear','sigmoid'],'gamma':[0.01,0.1,1]}
grid = GridSearchCV(svm,param_grid)
grid.fit(X,y)
print("grid.best_params_ = ",grid.best_params_,", grid.best_score_ =" ,grid.best_score_)

We can go through itGridSearchCVMethod to traverse the possibilities of super parameters to find the optimal super parameters. This is a means of violent parameter adjustment by means of force calculation. Of course, in the analysis stage, we must limit the optional range of each parameter to apply this method.

Because the data set is too simple, we have got 100% accuracy when traversing the first possibility. The output is as follows:

grid.best_params_ =  {'C': 0.01, 'gamma': 0.01, 'kernel': 'rbf'} , grid.best_score_ = 1.0

summary

When we deal with linearly separable data sets, we can useSVC(kernel='linear')Methods to train data, of course, we can also use faster methodsLinearSVCTo train data, especially when the training set is very large or there are many features.
When we deal with nonlinear SVM classification, we can use Gaussian RBF kernel, polynomial kernel, sigmoid kernel to fit the nonlinear model. Of course, we can also find the optimal parameters through gridsearchcv.

Previous articles: