# Linear model of machine learning basis

Time：2020-1-22

## Linear model for regression

### linear regression

What is linear regression?

The general model of regression problem is as follows:
$$y = \sum w[i]*x[i]+b$$
As shown in the figure below, for one-dimensional data, linear regression is to fit a straight line $$y = a x + B$$according to a given point $(x_i, y_i)$, that is, to find out the coefficients a and B.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import mglearn
import warnings
warnings.filterwarnings('ignore')

mglearn.plots.plot_linear_regression_wave()
w: 0.393906  b: -0.031804 Extended to multidimensional data, the training process of linear regression model is to find parameter vectorswIn the process of linear regression, only the fitting target becomes a high-dimensional plane. The two most commonly used methods of linear regression are the least square method (OLS) and the gradient descent method. Two packages of sklearn and statsmodel are available for using Python to realize linear regression. Sklearn is a common package for machine learning, and statsmodel prefers statistics. First, we use sklearn’s linearregression training model:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

10. Y = mglearn. Datasets. Make? Wave (n? Samples = 60)? Import data

#Data set partition: the same random state represents the same partition of data set
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)
#Standard sklearn style API, model(). Fit (x, y)
lr = LinearRegression().fit(X_train,y_train)

Print ('coefficient: {} '. Format (LR. Coef ±))
Print ('intercept: {} '. Format (LR. Intercept')
Print ('training precision: {} '. Format (LR. Score (x_train, y_train)))
Print ('test precision: {} '. Format (LR. Score (x_test, y_test)))
Coefficient: [0.39390555]
Intercept: - 0.031804343026759746
Training accuracy: 0.67008903150757556
Test accuracy: 0.65933685968637

The OLS fitting model is used in sklearn, and score is the determinable coefficient. It can be seen that the determinable coefficient of test set is only about 0.65, which is not good. This is because the original data is one-dimensional data. When the data dimension increases, the linear model can become very powerful. Next, we use statsmodel to train the model:

import statsmodels.api as sm

#Add a constant term to the model. If it is not executed, the trained line will cross the origin
#Training model
ols = sm.OLS(y_train,x).fit()
#Output statistical report
print(ols.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.670
Method:                 Least Squares   F-statistic:                     87.34
Date:                Sun, 22 Sep 2019   Prob (F-statistic):           6.46e-12
Time:                        21:54:35   Log-Likelihood:                -33.187
No. Observations:                  45   AIC:                             70.37
Df Residuals:                      43   BIC:                             73.99
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0318      0.078     -0.407      0.686      -0.190       0.126
x1             0.3939      0.042      9.345      0.000       0.309       0.479
==============================================================================
Omnibus:                        0.703   Durbin-Watson:                   2.369
Prob(Omnibus):                  0.704   Jarque-Bera (JB):                0.740
Skew:                          -0.081   Prob(JB):                        0.691
Kurtosis:                       2.393   Cond. No.                         1.90
==============================================================================

Warnings:
 Standard Errors assume that the covariance matrix of the errors is correctly specified.



The summary output is a table similar to the Eviews or Minitab statistical style. It can be seen that the R-squared is 0.67, which is the same as the sklearn result. Moreover, the F-statistic of the model and the T value of the parameter show that the result is significant.

#### Ordinary least square (OLS)

U I – \ theta 0- \- \ theta ﹣ 1 = \ frac {\ sum (Y ﹣ I – \ bar y) (x ﹣ I – \ bar x)} {(x ﹣ I – \ bar x) ^ 2} $$, \ theta ﹣ 0 = \ bar Y – \ theta ﹣ 1 \ bar x$$

Y} = 0 $$finally$$\ mathbf {\ theta} = (\ mathbf {x ^ {t} x}) ^ {- 1} \ mathbf {x ^ {t}) Y}$$The least square principle is to minimize MSE by derivation to obtain the parameter \ theta . Next, we will introduce another method, gradient descent method. #### Gradient descent method Gradient descent method is a pure computer programming method. As shown in the figure, we know that the loss function is a function of coefficients, and the one variable linear regression has two parameters, which make up the loss function surface. First, we randomly formulate a set of coefficients, that is, we randomly select an initial point on the plane above, and then perform the following transformation at the same time$$\ theta ﹐ 0 – \ alpha \ frac {\ partial J (\ theta)} {\ partial \ theta ﹐ $$\ theta ﹐ 1 = \ theta ﹐ 1 – \ alpha\ Frac {\ partial J (\ theta)} {\ partial \ theta 1}$$where “=” is the assignment number, repeat this step until it is terminated.

Let’s analyze what happened. First, the coefficient $\ alpha$of the partial derivative is a positive number. For the partial derivative, when the partial derivative is greater than zero, $J (\ theta)$increases with the increase of $\ theta  I$, and the new $\ theta  I$is smaller than the old $\ theta  I$, therefore, $J (\ theta)$decreases; when the partial derivative is smaller than zero, $J (\ theta)$decreases with the increase of $\ theta  I$, and the new $\ theta  I$is larger than the old $\ theta  I$, so, $J (\ theta)$still decreases, that is, each cycle Ring and loss function will decrease, and finally reach a local minimum value, as shown in the figure above.

Our loss function is a convex function, not a graph with multiple minimum values. Its real graph is as follows, and the minimum value is the minimum value. Algorithm steps:

• Determine loss function
• Initialization factor, step size
• Renewal coefficient
• Repeat the above three parts until the end

• The Batch Gradient Descent method, which was described earlier, uses all the data after each update to calculate the loss function and gradient.
• In the stochastic gradient descent method, only one random data is used to get the gradient at a time.
• The mini batch gradient descent method uses part of the data to get the gradient.

### Generalization of linear regression: polynomial regression

For one variable linear regression, when the dependent variable y and X are not linear, linear regression cannot be used directly. According to Taylor’s theorem: Let a = 0 know that y can be expressed linearly by $X, x ^ 2, x ^ 3…$so, we can take $x ^ n$as an additional variable, and transform the linear regression of one variable into multiple linear regression, so as to increase the accuracy of the model.

### Generalized linear regression

That is to say, by taking logarithm, the variable without linear relation is transformed into approximate linear relation to apply linear regression $$ln \ mathbf {y} = \ mathbf {x \ theta}$$

### Ridge return

.
$$J(\mathbf\theta) = \frac{1}{2}(\mathbf{X\theta} – \mathbf{Y})^T(\mathbf{X\theta} – \mathbf{Y}) + \frac{1}{2}\alpha||\theta||_2^2$$

#The realization of ridge regression in sklearn
from sklearn.linear_model import Ridge

Print ('data size: {} '. Format (x.shape))
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state = 0)

ridge = Ridge(alpha=1).fit(X_train,y_train)
LR = LinearRegression().fit(X_train,y_train)

Print ('linear regression precision ([training, test]): {} '. Format ([LR. Score (x_train, y_train, LR. Score (x_test, y_test)]))
Print ('ridge regression precision ([training, test]): {} '. Format ([ridge. Score (x_train, y_train, ridge. Score (x_test, y_test)]))
Data scale: (506, 104)
Linear regression accuracy ([training, test]): [0.9520519609032729, 0.6074721959665752]
Ridge regression accuracy ([training, test]): [0.885796658517094, 0.7527683481744755]

As you can see, the Boston dataset has 104 features, but only 506. The training accuracy of ridge regression model is lower than that of linear regression, but the testing accuracy is higher than that of linear regression.

The over fitting problem can also be solved by increasing the amount of data. As shown in the figure below, when the amount of data increases, the test accuracy of linear regression is similar to that of ridge.

mglearn.plots.plot_ridge_n_samples() ### lasso

Lasso uses L1 regularization, the penalty term is L1 norm, but it can make a characteristic coefficient 0,The model is easier to explain and can also present important features of the model, because of the use of absolute values, there are non differentiable points, so OLS, gradient descent are not available. $$J(\mathbf\theta) = \frac{1}{2n}(\mathbf{X\theta} – \mathbf{Y})^T(\mathbf{X\theta} – \mathbf{Y}) + \alpha||\theta||_1$$

#lasso implementation

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1,max_iter=1000000).fit(X_train,y_train)
Print ('training precision: {} '. Format (lasso. Score (x ﹣ train, y ﹣))
Print ('test precision: {} '. Format (lasso. Score (x_test, y_test)))
Print ('number of features used in the model: {} '. Format (NP. Sum (lasso. Coef_! = 0)))
Training accuracy: 0.7709955157630054
Test accuracy: 0.63020997610041
Number of features used in the model: 8

See SVM

## Linear model for multiclassification

Many linear models are not suitable for multi classification problems. One pair of other methods can be used. For example, data can be divided into three categories: A, B and C. three classifiers need to be trained to correspond to three categories respectively. For example, class a classifiers classify data into class A and not class A. for data belonging to multiple categories (like class A and class B) at the same time and not belonging to any one category, the score will be higher Category.

from sklearn.datasets import make_blobs

X,y=make_blobs(random_state=42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
plt.xlabel('Feature 0')
plt.ylabel('Featrue 1')
plt.legend(['Class0','Class1','Class2']) Linear SVM is used to classify the above data:

from sklearn.svm import LinearSVC
LSVC=LinearSVC().fit(X,y)
LSVC.coef_,LSVC.intercept_,LSVC.score(X,y)
(array([[-0.17492558,  0.23141285],
[ 0.4762191 , -0.06937294],
[-0.18914557, -0.20399693]]),
array([-1.0774515 ,  0.13140521, -0.08604887]),
1.0)



It can be seen that linearsvc outputs three lines. Each line separates one category from other categories. Next, it is visualized. The three lines divide it into seven areas, and the cross areas are evenly distributed.

mglearn.plots.plot_2d_classification(LSVC,X,fill=True,alpha=.7)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
line = np.linspace(-10,10)
for coef,intercept,color in zip(LSVC.coef_,LSVC.intercept_,['b','r','g']):
plt.plot(line,-(line*coef+intercept)/coef)
plt.xlabel('Feature 0')
plt.ylabel('Featrue 1')
plt.legend(['Class0','Class1','Class2','Line class 0','Line class 1','Line class 2'],loc=(1.01,0)) ## Summary: NPM common commands and operations

The full name of NPM is (node package manager), which is a package management and distribution tool installed with nodejs. It is very convenient for JavaScript developersDownload, install, upload and manage installed packages。 First of all, the following variables will be used: < name > | < PKG > module name < version > version […]