## Linear model for regression

### linear regression

What is linear regression?

The general model of regression problem is as follows:

$$y = \sum w[i]*x[i]+b$$

As shown in the figure below, for one-dimensional data, linear regression is to fit a straight line $$y = a x + B $$according to a given point $(x_i, y_i) $, that is, to find out the coefficients a and B.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import mglearn
import warnings
warnings.filterwarnings('ignore')
mglearn.plots.plot_linear_regression_wave()
```

```
w[0]: 0.393906 b: -0.031804
```

Extended to multidimensional data, the training process of linear regression model is to find parameter vectors**w**In the process of linear regression, only the fitting target becomes a high-dimensional plane. The two most commonly used methods of linear regression are the least square method (OLS) and the gradient descent method. Two packages of sklearn and statsmodel are available for using Python to realize linear regression. Sklearn is a common package for machine learning, and statsmodel prefers statistics. First, we use sklearn’s linearregression training model:

```
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
10. Y = mglearn. Datasets. Make? Wave (n? Samples = 60)? Import data
#Data set partition: the same random state represents the same partition of data set
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)
#Standard sklearn style API, model(). Fit (x, y)
lr = LinearRegression().fit(X_train,y_train)
Print ('coefficient: {} '. Format (LR. Coef ±))
Print ('intercept: {} '. Format (LR. Intercept')
Print ('training precision: {} '. Format (LR. Score (x_train, y_train)))
Print ('test precision: {} '. Format (LR. Score (x_test, y_test)))
```

```
Coefficient: [0.39390555]
Intercept: - 0.031804343026759746
Training accuracy: 0.67008903150757556
Test accuracy: 0.65933685968637
```

The OLS fitting model is used in sklearn, and score is the determinable coefficient. It can be seen that the determinable coefficient of test set is only about 0.65, which is not good. This is because the original data is one-dimensional data. When the data dimension increases, the linear model can become very powerful. Next, we use statsmodel to train the model:

```
import statsmodels.api as sm
#Add a constant term to the model. If it is not executed, the trained line will cross the origin
x = sm.add_constant(X_train)
#Training model
ols = sm.OLS(y_train,x).fit()
#Output statistical report
print(ols.summary())
```

```
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.670
Model: OLS Adj. R-squared: 0.662
Method: Least Squares F-statistic: 87.34
Date: Sun, 22 Sep 2019 Prob (F-statistic): 6.46e-12
Time: 21:54:35 Log-Likelihood: -33.187
No. Observations: 45 AIC: 70.37
Df Residuals: 43 BIC: 73.99
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0318 0.078 -0.407 0.686 -0.190 0.126
x1 0.3939 0.042 9.345 0.000 0.309 0.479
==============================================================================
Omnibus: 0.703 Durbin-Watson: 2.369
Prob(Omnibus): 0.704 Jarque-Bera (JB): 0.740
Skew: -0.081 Prob(JB): 0.691
Kurtosis: 2.393 Cond. No. 1.90
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

The summary output is a table similar to the Eviews or Minitab statistical style. It can be seen that the R-squared is 0.67, which is the same as the sklearn result. Moreover, the F-statistic of the model and the T value of the parameter show that the result is significant.

#### Ordinary least square (OLS)

U I – \ theta 0- \- \ theta ﹣ 1 = \ frac {\ sum (Y ﹣ I – \ bar y) (x ﹣ I – \ bar x)} {(x ﹣ I – \ bar x) ^ 2} $$, $\ theta ﹣ 0 = \ bar Y – \ theta ﹣ 1 \ bar x $$

Y} = 0 $$finally $$\ mathbf {\ theta} = (\ mathbf {x ^ {t} x}) ^ {- 1} \ mathbf {x ^ {t}) Y}$$

The least square principle is to minimize MSE by derivation to obtain the parameter $\ theta $. Next, we will introduce another method, gradient descent method.

#### Gradient descent method

Gradient descent method is a pure computer programming method.

As shown in the figure, we know that the loss function is a function of coefficients, and the one variable linear regression has two parameters, which make up the loss function surface. First, we randomly formulate a set of coefficients, that is, we randomly select an initial point on the plane above, and then perform the following transformation at the same time $$\ theta ﹐ 0 – \ alpha \ frac {\ partial J (\ theta)} {\ partial \ theta ﹐ $$\ theta ﹐ 1 = \ theta ﹐ 1 – \ alpha\ Frac {\ partial J (\ theta)} {\ partial \ theta 1} $$where “=” is the assignment number, repeat this step until it is terminated.

Let’s analyze what happened. First, the coefficient $\ alpha $of the partial derivative is a positive number. For the partial derivative, when the partial derivative is greater than zero, $J (\ theta) $increases with the increase of $\ theta I $, and the new $\ theta I $is smaller than the old $\ theta I $, therefore, $J (\ theta) $decreases; when the partial derivative is smaller than zero, $J (\ theta) $decreases with the increase of $\ theta I $, and the new $\ theta I $is larger than the old $\ theta I $, so, $J (\ theta) $still decreases, that is, each cycle Ring and loss function will decrease, and finally reach a local minimum value, as shown in the figure above.

Our loss function is a convex function, not a graph with multiple minimum values. Its real graph is as follows, and the minimum value is the minimum value.

Algorithm steps:

- Determine loss function
- Initialization factor, step size
- Renewal coefficient
- Repeat the above three parts until the end

**Gradient descent family**

- The Batch Gradient Descent method, which was described earlier, uses all the data after each update to calculate the loss function and gradient.
- In the stochastic gradient descent method, only one random data is used to get the gradient at a time.
- The mini batch gradient descent method uses part of the data to get the gradient.

### Generalization of linear regression: polynomial regression

For one variable linear regression, when the dependent variable y and X are not linear, linear regression cannot be used directly. According to Taylor’s theorem:

Let a = 0 know that y can be expressed linearly by $X, x ^ 2, x ^ 3… $so, we can take $x ^ n $as an additional variable, and transform the linear regression of one variable into multiple linear regression, so as to increase the accuracy of the model.

### Generalized linear regression

That is to say, by taking logarithm, the variable without linear relation is transformed into approximate linear relation to apply linear regression $$ln \ mathbf {y} = \ mathbf {x \ theta}$$

### Ridge return

.

$$J(\mathbf\theta) = \frac{1}{2}(\mathbf{X\theta} – \mathbf{Y})^T(\mathbf{X\theta} – \mathbf{Y}) + \frac{1}{2}\alpha||\theta||_2^2$$

```
#The realization of ridge regression in sklearn
from sklearn.linear_model import Ridge
X,y = mglearn.datasets.load_extended_boston()
Print ('data size: {} '. Format (x.shape))
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state = 0)
ridge = Ridge(alpha=1).fit(X_train,y_train)
LR = LinearRegression().fit(X_train,y_train)
Print ('linear regression precision ([training, test]): {} '. Format ([LR. Score (x_train, y_train, LR. Score (x_test, y_test)]))
Print ('ridge regression precision ([training, test]): {} '. Format ([ridge. Score (x_train, y_train, ridge. Score (x_test, y_test)]))
```

```
Data scale: (506, 104)
Linear regression accuracy ([training, test]): [0.9520519609032729, 0.6074721959665752]
Ridge regression accuracy ([training, test]): [0.885796658517094, 0.7527683481744755]
```

As you can see, the Boston dataset has 104 features, but only 506. The training accuracy of ridge regression model is lower than that of linear regression, but the testing accuracy is higher than that of linear regression.

The over fitting problem can also be solved by increasing the amount of data. As shown in the figure below, when the amount of data increases, the test accuracy of linear regression is similar to that of ridge.

`mglearn.plots.plot_ridge_n_samples()`

### lasso

Lasso uses L1 regularization, the penalty term is L1 norm, but it can make a characteristic coefficient 0,**The model is easier to explain and can also present important features of the model**, because of the use of absolute values, there are non differentiable points, so OLS, gradient descent are not available. $$J(\mathbf\theta) = \frac{1}{2n}(\mathbf{X\theta} – \mathbf{Y})^T(\mathbf{X\theta} – \mathbf{Y}) + \alpha||\theta||_1$$

```
#lasso implementation
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1,max_iter=1000000).fit(X_train,y_train)
Print ('training precision: {} '. Format (lasso. Score (x ﹣ train, y ﹣))
Print ('test precision: {} '. Format (lasso. Score (x_test, y_test)))
Print ('number of features used in the model: {} '. Format (NP. Sum (lasso. Coef_! = 0)))
```

```
Training accuracy: 0.7709955157630054
Test accuracy: 0.63020997610041
Number of features used in the model: 8
```

### ElasticNet

In this paper, the author analyzes the characteristics of$$

## Linear model for binary classification

The linear model for binary classification could have been predicted b y the following formula: $$y = \ sum w [i] * x [i] + b > 0$$

The commonly used binary classification models are logistic regression and linear support vector machine (LSVM).

### Logistic Regression

Logistic regression is to map the dependent variable y from linear regression to [0,1] by nonlinear (sigmoid function) transformation, as the probability that the classification sample points belong to 0,1.

#### Understanding of logical regression

The sigmoid function $$g (z) = \ frac {1} {1 + e ^ {- Z} $$has the following image: when x = 0, the function value is 0.5; when x tends to infinity, the function value tends to 0 and 1 respectively. If $${z = x \ theta} $$then the function value obtained from linear regression is mapped between 0-1, $g (z) $can be regarded as the probability of being classified as 1. The closer to 1, the greater the probability of being classified as 1, and the most likely to be misclassified at the critical value of 0.5.

```
X = np.linspace(-10,10)
y = []
for i in X:
y.append(1/(1+np.exp(-i)))
plt.plot(X,y)
plt.xlabel('z')
plt.ylabel('g(z)')
```

#### The principle of logical regression

It is assumed that each sample point is independently and identically distributed, and the number of samples is n. The maximum likelihood method (MLE) is used to construct the likelihood function to obtain $$l (\ theta) = \ prod {I = 1}^ NP (Y | x | I, theta) $$since the likelihood function represents the probability of obtaining the existing samples, it should be maximized. Therefore, the opposite number of the logarithm of the likelihood function is taken as the loss function $$J (\ theta) = – LNL (\ theta) = – \ sum \ limits {I = 1} ^ {m} (Y {IL (H {\ theta} (x i)) + (1-y {I) ln (1-h {\ theta} (x i)))$$

The partial derivation is $$\ frac {\ partial} {\ partial \ theta} J (\ theta) = x ^ t (H {\ theta} (x) – y)$$

Use gradient descent method $$\ theta = \ theta – \ alpha x ^ t (H {\ theta} (x) – y)$$

#### Sklearn implementation

```
#Using logistic regression on breast cancer data
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify = cancer.target,random_state=42)
for C,maker in zip([0.001,1,100],['o','^','v']):
logistic = LogisticRegression(C = C,penalty='l2').fit(X_train,y_train)
Print ('training precision (C = {}): {} '. Format (C, logistic. Score (x_train, y_train)))
Print ('training precision (C = {}): {} '. Format (C, logistic. Score (x_test, y_test)))
plt.plot(logistic.coef_.T,maker,label = 'C={}'.format(C))
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)
plt.xlabel('Coefficient Index')
plt.ylabel('Coefficient')
plt.legend()
```

```
Training accuracy (C = 0.001): 0.9225352112676056
Training accuracy (C = 0.001): 0.9370629370629371
Training accuracy (C = 1): 0.9553990610328639
Training accuracy (C = 1): 0.958041958041958
Training accuracy (C = 100): 0.971830985915493
Training accuracy (C = 100): 0.965034965034965
```

Logistic regression can also use regularization by adding regularization term after loss function. Logisticregression in sklearn uses L2 regularization by default, and the parameter penalty can modify the regularization method. The above figure shows the coefficients of the model trained by different regularization parameters. It can be seen that the smaller the regularization term C is, the stronger the regularization degree is, and the smaller the transformation range of the parameters is. This is because the opposite number is taken when taking the loss function, – C is equivalent to $\ alpha $, the smaller the C is, the larger the – C is. $$J(\mathbf\theta) = -\sum\limits_{i=1}^{m}(y_iln(h_{\theta}(x_i))+ (1-y_i)ln(1-h_{\theta}(x_i))) – \frac{1}{2}C||\theta||_2^2$$

### Linear SVC

See SVM

## Linear model for multiclassification

Many linear models are not suitable for multi classification problems. One pair of other methods can be used. For example, data can be divided into three categories: A, B and C. three classifiers need to be trained to correspond to three categories respectively. For example, class a classifiers classify data into class A and not class A. for data belonging to multiple categories (like class A and class B) at the same time and not belonging to any one category, the score will be higher Category.

```
from sklearn.datasets import make_blobs
X,y=make_blobs(random_state=42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
plt.xlabel('Feature 0')
plt.ylabel('Featrue 1')
plt.legend(['Class0','Class1','Class2'])
```

Linear SVM is used to classify the above data:

```
from sklearn.svm import LinearSVC
LSVC=LinearSVC().fit(X,y)
LSVC.coef_,LSVC.intercept_,LSVC.score(X,y)
```

```
(array([[-0.17492558, 0.23141285],
[ 0.4762191 , -0.06937294],
[-0.18914557, -0.20399693]]),
array([-1.0774515 , 0.13140521, -0.08604887]),
1.0)
```

It can be seen that linearsvc outputs three lines. Each line separates one category from other categories. Next, it is visualized. The three lines divide it into seven areas, and the cross areas are evenly distributed.

```
mglearn.plots.plot_2d_classification(LSVC,X,fill=True,alpha=.7)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
line = np.linspace(-10,10)
for coef,intercept,color in zip(LSVC.coef_,LSVC.intercept_,['b','r','g']):
plt.plot(line,-(line*coef[0]+intercept)/coef[1])
plt.xlabel('Feature 0')
plt.ylabel('Featrue 1')
plt.legend(['Class0','Class1','Class2','Line class 0','Line class 1','Line class 2'],loc=(1.01,0))
```