Linear regression: principle and implementation of gradient descent method

Time:2021-1-17

1、 Linear regression

   for a detailed introduction of linear regression, please refer to my last blog postLinear regression: the realization of least square method. In “linear regression: the realization of the least square method”, I have explained that the key to the establishment of linear regression model is to solve:

\[(w^*, b^*)=\arg\min\sum^{m}_{i=1}{{(f(x_i)-y_i)^2}}
\]

Here we introduce another algorithm: gradient descent method.

2、 Mathematical principle of gradient descent method

Suppose there are the following problems:

\[w=\arg\min f(w)
\]

Through Taylor’s first-order expansion, the following results are obtained

\[f(w)=f(w_0)+(w-w_0)∇f(w_0)
\]

The figure is as follows:

Linear regression: principle and implementation of gradient descent method

   here\(w-w_0\)Represents the step size and direction of the movement, then you can have\(w-w_0=\eta\gamma\)\(\eta\)Is a real number, indicating the step size of the movement,\(\gamma\)Is a unit vector, which indicates the direction of step movement.\(f(w_0)+(w-w_0)∇f(w_0)\)then is\(f(w)\)stay\(w\)It is required here to estimate the neighborhood at the\(\eta\)Otherwise, the accuracy of estimation will be reduced.
The purpose of     gradient descent algorithm is to optimize the function\(f(w)\)To reach the global minimum as soon as possible, we have the first condition

\[condition1:f(w)-f(w_0)=\eta\gamma∇f(w_0)≤0
\]

   due to\(\eta\)Is a constant greater than 0, so it can be ignored. Then it needs to be satisfied\(\eta\gamma∇f(w_0)≤0\)And as small as possible. And then there are:

\[\gamma∇f(w_0)=|\gamma||∇f(w_0)|\cos \alpha
\]

So I want to make\(\eta\gamma∇f(w_0)≤0\)And as small as possible, it needs to meet the following conditions:

\[condition2:\cos \alpha=-1
\]

In other words, the unit vector is required\(\gamma\)and\(∇f(w_0)\)So we can get the following conclusions

\[condition3:\gamma=\frac{-∇f(w_0)}{|∇f(w_0)|}
\]

   general\(condition3\)Bring in\(w-w_0=\eta\gamma\)The results are as follows:

\[condition4:w-w_0=\eta\frac{-∇f(w_0)}{|∇f(w_0)|}
\]

   due to\(\eta\)and\(|∇f(w_0)|\)They are all real numbers greater than 0, so they can be merged into new ones\(\eta^*=\frac{\eta}{|∇f(w_0)|}\)And then we can get the following results by moving the term of the equation:

\[condition5:w=w_0-\eta∇f(w_0)
\]

  \(condition5\)In the gradient descent algorithm\(f(w)\)Parameters of\(w\)Update formula for. It also explains why the weights need to be updated in the opposite direction of the gradient.

3、 Optimization of gradient descent method

Next, we use gradient descent algorithm to optimize the linear regression model. Here, the cost function e of the linear regression model is written as the following form (convenient operation)

\[E=\frac{1}{2m}\sum^{m}_{i=1}{(f(x_i)-y_i)^2}
\]

The derivation is as follows

\[\frac{\partial E}{\partial w}=\frac{1}{m}\sum^{m}_{i=1}{(wx_i-y_i)\frac{\partial (wx_i-y_i)}{\partial w}}=\frac{1}{m}\sum^{m}_{i=1}{(wx_i-y_i)x_i}
\]

   from the front\(condition5\)The update formula of weight can be obtained\(w^*=w+\Delta w\), here\(\Delta w=-\eta∇f(w_0)\). Therefore, the final update formula of weight is as follows:

\[w^*=w+\frac{\eta}{m}\sum^{m}_{i=1}{(y_i-wx_i)x_i}
\]

4、 Python implementation

The following fitting algorithm can be realized by the updating formula derived previously

def _gradient_descent(self, X, y):
	for i in range(max_iter):
		delta = y - self._linear_func(X)
		self.W[0] +=  self.eta  *Sum (delta) / x.shape [0] # the first column is all 1
		self.W[1:] += self.eta * (delta @ X) / X.shape[0]

Import Boston data set for testing:

if __name__ == "__main__":
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler
    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    scaler = MinMaxScaler().fit(X)
    X = scaler.transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3)
    lr = LinearRegression().fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    from sklearn.metrics import mean_squared_error
    print(mean_squared_error(y_test, y_pred))
    plt.figure()
    plt.plot(range(len(y_test)), y_test)
    plt.plot(range(len(y_pred)), y_pred)
    plt.legend(["test", "pred"])
    plt.show()

Mean square error:

Linear regression: principle and implementation of gradient descent method

Cost curve:

Linear regression: principle and implementation of gradient descent method

Fitting curve:

Linear regression: principle and implementation of gradient descent method