# Last regression algorithm

Time：2022-4-8

## 1、 Introduction

in the previous section, we learned that one method to solve multicollinearity is to regularize the cost function. One regularization algorithm is called ridge regression algorithm. Let’s learn another regularization algorithm-Lasso regression algorithm)1(lasso regression algorithm), the full name of Lasso is called least absolute convergence and selection operator.

## Two, model introduction

Let’s first review the cost function of ridge regression, and add a penalty coefficient to the original standard linear regression cost function λ Square of L2 norm of W vector:

$$\operatorname{Cost}(w) = \sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_{2}^{2}$$

Lasso regression algorithm also adds a regular term like ridge regression, but adds a penalty coefficient instead λ The L1 norm of the w vector is taken as the penalty term (L1 norm means the sum of the absolute values of each element of the vector w), so this regularization method is also called L1 regularization.

$$\operatorname{Cost}(w) = \sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_1$$

The same is to find the size of W when minimizing the cost function:

$$w = \underset{w}{\operatorname{argmin}}\left(\sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_1\right)$$

since the L1 norm of the vector is added, and there is an absolute value in it, the cost function is not derivable everywhere, so there is no way to directly obtain the analytical solution of W by directly deriving. Here are two methods to solve the weight coefficient W:Coordinate descent method2（coordinate descent）、Minimum angle regression method3（Least Angle Regression，LARS）

## 3、 Algorithm steps

Coordinate descent method:

the core of coordinate descent method is the same as its name, which is to gradually approach the optimal solution by updating the value of weight coefficient one after another along a certain coordinate axis.

specific steps:

(1) Initialize the weight vector to zero, for example.

(2) Traverse all the weight coefficients, successively take one of the weight coefficients as a variable, fix the other weight coefficients as the result of the previous calculation as a constant, and find the optimal solution under the current condition with only one weight coefficient variable.

in iteration K, the method of updating the weight coefficient is as follows:

$$(3) Step (2) is a complete iteration. When all weight coefficients change little or reach the maximum number of iterations, the iteration is ended. By Nicoguaro – Own work As shown in the figure above, other weight coefficients are fixed in each iteration, and only updated in the direction of one coordinate axis, and finally the optimal solution is reached. Minimum angle regression method: minimum angle regression method is a method of feature selection, which calculates the correlation of each feature and gradually calculates the optimal solution through mathematical formula. specific steps: (1) Initialize the weight vector to zero, for example. (2) Initialize the residual vector as the target label vector Y – XW. Since W is the zero vector at this time, the residual vector is equal to the target label vector y (3) Select an eigenvector x with the largest correlation with the residual vector_ i. Find a set of weight coefficients w along the direction of the eigenvector Xi, and another eigenvector x with the largest correlation with the residual vector appears_ J make the correlation between the new residual vector residual and the two eigenvectors equal (that is, the residual vector is equal to the angular bisection vector of the two eigenvectors), and recalculate the residual vector. (4) Repeat step (3) and continue to find a set of weight coefficients w to make the third feature vector x with the largest correlation with the residual vector_ K makes the correlation between the new residual vector residual and the three eigenvectors equal (that is, the residual vector is equal to the equiangular vector of the three eigenvectors), and so on. (5) When the residual vector residual is small enough or all eigenvectors have been selected, the iteration is ended. Least Angle Regression – Figure 2 the figure above shows the case when there are only two eigenvectors, and the initial residual vector is y_ 2, where x_ 1. The correlation between the vector and the residual vector is the largest (the included angle of the vector is the smallest), and find one θ_ 11 so that the new residual vector Y_ 2 – x_ oneθ_ 11 is X_ 1 and X_ Angle bisector of 2 (u_2 in the figure), W1 at this time= θ_ 11, w2 = 0。 Finally find a group θ_ 21， θ_ 22 so that the new residual vector Y_ 2 – x_ one θ_11 – (x_1 θ_21 + x_2 θ_ 22) as small as possible, W1= θ_ 11 + θ_ 21，w2 = θ_ 22。 All eigenvectors have been selected, so end the iteration. The coordinate descent method is relatively easier to understand than the minimum angle regression method. You only need to care about one weight coefficient at a time. The least angle regression law reduces the number of iterations through some ingenious mathematical formulas, making its worst computational complexity similar to the least square method. ## 4、 Proof of principle Lasso regression cost function is convex also need to prove:$$
f\left(\frac{x_{1}+x_{2}}{2}\right) \leq \frac{f\left(x_{1}\right)+f\left(x_{2}\right)}{2}
$$Inequality left:$$
\operatorname{Cost}\left(\frac{w_{1}+w_{2}}{2}\right)=\sum_{i=1}^{N}\left[\left(\frac{w_{1}+w_{2}}{2}\right)^{T} x_{i}-y_{i}\right]^{2}+\lambda\left\|\frac{w_{1}+w_{2}}{2}\right\|_{1}
$$Inequality right:$$
\frac{\operatorname{Cost}\left(w_{1}\right)+\operatorname{Cost}\left(w_{2}\right)}{2}=\frac{\sum_{i=1}^{N}\left(w_{1}^{T} x_{i}-y_{i}\right)^{2}+\sum_{i=1}^{N}\left(w_{2}^{T} x_{i}-y_{i}\right)^{2}+\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}}{2}
$$(1) The first half of both sides of the inequality is consistent with the standard linear regression. It only needs to prove that the remaining difference is greater than or equal to zero (2) According to the positive homogeneity of vector norm, the coefficient of constant term can be multiplied (3) Will λ Mention outside parentheses (4) According to the definition of vector norm, if it satisfies the triangular inequality, it must be greater than or equal to zero$$
\begin{aligned}
\Delta &=\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}-2 \lambda\left\|\frac{w_{1}+w_{2}}{2}\right\|_{1} & (1) \\
&=\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}-\lambda\left\|w_{1}+w_{2}\right\|_{1} & (2) \\
&=\lambda\left(\left\|w_{1}\right\|_{1}+\left\|w_{2}\right\|_{1}-\left\|w_{1}+w_{2}\right\|_{1}\right) & (3) \\
& \geq 0 & (4)
\end{aligned}
$$Artificial control λ The size of the final proof is greater than zero, and the final result is equal to zero. Mathematical formula of minimum regression method find the bisection vector of unit angle: (1) Let the unit angle bisection vector be u_ A. It can be regarded as a linear combination of the selected feature vectors, and the subscript a represents the sequence number set of the selected features (2) Since the included angle between each selected feature vector and the angular bisection vector is the same, each element of the vector after the dot product of the feature vector and the angular bisection vector must be the same, 1_ A is a vector whose elements are all 1, and Z is a constant (3) According to equation (2), the coefficient vector of linear combination can be expressed ω_ A$$
\begin{matrix}
u_\mathcal{A} = X_\mathcal{A} \omega_\mathcal{A} & (1)\\
X_\mathcal{A}^Tu_\mathcal{A} = X_\mathcal{A}^TX_\mathcal{A} \omega_\mathcal{A} = z 1_\mathcal{A} & (2)\\
\omega_\mathcal{A} = z(X_\mathcal{A}^TX_\mathcal{A})^{-1} 1_\mathcal{A} & (3)
\end{matrix}
$$(1) Angular bisection vector u_ A is the unit vector, so the length is 1 (2) Rewrite to vector form (3) Coefficient vector of linear combination brought into the previous step ω_ A (4) According to the nature of transpose, the bracket of the first term can be simplified and the constant Z is proposed (5) The matrix multiplied by its inverse matrix is the identity matrix, so it can be simplified (6) Finally, the expression of constant Z is obtained$$
\begin{array}{cc}
\left\|u_{\mathcal{A}}\right\|^{2}=1 & (1)\\
\omega_{\mathcal{A}}^{T} X_{\mathcal{A}}^{T} X_{\mathcal{A}} \omega_{\mathcal{A}}=1 & (2) \\
\left(z\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}\right)^{T} X_{\mathcal{A}}^{T} X_{\mathcal{A}} z\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}=1 & (3) \\
z^{2} 1_{\mathcal{A}}^{T}\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1}\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}=1 & (4) \\
z^21_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A} = 1 & (5) \\
z= \frac{1}{\sqrt[]{1_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A}}} = \left(1_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A}\right)^{-\frac{1}{2} } & (6) \\
\end{array}
$$(1) Multiply the transpose of the eigenvector by the eigenvector and use G_ A means (2) Bring in G_ A. Get the expression of Z (3) Bring in G_ A. Get ω_ Expression of a$$
\begin{array}{c}
G_{\mathcal{A}}=X_{\mathcal{A}}^{T} X_{\mathcal{A}} & (1)\\
z=\left(1_{\mathcal{A}}^{T}\left(G_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}\right)^{-\frac{1}{2}} & (2)\\
\omega_{\mathcal{A}}=z\left(G_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}} & (3)
\end{array}
$$put x_ A times ω_ A. You get the angular bisection vector U_ The expression of a, so that the bisection vector of unit angle can be obtained. For more detailed proof, please refer toBradley Efron’s paper4 find the length of the angular bisector vector: （1） μ_ A represents the current predicted value and updates one along the direction of the angular bisection vector γ length (2) C represents the correlation between the eigenvector and the residual vector, which is the dot product of the two vectors. Bring it into equation (1) to obtain the updated expression of the correlation (3) When the eigenvector is the selected eigenvector, because the product of each eigenvector and angular bisection vector is the same, it is equal to Z (4) Calculate the product of each eigenvector and angular bisection vector (5) When the feature vector is not the selected feature vector, use a to bring in equation (2) (6) If you want to add the next eigenvector, the absolute values of equations (3) and (5) must be equal to ensure that the correlation after the next update is also the same (7) Solution γ Expression for$$
\begin{array}{c}
\mu_{A^{+}}=\mu_{A}+\gamma u_{A} & (1)\\
C_{+}=X^{T}\left(y-\mu_{A^{+}}\right)=X^{T}\left(y-\mu_{A}-\gamma u_{A}\right)=C-\gamma X^{T} u_{A} & (2)\\
C_{+}=C-\gamma z \quad(j \in A) & (3)\\
a=X^{T} u_{A} & (4)\\
C_{+}=C_{j}-\gamma a_{j} \quad(j \notin A) & (5)\\
|C-\gamma z|=\left|C-\gamma a_{j}\right| & (6)\\
\gamma=\min _{j \notin A}^{+}\left\{\frac{C-C_{j}}{z-a_{j}}, \frac{C+C_{j}}{z+a_{j}}\right\} & (7)
\end{array}


8195 γ That is, the length of the angular bisection vector. For more detailed proof, please refer toBradley Efron’s paper4

## 5、 Code implementation

Lasso regression algorithm is implemented in python (coordinate descent method):

def lassoUseCd(X, y, lambdas=0.1, max_iter=1000, tol=1e-4):
"""
Lasso regression, using coordinate descent method
args:
X - training dataset
Y - target tag value
Lambdas - penalty term coefficient
max_ ITER - maximum number of iterations
TOL - tolerance value of variation
return:
W - weight coefficient
"""
#Initialize w to zero vector
w = np.zeros(X.shape)
for it in range(max_iter):
done = True
#Traverse all arguments
for i in range(0, len(w)):
#Record the last round factor
weight = w[i]
#Find the best coefficient under the current conditions
w[i] = down(X, y, w, i, lambdas)
#When the variation of one of the coefficients does not reach its tolerance value, continue the cycle
if (np.abs(weight - w[i]) > tol):
done = False
#When all coefficients change little, end the cycle
if (done):
break
return w

def down(X, y, w, index, lambdas=0.1):
"""
cost(w) = (x1 * w1 + x2 * w2 + ... - y)^2 + ... + λ(|w1| + |w2| + ...)
Assuming that W1 is a variable, other values are constants. After being brought into the above formula, the cost function is a univariate quadratic function about W1, which can be written as follows:
cost(w1) = (a * w1 + b)^2 + ... +  λ| w1| + c (a,b,c, λ  (all constant)
=>After expansion
cost(w1) = aa * w1^2 + 2ab * w1 +  λ| w1| + c (aa,ab,c, λ  (all constant)
"""
#Sum of coefficients of quadratic terms after expansion
aa = 0
#Sum of coefficients of expanded primary term
ab = 0
for i in range(X.shape):
#Coefficient of the primary term in parentheses
a = X[i][index]
#Coefficients of constant terms in brackets
b = X[i][:].dot(w) - a * w[index] - y[i]
#It can be easily obtained that the coefficients of the expanded quadratic term are the sum of the squares of the coefficients of the primary term in brackets
aa = aa + a * a
#It can be easily obtained that the coefficient of the expanded primary term is the coefficient of the primary term in brackets multiplied by the sum of the constant term in brackets
ab = ab + a * b
#Because it is a univariate quadratic function, when the derivative is zero, the value of the function is the minimum. Only the quadratic term coefficient, the primary term coefficient and λ
return det(aa, ab, lambdas)

def det(aa, ab, lambdas=0.1):
"""
Find w through the derivative of the cost function. When w = 0, it is not differentiable
det(w) = 2aa * w + 2ab + λ = 0 (w > 0)
=> w = - (2 * ab + λ) / (2 * aa)

det(w) = 2aa * w + 2ab - λ = 0 (w < 0)
=> w = - (2 * ab - λ) / (2 * aa)

det(w) = NaN (w = 0)
=> w = 0
"""
w = - (2 * ab + lambdas) / (2 * aa)
if w < 0:
w = - (2 * ab - lambdas) / (2 * aa)
if w > 0:
w = 0
return w

Lasso regression algorithm (minimum angle regression method) is implemented in Python:

def lassoUseLars(X, y, lambdas=0.1, max_iter=1000):
"""
Lasso regression, using least angle regression
Thesis: https://web.stanford.edu/ ~hastie/Papers/LARS/LeastAngle_ 2002.pdf
args:
X - training dataset
Y - target tag value
Lambdas - penalty term coefficient
max_ ITER - maximum number of iterations
return:
W - weight coefficient
"""
n, m = X.shape
#Selected feature subscript
active_set = set()
#Current prediction vector
cur_pred = np.zeros((n,), dtype=np.float32)
#Residual vector
residual = y - cur_pred
#The point product of eigenvector and residual vector, i.e. correlation
cur_corr = X.T.dot(residual)
#Select the most relevant subscript
j = np.argmax(np.abs(cur_corr), 0)
#Adds a subscript to the selected feature subscript set
#Initialize weight coefficient
w = np.zeros((m,), dtype=np.float32)
#Record the weight coefficient of the last time
prev_w = np.zeros((m,), dtype=np.float32)
#Record feature update direction
sign = np.zeros((m,), dtype=np.int32)
sign[j] = 1
#Average correlation
lambda_hat = None
#Record the last average correlation
prev_lambda_hat = None
for it in range(max_iter):
#Calculate residual vector
residual = y - cur_pred
#Dot product of eigenvector and residual vector
cur_corr = X.T.dot(residual)
#Current correlation maximum
largest_abs_correlation = np.abs(cur_corr).max()
#Calculate current average correlation
lambda_hat = largest_abs_correlation / n
#When the average correlation is less than λ End the iteration ahead of time
# https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/linear_model/_least_angle.py#L542
if lambda_hat <= lambdas:
if (it > 0 and lambda_hat != lambdas):
ss = ((prev_lambda_hat - lambdas) / (prev_lambda_hat - lambda_hat))
#Recalculate the weight coefficient
w[:] = prev_w + ss * (w - prev_w)
break
#Update last average correlation
prev_lambda_hat = lambda_hat

#When all features are selected, the iteration ends
if len(active_set) > m:
break

#Selected eigenvector
X_a = X[:, list(active_set)]
#X in the paper_ Calculation formula of a - (2.4)
X_a *= sign[list(active_set)]
#G in the paper_ Calculation formula of a - (2.5)
G_a = X_a.T.dot(X_a)
G_a_inv = np.linalg.inv(G_a)
G_a_inv_red_cols = np.sum(G_a_inv, 1)
#A in the paper_ Calculation formula of a - (2.5)
A_a = 1 / np.sqrt(np.sum(G_a_inv_red_cols))
#In the paper ω  Calculation formula of - (2.6)
omega = A_a * G_a_inv_red_cols
#Calculation formula of angular bisection vector in the paper - (2.6)
equiangular = X_a.dot(omega)
#Calculation formula of a in the paper - (2.11)
cos_angle = X.T.dot(equiangular)
#In the paper γ
gamma = None
#Next selected feature subscript
next_j = None
#Direction of the next feature
next_sign = 0
for j in range(m):
if j in active_set:
continue
#In the paper γ  Calculation method of - (2.13)
v0 = (largest_abs_correlation - cur_corr[j]) / (A_a - cos_angle[j]).item()
v1 = (largest_abs_correlation + cur_corr[j]) / (A_a + cos_angle[j]).item()
if v0 > 0 and (gamma is None or v0 < gamma):
gamma = v0
next_j = j
next_sign = 1
if v1 > 0 and (gamma is None or v1 < gamma):
gamma = v1
next_j = j
next_sign = -1
if gamma is None:
#In the paper γ  Calculation method of - (2.21)
gamma = largest_abs_correlation / A_a

#Selected eigenvector
sa = X_a
#Angular bisection vector
sb = equiangular * gamma
#Solving linear equations (SA * SX = sb)
sx = np.linalg.lstsq(sa, sb)
#Record the weight coefficient of the last time
prev_w = w.copy()
d_hat = np.zeros((m,), dtype=np.int32)
for i, j in enumerate(active_set):
#Update the current weight factor
w[j] += sx[i] * sign[j]
#D in the paper_ Calculation method of hat - (3.3)
d_hat[j] = omega[i] * sign[j]
#In the paper γ_ Calculation method of J - (3.4)
gamma_hat = -w / d_hat
#In the paper γ_ Calculation method of hat - (3.5)
gamma_hat_min = float("+inf")
#In the paper γ_ Subscript of hat
gamma_hat_min_idx = None
for i, j in enumerate(gamma_hat):
if j <= 0:
continue
if gamma_hat_min > j:
gamma_hat_min = j
gamma_hat_min_idx = i
if gamma_hat_min < gamma:
#Update current forecast vector - (3.6)
cur_pred = cur_pred + gamma_hat_min * equiangular
#Remove the subscript to the selected feature subscript set
active_set.remove(gamma_hat_min_idx)
#Update feature update direction set
sign[gamma_hat_min_idx] = 0
else:
#Update current prediction vector
cur_pred = X.dot(w)
#Adds a subscript to the selected feature subscript set
#Update feature update direction set
sign[next_j] = next_sign

return w

## 6、 Third party library implementation

scikit-learn6Implementation (coordinate descent method):

from sklearn.linear_model import Lasso

#Initialize lasso regressor. Coordinate descent method is used by default
reg = Lasso(alpha=0.1, fit_intercept=False)
#Fitting linear model
reg.fit(X, y)
#Weight coefficient
w = reg.coef_

scikit-learn7Implementation (minimum angle regression):

from sklearn.linear_model import LassoLars

#Initialize lasso regressor and use the minimum angle regression method
reg = LassoLars(alpha=0.1, fit_intercept=False)
#Fitting linear model
reg.fit(X, y)
#Weight coefficient
w = reg.coef_

## 7、 Animation demonstration

The example below shows that the weight of each working period is gradually changed towards the optimal solution of the previous working period, and the weight of each working period is only one month. < center > coordinate descent method < / center >

The following dynamic diagram shows λ The horizontal axis is the penalty coefficient λ ， The vertical axis is the weight coefficient, and each color represents the weight coefficient of an independent variable (the training data comes from sklearn diabetes datasets): <center> λ Influence on weight coefficient < / center > Can see when λ When gradually increasing（ λ Moving to the left), the weight coefficient of some features will quickly become zero. This property shows that lasso regression can be used for feature selection and control λ To select the key features.

## 8、 Mind map ## 9、 References

For a complete demonstration, please clickhere

Note: This article strives to be accurate and easy to understand, but as the author is also a beginner, his level is limited. If there are errors or omissions in the article, readers are urged to criticize and correct it by leaving a message