Time：2022-6-1

# 1、 Introduction

in the previous section, we learned that one method to solve multicollinearity is to regularize the cost function. One regularization algorithm is called ridge regression algorithm. Let’s learn another regularization algorithm-Lasso regression algorithm)1(lasso regression algorithm), the full name of Lasso is called least absolute convergence and selection operator.

# 2、 Model introduction

First, let’s review the cost function of ridge regression. A penalty coefficient is added to the original standard linear regression cost function λ Square of L2 norm of W vector of:

$$\operatorname{Cost}(w) = \sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_{2}^{2}$$

Lasso regression algorithm also adds a regular term like ridge regression, but adds a penalty coefficient instead λ The L1 norm of the w vector is taken as the penalty term (the meaning of L1 norm is the sum of the absolute values of each element of the vector w), so this regularization method is also called L1 regularization.

$$\operatorname{Cost}(w) = \sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_1$$

Similarly, find the size of W when minimizing the cost function:

$$w = \underset{w}{\operatorname{argmin}}\left(\sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_1\right)$$

since the L1 norm of the vector is added, and there is an absolute value in it, the cost function is not derivable everywhere, so there is no way to directly obtain the analytical solution of W by directly deriving. Here are two methods for solving the weight coefficient W:Coordinate descent method2（coordinate descent）、Minimum angle regression3（Least Angle Regression，LARS）

# 3、 Algorithm steps

Coordinate descent method:

the core of the coordinate descent method is the same as its name, that is, along a certain coordinate axis direction, the value of the weight coefficient is updated through iteration after iteration to gradually approach the optimal solution.

specific steps:

(1) Initialize the weight coefficient W, for example, to a zero vector.

(2) Traverse all the weight coefficients, take one of the weight coefficients as a variable in turn, and fix the other weight coefficients as the result of the previous calculation as a constant, so as to find the optimal solution under the current condition when there is only one weight coefficient variable.

at iteration K, the method to update the weight coefficient is as follows:

The$$(3) Step (2) is a complete iteration. When all weight coefficients change little or reach the maximum number of iterations, the iteration is ended. By Nicoguaro – Own work As shown in the above figure, other weight coefficients are fixed in each iteration, and only one of them is updated in the direction of one coordinate axis, and finally the optimal solution is reached. Minimum angle regression method: minimum angle regression method is a feature selection method, which calculates the correlation of each feature and gradually calculates the optimal solution through mathematical formula. specific steps: (1) Initialize the weight coefficient W, for example, to a zero vector. (2) Initialize the residual vector as the target label vector Y – XW. Since W is the zero vector at this time, the residual vector is equal to the target label vector y (3) Select an eigenvector X that has the greatest correlation with the residual vector_ i. Find a set of weight coefficients w along the direction of the eigenvector Xi, and another eigenvector x with the greatest correlation with the residual vector appears_ J make the correlation between the new residual vector residual and the two eigenvectors equal (that is, the residual vector is equal to the angular bisection vector of the two eigenvectors), and recalculate the residual vector. (4) Repeat step (3) and continue to find a set of weight coefficients w so that the third eigenvector x with the greatest correlation with the residual vector_ K makes the correlation between the new residual vector residual and the three eigenvectors equal (that is, the residual vector is equal to the equiangular vector of the three eigenvectors), and so on. (5) When the residual vector residual is small enough or all eigenvectors have been selected, the iteration ends. Least Angle Regression – Figure 2 the above figure illustrates the case when there are only two eigenvectors, and the initial residual vector is y_ 2, where x_ 1. The correlation between the vector and the residual vector is the largest (the included angle of the vector is the smallest), find one θ_ 11 make the new residual vector Y_ 2 – x_ oneθ_ 11 is x_ 1 and X_ Angular bisector of 2 (u_2 in the figure), at this time W1= θ_ 11, w2 = 0。 Finally found a group θ_ 21， θ_ 22 make the new residual vector Y_ 2 – x_ one θ_11 – (x_1 θ_21 + x_2 θ_ 22) as small as possible, W1= θ_ 11 + θ_ 21，w2 = θ_ 22. All eigenvectors have been selected, so the iteration ends. The coordinate descent method is easier to understand than the minimum angle regression method. You only need to care about one weight coefficient each time. The minimum angle regression method reduces the number of iterations through some ingenious mathematical formulas, making its worst computational complexity similar to that of the least square method. # 4、 Proof of principle Lasso regression cost function is convex function it is also necessary to prove that:$$
f\left(\frac{x_{1}+x_{2}}{2}\right) \leq \frac{f\left(x_{1}\right)+f\left(x_{2}\right)}{2}
$$Inequality left:$$
\operatorname{Cost}\left(\frac{w_{1}+w_{2}}{2}\right)=\sum_{i=1}^{N}\left[\left(\frac{w_{1}+w_{2}}{2}\right)^{T} x_{i}-y_{i}\right]^{2}+\lambda\left\|\frac{w_{1}+w_{2}}{2}\right\|_{1}
$$Inequality right:$$
\frac{\operatorname{Cost}\left(w_{1}\right)+\operatorname{Cost}\left(w_{2}\right)}{2}=\frac{\sum_{i=1}^{N}\left(w_{1}^{T} x_{i}-y_{i}\right)^{2}+\sum_{i=1}^{N}\left(w_{2}^{T} x_{i}-y_{i}\right)^{2}+\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}}{2}
$$(1) The first half of the two sides of the inequality is consistent with the standard linear regression. It only needs to prove that the remaining difference is greater than or equal to zero (2) According to the positive homogeneity of vector norm, the coefficient of constant term can be multiplied (3) Will λ Mention outside parentheses (4) According to the definition of vector norm, if the trigonometric inequality is satisfied, it must be greater than or equal to zero$$
\begin{aligned}
\Delta &=\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}-2 \lambda\left\|\frac{w_{1}+w_{2}}{2}\right\|_{1} & (1) \\
&=\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}-\lambda\left\|w_{1}+w_{2}\right\|_{1} & (2) \\
&=\lambda\left(\left\|w_{1}\right\|_{1}+\left\|w_{2}\right\|_{1}-\left\|w_{1}+w_{2}\right\|_{1}\right) & (3) \\
& \geq 0 & (4)
\end{aligned}
$$Artificial control λ The final result must be greater than or equal to zero within the range of real numbers. Mathematical formula of minimum regression method find the unit angle bisector vector: (1) Let the unit angle bisection vector be u_ A. It can be regarded as a linear combination of the selected feature vectors, and the subscript a represents the sequence number set of the selected feature (2) Since the angles between each selected feature vector and the angular bisector vector are the same, each element of the vector after the dot product of the feature vector and the angular bisector vector must be the same. 1_ A is a vector whose elements are all 1, and Z is a constant (3) The coefficient vector of linear combination can be expressed according to equation (2) ω_ A$$
\begin{matrix}
u_\mathcal{A} = X_\mathcal{A} \omega_\mathcal{A} & (1)\\
X_\mathcal{A}^Tu_\mathcal{A} = X_\mathcal{A}^TX_\mathcal{A} \omega_\mathcal{A} = z 1_\mathcal{A} & (2)\\
\omega_\mathcal{A} = z(X_\mathcal{A}^TX_\mathcal{A})^{-1} 1_\mathcal{A} & (3)
\end{matrix}
$$(1) Angular bisector vector u_ A is the unit vector, so the length is 1 (2) Rewrite to vector form (3) Coefficient vector of linear combination brought into the previous step ω_ A (4) According to the property of transpose, the bracket of the first term can be simplified, and the constant Z (5) Matrix multiplied by its inverse matrix is the identity matrix, so it can be simplified (6) Finally, the expression of constant Z is obtained$$
\begin{array}{cc}
\left\|u_{\mathcal{A}}\right\|^{2}=1 & (1)\\
\omega_{\mathcal{A}}^{T} X_{\mathcal{A}}^{T} X_{\mathcal{A}} \omega_{\mathcal{A}}=1 & (2) \\
\left(z\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}\right)^{T} X_{\mathcal{A}}^{T} X_{\mathcal{A}} z\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}=1 & (3) \\
z^{2} 1_{\mathcal{A}}^{T}\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1}\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}=1 & (4) \\
z^21_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A} = 1 & (5) \\
z= \frac{1}{\sqrt[]{1_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A}}} = \left(1_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A}\right)^{-\frac{1}{2} } & (6) \\
\end{array}
$$(1) Multiply the transpose of the eigenvector by the eigenvector with G_ A means (2) Bring in G_ A. Get the expression for Z (3) Bring in G_ A. Get ω_ Expression for a$$
\begin{array}{c}
G_{\mathcal{A}}=X_{\mathcal{A}}^{T} X_{\mathcal{A}} & (1)\\
z=\left(1_{\mathcal{A}}^{T}\left(G_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}\right)^{-\frac{1}{2}} & (2)\\
\omega_{\mathcal{A}}=z\left(G_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}} & (3)
\end{array}
$$put x_ A times ω_ A. You get the angular bisection vector u_ A, so that the unit angle bisection vector can be obtained. For more detailed proof, please refer toBradley Efron’s paper4 find the length of the angular bisector vector: (1) μ_ A represents the current predicted value, and updates one along the direction of the angular bisector vector γ length (2) C represents the correlation between the eigenvector and the residual vector, which is the dot product of the two vectors, and is carried into equation (1) to obtain the updated expression of the correlation (3) When the eigenvector is the selected eigenvector, since the product of each eigenvector and the angular bisector vector is the same, it is equal to Z (4) Calculate the product of each eigenvector and angular bisector vector (5) When the eigenvector is not the selected eigenvector, use a to bring in equation (2) (6) To add the next eigenvector, the absolute values of equations (3) and (5) must be equal to ensure that the correlation after the next update is the same (7) Solution γ Expression for$$
\begin{array}{c}
\mu_{A^{+}}=\mu_{A}+\gamma u_{A} & (1)\\
C_{+}=X^{T}\left(y-\mu_{A^{+}}\right)=X^{T}\left(y-\mu_{A}-\gamma u_{A}\right)=C-\gamma X^{T} u_{A} & (2)\\
C_{+}=C-\gamma z \quad(j \in A) & (3)\\
a=X^{T} u_{A} & (4)\\
C_{+}=C_{j}-\gamma a_{j} \quad(j \notin A) & (5)\\
|C-\gamma z|=\left|C-\gamma a_{j}\right| & (6)\\
\gamma=\min _{j \notin A}^{+}\left\{\frac{C-C_{j}}{z-a_{j}}, \frac{C+C_{j}}{z+a_{j}}\right\} & (7)
\end{array}


that’s it γ Is the length of the angular bisector vector. For more detailed proof, please refer toBradley Efron’s paper4

# 5、 Code implementation

Use Python to implement lasso regression algorithm (coordinate descent method):

def lassoUseCd(X, y, lambdas=0.1, max_iter=1000, tol=1e-4):
"""
Lasso regression using coordinate descent
args:
X - training dataset
Y - target tag value
Lambdas - penalty factor
Max_ ITER - Maximum iterations
TOL - tolerance value of variation
return:
W - weighting factor
"""
#Initialize w to zero vector
w = np.zeros(X.shape)
for it in range(max_iter):
done = True
#Traverse all arguments
for i in range(0, len(w)):
#Record last round factor
weight = w[i]
#Find out the best coefficient under current conditions
w[i] = down(X, y, w, i, lambdas)
#When the variation of one of the coefficients does not reach its tolerance value, continue the cycle
if (np.abs(weight - w[i]) > tol):
done = False
#When all coefficients change little, end the cycle
if (done):
break
return w

def down(X, y, w, index, lambdas=0.1):
"""
cost(w) = (x1 * w1 + x2 * w2 + ... - y)^2 + ... + λ(|w1| + |w2| + ...)
Assuming that W1 is a variable, other values are constants. After the above formula is brought in, its cost function is a univariate quadratic function about W1, which can be written as the following formula:
cost(w1) = (a * w1 + b)^2 + ... +  λ| w1| + c (a,b,c, λ  Are constant)
=>After expansion
Cost (W1) = AA * w1^2 + 2Ab * W1+ λ| w1| + c (aa,ab,c, λ  Are constant)
"""
#Sum of coefficients of expanded quadratic terms
aa = 0
#Sum of coefficients of expanded primary term
ab = 0
for i in range(X.shape):
#Coefficient of the primary term in brackets
a = X[i][index]
#Coefficients of constant terms in brackets
b = X[i][:].dot(w) - a * w[index] - y[i]
#It is easy to get that the coefficients of the expanded quadratic term are the sum of the squares of the coefficients of the primary term in brackets
aa = aa + a * a
#It is easy to get that the coefficient of the expanded primary term is the coefficient of the primary term in brackets multiplied by the sum of the constant term in brackets
ab = ab + a * b
#Because it is a univariate quadratic function, when the derivative is zero, the function value is the minimum value, and only the quadratic term coefficient, the primary term coefficient and λ
return det(aa, ab, lambdas)

def det(aa, ab, lambdas=0.1):
"""
Find w through the derivative of the cost function. When w = 0, it is not differentiable
det(w) = 2aa * w + 2ab + λ = 0 (w > 0)
=> w = - (2 * ab + λ) / (2 * aa)

det(w) = 2aa * w + 2ab - λ = 0 (w < 0)
=> w = - (2 * ab - λ) / (2 * aa)

det(w) = NaN (w = 0)
=> w = 0
"""
w = - (2 * ab + lambdas) / (2 * aa)
if w < 0:
w = - (2 * ab - lambdas) / (2 * aa)
if w > 0:
w = 0
return w

Use Python to implement lasso regression algorithm (minimum angle regression method):

def lassoUseLars(X, y, lambdas=0.1, max_iter=1000):
"""
Lasso regression, using least angle regression
Thesis: https://web.stanford.edu/ ~Hastie/papers/lars/leastangle_ 2002.pdf
args:
X - training dataset
Y - target tag value
Lambdas - penalty factor
Max_ ITER - Maximum iterations
return:
W - weighting factor
"""
n, m = X.shape
#Selected feature subscript
active_set = set()
#Current prediction vector
cur_pred = np.zeros((n,), dtype=np.float32)
#Residual vector
residual = y - cur_pred
#Point product of eigenvector and residual vector, i.e. correlation
cur_corr = X.T.dot(residual)
#Select the most relevant subscript
j = np.argmax(np.abs(cur_corr), 0)
#Add a subscript to the selected feature subscript set
#Initialize weight coefficient
w = np.zeros((m,), dtype=np.float32)
#Record the last weighting factor
prev_w = np.zeros((m,), dtype=np.float32)
#Record feature update direction
sign = np.zeros((m,), dtype=np.int32)
sign[j] = 1
#Average correlation
lambda_hat = None
#Record last average correlation
prev_lambda_hat = None
for it in range(max_iter):
#Calculate residual vector
residual = y - cur_pred
#Dot product of eigenvector and residual vector
cur_corr = X.T.dot(residual)
#Current correlation maximum
largest_abs_correlation = np.abs(cur_corr).max()
#Calculate current average correlation
lambda_hat = largest_abs_correlation / n
#When the average correlation is less than λ End iteration ahead of time
# https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/linear_model/_least_angle.py#L542
if lambda_hat <= lambdas:
if (it > 0 and lambda_hat != lambdas):
ss = ((prev_lambda_hat - lambdas) / (prev_lambda_hat - lambda_hat))
#Recalculate weight factor
w[:] = prev_w + ss * (w - prev_w)
break
#Update last average correlation
prev_lambda_hat = lambda_hat

#When all features are selected, the iteration ends
if len(active_set) > m:
break

#Selected eigenvector
X_a = X[:, list(active_set)]
#X in the paper_ Calculation formula of a - (2.4)
X_a *= sign[list(active_set)]
#G in the paper_ Calculation formula of a - (2.5)
G_a = X_a.T.dot(X_a)
G_a_inv = np.linalg.inv(G_a)
G_a_inv_red_cols = np.sum(G_a_inv, 1)
#A in the paper_ Calculation formula of a - (2.5)
A_a = 1 / np.sqrt(np.sum(G_a_inv_red_cols))
#In the paper ω  Calculation formula for - (2.6)
omega = A_a * G_a_inv_red_cols
#Calculation formula of angular bisection vector in the paper - (2.6)
equiangular = X_a.dot(omega)
#Calculation formula of a in the paper - (2.11)
cos_angle = X.T.dot(equiangular)
#In the paper γ
gamma = None
#Next selected feature subscript
next_j = None
#Direction of next feature
next_sign = 0
for j in range(m):
if j in active_set:
continue
#In the paper γ  Calculation method of - (2.13)
v0 = (largest_abs_correlation - cur_corr[j]) / (A_a - cos_angle[j]).item()
v1 = (largest_abs_correlation + cur_corr[j]) / (A_a + cos_angle[j]).item()
if v0 > 0 and (gamma is None or v0 < gamma):
gamma = v0
next_j = j
next_sign = 1
if v1 > 0 and (gamma is None or v1 < gamma):
gamma = v1
next_j = j
next_sign = -1
if gamma is None:
#In the paper γ  Calculation method of - (2.21)
gamma = largest_abs_correlation / A_a

#Selected eigenvector
sa = X_a
#Angular bisector vector
sb = equiangular * gamma
#Solving linear equations (SA * SX = sb)
sx = np.linalg.lstsq(sa, sb)
#Record the last weighting factor
prev_w = w.copy()
d_hat = np.zeros((m,), dtype=np.int32)
for i, j in enumerate(active_set):
#Update the current weight factor
w[j] += sx[i] * sign[j]
#D in the paper_ Calculation method of hat - (3.3)
d_hat[j] = omega[i] * sign[j]
#In the paper γ_ Calculation method of J - (3.4)
gamma_hat = -w / d_hat
#In the paper γ_ Calculation method of hat - (3.5)
gamma_hat_min = float("+inf")
#In the paper γ_ Subscript of hat
gamma_hat_min_idx = None
for i, j in enumerate(gamma_hat):
if j <= 0:
continue
if gamma_hat_min > j:
gamma_hat_min = j
gamma_hat_min_idx = i
if gamma_hat_min < gamma:
#Update current forecast vector - (3.6)
cur_pred = cur_pred + gamma_hat_min * equiangular
#Remove the subscript to the selected feature subscript set
active_set.remove(gamma_hat_min_idx)
#Update feature update direction set
sign[gamma_hat_min_idx] = 0
else:
#Update current forecast vector
cur_pred = X.dot(w)
#Add a subscript to the selected feature subscript set
#Update feature update direction set
sign[next_j] = next_sign

return w

# 6、 Third party library implementation

scikit-learn6Implementation (coordinate descent method):

from sklearn.linear_model import Lasso

#Initialize lasso regressor, using coordinate descent method by default
reg = Lasso(alpha=0.1, fit_intercept=False)
#Fitting linear model
reg.fit(X, y)
#Weight coefficient
w = reg.coef_

scikit-learn7Implementation (minimum angle regression):

from sklearn.linear_model import LassoLars

#Initialize lasso regressor and use minimum angle regression method
reg = LassoLars(alpha=0.1, fit_intercept=False)
#Fitting linear model
reg.fit(X, y)
#Weight coefficient
w = reg.coef_

# 7、 Animation demonstration

The following dynamic diagram shows the example of working years and average monthly salary in the previous section. Each time, the weight coefficient is changed in only one direction of the coordinate axis, and the process of gradually approaching the optimal solution. <center> coordinate descent method </center>

The following dynamic diagram shows λ The horizontal axis is the penalty coefficient λ ， The vertical axis is the weight coefficient, and each color represents the weight coefficient of an independent variable (the training data comes from the sklearn diabetes datasets): <center> λ Influence on weight coefficient </center> Can see when λ When gradually increasing（ λ Moving to the left), the weight coefficient of some features will quickly become zero. This property shows that lasso regression can be used for feature selection and control λ To select key features.

# 8、 Mind map # 9、 References

For a complete demonstration, please clickhere

Note: This article strives to be accurate and easy to understand. However, as the author is also a beginner, his level is limited. If there are errors or omissions in the article, readers are urged to criticize and correct them by leaving messages