Machine learning algorithm series (V) – lasso regression algorithm

Time:2022-6-1

Background knowledge points required for reading this article: linear regression algorithm and yidudui programming knowledge

1、 Introduction

   in the previous section, we learned that one method to solve multicollinearity is to regularize the cost function. One regularization algorithm is called ridge regression algorithm. Let’s learn another regularization algorithm-Lasso regression algorithm)1(lasso regression algorithm), the full name of Lasso is called least absolute convergence and selection operator.

2、 Model introduction

First, let’s review the cost function of ridge regression. A penalty coefficient is added to the original standard linear regression cost function λ Square of L2 norm of W vector of:

$$
\operatorname{Cost}(w) = \sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_{2}^{2}
$$

Lasso regression algorithm also adds a regular term like ridge regression, but adds a penalty coefficient instead λ The L1 norm of the w vector is taken as the penalty term (the meaning of L1 norm is the sum of the absolute values of each element of the vector w), so this regularization method is also called L1 regularization.

$$
\operatorname{Cost}(w) = \sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_1
$$

Similarly, find the size of W when minimizing the cost function:

$$
w = \underset{w}{\operatorname{argmin}}\left(\sum_{i = 1}^N \left( y_i – w^Tx_i \right)^2 + \lambda\|w\|_1\right)
$$

   since the L1 norm of the vector is added, and there is an absolute value in it, the cost function is not derivable everywhere, so there is no way to directly obtain the analytical solution of W by directly deriving. Here are two methods for solving the weight coefficient W:Coordinate descent method2(coordinate descent)、Minimum angle regression3(Least Angle Regression,LARS)

3、 Algorithm steps

Coordinate descent method:

   the core of the coordinate descent method is the same as its name, that is, along a certain coordinate axis direction, the value of the weight coefficient is updated through iteration after iteration to gradually approach the optimal solution.

   specific steps:

(1) Initialize the weight coefficient W, for example, to a zero vector.

(2) Traverse all the weight coefficients, take one of the weight coefficients as a variable in turn, and fix the other weight coefficients as the result of the previous calculation as a constant, so as to find the optimal solution under the current condition when there is only one weight coefficient variable.

   at iteration K, the method to update the weight coefficient is as follows:

The$$

(3) Step (2) is a complete iteration. When all weight coefficients change little or reach the maximum number of iterations, the iteration is ended.


By Nicoguaro – Own work

As shown in the above figure, other weight coefficients are fixed in each iteration, and only one of them is updated in the direction of one coordinate axis, and finally the optimal solution is reached.

Minimum angle regression method:

   minimum angle regression method is a feature selection method, which calculates the correlation of each feature and gradually calculates the optimal solution through mathematical formula.

   specific steps:

(1) Initialize the weight coefficient W, for example, to a zero vector.

(2) Initialize the residual vector as the target label vector Y – XW. Since W is the zero vector at this time, the residual vector is equal to the target label vector y

(3) Select an eigenvector X that has the greatest correlation with the residual vector_ i. Find a set of weight coefficients w along the direction of the eigenvector Xi, and another eigenvector x with the greatest correlation with the residual vector appears_ J make the correlation between the new residual vector residual and the two eigenvectors equal (that is, the residual vector is equal to the angular bisection vector of the two eigenvectors), and recalculate the residual vector.

(4) Repeat step (3) and continue to find a set of weight coefficients w so that the third eigenvector x with the greatest correlation with the residual vector_ K makes the correlation between the new residual vector residual and the three eigenvectors equal (that is, the residual vector is equal to the equiangular vector of the three eigenvectors), and so on.

(5) When the residual vector residual is small enough or all eigenvectors have been selected, the iteration ends.


Least Angle Regression – Figure 2

   the above figure illustrates the case when there are only two eigenvectors, and the initial residual vector is y_ 2, where x_ 1. The correlation between the vector and the residual vector is the largest (the included angle of the vector is the smallest), find one θ_ 11 make the new residual vector Y_ 2 – x_ oneθ_ 11 is x_ 1 and X_ Angular bisector of 2 (u_2 in the figure), at this time W1= θ_ 11, w2 = 0。 Finally found a group θ_ 21, θ_ 22 make the new residual vector Y_ 2 – x_ one θ_11 – (x_1 θ_21 + x_2 θ_ 22) as small as possible, W1= θ_ 11 + θ_ 21,w2 = θ_ 22. All eigenvectors have been selected, so the iteration ends.

The coordinate descent method is easier to understand than the minimum angle regression method. You only need to care about one weight coefficient each time. The minimum angle regression method reduces the number of iterations through some ingenious mathematical formulas, making its worst computational complexity similar to that of the least square method.

4、 Proof of principle

Lasso regression cost function is convex function

   it is also necessary to prove that:

$$
f\left(\frac{x_{1}+x_{2}}{2}\right) \leq \frac{f\left(x_{1}\right)+f\left(x_{2}\right)}{2}
$$

Inequality left:

$$
\operatorname{Cost}\left(\frac{w_{1}+w_{2}}{2}\right)=\sum_{i=1}^{N}\left[\left(\frac{w_{1}+w_{2}}{2}\right)^{T} x_{i}-y_{i}\right]^{2}+\lambda\left\|\frac{w_{1}+w_{2}}{2}\right\|_{1}
$$

Inequality right:

$$
\frac{\operatorname{Cost}\left(w_{1}\right)+\operatorname{Cost}\left(w_{2}\right)}{2}=\frac{\sum_{i=1}^{N}\left(w_{1}^{T} x_{i}-y_{i}\right)^{2}+\sum_{i=1}^{N}\left(w_{2}^{T} x_{i}-y_{i}\right)^{2}+\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}}{2}
$$

(1) The first half of the two sides of the inequality is consistent with the standard linear regression. It only needs to prove that the remaining difference is greater than or equal to zero

(2) According to the positive homogeneity of vector norm, the coefficient of constant term can be multiplied

(3) Will λ Mention outside parentheses

(4) According to the definition of vector norm, if the trigonometric inequality is satisfied, it must be greater than or equal to zero

$$
\begin{aligned}
\Delta &=\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}-2 \lambda\left\|\frac{w_{1}+w_{2}}{2}\right\|_{1} & (1) \\
&=\lambda\left\|w_{1}\right\|_{1}+\lambda\left\|w_{2}\right\|_{1}-\lambda\left\|w_{1}+w_{2}\right\|_{1} & (2) \\
&=\lambda\left(\left\|w_{1}\right\|_{1}+\left\|w_{2}\right\|_{1}-\left\|w_{1}+w_{2}\right\|_{1}\right) & (3) \\
& \geq 0 & (4)
\end{aligned}
$$

Artificial control λ The final result must be greater than or equal to zero within the range of real numbers.

Mathematical formula of minimum regression method

   find the unit angle bisector vector:

(1) Let the unit angle bisection vector be u_ A. It can be regarded as a linear combination of the selected feature vectors, and the subscript a represents the sequence number set of the selected feature

(2) Since the angles between each selected feature vector and the angular bisector vector are the same, each element of the vector after the dot product of the feature vector and the angular bisector vector must be the same. 1_ A is a vector whose elements are all 1, and Z is a constant

(3) The coefficient vector of linear combination can be expressed according to equation (2) ω_ A

$$
\begin{matrix}
u_\mathcal{A} = X_\mathcal{A} \omega_\mathcal{A} & (1)\\
X_\mathcal{A}^Tu_\mathcal{A} = X_\mathcal{A}^TX_\mathcal{A} \omega_\mathcal{A} = z 1_\mathcal{A} & (2)\\
\omega_\mathcal{A} = z(X_\mathcal{A}^TX_\mathcal{A})^{-1} 1_\mathcal{A} & (3)
\end{matrix}
$$

(1) Angular bisector vector u_ A is the unit vector, so the length is 1

(2) Rewrite to vector form

(3) Coefficient vector of linear combination brought into the previous step ω_ A

(4) According to the property of transpose, the bracket of the first term can be simplified, and the constant Z

(5) Matrix multiplied by its inverse matrix is the identity matrix, so it can be simplified

(6) Finally, the expression of constant Z is obtained

$$
\begin{array}{cc}
\left\|u_{\mathcal{A}}\right\|^{2}=1 & (1)\\
\omega_{\mathcal{A}}^{T} X_{\mathcal{A}}^{T} X_{\mathcal{A}} \omega_{\mathcal{A}}=1 & (2) \\
\left(z\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}\right)^{T} X_{\mathcal{A}}^{T} X_{\mathcal{A}} z\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}=1 & (3) \\
z^{2} 1_{\mathcal{A}}^{T}\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1}\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)\left(X_{\mathcal{A}}^{T} X_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}=1 & (4) \\
z^21_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A} = 1 & (5) \\
z= \frac{1}{\sqrt[]{1_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A}}} = \left(1_\mathcal{A}^T\left(X_\mathcal{A}^TX_\mathcal{A}\right)^{-1}1_\mathcal{A}\right)^{-\frac{1}{2} } & (6) \\
\end{array}
$$

(1) Multiply the transpose of the eigenvector by the eigenvector with G_ A means

(2) Bring in G_ A. Get the expression for Z

(3) Bring in G_ A. Get ω_ Expression for a

$$
\begin{array}{c}
G_{\mathcal{A}}=X_{\mathcal{A}}^{T} X_{\mathcal{A}} & (1)\\
z=\left(1_{\mathcal{A}}^{T}\left(G_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}}\right)^{-\frac{1}{2}} & (2)\\
\omega_{\mathcal{A}}=z\left(G_{\mathcal{A}}\right)^{-1} 1_{\mathcal{A}} & (3)
\end{array}
$$

   put x_ A times ω_ A. You get the angular bisection vector u_ A, so that the unit angle bisection vector can be obtained. For more detailed proof, please refer toBradley Efron’s paper4

   find the length of the angular bisector vector:

(1) μ_ A represents the current predicted value, and updates one along the direction of the angular bisector vector γ length

(2) C represents the correlation between the eigenvector and the residual vector, which is the dot product of the two vectors, and is carried into equation (1) to obtain the updated expression of the correlation

(3) When the eigenvector is the selected eigenvector, since the product of each eigenvector and the angular bisector vector is the same, it is equal to Z

(4) Calculate the product of each eigenvector and angular bisector vector

(5) When the eigenvector is not the selected eigenvector, use a to bring in equation (2)

(6) To add the next eigenvector, the absolute values of equations (3) and (5) must be equal to ensure that the correlation after the next update is the same

(7) Solution γ Expression for

$$
\begin{array}{c}
\mu_{A^{+}}=\mu_{A}+\gamma u_{A} & (1)\\
C_{+}=X^{T}\left(y-\mu_{A^{+}}\right)=X^{T}\left(y-\mu_{A}-\gamma u_{A}\right)=C-\gamma X^{T} u_{A} & (2)\\
C_{+}=C-\gamma z \quad(j \in A) & (3)\\
a=X^{T} u_{A} & (4)\\
C_{+}=C_{j}-\gamma a_{j} \quad(j \notin A) & (5)\\
|C-\gamma z|=\left|C-\gamma a_{j}\right| & (6)\\
\gamma=\min _{j \notin A}^{+}\left\{\frac{C-C_{j}}{z-a_{j}}, \frac{C+C_{j}}{z+a_{j}}\right\} & (7)
\end{array}
$$

   that’s it γ Is the length of the angular bisector vector. For more detailed proof, please refer toBradley Efron’s paper4

5、 Code implementation

Use Python to implement lasso regression algorithm (coordinate descent method):

def lassoUseCd(X, y, lambdas=0.1, max_iter=1000, tol=1e-4):
    """
    Lasso regression using coordinate descent
    args:
        X - training dataset
        Y - target tag value
        Lambdas - penalty factor
        Max_ ITER - Maximum iterations
        TOL - tolerance value of variation
    return:
        W - weighting factor
    """
    #Initialize w to zero vector
    w = np.zeros(X.shape[1])
    for it in range(max_iter):
        done = True
        #Traverse all arguments
        for i in range(0, len(w)):
            #Record last round factor
            weight = w[i]
            #Find out the best coefficient under current conditions
            w[i] = down(X, y, w, i, lambdas)
            #When the variation of one of the coefficients does not reach its tolerance value, continue the cycle
            if (np.abs(weight - w[i]) > tol):
                done = False
        #When all coefficients change little, end the cycle
        if (done):
            break
    return w

def down(X, y, w, index, lambdas=0.1):
    """
    cost(w) = (x1 * w1 + x2 * w2 + ... - y)^2 + ... + λ(|w1| + |w2| + ...)
    Assuming that W1 is a variable, other values are constants. After the above formula is brought in, its cost function is a univariate quadratic function about W1, which can be written as the following formula:
    cost(w1) = (a * w1 + b)^2 + ... +  λ| w1| + c (a,b,c, λ  Are constant)
    =>After expansion
    Cost (W1) = AA * w1^2 + 2Ab * W1+ λ| w1| + c (aa,ab,c, λ  Are constant)
    """
    #Sum of coefficients of expanded quadratic terms
    aa = 0
    #Sum of coefficients of expanded primary term
    ab = 0
    for i in range(X.shape[0]):
        #Coefficient of the primary term in brackets
        a = X[i][index]
        #Coefficients of constant terms in brackets
        b = X[i][:].dot(w) - a * w[index] - y[i]
        #It is easy to get that the coefficients of the expanded quadratic term are the sum of the squares of the coefficients of the primary term in brackets
        aa = aa + a * a
        #It is easy to get that the coefficient of the expanded primary term is the coefficient of the primary term in brackets multiplied by the sum of the constant term in brackets
        ab = ab + a * b
    #Because it is a univariate quadratic function, when the derivative is zero, the function value is the minimum value, and only the quadratic term coefficient, the primary term coefficient and λ
    return det(aa, ab, lambdas)

def det(aa, ab, lambdas=0.1):
    """
    Find w through the derivative of the cost function. When w = 0, it is not differentiable
    det(w) = 2aa * w + 2ab + λ = 0 (w > 0)
    => w = - (2 * ab + λ) / (2 * aa)

    det(w) = 2aa * w + 2ab - λ = 0 (w < 0)
    => w = - (2 * ab - λ) / (2 * aa)

    det(w) = NaN (w = 0)
    => w = 0
    """
    w = - (2 * ab + lambdas) / (2 * aa)
    if w < 0:
        w = - (2 * ab - lambdas) / (2 * aa)
        if w > 0:
            w = 0
    return w

Use Python to implement lasso regression algorithm (minimum angle regression method):

def lassoUseLars(X, y, lambdas=0.1, max_iter=1000):
    """
    Lasso regression, using least angle regression
    Thesis: https://web.stanford.edu/ ~Hastie/papers/lars/leastangle_ 2002.pdf
    args:
        X - training dataset
        Y - target tag value
        Lambdas - penalty factor
        Max_ ITER - Maximum iterations
    return:
        W - weighting factor
    """
    n, m = X.shape
    #Selected feature subscript
    active_set = set()
    #Current prediction vector
    cur_pred = np.zeros((n,), dtype=np.float32)
    #Residual vector
    residual = y - cur_pred
    #Point product of eigenvector and residual vector, i.e. correlation
    cur_corr = X.T.dot(residual)
    #Select the most relevant subscript
    j = np.argmax(np.abs(cur_corr), 0)
    #Add a subscript to the selected feature subscript set
    active_set.add(j)
    #Initialize weight coefficient
    w = np.zeros((m,), dtype=np.float32)
    #Record the last weighting factor
    prev_w = np.zeros((m,), dtype=np.float32)
    #Record feature update direction
    sign = np.zeros((m,), dtype=np.int32)
    sign[j] = 1
    #Average correlation
    lambda_hat = None
    #Record last average correlation
    prev_lambda_hat = None
    for it in range(max_iter):
        #Calculate residual vector
        residual = y - cur_pred
        #Dot product of eigenvector and residual vector
        cur_corr = X.T.dot(residual)
        #Current correlation maximum
        largest_abs_correlation = np.abs(cur_corr).max()
        #Calculate current average correlation
        lambda_hat = largest_abs_correlation / n
        #When the average correlation is less than λ End iteration ahead of time
        # https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/linear_model/_least_angle.py#L542
        if lambda_hat <= lambdas:
            if (it > 0 and lambda_hat != lambdas):
                ss = ((prev_lambda_hat - lambdas) / (prev_lambda_hat - lambda_hat))
                #Recalculate weight factor
                w[:] = prev_w + ss * (w - prev_w)
            break
        #Update last average correlation
        prev_lambda_hat = lambda_hat

        #When all features are selected, the iteration ends
        if len(active_set) > m:
            break

        #Selected eigenvector
        X_a = X[:, list(active_set)]
        #X in the paper_ Calculation formula of a - (2.4)
        X_a *= sign[list(active_set)]
        #G in the paper_ Calculation formula of a - (2.5)
        G_a = X_a.T.dot(X_a)
        G_a_inv = np.linalg.inv(G_a)
        G_a_inv_red_cols = np.sum(G_a_inv, 1)     
        #A in the paper_ Calculation formula of a - (2.5)
        A_a = 1 / np.sqrt(np.sum(G_a_inv_red_cols))
        #In the paper ω  Calculation formula for - (2.6)
        omega = A_a * G_a_inv_red_cols
        #Calculation formula of angular bisection vector in the paper - (2.6)
        equiangular = X_a.dot(omega)
        #Calculation formula of a in the paper - (2.11)
        cos_angle = X.T.dot(equiangular)
        #In the paper γ
        gamma = None
        #Next selected feature subscript
        next_j = None
        #Direction of next feature
        next_sign = 0
        for j in range(m):
            if j in active_set:
                continue
            #In the paper γ  Calculation method of - (2.13)
            v0 = (largest_abs_correlation - cur_corr[j]) / (A_a - cos_angle[j]).item()
            v1 = (largest_abs_correlation + cur_corr[j]) / (A_a + cos_angle[j]).item()
            if v0 > 0 and (gamma is None or v0 < gamma):
                gamma = v0
                next_j = j
                next_sign = 1
            if v1 > 0 and (gamma is None or v1 < gamma):
                gamma = v1
                next_j = j
                next_sign = -1
        if gamma is None:
            #In the paper γ  Calculation method of - (2.21)
            gamma = largest_abs_correlation / A_a

        #Selected eigenvector
        sa = X_a
        #Angular bisector vector
        sb = equiangular * gamma
        #Solving linear equations (SA * SX = sb)
        sx = np.linalg.lstsq(sa, sb)
        #Record the last weighting factor
        prev_w = w.copy()
        d_hat = np.zeros((m,), dtype=np.int32)
        for i, j in enumerate(active_set):
            #Update the current weight factor
            w[j] += sx[0][i] * sign[j]
            #D in the paper_ Calculation method of hat - (3.3)
            d_hat[j] = omega[i] * sign[j]
        #In the paper γ_ Calculation method of J - (3.4)
        gamma_hat = -w / d_hat
        #In the paper γ_ Calculation method of hat - (3.5)
        gamma_hat_min = float("+inf")
        #In the paper γ_ Subscript of hat
        gamma_hat_min_idx = None
        for i, j in enumerate(gamma_hat):
            if j <= 0:
                continue
            if gamma_hat_min > j:
                gamma_hat_min = j
                gamma_hat_min_idx = i
        if gamma_hat_min < gamma:
            #Update current forecast vector - (3.6)
            cur_pred = cur_pred + gamma_hat_min * equiangular
            #Remove the subscript to the selected feature subscript set
            active_set.remove(gamma_hat_min_idx)
            #Update feature update direction set
            sign[gamma_hat_min_idx] = 0
        else:
            #Update current forecast vector
            cur_pred = X.dot(w)
            #Add a subscript to the selected feature subscript set
            active_set.add(next_j)
            #Update feature update direction set
            sign[next_j] = next_sign

    return w

6、 Third party library implementation

scikit-learn6Implementation (coordinate descent method):

from sklearn.linear_model import Lasso

#Initialize lasso regressor, using coordinate descent method by default
reg = Lasso(alpha=0.1, fit_intercept=False)
#Fitting linear model
reg.fit(X, y)
#Weight coefficient
w = reg.coef_

scikit-learn7Implementation (minimum angle regression):

from sklearn.linear_model import LassoLars

#Initialize lasso regressor and use minimum angle regression method
reg = LassoLars(alpha=0.1, fit_intercept=False)
#Fitting linear model
reg.fit(X, y)
#Weight coefficient
w = reg.coef_

7、 Animation demonstration

The following dynamic diagram shows the example of working years and average monthly salary in the previous section. Each time, the weight coefficient is changed in only one direction of the coordinate axis, and the process of gradually approaching the optimal solution.


<center> coordinate descent method </center>

The following dynamic diagram shows λ The horizontal axis is the penalty coefficient λ , The vertical axis is the weight coefficient, and each color represents the weight coefficient of an independent variable (the training data comes from the sklearn diabetes datasets):


<center> λ Influence on weight coefficient </center>

Can see when λ When gradually increasing( λ Moving to the left), the weight coefficient of some features will quickly become zero. This property shows that lasso regression can be used for feature selection and control λ To select key features.

8、 Mind map

9、 References

  1. https://en.wikipedia.org/wiki…(statistics)
  2. https://en.wikipedia.org/wiki…
  3. https://en.wikipedia.org/wiki…
  4. https://web.stanford.edu/~has…
  5. https://zhuanlan.zhihu.com/p/…
  6. https://scikit-learn.org/stab…
  7. https://scikit-learn.org/stab…

For a complete demonstration, please clickhere

Note: This article strives to be accurate and easy to understand. However, as the author is also a beginner, his level is limited. If there are errors or omissions in the article, readers are urged to criticize and correct them by leaving messages

This article was first published on——AI map, welcome to pay attention