A case study of R language penalized logistic regression (Lasso, ridge regression) high dimensional variable selection classification model

Time：2021-6-9

Logistic regression is a common method in research, which can screen influencing factors, predict probability and classify. For example, the data obtained by high pass sequencing technology in medical research challenges the selection of high-dimensional variables. Penalized logistic regression can select variables and estimate coefficients for high-dimensional data, and its effective algorithm ensures the feasibility of calculation. Methods the commonly used penalty logistic algorithms such as lasso and ridge regression were introduced.

method

As we have seen before, the classical estimation technique for estimating the parameters of parametric models is to use the maximum likelihood method. More specifically,

The objective function here only focuses on goodness of fit. But often, in econometrics, we believe that simple theories are preferable to more complex ones. So we want to punish models that are too complex.

That’s a good idea. This is often mentioned in Econometrics textbooks, but the choice of models usually does not involve reasoning. Usually, we use the maximum likelihood method to estimate the parameters, and then use AIC or BIC to compare the two models. Akaike (AIC) standard is based on

We have a measure of goodness of fit on the left, while on the right, the penalty increases with the “complexity” of the model.

Here, complexity is the number of variables used. But suppose we don’t make variable selection, we consider the regression of all covariates. definition

AIC can be written as

In fact, this is our objective function. More specifically, we will consider

In this article, I want to discuss the numerical algorithms for solving this optimization problem, for L1 (ridge regression) and L2 (lasso regression).

Standardization of covariates

Here we use the infarct data observed from patients in the emergency room, and we want to know who survived and get a predictive model. The first step is to consider all the covariates X_ The linear transformation of JXJ to normalize variables (with unit variance)

`` for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j]) ``

Ridge regression

Before running some code, think back to the problem we want to solve

Considering the log likelihood of Gaussian variables, the sum of squares of the residuals is obtained, and the explicit solution is obtained. But not in the case of logistic regression.
The heuristic method of ridge regression is shown in the figure below. In the background, we can visualize the (two-dimensional) log likelihood of logistic regression. If we rewire the optimization problem as a constrained optimization problem, the blue circle is our constraint

Can be written equivalently (this is a strictly convex problem)

Therefore, the maximum constrained value should be on the blue disk

``````b0=bbeta[1]
beta=bbeta[-1]
sum(-y*log(1 + exp(-(b0+X%*%beta))) -
(1-y)*log(1 + exp(b0+X%*%beta)))}
u = seq(-4,4,length=251)
v = outer(u,u,function(x,y) LogLik(c(1,x,y)))

lines(u,sqrt(1-u^2),type="l",lwd=2,col="blue")
lines(u,-sqrt(1-u^2),type="l",lwd=2,col="blue")``````

Let’s consider the objective function, the following code

`````` -sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*
log(1 + exp(b0+X%*%beta)))+lambda*sum(beta^2) ``````

Why not try a standard optimizer? We mentioned that it’s not wise to use optimization routines because they are strongly dependent on starting points.

`````` beta_init = lm(y~.,)\$coefficients

for(i in 1:1000){
vpar[i,] = optim(par = beta_init*rnorm(8,1,2),
function(x) LogLik(x,lambda), method = "BFGS", control = list(abstol=1e-9))\$par}
par(mfrow=c(1,2))
plot(density(vpar[,2]) ``````

Obviously, even if we change the starting point, it seems that we are converging towards the same value. It can be considered the best.

It will then be used for calculation βλ Code for

`````` beta_init = lm(y~.,data )\$coefficients
logistic_opt = optim(par = beta_init*0, function(x) LogLik(x,lambda),
method = "BFGS", control=list(abstol=1e-9)) ``````

We can make it βλ The evolution of the model is visualized as λ Function of

`````` v_lambda = c(exp(seq(-2,5,length=61)))

plot(v_lambda,est_ridge[1,],col=colrs[1])
for(i in 2:7) lines(v_lambda,est_ridge[i,],``````

It seems to make sense: we can observe λ The contraction at the time of increase.

Ridge, using netwon Raphson algorithm

As we have seen, we can also use Newton Raphson to solve this problem. There is no penalty term, the algorithm is

among

therefore

Then there’s the code

`````` for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])

for(s in 1:9){
pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s]))

B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z)
beta = cbind(beta,B)}
beta[,8:10]
[,1] [,2] [,3]
XInter 0.59619654 0.59619654 0.59619654
XFRCAR 0.09217848 0.09217848 0.09217848
XINCAR 0.77165707 0.77165707 0.77165707
XINSYS 0.69678521 0.69678521 0.69678521
XPRDIA -0.29575642 -0.29575642 -0.29575642
XPAPUL -0.23921101 -0.23921101 -0.23921101
XPVENT -0.33120792 -0.33120792 -0.33120792
XREPUL -0.84308972 -0.84308972 -0.84308972``````

Again, it seems to converge very fast.

Interestingly, through this algorithm, we can also get the variance of the estimator

Then according to λ Function calculation βλ Code for

`````` for(s in 1:20){
pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s]))
diag(Delta)=(pi*(1-pi))
z = X%*%beta[,s] + solve(Delta)%*%(Y-pi)
B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z)
beta = cbind(beta,B)}
Varz = solve(Delta)
Varb = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% t(X)%*% Delta %*% Varz %*%
Delta %*% X %*% solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) ``````

We can visualize it βλ The evolution of (as λ (function of)

`````` plot(v_lambda,est_ridge[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_ridge[i,],col=colrs[i])``````

The evolution of variance is obtained

In retrospect, when λ= 0 (on the left side of the graph), β 0= β MCO (no punishment). So, when λ The results show that (I) the bias increases (the estimate tends to zero) and (II) the variance decreases.

Using glmnetridge regression

As usual, there are R functions available for ridge regression. Let’s use the glmnet function, α= 0

`````` for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
glmnet(X, y, alpha=0)
plot(glm_ridge,xvar="lambda")``````

As the standard norm of L1,

Ridge regression with orthogonal covariates when the covariates are orthogonal, an interesting example is obtained. This can be obtained by principal component analysis of covariates.

`` get_pca_ind(pca)\$coord``

Let’s do ridge regression for these (orthogonal) covariates

`````` glm_ridge = glmnet(pca_X, y, alpha=0)
plot(glm_ridge,xvar="lambda",col=colrs,lwd=2)``````

We clearly observed the contraction of parameters, i.e

application

Let’s try the second set of data

We can try all kinds of things λ Value of

``````glmnet(cbind(df0\$x1,df0\$x2), df0\$y==1, alpha=0)
plot(reg,xvar="lambda",col=c("blue","red"),lwd=2)
abline(v=log(.2)) ``````

perhaps

`````` abline(v=log(1.2))
plot_lambda(1.2)``````

The next step is to change the standard of punishment and use the L1 standard norm.

Standardization of covariates

As mentioned earlier, the first step is to consider all covariates X_ The linear transformation of JXJ makes variables standardized (with unit variance)

`````` for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
X = as.matrix(X)``````

Ridge regression

The heuristic method of lasso lasso regression is shown in the figure below. In the background, we can visualize the (two-dimensional) log likelihood of logistic regression, and the blue square is our constraint. If we reconsider the optimization problem as a constrained optimization problem,

``````LogLik = function(bbeta){

sum(-y*log(1 + exp(-(b0+X%*%beta))) -
(1-y)*log(1 + exp(b0+X%*%beta)))}

polygon(c(-1,0,1,0),c(0,1,0,-1),border="blue")``````

The advantage here is that it can be used as a variable selection tool.

Heuristic, the mathematical explanation is as follows. Consider a simple regression equation y_ i=xi β+ε， have   L1 penalty and L2 loss function. The optimization problem becomes

The first order condition can be written as

The solution is

Optimization process

`````` logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik(x,lambda),
hessian=TRUE, method = "BFGS", control=list(abstol=1e-9))

plot(v_lambda,est_lasso[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_lasso[i,],col=colrs[i],lwd=2)``````

The results are volatile.

Using glmnet

For comparison, using the R program for Lasso, we get the following

`` plot(glm_lasso,xvar="lambda",col=colrs,lwd=2)``

If we look closely at the output, we can see that there is variable selection, in a sense, β j， λ= 0, which means “really zero.”.

``````,lambda=exp(-4))\$beta
7x1 sparse Matrix of class "dgCMatrix"
s0
FRCAR .
INCAR 0.11005070
INSYS 0.03231929
PRDIA .
PAPUL .
PVENT -0.03138089
REPUL -0.20962611``````

We can’t expect null values without optimization routines

`````` opt_lasso(.2)
FRCAR INCAR INSYS PRDIA
0.4810999782 0.0002813658 1.9117847987 -0.3873926427
PAPUL PVENT REPUL
-0.0863050787 -0.4144139379 -1.3849264055``````

Orthogonal covariates

Before learning mathematics, please note that there are some very clear “variable” selection processes when covariates are orthogonal,

`````` pca = princomp(X)
pca_X = get_pca_ind(pca)\$coord

plot(glm_lasso,xvar="lambda",col=colrs)
plot(glm_lasso,col=colrs)``````

Standard Lasso

If we go back to the original lasso method, the goal is to solve the problem

Note that intercept is not penalized. The first order condition is

that is

We can check whether bf0 contains at least subdifferential.

For items on the left

So the equation can be written

Then we simplify them into a set of rules to examine our solutions

We can make it β J is decomposed into the sum of positive and negative parts β Replace J with β j+- β J -, where β j+， β j-≥0。 The lasso problem becomes

order α j+， α J − denotes β j+， β Lagrange multiplier of J −.

In order to satisfy the condition of stationarity, we take Lagrange’s theorem β J + and set it to zero

We are right β J − do the same to get

For convenience, the soft threshold function is introduced

Pay attention to optimization

You can also write

Observed

This is a coordinate update.
Now, if we consider a (slightly) more general problem, there are weights in the first part

The coordinates are updated to

Back to our original question.

Lasso lasso logistic regression

The logic problem can be expressed as quadratic programming problem. Think back to logarithmic likelihood here

This is the concave function of the parameter. Therefore, we can use the quadratic approximation of log likelihood – Taylor expansion,

Where Z_ Izi is

PI is prediction

In this way, we get a penalized least square problem. We can use the previous method

`````` beta0 = sum(y-X%*%beta)/(length(y))
beta0list[j+1] = beta0
betalist[[j+1]] = beta
obj[j] = (1/2)*(1/length(y))*norm(omega*(z - X%*%beta -
beta0*rep(1,length(y))),'F')^2 + lambda*sum(abs(beta))

if (norm(rbind(beta0list[j],betalist[[j]]) -
rbind(beta0,beta),'F') &lt; tol) { break }
}
return(list(obj=obj[1:j],beta=beta,intercept=beta0)) }``````

It looks like the result of a call to glmnet, for something large enough λ， We do have free ingredients.

Application on the second data set

Now consider a second dataset with two covariates. The code to get lasso estimation is

`````` plot_l = function(lambda){
m = apply(df0,2,mean)
s = apply(df0,2,sd)
for(j in 1:2) df0[,j] &
reg = glmnet(cbind(df0\$x1,df0\$x2), df0\$y==1, alpha=1,lambda=lambda)
u = seq(0,1,length=101)
p = function(x,y){

predict(reg,newx=cbind(x1=xt,x2=yt),type="response")}

image(u,u,v,col=clr10,breaks=(0:10)/10)
points(df\$x1,df\$x2,pch=c(1,19)[1+z],cex=1.5)

Considering some small values of lambda, we only have some degree of parameter reduction

`````` plot(reg,xvar="lambda",col=c("blue","red"),lwd=2)
abline(v=exp(-2.8))
plot_l(exp(-2.8))``````

But use a larger one λ， Then there is variable selection β 1， λ= 0

Most popular insights

What is “hybrid cloud”?

In this paper, we define the concept of “hybrid cloud”, explain four different cloud deployment models of hybrid cloud, and deeply analyze the industrial trend of hybrid cloud through a series of data and charts. 01 introduction Hybrid cloud is a computing environment that integrates multiple platforms and data centers. Generally speaking, hybrid cloud is […]