7、 Lasso regression

Time:2022-6-19

Lasso regression

1. prepare input data

load("TCGA-KIRC_sur_model.Rdata")
ls()
#> [1] "exprSet" "meta"
exprSet[1:4,1:4]
#>              TCGA-A3-3307-01A-01T-0860-13 TCGA-A3-3308-01A-02R-1324-13
#> hsa-let-7a-1                     11.94363                     13.16162
#> hsa-let-7a-2                     12.97324                     14.17303
#> hsa-let-7a-3                     12.04630                     13.18481
#> hsa-let-7b                       13.76790                     14.51511
#>              TCGA-A3-3311-01A-02R-1324-13 TCGA-A3-3313-01A-02R-1324-13
#> hsa-let-7a-1                     12.26391                     12.38326
#> hsa-let-7a-2                     13.26650                     13.39119
#> hsa-let-7a-3                     12.28186                     12.35551
#> hsa-let-7b                       14.09461                     12.33623
meta[1:4,1:4]
#>             ID event death last_followup
#> 1 TCGA-A3-3307     0     0          1436
#> 2 TCGA-A3-3308     0     0            16
#> 3 TCGA-A3-3311     1  1191             0
#> 4 TCGA-A3-3313     1   735             0

2. build lasso regression model

The input data are the expression matrix (only including the tumor sample) and the corresponding life and death of each patient (the order must be consistent).

Expression matrix after x=t (exprset) \\
Y=meta$event\\
library(glmnet)
model_ Lasso < - glmnet (x, y, nlambda=10, alpha=1) \ λ
print(model_lasso)
#> 
#> Call:  glmnet(x = x, y = y, alpha = 1, nlambda = 10) 
#> 
#>     Df  %Dev   Lambda
#> 1    0  0.00 0.129900
#> 2   11 11.73 0.077850
#> 3   21 20.92 0.046670
#> 4   54 30.69 0.027980
#> 5   95 44.71 0.016770
#> 6  160 57.60 0.010050
#> 7  228 68.75 0.006028
#> 8  295 77.36 0.003613
#> 9  347 83.51 0.002166
#> 10 380 88.43 0.001299

Here are examples, so only 10 are calculated λ Value, explain the meaning of three columns of output results- DF is the degree of freedom – column%dev represents the proportion of residuals interpreted by the model. For linear models, it is r^2 (R-squared) of model fitting. It is between 0 and 1. The closer it is to 1, the better the performance of the model. If it is 0, the prediction result of the model is not as effective as directly taking the mean value of the dependent variable as the prediction value- Lambda is an important parameter for building models.

The higher the percentage of residuals explained, the better. However, the number of genes used to build the model should not be too large, and a compromise value should be taken.

2.1 select appropriate λ value

Calculate 1000, draw pictures and select the best performers λ value

cv_fit <- cv.glmnet(x=x, y=y, nlambda = 1000,alpha = 1)
plot(cv_fit)

Two dashed lines indicate two special λ Value, one is lambda Min, one is lambda 1se, the lambda between these two values is considered appropriate. lambda. The model constructed by 1se is the simplest, that is, the number of genes used is small, while lambda Min has a higher accuracy and uses more genes.

2.2 use these two λ Value re modeling

model_lasso_min <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.min)
model_lasso_1se <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.1se)

These two values are reflected in the parameter lambda. With the model, the screened genes can be picked out. All genes are stored in the subset beta of the model. The used genes have a S0 value, and only “.” is recorded for useless genes, So you can use the following code to pick out the genes used.

head(model_lasso_min$beta,20)
#> 20 x 1 sparse Matrix of class "dgCMatrix"
#>                        s0
#> hsa-let-7a-1   .         
#> hsa-let-7a-2   .         
#> hsa-let-7a-3   .         
#> hsa-let-7b     .         
#> hsa-let-7c     .         
#> hsa-let-7d     .         
#> hsa-let-7e     .         
#> hsa-let-7f-1   .         
#> hsa-let-7f-2   .         
#> hsa-let-7g     .         
#> hsa-let-7i     .         
#> hsa-mir-1-2    .         
#> hsa-mir-100    .         
#> hsa-mir-101-1 -0.02460853
#> hsa-mir-101-2  .         
#> hsa-mir-103-1  .         
#> hsa-mir-103-2  .         
#> hsa-mir-105-1  .         
#> hsa-mir-105-2  .         
#> hsa-mir-106a   .
choose_gene_min=rownames(model_lasso_min$beta)[as.numeric(model_lasso_min$beta)!=0]
choose_gene_1se=rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]
length(choose_gene_min)
#> [1] 69
length(choose_gene_1se)
#> [1] 15

3. model prediction and evaluation

3.1 self prediction

The newx parameter is the prediction object. Output result lasso Prob is a matrix. The first column is the prediction result of min, and the second column is the prediction result of 1se. The prediction result is probability, or percentage, not absolute 0 and 1.

Put the life and death of each sample together with the prediction results, and cbind can be used directly.

lasso. Prob < - predict (cv\u fit, newx=x, s=c (cv\u fit$lambda.min, cv\u fit$lambda.1se)) \newx=x expression matrix of test set
re=cbind(y ,lasso.prob)
head(re)
#>                              y         1         2
#> TCGA-A3-3307-01A-01T-0860-13 0 0.1304463 0.2161883
#> TCGA-A3-3308-01A-02R-1324-13 0 0.3652738 0.3646634
#> TCGA-A3-3311 -01A-02R-1324-13 1 0.3015306 0.2927687
#> TCGA-A3-3313-01A-02R-1324-13 1 0.4953936 0.3473468
#> TCGA-A3-3316-01A-01T-0860-13 0 0.3381294 0.3110597
#> TCGA-A3-3317-01A-01T-0860-13 0 0.3472768 0.3380213

3.2 box line diagram

Visualize the forecast results. Take the actual life and death as a group, draw a box and line diagram to view the prediction results as a whole.

re=as.data.frame(re)
colnames(re)=c('event','prob_min','prob_1se')
re$event=as.factor(re$event)
library(ggpubr) 
p1 = ggboxplot(re, x = "event", y = "prob_min",
               color = "event", palette = "jco",
               add = "jitter")+
  scale_y_continuous(limits = c(0,1)) +
  stat_compare_means()
p2 = ggboxplot(re, x = "event", y = "prob_1se",
          color = "event", palette = "jco",
          add = "jitter")+ 
  scale_y_continuous(limits = c(0,1)) +
  stat_compare_means()
library(patchwork)
p1+p2
#> Warning: Removed 16 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 16 rows containing non-finite values (stat_compare_means).
#> Warning: Removed 16 rows containing missing values (geom_point).

It can be seen that the real result is a raw (0) sample, the predicted value is a little smaller (close to 0), the real result is a dead (1) sample, and the predicted value is a little larger (close to 1). On the whole, the trend is correct, but not completely accurate, and the model is available.

Compare two λ There is little difference in the model constructed by the value, and the predicted value of Min is more accurate.

3.3 ROC curve

The value range of AUC is 0.5-1, and the closer it is to 1, the better. The ROC curve can be drawn according to the prediction results.

library(ROCR)
library(caret)
#Predict yourself
#min
pred_min <- prediction(re[,2], re[,1])
auc_min = performance(pred_min,"auc")@y.values[[1]]
perf_min <- performance(pred_min,"tpr","fpr")
plot(perf_min,colorize=FALSE, col="blue") 
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.2, labels = paste0("AUC = ",round(auc_min,3)))
#1se
pred_1se <- prediction(re[,3], re[,1])
auc_1se = performance(pred_1se,"auc")@y.values[[1]]
perf_1se <- performance(pred_1se,"tpr","fpr")
plot(perf_1se,colorize=FALSE, col="red") 
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.2, labels = paste0("AUC = ",round(auc_1se,3)))
  • OCD option, I want to draw the two models together.
plot(perf_min,colorize=FALSE, col="blue") 
plot(perf_1se,colorize=FALSE, col="red",add = T) 
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.3, labels = paste0("AUC = ",round(auc_min,3)),col = "blue")
text(0.8,0.2, labels = paste0("AUC = ",round(auc_1se,3)),col = "red")

-You can also draw better with ggplot2

tpr_min = performance(pred_min,"tpr")@y.values[[1]]
tpr_1se = performance(pred_1se,"tpr")@y.values[[1]]
dat = data.frame(tpr_min = [email protected][[1]],
                 fpr_min = [email protected][[1]],
                 tpr_1se = [email protected][[1]],
                 fpr_1se = [email protected][[1]])
library(ggplot2)
ggplot() + 
  geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") + 
  geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
  theme_bw()+
  annotate("text",x = .75, y = .25,
           label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
  scale_x_continuous(name  = "fpr")+
  scale_y_continuous(name = "tpr")
ggplot() + 
  geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") + 
  geom_line(data = dat,aes(x = fpr_1se, y = tpr_1se),color = "red")+
  geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
  theme_bw()+
  annotate("text",x = .75, y = .25,
           label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
  annotate("text",x = .75, y = .15,label = paste("AUC of 1se = ",round(auc_1se,2)),color = "red")+
  scale_x_continuous(name  = "fpr")+
  scale_y_continuous(name = "tpr")

5. build model and predict cutting data

5.1 cutting data

Cut the data with r-packet caret, and the generated result is a group of numbers representing the number of columns. These numbers can be used to subset the expression matrix and meta.

library(caret)
set.seed(12345679)
sam<- createDataPartition(meta$event, p = .5,list = FALSE)
head(sam)
#>      Resample1
#> [1,]         5
#> [2,]         9
#> [3,]        13
#> [4,]        17
#> [5,]        19
#> [6,]        22

You can view the cutting proportion of some clinical parameters in the two groups

train <- exprSet[,sam]
test <- exprSet[,-sam]
train_meta <- meta[sam,]
test_meta <- meta[-sam,]

prop.table(table(train_meta$stage))
#> 
#>         i        ii       iii        iv 
#> 0.4636015 0.1072797 0.2796935 0.1494253
prop.table(table(test_meta$stage)) 
#> 
#>         i        ii       iii        iv 
#> 0.5249042 0.1111111 0.1954023 0.1685824
prop.table(table(test_meta$race)) 
#> 
#>                     asian black or african american                     white 
#>                0.01171875                0.08593750                0.90234375
prop.table(table(train_meta$race)) 
#> 
#>                     asian black or african american                     white 
#>                0.01937984                0.13953488                0.84108527

5.2 train data set modeling after cutting

The modeling method is the same as above.

#Calculate lambda
x = t(train)
y = train_meta$event
cv_fit <- cv.glmnet(x=x, y=y, nlambda = 1000,alpha = 1)
plot(cv_fit)
#Build model
model_lasso_min <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.min)
model_lasso_1se <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.1se)
#Pick out genes
head(model_lasso_min$beta)
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#>              s0
#> hsa-let-7a-1  .
#> hsa-let-7a-2  .
#> hsa-let-7a-3  .
#> hsa-let-7b    .
#> hsa-let-7c    .
#> hsa-let-7d    .
choose_gene_min=rownames(model_lasso_min$beta)[as.numeric(model_lasso_min$beta)!=0]
choose_gene_1se=rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]
length(choose_gene_min)
#> [1] 42
length(choose_gene_1se)
#> [1] 6

4. model prediction

Build a model with the training set, predict the life and death of the test set, and note that the newx parameter changes.

lasso.prob <- predict(cv_fit, newx=t(test), s=c(cv_fit$lambda.min,cv_fit$lambda.1se) )
re=cbind(event = test_meta$event ,lasso.prob)
head(re)
#>                              event         1         2
#> TCGA-A3-3307-01A-01T-0860-13     0 0.2346060 0.2849366
#> TCGA-A3-3308-01A-02R-1324-13     0 0.3545987 0.3752418
#> TCGA-A3-3311-01A-02R-1324-13     1 0.3812493 0.3368730
#> TCGA-A3-3313-01A-02R-1324-13     1 0.4420153 0.3805503
#> TCGA-A3-3317-01A-01T-0860-13     0 0.3536417 0.3175332
#> TCGA-A3-3319-01A-02R-1324-13     0 0.7300191 0.4180086

Visualize again

re=as.data.frame(re)
colnames(re)=c('event','prob_min','prob_1se')
re$event=as.factor(re$event)
library(ggpubr) 
p1 = ggboxplot(re, x = "event", y = "prob_min",
               color = "event", palette = "jco",
               add = "jitter")+
  scale_y_continuous(limits = c(0,1)) +
  stat_compare_means()
p2 = ggboxplot(re, x = "event", y = "prob_1se",
          color = "event", palette = "jco",
          add = "jitter")+ 
  scale_y_continuous(limits = c(0,1)) +
  stat_compare_means()
library(patchwork)
p1+p2
#> Warning: Removed 4 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 4 rows containing non-finite values (stat_compare_means).
#> Warning: Removed 4 rows containing missing values (geom_point).

Draw ROC curve again

library(ROCR)
library(caret)
#Training set model prediction test set
#min
pred_min <- prediction(re[,2], re[,1])
auc_min = performance(pred_min,"auc")@y.values[[1]]
perf_min <- performance(pred_min,"tpr","fpr")

#1se
pred_1se <- prediction(re[,3], re[,1])
auc_1se = performance(pred_1se,"auc")@y.values[[1]]
perf_1se <- performance(pred_1se,"tpr","fpr")
tpr_min = performance(pred_min,"tpr")@y.values[[1]]
tpr_1se = performance(pred_1se,"tpr")@y.values[[1]]
dat = data.frame(tpr_min = [email protected][[1]],
                 fpr_min = [email protected][[1]],
                 tpr_1se = [email protected][[1]],
                 fpr_1se = [email protected][[1]])

ggplot() + 
  geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") + 
  geom_line(data = dat,aes(x = fpr_1se, y = tpr_1se),color = "red")+
  geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
  theme_bw()+
  annotate("text",x = .75, y = .25,
           label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
  annotate("text",x = .75, y = .15,label = paste("AUC of 1se = ",round(auc_1se,2)),color = "red")+
  scale_x_continuous(name  = "fpr")+
  scale_y_continuous(name = "tpr")

AUC value is lower than that without splitting.
*Course notes of student information skill tree