Lasso regression
1. prepare input data
load("TCGA-KIRC_sur_model.Rdata")
ls()
#> [1] "exprSet" "meta"
exprSet[1:4,1:4]
#> TCGA-A3-3307-01A-01T-0860-13 TCGA-A3-3308-01A-02R-1324-13
#> hsa-let-7a-1 11.94363 13.16162
#> hsa-let-7a-2 12.97324 14.17303
#> hsa-let-7a-3 12.04630 13.18481
#> hsa-let-7b 13.76790 14.51511
#> TCGA-A3-3311-01A-02R-1324-13 TCGA-A3-3313-01A-02R-1324-13
#> hsa-let-7a-1 12.26391 12.38326
#> hsa-let-7a-2 13.26650 13.39119
#> hsa-let-7a-3 12.28186 12.35551
#> hsa-let-7b 14.09461 12.33623
meta[1:4,1:4]
#> ID event death last_followup
#> 1 TCGA-A3-3307 0 0 1436
#> 2 TCGA-A3-3308 0 0 16
#> 3 TCGA-A3-3311 1 1191 0
#> 4 TCGA-A3-3313 1 735 0
2. build lasso regression model
The input data are the expression matrix (only including the tumor sample) and the corresponding life and death of each patient (the order must be consistent).
Expression matrix after x=t (exprset) \\
Y=meta$event\\
library(glmnet)
model_ Lasso < - glmnet (x, y, nlambda=10, alpha=1) \ λ
print(model_lasso)
#>
#> Call: glmnet(x = x, y = y, alpha = 1, nlambda = 10)
#>
#> Df %Dev Lambda
#> 1 0 0.00 0.129900
#> 2 11 11.73 0.077850
#> 3 21 20.92 0.046670
#> 4 54 30.69 0.027980
#> 5 95 44.71 0.016770
#> 6 160 57.60 0.010050
#> 7 228 68.75 0.006028
#> 8 295 77.36 0.003613
#> 9 347 83.51 0.002166
#> 10 380 88.43 0.001299
Here are examples, so only 10 are calculated λ Value, explain the meaning of three columns of output results- DF is the degree of freedom – column%dev represents the proportion of residuals interpreted by the model. For linear models, it is r^2 (R-squared) of model fitting. It is between 0 and 1. The closer it is to 1, the better the performance of the model. If it is 0, the prediction result of the model is not as effective as directly taking the mean value of the dependent variable as the prediction value- Lambda is an important parameter for building models.
The higher the percentage of residuals explained, the better. However, the number of genes used to build the model should not be too large, and a compromise value should be taken.
2.1 select appropriate λ value
Calculate 1000, draw pictures and select the best performers λ value
cv_fit <- cv.glmnet(x=x, y=y, nlambda = 1000,alpha = 1)
plot(cv_fit)
Two dashed lines indicate two special λ Value, one is lambda Min, one is lambda 1se, the lambda between these two values is considered appropriate. lambda. The model constructed by 1se is the simplest, that is, the number of genes used is small, while lambda Min has a higher accuracy and uses more genes.
2.2 use these two λ Value re modeling
model_lasso_min <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.min)
model_lasso_1se <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.1se)
These two values are reflected in the parameter lambda. With the model, the screened genes can be picked out. All genes are stored in the subset beta of the model. The used genes have a S0 value, and only “.” is recorded for useless genes, So you can use the following code to pick out the genes used.
head(model_lasso_min$beta,20)
#> 20 x 1 sparse Matrix of class "dgCMatrix"
#> s0
#> hsa-let-7a-1 .
#> hsa-let-7a-2 .
#> hsa-let-7a-3 .
#> hsa-let-7b .
#> hsa-let-7c .
#> hsa-let-7d .
#> hsa-let-7e .
#> hsa-let-7f-1 .
#> hsa-let-7f-2 .
#> hsa-let-7g .
#> hsa-let-7i .
#> hsa-mir-1-2 .
#> hsa-mir-100 .
#> hsa-mir-101-1 -0.02460853
#> hsa-mir-101-2 .
#> hsa-mir-103-1 .
#> hsa-mir-103-2 .
#> hsa-mir-105-1 .
#> hsa-mir-105-2 .
#> hsa-mir-106a .
choose_gene_min=rownames(model_lasso_min$beta)[as.numeric(model_lasso_min$beta)!=0]
choose_gene_1se=rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]
length(choose_gene_min)
#> [1] 69
length(choose_gene_1se)
#> [1] 15
3. model prediction and evaluation
3.1 self prediction
The newx parameter is the prediction object. Output result lasso Prob is a matrix. The first column is the prediction result of min, and the second column is the prediction result of 1se. The prediction result is probability, or percentage, not absolute 0 and 1.
Put the life and death of each sample together with the prediction results, and cbind can be used directly.
lasso. Prob < - predict (cv\u fit, newx=x, s=c (cv\u fit$lambda.min, cv\u fit$lambda.1se)) \newx=x expression matrix of test set
re=cbind(y ,lasso.prob)
head(re)
#> y 1 2
#> TCGA-A3-3307-01A-01T-0860-13 0 0.1304463 0.2161883
#> TCGA-A3-3308-01A-02R-1324-13 0 0.3652738 0.3646634
#> TCGA-A3-3311 -01A-02R-1324-13 1 0.3015306 0.2927687
#> TCGA-A3-3313-01A-02R-1324-13 1 0.4953936 0.3473468
#> TCGA-A3-3316-01A-01T-0860-13 0 0.3381294 0.3110597
#> TCGA-A3-3317-01A-01T-0860-13 0 0.3472768 0.3380213
3.2 box line diagram
Visualize the forecast results. Take the actual life and death as a group, draw a box and line diagram to view the prediction results as a whole.
re=as.data.frame(re)
colnames(re)=c('event','prob_min','prob_1se')
re$event=as.factor(re$event)
library(ggpubr)
p1 = ggboxplot(re, x = "event", y = "prob_min",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
p2 = ggboxplot(re, x = "event", y = "prob_1se",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
library(patchwork)
p1+p2
#> Warning: Removed 16 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 16 rows containing non-finite values (stat_compare_means).
#> Warning: Removed 16 rows containing missing values (geom_point).
It can be seen that the real result is a raw (0) sample, the predicted value is a little smaller (close to 0), the real result is a dead (1) sample, and the predicted value is a little larger (close to 1). On the whole, the trend is correct, but not completely accurate, and the model is available.
Compare two λ There is little difference in the model constructed by the value, and the predicted value of Min is more accurate.
3.3 ROC curve
The value range of AUC is 0.5-1, and the closer it is to 1, the better. The ROC curve can be drawn according to the prediction results.
library(ROCR)
library(caret)
#Predict yourself
#min
pred_min <- prediction(re[,2], re[,1])
auc_min = performance(pred_min,"auc")@y.values[[1]]
perf_min <- performance(pred_min,"tpr","fpr")
plot(perf_min,colorize=FALSE, col="blue")
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.2, labels = paste0("AUC = ",round(auc_min,3)))
#1se
pred_1se <- prediction(re[,3], re[,1])
auc_1se = performance(pred_1se,"auc")@y.values[[1]]
perf_1se <- performance(pred_1se,"tpr","fpr")
plot(perf_1se,colorize=FALSE, col="red")
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.2, labels = paste0("AUC = ",round(auc_1se,3)))
- OCD option, I want to draw the two models together.
plot(perf_min,colorize=FALSE, col="blue")
plot(perf_1se,colorize=FALSE, col="red",add = T)
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.3, labels = paste0("AUC = ",round(auc_min,3)),col = "blue")
text(0.8,0.2, labels = paste0("AUC = ",round(auc_1se,3)),col = "red")
-You can also draw better with ggplot2
tpr_min = performance(pred_min,"tpr")@y.values[[1]]
tpr_1se = performance(pred_1se,"tpr")@y.values[[1]]
dat = data.frame(tpr_min = [email protected][[1]],
fpr_min = [email protected][[1]],
tpr_1se = [email protected][[1]],
fpr_1se = [email protected][[1]])
library(ggplot2)
ggplot() +
geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") +
geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
theme_bw()+
annotate("text",x = .75, y = .25,
label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
scale_x_continuous(name = "fpr")+
scale_y_continuous(name = "tpr")
ggplot() +
geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") +
geom_line(data = dat,aes(x = fpr_1se, y = tpr_1se),color = "red")+
geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
theme_bw()+
annotate("text",x = .75, y = .25,
label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
annotate("text",x = .75, y = .15,label = paste("AUC of 1se = ",round(auc_1se,2)),color = "red")+
scale_x_continuous(name = "fpr")+
scale_y_continuous(name = "tpr")
5. build model and predict cutting data
5.1 cutting data
Cut the data with r-packet caret, and the generated result is a group of numbers representing the number of columns. These numbers can be used to subset the expression matrix and meta.
library(caret)
set.seed(12345679)
sam<- createDataPartition(meta$event, p = .5,list = FALSE)
head(sam)
#> Resample1
#> [1,] 5
#> [2,] 9
#> [3,] 13
#> [4,] 17
#> [5,] 19
#> [6,] 22
You can view the cutting proportion of some clinical parameters in the two groups
train <- exprSet[,sam]
test <- exprSet[,-sam]
train_meta <- meta[sam,]
test_meta <- meta[-sam,]
prop.table(table(train_meta$stage))
#>
#> i ii iii iv
#> 0.4636015 0.1072797 0.2796935 0.1494253
prop.table(table(test_meta$stage))
#>
#> i ii iii iv
#> 0.5249042 0.1111111 0.1954023 0.1685824
prop.table(table(test_meta$race))
#>
#> asian black or african american white
#> 0.01171875 0.08593750 0.90234375
prop.table(table(train_meta$race))
#>
#> asian black or african american white
#> 0.01937984 0.13953488 0.84108527
5.2 train data set modeling after cutting
The modeling method is the same as above.
#Calculate lambda
x = t(train)
y = train_meta$event
cv_fit <- cv.glmnet(x=x, y=y, nlambda = 1000,alpha = 1)
plot(cv_fit)
#Build model
model_lasso_min <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.min)
model_lasso_1se <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.1se)
#Pick out genes
head(model_lasso_min$beta)
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> s0
#> hsa-let-7a-1 .
#> hsa-let-7a-2 .
#> hsa-let-7a-3 .
#> hsa-let-7b .
#> hsa-let-7c .
#> hsa-let-7d .
choose_gene_min=rownames(model_lasso_min$beta)[as.numeric(model_lasso_min$beta)!=0]
choose_gene_1se=rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]
length(choose_gene_min)
#> [1] 42
length(choose_gene_1se)
#> [1] 6
4. model prediction
Build a model with the training set, predict the life and death of the test set, and note that the newx parameter changes.
lasso.prob <- predict(cv_fit, newx=t(test), s=c(cv_fit$lambda.min,cv_fit$lambda.1se) )
re=cbind(event = test_meta$event ,lasso.prob)
head(re)
#> event 1 2
#> TCGA-A3-3307-01A-01T-0860-13 0 0.2346060 0.2849366
#> TCGA-A3-3308-01A-02R-1324-13 0 0.3545987 0.3752418
#> TCGA-A3-3311-01A-02R-1324-13 1 0.3812493 0.3368730
#> TCGA-A3-3313-01A-02R-1324-13 1 0.4420153 0.3805503
#> TCGA-A3-3317-01A-01T-0860-13 0 0.3536417 0.3175332
#> TCGA-A3-3319-01A-02R-1324-13 0 0.7300191 0.4180086
Visualize again
re=as.data.frame(re)
colnames(re)=c('event','prob_min','prob_1se')
re$event=as.factor(re$event)
library(ggpubr)
p1 = ggboxplot(re, x = "event", y = "prob_min",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
p2 = ggboxplot(re, x = "event", y = "prob_1se",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
library(patchwork)
p1+p2
#> Warning: Removed 4 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 4 rows containing non-finite values (stat_compare_means).
#> Warning: Removed 4 rows containing missing values (geom_point).
Draw ROC curve again
library(ROCR)
library(caret)
#Training set model prediction test set
#min
pred_min <- prediction(re[,2], re[,1])
auc_min = performance(pred_min,"auc")@y.values[[1]]
perf_min <- performance(pred_min,"tpr","fpr")
#1se
pred_1se <- prediction(re[,3], re[,1])
auc_1se = performance(pred_1se,"auc")@y.values[[1]]
perf_1se <- performance(pred_1se,"tpr","fpr")
tpr_min = performance(pred_min,"tpr")@y.values[[1]]
tpr_1se = performance(pred_1se,"tpr")@y.values[[1]]
dat = data.frame(tpr_min = [email protected][[1]],
fpr_min = [email protected][[1]],
tpr_1se = [email protected][[1]],
fpr_1se = [email protected][[1]])
ggplot() +
geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") +
geom_line(data = dat,aes(x = fpr_1se, y = tpr_1se),color = "red")+
geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
theme_bw()+
annotate("text",x = .75, y = .25,
label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
annotate("text",x = .75, y = .15,label = paste("AUC of 1se = ",round(auc_1se,2)),color = "red")+
scale_x_continuous(name = "fpr")+
scale_y_continuous(name = "tpr")
AUC value is lower than that without splitting.
*Course notes of student information skill tree