The principal component regression (PCR) method essentially uses the ordinary least squares (OLS) fitting of the first methodPrincipal components (PC) from predictive variables. This brings many advantages:
- The number of predictors is virtually unlimited.
- The relevant predictive variables do not destroy the regression fitting.
However, in many cases, it is much wiser to perform decomposition similar to PCA.
Today, we will execute PLS-DA on the Arcene dataset, It contains 100 observations and 10000 explanatory variables.
Let’s start using R
Cancer / cancer free tags (coded as – 1 / 1) are stored in different files, so we can attach them directly to the complete data set, and then use formula syntax to train the model.
# Install load library(caret) arcene <- read.table("train.data", sep = " ", colClasses = c(rep("numeric", 10000), "NULL")) # Add labels as additional columns arcene$class <- factor(scan("rain.labels", sep = "\\t"))
The main problems now are:
- How can we accurately predict whether a patient is ill according to the MS spectrum of his serum?
- Which protein / MS peak can best distinguish patients from healthy patients?
For preprocessing, we will usepreProcParameters delete zero variance predictors in precise order and standardize all remaining variables. Considering the sample size (_n_ = 100), I will choose a 50% off cross validation (CV) with 10 repetitions – a large number of repetitions make up for the high square difference caused by the reduced number of verifications – a total of 50 accuracy estimates have been made.
# Compile Cross Validation Settings set.seed(100) myfolds <- createMultiFolds(arcene$class, k = 5, times = 10) control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")
This figure depicts the CV Curve, where we can learn the average accuracy (_y_axis,%) obtained from models trained with different numbers of LV (_x_axis).
Now, we perform linear discriminant analysis (LDA) for comparison. We can also try some more complex models, such as random forest (RF).
Finally, we can compare the accuracy of PLS-DA, pca-da and RF.
We will useresamplesCompile the three models and borrow themggplot2The mapping function is used to compare the 50 accuracy estimates of the best cross validation model in three cases.
Obviously, long RF operation does not translate into excellent performance, on the contrary. Although the average performance of the three models is similar, the accuracy of RF is much different. If we want to find a robust model, this is of course a problem. In this case, PLS-DA and pca-da showed the best performance (accuracy 63-95%), and both models performed well in diagnosing cancer in new serum samples.
In summary, we will use the predicted variable importance (VIP) in PLS-DA and pca-da to identify the ten proteins that can most diagnose cancer.
The PLS-DA VIP diagram above clearly distinguishes v1184 from all other proteins. This may be an interesting cancer biomarker. Of course, many other tests and models must be carried out to provide reliable diagnostic tools.