R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Time:2022-9-22

Original link:http://tecdat.cn/?p=24354

This article describes simplifying the model building and evaluation process.

The train of the caret packagefunction can be used to

  • Using resampling to evaluate the impact of model tuning parameters on performance
  • Choose the "best" model among these parameters
  • Estimating model performance from the training set

First, a specific model must be selected.

The first step in tuning a model is to choose a set of parameters to evaluate. For example, if fitting a partial least squares (PLS) model, you must specify the number of PLS ​​components to evaluate.

Once the model and tuning parameter values ​​are defined, the type of resampling should also be specified. Currently, _k_fold cross-validation (one-shot or repeated), leave-one-out cross-validation, and bootstrap (simple estimation or rule of 632) resampling methods can be usedtrain. After resampling, the process produces a profile of performance measurements that can be used to guide the user in choosing which tuning parameter values ​​should be selected. By default, the function automatically selects the tuning parameter associated with the best value, although different algorithms can be used.

Sonar data example

Here we load the data:

str(Snr\[, 1:10\])

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Create a stratified random sample of data as training and test sets:

iTraing <- creaDaaPatiion(Cls, p = .75, list = FALSE)

We will use this data to illustrate the functionality on this (and other) pages.

Basic parameter tuning

By default, simple resampling is used for line 3 in the above algorithm. There are others, like repeated _K_fold cross-validation, leave-one-out, etc. Specify the type of resampling:

fit <- trainCnol(## 10-fold CV
                           meod = "rpaedcv",
                 
                           ## Repeat 10 times
                           rpets = 10)

the first two parameterstrainare the predictor and outcome data objects, respectively. The third parameter,methodSpecifies the type of model. To illustrate, we will passgbmBag. The basic syntax for fitting this model using repeated cross-validation is as follows:

train(
                 mehd = "gbm",

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

 R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

For gradient boosting machine (GBM) models, there are three main tuning parameters:

  • The number of iterations, i.e. the tree, (n.treesexistgbmfunction call)
  • The complexity of the tree, calledinteraction.depth
  • Learning rate: The speed at which the algorithm adapts, calledshrinkage
  • The minimum number of training set samples in the node to start splitting (n.minobsinnode)

The default values ​​tested for this model are shown in the first two columns (shrinkageandn.minobsinnodenot shown because the grid sets of candidate models all use a single value for these tuning parameters). Mark as&quot;Accuracy&quot;The column is the average overall agreement rate for the cross-validation iterations. The agreement standard deviation is also calculated from the cross-validation results.&quot;Kappa” column is the mean of Cohen’s (unweighted) Kappa statistic over the resampling results.trainApplies to specific models. For these models,trainA grid of tuning parameters can be created automatically. By default, if _p_ is the number of tuning parameters, the grid size is _3^p_. As another example, the Regularized Discriminant Analysis (RDA) model has two parameters (gammaandlambda), both parameters are between 0 and 1. The default training grid will yield nine combinations in this two-dimensional space.

trainThe next section will cover the other features.

Reproducibility Considerations

Many models use random numbers at the stage of estimating parameters. Also, the resampling index is chosen using random numbers. There are two main ways to control randomness to ensure reproducible results.

  • There are two ways to ensure that the same heavy samples are used when calling train. The first is to use set.seed before calling train. The first use of random numbers is to create resampling information. Alternatively, if you want to use a specific split of the data, you can use the index parameter of the trainControl function.
  • The seed can also be set when the model is created in resampling. While setting the seed before calling train guarantees that the same random number is used, this is unlikely to be the case when using parallel processing (depending on which technique is leveraged). To set the seed for the model fit, trainControl has an additional parameter called seed that can be used. The value of this parameter is a list of integer vectors to seed. The help page for trainControl describes the appropriate format for this option.

Custom tuning process

There are several ways to customize the process of selecting tuning/complexity parameters and building the final model.

preprocessing options

As mentioned earlier,trainData can be preprocessed in various ways before model fitting. the functionpreProcessis used automatically. This function can be used for standardization, imputation (see details below), applying spatial sign transformations via principal component analysis or independent component analysis, and feature extraction.

To specify what preprocessing should be done, thetrainThe function has a parameter namedpreProcess。 preProcessAdditional options for functions are available viatrainControlfunction pass.

These processing steps will be used inpredict.trainextractPredictionor apply during any forecast period generatedextractProbs(see details later in this document). preprocessingWon'tfor direct useobject$finalModelObject predictions.

For imputation, three methods are currently implemented:

  • _k -_Nearest Neighbors takes samples with missing values ​​and finds the _k_closest samples in the training set. The average of the _k_ training set values ​​of the predictor is used as a surrogate for the original data. When calculating the distance to a training set sample, the predictor used in the calculation is the predictor that has no missing values ​​for that sample and no missing values ​​in the training set.
  • Another approach is to fit a bag tree model for each predictor using the training set samples. This is usually a fairly accurate model and can handle missing values. When the predictor of a sample needs to be estimated, the values ​​of the other predictors are fed back through the bagging tree and the predicted value is used as the new value. This model will have a large computational cost.
     
  • The median of the predictor training set values ​​can be used to estimate missing data.

PCA and ICA models only use full samples if there are missing values ​​in the training set.

Alternate tuning grid

The tuning parameter grid can be specified by the user. the parametertuneGridA data frame containing columns for each adjustment parameter can be taken. The column names should be the same as the parameters of the fit function. For the RDA example mentioned earlier, the name would begammaandlambda。 trainThe model will be adjusted on each combination of values ​​in the row.

For boosted tree models, we can fix the learning rate and evaluate more than three values ​​of n.trees.
 

expnd.grd(
                        n.trees = (1:30)*50, 
                       )
                        

Fit2

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

 R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Another option is to use a random sample of possible combinations of tuning parameters, a &quot;random search&quot;.

To use random search, usesearch = "random"options in calltrainControl. in this case,tuneLengthParameters define the total number of parameter combinations that will be evaluated.

draw the resampled image

ShouldplotFunctions can be used to examine the relationship between performance estimates and tuning parameters. For example, a simple call to the function shows the results of the first performance metric:

tels.pr.st(cretTe())

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

can use thismetricOption to show other performance metrics:

trels.r.st(carthme())
plt(Fit2, meric = "Kap")

 R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Other types of plots can also be used. related?plot.trainFor more details, see . The code below shows a heatmap of the results:

trlipt(crTme())
plt(Fit2))

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

ggplotcan also be usedggplotmethod:

ggplot( Fit2)

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

There are also plotting functions that represent the resampled estimates in more detail. related?xyplot.trainFor more details, see .

From these figures, a different set of tuning parameters may be required. To change the final value without starting the whole process again,update.trainCan be used to refit the final model. Look?update.train

trainControlFunction

the functiontrainControlGenerate parameters to further control how the model is created, possible values:

  • method. method of resampling. &quot;boot&quot;, &quot;cv&quot;, &quot;LOOCV&quot;, &quot;LGOCV&quot;, &quot;recomplatedcv&quot;, &quot;timeslice&quot;, &quot;none&quot;, and &quot;oob&quot;. The last value, the out-of-bag estimate, can only be used by random forest, bagged tree, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models. GBM models are not included. Additionally, for leave-one-out cross-validation, no uncertainty estimates for resampling performance metrics are given.
  • numberandrepeats: numberControls the number of folds in _K_fold cross-validation or the number of resampling iterations used for bootstrap and leave-group cross-validation.repeatsApplies only to repeated _K_fold cross-validation. Assumptionmethod = "repeatedcv"number = 10andrepeats = 3, then three separate 10-fold cross-validation is used as the resampling scheme.
  • verboseIter: Output the training log.
  • returnData: save the data to a file namedtrainingData。

Alternative performance metrics

Users can change the metrics used to determine optimal settings. By default, RMSE, _R_2, and mean absolute error (MAE) are calculated for regression, while accuracy and Kappa are calculated for classification. Also by default, parameter values ​​are chosen using RMSE and precision, respectively, for regression and classification. of the functionmetricparametertrainAllows the user to control which optimal standard is used. For example, in a class of problems with a low percentage of samples, usemetric = "Kappa"The quality of the final model can be improved.

If none of these parameters are satisfactory, users can also calculate custom performance metrics. ShouldtrainControlfunction has one parametersummaryFunction, a function that specifies computational performance. The function should have the following parameters:

  • data is a reference to a data frame or matrix with columns named obs and pred for observed and predicted outcome values ​​(numeric data for regression or character values ​​for classification). Currently, class probabilities are not passed to the function. The values ​​in data are the retained predictions (and their associated reference values) for a single tuning parameter combination. If the classProbs parameter of the trainControl object is set to &quot;true&quot;, an extra column containing class probabilities will appear in the data. The names of these columns are the same as the class level. Also, if weights are specified when calling train, there will also be a column of data in the dataset called weights.
  • levis a string with the resulting factor levels extracted from the training data. For regression, the value ofNULLpassed to the function.
  • modelis the string of the model being used (i.e. passed tomethodthe value of the parametertrain)。

The output of this function should be a vector of numeric summary metrics with non-empty names. by default,trainEvaluate classification models against predicted classes. Optionally, class probabilities can also be used to measure performance. To obtain predicted class probabilities during resampling, the parameterclassProbs in trainControlmust be set toTRUE. This incorporates probability columns into the predictions generated by each resampling (one column per class, with the column names being the class names).

As shown in the previous section, a custom function can be used to calculate the average performance score for resampling. Calculate the sensitivity, specificity, and area under the ROC curve:

head(toClamary)

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

To reconstruct a boosted tree model using this criterion, we can see the relationship between the tuning parameters and the area under the ROC curve using the following code:

Fit3<- tran(C
                 mtric = "ROC")

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

 R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

In this case, the average area under the ROC curve associated with the best tuning parameter was 0.922 over 100 resamplings.

Choose the final model

Another way to customize the tuning process is to modify the algorithm used to select the &quot;best&quot; parameter values, given performance numbers. By default, thetrainThe function selects the model with the largest performance value (or smallest, for regression models, the mean squared error). Other options for selecting models are available.Breiman et al (1984)&quot;) suggests a &quot;one standard error rule&quot; for simple tree-based models. In this case, the model with the best performance value is identified and resampling is used to estimate the standard error of performance. The final model used is The simplest model within one standard error of the (empirically) best model. For simple trees, this makes sense because as these models become more specific to the training data, they will start to overfit.

trainAllows the user to specify alternative rules for selecting the final model. the parameterselectionFunctionCan be used to provide a function to algorithmically determine the final model. There are now three functions in the package:bestis to select the maximum/minimum value,oneSEtry to capture the spiritBreiman et al (1984)&quot;) andtoleranceChoose the least complex model within a percentage tolerance of the optimal value.

User-defined functions can be used as long as they have the following parameters:

  • xis a data frame containing tuning parameters and their associated performance metrics. Each row corresponds to a different tuning parameter combination.
  • metricA string indicating which performance metrics should be optimized (this is passed directly frommetricthe independent variabletrain
  • maximizeis a single logical value indicating whether a larger value of the performance metric is better (this is also passed directly from the call totrain)。

The function should output an integer indicatingxWhich row is selected.

As an example, if we chose the boosted tree model from before based on overall accuracy, we would choose: n.trees = 1450, interaction.depth = 5, shrinkage = 0.1, n.minobsinnode = 20. The plot is fairly compact, with accuracy values ​​ranging from 0.863 to 0.922. A less complex model (e.g. fewer, shallower trees) might also yield acceptable accuracy.

The tolerance function can be used to find less complex models based on ( _x_ – _x_ best) / _x_ bestx 100 (percent difference). For example, to choose parameter values ​​based on a 2% performance penalty:

tolrae(rslts, merc = "ROC", 
                         tol = 2, mxiie = TRUE)

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

resul\[whTwc,1:6\]

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

This shows that we can get a less complex model with an area under the ROC curve of 0.914 (compared to the &quot;choose best&quot; value of 0.922).

The main problem with these functions is related to ordering the models from simplest to more complex. In some cases this is easy (e.g. simple trees, partial least squares), but in the case of such models, the ordering of the models is subjective. For example, is a boosted tree model with 100 iterations and a tree depth of 2 more complex than a model with 50 iterations and a depth of 8? The package makes some choices. In the case of boosted trees, the package assumes that increasing the number of iterations increases the complexity faster than increasing the tree depth, so the model is sorted by iterations and then by depth.

Extract predictions and class probabilities

As mentioned earlier, the object produced by the training function contains the &quot;optimized&quot; model in the finalModel subobject. Predictions can be made from these objects as usual. In some cases, such as pls or gbm objects, it may be necessary to specify additional parameters from the optimized fit. In these cases, the training subjects use the results of parameter optimization to predict new samples. For example, if using predict.gbm to create predictions, the user must specify the number of trees directly (there is no default). Also, for binary classification, the function's predictions are in the form of the probability of one of the classes, so an extra step is required to convert it to a factor vector. predict.train handles these details (and other models) automatically.

Also, there is very little standard syntax for model predictions in R. For example, to obtain class probabilities, manypredictMethods have a parameter called parametertype, which specifies whether classes or probabilities should be generated. Different packages use different valuestype,E.g"prob",  "posterior",  "response", "probability"or"raw". In other cases, a completely different syntax is used.

For predict.train, the type options are normalized to &quot;class&quot; and &quot;prob&quot;. For example.

prdit(it3, nwta = hadetn))

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

prdit(Ft3, ewata = hed(ttig), tye = "pob")

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Explore and compare resampling distributions

Inside the model

For example, the following statement creates a density plot:

tlisaret(crtTe())
deiplt(Ft3, pch = "|")

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Note that if you are interested in plotting resampling results for multiple tuning parameters,resamples = "all"then this option should be used in the control object.

model room

Characterize the differences between models (using the resultingtrain, sbforrfeby their resampling distribution).

First, a support vector machine model is fitted to the sonar data. usepreProcThe parameters normalize the data. Note that the same random number seed is set before the same model as the seed used for the boosted tree model.

set.sed(25)
Ft <- tran(
                 preProc = c("center", "scale"),
               
                 metric = "ROC")

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

In addition, a regularized discriminant analysis model was fitted.

Fit <- tn(
                 method = "rda")

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Given these models, can we make statistical statements about their performance differences? To do this, we first collect resampling results using .

rsa <- resamples()

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

summary

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

There are several bitmap methods available for visualizing resampling distributions: density plots, box and whisker plots, scatterplot matrices, and scatterplots for summary statistics. E.g:

the <- elia.get(
ptsyol$col = rb(.2, ., .2, .4)
plot(resamp, layot = c(3, 1))

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Since the models are fitted on the same version of the training data, it makes sense to make inferences about the differences between the models. In this way, we reduce possible within-sample correlations. We can calculate the difference and then use a simple t-test to evaluate the null hypothesis that there is no difference between the models.
 

diValu

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

summary

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

plot(diVls, lyu = c(3, 1))

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

plot(fVue)

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Fitted model without parameter adjustment

When the model adjustment value is known,trainCan be used to fit a model to the entire training set without any resampling or parameter tuning. can use usingmethod = "none"OptionstrainControl. E.g:

 tronol(mtd = "none", csPrs = TRUE)
Fit4

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

cautionplot.train,  resamplesconfusionMatrix.trainand several other functions are not available for this object, but other functionspredict.trainWill:

prdct(Fit4, newdata )

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

prdit(Fit4, newdata , tpe = "prb")

R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data


 R language gradient boosting machine GBM, support vector machine SVM, regular discriminant analysis RDA model training, parameter tuning optimization and performance comparison Visual analysis of sonar data

Most Popular Insights

1.See why employees leave from the decision tree model

2.Tree-Based Approaches in R Language: Decision Trees, Random Forests

3.Using scikit-learn and pandas decision tree in python

4.Machine Learning: Running Random Forest Data Analysis Reports in SAS

5.R Language Improves Airline Customer Satisfaction with Random Forest and Text Mining

6.Machine Learning Boosts Fast Fashion Accurate Sales Time Series

7.Using Machine Learning to Identify Changing Stock Market Conditions – Applications of Hidden Markov Models

8.python machine learning: recommendation system implementation (collaborative filtering with matrix factorization)

9.