This article describes simplifying the model building and evaluation process.
The train of the caret packagefunction can be used to
- Using resampling to evaluate the impact of model tuning parameters on performance
- Choose the "best" model among these parameters
- Estimating model performance from the training set
First, a specific model must be selected.
The first step in tuning a model is to choose a set of parameters to evaluate. For example, if fitting a partial least squares (PLS) model, you must specify the number of PLS components to evaluate.
Once the model and tuning parameter values are defined, the type of resampling should also be specified. Currently, _k_fold cross-validation (one-shot or repeated), leave-one-out cross-validation, and bootstrap (simple estimation or rule of 632) resampling methods can be used
train. After resampling, the process produces a profile of performance measurements that can be used to guide the user in choosing which tuning parameter values should be selected. By default, the function automatically selects the tuning parameter associated with the best value, although different algorithms can be used.
Sonar data example
Here we load the data:
Create a stratified random sample of data as training and test sets:
iTraing <- creaDaaPatiion(Cls, p = .75, list = FALSE)
We will use this data to illustrate the functionality on this (and other) pages.
Basic parameter tuning
By default, simple resampling is used for line 3 in the above algorithm. There are others, like repeated _K_fold cross-validation, leave-one-out, etc. Specify the type of resampling:
fit <- trainCnol(## 10-fold CV meod = "rpaedcv", ## Repeat 10 times rpets = 10)
the first two parameters
trainare the predictor and outcome data objects, respectively. The third parameter,
methodSpecifies the type of model. To illustrate, we will passgbmBag. The basic syntax for fitting this model using repeated cross-validation is as follows:
train( mehd = "gbm",
For gradient boosting machine (GBM) models, there are three main tuning parameters:
- The number of iterations, i.e. the tree, (
- The complexity of the tree, called
- Learning rate: The speed at which the algorithm adapts, called
- The minimum number of training set samples in the node to start splitting (
The default values tested for this model are shown in the first two columns (
n.minobsinnodenot shown because the grid sets of candidate models all use a single value for these tuning parameters). Mark as"
Accuracy"The column is the average overall agreement rate for the cross-validation iterations. The agreement standard deviation is also calculated from the cross-validation results."
Kappa” column is the mean of Cohen’s (unweighted) Kappa statistic over the resampling results.
trainApplies to specific models. For these models,
trainA grid of tuning parameters can be created automatically. By default, if _p_ is the number of tuning parameters, the grid size is _3^p_. As another example, the Regularized Discriminant Analysis (RDA) model has two parameters (
lambda), both parameters are between 0 and 1. The default training grid will yield nine combinations in this two-dimensional space.
trainThe next section will cover the other features.
Many models use random numbers at the stage of estimating parameters. Also, the resampling index is chosen using random numbers. There are two main ways to control randomness to ensure reproducible results.
- There are two ways to ensure that the same heavy samples are used when calling train. The first is to use set.seed before calling train. The first use of random numbers is to create resampling information. Alternatively, if you want to use a specific split of the data, you can use the index parameter of the trainControl function.
- The seed can also be set when the model is created in resampling. While setting the seed before calling train guarantees that the same random number is used, this is unlikely to be the case when using parallel processing (depending on which technique is leveraged). To set the seed for the model fit, trainControl has an additional parameter called seed that can be used. The value of this parameter is a list of integer vectors to seed. The help page for trainControl describes the appropriate format for this option.
Custom tuning process
There are several ways to customize the process of selecting tuning/complexity parameters and building the final model.
As mentioned earlier,
trainData can be preprocessed in various ways before model fitting. the function
preProcessis used automatically. This function can be used for standardization, imputation (see details below), applying spatial sign transformations via principal component analysis or independent component analysis, and feature extraction.
To specify what preprocessing should be done, the
trainThe function has a parameter named
preProcessAdditional options for functions are available via
These processing steps will be used in
extractPredictionor apply during any forecast period generated
extractProbs(see details later in this document). preprocessingWon'tfor direct use
For imputation, three methods are currently implemented:
- _k -_Nearest Neighbors takes samples with missing values and finds the _k_closest samples in the training set. The average of the _k_ training set values of the predictor is used as a surrogate for the original data. When calculating the distance to a training set sample, the predictor used in the calculation is the predictor that has no missing values for that sample and no missing values in the training set.
- Another approach is to fit a bag tree model for each predictor using the training set samples. This is usually a fairly accurate model and can handle missing values. When the predictor of a sample needs to be estimated, the values of the other predictors are fed back through the bagging tree and the predicted value is used as the new value. This model will have a large computational cost.
- The median of the predictor training set values can be used to estimate missing data.
PCA and ICA models only use full samples if there are missing values in the training set.
Alternate tuning grid
The tuning parameter grid can be specified by the user. the parameter
tuneGridA data frame containing columns for each adjustment parameter can be taken. The column names should be the same as the parameters of the fit function. For the RDA example mentioned earlier, the name would be
trainThe model will be adjusted on each combination of values in the row.
For boosted tree models, we can fix the learning rate and evaluate more than three values of n.trees.
expnd.grd( n.trees = (1:30)*50, ) Fit2
Another option is to use a random sample of possible combinations of tuning parameters, a "random search".
To use random search, use
search = "random"options in call
trainControl. in this case,
tuneLengthParameters define the total number of parameter combinations that will be evaluated.
draw the resampled image
plotFunctions can be used to examine the relationship between performance estimates and tuning parameters. For example, a simple call to the function shows the results of the first performance metric:
can use this
metricOption to show other performance metrics:
trels.r.st(carthme()) plt(Fit2, meric = "Kap")
Other types of plots can also be used. related
?plot.trainFor more details, see . The code below shows a heatmap of the results:
ggplotcan also be used
There are also plotting functions that represent the resampled estimates in more detail. related
?xyplot.trainFor more details, see .
From these figures, a different set of tuning parameters may be required. To change the final value without starting the whole process again,
update.trainCan be used to refit the final model. Look
trainControlGenerate parameters to further control how the model is created, possible values:
method. method of resampling. "boot", "cv", "LOOCV", "LGOCV", "recomplatedcv", "timeslice", "none", and "oob". The last value, the out-of-bag estimate, can only be used by random forest, bagged tree, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models. GBM models are not included. Additionally, for leave-one-out cross-validation, no uncertainty estimates for resampling performance metrics are given.
numberControls the number of folds in _K_fold cross-validation or the number of resampling iterations used for bootstrap and leave-group cross-validation.
repeatsApplies only to repeated _K_fold cross-validation. Assumption
method = "repeatedcv",
number = 10and
repeats = 3, then three separate 10-fold cross-validation is used as the resampling scheme.
verboseIter: Output the training log.
returnData: save the data to a file named
Alternative performance metrics
Users can change the metrics used to determine optimal settings. By default, RMSE, _R_2, and mean absolute error (MAE) are calculated for regression, while accuracy and Kappa are calculated for classification. Also by default, parameter values are chosen using RMSE and precision, respectively, for regression and classification. of the function
trainAllows the user to control which optimal standard is used. For example, in a class of problems with a low percentage of samples, use
metric = "Kappa"The quality of the final model can be improved.
If none of these parameters are satisfactory, users can also calculate custom performance metrics. Should
trainControlfunction has one parameter
summaryFunction, a function that specifies computational performance. The function should have the following parameters:
- data is a reference to a data frame or matrix with columns named obs and pred for observed and predicted outcome values (numeric data for regression or character values for classification). Currently, class probabilities are not passed to the function. The values in data are the retained predictions (and their associated reference values) for a single tuning parameter combination. If the classProbs parameter of the trainControl object is set to "true", an extra column containing class probabilities will appear in the data. The names of these columns are the same as the class level. Also, if weights are specified when calling train, there will also be a column of data in the dataset called weights.
levis a string with the resulting factor levels extracted from the training data. For regression, the value of
NULLpassed to the function.
modelis the string of the model being used (i.e. passed to
methodthe value of the parameter
The output of this function should be a vector of numeric summary metrics with non-empty names. by default,
trainEvaluate classification models against predicted classes. Optionally, class probabilities can also be used to measure performance. To obtain predicted class probabilities during resampling, the parameter
trainControlmust be set to
TRUE. This incorporates probability columns into the predictions generated by each resampling (one column per class, with the column names being the class names).
As shown in the previous section, a custom function can be used to calculate the average performance score for resampling. Calculate the sensitivity, specificity, and area under the ROC curve:
To reconstruct a boosted tree model using this criterion, we can see the relationship between the tuning parameters and the area under the ROC curve using the following code:
Fit3<- tran(C mtric = "ROC")
In this case, the average area under the ROC curve associated with the best tuning parameter was 0.922 over 100 resamplings.
Choose the final model
Another way to customize the tuning process is to modify the algorithm used to select the "best" parameter values, given performance numbers. By default, the
trainThe function selects the model with the largest performance value (or smallest, for regression models, the mean squared error). Other options for selecting models are available.Breiman et al (1984)") suggests a "one standard error rule" for simple tree-based models. In this case, the model with the best performance value is identified and resampling is used to estimate the standard error of performance. The final model used is The simplest model within one standard error of the (empirically) best model. For simple trees, this makes sense because as these models become more specific to the training data, they will start to overfit.
trainAllows the user to specify alternative rules for selecting the final model. the parameter
selectionFunctionCan be used to provide a function to algorithmically determine the final model. There are now three functions in the package:
bestis to select the maximum/minimum value,
oneSEtry to capture the spiritBreiman et al (1984)") and
toleranceChoose the least complex model within a percentage tolerance of the optimal value.
User-defined functions can be used as long as they have the following parameters:
xis a data frame containing tuning parameters and their associated performance metrics. Each row corresponds to a different tuning parameter combination.
metricA string indicating which performance metrics should be optimized (this is passed directly from
metricthe independent variable
maximizeis a single logical value indicating whether a larger value of the performance metric is better (this is also passed directly from the call to
The function should output an integer indicating
xWhich row is selected.
As an example, if we chose the boosted tree model from before based on overall accuracy, we would choose: n.trees = 1450, interaction.depth = 5, shrinkage = 0.1, n.minobsinnode = 20. The plot is fairly compact, with accuracy values ranging from 0.863 to 0.922. A less complex model (e.g. fewer, shallower trees) might also yield acceptable accuracy.
The tolerance function can be used to find less complex models based on ( _x_ – _x_ best) / _x_ bestx 100 (percent difference). For example, to choose parameter values based on a 2% performance penalty:
tolrae(rslts, merc = "ROC", tol = 2, mxiie = TRUE)
This shows that we can get a less complex model with an area under the ROC curve of 0.914 (compared to the "choose best" value of 0.922).
The main problem with these functions is related to ordering the models from simplest to more complex. In some cases this is easy (e.g. simple trees, partial least squares), but in the case of such models, the ordering of the models is subjective. For example, is a boosted tree model with 100 iterations and a tree depth of 2 more complex than a model with 50 iterations and a depth of 8? The package makes some choices. In the case of boosted trees, the package assumes that increasing the number of iterations increases the complexity faster than increasing the tree depth, so the model is sorted by iterations and then by depth.
Extract predictions and class probabilities
As mentioned earlier, the object produced by the training function contains the "optimized" model in the finalModel subobject. Predictions can be made from these objects as usual. In some cases, such as pls or gbm objects, it may be necessary to specify additional parameters from the optimized fit. In these cases, the training subjects use the results of parameter optimization to predict new samples. For example, if using predict.gbm to create predictions, the user must specify the number of trees directly (there is no default). Also, for binary classification, the function's predictions are in the form of the probability of one of the classes, so an extra step is required to convert it to a factor vector. predict.train handles these details (and other models) automatically.
Also, there is very little standard syntax for model predictions in R. For example, to obtain class probabilities, many
predictMethods have a parameter called parameter
type, which specifies whether classes or probabilities should be generated. Different packages use different values
"raw". In other cases, a completely different syntax is used.
For predict.train, the type options are normalized to "class" and "prob". For example.
prdit(it3, nwta = hadetn))
prdit(Ft3, ewata = hed(ttig), tye = "pob")
Explore and compare resampling distributions
Inside the model
For example, the following statement creates a density plot:
tlisaret(crtTe()) deiplt(Ft3, pch = "|")
Note that if you are interested in plotting resampling results for multiple tuning parameters,
resamples = "all"then this option should be used in the control object.
Characterize the differences between models (using the resulting
rfeby their resampling distribution).
First, a support vector machine model is fitted to the sonar data. use
preProcThe parameters normalize the data. Note that the same random number seed is set before the same model as the seed used for the boosted tree model.
set.sed(25) Ft <- tran( preProc = c("center", "scale"), metric = "ROC")
In addition, a regularized discriminant analysis model was fitted.
Fit <- tn( method = "rda")
Given these models, can we make statistical statements about their performance differences? To do this, we first collect resampling results using .
rsa <- resamples()
There are several bitmap methods available for visualizing resampling distributions: density plots, box and whisker plots, scatterplot matrices, and scatterplots for summary statistics. E.g:
the <- elia.get( ptsyol$col = rb(.2, ., .2, .4) plot(resamp, layot = c(3, 1))
Since the models are fitted on the same version of the training data, it makes sense to make inferences about the differences between the models. In this way, we reduce possible within-sample correlations. We can calculate the difference and then use a simple t-test to evaluate the null hypothesis that there is no difference between the models.
plot(diVls, lyu = c(3, 1))
Fitted model without parameter adjustment
When the model adjustment value is known,
trainCan be used to fit a model to the entire training set without any resampling or parameter tuning. can use using
method = "none"Options
tronol(mtd = "none", csPrs = TRUE) Fit4
confusionMatrix.trainand several other functions are not available for this object, but other functions
prdct(Fit4, newdata )
prdit(Fit4, newdata , tpe = "prb")
Most Popular Insights