Partition of data sets for machine learning


Generally, the data set is divided into three subsets (as shown in the figure below), which can greatly reduce the probability of over fitting:

Partition of data sets for machine learning

Figure 2. Dividing a single data set into three subsets

Use the validation set to evaluate the effect of the training set. Then, after the model “passes” the validation set, use the test set to check the evaluation results again. The following figure shows this new workflow:

Partition of data sets for machine learning

Figure 3. Better workflow

In this improved workflow:

1. Select the model with the best effect on the validation set

2. Verify the model again using the test set

The workflow is better because it reports less information to the test set

be careful:

The more you use the same data to determine the hyperparameter settings or other model improvements, the lower your confidence that these results can be truly generalized to new data you have never seen. Please note that the failure rate of verification sets is usually slower than that of test sets

If possible, it is recommended that you collect more data to “Refresh” the test set and verification set. Restarting is a good way to reset

This work adoptsCC agreement, reprint must indicate the author and the link to this article