Home >Technology peripherals >AI >How to split a dataset correctly? Summary of three common methods
Decomposing the data set into a training set can help us understand the model, which is very important for how the model generalizes to new unseen data. A model may not generalize well to new unseen data if it is overfitted. Therefore good predictions cannot be made.
Having an appropriate validation strategy is the first step to successfully create good predictions and use the business value of AI models. This article has compiled some common data splitting strategies.
Divide the data set into 2 parts: training and verification, and use 80% training and 20% verification. You can do this using Scikit's random sampling.
First of all, the random seed needs to be fixed, otherwise the same data split cannot be compared and the results cannot be reproduced during debugging. If the data set is small, there is no guarantee that the validation split can be uncorrelated with the training split. If the data is unbalanced, you won't get the same split ratio.
So simple splitting can only help us develop and debug. Real training is not perfect enough, so the following splitting methods can help us end these problems.
Split the data set into k partitions. In the image below, the dataset is divided into 5 partitions.
#Select one partition as the validation data set, while the other partitions are the training data set. This will train the model on each different set of partitions.
Finally, K different models will be obtained, and these models will be used together using the integration method when reasoning and predicting later.
K is usually set to [3,5,7,10,20]
If you want to check the model performance for low bias, use a higher K [20]. If you are building a model for variable selection, use low k [3,5] and the model will have lower variance.
Advantages:
Question:
can retain the ratio between different classes in each fold. If the dataset is unbalanced, say Class1 has 10 examples, and Class2 has 100 examples. Stratified-kFold creates each folded classification with the same ratio as the original dataset
The idea is similar to K-fold cross validation, but with the same ratio for each fold as the original dataset.
Each split preserves the initial ratio between classes. If your dataset is large, cross-validation of K-fold may also preserve proportions, but this is stochastic, whereas Stratified-kFold is deterministic and can be used with small datasets.
Bootstrap and Subsampling are similar to K-Fold cross-validation, but they do not have fixed folds. It randomly selects some data from the data set, uses other data as validation and repeats n times
Bootstrap=alternating sampling, which we have introduced in detail in previous articles.
When should I use him? Bootstrap and Subsamlping can only be used if the standard error of the estimated metric error is large. This may be due to outliers in the data set.
Usually in machine learning, k-fold cross-validation is used as a starting point. If the data set is unbalanced, Stratified-kFold is used. If there are many outliers, Bootstrap or other methods can be used. Data splitting improvements.
The above is the detailed content of How to split a dataset correctly? Summary of three common methods. For more information, please follow other related articles on the PHP Chinese website!