Home >Technology peripherals >AI >How to split a dataset correctly? Summary of three common methods

How to split a dataset correctly? Summary of three common methods

WBOY
WBOYforward
2023-04-08 18:51:071547browse

Decomposing the data set into a training set can help us understand the model, which is very important for how the model generalizes to new unseen data. A model may not generalize well to new unseen data if it is overfitted. Therefore good predictions cannot be made.

Having an appropriate validation strategy is the first step to successfully create good predictions and use the business value of AI models. This article has compiled some common data splitting strategies.

Simple training and test split

Divide the data set into 2 parts: training and verification, and use 80% training and 20% verification. You can do this using Scikit's random sampling.

How to split a dataset correctly? Summary of three common methods

First of all, the random seed needs to be fixed, otherwise the same data split cannot be compared and the results cannot be reproduced during debugging. If the data set is small, there is no guarantee that the validation split can be uncorrelated with the training split. If the data is unbalanced, you won't get the same split ratio.

So simple splitting can only help us develop and debug. Real training is not perfect enough, so the following splitting methods can help us end these problems.

K-fold cross validation

Split the data set into k partitions. In the image below, the dataset is divided into 5 partitions.

How to split a dataset correctly? Summary of three common methods

#Select one partition as the validation data set, while the other partitions are the training data set. This will train the model on each different set of partitions.

Finally, K different models will be obtained, and these models will be used together using the integration method when reasoning and predicting later.

K is usually set to [3,5,7,10,20]

If you want to check the model performance for low bias, use a higher K [20]. If you are building a model for variable selection, use low k [3,5] and the model will have lower variance.

Advantages:

  • By averaging model predictions, you can improve model performance on unseen data drawn from the same distribution.
  • This is a widely used method to obtain good production models.
  • Different integration techniques can be used to create predictions for each data in the data set, and these predictions can be used to improve the model. This is called OOF (out-fold prediction).

Question:

  • If you have an unbalanced data set, use Stratified-kFold.
  • If you retrain a model on all datasets, you cannot compare its performance to any model trained with k-Fold. Because this model is trained on k-1, not the entire data set.

Stratified-kFold

can retain the ratio between different classes in each fold. If the dataset is unbalanced, say Class1 has 10 examples, and Class2 has 100 examples. Stratified-kFold creates each folded classification with the same ratio as the original dataset

The idea is similar to K-fold cross validation, but with the same ratio for each fold as the original dataset.

How to split a dataset correctly? Summary of three common methods

Each split preserves the initial ratio between classes. If your dataset is large, cross-validation of K-fold may also preserve proportions, but this is stochastic, whereas Stratified-kFold is deterministic and can be used with small datasets.

Bootstrap and Subsampling

Bootstrap and Subsampling are similar to K-Fold cross-validation, but they do not have fixed folds. It randomly selects some data from the data set, uses other data as validation and repeats n times

Bootstrap=alternating sampling, which we have introduced in detail in previous articles.

When should I use him? Bootstrap and Subsamlping can only be used if the standard error of the estimated metric error is large. This may be due to outliers in the data set.

Summary

Usually in machine learning, k-fold cross-validation is used as a starting point. If the data set is unbalanced, Stratified-kFold is used. If there are many outliers, Bootstrap or other methods can be used. Data splitting improvements.

The above is the detailed content of How to split a dataset correctly? Summary of three common methods. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete