The Train, Validation, and Test Sets: How to Split Your Machine Learning Data

What's the optimal machine learning data split ratio and how to achieve it? Learn how to avoid overfitting and start building reliable machine learning models in hours, not weeks.

Here's the first rule of machine learning

Don't use the same dataset for model training and model evaluation.

If you want to build a reliable machine learning model, you need to split your dataset into the training set, validation set, and test set.

If you don't, your results will be biased, and you'll end up with a false impression of better model accuracy.

It's a trap!

But—

You are about to learn how to avoid it (and build models that work like magic). This article will give you a brief explanation of why splitting your machine learning data matters and the best ways to approach it. 

Here's what we'll cover:

  1. Train vs. Validation vs. Test set
  2. How to split your machine learning data?
  3. Common pitfalls in the training data split

And hey—if you want to skip this tutorial and start annotating your data and training your models right away, check out:

Now, let's dive in!

Train vs. Validation vs. Test set

For training and testing purposes of our model, we should have our data broken down into three distinct dataset splits.

The Training Set

It is the set of data that is used to train and make the model learn the hidden features/patterns in the data.

In each epoch, the same training data is fed to the neural network repeatedly, and the model continues to learn the features of the data.

The training set should have a diversified set of inputs so that the model is trained in all scenarios and can predict any unseen data sample that may appear in the future.

The Validation Set

The validation set is a set of data, separate from the training set, that is used to validate our model performance during training.

This validation process gives information that helps us tune the model’s hyperparameters and configurations accordingly. It is like a critic telling us whether the training is moving in the right direction or not.

The model is trained on the training set, and, simultaneously, the model evaluation is performed on the validation set after every epoch.

The main idea of splitting the dataset into a validation set is to prevent our model from overfitting i.e., the model becomes really good at classifying the samples in the training set but cannot generalize and make accurate classifications on the data it has not seen before. 

💡 Pro tip: Check out An Introductory Guide to Quality Training Data for Machine Learning.

The Test Set

The test set is a separate set of data used to test the model after completing the training.

It provides an unbiased final model performance metric in terms of accuracy, precision, etc. To put it simply, it answers the question of "How well does the model perform?"

Training, test, and validation data

How to split your Machine Learning data?

The creation of different samples and splits in the dataset helps us judge the true model performance. 

The dataset split ratio depends on the number of samples present in the dataset and the model.

Some common inferences that can be derived on dataset split include:

  • If there are several hyperparameters to tune, the machine learning model requires a larger validation set to optimize the model performance. Similarly, if the model has fewer or no hyperparameters, it would be easy to validate the model using a small set of data.
  • If a model use case is such that a false prediction can drastically hamper the model performance—like falsely predicting cancer—it’s better to validate the model after each epoch to make the model learn varied scenarios.
  • With the increase in the dimension/features of the data, the hyperparameters of the neural network also increase making the model more complex. In these scenarios, a large split of data should be kept in training set with a validation set.
💡 Pro tip: See the list of 65+ Best Free Datasets for Machine Learning to find quality data.

The truth is—

There is no optimal split percentage.

One has to come to a split percentage that suits the requirements and meets the model’s needs. 

However, there are two major concerns while deciding on the optimum split:

  • If there is less training data, the machine learning model will show high variance in training.
  • With less testing data/validation data, your model evaluation/model performance statistic will have greater variance.

Essentially, you need to come up with an optimum split that suits the need of the dataset/model.

But here's the rough standard split that you might encounter.

Machine Learning data split

3 common pitfalls in the training data split

Finally, let's briefly discuss common mistakes that data scientists make when building their models.

Low-quality training data

The quality of the training data is crucial for the model performance to improve.

If the training data is “garbage,” one cannot expect the model to perform well.

Moreover, since the machine learning algorithms are sensitive to the training data, even small variations/errors in the training set can lead to significant errors in the model performance. 

💡 Pro tip: Check out A Simple Guide to Data Preprocessing in Machine Learning and Data Cleaning Checklist to learn more.

Overfitting

Overfitting happens when the machine learning model memorizes the pattern in the training data to such an extent that it fails to classify unseen data.

The noise or fluctuations in the training data is seen as features and learned by the model. This leads to the model outperforming in the training set but poor performance in the validation and testing sets.

Overemphasis on Validation and Test Set metrics

The validation set metric is the one that decides the path of the training of the model.

After each epoch, the machine learning model is evaluated on the validation set. Based on the validation set metrics, the corresponding loss terms are calculated, and the hyperparameters are modified.

Metrics should be chosen so that they bring a positive effect on the overall trajectory of the model performance.

Train, Validation, and Test Set: Key takeaways

Finally, here's a recap of everything we've learned:

  • Training data is the set of the data on which the actual training takes place. Validation split helps to improve the model performance by fine-tuning the model after each epoch. The test set informs us about the final accuracy of the model after completing the training phase.
  • The training set should not be too small; else, the model will not have enough data to learn. On the other hand, if the validation set is too small, then the evaluation metrics like accuracy, precision, recall, and F1 score will have large variance and will not lead to the proper tuning of the model.
  • In general, putting 80% of the data in the training set, 10%  in the validation set, and 10% in the test set is a good split to start with.
  • The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.

💡 Read more:

A Comprehensive Guide to Convolutional Neural Networks

Optical Character Recognition: What is It and How Does it Work [Guide]

YOLO: Real-Time Object Detection Explained

27+ Most Popular Computer Vision Applications and Use Cases in 2021

Pragati Baheti
Microsoft
Pragati Baheti
Microsoft

Pragati is a software developer at Microsoft, and a deep learning enthusiast. She writes about the fundamental mathematics behind deep neural networks.

Related posts

Upgrade to a new era of software

We're telling the stories of teams that pioneer neural networks to solve any visual task. You can join them by signing up to V7 - the only platform to develop AIs for aony computer vision use case, and monitor them in production.You'll be able to develop your own training data and models, or apply pre-existing AI models to solve new use cases.

Learn about V7

Ready to get started?

Schedule a demo with our team or discuss your project.

Dataset Management

AutoML model training to solve visual tasks or auto-label your datasets, and a scalable inference engine to launch your project.