Here's the first rule of machine learning—
Don't use the same dataset for model training and model evaluation.
If you don't, your results will be biased, and you'll end up with a false impression of better model accuracy.
It's a trap!
You are about to learn how to avoid it (and build models that work like magic). This article will give you a brief explanation of why splitting your machine learning data matters and the best ways to approach it.
Here's what we'll cover:
And hey—if you want to skip this tutorial and start annotating your data and training your models right away, check out:
Now, let's dive in!
For training and testing purposes of our model, we should have our data broken down into three distinct dataset splits.
It is the set of data that is used to train and make the model learn the hidden features/patterns in the data.
In each epoch, the same training data is fed to the neural network repeatedly, and the model continues to learn the features of the data.
The training set should have a diversified set of inputs so that the model is trained in all scenarios and can predict any unseen data sample that may appear in the future.
The validation set is a set of data, separate from the training set, that is used to validate our model performance during training.
This validation process gives information that helps us tune the model’s hyperparameters and configurations accordingly. It is like a critic telling us whether the training is moving in the right direction or not.
The model is trained on the training set, and, simultaneously, the model evaluation is performed on the validation set after every epoch.
The main idea of splitting the dataset into a validation set is to prevent our model from overfitting i.e., the model becomes really good at classifying the samples in the training set but cannot generalize and make accurate classifications on the data it has not seen before.
The test set is a separate set of data used to test the model after completing the training.
It provides an unbiased final model performance metric in terms of accuracy, precision, etc. To put it simply, it answers the question of "How well does the model perform?"
The creation of different samples and splits in the dataset helps us judge the true model performance.
The dataset split ratio depends on the number of samples present in the dataset and the model.
Some common inferences that can be derived on dataset split include:
The truth is—
There is no optimal split percentage.
One has to come to a split percentage that suits the requirements and meets the model’s needs.
However, there are two major concerns while deciding on the optimum split:
Essentially, you need to come up with an optimum split that suits the need of the dataset/model.
But here's the rough standard split that you might encounter.
Finally, let's briefly discuss common mistakes that data scientists make when building their models.
The quality of the training data is crucial for the model performance to improve.
If the training data is “garbage,” one cannot expect the model to perform well.
Moreover, since the machine learning algorithms are sensitive to the training data, even small variations/errors in the training set can lead to significant errors in the model performance.
Overfitting happens when the machine learning model memorizes the pattern in the training data to such an extent that it fails to classify unseen data.
The noise or fluctuations in the training data is seen as features and learned by the model. This leads to the model outperforming in the training set but poor performance in the validation and testing sets.
The validation set metric is the one that decides the path of the training of the model.
After each epoch, the machine learning model is evaluated on the validation set. Based on the validation set metrics, the corresponding loss terms are calculated, and the hyperparameters are modified.
Metrics should be chosen so that they bring a positive effect on the overall trajectory of the model performance.
Finally, here's a recap of everything we've learned:
💡 Read more: