Out of all the things that can go wrong with your ML model, overfitting is one of the most common and most detrimental errors.
The bad news is that this time, it's not an exaggeration.
Overfitting is a frequent issue and if your model generalizes data poorly on new testing data, you know have a problem.
Worry not! We are here to help you understand the issue of overfitting and find ways to avoid it shall you become dangerously close to overfitting your model.
In the next few minutes, you'll learn the following:
And if you happen to be ready to get some hands on experience labeling data and training your AI models, make sure to check out:
“Overfitting refers to the model that models the training data way too well”
These models fail to generalize and perform well in the case of unseen data scenarios, defeating the model's purpose.
When can overfitting occur?
The high variance of the model performance is an indicator of an overfitting problem.
The training time of the model or its architectural complexity may cause the model to overfit. If the model trains for too long on the training data or is too complex, it learns the noise or irrelevant information within the dataset.
Here are some of the key definitions that’ll help you navigate through this guide.
Underfitting occurs when we have a high bias in our data, i.e., we are oversimplifying the problem, and as a result, the model does not work correctly in the training data.
Overfitting occurs when the model has a high variance, i.e., the model performs well on the training data but does not perform accurately in the evaluation set. The model memorizes the data patterns in the training dataset but fails to generalize to unseen examples.
Overfitting happens when:
Underfitting happens when:
The goal is to find a good fit such that the model picks up the patterns from the training data and does not end up memorizing the finer details.
This, in turn, would ensure that the model generalizes and accurately predicts other data samples
Have a look at this visual comparison to get a better understanding of the differences.
Here’s something you should know—
Detecting overfitting is technically not possible unless we test the data.
One of the leading indicators of an overfit model is its inability to generalize datasets. The most obvious way to start the process of detecting overfitting machine learning models is to segment the dataset. It’s done so that we can examine the model's performance on each set of data to spot overfitting when it occurs and see how the training process works.
K-fold cross-validation is one of the most popular techniques commonly used to detect overfitting.
We split the data points into k equally sized subsets in K-folds cross-validation, called "folds." One split subsets act as the testing set, and the remaining folds will train the model.
The model is trained on a limited sample to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. One fold acts as a validation set in each turn.
After all the iterations, we average the scores to assess the performance of the overall model.
Here we will discuss possible options to prevent overfitting, which helps improve the model performance.
With the increase in the training data, the crucial features to be extracted become prominent. The model can recognize the relationship between the input attributes and the output variable. The only assumption in this method is that the data to be fed into the model should be clean; otherwise, it would worsen the problem of overfitting.
An alternative method to training with more data is data augmentation, which is less expensive and safer than the previous method. Data augmentation makes a sample data look slightly different every time the model processes it.
Another similar option as data augmentation is adding noise to the input and output data. Adding noise to the input makes the model stable without affecting data quality and privacy while adding noise to the output makes the data more diverse. Noise addition should be done in limit so that it does not make the data incorrect or too different.
Every model has several parameters or features depending upon the number of layers, number of neurons, etc. The model can detect many redundant features or features determinable from other features leading to unnecessary complexity. We very well know that the more complex the model, the higher the chances of the model to overfit.
Cross-validation is a robust measure to prevent overfitting. The complete dataset is split into parts. In standard K-fold cross-validation, we need to partition the data into k folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining holdout fold as the test set. This method allows us to tune the hyperparameters of the neural network or machine learning model and test it using completely unseen data.
Till now, we have come across model complexity to be one of the top reasons for overfitting. The data simplification method is used to reduce overfitting by decreasing the complexity of the model to make it simple enough that it does not overfit. Some of the procedures include pruning a decision tree, reducing the number of parameters in a neural network, and using dropout on a neutral network.
If overfitting occurs when a model is too complex, reducing the number of features makes sense. Regularization methods like Lasso, L1 can be beneficial if we do not know which features to remove from our model. Regularization applies a "penalty" to the input parameters with the larger coefficients, which subsequently limits the model's variance.
It is a machine learning technique that combines several base models to produce one optimal predictive model. In Ensemble learning, the predictions are aggregated to identify the most popular result. Well-known ensemble methods include bagging and boosting, which prevents overfitting as an ensemble model is made from the aggregation of multiple models.
This method aims to pause the model's training before memorizing noise and random fluctuations from the data. There can be a risk that the model stops training too soon, leading to underfitting. One has to come to an optimum time/iterations the model should train.
Large weights in a neural network signify a more complex network. Probabilistically dropping out nodes in the network is a simple and effective method to prevent overfitting. In regularization, some number of layer outputs are randomly ignored or “dropped out” to reduce the complexity of the model.
Our tip: If one has two models with almost equal performance, the only difference being that one model is more complex than the other, one should always go with the less complex model. In data science, it's a thumb rule that one should always start with a less complex model and add complexity over time.
Finally, here’s a short recap of everything we’ve learn today.
We can solve the problem of overfitting by: