Deep Learning is used for solving complex pattern recognition tasks. However—
Such models require a large amount of labeled data (think millions of annotated images) to perform optimally.
Therefore, sometimes we need to rely on pre-trained models for solving supervised learning tasks, i.e., a model already trained on a large dataset is re-used for the task at hand with a fewer data samples.
What’s more, without customized models trained specifically for the task we want to perform we can be certain that our model will eventually underperform.
For example, in a multi-class classification task, the pre-trained model we are using might not provide the optimal performance for all the classes in the dataset.
Similarly, a different pre-trained model might work well on some other classes of the same data. Thus, we need a method that can aggregate the performance of all such models and provide a better solution for all distributions of data.
This is where the concept of “Ensemble Learning” comes into play. And let me tell you—it's a real game changer.
In the next few minutes, we'll help you understand the following:
However, before diving into the topic, you might want to refresh your knowledge and check out these guides:
Ensemble Learning is a method of reaching a consensus in predictions by fusing the salient properties of two or more models. The final ensemble learning framework is more robust than the individual models that constitute the ensemble because ensembling reduces the variance in the prediction errors
Ensemble Learning tries to capture complementary information from its different contributing models—that is, an ensemble framework is successful when the contributing models are statistically diverse.
In easier words, models that display performance variation when evaluated on the same dataset are better suited to form an ensemble.
Different models which make incorrect predictions on different sets of samples from the dataset should be ensembled. If two statistically similar models are ensembled (models that make wrong predictions on the same set of samples), the resulting model will only be as good as the contributing models. An ensemble won’t make any difference to the prediction ability in such a case.
The diversity in the predictions of the contributing models of an ensemble is popularly verified using the Kullback-Leibler and Jensen-Shannon Divergence metrics (this paper is great example demonstrating the point).
Here are some of the scenarios where ensemble learning comes in handy.
As explained in the example at the beginning of this article, there may arise situations where different models perform better on some distributions within the dataset, say, for example, a model may be well adapted to differentiate between cats and dogs, but not so much when distinguishing between dogs and wolves.
On the other hand, a second model can accurately differentiate between dogs and wolves while producing wrong predictions on the “cat” class. An ensemble of these two models might draw a more discriminative decision boundary between all the three classes of the data.
In cases where a substantial amount of data is available, we may divide the classification tasks between different classifiers and ensemble them during prediction time, rather than trying to train one classifier with large volumes of data.
On the other hand, in cases where the dataset available is small (for example, in the biomedical domain, where acquiring labeled medical data is costly), we can use a bootstrapping ensemble strategy.
The way it works is quite simple—
We train different classifiers using various “bootstrap samples” of data, i.e., we create several subsets of a single dataset using replacement. It means that the same data sample may be present in more than one subset, which will be later used to train different models (for further reading, check out this paper).
This method will be further explained in the section on the “Bagging” ensemble technique.
The very core of an ensemble framework is based on the confidence in predictions by the different models. For example, when trying to draw a consensus between four models on a cat/dog classification problem, if two models predict the sample as class “cat” and the other two predict as “dog,” the confidence of the ensemble is low.
Further, researchers also use the confidence scores of the individual classifiers to generate a final confidence score of the ensemble (Examples: Paper-1, Paper-2). Involving the confidence scores for developing the ensemble gives more robust predictions than simple “majority voting” since a prediction with 95% confidence is more reliable than a prediction with 51% confidence.
Therefore, we can assign more importance to classifiers that predict with more confidence during the ensemble.
Sometimes, a problem can have a complex decision boundary, and it might become impossible for a single classifier to generate the appropriate boundary.
For example, if we have a linear classifier and we try to tackle a problem with a parabolic (polynomial) decision boundary. One linear classifier obviously cannot do the job well. However, an ensemble of multiple linear classifiers can generate any polynomial decision boundary.
An example of such a case is shown in the diagram below.
The most prevalent reason for using an ensemble learning model is information fusion for enhancing classification performance. That is, models that have been trained on different distributions of data pertaining to the same set of classes are employed during prediction time to get a more robust decision.
For example, we may have trained one cat/dog classifier on high-quality images taken by a professional photographer. In contrast, another classifier has been trained on data using low-quality photos captured on mobile phones. When predicting a new sample, integrating the decisions from both these classifiers will be more robust and bias-free.
Now, let’s move on to some popular mechanisms for computing an ensemble.
Ensemble learning combines the mapping functions learned by different classifiers to generate an aggregated mapping function.
The diverse methods proposed over the years use different strategies for computing this combination.
Below we describe the most popular methods that are commonly used in the literature.
The Bagging ensemble technique is the acronym for “bootstrap aggregating” and is one of the earliest ensemble methods proposed.
For this method, subsamples from a dataset are created and they are called “bootstrap sampling.” To put it simply, random subsets of a dataset are created using replacement, meaning that the same data point may be present in several subsets.
These subsets are now treated as independent datasets, on which several Machine Learning models will be fit. During test time, the predictions from all such models trained on different subsets of the same data are accounted for.
There is an aggregation mechanism used to compute the final prediction (like averaging, weighted averaging, etc. discussed later).
The image shown above exemplifies the Bagging ensemble mechanism.
Note that, in the bagging mechanism, a parallel stream of processing occurs. The main aim of the bagging method is to reduce variance in the ensemble predictions.
Thus, the chosen ensemble classifiers usually have high variance and low bias (complex models with many trainable parameters). Popular ensemble methods based on this approach include:
The boosting ensemble mechanism works in a way markedly different from the bagging mechanism.
Here, instead of parallel processing of data, sequential processing of the dataset occurs. The first classifier is fed with the entire dataset, and the predictions are analyzed.
The instances where Classifier-1 fails to produce correct predictions (that are samples near the decision boundary of the feature space) are fed to the second classifier.
This is done so that Classifier-2 can specifically focus on the problematic areas of feature space and learn an appropriate decision boundary. Similarly, further steps of the same idea are employed, and then the ensemble of all these previous classifiers is computed to make the final prediction on the test data.
The pictorial representation of the same is shown below.
The main aim of the boosting method is to reduce bias in the ensemble decision. Thus, the classifiers are chosen for the ensemble usually need to have low variance and high bias, i.e., simpler models with less trainable parameters.
Other algorithms based on this approach include:
The stacking ensemble method also involves creating bootstrapped data subsets, like the bagging ensemble mechanism for training multiple models.
However, here, the outputs of all such models are used as an input to another classifier, called meta-classifier, which finally predicts the samples. The intuition behind using two layers of classifiers is to determine whether the training data have been appropriately learned.
For example, in the example of the cat/dog/wolf classifier at the beginning of this article, if, say, Classifier-1 can distinguish between cats and dogs, but not between dogs and wolves, the meta-classifier present in the second layer will be able to capture this behavior from classifier-1. The meta classifier can then correct this behavior before making the final prediction.
A pictorial representation of the stacking mechanism is shown below.
The diagram above shows one level of stacking. There are also multi-level stacking ensemble methods where additional layers of classifiers are added in between.
However, such practices become computationally very expensive for a relatively small boost in performance.
The “Mixture of Experts” genre of ensemble trains several classifiers, the outputs of which are ensemble using a generalized linear rule.
The weights assigned to these combinations are further determined by a “Gating Network,” also a trainable model and usually a neural network.
Such an ensemble technique is usually used when different classifiers are trained on other parts of the feature space. Following the previous example of the cat/dog/wolf classification problem, suppose one classifier is trained only on cats/dogs data, and another is trained on dogs/wolves data.
Such a method is also successful on the “Information Fusion” problem described before.
Majority voting is one of the earliest and easiest ensemble schemes in the literature. In this method, an odd number of contributing classifiers are chosen, and for each sample, the predictions from the classifiers are computed. Then, as the name suggests, the class that gets most of the class from the classifier pool is deemed the ensemble’s predicted class.
Such a method works well for binary classification problems, where there are only two candidates for which the classifiers can vote. However, it fails for a problem with many classes since many cases arise, where no class gets a clear majority of the votes.
In such cases, we usually choose a random class among the top candidates, which leads to a more considerable margin of error. Thus, methods based on the confidence scores are more reliable and are used more widely now.
The “Max Rule” ensemble method relies on the probability distributions generated by each classifier. This method employs the concept of “confidence in prediction” of the classifiers and thus is a superior method to Majority Voting for multi-class classification challenges.
Here, for a predicted class by a classifier, the corresponding confidence score is checked. The class prediction of the classifier that predicts with the highest confidence score is deemed the prediction of the ensemble framework.
In this ensemble technique, the probability scores for multiple models are first computed. Then, the scores are averaged over all the models for all the classes in the dataset.
Probability scores are the confidence in predictions by a particular model. So, here we are pooling the confidences of several models to generate a final probability score for the ensemble. The class that has the highest probability after the averaging operation is assigned as the predicted class.
Such a method has been used in this paper for COVID-19 detection from lung CT-scan images.
In the weighted probability averaging technique, similar to the previous method, the probability or confidence scores are extracted from the different contributing models.
But here, unlike the other case, we calculate a weighted average of the probability. The weights in this approach refer to the importance of each classifier, i.e., a classifier whose overall performance on the dataset is better than another classifier is given more importance while computing the ensemble, which leads to a better predictive ability of the ensemble framework.
In the deep learning literature, the classification accuracy of the models is usually assigned as the weights to the classifiers while computing the ensemble.
More recent methods, for example described in this paper, focus on leveraging weights more intelligently. In the paper, the authors argue that for class-imbalanced datasets, i.e., for datasets where each class contains a different quantity of training data, assigning the classification accuracy as the weights to the ensemble can aggravate the performance.
Instead, they proposed a new weighting scheme, wherein metrics such as precision, recall, F1-score, and AUC (area under receiver operating characteristics curve) are used. They proposed a fusion function that inputs these four evaluation metrics to generate an appropriate weight (or importance) to the input classifiers. They applied their ensemble method on the pneumonia detection problem using lung X-Ray images and showed that their method performs superior to weighted averaging using only classification accuracy.
The flowchart diagram of their method and their results upon comparison with different ensemble methods are shown below. The relevant codes for their method are also available on GitHub.
The ensemble methods described above have been around for decades. However, with the advancements in research, much more powerful ensemble techniques have been developed for different use cases.
For example, Fuzzy Ensembles are a class of ensemble techniques that use the concept of “dynamic importance.”
The “weights” given to the classifiers are not fixed; they are modified based on the contributing models’ confidence scores for every sample, rather than checking the performance on the entire dataset. They perform much better than the popularly used weighted average probability methods. The codes for the papers are also available here: Paper-1, Paper-2, and Paper-3.
Another genre of ensemble technique that has recently gained popularity is called “Snapshot Ensembling.”
As we can see from the discussion throughout this article, ensemble learning comes at the expense of training multiple models.
Especially in deep learning, it is a costly operation, even with transfer learning. So, this ensemble learning method proposed in this paper trains only one deep learning model and saves the model snapshots at different training epochs.
The ensemble of these models generates a final ensemble prediction framework on the test data.
They proposed some modifications to the usual deep learning model training regime to ensure the diversity in the model snapshots. The model weights saved at these different epochs need to be significantly different to make the ensemble successful. You can find further information on these methods in the reference papers.
Ensemble learning is a fairly common strategy in deep learning and has been applied to tackle a myriad of problems. It has helped tackle complex pattern recognition tasks that require computers to learn high-level semantic information from digital images or videos, like object detection, where bounding boxes need to be formed around the objects of interest and image classification.
Here are some of the real-life applications of Ensemble Learning.
Classification and localization of diseases for simplistic and fast prognosis have been aided by Ensemble learning, like in cardiovascular disease detection from X-Ray and CT scans.
Monitoring of physical characteristics of a target area without coming in physical contact, called Remote Sensing, is a difficult task since the data acquired by different sensors have varying resolutions leading to incoherence in data distribution.
Tasks like Landslide Detection and Scene Classification have also been accomplished with the help of Ensemble Learning.
Detection of digital fraud is an important and challenging task since very minute precision is required to automate the process. Ensemble Learning has proved its efficacy in detecting Credit Card Fraud and Impression Fraud.
Ensemble Learning is also applied in speech emotion recognition, especially in the case of multi-lingual environments. The technique allows for the combining of all classifiers’ effect instead of choosing one classifier and compromising certain language corpus’s accuracy.
Ensemble Learning is a standard machine learning technique that involves taking the opinions of multiple experts (classifiers) to make predictions.
The need for ensemble learning arises in several problematic situations that can be both data-centric and algorithm-centric, like a scarcity/excess of data, the complexity of the problem, constraint in computational resources, etc.
The several methods evolved over the decades have proven their utility in tackling many such issues. Still, newer ensemble approaches are being developed by researchers that address the caveats of the traditional ensembles.