However, supervised learning requires a large amount of carefully labeled data, and the data labeling process is often long, expensive, and error-prone.
That is—unless you have an auto-annotation tool, like V7, at your disposal ;-)
Furthermore, models trained using supervised learning generalize well on the data it was trained on but cannot acquire the “skill” of generalizing on new distributions of unlabeled data, thus proving to be a bottleneck in further advancements of Deep Learning.
Unsupervised Learning is another Machine Learning paradigm that tries to make sense of unlabeled data through various techniques.
Self-Supervised Learning (SSL) is one such methodology that can learn complex patterns from unlabeled data. SSL allows AI systems to work more efficiently when deployed due to its ability to train itself, thus requiring less training time.
In the next few minutes, you’ll learn everything you need to know about Self-Supervised Learning and how this approach changes the way we build and think about AI. We’ll also highlight some of the most exciting directions and areas that SSL is already transforming.
Here’s what we’ll cover:
Solve any video or image labeling task 10x faster and with 10x less manual work.
And in case you landed on this page looking for quality data to train a computer vision model—we’ve got you covered!
Now, let’s dive in!
Self-Supervised Learning (SSL) is a Machine Learning paradigm where a model, when fed with unstructured data as input, generates data labels automatically, which are further used in subsequent iterations as ground truths.
The fundamental idea for self-supervised learning is to generate supervisory signals by making sense of the unlabeled data provided to it in an unsupervised fashion on the first iteration.
Then, the model uses the high confidence data labels among those generated to train the model in the next iterations like any other supervised learning model via backpropagation. The only difference is, the data labels used as ground truths in every iteration are changed.
Supervised Learning entails training a model with data that have high-quality manual labels associated with them to tune the model weights accordingly.
Self-Supervised Learning also entails training a model with data and their labels, but the labels here are generated by the model itself and are not available at the very start.
Unsupervised Learning works on datasets with no available labels, and such a learning paradigm tries to make sense of the data provided without using labels at any stage of its training.
Thus, from this discussion, we can infer that SSL is a subset of Unsupervised Learning since both are provided only with unstructured data. However, Unsupervised Learning works towards clustering, grouping, and dimensionality reduction, whereas SSL performs conclusive tasks like classification, segmentation, and regression like any supervised model.
Although supervised learning is widely successful in vast application domains, there are several problems associated with it.
Supervised learning relies heavily on large volumes of high-quality labeled data, which acquiring is very costly and time-consuming. This is a huge limitation in the domains like medical imaging, where only expert medical professionals can manually annotate the data.
Furthermore, supervised learning models work optimally when each category of data has a more or less equal number of samples. Class imbalance adversely affects the model performance. And yet, acquiring enough data for rare classes is difficult—for example, data for a newly identified wild species of birds.
SSL eliminates the need for data labeling.
The concept of SSL got popularized in the context of Natural Language Processing (NLP) when it was applied to transformer models like BERT, for tasks like text prediction, determination of text topic, etc.
Here are some of the benefits of Self-Supervised Learning.
As discussed above, the success of supervised learning depends heavily on the quantity of high-quality data labels. Further, novel classes outside those that the supervised model is trained for cannot be accommodated at testing time. SSL on the other hand works with unstructured data and can train on massive amounts of it.
Supervised Learning requires human-annotated labels to train models. Here, the computer tries to learn how humans think through their already labeled examples. But, as we discussed—labeling such large amounts of data is not always feasible.
Reinforcement Learning is another way to go, where a model can be rewarded or penalized on a model’s prediction for tuning the weights. However, this too is infeasible for a number of practical scenarios.
SSL explores a machine’s capability of thinking independently—like humans—by automatically generating labels without any humans in the AI loop. The model itself needs to decide whether the labels generated are reliable or not, and accordingly use them in the next iteration to tune its weights.
SSL was first used in the context of NLP.
Since then, it has been extended to solve a variety of Computer Vision tasks like image classification, video frame prediction, etc. Active research is going on in the field of SSL to enhance its capabilities further to make it as accurate as supervised learning models.
Here are some of the limitations of Self-Supervised Learning.
In SSL, the model needs to make sense of the provided unlabeled data, and also generate the corresponding labels, which burdens the model more than those trained for supervised learning tasks. Models can be trained much faster when examples with their ground truths are provided.
For example, in contrastive learning type SSL (which we will explain soon), for each anchor-positive pair (for example two cropped pieces of the same image), several anchor-negative pairs (cropped pieces of the test image, and several different cropped images) need to be sampled in every iteration, making the training process much slower.
SSL models generate their own labels for the dataset, and we do not have any external support that can aid the model in determining whether its computations are correct. Thus, SSL models cannot be expected to be as accurate as traditional supervised learning models.
In SSL, if the model predicts a wrong class with a very high confidence score, the model will keep believing that the prediction is correct and won’t tune the weights against this prediction.
In this section we will explore the various genres of the SSL framework that are popularly used.
Energy-based models tries to compute the compatibility between two given inputs using a mathematical function. When given two inputs, if an EBM produces a low energy output, it means that the inputs have high compatibility. A high energy output indicates high incompatibility.
For example, two augmented versions of a same image, say of a dog, when given as input to an EBM should produce a low energy output, while an image of a dog and an image of a cat given as input should produce a high energy output.
A joint embedding architecture is a two-branch network, where each of the branches are identical in construction. Two inputs are provided to each of the branches to compute their separate embedding vectors. A module is present at the head of the network that takes the two embedding vectors as inputs and calculates the distance between them in the latent space.
Thus, when the two inputs are similar to each other (two augmented versions of a dog image), the distance calculated should be small. The network parameters can be easily tuned to ensure that the inputs in the latent space are close to each other.
In Contrastive Learning-type SSL, we train a model by contrasting an input (like a text, an image, a video segment), called “anchor”, with positive and negative examples. A positive sample refers to one which belongs to the same distribution as the anchor, while the negative sample has a distribution different than that of the anchor.
Let us understand this with the help of an example.
Suppose we have a deep model “” which we want to train for classifying images. When given an input “x” to the model, the obtained output is denoted by: (x). Further, suppose we have the anchor xa, which is part of the image of a dog, and its corresponding output (xa).
Now, the positive sample corresponding to xa, is a cropped out part of the same image of the dog, denoted by x+, while the negative sample is a cropped out part of another image (suppose of a cat), denoted by x-. In contrastive learning, the aim is to minimize the distance between xa and x+ in the feature space, and at the same time, to maximize the distance between xa and x-.
The idea for contrastive predictive coding was first presented in this paper.
The intuition here is to learn the representations that encode the underlying shared information between different parts of the data while also discarding low-level information and noise which is more local.
For example, given the upper half of an image, a model should predict the lower half of the image. In the image shown above, “x” is a time-series signal, data for which is available upto time “t”, and the model needs to predict the signal till time “t+4”. Here, “genc” is an embedding network that extracts features “zt” from signal “xt”, and “gar” is an autoregressive model that summarizes all z≤t in the embedding space to produce a context latent representation ct=gar(z≤t). This complex representation is used to model a density ratio which preserves the mutual information between the predicted signal and the aggregated context ct.
This idea is extendable to image, video and text data as well. Thus, in CPC, we combine prediction of future observations (Predictive Coding) with a probabilistic contrastive loss (expression shown below), giving this method the name.
This class of methods employ the general idea of contrastive learning, to entire instances of data (like a whole image).
For example, two rotated or flipped versions of the same dog image can serve as the anchor-positive pair, while a rotated/flipped version of a cat image can serve as a negative sample. Now, similar to the basic principle, the distance between the anchor-positive pair is to be minimized, while that between the anchor-negative pair needs to be maximized.
The main idea behind this technique is that, an input which has undergone some basic data transformations should still be of the same category, i.e., a deep learning model should be invariant to transformations. An image of a dog, when flipped vertically and converted to grayscale, still denotes the class “dog”.
In this class of methods, a random image is taken and random data transformations are applied to it (like flipping, cropping, adding noise, etc.) to create the positive sample. Now, several other images from the dataset are taken as the negative samples, and a loss function is designed similar to CPC to maximize the distance between the anchor-negative sample pairs.
In 2020, a paper proposed the SwAV (Swapping Assignments between multiple Views) model, which is a method for comparing cluster assignments to contrast different image views while not relying on explicit pairwise feature comparisons.
The goal in this method is to learn visual features in an online fashion without supervision. For this the authors propose an online clustering-based self-supervised method. Typical clustering-based methods are offline in the sense that they alternate between a cluster assignment step where image features of the entire dataset are clustered, and a training step where the cluster assignments, i.e., “codes” are predicted for different image views.
Unfortunately, these methods are not suitable for online learning as they require multiple passes over the dataset to compute the image features necessary for clustering. In SwAV, the authors enforce consistency between codes from different augmentations of the same image.
This solution is inspired by contrastive instance learning as the codes are not considered as a target, but are only used to enforce consistent mapping between views of the same image. SwAV can be interpreted as a way of contrasting between multiple image views by comparing their cluster assignments instead of their features. Thus, this method can be scaled to potentially unlimited amounts of data.
Non-Contrastive Self Supervised Learning (NC-SSL) is a learning paradigm where only positive sample pairs are used to train a model, unlike in Contrastive Learning where both positive and negative pairs are used. This seems counterintuitive, since it appears like only trying to minimize distances between positive pairs may collapse into a constant solution.
However, NC-SSL has shown to be able to learn non-trivial representation with only positive pairs, using an extra predictor and a stop-gradient operation. Furthermore, the learned representation shows comparable (or even better) performance for downstream tasks.
This brings about two fundamental questions: (1) why the learned representation does not collapse to trivial (i.e., constant) solutions, and (2) without negative pairs, what representation NC-SSL learns from the training and how the learned representation reduces the sample complexity in downstream tasks.
To answer the first question, in NC-SSL, different techniques are proposed to avoid collapsing. BYOL and SimSiam use an extra predictor and stop gradient operation. Beyond these, BatchNorm (including its variants), de-correlation, whitening, centering, and online clustering are all effective ways to enforce implicit contrastive constraints among samples for preventing collapse.
Wang et al. hunted for an answer to the second question in this paper, where they proved that a desirable projection matrix can be learned in a linear network setting and reduce the sample complexity on down-stream tasks. Further, their analysis highlight the crucial role of weight decay in NC-SSL, which discards the features that have high variance under augmentations and keep the invariant features.
As we have mentioned above, SSL is widely used for speech recognition. However, let’s also take a look at some of the most promising SSL applications for Computer Vision.
As discussed before, obtaining labeled data in the biomedical domain is extremely difficult, for both privacy reasons and the need for multiple expert doctors to manually annotate the data. This calls for unsupervised methods that can accurately deal with scanty biomedical data.
Contrastive Self-Supervised Learning has been used in unsupervised histopathology image classification in this paper for the detection of cancer. Here, the authors have used the instance discrimination method of SSL where they used augmented copies of a sample image to create positive pairs.
Other applications of SSL in healthcare may be in the segmentation of medical images, for example, the segmentation of organs from an X-Ray image (as depicted in the image above). Such information aids doctors in the diagnosis of several diseases.
Orienting 3D objects is a critical component in the automation of many packing and assembly tasks. Thus, SSL has also been employed in this domain, for example in this paper, where they used depth information to orient novel 3D objects using a robot correctly.
Read more about AI in Manufacturing here.
Verification of signatures can be posed as a self-supervised learning problem, where novel data can be fed for detecting forgery.
V7 comes equipped with the Text Scanner model which you can use to solve even the most complex OCR tasks.
Automatic colorization of grayscale images or videos is a useful self-supervised learning task. Here, the task boils down to mapping the given grayscale image/video to a distribution over quantized color value outputs.
The concept used here can also be extended to image inpainting, context filling, i.e., text prediction or predicting a gap in voice recordings.
The prediction of future frames in video sequence data is a very useful SSL application paradigm. It is possible to obtain high accuracy in such tasks since a video is a collection of semantically related frames in sequence. Some logic is always followed in the order of frames, for example, the motion of objects is always smooth, and gravity always acts downwards.
The field of robotics has interesting SSL applications. A robot cannot be trained to deal with each and every circumstance in the practical world, and it needs to make some decisions autonomously.
For example, the Mars rover missions rely heavily on unsupervised navigation mechanisms, since the time lag between Earth and Mars makes it infeasible to operate them manually.
Supervised Learning has been widely successful in addressing challenges in Computer Vision. However, its dependency on large amounts of high-quality labeled data makes training such a model, a difficult endeavour.
Self-Supervised Learning is a more feasible option now, since we can acquire large amounts of unstructured data with our advanced technology, but human-centered labeling operations are expensive and time-demanding.
SSL annotates the unstructured data given as input, and uses this self-generated data labels as ground truths for future iterations to train the model. This learning paradigm, originated from NLP applications, has shown promise in Computer Vision tasks like image classification and segmentation, object recognition, etc.
Several genres of SSL exist now (the two most-used methods being Contrastive and Non-Contrastive Learning paradigms), based on their working principle, each with their own sets of merits and demerits. Active research is still being conducted on SSL methods to enhance its performance and lower its computational requirements.
In the past decade, the field of AI has made significant developments in Machine Learning systems that can tackle a vast range of Computer Vision problems using the paradigm of supervised learning.
💡 Read more: