The Beginner’s Guide to Contrastive Learning

Here's everything you need to know about contrastive learning and its most prominent applications. Ready to dive in?
Read time
min read  ·  
May 22, 2022
Contrastive learning framework example

Contrastive Learning is a technique that enhances the performance of vision tasks by using the principle of contrasting samples against each other to learn attributes that are common between data classes and attributes that set apart a data class from another.

This mode of learning, which mimics the way humans learn about the world around them, has shown promising results in the Deep Learning literature, thus gaining importance in the field of Computer Vision research.

You are about to read the most comprehensive article on Contrastive Learning.

Here’s what we’ll cover:

  1. What is Contrastive Learning?
  2. The importance of Contrastive Learning
  3. How does Contrastive Learning work?
  4. Ten Contrastive Learning frameworks
  5. Applications of Contrastive Learning
Accurate AI file analysis at any scale

Turn images, PDFs, or free-form text into structured insights

And in case you landed here looking for a tool to label your machine learning data or train a model, check out:

  1. V7 Open Datasets
  2. V7 Data Annotation
  3. V7 Model Training

Let’s dive in.

What is contrastive learning?

Contrastive Learning is a Machine Learning paradigm where unlabeled data points are juxtaposed against each other to teach a model which points are similar and which are different.

That is, as the name suggests, samples are contrasted against each other, and those belonging to the same distribution are pushed towards each other in the embedding space. In contrast, those belonging to different distributions are pulled against each other.

The importance of contrastive learning

Supervised Learning is the machine learning technique where a model is trained by using a large number of labeled examples. The quality of data labels is of immense importance for the success of supervised models.

💡 Pro Tip: Check out Supervised vs. Unsupervised Learning: What’s the Difference?

But, acquiring such high-quality labeled data is a cumbersome task, especially in domains like biomedical imaging, where expert doctors are required to annotate the data. This is both expensive and time-consuming. 80% of the time spent in a supervised learning ML project is invested in acquiring and cleaning the data for model training.

Thus, the recent focus of Deep Learning research has been on reducing the requirement for supervision in model training. To this end, several methodologies have been proposed, like Semi-Supervised Learning, Unsupervised Learning, and Self-Supervised Learning.

In Semi-Supervised Learning, a small amount of labeled data is used along with a large amount of unlabeled data to train a deep model. In Unsupervised Learning, the model tries to make sense of the unstructured data without any data labels.

💡 Pro Tip: Read more on Train-Validation-Test sets.

Self-Supervised Learning (SSL) has a slightly different approach.

Like in unsupervised learning, unstructured data is provided as input to the model. However, the model annotates the data on its own, and labels that have been predicted with high confidence are used as ground truths in future iterations of the model training.

This keeps improving the model weights to make better predictions. The efficacy of SSL methods as compared to traditional supervised methods has captured the attention of several Computer Vision researchers.

💡 Pro Tip: Learn how to label your data faster by reading What is Data Labeling and How to Do It Efficiently [Tutorial].

One of the oldest and most popular techniques employed in SSL is Contrastive Learning, which uses “positive” and “negative” samples to guide the Deep Learning model.

Contrastive Learning has since evolved further and is now being used in fully supervised and semi-supervised settings as well giving a boost in performance to existing state-of-the-art.

Let us now discuss the working principle of Contrastive Learning.

How does Contrastive Learning work in Vision AI?

Contrastive Learning mimics the way humans learn. For example, we might not know what otters are or what grizzly bears are, but seeing the images (as shown below), we can at least infer which pictures show the same animals.

The basic contrastive learning framework consists of selecting a data sample, called “anchor,” a data point belonging to the same distribution as the anchor, called the “positive” sample, and another data point belonging to a different distribution called the “negative” sample.

The SSL model tries to minimize the distance between the anchor and positive samples, i.e., the samples belonging to the same distribution, in the latent space, and at the same time maximize the distance between the anchor and the negative samples.

As shown in the example above, two images belonging to the same class lie close to each other in the embedding space (“d+”), and those belonging to different classes lie at a greater distance from each other (“d-”). Thus, a contrastive learning model (denotes by “theta” in the example above) tries to minimize the distance “d+” and maximize the distance “d-.”

Several techniques exist to select the positive and negative samples with respect to the anchor, which we will discuss next.

Instance Discrimination Method

In this class of Contrastive Learning, the entirety of images are made to undergo transformations and used as positive samples to an anchor image. For example, if we select an image of a dog as the anchor, we can mirror the image or convert it to grayscale to use as the positive sample. The negative sample can be any other image in the dataset.

The image below shows the basic framework of the instance discrimination-based contrastive learning technique. The distance function can be anything, from Euclidean distance to cosine distances in the embedding space.

Some image augmentation methods popularly used for Instance Discrimination-based Contrastive Learning is listed as follows:

  1. Colour Jittering: Here, the brightness, contrast, and saturation of an RGB image are changed randomly. This technique is helpful to ensure a model is not memorizing a given object by the scene's colors. While output image colors can appear odd to human interpretation, such augmentations help a model consider the edges and shape of objects rather than only the colors.
  1. Image Rotation: An image is rotated randomly within 0-90 degrees. Since rotating an image doesn’t change the core information contained in it (i.e., a dog in an image will still be a dog), models are trained to be rotation invariant for robust prediction.

  2. Image Flipping: The image is flipped (mirrored) about its center, either vertically or horizontally. This is an extension of the concept of image rotation-based augmentation.

  3. Image Noising: Random noise is added to the images pixel-wise. This technique allows the model to learn how to separate the signal from the noise in the image and makes it more robust to changes in the image during test time. For example, randomly changing some pixels in the image to white or black is known as salt-and-pepper noise (an example is shown below).

  4. Random Affine: Affine is a geometric transformation that preserves lines and parallelism, but not necessarily the distances and angles.

The image above shows some examples of the image augmentations described in this section

Image Subsampling/Patching Method

This class of Contrastive Learning methods breaks a single image into multiple patches of a fixed dimension (say, 10x10 patch windows). There might be some degree of overlap between the patches.

Now, suppose we take the image of a cat and use one of its patches as the anchor while leveraging the rest as the positive samples. Patches from other images (say, one patch each of a raccoon, an owl, and a giraffe) are used as negative samples.

Contrastive Learning: Objectives

A number of loss functions have been defined in the Contrastive Learning literature for applications in different problems, each with its own set of functionalities. Let us discuss some of these in this section.

1. Max margin Contrastive Loss

It is one of the oldest loss functions proposed in the Contrastive Learning literature (Paper).

The basic idea here is that the loss function maximizes the distance between samples if they do not belong to the same distribution and instead minimizes the distance between them if they belong to the same distribution. It is mathematically represented as follows:

Here, “s_i” and “s_j” are the two samples with corresponding labels “y_i” and “y_j” that need to be compared, “theta” is the embedding network, and “epsilon” is a hyperparameter, defining the lower bound distance between samples of different classes.

The labels for the samples are generated by whether the samples belong to the same distribution. For example, if the two samples are cropped versions of the same image or the augmented versions of the same sample, the labels will be the same.

2. Triplet Loss

The triplet loss (proposed in this paper) is a lot similar to the contrastive loss, both of which try to minimize the distance between similar distributions and maximize the distance between unlike distributions. The primary difference in the triplet loss is that a positive and a negative sample are simultaneously taken as input with the anchor sample to compute the loss. Mathematically it is represented as follows:

Here, “s_a,” “s+,” and “s-” respectively represent the anchor, positive and negative samples. For the success of models that use the triplet loss function, it is crucial that the negative samples are difficult to separate from the positive samples.

For example, raccoons and ringtails look a lot similar (see the image below). Both have striped, bushy tails and similar body fur colors. When sampling the negative samples against raccoons, choosing ringtails will enable the model to differentiate between classes more effectively, than say, when choosing an elephant as the negative sample.

3. N-pair Loss

The N-pair loss is an extension of the triplet loss function. Instead of sampling a single negative sample, an “N” number of negative samples are sampled along with one anchor and one positive sample. The mathematical representation for this loss is as follows:

The N-pair loss function shown above is defined for (N+1) training samples, where the first sample is the anchor “s^a,” the second sample is the positive sample, “s+” and the rest of the (N-1) samples are the negative examples.

In this equation, if we have only one negative sample instead of (N-1), then the resulting equation will be equivalent to the softmax loss function for a multi-class classification problem.

4. InfoNCE

InfoNCE, where NCE stands for Noise-Contrastive Estimation, is another type of contrastive loss function.

If “S = {s_1, s_2, …, s_N}” denotes the set of “N” random samples containing one positive sample and “N-1” negative samples, the loss function can be mathematically represented as follows:

Optimizing this loss will result in “f_k” estimating the density ratio given by:

The density ratio preserves the mutual information between the future observations “s_{t+k}” and the context latent representation “c_t.” “p(s_{t+k})” is a generative model.

5. Logistic Loss

The logistic loss is a simple convex loss function popularly used in Supervised Learning literature. It is mathematically expressed as:

The loss is defined for “N” samples which are denoted by “s_i” with corresponding labels (whether or not the samples belong to the same distribution) denoted by “y_i.”

6. NT-Xent Loss

The Normalized Temperature-scaled Cross-Entropy or NT-Xent loss is a modification of the multi-class N-pair loss with an addition of the temperature (T) parameter. It is mathematically represented as follows:

sim(.)” denotes the cosine similarity function, and “z_i” and “z_j” are the encoded features of samples “s_i” and “s_j” respectively.

The above equation shows the expression for the NT-Xent loss used in Self-Supervised Learning. For supervised learning, the contrastive loss shown above is incapable of handling the case where, due to the presence of labels, more than one sample is known to belong to the same class. Generalization to an arbitrary number of positives leads to a choice between multiple possible functions.

Thus, the NT-Xent loss’ extension into the supervised learning paradigm is expressed as follows:

Here, (2N_{y_i} -1) denotes the set of indices of all positives in the multiview batch distinct from “i.”

Supervised Contrastive Learning (SSCL) vs. Self-Supervised Contrastive Learning (SCL)

Supervised Learning refers to the learning paradigm where both the data and their corresponding labels are available for training a model. In Self-Supervised Learning, on the other hand, the model generates labels using the raw input data without any external support.

💡 Pro Tip: Check out A Simple Guide to Data Preprocessing in Machine Learning.

In Self-Supervised Contrastive Learning (SSCL), due to the absence of class labels, the positive and negative samples are generated from the anchor image itself- by various data augmentation techniques. Augmented versions of all other images are considered as “negative” samples.

Such challenges hinder the model training. For example, if we have an image of a dog as the anchor sample, then only augmented versions of this image can form the positive samples. Images of other dogs, thus, also belong to the set of negative samples.

Since, according to the Self-Supervised Learning framework, the contrastive loss function will force the two different dog images to be distant in the embedding space, whereas, in reality, they should lie as close together as possible.

Thus, Supervised Contrastive Learning (SCL) leverages the label information effectively and allows samples of the same distribution (like several images of different dogs) to lie close together in the latent space, while samples belonging to disparate classes are repelled in the latent space.

Accordingly, the loss functions for the SCL frameworks are also different from those used in SSCL frameworks. One such example has been explored in the previous section while explaining the NT-Xent loss function.

The problems associated with SSCL mentioned here have also given rise to the field of Non-Contrastive Learning for Self-Supervised Learning, where we do not use any negative samples at all.

We only use positive samples to train models where the aim is to push the representations of the samples (belonging to the same distribution) close to each other in the embedding space. This is a vast area of research and cannot be detailed any further in this article.

V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

Ten Contrastive Learning frameworks

Let us look into the working mechanisms of ten popular Contrastive Learning frameworks proposed in recent literature by Deep Learning and Computer Vision researchers.

1. SimCLR

The SimCLR model, developed by Google Brain, proposed in this paper is a framework for contrastive learning of visual representations. SimCLR had been proposed to address the Self-Supervised and Semi-Supervised Learning problems through Contrastive Learning.

Its basic working principle is to maximize the agreement between different augmented versions of the same sample using a contrastive loss in the latent space. The framework of the SimCLR method is shown below.

Source: Paper

The SimCLR model consists of the following modules:

1. A data augmentation module that transforms a given data sample (image) randomly to create two views of the same example (“x_i” and “x_j” in the diagram above). These represent the positive pairs. The SimCLR framework applies the following three augmentations: random crop and resizing (with random flip), color distortions, and Gaussian blur. According to the results obtained by the authors, random cropping and color distortion are essential for achieving good performance.

2. SimCLR consists of a neural network base encoder, denoted by “f(.)” in the architecture diagram, to extract representative vectors from the augmented data samples. Although we can use any network as the backbone encoder, the authors chose the ResNet model for simplicity. The features are extracted after the final averaging pooling layer of the ResNet-50 model.

💡 Pro Tip: Read An Introduction to Autoencoders: Everything You Need to Know

3. Next, a small neural network projection head, denoted by “g(.),” is incorporated to map the extracted representative vectors to a common latent space. This will allow the contrastive loss to be implemented. The authors use a simple Multi-Layer Perceptron with one hidden layer for this purpose. They investigated and found that applying the contrastive loss in this latent space produces better results than directly evaluating the loss over the features extracted from ResNet-50.

4. Finally, a contrastive loss function is defined based on previous literature, which the authors chose to be the NT-Xent loss function that has been explained in the previous section.

Source: Paper

The SimCLR model achieved state-of-the-art results at the time (results shown above), and since then, several Contrastive Learning models have been devised based on this framework.


Most Contrastive Learning algorithms that are based on instance discrimination train the encoders to be invariant to pre-defined transformations of the same instance.

While most methods treat different views of the same image as positives for a contrastive loss, Nearest-Neighbor Contrastive Learning (NNCLR) framework developed in this paper tries to use the positives from other instances in the dataset, i.e., to use different images from the same class, rather than augmenting the same image.

The framework of the NNCLR model is shown below.

Source: Paper

The NNCLR model samples the nearest neighbors from the dataset in the latent space and treats them as positive samples, leading to a more diverse selection of positive pairs, which in turn help the model learn better.

Source: Paper

NNCLR uses the InfoNCE loss just like in the SimCLR framework, but now, the positive sample is the nearest neighbor of the anchor image. The loss function defined in the paper is as follows:

Source: Paper

Here, “Q” is the support set, and “NN” denotes the nearest neighbor function.

3. ORE

Joseph et al. developed the Open World Object Detection or ORE in this paper. Here, a model is tasked to:

1) identify objects that have not been introduced to it as “unknown,” without explicit supervision.

2) incrementally learn these identified unknown categories without forgetting previously learned classes when the corresponding labels are progressively received.

The overview of the OWOD framework is shown below.

The ORE framework addresses the incremental object detection problem, where previously unseen objects are detected in the images. As and when more information about the identified unknown classes becomes available, the system is able to incorporate them into its existing knowledge base.

This would define a smart object detection system, and the ORE methodology is an effort towards achieving this goal.

💡 Pro Tip: Read YOLO: Real-Time Object Detection Explained.

Supervision is needed to optimally cluster the unknowns using contrastive clustering on what an “unknown instance” is. It is infeasible to manually annotate even a small subset of the potentially infinite set of unknown classes.

To counter this, the authors propose an auto-labeling mechanism based on the Region Proposal Network (which generates a set of bounding box predictions for foreground and background instances) to pseudo-label unknown instances.

💡 Pro Tip: Learn best practices for Annotating with Bounding Boxes and start building your object detectors on V7.

The inherent separation of auto-labeled unknown instances in the latent space helps the energy-based classification head in ORE to differentiate between the known and unknown instances.

To prevent the model from forgetting older classes, a few examples from these classes are “replayed” in every iteration for continual learning. The results obtained by this method (shown below) outperform several state-of-the-art methods.

Source: Paper


CURL is the abbreviation for Contrastive Unsupervised Representations for Reinforcement Learning (RL), proposed in this paper. CURL learns contrastive representations jointly with the RL objective. Here, Representation Learning (RL) is posed as an auxiliary task that can be coupled to any model-free RL algorithm.

CURL uses a form of contrastive learning that maximizes agreement between augmented versions of the same observation, where each observation is a stack of temporally sequential frames. CURL significantly improves sample efficiency over prior pixel-based methods by performing contrastive learning simultaneously with an off-policy RL algorithm.

The authors have focussed more on simplicity and reproducibility by adding minimal overhead in terms of architecture and model learning.

💡 Pro Tip: Have a look at The Essential Guide to Neural Network Architectures.

The contrastive learning objective in CURL operates with the same latent space and architecture typically used for model-free RL and seamlessly integrates with the training pipeline without the need to introduce multiple additional hyperparameters.


Preservational Contrastive Representation Learning or PCRL was proposed in this paper for learning self-supervised medical representations. The overview of the model is shown below.

Source: Paper

Contrastive learning aims to learn invariant representations via contrasting image pairs, which can be regarded as an implicit way to preserve maximal information.


The authors think it is still beneficial and complemental to preserve more information in addition to the contrastive loss explicitly.

To achieve this goal, an intuitive solution is to reconstruct the original inputs using learned representations so that these representations can preserve the information closely related to the inputs. However, the authors discover that directly adding a plain reconstruction branch to restore the original inputs does not significantly improve the learned representations.

The PCRL model reconstructs diverse contexts using representations learned from the contrastive loss to address this problem. For restoring the diverse images, the authors propose two modules:

Transformation-conditioned Attention (to enable the reconstruction of diverse contexts) and Cross-model Mixup (or shuffling these feature representations to enable more diverse restoration) build a triple encoder, single decoder architecture for self-supervised learning.

6. SwAV

Swapping Assignments between multiple Views or SwAV is an unsupervised contrastive clustering mechanism developed by Facebook in this paper.

SwAV takes advantage of the contrastive methods without requiring to compute pairwise comparisons. Specifically, SwAV simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in traditional contrastive learning.

Simply put, the model uses a “swapped” prediction mechanism where it predicts the code of a view from the representation of another view.

The architecture diagram is shown below

Source: Paper

The authors proposed a Multi-Crop augmentation scheme, which creates multiple views of the same sample without increased computational requirement. Two standard high resolution (of size 224x224) images and multiple lower resolution (of size 96x96) are obtained. This use of low-resolution views enables the model to train with better samples of images using low computational cost.

The base encoder of the SwAV model is the ResNet backbone (with different depths). The main contribution of the SwAV mechanism is in the online cluster assignments by using mini-batches. Generally, clustering is performed offline, meaning the entire dataset is fed to the model.

The swapped loss function used in the SwAV algorithm is:

Here, “z_t” and “z_s” are the features extracted from two views of the same image, and “q_t” and “q_s” are the corresponding intermediate codes (which are obtained by matching the features to the set of prototypes). “l(z, q)” measures the fit between features “z” and a code “q.”

The results obtained by SwAV are shown below.

Source: Paper

7. MoCo

Momentum Contrast or MoCo is a self-supervised learning algorithm with a contrastive loss proposed in this paper.

Source: Paper

We can think of contrastive loss methods as building dynamic dictionaries. The “keys” (tokens) in the dictionary are sampled from the data (e.g., images or patches) and are represented by an encoder network.

Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” sample should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss.

To achieve this, MoCo forms a queue of mini-batches that are encoded by the momentum encoder network. As a new mini-batch is selected, its encodings are enqueued, and the oldest encodings in the data structure are dequeued. This decouples the dictionary size, represented by the queue, from the batch size and enables a much larger dictionary to query from.

8. Supervised Contrastive Segmentation

Current semantic segmentation methods focus only on mining “local” context, i.e., dependencies between pixels within individual images, by context-aggregation modules or structure-aware optimization criteria. They ignore the “global” context of the training data, i.e., rich semantic relations between pixels across different images.

The fully-supervised contrastive segmentation framework proposed in this paper is posed as a solution to this problem, where pixel embeddings belonging to the same semantic class are enforced to be more similar than those belonging to different classes.

The overview diagram of the method is shown below.

Source: Paper

The authors proposed the pixel-wise contrastive learning method for semantic segmentation lifts the current image-wise training strategy to an inter-image, pixel-to-pixel paradigm. It essentially learns a well-structured pixel semantic embedding space by fully using the global semantic similarities among labeled pixels.

💡 Pro Tip: Check out A Gentle Introduction to Image Segmentation for Machine Learning.

The authors develop a region memory to better explore the large visual data space and further calculate pixel-to-region contrast. Integrated with pixel-to-pixel contrast computation, this method exploits semantic correlations among pixels and between pixels and semantic regions.

The authors employ a weighted average of the pixel-wise cross-entropy loss and the supervised NCE loss for their model, which provided a better clustering result than the cross-entropy loss alone (shown below).

Source: Paper

The quantitative results obtained by the authors are shown below.

Source: Paper

9. PCL

Prototypical Contrastive Learning or PCL proposed in this paper is an unsupervised representation learning method that bridges contrastive learning with clustering. PCL learns low-level features for the task of instance discrimination, and it also encodes the semantic structures discovered by clustering into the learned embedding space.

The training framework for the PCL model is shown below.

Source: Paper

The previous methods based on instance discrimination form a special case in the EM framework proposed by the authors. The authors introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. They iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning.

💡 Pro Tip: Looking for quality training data? See 65+ Best Free Datasets for Machine Learning and 20+ Open Source Computer Vision Datasets.

The authors also propose ProtoNCE, a new contrastive loss that improves the widely used InfoNCE by dynamically estimating the concentration for the feature distribution around each prototype. ProtoNCE also includes an InfoNCE term in which the instance embeddings can be interpreted as instance-based prototypes.

PCL outperforms instance-wise contrastive learning on multiple benchmarks with substantial improvements in low-resource transfer learning. PCL also leads to better clustering results. The qualitative and quantitative results obtained by the authors with the PCL model are shown below.

Source: Paper
Source: Paper

10. SSCL

The Self-Supervised Contrastive Learning or SSCL framework developed in this paper addresses the aspect detection problem, which involves extracting interpretable aspects and identifying aspect-specific segments (such as sentences) from online reviews.

According to the authors, previous deep learning-based topic models, specifically aspect-based autoencoders, suffer from several problems, such as extracting noisy aspects and poorly mapping aspects discovered by models to the aspects of interest.

To tackle these challenges, the authors proposed an SSCL framework consisting of a Smooth Self Attention (SSA) model along with a high-resolution selective mapping (HRSMap) method. The overview of the method is shown below.

Source: Paper

The authors constructed two representations directly based on (i) word embeddings and (ii) aspect embeddings for every review segment in a corpus.

Then, a contrastive learning mechanism is devised to map aspect embeddings to the word embedding space. In the image above, “alpha” represents the smooth self-attention parameters, while “beta” represents soft-labels (probability distribution) over model-inferred aspects for a review segment.

Selective mapping means the model will not map noisy or meaningless aspects to gold-standard aspects. For aspect mapping, the authors used a high-resolution mapping, in the sense that the number of model-inferred aspects should be at least three times more than the number of gold-standard aspects so that model-inferred aspects have better coverage. This is pictorially represented below.

Source: Paper

Applications of Contrastive Learning

Finally, let's have a look at some of the most prominent applications of Contrastive Learning.

Semi-supervised Learning

Obtaining a large quantity of labeled data is difficult, especially in domains like astronomy, remote sensing, and biomedical engineering. Thus, we may often have a dataset where only a few samples are annotated while the rest are unstructured.

Semi-Supervised Learning aims to utilize both the unlabelled and the labeled samples in the dataset to train a model and make predictions. In 2020, this research found that deeper and wider self-supervised models are strong semi-supervised learners.

That is, a model is pre-trained in an unsupervised fashion using the unlabeled part of the data, and then with the few labeled samples available, the model is fine-tuned.

Source: Paper

After this large deep network is pre-trained and fine-tuned using the data, it can be distilled into a much smaller network, using a concept called “Knowledge Distillation,” with little loss in classification accuracy by using the unlabeled examples for a second time, but in a task-specific way.

💡 Pro Tip: Read 12 Types of Neural Network Activation Functions: How to Choose?

The authors achieved a 10x improvement in label efficiency on the ImageNet dataset when they used 1% of labels (<13 images per class) compared to the previous state-of-the-art.

And with 10% labels, they achieved results higher than supervised learning models.

Source: Paper

Supervised Learning

Contrastive Learning is now also popularly being applied in fully-supervised settings.

Since the class labels are readily available in such a setting, the contrastive loss can be more effectively formulated since the positive pairs need not be augmented versions of the same sample and can instead be chosen as any other sample from the same class.

This paper bridges the gap between self-supervised and fully supervised learning and enables contrastive learning to be applied in the supervised setting by proposing the SupCon loss function. SupCon encourages embeddings from the same class to be pulled closer together, while embeddings from different classes are pushed apart.

This simplifies the process of positive selection while avoiding potential false negatives. Since SupCon accommodates multiple positives per anchor, this approach results in an improved selection of positive examples that are more varied while still containing semantically relevant information.

While conventional Contrastive Learning methods are restricted to downstream tasks, the SupCon model allows the label information to play an active role in the representation learning.

The model proposed is robust to image corruptions and hyperparameter variations.

Source: Paper

Natural Language Processing (NLP)

Contrastive Learning has seen applications in NLP as well (such as SimCSE), where the goal is to learn such embedding space in which similar sentences are close to each other while dissimilar ones are far apart.

However, Contrastive Learning in computer vision is just generating the augmentation of images. It is more challenging to construct text augmentation than image augmentation because we need to keep the sentence's meaning intact. There are four methods for augmenting text sequences:

1. Back-Translation: Here, the idea is to generate augmented sentences using back-translation. That is, a sentence that has been translated to a different language (say from English to Japanese) is translated back (from the translated Japanese sentence back to English). CERT is one such framework that uses this technique.

2. Lexical Edits: This type of augmentation takes a sentence as input and randomly applies one of the following simple sets of operations for text augmentation:

  • Random Insertion: Insert a synonym of a randomly selected not-stop word in the sentence at a random position.
  • Random Swap: Randomly swap two words for “n” number of times.
  • Random Deletion: Randomly delete each word in the sentence with probability “p.”
  • Synonym Replacement: Randomly choose “n” words from the sentence that are not-stop words. Replace each of these words with one of its synonyms chosen at random.

3. Cutoff: This augmentation strategy was proposed in this paper. Here, once a sentence is embedded into a vector representation (say, of size “Nxm,” where “N” = number of features, “m” = length of sentence), one of the following three strategies are used:

  • Feature Cutoff: Remove some selected features.
  • Token Cutoff: Remove the information of a few selected tokens.
  • Span Cutoff: Remove a continuous chunk of text.

4. DropOut: In this approach, proposed in this paper, we take a collection of sentences and consider the positive pairs as the sentences themselves. During the training of Transformers, there is a dropout mask and attention probabilities applied on fully-connected layers. It simply feeds the same input to the encoder twice by applying different dropout masks.

Computer Vision

Contrastive Learning has most extensively been explored in the field of Computer Vision. We have discussed several image-based applications of Contrastive Learning in previous sections.

Other applications include:

1. Video Sequence Prediction: The VideoMoCo model uses a Contrastive Learning methodology for unsupervised representation learning. The authors used temporally adversarial examples to augment input samples, i.e., some of the frames from the sequence were dropped and used as positive/negative pairs.

2. Object Detection: DetCo is a contrastive self-supervised approach for object detection which uses 1) contrastive learning between the global image and local patches and 2) multi-level supervision to intermediate representations.

3. Semantic Segmentation: Segmentation of natural images using contrastive learning has been explored in this paper, which uses a supervised contrastive loss for pre-training a model, and uses the traditional cross-entropy for fine-tuning.

4. Remote Sensing: The GLCNet model uses a self-supervised pre-train and supervised fine-tuning approach for the segmentation of remote sensing image data. The authors exploit both the global image-level representation using a “global style contrastive learning module” and representations of the local regions using the “local features matching contrastive learning module.”

5. Perceptual Audio Similarity: CDPAM is an unsupervised contrastive learning-based method for classifying audio samples based on their perceptual similarity. The authors fine-tune their model by collecting human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations.

Contrastive Learning: Summary

Contrastive Learning is a recent popular technique for self-supervised learning and for enhancing existing supervised learning-based approaches.

It works on the principle of juxtaposing samples from a dataset and push or pull representations in the embedding space based on whether the samples belong to the same distribution (i.e., the same class in classification tasks or the same object in object detection/recognition tasks) or different distributions respectively.

There are different ways to generate positive contrast samples from the anchor sample, like augmenting the entire sample or creating subsamples ( like extracting patches from images) from the existing data. Contrastive Learning-based methods have boosted performance in Semi-Supervised Learning and Representation Learning tasks.

We have explored some of the most popular Contrastive Learning frameworks in the literature and discussed the various loss functions devised for the same in this article. We have also looked into several applications of Contrastive Learning, both in and beyond Computer Vision.

Newer methods are still being researched to use minimal supervision in Contrastive Learning and achieve performance as good as or better than traditional supervised learning methods.

Deep Learning is the go-to technique for solving Computer Vision tasks today, owing to its imminent success in several application domains from image classification to video object segmentation and Natural Language Processing. Most practical applications of Deep Learning are stunted due to poor performance.

Rohit Kundu is a Ph.D. student in the Electrical and Computer Engineering department of the University of California, Riverside. He is a researcher in the Vision-Language domain of AI and published several papers in top-tier conferences and notable peer-reviewed journals.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7’s new Gen AI product