Data rampage and data drought; as machine learning practitioners, we’re often drowning in what we can’t use, and desperate for what doesn’t exist.
On the one hand, supervised learning is the bread-and-butter of machine learning (ML) techniques, but is powered by labeled data which is tedious and expensive to annotate. Alternatively, unsupervised learning uses unlabeled data, which without human-made annotations is often plentiful.
When used alone, either of these strategies is often impractical for training a model to deployment-ready benchmarks. Labeling an entire dataset is time-consuming and expensive, and unlabeled data may not provide the desired accuracy (… or F1 score 😁).
What if we have access to both types of data? Or what if we only want to label a percentage of our dataset? How can we combine both our labeled and unlabeled datasets to improve model performance?
We can use semi-supervised intuition to answer these questions, as it leverages both labeled and unlabeled data to bolster model performance.
Here’s what we’ll cover:
And in case you landed here looking to get hands-on experience training your own AI models—today’s your lucky day ;-) V7 allows you to label your data and train image classification, semantic segmentation, instance segmentation, and OCR models in one unified platform. Check out:
Here's a quick sneak peek of V7's auto-annotation capabilities.
Now, let’s dive in.
Semi-supervised learning is a broad category of machine learning techniques that utilizes both labeled and unlabeled data; in this way, as the name suggests, it is a hybrid technique between supervised and unsupervised learning.
In general, the core idea of semi-supervision is to treat a datapoint differently based on whether it has a label or not: for labeled points, the algorithm will use traditional supervision to update the model weights; and for unlabeled points, the algorithm minimizes the difference in predictions between other similar training examples.
For intuition, consider the moons dataset in Figure 1: a binary classification problem with one-class for each crescent moon. Let’s say we only have 8 labeled datapoints, with the rest unlabeled.
Supervised training updates model weights to minimize the average difference between predictions and labels. However, with limited labeled data, this might find a decision boundary that is valid for the labeled points but won’t generalize to the whole distribution—as in Figure 2a below.
Unsupervised learning, on the other hand, tries to cluster points together based on similarities in some feature-space. But, without labels to guide training, an unsupervised algorithm might find sub-optimal clusters. In Figure 2b, for example, the discovered clusters incorrectly fit the true class distribution.
Without sufficient labeled data, or in difficult clustering settings, supervised and unsupervised techniques can fail to achieve the desired result.
In the semi-supervised setting, however, we use both labeled and unlabeled data. Our labeled points act as a sanity check; they ground our model predictions and add structure to the learning problem by establishing how many classes there are, and which clusters correspond to which class.
Unlabeled datapoints provide context; by exposing our model to as much data as possible, we can accurately estimate the shape of the whole distribution.
With both parts—labeled and unlabeled data—we can train more accurate and resilient models. In our moons dataset, semi-supervised training can step closer to the true distribution shown in Figure 3.
Techniques that leverage unlabeled data are motivated by the real-world challenges around data collection.
Every year, companies and practitioners dump exorbitant amounts of time and money into labeling datasets for machine learning. All the while, unlabeled data remains idle; and while it’s typically cheap and easy to collect, delivering results without labels is a challenge. If it’s possible to avoid manual data labeling while achieving strong results, machine learning practitioners can conserve precious, otherwise wasted resources.
Given two identical datasets, a supervised learning task with an entirely-labeled dataset will surely train a better model than a set with a portion of unlabeled points. But, semi-supervised learning is powerful when labels are limited and unlabeled data is plentiful. In this instance, our model gains exposure to instances it might encounter in deployment, without investing time and money labeling thousands upon thousands of extra images.
Some of the most powerful applications might be where labeling data is difficult.
In many NLP tasks like webpage classification, speech analysis, or named-entity recognition; or in less traditional machine learning applications, like protein sequence classification, labeling data is especially tedious, time-consuming, and can require domain expertise. Here, leveraging as much unlabeled data as possible makes dataset engineering more efficient.
Whenever you have easy access to data collection, blending unlabeled and labeled datasets will surely boost model performance.
As a broad subset of machine learning, semi-supervised intuition is based on a few core principles.
The continuity, or smoothness, assumption indicates that close-together datapoints are likely to have the same label.
Similarly, the cluster assumption indicates that, in a classification problem, data tends to be organized into high-density clusters, and that datapoints of the same cluster are likely to have the same label. Therefore, a decision boundary should not lie in areas of densely packed datapoints; rather, it should lie in-between high-density regions, separating them into discrete clusters.
The manifold assumption adapts the intuition for our example moons dataset to deep learning applications, including computer vision and natural language processing. It assumes the high-dimensional data distribution can be represented in an embedded low-dimensional space. This low-dimensional space is called the data manifold.
For intuition as to what a manifold is, consider a sheet of paper as a 2D plane, where we can identify a location with a set of x-y coordinates. We can take our sheet of paper and crumple it into a ball, where it is represented as a sphere in 3D space. Now, a pair of 2D coordinates on the original sheet of paper can be mapped to a set of x-y-z coordinates in 3D space on the crumpled ball.
This crumpled ball, in 3D space, is our higher-dimensional space. Embedded in 3D space, is the sheet of paper’s 2D coordinate plane, serving as our low-dimensional manifold.
Consider a binary classification problem between images of cats and dogs. In deep learning applications, an image is just a big tensor of values indicating the colors of pixels. This space of images is our high-dimensional space.
Based on color values, images of cats and dogs are scattered in an incomprehensible distribution in high-dimensional Euclidean space (the crumpled paper ball), where, unlike our moons dataset, there are no clear clusters. Therefore, it is helpful to assume there exists a lower-dimensional manifold (the sheet of paper) such that the idea of distance is representative of semantic meaning—in this case, where datapoints of cats are clustered near cats, and dogs are clustered near dogs.
Why is this valuable?
In deep learning applications, learning to untangle the distribution of high-dimensional images of cats and dogs is difficult. Therefore, based on the manifold assumption, our model can learn the function mapping images in Euclidean space to representations on our low-dimensional manifold. Here, our cluster and continuity assumptions are more reliable, and we can classify a datapoint based on its learned representation.
As a conduit to our original assumptions, the manifold assumption helps harness semi-supervised techniques in deep learning settings.
Now, let’s introduce some implementations of semi-supervised learning intuition.
The core motivation of using consistency regularization is to take advantage of the continuity and cluster assumptions.
In the semi-supervised setting, let’s say we have a dataset with both labeled and unlabeled examples of two classes.
During training, we handle labeled and unlabeled datapoints differently: for points with labels, we optimize using traditional supervised learning, calculating loss by comparing our prediction to our label; for unlabeled points, we want to enforce that—on our low-dimensional manifold—similar datapoints have similar predictions.
But how can we enforce this consistency?
Consider a dataset D such that:
With augmentations, we can create artificially similar datapoints.
Consider a function Augment(x) that slightly alters x. We need to make sure our model outputs similar predictions for an augmented datapoint, Augment(x), and its original counterpart, x. Returning to our moons dataset in Figure 5, see the highlighted unlabeled datapoints as examples of x; see the black circles representing the area of potential Augment(x) points.
For a given image x, our model should make similar predictions for all datapoints in the radius of potential Augment(x). In practice, this works by introducing both a supervised and unsupervised loss term. Some of the most popular implementations of consistency regularization are Pi-Models and temporal ensembling, proposed by Laine and Aila in Temporal Ensembling for Semi-supervised Learning.
Given CrossEntropy, a popular supervised loss function, and a model f Laine and Aila formulate loss as follows:
Optimizing this loss for unlabeled datapoints enforces that the distance—measured by the L2-norm—between predictions for any Augment(x) should be the same as a prediction for its original x. By minimizing the distance between predictions of similar datapoints, we’ll find a decision boundary consistent with our continuity and cluster assumptions.
The unsupervised loss term directly encourages a model to assign similar datapoints to the same class; and, if model predictions are consistent for a certain radius around each datapoint x, a decision boundary is forced away from high-density clusters of data.
On GitHub, there is an excellent implementation of the combined supervised-unsupervised loss for temporal ensembling by Johan Ferret (@ferretj), which you can check out here: Temporal Ensembling.
You can explore other popular forms of consistency regularization here:
Pseudo-labelling is where, during training, model predictions are converted into a “one-hot” label.
For example, in our moons dataset classifier, consider a datapoint that our model predicts is blue with a probability of .75.
All confident model predictions are converted into “one-hot” vectors, where the most confident class becomes the label. From this, we train on the new “one-hot” probability distribution as a pseudo-label.
Not only are we able to create artificial labels, but training over pseudo-labels is a form of entropy minimization, meaning, the model’s predictions are encouraged to be high confidence on unlabeled datapoints. Similarly, by accepting certain predictions as truth, we avoid learning any general rules about the true data distribution (inductive learning). In this way, pseudo-labels offer a form of transductive learning—the reasoning from given training data to other specific test data.
Proposed by Dong-Hyun Lee in Pseudo-label, pseudo-labels are helpful alone, but in combination with other techniques like consistency regularization, they can help achieve state-of-the-art results.
There’s an implementation of pseudo-labels on GitHub you can check out here: Pseudo-labels.
Here’s the jist.
In a generic semi-supervised algorithm, given a dataset of labeled and unlabeled data, examples are handled one of two different ways:
Some of the papers and implementations with the best results have taken holistic approaches, utilizing many techniques in a single algorithm.
For example, FixMatch uses both consistency-regularization and pseudo-labels; and MixMatch uses a combination of mixup operations and label sharpening to train on both labeled and unlabeled data.
The field of semi-supervised learning has been borrowing from other cutting-edge research areas. In their paper Big Self-Supervised Models are Strong Semi-Supervised Learners, Chen et al. proposed using unlabeled data to train a large task-agnostic unsupervised model, fine-tuning it with label-supervision, then returning to unlabeled data to perform self-training on a task-specific model.
Learn more about self-supervised learning in The Beginner's Guide to Self-Supervised Learning, and you can read about the most recent state-of-the-art papers and results for many applications of semi-supervised learning on Papers With Code.
There are also paradigms within the semi-supervised setting like active learning, which aims to identify which unlabeled points are most valuable to be labeled by a human in the loop. As we build datasets to deploy AI into the real world, techniques like these can be crucial to cover edge cases and achieve deployment-ready benchmarks.
Here's a quick summary of everything we've covered:
The most difficult machine learning problem is building a dataset. Labeled data is expensive; unlabeled data is cheap. Using both types of data, exposing your model to as much of the target sample space as possible is incredibly powerful and can help you achieve very high accuracy with small fractions of labeled data.
💡 Read more:
13 Best Image Annotation Tools
9 Essential Features for a Bounding Box Annotation Tool
Annotating With Bounding Boxes: Quality Best Practices
Data Cleaning Checklist: How to Prepare Your Machine Learning Data
Mean Average Precision (mAP) Explained: Everything You Need to Know
A Newbie-Friendly Guide to Transfer Learning
The Beginner's Guide to Deep Reinforcement Learning
YOLO: Real-Time Object Detection Explained
A Gentle Introduction to Image Segmentation for Machine Learning
The Beginner’s Guide to Contrastive Learning