Humans can recognize new object classes from very few instances. However, most machine learning techniques require thousands of examples to achieve similar performance.
In the past decade, computer vision researchers have primarily focused on solving generic tasks using millions of images. But it has led to a high correlation between data quantity and the performance of the models.
Therefore, researchers have developed ‘Few Shot Learning’ to mitigate the data scarcity issue, focusing on training models with fewer data without compromising their performance.
This guide will help you understand everything you need to know about Few-Shot Learning in a couple of minutes.
Here’s what we’ll cover:
And hey—
If you are searching for the tools to annotate your data and train you ML models, we got you covered!
Head over to our Open ML Datasets repository, pick a dataset, upload it to V7, and start annotating data to train your neural networks in one place. Have a look at these resources to get started:
Train ML models and solve any computer vision task faster with V7.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
Let's begin!
Few-Shot Learning is an example of meta-learning, where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize well to unseen (but related) tasks with just a few examples, during the meta-testing phase.
Few-shot training stands in contrast to traditional methods of training machine learning models, where a large amount of training data is typically used. Few-shot learning is used primarily in Computer Vision.
In practice, few-shot learning is useful when training examples are hard to find (e.g., cases of a rare disease) or the cost of data annotation is high.
The importance of Few-Shot Learning
Few-shot learning uses the N-way-K-shot classification approach to discriminate between N classes with K examples.
Using conventional methods will not work as modern classification algorithms depend on far more parameters than training examples and will generalize poorly.
If the data is insufficient to constrain the problem, then one possible solution is to learn from the experience of other similar problems. To this end, most approaches characterize few-shot learning as a meta-learning problem.
A shot is nothing more than a single example available for training, so we have N examples for training in N-shot learning.
In the N-shot learning field, we have n labeled images of each K class, i.e., N ∗ K total examples, which we call support set S . We also have to classify Query Set Q, where each example lies in one of the K classes.
N-shot learning has mainly three sub-fields:
Zero Shot Learning aims to classify unseen data samples without any training. Having a general idea about the attributes of an object, its appearance, properties, and functionality, classifying data shouldn’t be a problem.
In the One Shot Learning problem, we have a single sample of each class.
Few-Shot has two to five samples per class, making it just a more flexible version of OSL.
Now, let's discuss how Few-Shot Learning works in more detail.
We use one set of classification problems to help solve other unrelated sets.
Here, each task mimics the few-shot scenario, so for N-way-K-shot classification, each task includes N classes with K examples.
In the classical learning framework, we learn how to classify from training data examples and evaluate the results using test data. In the meta-learning framework, we learn how to classify a given set of training tasks and evaluate using a set of test tasks.
N classes with K examples are known as the support set for the task and are used for learning how to solve this task.
In addition, there are further examples of the same classes, known as a query set, which are used to evaluate the performance of this task. Each task can be completely non-overlapping; we may never see the classes from one task in any of the others.
In the classic paradigm, when we have a specific task, an algorithm is learning if its task performance improves with experience.
In the Meta Learning paradigm, we have a set of tasks. An algorithm is learning to improve with experience and the number of tasks. This algorithm is called a Meta Learning algorithm.
Let’s assume we have a test task TEST. We will train our Meta Learning algorithm on a batch of training tasks TRAIN. Training experience gained from attempting to solve TRAIN tasks will be used to solve the TEST task.
Training an FSL task has a set sequence of steps. Imagine we have a classification problem, as we mentioned before. To start, we need to choose a base dataset. Choosing a quality base dataset is crucial.
In the N-way-K-Shot classification problem, we have a large base dataset that we’ll use as a Meta Learning training set (TRAIN).
The Meta Training process will have a finite number of episodes. We form an episode like this:
Approaches to meta-learning are diverse, and there is no single best approach. However, there are three distinct ways, each of which exploits a different type of prior knowledge:
Prior knowledge about similarity: ML models try to learn embeddings in training tasks that tend to separate different classes even when they are unseen.
Prior knowledge about learning: ML models use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from a few examples.
Prior knowledge of data: ML models exploit prior knowledge about the structure and variability of the data, which enables constructing viable models from a few examples.
It’s based on the concept that if you don’t have enough data to build a reliable model and avoid overfitting and underfitting, you should add more data.
Many FSL problems are solved by using additional information from a large base dataset. The key feature of the base dataset is that it doesn’t have classes that we have in our support set for the Few-Shot task. For example, if we want to classify a specific bird species, the base dataset can have images of many other birds.
We can also produce more data ourselves. To reach this goal, we can use data augmentation or even generative adversarial networks (GANs).
From the parameter-level point of view, it’s relatively easy to overfit on Few-Shot Learning samples, as they have a high dimensional capacity to fit all data features.
We should limit the parameter space and use regularization and proper loss functions to overcome this problem. The model will generalize the limited number of training samples.
On the other hand, we can enhance model performance by directing it to the vast parameter space. Using a standard optimization algorithm might not give reliable results because of the small amount of training data.
That is why on the parameter level, we train our model to find the best route in the parameter space to give optimal prediction results.
Next, let us briefly describe the most prominent Few-Shot Image Classification algorithms.
MAML was inspired by the idea behind the question of how much data is needed to learn about something. Can we teach algorithms to learn how to learn?
Meta-learning algorithms can be designed to address the following tasks:
Before explaining how to train MAML (meta-training), let’s define what we expect at meta-test time. Considering that we have found a good initialization parameter θ from which we can perform efficient, one-shot adaptation.
Given a new task, the new parameter θ’, obtained by gradient descent, should perform well on the new task. The figure below illustrates how MAML should work at meta-test time. We are looking for a pretrained parameter that can reach near-optimal parameters for every task in one (or a few) gradient step(s).
The meta-training algorithm is divided into two parts:
MAML currently doesn’t work as well as metric learning algorithms on popular few-shot image classification benchmarks. It is quite hard to train because there are two levels of training, so the hyper-parameters search is much more complex.
Plus, the meta-backpropagation implies the computation of gradients, so you have to use approximations to be able to train it on standard GPUs. For these reasons, you would probably rather use metric learning algorithms for your computer vision projects at home or at work.
Prototypical networks are based on the concept that there exists an embedding in which several points cluster around a single prototype representation for each class. It aims to learn per-class prototypes based on sample averaging in the feature space.
Prototypical networks compute each class’s M-dimensional representation or prototype through an embedding function with learnable parameters. Also, each prototype is the mean vector of the embedded support points belonging to its class.
Prototypical networks are more efficient than the recent meta-learning algorithms, making them an appealing approach to few-shot and zero-shot learning.
Matching Networks was the first to train and test on n-shot, k-way tasks. This appeal is straightforward — training and evaluating the same tasks lets us optimize for the target task in an end-to-end fashion. The matching networks paper develops a very novel idea of a fully differentiable neural neighbors algorithm.
Matching Networks based on deep neural networks combine embedding and classification to form an end-to-end differentiable nearest neighbors classifier.
Matching Networks first embed a high dimensional sample into a low dimensional space and then perform a generalized nearest-neighbor classification form.
The embedding function they use for their few-shot classification problems is a CNN. It is differentiable hence making the attention and Matching Networks fully differentiable! It's straightforward to fit the whole model end-to-end with typical methods such as stochastic gradient descent.
The distance function was not defined in advance but learned by the algorithm. RN has its relation module that does this. If you want to learn more, check out the paper.
The overall structure is as follows. The relation module is put on the top of the embedding module, which is the part that computes embeddings and class prototypes from input images.
The relation module is fed with the concatenation of embedding a query image with each class prototype, and it outputs a relation score for each couple. Applying a Softmax to the relation scores, we get a prediction.
Few-shot object detection aims to generalize on novel objects using limited supervision and annotated samples.
Most FSOD applications divide classes into two non-overlapping parts: base and novel classes during the training.
The training dataset includes base classes to train the baseline model. Then, the model is finetuned, where a combined dataset of base and novel classes is used. The last stage includes testing on a dataset composed of only novel classes.
Two popular few shot object detection tasks are used for benchmarking: MS-COCO on 10-shot and MS-COCO on 30-shot. Let’s look at the top 3 models for each of these tasks:
Depending on the task, these three algorithms outperform others. However, there is a massive gap in accuracy between classic object detection tasks and few-shot object detection.
The Faster R-CNN is modified for few-shot object detection. Faster R-CNN consists of 3 blocks:
The architecture of Decoupled Faster R-CNN (DeFRCN) for few-shot object detection. Compared to the standard Faster R-CNN, two Gradient Decoupled Layers (sky-blue) and an offline Prototypical Calibration Block (red) are inserted into the framework for decoupling for multi-stage and multi-task, respectively.
Existing FSOD systems follow FSC approaches, neglect the problem of spatial misalignment and the risk of information entanglement, and consequently result in low performance.
The paper proposes a novel Dual-Awareness Attention (DAnA), which captures the pairwise spatial relationship across the support and query images.
The generated query-position-aware (QPA) support features are robust to spatial misalignment and capable of guiding the detection network precisely. The DAnA component adapts to various object detection networks and enhances FSOD performance by paying attention to specific semantics conditioned on the query.
Experimental results demonstrate that DAnA significantly boosts (+6.9 AP relatively) few-shot object detection performance on the COCO benchmark. By equipping DAnA, conventional object detection models, Faster- RCNN, and RetinaNet, which are not designed explicitly for few-shot learning, reach state-of-the-art performance in FSOD tasks.
Few Shot Learning has applications in a wide array of AI tasks.
Few-shot learning enables natural language processing (NLP) applications including:
Few-shot learning is used mainly in machine vision to deal with problems such as:
Data that contains information regarding voices/sounds can be analyzed by acoustic signal processing, and few-shot learning can enable the deployment of the following tasks:
Below is a curated list of some of the most cited and acknowledged research work in the Few Shot Learning domain.
💡 Read next:
A Step-by-Step Guide to Text Annotation [+Free OCR Tool]
The Complete Guide to CVAT - Pros & Cons
9 Essential Features for a Bounding Box Annotation Tool
9 Reinforcement Learning Real-Life Applications
Mean Average Precision (mAP) Explained: Everything You Need to Know
The Beginner's Guide to Deep Reinforcement Learning
Building AI products? This guide breaks down the A to Z of delivering an AI success story.