Here's the thing—
Collecting a large amount of data when tackling a completely new task can be challenging, to say the least.
Obtaining satisfactory model performance (think: model accuracy) using only a limited amount of data for training is also tricky... if not impossible.
Fortunately, there is a solution that addresses this very problem, and it goes by the name of Transfer Learning.
It's almost too good to be true because its idea is very simple: You can train the model with a small amount of data and still achieve a high level of performance.
Pretty cool, right?
Well, read on because we are about to give you the best explanation of how it all works. Here's what we'll cover:
Solve any video or image labeling task 10x faster and with 10x less manual work.
In case you'd like to catch up on some of the other machine learning concepts, check out:
Or... thinking of training your own AI model, instead? We've got you covered—see how you can do it on V7.
Now, let's dive in!
In other words, transfer learning is a machine learning method where we reuse a pre-trained model as the starting point for a model on a new task.
To put it simply—a model trained on one task is repurposed on a second, related task as an optimization that allows rapid progress when modeling the second task.
By applying transfer learning to a new task, one can achieve significantly higher performance than training with only a small amount of data.
Transfer learning is so common that it is rare to train a model for an image or natural language processing-related tasks from scratch.
ImageNet, AlexNet, and Inception are typical examples of models that have the basis of Transfer learning.
Deep learning experts introduced transfer learning to overcome the limitations of traditional machine learning models.
Let's have a look at the differences between the two types of learning.
1. Traditional machine learning models require training from scratch, which is computationally expensive and requires a large amount of data to achieve high performance. On the other hand, transfer learning is computationally efficient and helps achieve better results using a small data set.
2. Traditional ML has an isolated training approach where each model is independently trained for a specific purpose, without any dependency on past knowledge. Contrary to that, transfer learning uses knowledge acquired from the pre-trained model to proceed with the task. To paint a better picture of it:
One can not use the pre-trained model of ImageNet with biomedical images because ImageNet does not contain images belonging to the biomedical field.
3. Transfer learning models achieve optimal performance faster than the traditional ML models. It is because the models that leverage knowledge (features, weights, etc.) from previously trained models already understand the features. It makes it faster than training neural networks from scratch.
Different transfer learning strategies and techniques are applied based on the domain of the application, the task at hand, and the availability of data.
Before deciding on the strategy of transfer learning, it is crucial to have an answer of the following questions:
Traditionally, transfer learning strategies fall under three major categories depending upon the task domain and the amount of labeled/unlabeled data present.
Let's explore them in more detail.
Inductive Transfer Learning requires the source and target domains to be the same, though the specific tasks the model is working on are different.
The algorithms try to use the knowledge from the source model and apply it to improve the target task. The pre-trained model already has expertise on the features of the domain and is at a better starting point than if we were to train it from scratch.
Inductive transfer learning is further divided into two subcategories depending upon whether the source domain contains labeled data or not. These include multi-task learning and self-taught learning, respectively.
Scenarios where the domains of the source and target tasks are not exactly the same but interrelated uses the Transductive Transfer Learning strategy. One can derive similarities between the source and target tasks. These scenarios usually have a lot of labeled data in the source domain, while the target domain has only unlabeled data.
Unsupervised Transfer Learning is similar to Inductive Transfer learning. The only difference is that the algorithms focus on unsupervised tasks and involve unlabeled datasets both in the source and target tasks.
Now, we'll go through another way of categorizing transfer learning strategies based on the similarity of the domain and independent of the type of data samples present for training.
Let's dive in.
Homogeneous Transfer learning approaches are developed and proposed to handle situations where the domains are of the same feature space.
In Homogeneous Transfer learning, domains have only a slight difference in marginal distributions. These approaches adapt the domains by correcting the sample selection bias or covariate shift.
Here's the breakdown.
It covers a simple scenario in which there is a large amount of labeled data in the source domain and a limited number in the target domain. Both the domains and feature spaces differ only in marginal distributions.
For example, suppose we need to build a model to diagnose cancer in a specific region where the elderly are the majority. Limited target-domain instances are given, and relevant data are available from another region where young people are the majority. Directly transferring all the data from another region may be unsuccessful since the marginal distribution difference exists, and the elderly have a higher risk of cancer than younger people.
In this scenario, it is natural to consider adapting the marginal distributions. Instance-based Transfer learning reassigns weights to the source domain instances in the loss function.
The parameter-based transfer learning approaches transfer the knowledge at the model/parameter level.
This approach involves transferring knowledge through the shared parameters of the source and target domain learner models. One way to transfer the learned knowledge can be by creating multiple source learner models and optimally combining the re-weighted learners similar to ensemble learners to form an improved target learner.
The idea behind parameter-based methods is that a well-trained model on the source domain has learned a well-defined structure, and if two tasks are related, this structure can be transferred to the target model. In general, there are two ways to share the weights in deep learning models: soft weight sharing and hard weight sharing.
Feature-based approaches transform the original features to create a new feature representation. This approach can further be divided into two subcategories, i.e., asymmetric and symmetric Feature-based Transfer Learning.
Relational-based transfer learning approaches mainly focus on learning the relations between the source and a target domain and using this knowledge to derive past knowledge and use it in the current context.
Such approaches transfer the logical relationship or rules learned in the source domain to the target domain.
For example, if we learn the relationship between different elements of the speech in a male voice, it can help significantly to analyze the sentence in another voice.
Transfer learning involves deriving representations from a previous network to extract meaningful features from new samples for an inter-related task. However, these approaches forget to account for the difference in the feature spaces between the source and target domains.
It is often challenging to collect labeled source domain data with the same feature space as the target domain, and Heterogeneous Transfer learning methods are developed to address such limitations.
This technique aims to solve the issue of source and target domains having differing feature spaces and other concerns like differing data distributions and label spaces. Heterogeneous Transfer Learning is applied in cross-domain tasks such as cross-language text categorization, text-to-image classification, and many others.
Finally, let's discuss Transfer Learning in the context of Deep Learning.
Domains like natural language processing and image recognition are considered to be the hot areas of research for transfer learning. There are also many models that achieved state-of-the-art performance.
These pre-trained neural networks/models form the basis of transfer learning in the context of deep learning and are referred to as deep transfer learning.
To understand the flow of deep learning models, it's essential to understand what they are made up of.
Deep learning systems are layered architectures that learn different features at different layers. Initial layers compile higher-level features that narrow down to fine-grained features as we go deeper into the network.
These layers are finally connected to the last layer (usually a fully connected layer, in the case of supervised learning) to get the final output. This opens the scope of using popular pre-trained networks (such as Oxford VGG Model, Google Inception Model, Microsoft ResNet Model) without its final layer as a fixed feature extractor for other tasks.
The key idea here is to leverage the pre-trained model's weighted layers to extract features, but not update the model's weights during training with new data for the new task.
The pre-trained models are trained on a large and general enough dataset and will effectively serve as a generic model of the visual world.
This is a more engaging technique, where we do not just directly depend on the features extracted from the pre-trained models and replace the final layer but also selectively retrain some of the previous layers.
Deep neural networks are layered structures and have many tunable hyperparameters. The role of the initial layers is to capture generic features, while the later ones focus more on the explicit task at hand. It makes sense to fine-tune the higher-order feature representations in the base model to make them more relevant for the specific task. We can re-train some layers of the model while keeping some frozen in training.
An example is depicted in the following figure on an object detection task, where initial lower layers of the network learn very generic features and the higher layers learn very task-specific features.
One logical way to increase the model's performance even further is to re-train (or "fine-tune") the weights of the top layers of the pre-trained model alongside the training of the classifier you added.
This will force the weights to be updated from generic feature maps the model has learned from the source task. Fine-tuning will allow the model to apply past knowledge in the target domain and re-learn some things again.
Moreover, one should try to fine-tune a small number of top layers rather than the entire model. The first few layers learn elementary and generic features that generalize to almost all types of data.
Therefore, it's wise to freeze these layers and reuse the basic knowledge derived from the past training. As we go higher up, the features are increasingly more specific to the dataset on which the model was trained. Fine-tuning aims to adapt these specialized features to work with the new dataset, rather than overwrite the generic learning.
Lastly, let us walk you through the process of how transfer learning works in practice.
The first step is to choose the pre-trained model we would like to keep as the base of our training, depending on the task. Transfer learning requires a strong correlation between the knowledge of the pre-trained source model and the target task domain for them to be compatible.
Here are some of the pre-trained models you can use:
For computer vision:
For NLP tasks:
The base model is one of the architectures such as ResNet or Xception which we have selected in the first step to be in close relation to our task. We can either download the network weights which saves the time of additional training of the model. Else, we will have to use the network architecture to train our model from scratch.
There can be a case where the base model will have more neurons in the final output layer than we require in our use case. In such scenarios, we need to remove the final output layer and change it accordingly.
Freezing the starting layers from the pre-trained model is essential to avoid the additional work of making the model learn the basic features.
If we do not freeze the initial layers, we will lose all the learning that has already taken place. This will be no different from training the model from scratch and will be a loss of time, resources, etc.
The only knowledge we are reusing from the base model is the feature extraction layers. We need to add additional layers on top of them to predict the specialized tasks of the model. These are generally the final output layers.
The pre-trained model’s final output will most likely differ from the output we want for our model. For example, pre-trained models trained on the ImageNet dataset will output 1000 classes.
However, we need our model to work for two classes. In this case, we have to train the model with a new output layer in place.
One method of improving the performance is fine-tuning.
Fine-tuning involves unfreezing some part of the base model and training the entire model again on the whole dataset at a very low learning rate. The low learning rate will increase the performance of the model on the new dataset while preventing overfitting.
Domain adaptation is a transfer learning scenario where the source and target domains have different feature spaces and distributions.
Domain adaptation is the process of adapting one or more source domains to transfer information to improve the performance of a target learner. This process attempts to alter a source domain to bring the distribution of the source closer to that of the target.
In a neural network, different layers identify different complexity of features. In a perfect scenario, we'd develop an algorithm that makes this feature domain invariant and improves its transferability across domains.
The feature representations between the source and target domains should be as similar as possible in such a context. The goal is to add an objective to the model at the source to encourage similarity with the target by confusing the source domain itself.
Specifically, domain confusion loss is used to confuse the high-level classification layers of a neural network by matching the distributions of the target and source domains.
In the end, we want to make sure samples come across as mutually indistinguishable to the classifier. To achieve this, one has to minimize the classification loss for the source samples, and one has to also minimize the domain confusion loss for all samples.
In the case of multitask learning, several tasks from the same domain are learned simultaneously without distinction between the source and targets.
We have a set of learning tasks, t1 , t2 , …, t(n) and we co-learn all tasks simultaneously.
This helps to transfer knowledge from each scenario and develop a rich combined feature vector from all the varied scenarios of the same domain. The learner optimizes the learning/performance across all of the n tasks through some shared knowledge.
One-shot learning is a classification task where we have one or a few examples to learn from and classify many new examples in the future.
This is the case of face recognition, where people's faces must be classified correctly with different facial expressions, lighting conditions, accessories, and hairstyles, and the model has one or a few template photos as input.
For one-shot learning, we need to fully rely on the knowledge transfer from the base model trained on a few examples we have for a class.
If transfer learning is applied excessively using zero instances of a class and does not depend on labeled data samples, then the corresponding strategy is called Zero-shot learning.
Zero-shot learning needs additional data during the training phase to understand the unseen data.
Zero-shot learning focuses on the traditional input variable, x, the traditional output variable, y, and the task-specific random variable. Zero-shot learning comes in handy in scenarios such as machine translation, where we may not have labels in the target language.
Transfer learning helps data scientists to learn from the knowledge gained from a previously used machine learning model for a similar task.
This is the reason why this technique has now become applied in several fields we've listed below.
NLP is one of the most attractive applications of transfer learning. Transfer learning uses the knowledge of pre-trained AI models that can understand linguistic structures to solve cross-domain tasks. Everyday NLP tasks like next word prediction, question-answering, machine translation use deep learning models like BERT, XLNet, Albert, TF Universal Model, etc.
Transfer learning is also applied in Image Processing.
Deep Neural Networks are used to solve image-related tasks as they can work well identifying complex features of the image. The dense layers contain the logic for detecting the image; thus, tuning the higher layers will not affect the base logic. Image Recognition, Object Detection, noise removal from images, etc., are typical application areas of Transfer learning because all image-related tasks require basic knowledge and pattern detection of familiar images.
Transfer learning algorithms are used to solve Audio/Speech related tasks like speech recognition or speech-to-text translation.
When we say "Siri" or"Hey Google!", the primary AI model developed for English speech recognition is busy processing our commands at the backend.
Interestingly, a pre-trained AI model developed for English speech recognition forms the basis for a French speech recognition model.
Finally, let's do a quick recap of everything we've learned today. Here's a bullet-point summary of the things we've covered:
💡 Read next: