V7 Go is now an AI Agent Platform

Watch the Keynote

Blog

Webinars

AI agents

Darwin academy

Resources

Computer vision

A Newbie-Friendly Guide to Transfer Learning

18 min read

—

Oct 12, 2021

Here's everything you need to know about classical Transfer Learning and Deep Transfer Learning. Read this guide improve your model training and achieve better performance in less time.

Pragati Baheti

Guest Author and Software Developer

6:01

NEW - V7 Go Product Update

Introducing Expert AI Agents

Play video

6:01

NEW - V7 Go Product Update

Introducing Expert AI Agents

Play video

Here's the thing—

Collecting a large amount of data when tackling a completely new task can be challenging, to say the least.

However—

Obtaining satisfactory model performance (think: model accuracy) using only a limited amount of data for training is also tricky... if not impossible.

Fortunately, there is a solution that addresses this very problem, and it goes by the name of Transfer Learning.

It's almost too good to be true because its idea is very simple: You can train the model with a small amount of data and still achieve a high level of performance.

Pretty cool, right?

Well, read on because we are about to give you the best explanation of how it all works.

Here’s what we’ll cover:

What is Transfer Learning?
Traditional Machine Learning vs. Transfer Learning
Classical Transfer Learning strategies
Transfer Learning for Deep Learning
Deep Transfer Learning in 6 steps
Types of Deep Transfer Learning
Deep Transfer Learning Applications

Data labeling

Data labeling platform

Get started today

Data labeling

Data labeling platform

Get started today

What is Transfer Learning?‍

In other words, transfer learning is a machine learning method where we reuse a pre-trained model as the starting point for a model on a new task.

To put it simply—a model trained on one task is repurposed on a second, related task as an optimization that allows rapid progress when modeling the second task.

By applying transfer learning to a new task, one can achieve significantly higher performance than training with only a small amount of data.

Transfer learning is so common that it is rare to train a model for an image or natural language processing-related tasks from scratch.

Instead, researchers and data scientists prefer to start from a pre-trained model that already knows how to classify objects and has learned general features like edges, shapes in images.

ImageNet, AlexNet, and Inception are typical examples of models that have the basis of Transfer learning.

Transfer Learning

And in case you prefer a video explanation of this topic, have a look at this video guide.

Traditional Machine Learning vs.Transfer Learning

Deep learning experts introduced transfer learning to overcome the limitations of traditional machine learning models.

Let's have a look at the differences between the two types of learning.

1. Traditional machine learning models require training from scratch, which is computationally expensive and requires a large amount of data to achieve high performance. On the other hand, transfer learning is computationally efficient and helps achieve better results using a small data set.

Searching for high-quality data? Check out 65+ Best Free Datasets for Machine Learning

2. Traditional ML has an isolated training approach where each model is independently trained for a specific purpose, without any dependency on past knowledge. Contrary to that, transfer learning uses knowledge acquired from the pre-trained model to proceed with the task. To paint a better picture of it:

One can not use the pre-trained model of ImageNet with biomedical images because ImageNet does not contain images belonging to the biomedical field.

3. Transfer learning models achieve optimal performance faster than the traditional ML models. It is because the models that leverage knowledge (features, weights, etc.) from previously trained models already understand the features. It makes it faster than training neural networks from scratch.

Traditional Machine Learning vs. Transfer Learning

Classical Transfer Learning Strategies

Different transfer learning strategies and techniques are applied based on the domain of the application, the task at hand, and the availability of data.

Before deciding on the strategy of transfer learning, it is crucial to have an answer of the following questions:

Which part of the knowledge can be transferred from the source to the target to improve the performance of the target task?
When to transfer and when not to, so that one improves the target task performance/results and does not degrade them?
How to transfer the knowledge gained from the source model based on our current domain/task?

Traditionally, transfer learning strategies fall under three major categories depending upon the task domain and the amount of labeled/unlabeled data present.

Curious to learn more about labeling data? Read What is Data Labeling and How to Do It Efficiently.

Let's explore them in more detail.

Inductive Transfer Learning

Inductive Transfer Learning requires the source and target domains to be the same, though the specific tasks the model is working on are different.

The algorithms try to use the knowledge from the source model and apply it to improve the target task. The pre-trained model already has expertise on the features of the domain and is at a better starting point than if we were to train it from scratch.

Inductive transfer learning is further divided into two subcategories depending upon whether the source domain contains labeled data or not. These include multi-task learning and self-taught learning, respectively.

Transductive Transfer Learning

Scenarios where the domains of the source and target tasks are not exactly the same but interrelated uses the Transductive Transfer Learning strategy. One can derive similarities between the source and target tasks. These scenarios usually have a lot of labeled data in the source domain, while the target domain has only unlabeled data.

Unsupervised Transfer Learning

Unsupervised Transfer Learning is similar to Inductive Transfer learning. The only difference is that the algorithms focus on unsupervised tasks and involve unlabeled datasets both in the source and target tasks.

Transfer Learning strategies

Check out Supervised vs. Unsupervised Learning: What’s the Difference?

Common Approaches to Transfer Learning

Now, we'll go through another way of categorizing transfer learning strategies based on the similarity of the domain and independent of the type of data samples present for training.

Let's dive in.

Homogeneous Transfer Learning

Homogeneous Transfer learning approaches are developed and proposed to handle situations where the domains are of the same feature space.

In Homogeneous Transfer learning, domains have only a slight difference in marginal distributions. These approaches adapt the domains by correcting the sample selection bias or covariate shift.

Here's the breakdown.

Instance transfer

It covers a simple scenario in which there is a large amount of labeled data in the source domain and a limited number in the target domain. Both the domains and feature spaces differ only in marginal distributions.

For example, suppose we need to build a model to diagnose cancer in a specific region where the elderly are the majority. Limited target-domain instances are given, and relevant data are available from another region where young people are the majority. Directly transferring all the data from another region may be unsuccessful since the marginal distribution difference exists, and the elderly have a higher risk of cancer than younger people.

In this scenario, it is natural to consider adapting the marginal distributions. Instance-based Transfer learning reassigns weights to the source domain instances in the loss function.

Parameter transfer

The parameter-based transfer learning approaches transfer the knowledge at the model/parameter level.

This approach involves transferring knowledge through the shared parameters of the source and target domain learner models. One way to transfer the learned knowledge can be by creating multiple source learner models and optimally combining the re-weighted learners similar to ensemble learners to form an improved target learner.

The idea behind parameter-based methods is that a well-trained model on the source domain has learned a well-defined structure, and if two tasks are related, this structure can be transferred to the target model. In general, there are two ways to share the weights in deep learning models: soft weight sharing and hard weight sharing.

In soft weight sharing, the model is expected to be close to the already learned features and is usually penalized if its weights deviate significantly from a given set of weights.
In hard weight sharing, we share the exact weights among different models.

Ready to train your models? Have a look at Mean Average Precision (mAP) Explained: Everything You Need to Know.

Feature-representation transfer

Feature-based approaches transform the original features to create a new feature representation. This approach can further be divided into two subcategories, i.e., asymmetric and symmetric Feature-based Transfer Learning.

Asymmetric approaches transform the source features to match the target ones. In other words, we take the features from the source domain and fit them into the target feature space. There can be some information loss in this process due to the marginal difference in the feature distribution.
Symmetric approaches find a common latent feature space and then transform both the source and the target features into this new feature representation.

Relational-knowledge transfer

Relational-based transfer learning approaches mainly focus on learning the relations between the source and a target domain and using this knowledge to derive past knowledge and use it in the current context.

Such approaches transfer the logical relationship or rules learned in the source domain to the target domain.

For example, if we learn the relationship between different elements of the speech in a male voice, it can help significantly to analyze the sentence in another voice.

Heterogeneous Transfer Learning

Transfer learning involves deriving representations from a previous network to extract meaningful features from new samples for an inter-related task. However, these approaches forget to account for the difference in the feature spaces between the source and target domains.

It is often challenging to collect labeled source domain data with the same feature space as the target domain, and Heterogeneous Transfer learning methods are developed to address such limitations.

This technique aims to solve the issue of source and target domains having differing feature spaces and other concerns like differing data distributions and label spaces. Heterogeneous Transfer Learning is applied in cross-domain tasks such as cross-language text categorization, text-to-image classification, and many others.

Learn more by reading Optical Character Recognition: What is It and How Does it Work?

Transfer Learning for Deep Learning

Finally, let's discuss Transfer Learning in the context of Deep Learning.

Domains like natural language processing and image recognition are considered to be the hot areas of research for transfer learning. There are also many models that achieved state-of-the-art performance.

These pre-trained neural networks/models form the basis of transfer learning in the context of deep learning and are referred to as deep transfer learning.

Off-the-shelf pre-trained models as feature extractors

To understand the flow of deep learning models, it's essential to understand what they are made up of.

Deep learning systems are layered architectures that learn different features at different layers. Initial layers compile higher-level features that narrow down to fine-grained features as we go deeper into the network.

Read A Comprehensive Guide to Convolutional Neural Networks.

These layers are finally connected to the last layer (usually a fully connected layer, in the case of supervised learning) to get the final output. This opens the scope of using popular pre-trained networks (such as Oxford VGG Model, Google Inception Model, Microsoft ResNet Model) without its final layer as a fixed feature extractor for other tasks.

Transfer Learning with Pre-trained Deep Learning Models as Feature Extractors

The key idea here is to leverage the pre-trained model's weighted layers to extract features, but not update the model's weights during training with new data for the new task.

The pre-trained models are trained on a large and general enough dataset and will effectively serve as a generic model of the visual world.

Looking for quality datasets? See our list of 20+ Open Source Computer Vision Datasets

Fine Tuning Off-the-shelf Pre-trained Models

This is a more engaging technique, where we do not just directly depend on the features extracted from the pre-trained models and replace the final layer but also selectively retrain some of the previous layers.

Deep neural networks are layered structures and have many tunable hyperparameters. The role of the initial layers is to capture generic features, while the later ones focus more on the explicit task at hand. It makes sense to fine-tune the higher-order feature representations in the base model to make them more relevant for the specific task. We can re-train some layers of the model while keeping some frozen in training.

An example is depicted in the following figure on an object detection task, where initial lower layers of the network learn very generic features and the higher layers learn very task-specific features.

Fine-tuning: Supervised domain adaptation

Freezing vs. Fine-tuning

One logical way to increase the model's performance even further is to re-train (or "fine-tune") the weights of the top layers of the pre-trained model alongside the training of the classifier you added.

This will force the weights to be updated from generic feature maps the model has learned from the source task. Fine-tuning will allow the model to apply past knowledge in the target domain and re-learn some things again.

Moreover, one should try to fine-tune a small number of top layers rather than the entire model. The first few layers learn elementary and generic features that generalize to almost all types of data.

Therefore, it's wise to freeze these layers and reuse the basic knowledge derived from the past training. As we go higher up, the features are increasingly more specific to the dataset on which the model was trained. Fine-tuning aims to adapt these specialized features to work with the new dataset, rather than overwrite the generic learning.

Freeze vs. Fine-Tune in Transfer Learning for Deep Learning

Transfer Learning in 6 steps

Lastly, let us walk you through the process of how transfer learning works in practice.

Transfer Learning process

1. Obtain pre-trained model

The first step is to choose the pre-trained model we would like to keep as the base of our training, depending on the task. Transfer learning requires a strong correlation between the knowledge of the pre-trained source model and the target task domain for them to be compatible.

Here are some of the pre-trained models you can use:

For computer vision:

VGG-16
VGG-19
Inception V3
XCeption
ResNet-50

For NLP tasks:

Word2Vec
GloVe
FastText

2. Create a base model

The base model is one of the architectures such as ResNet or Xception which we have selected in the first step to be in close relation to our task. We can either download the network weights which saves the time of additional training of the model. Else, we will have to use the network architecture to train our model from scratch.

There can be a case where the base model will have more neurons in the final output layer than we require in our use case. In such scenarios, we need to remove the final output layer and change it accordingly.

Base model creation with the removal of classifiers.

3. Freeze layers

Freezing the starting layers from the pre-trained model is essential to avoid the additional work of making the model learn the basic features.

If we do not freeze the initial layers, we will lose all the learning that has already taken place. This will be no different from training the model from scratch and will be a loss of time, resources, etc.

4. Add new trainable layers

The only knowledge we are reusing from the base model is the feature extraction layers. We need to add additional layers on top of them to predict the specialized tasks of the model. These are generally the final output layers.

5. Train the new layers

The pre-trained model’s final output will most likely differ from the output we want for our model. For example, pre-trained models trained on the ImageNet dataset will output 1000 classes.

However, we need our model to work for two classes. In this case, we have to train the model with a new output layer in place.

6. Fine-tune your model

One method of improving the performance is fine-tuning.

Fine-tuning involves unfreezing some part of the base model and training the entire model again on the whole dataset at a very low learning rate. The low learning rate will increase the performance of the model on the new dataset while preventing overfitting.

Types of Deep Transfer Learning

Domain Adaptation

Domain adaptation is a transfer learning scenario where the source and target domains have different feature spaces and distributions.

Domain adaptation is the process of adapting one or more source domains to transfer information to improve the performance of a target learner. This process attempts to alter a source domain to bring the distribution of the source closer to that of the target.

Check out our guide on Domain Adaptation in Computer Vision

Domain Confusion

In a neural network, different layers identify different complexity of features. In a perfect scenario, we'd develop an algorithm that makes this feature domain invariant and improves its transferability across domains.

The feature representations between the source and target domains should be as similar as possible in such a context. The goal is to add an objective to the model at the source to encourage similarity with the target by confusing the source domain itself.

Specifically, domain confusion loss is used to confuse the high-level classification layers of a neural network by matching the distributions of the target and source domains.

In the end, we want to make sure samples come across as mutually indistinguishable to the classifier. To achieve this, one has to minimize the classification loss for the source samples, and one has to also minimize the domain confusion loss for all samples.

Multi-task Learning

In the case of multitask learning, several tasks from the same domain are learned simultaneously without distinction between the source and targets.

We have a set of learning tasks, t1 , t2 , …, t(n) and we co-learn all tasks simultaneously.

This helps to transfer knowledge from each scenario and develop a rich combined feature vector from all the varied scenarios of the same domain. The learner optimizes the learning/performance across all of the n tasks through some shared knowledge.

Multi-task Learning in Deep Transfer Learning

One-shot Learning

One-shot learning is a classification task where we have one or a few examples to learn from and classify many new examples in the future.

This is the case of face recognition, where people's faces must be classified correctly with different facial expressions, lighting conditions, accessories, and hairstyles, and the model has one or a few template photos as input.

For one-shot learning, we need to fully rely on the knowledge transfer from the base model trained on a few examples we have for a class.

Check out 3 Signs You Are Ready to Annotate Data for Machine Learning.

Zero-shot Learning

If transfer learning is applied excessively using zero instances of a class and does not depend on labeled data samples, then the corresponding strategy is called Zero-shot learning.

Zero-shot learning needs additional data during the training phase to understand the unseen data.

Zero-shot learning focuses on the traditional input variable, x, the traditional output variable, y, and the task-specific random variable. Zero-shot learning comes in handy in scenarios such as machine translation, where we may not have labels in the target language.

Deep Transfer Learning applications

Transfer learning helps data scientists to learn from the knowledge gained from a previously used machine learning model for a similar task.

This is the reason why this technique has now become applied in several fields we've listed below.

NLP

NLP is one of the most attractive applications of transfer learning. Transfer learning uses the knowledge of pre-trained AI models that can understand linguistic structures to solve cross-domain tasks. Everyday NLP tasks like next word prediction, question-answering, machine translation use deep learning models like BERT, XLNet, Albert, TF Universal Model, etc.

Computer Vision

Transfer learning is also applied in Image Processing.

Deep Neural Networks are used to solve image-related tasks as they can work well identifying complex features of the image. The dense layers contain the logic for detecting the image; thus, tuning the higher layers will not affect the base logic. Image Recognition, Object Detection, noise removal from images, etc., are typical application areas of Transfer learning because all image-related tasks require basic knowledge and pattern detection of familiar images.

Read YOLO: Real-Time Object Detection Explained.

Audio/Speech

Transfer learning algorithms are used to solve Audio/Speech related tasks like speech recognition or speech-to-text translation.

When we say "Siri" or"Hey Google!", the primary AI model developed for English speech recognition is busy processing our commands at the backend.

Interestingly, a pre-trained AI model developed for English speech recognition forms the basis for a French speech recognition model.

Transfer Learning in a nutshell: Key takeaways

Finally, let's do a quick recap of everything we've learned today. Here's a bullet-point summary of the things we've covered:

Transfer learning models focus on storing knowledge gained while solving one problem and applying it to a different but related problem.
Instead of training a neural network from scratch, many pre-trained models can serve as the starting point for training. These pre-trained models give a more reliable architecture and save time and resources.
Transfer learning is used in scenarios where there is not enough data for training or when we want better results in a short amount of time.
Transfer learning involves selecting a source model similar to the target domain, adapting the source model to the target model before transferring the knowledge, and training the source model to achieve the target model.
It is common to fine-tune the higher-level layers of the model while freezing the lower levels as the basic knowledge is the same that is transferred from the source task to the target task of the same domain.
In tasks with a small amount of data, if the source model is too similar to the target model, there might be an issue of overfitting. To prevent the transfer learning model from overfitting, it is essential to tune the learning rate, freeze some layers from the source model, or add linear classifiers while training the target model can help avoid this issue.

The Complete Guide to CVAT—Pros & Cons

5 Alternatives to Scale AI

The Ultimate Guide to Semi-Supervised Learning

9 Essential Features for a Bounding Box Annotation Tool

The Complete Guide to Ensemble Learning

The Beginner’s Guide to Contrastive Learning

9 Reinforcement Learning Real-Life Applications

Mean Average Precision (mAP) Explained: Everything You Need to Know

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

Video annotation

AI video annotation

Get started today

Pragati Baheti

Pragati is a software developer at Microsoft, and a deep learning enthusiast. She writes about the fundamental mathematics behind deep neural networks.

Next steps

Label videos with V7.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Book a demo

Explore V7 Darwin

Book a demo

Explore V7 Darwin