Domain Adaptation in Computer Vision: Everything You Need to Know

What is Domain Adaptation in Computer Vision and why is it important? Explore Domain Adaptation techniques and improve your models' performance to build more accurate AI faster.
Read time
18
min read  ·  
October 21, 2022
Domain adaptation process
Contents

Deep Learning algorithms are the go-to choice for engineers to solve the different kinds of Computer Vision problems- from classification and segmentation to object detection and image retrieval. But, there are two main problems associated with it.

Firstly, neural networks require a lot of labeled data for training. Manually annotating it is a laborious task. Secondly, a trained deep learning model performs well on test data only if it comes from the same data distribution as the training data. A dataset created by photos taken on a mobile phone has a significantly different distribution than a high-end DSLR camera. Traditional Transfer Learning methods fail.

Thus, for every new dataset, we first need to annotate the samples and then re-train the deep learning model to adapt to the new data. Training a sizable Deep Learning model with datasets as big as the ImageNet dataset even once takes a lot of computational power (model training may go on for weeks), and training them again is infeasible.

Domain Adaptation is a method that tries to address this problem. Using domain adaptation, a model trained on one dataset does not need to be re-trained on a new dataset. Instead, the pre-trained model can be adjusted to give optimal performance on this new data. This saves a lot of computational resources, and in techniques like unsupervised domain adaptations, the new data does not need to be labeled.

Here’s what we’ll cover:

  1. What is Domain Adaptation
  2. Types of Domain Adaptation
  3. Domain Adaptation Techniques

In case you are searching for the tools to annotate your data and train you ML models - we got you covered! 

Head over to our Open ML Datasets repository, pick a dataset, upload it to V7, and start annotating data to train your neural networks in one place. Have a look at these resources to get started:

  1. V7 Image and Video Annotation
  2. V7 Model Training
  3. V7 Dataset Management
  4. V7 Automated Annotation

Solve any video or image labeling task 10x faster and with 10x less manual work.

Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.

What is Domain Adaptation

Domain Adaptation is a technique to improve the performance of a model on a target domain containing insufficient annotated data by using the knowledge learned by the model from another related domain with adequate labeled data.

Domain Adaptation

Domain Adaptation is essentially a special case of transfer learning.

The mechanism of domain adaptation is to uncover the common latent factors across the source and target domains and adapt them to reduce both the marginal and conditional mismatch in terms of the feature space between domains. Following this, different domain adaptation techniques have been developed, including feature alignment and classifier adaptation.

Domain Adaptation: Key definitions

Before diving in, let’s quickly go through some of the most important concepts regarding domain adaptation. For this, let us use an example scenario: a classification model is trained on photos (supervised learning) captured by a mobile phone, and this model is used to classify images on images captured by a DSLR camera.

Source Domain: This is the data distribution on which the model is trained using labeled examples. In the example above, the dataset created by the cellphone photos is the source domain.

Target Domain: This is the data distribution on which a model pre-trained on a different domain is used to perform a similar task. The target domain is the dataset generated using the photos using the DSLR camera in the example above.

Domain Translation: Domain Translation is the problem of finding a meaningful correspondence between two domains.

Domain Shift: A domain shift is a change in the statistical distribution of data between the different domains (like the training, validation, and test sets) for a model.

💡 Pro Tip: Learn more about Train, Validation, and Test sets and how to partition them.

Types of Domain Adaptation

Domain Adaptation methods can be categorized into several types depending on factors like the available annotations of the target domain data, the nature of the source and target domain feature spaces, and the path traversing which domain adaptation is achieved. We will discuss them one by one.

Categorization of Domain Adaptation methods

First, let’s look into the types of DA based on the labeling of target domain data:


  1. Supervised DA
  2. Semi-Supervised DA
  3. Weakly Supervised DA
  4. Unsupervised DA

Supervised 

The target domain data is fully annotated in Supervised Domain Adaptation (SDA). Unsupervised Domain Adaptation expects large amounts of target data to be effective, and this is emphasized even more when using deep models. However, SDA can function optimally even without such vast amounts of target domain training data, labeling which is likely not very expensive.

This paper has formulated one SDA approach, where the authors introduce a classification and contrastive semantic alignment (CCSA) loss. The input images are mapped into an embedding/feature space using a deep learning model, based on which the classification prediction is computed.

💡 Pro Tip: Read The Beginner’s Guide to Contrastive Learning.

They use the semantic alignment loss (along with a traditional classification loss) to encourage samples from different domains but belonging to the same category to map nearby in this embedding space.

Deep supervised domain adaptation
Source: Paper

Semi-Supervised

In Semi-Supervised Domain Adaptation (SSDA), only a few data samples in the target domain are labeled. Unsupervised Domain Adaptation methods have been seen to perform poorly when a few target domain data are labeled (as shown in this paper), and thus SSDA approaches tend to be vastly different from such unsupervised methods.

This paper devised a cosine similarity-based classifier architecture that predicts a K-way class probability vector by computing cosine similarity between the ‘K’ class-specific weight vectors and the output of a feature extractor (lower layers), followed by a softmax.

Each class weight vector is an estimated “prototype” that can be regarded as a representative point of that class. The approach is similar to those used in Few-Shot Learning settings. The difference between Few-Shot approaches and their approach is shown below:

Baseline Few-shot learning method
Source: Paper

The key idea in the aforementioned approach is to minimize the distance between the class prototypes and neighboring unlabeled target samples, thereby extracting discriminative features. The problem lies in the computation of domain-invariant prototypes using only a few labeled target domain samples.

Therefore, the authors move the weight vectors towards the target by maximizing the entropy of unlabeled target examples in the first adversarial step. Second, they update the feature extractor to minimize the entropy of the unlabeled examples to make them better clustered around the prototypes. This process is formulated as a mini-max game between the weight vectors and the feature extractor and applied over the unlabeled target examples. The overview of their approach is shown below.

An overview of the model architecture and MME
Source: Paper

Weakly-Supervised

Weakly Supervised Domain Adaptation (WSDA) refers to a problem setting wherein only “weak labels” are available in the target domain. For example, in a semantic segmentation domain adaptation problem, that is, ground truth masks are unavailable in the target domain, but the categories of the objects to be segmented are available.

So, here the category labels are the weak labels (and the ground truth segmentation masks are called “hard labels”).

💡Pro Tip: Go here to learn more about Semantic Segmentation.

This paper proposed a WSDA approach for 3D Hand Pose Estimation on Hand Object Interaction (HOI) data. For this, the authors trained a domain adaptation network using 2D object segmentation masks and 3D pose labels for hand-only data. With this information, the network needs to annotate (estimate the hand poses) the HOI data.

Example of hand pose estimation in hand-object interaction
Source: Paper

They achieved the domain adaptation by two guidances in image space (though they can also be done in feature space). Two image generation methods have been investigated and combined: a generative adversarial network and mesh renderer using estimated 3D meshes and textures. As an outcome, input HOI images are transformed into segmented and de-occluded hand-only images, effectively improving HPE accuracies. The overview of their approach is shown below:

Diagram of the proposed 3D hand mesh and pose estimation framework via domain adaptation
Source: Paper

Unsupervised

In Unsupervised Domain Adaptation (UDA), any kind of labels (weak/hard) for the target domain data are entirely missing. A model trained on source domain data must adapt to the target domain independently.

One such UDA method is proposed in this paper, where the authors develop a new Residual Transfer Network (RTN) approach to domain adaptation in deep networks, which can simultaneously learn adaptive classifiers and transferable features. They relax the shared-classifier assumption made by previous methods and assume that the source and target classifiers differ by a small residual function. The schematic of their approach is shown below.

Residual Transfer Network for domain adaptation
Source: Paper

Classifier adaptation is enabled in this method by plugging several layers into deep networks to explicitly learn the residual function with reference to the target domain classifier. In this way, the source domain classifier and target domain classifier can be bridged tightly in the back-propagation procedure.

The target domain classifier is tailored to the target domain data by exploiting the low-density separation criterion. The features of multiple layers are then fused with the tensor product and embedded into a reproducing kernel Hilbert space to match distributions for feature adaptation.

Homogeneous DA vs. Heterogeneous DA

Domain Adaptation can be categorized into Homogeneous and Heterogeneous DA based on different domain divergences. Both of these DA methods further have their own supervised, semi-supervised, and unsupervised categories.

Homogenous DA

Homogeneous DA refers to a problem where the feature spaces of the source and target domains are identical with identical dimensionality, and the difference lies in only the data distribution. 

Homogeneous DA considers that source and target domain data are collected using the same type of features, that is, cross-domain data are observed in the same feature space but exhibit different distributions. Thus, this is also called a “distribution-shift” type Domain Adaptation problem.

Heterogeneous DA

In Heterogeneous DA problems, the source and target domains are non-equivalent and might have different feature space dimensionality. In heterogeneous DA, cross-domain data are described by different types of features and thus exhibit distinct distributions (for example, training and test image data with different resolutions or encoded by different codebooks). It is thus also known as a “feature space difference” type DA problem and is a much more challenging problem than Homogeneous DA.

One such Heterogeneous DA method is devised in this paper, which addresses a semi-supervised DA problem. The authors propose a learning algorithm of Cross-Domain Landmark Selection (CDLS). The overview of their method is shown below:

Cross-Domain Landmark Selection for heterogenous domain adaptation
Source: Paper

Instead of viewing all cross-domain data to be equally important during adaptation, the CDLS model derives a heterogeneous feature transformation which results in a domain-invariant subspace for associating cross-domain data. In addition, the representative source and target domain data are jointly exploited to improve the adaptation capability of CDLS.

Once the adaptation process is complete, one can simply project cross-domain labeled and unlabeled target domain data into the derived subspace for performing recognition. 

One-Step vs. Multi-Step DA

One-Step vs Multi-Step Domain Adaptation
One-Step vs Multi-Step Domain Adaptation

The final form of categorization of Domain Adaptation techniques is based on how the domain adaptation is achieved: most DA settings assume that the source and target domains are directly related; thus, transferring knowledge can be accomplished in one step. We call them One-Step DA.

In reality, however, this assumption is occasionally unavailable. There is little overlap between the two domains, and performing One-Step DA will not be effective. Fortunately, there are some intermediate domains that are able to draw the source and target domains closer than their original distance. Thus, we use a series of intermediate bridges to connect two seemingly unrelated domains and then perform One-Step DA via this bridge, named Multi-Step (or transitive) DA.

For example, face and vehicle images are dissimilar due to different shapes or other aspects, and thus, one-step DA would fail. However, some intermediate images, such as “football helmet,” can be introduced to be an intermediate domain and have a smooth knowledge transfer.

Four algorithmic Domain Adaptation principles

Reweighting Algorithms/Instance-based Adaptation

Reweighting algorithms work on the principle of minimizing the distribution difference by reweighting the source data and then training a classifier on the reweighted source data. This decreases the importance of data belonging to the source-only classes.

The target and reweighted source domain data are used to train the feature extractor by adversarial training or kernel mean matching to align distributions. Such methods are also called “Instance-based Adaptation.”

One such technique has been proposed in this paper where the authors propose an adversarial reweighting (AR) approach, which adversarially learns to reweight the source domain data for aligning the distributions of the source and target domains.

Specifically, their approach relies on adversarial training to learn the weights of source domain data to minimize the Wasserstein distance between the reweighted source domain and target domain distributions.

Their workflow is as follows.

PDA: Partial Domain Adaptation
Source: Paper. PDA: Partial Domain Adaptation

Iterative Algorithms

As the name suggests, Iterative Adaptation methods aim to “auto-label” target domain data iteratively. However, these methods generally require labeled target samples, so this is fit for supervised and semi-supervised DA settings. Here, a deep model trains on the source domain labeled data and annotates the unlabeled target samples.

Then a new model is learned from the new target domain labeled samples.

This paper, for example, presents an algorithm that seeks to slowly adapt its training set from the source to the target domain, using ideas from co-training. The authors accomplish this in two ways: first, they train on their own output in rounds, where at each round, they include in the training data the target instances they are most confident of.

Second, they select a subset of shared source and target features based on their compatibility (measured across the training and unlabeled sets). As more target instances are added to the training set, target-specific features become compatible across the two sets, therefore, are included in the predictor.

Feature-based Adaptation

Feature-based Adaptation techniques aim to map the source data into the target data by learning a transformation that extracts invariant feature representation across domains. They usually create a new feature representation by transforming the original features into a new feature space and then minimizing the gap between domains in the new representation space in an optimization procedure while preserving the underlying structure of the original data.

Feature-based Adaptation methods can be further categorized into the following:

  1. Subspace-based Adaptation: These aim to discover a common intermediate representation that is shared between domains.

  2. Transformation-based Adaptation: Feature transformation transforms the original features into a new feature representation to minimize the discrepancy between the marginal and the conditional distributions while preserving the original data’s underlying structure and characteristics.

  3. Reconstruction-based Adaptation: The feature reconstruction-based methods aim to reduce the disparity between domain distributions using a sample reconstruction in an intermediate feature representation. 

Hierarchical Bayesian Model

The Hierarchical Bayesian Model for Domain Adaptation proposed in this paper is named so because of its use of a hierarchical Bayesian prior, through which the domain-specific parameters are tied. Such a model aims to derive domain-dependent latent representations allowing both domain-specific and globally shared latent factors.

This hierarchical Bayesian prior encourages features to have similar weights across domains unless there is good contrary evidence. Hierarchical Bayesian frameworks are a more principled approach for transfer learning compared to approaches that learn parameters of each task/distribution independently and smooth parameters of tasks with more information towards coarser-grained ones.

Domain Adaptation Techniques

Domain Invariant Feature Learning

Most recent domain adaptation methods align source and target domains by creating a domain invariant feature representation, typically in the form of a feature extractor neural network. A feature representation is domain-invariant if the features follow the same distribution regardless of whether the input data is from the source or target domain. Suppose we can train a classifier to perform well on the source data using domain-invariant features. In that case, the classifier may generalize well to the target domain since the features of the target data match those on which we trained the classifier.

Domain Invariant Feature Learning

The general training and testing setup of these methods is illustrated above. Methods differ in how they align the domains (the Alignment Component in the figure). Some minimize divergence, some perform reconstruction, and some employ adversarial training. In addition, they differ in weight-sharing choices. The various alignment methods will be elaborated on next.

  1. Divergence-based Domain Adaptation

Divergence-based DA techniques aim to minimize some divergence criteria between the source and target domain data distributions. Four choices used in various domain adaptation approaches are maximum mean discrepancy, correlation alignment, contrastive domain discrepancy, and the Wasserstein metric.

One such framework is proposed in this paper, where the DA is performed by aligning infinite-dimensional covariance matrices (descriptors) across domains. More specifically, the authors first map the original features to a Reproduction Hilbert Kernel Space (RKHS) and then use a linear operator in the resulting space to “move” the source data to the target domain such that the RKHS covariance descriptors of the transformed data and target data are close. Computing the pairwise inner product with the transformed and target samples, the authors obtain a new domain-invariant kernel matrix with the closed-form expression, which can be used in any kernel-based learning machine.

  1. Reconstruction-based Domain Adaptation

Rather than minimizing a divergence, alignment can be accomplished by learning a representation that both classify the labeled source domain data well and can be used to reconstruct either the target domain data or both the source and target domain data. The alignment component in these setups is a reconstruction network– the opposite of the feature extractor network that takes the feature extractor output and recreates the feature extractor’s input.

This paper proposes Deep Reconstruction Classification Networks (DRCN), which is a convolutional network that jointly learns two tasks: (i) supervised source label prediction and (ii) unsupervised target data reconstruction. The aim is that the learned label prediction function can perform well in classifying images in the target domain– the data reconstruction can thus be viewed as an auxiliary task to support the adaptation of the label prediction.

The network learning mechanism in DRCN alternates between unsupervised and supervised training, which is different from the standard pretraining-fine tuning strategy. The illustration of their framework is shown below.

DRCN's architecture
Source: Paper
  1. Adversarial-based Domain Adaptation

Adversarial adaptation methods have become an increasingly popular incarnation of approaches that seek to minimize an approximate domain discrepancy distance through an adversarial objective with respect to a domain discriminator. These methods are closely related to generative adversarial learning, which pits two networks against each other— a generator and a discriminator. Adversarial domain adaptation approaches tend to minimize the distribution discrepancy between domains to obtain transferable and domain invariant features.

Several varieties of feature-level adversarial domain adaptation methods have been introduced in the Computer Vision literature. In most cases, the alignment component consists of a domain classifier (A domain classifier is a classifier that outputs whether we generated the feature representation from source or target data). For example, this alignment component may be represented by a network learning an approximate Wasserstein distance, or it may be a GAN (Generative Adversarial Network) model.

This paper proposed the Cycle-Consistent Adversarial Domain Adaptation (CyCADA) model, which adapts representations at both the pixel-level and feature-level while enforcing semantic consistency. The authors enforce both structural and semantic consistency during adaptation using a cycle-consistency loss (that is, the source should match the source mapped to the target mapped back to the source) and semantic losses based on a particular visual recognition task. The semantic losses both guide the overall representation to be discriminative and enforce semantic consistency before and after mapping between domains. Their workflow diagram is shown below.

Source: Paper

Domain Mapping

An alternative to creating a domain-invariant feature representation is mapping from one domain to another. The mapping is typically created adversarially and at the pixel level (that is, pixel-level adversarial domain adaptation). This mapping can be accomplished with a conditional GAN.

The generator performs adaptation at the pixel level by translating a source input image to an image that closely resembles the target distribution. For example, the GAN could change from a synthetic vehicle driving image to one that looks realistic.

Source: Paper

For example, this paper proposed the StarGAN model, which is a generative adversarial network capable of learning mappings among multiple domains.

Contrary to existing approaches for multi-domain translation, which use several separate generator and discriminator networks for cross-domain translation, the StarGAN model takes in training data of multiple domains and learns the mappings between all available domains using only one generator. The representative diagram of the StarGAN model is shown below.

Source: Paper

Instead of learning a fixed translation, StarGAN takes both image and domain information as inputs and learns to translate the input image into the corresponding domain flexibly. StarGAN uses a label (e.g., binary or one-hot encoding vector) to represent the domain information.

During training, a target domain label is randomly generated, and the model is trained to translate an input image into the target domain flexibly. By doing so, the authors can control the domain label and translate the image into any desired domain at the testing phase.

Ensemble Methods

Given a base model such as a deep neural network, an ensemble consisting of multiple models can often outperform a single model by averaging together the models’ outputs (e.g., regression) or taking a vote (e.g., classification). This is because if the models are diverse, then each individual model will likely make different mistakes. However, this performance gain corresponds with an increase in computation cost due to the large number of models to evaluate for each ensemble prediction.

💡 Pro Tip: Check out The Complete Guide to Ensemble Learning

An alternative to using multiple base models as the ensemble is using only a single model but “evaluating” (via history or average of the network) the models in the ensemble at multiple points in time during training– a technique called self-ensembling.

This can be done by averaging over past predictions for each example (by recording previous predictions) or past network weights (by maintaining a running average). This reduces the computational cost by several folds.

One such method was proposed in this paper, where the authors used self-ensembling for unsupervised Domain Adaptation. They use two networks: a student network and a teacher network. Input images are fed first to stochastic data augmentation (Gaussian noise, translations, horizontal flips, affine transforms, etc.) before being input to both networks.

Because the method is stochastic, the augmented images fed to the networks will differ. The student network is trained with gradient descent, while the teacher network weights are an exponential moving average (EMA) of the student network’s weights.

Mean teacher model and the model structure
Source: Paper

Target Discriminative Methods

One assumption that has led to successes in semi-supervised learning algorithms is the cluster assumption: that data points are distributed in separate clusters, and the samples in each cluster have a common label. If this is the case, then decision boundaries should lie in low-density regions (that is, they should not pass through regions where there are many data points).

A variety of domain adaptation methods have been explored to move decision boundaries into density regions of lower density. These have typically been trained adversarially.

💡 Pro Tip: Read The Ultimate Guide to Semi-Supervised Learning

For example, this paper proposes an unsupervised DA method called “Adversarial Discriminative Domain Adaptation” (ADDA).

ADDA first learns a discriminative representation using the labels in the source domain, and then a separate encoding that maps the target data to the same space using an asymmetric mapping learned through a domain-adversarial loss. Their architecture is shown below.

Generalized architecture for adverserial domain adaptation
Source: Paper

Conclusion

With the increasing amount of data being available in this era of Information Technology, it is difficult to annotate all of them and train existing models using the new data. Domain Adaptation provides an intelligent solution to this problem by enabling the re-use of pre-trained models on a new statistical distribution of data belonging to a related domain without having to train deep models from scratch.

Active research in Domain Adaptation is being pursued to enhance the capabilities of DA algorithms for performing tasks like bi-directional adaptation, improving image-to-image translation, style transfer, etc. Several genres of techniques exist to address the domain shift problem, each with its own sets of strong and weak aspects. Thus, DA is an important area in Computer Vision that attempts to close the gap between Artificial Intelligence and humans by effective and efficient knowledge transfer we humans employ every day.

💡 Read next:

A Step-by-Step Guide to Text Annotation [+Free OCR Tool]

The Complete Guide to CVAT - Pros & Cons [2022]

5 Alternatives to Scale AI

9 Essential Features for a Bounding Box Annotation Tool

9 Reinforcement Learning Real-Life Applications

Mean Average Precision (mAP) Explained: Everything You Need to Know

YOLO: Real-Time Object Detection Explained

The Beginner's Guide to Deep Reinforcement Learning [2022]

Knowledge Distillation: Principles & Algorithms [+Applications]

Rohit Kundu is a Ph.D. student in the Electrical and Computer Engineering department of the University of California, Riverside. He is a researcher in the Vision-Language domain of AI and published several papers in top-tier conferences and notable peer-reviewed journals.

FREE
Apply for an Education Plan
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Related articles

Subscribe to our blog
1 personalized email from V7's CEO per month
Thank you for subscribing!
Oops! Something went wrong while submitting the form.