Deep Learning algorithms are the go-to choice for engineers to solve the different kinds of Computer Vision problems- from classification and segmentation to object detection and image retrieval. But, there are two main problems associated with it.
Firstly, neural networks require a lot of labeled data for training. Manually annotating it is a laborious task. Secondly, a trained deep learning model performs well on test data only if it comes from the same data distribution as the training data. A dataset created by photos taken on a mobile phone has a significantly different distribution than a high-end DSLR camera. Traditional Transfer Learning methods fail.
Thus, for every new dataset, we first need to annotate the samples and then re-train the deep learning model to adapt to the new data. Training a sizable Deep Learning model with datasets as big as the ImageNet dataset even once takes a lot of computational power (model training may go on for weeks), and training them again is infeasible.
Domain Adaptation is a method that tries to address this problem. Using domain adaptation, a model trained on one dataset does not need to be re-trained on a new dataset. Instead, the pre-trained model can be adjusted to give optimal performance on this new data. This saves a lot of computational resources, and in techniques like unsupervised domain adaptations, the new data does not need to be labeled.
Here’s what we’ll cover:
In case you are searching for the tools to annotate your data and train you ML models - we got you covered!
Solve any video or image labeling task 10x faster and with 10x less manual work.
Domain Adaptation is a technique to improve the performance of a model on a target domain containing insufficient annotated data by using the knowledge learned by the model from another related domain with adequate labeled data.
Domain Adaptation is essentially a special case of transfer learning.
The mechanism of domain adaptation is to uncover the common latent factors across the source and target domains and adapt them to reduce both the marginal and conditional mismatch in terms of the feature space between domains. Following this, different domain adaptation techniques have been developed, including feature alignment and classifier adaptation.
Before diving in, let’s quickly go through some of the most important concepts regarding domain adaptation. For this, let us use an example scenario: a classification model is trained on photos (supervised learning) captured by a mobile phone, and this model is used to classify images on images captured by a DSLR camera.
Source Domain: This is the data distribution on which the model is trained using labeled examples. In the example above, the dataset created by the cellphone photos is the source domain.
Target Domain: This is the data distribution on which a model pre-trained on a different domain is used to perform a similar task. The target domain is the dataset generated using the photos using the DSLR camera in the example above.
Domain Translation: Domain Translation is the problem of finding a meaningful correspondence between two domains.
Domain Shift: A domain shift is a change in the statistical distribution of data between the different domains (like the training, validation, and test sets) for a model.
Domain Adaptation methods can be categorized into several types depending on factors like the available annotations of the target domain data, the nature of the source and target domain feature spaces, and the path traversing which domain adaptation is achieved. We will discuss them one by one.
First, let’s look into the types of DA based on the labeling of target domain data:
The target domain data is fully annotated in Supervised Domain Adaptation (SDA). Unsupervised Domain Adaptation expects large amounts of target data to be effective, and this is emphasized even more when using deep models. However, SDA can function optimally even without such vast amounts of target domain training data, labeling which is likely not very expensive.
This paper has formulated one SDA approach, where the authors introduce a classification and contrastive semantic alignment (CCSA) loss. The input images are mapped into an embedding/feature space using a deep learning model, based on which the classification prediction is computed.
They use the semantic alignment loss (along with a traditional classification loss) to encourage samples from different domains but belonging to the same category to map nearby in this embedding space.
In Semi-Supervised Domain Adaptation (SSDA), only a few data samples in the target domain are labeled. Unsupervised Domain Adaptation methods have been seen to perform poorly when a few target domain data are labeled (as shown in this paper), and thus SSDA approaches tend to be vastly different from such unsupervised methods.
This paper devised a cosine similarity-based classifier architecture that predicts a K-way class probability vector by computing cosine similarity between the ‘K’ class-specific weight vectors and the output of a feature extractor (lower layers), followed by a softmax.
Each class weight vector is an estimated “prototype” that can be regarded as a representative point of that class. The approach is similar to those used in Few-Shot Learning settings. The difference between Few-Shot approaches and their approach is shown below:
The key idea in the aforementioned approach is to minimize the distance between the class prototypes and neighboring unlabeled target samples, thereby extracting discriminative features. The problem lies in the computation of domain-invariant prototypes using only a few labeled target domain samples.
Therefore, the authors move the weight vectors towards the target by maximizing the entropy of unlabeled target examples in the first adversarial step. Second, they update the feature extractor to minimize the entropy of the unlabeled examples to make them better clustered around the prototypes. This process is formulated as a mini-max game between the weight vectors and the feature extractor and applied over the unlabeled target examples. The overview of their approach is shown below.
Weakly Supervised Domain Adaptation (WSDA) refers to a problem setting wherein only “weak labels” are available in the target domain. For example, in a semantic segmentation domain adaptation problem, that is, ground truth masks are unavailable in the target domain, but the categories of the objects to be segmented are available.
So, here the category labels are the weak labels (and the ground truth segmentation masks are called “hard labels”).
This paper proposed a WSDA approach for 3D Hand Pose Estimation on Hand Object Interaction (HOI) data. For this, the authors trained a domain adaptation network using 2D object segmentation masks and 3D pose labels for hand-only data. With this information, the network needs to annotate (estimate the hand poses) the HOI data.
They achieved the domain adaptation by two guidances in image space (though they can also be done in feature space). Two image generation methods have been investigated and combined: a generative adversarial network and mesh renderer using estimated 3D meshes and textures. As an outcome, input HOI images are transformed into segmented and de-occluded hand-only images, effectively improving HPE accuracies. The overview of their approach is shown below:
In Unsupervised Domain Adaptation (UDA), any kind of labels (weak/hard) for the target domain data are entirely missing. A model trained on source domain data must adapt to the target domain independently.
One such UDA method is proposed in this paper, where the authors develop a new Residual Transfer Network (RTN) approach to domain adaptation in deep networks, which can simultaneously learn adaptive classifiers and transferable features. They relax the shared-classifier assumption made by previous methods and assume that the source and target classifiers differ by a small residual function. The schematic of their approach is shown below.
Classifier adaptation is enabled in this method by plugging several layers into deep networks to explicitly learn the residual function with reference to the target domain classifier. In this way, the source domain classifier and target domain classifier can be bridged tightly in the back-propagation procedure.
The target domain classifier is tailored to the target domain data by exploiting the low-density separation criterion. The features of multiple layers are then fused with the tensor product and embedded into a reproducing kernel Hilbert space to match distributions for feature adaptation.
Domain Adaptation can be categorized into Homogeneous and Heterogeneous DA based on different domain divergences. Both of these DA methods further have their own supervised, semi-supervised, and unsupervised categories.
Homogeneous DA refers to a problem where the feature spaces of the source and target domains are identical with identical dimensionality, and the difference lies in only the data distribution.
Homogeneous DA considers that source and target domain data are collected using the same type of features, that is, cross-domain data are observed in the same feature space but exhibit different distributions. Thus, this is also called a “distribution-shift” type Domain Adaptation problem.
In Heterogeneous DA problems, the source and target domains are non-equivalent and might have different feature space dimensionality. In heterogeneous DA, cross-domain data are described by different types of features and thus exhibit distinct distributions (for example, training and test image data with different resolutions or encoded by different codebooks). It is thus also known as a “feature space difference” type DA problem and is a much more challenging problem than Homogeneous DA.
One such Heterogeneous DA method is devised in this paper, which addresses a semi-supervised DA problem. The authors propose a learning algorithm of Cross-Domain Landmark Selection (CDLS). The overview of their method is shown below:
Instead of viewing all cross-domain data to be equally important during adaptation, the CDLS model derives a heterogeneous feature transformation which results in a domain-invariant subspace for associating cross-domain data. In addition, the representative source and target domain data are jointly exploited to improve the adaptation capability of CDLS.
Once the adaptation process is complete, one can simply project cross-domain labeled and unlabeled target domain data into the derived subspace for performing recognition.
The final form of categorization of Domain Adaptation techniques is based on how the domain adaptation is achieved: most DA settings assume that the source and target domains are directly related; thus, transferring knowledge can be accomplished in one step. We call them One-Step DA.
In reality, however, this assumption is occasionally unavailable. There is little overlap between the two domains, and performing One-Step DA will not be effective. Fortunately, there are some intermediate domains that are able to draw the source and target domains closer than their original distance. Thus, we use a series of intermediate bridges to connect two seemingly unrelated domains and then perform One-Step DA via this bridge, named Multi-Step (or transitive) DA.
For example, face and vehicle images are dissimilar due to different shapes or other aspects, and thus, one-step DA would fail. However, some intermediate images, such as “football helmet,” can be introduced to be an intermediate domain and have a smooth knowledge transfer.
Reweighting algorithms work on the principle of minimizing the distribution difference by reweighting the source data and then training a classifier on the reweighted source data. This decreases the importance of data belonging to the source-only classes.
The target and reweighted source domain data are used to train the feature extractor by adversarial training or kernel mean matching to align distributions. Such methods are also called “Instance-based Adaptation.”
One such technique has been proposed in this paper where the authors propose an adversarial reweighting (AR) approach, which adversarially learns to reweight the source domain data for aligning the distributions of the source and target domains.
Specifically, their approach relies on adversarial training to learn the weights of source domain data to minimize the Wasserstein distance between the reweighted source domain and target domain distributions.
Their workflow is as follows.
As the name suggests, Iterative Adaptation methods aim to “auto-label” target domain data iteratively. However, these methods generally require labeled target samples, so this is fit for supervised and semi-supervised DA settings. Here, a deep model trains on the source domain labeled data and annotates the unlabeled target samples.
Then a new model is learned from the new target domain labeled samples.
This paper, for example, presents an algorithm that seeks to slowly adapt its training set from the source to the target domain, using ideas from co-training. The authors accomplish this in two ways: first, they train on their own output in rounds, where at each round, they include in the training data the target instances they are most confident of.
Second, they select a subset of shared source and target features based on their compatibility (measured across the training and unlabeled sets). As more target instances are added to the training set, target-specific features become compatible across the two sets, therefore, are included in the predictor.
Feature-based Adaptation techniques aim to map the source data into the target data by learning a transformation that extracts invariant feature representation across domains. They usually create a new feature representation by transforming the original features into a new feature space and then minimizing the gap between domains in the new representation space in an optimization procedure while preserving the underlying structure of the original data.
Feature-based Adaptation methods can be further categorized into the following:
The Hierarchical Bayesian Model for Domain Adaptation proposed in this paper is named so because of its use of a hierarchical Bayesian prior, through which the domain-specific parameters are tied. Such a model aims to derive domain-dependent latent representations allowing both domain-specific and globally shared latent factors.
This hierarchical Bayesian prior encourages features to have similar weights across domains unless there is good contrary evidence. Hierarchical Bayesian frameworks are a more principled approach for transfer learning compared to approaches that learn parameters of each task/distribution independently and smooth parameters of tasks with more information towards coarser-grained ones.
Most recent domain adaptation methods align source and target domains by creating a domain invariant feature representation, typically in the form of a feature extractor neural network. A feature representation is domain-invariant if the features follow the same distribution regardless of whether the input data is from the source or target domain. Suppose we can train a classifier to perform well on the source data using domain-invariant features. In that case, the classifier may generalize well to the target domain since the features of the target data match those on which we trained the classifier.
The general training and testing setup of these methods is illustrated above. Methods differ in how they align the domains (the Alignment Component in the figure). Some minimize divergence, some perform reconstruction, and some employ adversarial training. In addition, they differ in weight-sharing choices. The various alignment methods will be elaborated on next.
Divergence-based DA techniques aim to minimize some divergence criteria between the source and target domain data distributions. Four choices used in various domain adaptation approaches are maximum mean discrepancy, correlation alignment, contrastive domain discrepancy, and the Wasserstein metric.
One such framework is proposed in this paper, where the DA is performed by aligning infinite-dimensional covariance matrices (descriptors) across domains. More specifically, the authors first map the original features to a Reproduction Hilbert Kernel Space (RKHS) and then use a linear operator in the resulting space to “move” the source data to the target domain such that the RKHS covariance descriptors of the transformed data and target data are close. Computing the pairwise inner product with the transformed and target samples, the authors obtain a new domain-invariant kernel matrix with the closed-form expression, which can be used in any kernel-based learning machine.
Rather than minimizing a divergence, alignment can be accomplished by learning a representation that both classify the labeled source domain data well and can be used to reconstruct either the target domain data or both the source and target domain data. The alignment component in these setups is a reconstruction network– the opposite of the feature extractor network that takes the feature extractor output and recreates the feature extractor’s input.
This paper proposes Deep Reconstruction Classification Networks (DRCN), which is a convolutional network that jointly learns two tasks: (i) supervised source label prediction and (ii) unsupervised target data reconstruction. The aim is that the learned label prediction function can perform well in classifying images in the target domain– the data reconstruction can thus be viewed as an auxiliary task to support the adaptation of the label prediction.
The network learning mechanism in DRCN alternates between unsupervised and supervised training, which is different from the standard pretraining-fine tuning strategy. The illustration of their framework is shown below.
Adversarial adaptation methods have become an increasingly popular incarnation of approaches that seek to minimize an approximate domain discrepancy distance through an adversarial objective with respect to a domain discriminator. These methods are closely related to generative adversarial learning, which pits two networks against each other— a generator and a discriminator. Adversarial domain adaptation approaches tend to minimize the distribution discrepancy between domains to obtain transferable and domain invariant features.
Several varieties of feature-level adversarial domain adaptation methods have been introduced in the Computer Vision literature. In most cases, the alignment component consists of a domain classifier (A domain classifier is a classifier that outputs whether we generated the feature representation from source or target data). For example, this alignment component may be represented by a network learning an approximate Wasserstein distance, or it may be a GAN (Generative Adversarial Network) model.
This paper proposed the Cycle-Consistent Adversarial Domain Adaptation (CyCADA) model, which adapts representations at both the pixel-level and feature-level while enforcing semantic consistency. The authors enforce both structural and semantic consistency during adaptation using a cycle-consistency loss (that is, the source should match the source mapped to the target mapped back to the source) and semantic losses based on a particular visual recognition task. The semantic losses both guide the overall representation to be discriminative and enforce semantic consistency before and after mapping between domains. Their workflow diagram is shown below.
An alternative to creating a domain-invariant feature representation is mapping from one domain to another. The mapping is typically created adversarially and at the pixel level (that is, pixel-level adversarial domain adaptation). This mapping can be accomplished with a conditional GAN.
The generator performs adaptation at the pixel level by translating a source input image to an image that closely resembles the target distribution. For example, the GAN could change from a synthetic vehicle driving image to one that looks realistic.
For example, this paper proposed the StarGAN model, which is a generative adversarial network capable of learning mappings among multiple domains.
Contrary to existing approaches for multi-domain translation, which use several separate generator and discriminator networks for cross-domain translation, the StarGAN model takes in training data of multiple domains and learns the mappings between all available domains using only one generator. The representative diagram of the StarGAN model is shown below.
Instead of learning a fixed translation, StarGAN takes both image and domain information as inputs and learns to translate the input image into the corresponding domain flexibly. StarGAN uses a label (e.g., binary or one-hot encoding vector) to represent the domain information.
During training, a target domain label is randomly generated, and the model is trained to translate an input image into the target domain flexibly. By doing so, the authors can control the domain label and translate the image into any desired domain at the testing phase.
Given a base model such as a deep neural network, an ensemble consisting of multiple models can often outperform a single model by averaging together the models’ outputs (e.g., regression) or taking a vote (e.g., classification). This is because if the models are diverse, then each individual model will likely make different mistakes. However, this performance gain corresponds with an increase in computation cost due to the large number of models to evaluate for each ensemble prediction.
An alternative to using multiple base models as the ensemble is using only a single model but “evaluating” (via history or average of the network) the models in the ensemble at multiple points in time during training– a technique called self-ensembling.
This can be done by averaging over past predictions for each example (by recording previous predictions) or past network weights (by maintaining a running average). This reduces the computational cost by several folds.
One such method was proposed in this paper, where the authors used self-ensembling for unsupervised Domain Adaptation. They use two networks: a student network and a teacher network. Input images are fed first to stochastic data augmentation (Gaussian noise, translations, horizontal flips, affine transforms, etc.) before being input to both networks.
Because the method is stochastic, the augmented images fed to the networks will differ. The student network is trained with gradient descent, while the teacher network weights are an exponential moving average (EMA) of the student network’s weights.
One assumption that has led to successes in semi-supervised learning algorithms is the cluster assumption: that data points are distributed in separate clusters, and the samples in each cluster have a common label. If this is the case, then decision boundaries should lie in low-density regions (that is, they should not pass through regions where there are many data points).
A variety of domain adaptation methods have been explored to move decision boundaries into density regions of lower density. These have typically been trained adversarially.
For example, this paper proposes an unsupervised DA method called “Adversarial Discriminative Domain Adaptation” (ADDA).
ADDA first learns a discriminative representation using the labels in the source domain, and then a separate encoding that maps the target data to the same space using an asymmetric mapping learned through a domain-adversarial loss. Their architecture is shown below.
With the increasing amount of data being available in this era of Information Technology, it is difficult to annotate all of them and train existing models using the new data. Domain Adaptation provides an intelligent solution to this problem by enabling the re-use of pre-trained models on a new statistical distribution of data belonging to a related domain without having to train deep models from scratch.
Active research in Domain Adaptation is being pursued to enhance the capabilities of DA algorithms for performing tasks like bi-directional adaptation, improving image-to-image translation, style transfer, etc. Several genres of techniques exist to address the domain shift problem, each with its own sets of strong and weak aspects. Thus, DA is an important area in Computer Vision that attempts to close the gap between Artificial Intelligence and humans by effective and efficient knowledge transfer we humans employ every day.
💡 Read next: