“Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning.” — Yann LeCun, Director of AI Research at Facebook AI.
GAN is about creating, like drawing a portrait or composing a symphony from scratch, and it is hard compared to other deep learning fields. It is much easier to identify a Van Gogh painting than painting one by computers or by people.
But it brings us closer to understanding intelligence.
GANs have received massive worldwide appreciation from the research community, considering their vast potential. GANs have played a huge role in solving data generation problems across domains like image, audio, video, and text.
Here’s what we’ll cover:
Ready to streamline AI product deployment right away? Check out:
Now, without further ado, let’s dive in!
A generative adversarial network, or GAN, is a deep neural network framework that can learn from training data and generate new data with the same characteristics as the training data. For example, generative networks trained on photographs of human faces can generate realistic-looking faces which are entirely fictitious.
Generative adversarial networks consist of two neural networks, the generator, and the discriminator, which compete against each other. The generator is trained to produce fake data, and the discriminator is trained to distinguish the generator’s fake data from actual examples.
Intuitively, the generator maps random noise through a model to produce a sample, and the discriminator decides whether the sample is real or not
The below image shows how GANs are trained. There are two fundamental blocks in GANs.
Machine learning models can be classified into two types: Discriminative and Generative.
A discriminative model makes predictions on the unseen data based on conditional probability and can be used for classification or regression problems.
A generative model focuses on the latent distribution of a dataset to return a probability for an example.
Let us understand the difference through an example. Suppose we want to classify whether an email is a spam or not. Let’s model the problem.
Problem Formulation
We have a dataset with Input: n emails, and each email has {f1, f2, f3, ……, fm} features and n labels.
The joint distribution of the model can be represented as
p(Y,X) = P(y,f1,f2…fm)
We aim to estimate the probability of spam email, i.e., P(Y=1|X). Both generative and discriminative models can solve this problem but in different ways.
Solving through Generative Models
In the case of generative models, we can solve it using the Bayes theorem.
To find the conditional probability P(Y|X), they estimate the prior probability P(Y) and likelihood probability P(X|Y) with the help of the training data and use the Bayes Theorem to calculate the posterior probability P(Y |X):
The posterior probability classifies spam or not based on existing mail and the likelihood of that mail being spam or not.
Some generative models are Näive Bayes, Gaussians, HMM, Mixture of Gaussians, Bayesian networks, Markov Random Fields, and multinomials.
Solving through Discriminative Models
In the case of discriminative models, to find the probability, they directly assume some functional form for P(Y|X) and then estimate the parameters of P(Y|X) with the help of the training data.
A Discriminative model models the decision boundary between the classes and learns the conditional probability distribution p(y|x). Some examples of discriminative models are logistic regression, SVMs, ANN, KNN and Conditional Random Fields.
The question a generative algorithm tries to answer is: Assuming this email is spam, how likely are these features? While discriminative models care about the relation between y and x, generative models care about “how you get x.”
They allow you to capture p(x|y), the probability of x given y, or the probability of features given a label or category. (That said, generative algorithms can also be used as classifiers. It just so happens that they can do more than categorize input data.)
Explicit likelihood models are defined by an explicit specification of the density, and so their unnormalized complete likelihood can be usually expressed in closed form.
Implicit probabilistic models are defined naturally in a sampling procedure and often induce a likelihood function that cannot be expressed explicitly.
Generative models aim for a complete probabilistic description of the data. With these models, the goal is to construct the joint probability distribution P(X, Y) – either directly or by first computing P(X | Y) and P(Y) – and then inferring the conditional probabilities required to classify new data.
As per the below image, neural network architectures like GANs fall under the implicit model category as they can sample from latent space but do not evaluate the likelihood for the sample. We also lack the understanding of why they work!
Here are some examples of generative models:
Pros:
Cons:
Generative models are more complex to train than discriminative models because it is hard to model real-world data distribution and class conditional probabilities accurately.
Now, let’s discuss the architecture of GANs and how they work.
GAN composes of two deep networks, the generator, and the discriminator. We will first examine how a generator creates images before learning how to train it.
First, we sample some noise z using a normal or uniform distribution. With z as an input, we use a generator G to create an image x (x=G(z)).
I know it sounds unreal. Let me simplify it for you.
Conceptually, z represents the latent features of the images generated, for example, the color and the shape. In Deep learning classification, we don’t control the features the model is learning. Similarly, in GAN, we don’t control the semantic meaning of z.
We let the training process to learn it. i.e., we do not control which byte in z determines the color of the cup.
To discover its meaning, the most effective way is to plot the generated images and examine ourselves. But a generator alone will create random noise. Conceptually, the discriminator in GAN guides the generator on what images to create.
GAN builds a discriminator to learn what features make images real by training with real and generated images. Then the same discriminator will provide feedback to the generator to create real images.
Let me walk you through the training process. We will go through the equations and algorithms mentioned in the paper in detail.
We train the discriminator just like a deep network classifier. If the input is real, we want D(x)=1. If it is generated, it should be zero. Through this process, the discriminator identifies features that contribute to real images.
On the other hand, we want the generator to create images with D(x) = 1 (matching the real image). So we can train the generator by backpropagating this target value all the way back to the generator, i.e. we train the generator to create images that towards what the discriminator thinks is real.
We train both networks in alternating steps and lock them into a fierce competition to improve themselves. Eventually, the discriminator identifies the tiny difference between the real and the generated, and the generator creates images that the discriminator cannot tell the difference. The GAN model eventually converges and produces authentic images.
This discriminator concept can be applied to many existing deep learning applications also. The discriminator in GAN acts as a critic. We can plug the discriminator into existing deep learning solutions to provide feedback to make it better.
The discriminator outputs a value D(x), indicating the chance that x is a real image. Our objective is to maximize the chance to recognize real images as real and generated images as fake.
To measure the loss, we use cross-entropy. For a real image, p (the true label for real images) equals 1. For generated images, we reverse the label (i.e. one minus label). So the objective becomes:
On the generator side, its objective function wants the model to generate images with the highest confidence value of D(x) to fool the discriminator.
GANs loss functions are modeled as a minimax game where both generator and discriminator performance is co-dependent. According to the original paper of GANs, Discriminator and Generator play the following two-player minimax game with value function V(D, G).
Once both objective functions are defined, they are trained jointly in an alternating fashion using gradient descent.
We train both networks in alternating steps until the generator produces good-quality images. The following summarizes the data flow and the gradients used for the backpropagation.
When we model a system that rewards two contrary outcomes simultaneously, it can lead to equilibrium issues and make the model convergence tricky. We see later in this article how this problem is addressed in recent versions of GANs.
Pseudo Code for training GANs (from original paper)
Generative Adversarial Network (GAN) and Variational Autoencoder (VAE) are popular models for generating images and sequences. As GAN and VAE share similar tasks, we might encounter the challenge of choosing between them in specific application scenarios.
In a nutshell, a VAE is an autoencoder whose encodings distribution is regularised during the training to ensure that its latent space has good properties, allowing us to generate new data.
Moreover, the term “variational” comes from the close relationship there is between the regularisation and the variational inference method in statistics.
Two main components in VAE are
1. Encoder
2. Decoder
The encoder produces the “new features” representation from the “old features” representation (by selection or by extraction) and decodes the reverse process.
Dimensionality reduction can then be interpreted as data compression where the encoder compresses the data (from the initial space to the encoded space, also called latent space) whereas the decoder decompresses them.
The reconstructed output from the decoder will be lossy information from the original input.
As you have a clear picture of VAE, let’s look into some important differences between GANs and VAE.
DCGAN is a generative adversarial network architecture based on CNNs. It uses a couple of guidelines, in particular:
This network takes in a 100x1 noise vector, denoted z, and maps it into the G(Z) output which is 64x64x3. We see the network goes from -> 100x1 → 1024x4x4 → 512x8x8 → 256x16x16 → 128x32x32 → 64x64x3
Generating high-resolution images is considered challenging for GAN models as the generator must learn how to output both overall structure and fine details simultaneously.
The primary contribution of the ProGan paper is a training methodology for GANs where we start with low-resolution images, and then progressively increase the resolution by adding layers to the networks.
Progressive Growing GAN involves using a generator and discriminator model with the same general structure and starting with very small images, such as 4×4 pixels.
This approach allows the generation of high-quality images, such as 1024×1024 photorealistic faces of celebrities that do not exist.
A conditional generative adversarial network, or cGAN for short, is a type of GAN that involves the conditional generation of images by a generator model.
In cGANs, a conditional setting is applied, meaning that both the generator and discriminator are conditioned on some sort of auxiliary information (such as class labels or data) from other modalities.
As a result, the ideal model can learn a multi-modal mapping from inputs to outputs by being fed with different contextual information.
You can control the generator's output at test time by giving the label for the image you want to generate.
Pix2Pix is a Generative Adversarial Network, or GAN, model designed for general-purpose image-to-image translation.
The image-to-image translation is the problem of changing a given image in a specific or controlled way. Examples include translating a landscape photograph from day to night or a segmented image to a photograph.
cGANs are suitable for image-to-image translation tasks as we can condition an input image and generate a corresponding output image.
Pix2Pix GAN provides a general purpose model and loss function for image-to-image translation.
The CycleGAN is a technique that involves the automatic training of image-to-image translation models without paired examples. The models are trained in an unsupervised manner using a collection of images from the source and target domain that do not need to be related in any way.
The CycleGAN is an extension of the GAN architecture that involves the simultaneous training of two generator models and two discriminator models.
Super-resolution (SR) upsampling a low-resolution image into a higher resolution with minimal information distortion.
The generator network employs residual blocks, where the idea is to keep information from previous layers alive and allow the network to choose from more features adaptively.
Instead of adding random noise as the generator input, we pass the low-resolution image.
The discriminator network is pretty standard and works as a discriminator would work in a normal GAN.
The novel factor in SRGANs is the perceptual loss function. While the generator and discriminator will get trained based on the GAN architecture, SRGANs use the help of another loss function to reach their destination: the perceptual/content loss function.
‘DALL·E 2 is a new AI system that can create realistic images and art from a description in natural language - OpenAI
A simplified explanation of the DALLE 2 algorithm:
If you are interested in a detailed explanation of the DALLE2 algorithm, check out the blog from one of the paper's co-authors.
Training GANs is a non-trivial problem because of the min-max game formulation of the GAN architecture.
Common failure modes in training GANs are:
Wasserstein loss: The Wasserstein loss is designed to prevent vanishing gradients even when you train the discriminator to optimality.
Modified minimax loss: The original GAN paper proposed a modification to minimax loss to deal with vanishing gradients.