Computer vision

The Complete Guide to Generative Adversarial Networks [GANs]

16 min read

—

Aug 3, 2022

Introduced in 2014 by Ian Goodfellow and his colleagues, GANs have been a hot research topic ever since—and for a good reason. Read this guide to learn everything you need to know about GANs and their applications.

Deval Shah

Guest Author

‍“Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning.” — Yann LeCun, Director of AI Research at Facebook AI.

GAN is about creating, like drawing a portrait or composing a symphony from scratch, and it is hard compared to other deep learning fields. It is much easier to identify a Van Gogh painting than painting one by computers or by people.

But it brings us closer to understanding intelligence.

GANs have received massive worldwide appreciation from the research community, considering their vast potential. GANs have played a huge role in solving data generation problems across domains like image, audio, video, and text.

Here’s what we’ll cover:

What Are Generative Adversarial Networks?
Generative Models examples
How GANs work & model training
GANs vs Autoencoders vs VAEs
Popular GAN Variants
Issues with GANs

Ready to streamline AI product deployment right away? Check out:

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

Video annotation

AI video annotation

Get started today

Now, without further ado, let’s dive in!

What Are Generative Adversarial Networks?

A generative adversarial network, or GAN, is a deep neural network framework that can learn from training data and generate new data with the same characteristics as the training data. For example, generative networks trained on photographs of human faces can generate realistic-looking faces which are entirely fictitious.

Generative adversarial networks consist of two neural networks, the generator, and the discriminator, which compete against each other. The generator is trained to produce fake data, and the discriminator is trained to distinguish the generator’s fake data from actual examples.

Intuitively, the generator maps random noise through a model to produce a sample, and the discriminator decides whether the sample is real or not

The below image shows how GANs are trained. There are two fundamental blocks in GANs.

Generator - The generator takes input as random noise and generates a data sample ideally in the latent space of the input dataset. Throughout the training, it tries to mimic the distribution of the input dataset.
Discriminator - The discriminator network is a binary classifier that outputs whether the sample is real or fake. The input to the discriminator could either come from an input dataset or generator, and its task is to classify whether the sample is real or fake.

GANs training architecture

Discriminative vs Generative Models

Machine learning models can be classified into two types: Discriminative and Generative.

A discriminative model makes predictions on the unseen data based on conditional probability and can be used for classification or regression problems.

A generative model focuses on the latent distribution of a dataset to return a probability for an example.

Let us understand the difference through an example. Suppose we want to classify whether an email is a spam or not. Let’s model the problem.

Problem Formulation

We have a dataset with Input: n emails, and each email has {f1, f2, f3, ……, fm} features and n labels.

The joint distribution of the model can be represented as

p(Y,X) = P(y,f1,f2…fm)

We aim to estimate the probability of spam email, i.e., P(Y=1|X). Both generative and discriminative models can solve this problem but in different ways.

Solving through Generative Models

In the case of generative models, we can solve it using the Bayes theorem.

To find the conditional probability P(Y|X), they estimate the prior probability P(Y) and likelihood probability P(X|Y) with the help of the training data and use the Bayes Theorem to calculate the posterior probability P(Y |X):

Discriminative and Generative Models approach

The posterior probability classifies spam or not based on existing mail and the likelihood of that mail being spam or not.

Generative Model

Some generative models are Näive Bayes, Gaussians, HMM, Mixture of Gaussians, Bayesian networks, Markov Random Fields, and multinomials.

Solving through Discriminative Models

In the case of discriminative models, to find the probability, they directly assume some functional form for P(Y|X) and then estimate the parameters of P(Y|X) with the help of the training data.

Discriminative Model

A Discriminative model ‌models the decision boundary between the classes and learns the conditional probability distribution p(y|x). Some examples of discriminative models are logistic regression, SVMs, ANN, KNN and Conditional Random Fields.

The question a generative algorithm tries to answer is: Assuming this email is spam, how likely are these features? While discriminative models care about the relation between y and x, generative models care about “how you get x.”

They allow you to capture p(x|y), the probability of x given y, or the probability of features given a label or category. (That said, generative algorithms can also be used as classifiers. It just so happens that they can do more than categorize input data.)

Generative models have two types:

Explicit likelihood models are defined by an explicit specification of the density, and so their unnormalized complete likelihood can be usually expressed in closed form.

Implicit probabilistic models are defined naturally in a sampling procedure and often induce a likelihood function that cannot be expressed explicitly.

Generative models aim for a complete probabilistic description of the data. With these models, the goal is to construct the joint probability distribution P(X, Y) – either directly or by first computing P(X | Y) and P(Y) – and then inferring the conditional probabilities required to classify new data.

As per the below image, neural network architectures like GANs fall under the implicit model category as they can sample from latent space but do not evaluate the likelihood for the sample. We also lack the understanding of why they work!

IAN Goodfellow's presentation of Generative Models Taxonomy

Here are some examples of generative models:

‌Naïve Bayes
Bayesian networks
Markov random fields
‌Hidden Markov Models (HMMs)
Latent Dirichlet Allocation (LDA)
Generative Adversarial Networks (GANs)
Autoregressive Model

Pros:

Easy to calculate prior probability.
Reflect the feature of dataset for a specific category.
Know joint probability distribution.
Fits to hidden/dummy random variable.
Converge faster.

Cons:

Need large scale dataset to get reasonable Class-Conditional Distribution
Dataset usually has high dimensions, which causes memory shortage, and computation issues.

Generative models are more complex to train than discriminative models because it is hard to model real-world data distribution and class conditional probabilities accurately.

Pro Tip: Read this Comprehensive Guide to Convolutional Neural Networks

How do Generative Adversarial Networks work?

Now, let’s discuss the architecture of GANs and how they work.

Generative Adversarial Networks architecture

GAN composes of two deep networks, the generator, and the discriminator. We will first examine how a generator creates images before learning how to train it.

First, we sample some noise z using a normal or uniform distribution. With z as an input, we use a generator G to create an image x (x=G(z)).

I know it sounds unreal. Let me simplify it for you.

Generative model generating target output image from random noise

Conceptually, z represents the latent features of the images generated, for example, the color and the shape. In Deep learning classification, we don’t control the features the model is learning. Similarly, in GAN, we don’t control the semantic meaning of z.

We let the training process to learn it. i.e., we do not control which byte in z determines the color of the cup.

To discover its meaning, the most effective way is to plot the generated images and examine ourselves. But a generator alone will create random noise. Conceptually, the discriminator in GAN guides the generator on what images to create.

GAN builds a discriminator to learn what features make images real by training with real and generated images. Then the same discriminator will provide feedback to the generator to create real images.

The feedback loop in GANs

Generative Adversarial Networks Training

Let me walk you through the training process. We will go through the equations and algorithms mentioned in the paper in detail.

We train the discriminator just like a deep network classifier. If the input is real, we want D(x)=1. If it is generated, it should be zero. Through this process, the discriminator identifies features that contribute to real images.

On the other hand, we want the generator to create images with D(x) = 1 (matching the real image). So we can train the generator by backpropagating this target value all the way back to the generator, i.e. we train the generator to create images that towards what the discriminator thinks is real.

Backpropagation path for training GANs

We train both networks in alternating steps and lock them into a fierce competition to improve themselves. Eventually, the discriminator identifies the tiny difference between the real and the generated, and the generator creates images that the discriminator cannot tell the difference. The GAN model eventually converges and produces authentic images.

This discriminator concept can be applied to many existing deep learning applications also. The discriminator in GAN acts as a critic. We can plug the discriminator into existing deep learning solutions to provide feedback to make it better.

The discriminator outputs a value D(x), indicating the chance that x is a real image. Our objective is to maximize the chance to recognize real images as real and generated images as fake.

To measure the loss, we use cross-entropy. For a real image, p (the true label for real images) equals 1. For generated images, we reverse the label (i.e. one minus label). So the objective becomes:

Discriminator Loss Function

On the generator side, its objective function wants the model to generate images with the highest confidence value of D(x) to fool the discriminator.

Generator Loss Function

GANs loss functions are modeled as a minimax game where both generator and discriminator performance is co-dependent. According to the original paper of GANs, Discriminator and Generator play the following two-player minimax game with value function V(D, G).

Once both objective functions are defined, they are trained jointly in an alternating fashion using gradient descent.

We fix the generator model’s parameters and perform a single iteration of gradient descent on the discriminator using the real and the generated images. Then we switch sides.
Fix the discriminator and train the generator for another single iteration.

We train both networks in alternating steps until the generator produces good-quality images. The following summarizes the data flow and the gradients used for the backpropagation.

Weight update in GANs training

When we model a system that rewards two contrary outcomes simultaneously, it can lead to equilibrium issues and make the model convergence tricky. We see later in this article how this problem is addressed in recent versions of GANs.

Pseudo Code for training GANs (from original paper)

Pseudo Code for training GANs

GANs vs Autoencoders vs VAEs

Generative Adversarial Network (GAN) and Variational Autoencoder (VAE) are popular models for generating images and sequences. As GAN and VAE share similar tasks, we might encounter the challenge of choosing between them in specific application scenarios.

In a nutshell, a VAE is an autoencoder whose encodings distribution is regularised during the training to ensure that its latent space has good properties, allowing us to generate new data.

Moreover, the term “variational” comes from the close relationship there is between the regularisation and the variational inference method in statistics.

Two main components in VAE are

1. Encoder

2. Decoder

The encoder produces the “new features” representation from the “old features” representation (by selection or by extraction) and decodes the reverse process.

Dimensionality reduction can then be interpreted as data compression where the encoder compresses the data (from the initial space to the encoded space, also called latent space) whereas the decoder decompresses them.

The reconstructed output from the decoder will be lossy information from the original input.

Variational Autoencoder Architecture

Pro tip: Want to learn more about autoencoders? Check out this detailed blog

As you have a clear picture of VAE, let’s look into some important differences between GANs and VAE.

The main difference between VAE and GANs is their learning process. VAEs minimize a loss reproducing a certain image and can be considered solving a semisupervised learning problem. GANs, on the other hand, solve an unsupervised learning problem.

The training time for the two methods. GANs take a longer time and are complex to train. Therefore the use of VAE was considered and proved a lot more stable. With GANs, this does not necessarily occur.

Encoder-Decoder Representation in VAE

VAE uses the probabilistic graph model and learns by finding good posterior p(z|x) and likelihood p(x|z). To generate images, VAE first chooses a prior distribution p(z) according to the expected x, then samples a hidden state from p(z) and feeds it into the decoder.

Pro tip: Looking for quality training data? Check out 65+ Best Free Datasets for Machine Learning to find the right dataset for your needs.

GAN variants

Deep Convolutional GAN(DCGAN)

DCGAN is a generative adversarial network architecture based on CNNs. It uses a couple of guidelines, in particular:

Replacing pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
Using batchnorm in both the generator and the discriminator.
Removing fully connected hidden layers for deeper architectures.
Using ReLU activation in the generator for all layers except for the output, which uses tanh.
Using LeakyReLU activation in the discriminator for all layers.

DCGAN Architecture

This network takes in a 100x1 noise vector, denoted z, and maps it into the G(Z) output which is 64x64x3. We see the network goes from -> 100x1 → 1024x4x4 → 512x8x8 → 256x16x16 → 128x32x32 → 64x64x3

Bedroom images generated from random noise using DCGAN

Progressive GANs

Generating high-resolution images is considered challenging for GAN models as the generator must learn how to output both overall structure and fine details simultaneously.

The primary contribution of the ProGan paper is a training methodology for GANs where we start with low-resolution images, and then progressively increase the resolution by adding layers to the networks.

Progressive Growing GAN involves using a generator and discriminator model with the same general structure and starting with very small images, such as 4×4 pixels.

ProGan Architecture

This approach allows the generation of high-quality images, such as 1024×1024 photorealistic faces of celebrities that do not exist.

Conditional GANs

A conditional generative adversarial network, or cGAN for short, is a type of GAN that involves the conditional generation of images by a generator model.

In cGANs, a conditional setting is applied, meaning that both the generator and discriminator are conditioned on some sort of auxiliary information (such as class labels or data) from other modalities.

As a result, the ideal model can learn a multi-modal mapping from inputs to outputs by being fed with different contextual information.

You can control the generator's output at test time by giving the label for the image you want to generate.

Conditinal GAN

Pix2Pix GAN

Pix2Pix is a Generative Adversarial Network, or GAN, model designed for general-purpose image-to-image translation.

The image-to-image translation is the problem of changing a given image in a specific or controlled way. Examples include translating a landscape photograph from day to night or a segmented image to a photograph.

cGANs are suitable for image-to-image translation tasks as we can condition an input image and generate a corresponding output image.

Pix2Pix GAN Architecture

Pix2Pix GAN provides a general purpose model and loss function for image-to-image translation.

CycleGAN

The CycleGAN is a technique that involves the automatic training of image-to-image translation models without paired examples. The models are trained in an unsupervised manner using a collection of images from the source and target domain that do not need to be related in any way.

The CycleGAN is an extension of the GAN architecture that involves the simultaneous training of two generator models and two discriminator models.

One generator takes images from the first domain as input and outputs images for the second domain. The other generator takes images from the second domain as input and generates images for the first domain.
Discriminator models are then used to determine how plausible the generated images are and update the generator models accordingly.

CycleGAN

Super-resolution GANs

Super-resolution (SR) upsampling a low-resolution image into a higher resolution with minimal information distortion.

The generator network employs residual blocks, where the idea is to keep information from previous layers alive and allow the network to choose from more features adaptively.

Instead of adding random noise as the generator input, we pass the low-resolution image.

The discriminator network is pretty standard and works as a discriminator would work in a normal GAN.

The novel factor in SRGANs is the perceptual loss function. While the generator and discriminator will get trained based on the GAN architecture, SRGANs use the help of another loss function to reach their destination: the perceptual/content loss function.

SRGAN Architecture

Pro Tip: Read this Comprehensive Guide to Super Resolution

DALLE-2

‘DALL·E 2 is a new AI system that can create realistic images and art from a description in natural language - OpenAI

A simplified explanation of the DALLE 2 algorithm:

First, a text prompt is an input into a text encoder that is trained to map the prompt to a representation space.
Next, a prior model maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding.
Finally, an image decoder stochastically generates an image which is a visual manifestation of this semantic information.

DALLE-2 Architecture

If you are interested in a detailed explanation of the DALLE2 algorithm, check out the blog from one of the paper's co-authors.

Pro tip: Style Transfer is an important application of GANs. Check out this Comprehensive Guide to Style Transfer

Issues with GANs

Training GANs is a non-trivial problem because of the min-max game formulation of the GAN architecture.

Common failure modes in training GANs are:

Non-convergence: the model parameters oscillate, destabilize and never converge.
Mode collapse: the generator collapses, which produces a limited variety of samples.
Diminished gradient: the discriminator gets too successful that the generator gradient vanishes and learns nothing.
Unbalance between the generator and discriminator causes overfitting.
Highly sensitive to the hyperparameter selections.

Wasserstein loss: The Wasserstein loss is designed to prevent vanishing gradients even when you train the discriminator to optimality.

Modified minimax loss: The original GAN paper proposed a modification to minimax loss to deal with vanishing gradients.

GANs: Key Takeaways

GAN is a revolutionary neural network architecture modeled as a min-max two-player game to model a data distribution in the latent space of the input dataset.
A discriminative model makes predictions on the unseen data based on conditional probability, whereas a generative model focuses on the latent distribution of a dataset.
GAN is harder to train but produces high-quality data as compared to VAE. GAN training is non-trivial and can lead to several failure modes leading to a sub-optimal convergence.
GAN Variants like DALLE-2, Pix2Pix GAN, Cycle GAN, and more have shown great promise in real-world applications like image-to-image translation, text-to-image synthesis, etc.

References

Data labeling

Data labeling platform

Get started today

Data labeling

Data labeling platform

Get started today

Deval Shah

Deval is a senior software engineer at Eagle Eye Networks and a computer vision enthusiast. He writes about complex topics related to machine learning and deep learning.

Next steps

Label videos with V7.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Book a demo

Explore V7 Darwin

Book a demo

Explore V7 Darwin