Vision Transformer (ViT) emerged as a competitive alternative to convolutional neural networks (CNNs) that are currently state-of-the-art in computer vision and widely used for different image recognition tasks. ViT models outperform the current state-of-the-art CNNs by almost four times in terms of computational efficiency and accuracy.
Although convolutional neural networks have dominated the field of computer vision for years, new vision transformer models have also shown remarkable abilities, achieving comparable and even better performance than CNNs on many computer vision tasks.
Here’s what we’ll cover:
Solve any video or image labeling task 10x faster and with 10x less manual work.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
Attention mechanisms combined with RNNs were the predominant architecture for facing any task involving text until 2017, when a paper was published and changed everything, giving birth to the now widely used Transformers. The paper was entitled “Attention is all you need.”
A Transformer is a deep learning model that adopts the self-attention mechanism, differentially weighting the significance of each part of the input data. Transformers are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM).
Transformers differed from Recurrent Neural Networks in the following ways:
These were the reasons why transformers got popular in no time. Transformers had surpassed RNNs in all NLP tasks. This is why it is essential to understand this architecture and its evolution in the past five years.
Originally released on 11th June 2018, GPT has undergone transformations in recent years. There is no better model to begin discussing transformers than GPT, which stands for Generative Pre-Training. It pioneered using unsupervised learning for pre-training and supervised learning for fine-tuning, which is currently commonly employed by many transformers.
It was trained on a book corpus dataset consisting of 7,000 unpublished books. The architecture of GPT consists of 12 decoders stacked together, meaning it is a decoder-only model. GPT-1 consists of 117 million parameters, a small number compared to the transformers developed today! GPT was just the beginning of the era of transformers.
BERT stands for Bidirectional Encoder Representation from Transformers and it was released on 11th October 2018. As the name suggests, BERT is a bidirectional model. The attention mechanism can attend to both directions of the current token, left and right. This is because BERT is made by stacking together 12 encoders, meaning it is an encoder-only model. Encoders can take the full sentence as the input and reference any sentence word to perform the task.
BERT consists of 110 million parameters. Like GPT, it was trained on a specific task and can be fine-tuned for other tasks. In other words, the task used to pre-train BERT was to fill in the blanks.
The Vision Transformer (ViT) model was introduced in 2021 in a conference research paper titled "An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale," published at ICLR 2021. The fine-tuning code and pre-trained ViT models are accessible on Google Research's GitHub. The ViT models were pre-trained on the ImageNet and ImageNet-21k datasets.
Vision transformers have extensive applications in popular image recognition tasks such as object detection, image segmentation, image classification, and action recognition. Moreover, ViTs are applied in generative modeling and multi-model tasks, including visual grounding, visual-question answering, and visual reasoning.
In ViTs, images are represented as sequences, and class labels for the image are predicted, which allows models to learn image structure independently. Input images are treated as a sequence of patches where every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension.
Let’s examine the vision transformer architecture step by step.
Image patches are the sequence tokens (like words). The encoder block is identical to the original transformer proposed by Vaswani et al. (2017).
There are multiple blocks in the ViT encoder, and each block consists of three major processing elements:
Layer Norm keeps the training process on track and lets the model adapt to the variations among the training images.
Multi-head Attention Network (MSP) is a network responsible for generating attention maps from the given embedded visual tokens. These attention maps help the network focus on the most critical regions in the image, such as object(s). The concept of attention maps is the same as that found in the traditional computer vision literature (e.g., saliency maps and alpha-matting).
MLP is a two-layer classification network with GELU (Gaussian Error Linear Unit) at the end. The final MLP block also called the MLP head, is used as an output of the transformer. An application of softmax on this output can provide classification labels (i.e., if the application is Image Classification).
The ViT model represents an input image as a series of image patches, like the series of word embeddings used when using transformers to text, and directly predicts class labels for the image.
Vision Transformer (ViT) has been gaining momentum in recent years. In the following sections, we will explain the ideas from the paper entitled “Do Vision Transformers See Like Convolutional Neural Networks?” by Raghu et al., published in 2021 by Google Research and Google Brain. In particular, we’ll explore the difference between the conventionally used CNN and Vision Transformer.
The six central abstract ideas shared in the paper are:
Understanding fundamental differences between ViTs and CNNs is essential as the transformer architecture has become more ubiquitous. Transformers have extended their reach from taking over the world of language models to usurping CNNs as the de-facto vision model.
Before diving deep into how vision Transformers work, we must understand the fundamentals of attention and multi-head attention presented in the original transformer paper.
The Transformer is a model proposed in the paper “Attention Is All You Need” (Vaswani et al., 2017). It is a model that uses a mechanism called self-attention, which is neither a CNN nor an LSTM and builds a Transformer model to outperform existing methods significantly.
Note that the part labeled Multi-Head Attention in the figure below is the core part of the Transformer, but it also uses skip-joining like ResNet.
The attention mechanism used in the Transformer uses three variables: Q (Query), K(Key), and V (Value). Simply put, it calculates the attention weight of a Query token (token: something like a word) and a Key token and multiplies the Value associated with each Key. In short, it calculates the association (attention weight) between the Query token and the Key token and multiplies the Value associated with each Key.
Defining the Q, K, and V calculation as a single head, the multi-head attention mechanism is defined as follows. The (single-head) attention mechanism in the above figure uses Q and K. Still, in the multi-head attention mechanism, each head has its projection matrix W_i^Q, W_i^K, and W_i^V, and they calculate the attention weights using the feature values projected using these matrices.
The intuition behind multi-head attention is that it allows us to attend to different parts of the sequence differently each time. This practically means that:
Vision Transformer (ViT) is a model that applies the Transformer to the image classification task and was proposed in October 2020 (Dosovitskiy et al. 2020). The model architecture is almost the same as the original Transformer but with a twist to allow images to be treated as input, just like natural language processing.
Transformer Architecture modified for images (ViT)
The paper suggests using a Transformer Encoder as a base model to extract features from the image and passing these “processed” features into a Multilayer Perceptron (MLP) head model for classification.
Transformers are already very compute-heavy—infamous for their quadratic complexity when computing the Attention matrix. This worsens as the sequence length increases.
For a 28x28 mnist image, if we flatten it to 784 pixels, we still have to deal with an attention matrix of 784x784 to see which pixels attend to one another. This is very expensive even for today’s hardware capabilities.
Hence, the paper suggests breaking the image down into square patches as a form of lightweight “windowed” Attention to address this issue.
The image is converted to square patches.
These patches are flattened and sent through a single Feed Forward layer to get a linear patch projection. This Feed Forward layer contains the embedding matrix E mentioned in the paper. This matrix E is randomly generated.
To help with the classification bit, the authors took inspiration from the original BERT paper by concatenating a learnable [class] embedding with the other patch projections.
Yet another problem with Transformers is that the sequence order is not enforced naturally since data is passed in at a shot instead of timestep-wise, as is done in RNNs and LSTMs. To combat this, the original Transformer paper suggests using Positional Encodings/Embeddings that establish a certain order in the inputs.
The positional embedding matrix Epos is randomly generated and added to the concatenated matrix containing the learnable class embedding and patch projections.
D is the fixed latent vector size used throughout the Transformer. It’s what we squash the input vectors to before passing them into the Encoder.
Altogether, these patch projections and positional embeddings form a larger matrix that’ll soon be put through the Transformer Encoder.
The outputs of the Transformer Encoder are then sent into a Multilayer Perceptron for image classification. The input features capture the image's essence very well, making the MLP head’s classification task far simpler.
The MLP Head inputs the Transformer outputs related to the special [class] embedding and ignores the other outputs.
While ViT shows excellent potential in learning high-quality image features, it is inferior in performance vs. accuracy gains. The little gain in accuracy does not justify the poor run time of ViT.
With ever-increasing dataset sizes and the continued development of unsupervised and semi-supervised methods, developing new vision architectures that train more efficiently on these datasets becomes increasingly essential. We believe ViT is a preliminary step towards generic, scalable architectures that can solve many vision tasks. Hence, the ViT has gained prominence in research interests due to its broad applicability and scalability.
Here are some of the most prominent applications of ViT:
The task of image classification is the most common problem in vision. CNN-based methods are state-of-art for image classification tasks. ViTs don’t produce a comparable performance at small to medium datasets. However, they have outperformed CNNs on very large datasets.
This is because CNNs encode the local information in the image more effectively than ViTs due to the application of locally restricted receptive fields.
A more advanced form of image categorization can be achieved by generating a caption describing the content of an image instead of a one-word label. This has become possible with the use of ViTs. ViTs learn a general representation of a given data modality instead of a crude set of labels. Therefore, it is possible to generate descriptive text for a given image. We will use an implementation of ViT trained on the COCO dataset. The results of such captioning can be seen below:
DPT (DensePredictionTransformers) is a segmentation model released by Intel in March 2021 that applies vision transformers to images. It can perform image semantic segmentation with 49.02% mIoU on ADE20K. It can also be used for monocular depth estimation with an improvement of up to 28% relative performance compared to a state-of-the-art fully-convolutional network.
A transformer-based image anomaly detection and localization network combines a reconstruction-based approach and patch embedding. The use of transformer networks helps preserve the spatial information of the embedded patches, which is later processed by a Gaussian mixture density network to localize the anomalous areas.
Interesting paper by the Google Research team where they use a pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. The model extracts spatiotemporal tokens from the input video, which are then encoded by a series of transformer layers.
To handle the long sequences of tokens encountered in the video, the authors propose several efficient variants of our model that factorize the input's spatial and temporal dimensions.
Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets
On Tesla AI Day in 2021, Tesla revealed many intricate inner workings of the neural network powering Tesla FSD. One of the most intriguing building blocks is one dubbed “image-to-BEV transform + multi-camera fusion). At the center of this block is a Transformer module, or more concretely, a cross-attention module.
Here are several things to keep in mind about ViTs:
Building AI products? This guide breaks down the A to Z of delivering an AI success story.