Deep learning has revolutionized the world of computer vision—the ability for machines to “see” and interpret the world around them.
In particular, Convolutional Neural Networks (CNNs) were designed to process image data more efficiently than traditional Multi-Layer Perceptrons (MLP).
Since images contain a consistent pattern spanning several pixels, processing them one pixel at a time—as MLPs do—is inefficient.
This is why CNNs that process images in patches or windows are now the de-facto choice for image processing tasks.
But let’s start from the beginning—
Here’s what we’ll cover:
If you’re interested in learning more about computer vision, deep learning, and neural networks, have a look at these articles:
Digital Image processing is the class of methods that deal with manipulating digital images through the use of computer algorithms. It is an essential preprocessing step in many applications, such as face recognition, object detection, and image compression.
Image processing is done to enhance an existing image or to sift out important information from it. This is important in several Deep Learning-based Computer Vision applications, where such preprocessing can dramatically boost the performance of a model. Manipulating images, for example, adding or removing objects to images, is another application, especially in the entertainment industry.
This paper addresses a medical image segmentation problem, where the authors used image inpainting in their preprocessing pipeline for the removal of artifacts from dermoscopy images. Examples of this operation are shown below.
The authors achieved a 3% boost in performance with this simple preprocessing procedure which is a considerable enhancement, especially in a biomedical application where the accuracy of diagnosis is crucial for AI systems. The quantitative results obtained with and without preprocessing for the lesion segmentation problem in three different datasets are shown below.
Digital images are interpreted as 2D or 3D matrices by a computer, where each value or pixel in the matrix represents the amplitude, known as the “intensity” of the pixel. Typically, we are used to dealing with 8-bit images, wherein the amplitude value ranges from 0 to 255.
Thus, a computer “sees” digital images as a function: I(x, y) or I(x, y, z), where “I” is the pixel intensity and (x, y) or (x, y, z) represent the coordinates (for binary/grayscale or RGB images respectively) of the pixel in the image.
Computers deal with different “types” of images based on their function representations. Let us look into them next.
Images that have only two unique values of pixel intensity- 0 (representing black) and 1 (representing white) are called binary images. Such images are generally used to highlight a discriminating portion of a colored image. For example, it is commonly used for image segmentation, as shown below.
Grayscale or 8-bit images are composed of 256 unique colors, where a pixel intensity of 0 represents the black color and pixel intensity of 255 represents the white color. All the other 254 values in between are the different shades of gray.
An example of an RGB image converted to its grayscale version is shown below. Notice that the shape of the histogram remains the same for the RGB and grayscale images.
The images we are used to in the modern world are RGB or colored images which are 16-bit matrices to computers. That is, 65,536 different colors are possible for each pixel. “RGB” represents the Red, Green, and Blue “channels” of an image.
Up until now, we had images with only one channel. That is, two coordinates could have defined the location of any value of a matrix. Now, three equal-sized matrices (called channels), each having values ranging from 0 to 255, are stacked on top of each other, and thus we require three unique coordinates to specify the value of a matrix element.
Thus, a pixel in an RGB image will be of color black when the pixel value is (0, 0, 0) and white when it is (255, 255, 255). Any combination of numbers in between gives rise to all the different colors existing in nature. For example, (255, 0, 0) is the color red (since only the red channel is activated for this pixel). Similarly, (0, 255, 0) is green and (0, 0, 255) is blue.
An example of an RGB image split into its channel components is shown below. Notice that the shapes of the histograms for each of the channels are different.
RGBA images are colored RGB images with an extra channel known as “alpha” that depicts the opacity of the RGB image. Opacity ranges from a value of 0% to 100% and is essentially a “see-through” property.
Opacity in physics depicts the amount of light that passes through an object. For instance, cellophane paper is transparent (100% opacity), frosted glass is translucent, and wood is opaque. The alpha channel in RGBA images tries to mimic this property. An example of this is shown below.
The fundamental steps in any typical Digital Image Processing pipeline are as follows:
The image is captured by a camera and digitized (if the camera output is not digitized automatically) using an analogue-to-digital converter for further processing in a computer.
In this step, the acquired image is manipulated to meet the requirements of the specific task for which the image will be used. Such techniques are primarily aimed at highlighting the hidden or important details in an image, like contrast and brightness adjustment, etc. Image enhancement is highly subjective in nature.
This step deals with improving the appearance of an image and is an objective operation since the degradation of an image can be attributed to a mathematical or probabilistic model. For example, removing noise or blur from images.
This step aims at handling the processing of colored images (16-bit RGB or RGBA images), for example, peforming color correction or color modeling in images.
Wavelets are the building blocks for representing images in various degrees of resolution. Images subdivision successively into smaller regions for data compression and for pyramidal representation.
For transferring images to other devices or due to computational storage constraints, images need to be compressed and cannot be kept at their original size. This is also important in displaying images over the internet; for example, on Google, a small thumbnail of an image is a highly compressed version of the original. Only when you click on the image is it shown in the original resolution. This process saves bandwidth on the servers.
Image components that are useful in the representation and description of shape need to be extracted for further processing or downstream tasks. Morphological Processing provides the tools (which are essentially mathematical operations) to accomplish this. For example, erosion and dilation operations are used to sharpen and blur the edges of objects in an image, respectively.
This step involves partitioning an image into different key parts to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation allows for computers to put attention on the more important parts of the image, discarding the rest, which enables automated systems to have improved performance.
Image segmentation procedures are generally followed by this step, where the task for representation is to decide whether the segmented region should be depicted as a boundary or a complete region. Description deals with extracting attributes that result in some quantitative information of interest or are basic for differentiating one class of objects from another.
After the objects are segmented from an image and the representation and description phases are complete, the automated system needs to assign a label to the object—to let the human users know what object has been detected, for example, “vehicle” or “person”, etc.
Knowledge may be as simple as the bounding box coordinates for an object of interest that has been found in the image, along with the object label assigned to it. Anything that will help in solving the problem for the specific task at hand can be encoded into the knowledge base.
Image processing can be used to improve the quality of an image, remove undesired objects from an image, or even create new images from scratch. For example, image processing can be used to remove the background from an image of a person, leaving only the subject in the foreground.
Image processing is a vast and complex field, with many different algorithms and techniques that can be used to achieve different results. In this section, we will focus on some of the most common image processing tasks and how they are performed.
One of the most common image processing tasks is an image enhancement, or improving the quality of an image. It has crucial applications in Computer Vision tasks, Remote Sensing, and surveillance. One common approach is adjusting the image's contrast and brightness.
Contrast is the difference in brightness between the lightest and darkest areas of an image. By increasing the contrast, the overall brightness of an image can be increased, making it easier to see. Brightness is the overall lightness or darkness of an image. By increasing the brightness, an image can be made lighter, making it easier to see. Both contrast and brightness can be adjusted automatically by most image editing software, or they can be adjusted manually.
However, adjusting the contrast and brightness of an image are elementary operations. Sometimes an image with perfect contrast and brightness, when upscaled, becomes blurry due to lower pixel per square inch (pixel density). To address this issue, a relatively new and much more advanced concept of Image Super-Resolution is used, wherein a high-resolution image is obtained from its low-resolution counterpart(s). Deep Learning techniques are popularly used to accomplish this.
For example, the earliest example of using Deep Learning to address the Super-Resolution problem is the SRCNN model, where a low-resolution image is first upscaled using traditional Bicubic Interpolation and then used as the input to a CNN model. The non-linear mapping in the CNN extracts overlapping patches from the input image, and a convolution layer is fitted over the extracted patches to obtain the reconstructed high-resolution image. The model framework is depicted visually below.
An example of the results obtained by the SRCNN model compared to its contemporaries is shown below.
The quality of images could degrade for several reasons, especially photos from the era when cloud storage was not so commonplace. For example, images scanned from hard copies taken with old instant cameras often acquire scratches on them.
Image Restoration is particularly fascinating because advanced techniques in this area could potentially restore damaged historical documents. Powerful Deep Learning-based image restoration algorithms may be able to reveal large chunks of missing information from torn documents.
Image inpainting, for example, falls under this category, and it is the process of filling in the missing pixels in an image. This can be done by using a texture synthesis algorithm, which synthesizes new textures to fill in the missing pixels. However, Deep Learning-based models are the de facto choice due to their pattern recognition capabilities.
An example of an image painting framework (based on the U-Net autoencoder) was proposed in this paper that uses a two-step approach to the problem: a coarse estimation step and a refinement step. The main feature of this network is the Coherent Semantic Attention (CSA) layer that fills the occluded regions in the input images through iterative optimization. The architecture of the proposed model is shown below.
Some example results obtained by the authors and other competing models are shown below.
Image segmentation is the process of partitioning an image into multiple segments or regions. Each segment represents a different object in the image, and image segmentation is often used as a preprocessing step for object detection.
There are many different algorithms that can be used for image segmentation, but one of the most common approaches is to use thresholding. Binary thresholding, for example, is the process of converting an image into a binary image, where each pixel is either black or white. The threshold value is chosen such that all pixels with a brightness level below the threshold are turned black, and all pixels with a brightness level above the threshold are turned white. This results in the objects in the image being segmented, as they are now represented by distinct black and white regions.
In multi-level thresholding, as the name suggests, different parts of an image are converted to different shades of gray depending on the number of levels. This paper, for example, used multi-level thresholding for medical imaging—specifically for brain MRI segmentation, an example of which is shown below.
Modern techniques use automated image segmentation algorithms using deep learning for both binary and multi-label segmentation problems. For example, the PFNet or Positioning and Focus Network is a CNN-based model that addresses the camouflaged object segmentation problem. It consists of two key modules—the positioning module (PM) designed for object detection (that mimics predators that try to identify a coarse position of the prey); and the focus module (FM) designed to perform the identification process in predation for refining the initial segmentation results by focusing on the ambiguous regions. The architecture of the PFNet model is shown below.
The results obtained by the PFNet model outperformed contemporary state-of-the-art models, examples of which are shown below.
Object Detection is the task of identifying objects in an image and is often used in applications such as security and surveillance. Many different algorithms can be used for object detection, but the most common approach is to use Deep Learning models, specifically Convolutional Neural Networks (CNNs).
CNNs are a type of Artificial Neural Network that were specifically designed for image processing tasks since the convolution operation in their core helps the computer “see” patches of an image at once instead of having to deal with one pixel at a time. CNNs trained for object detection will output a bounding box (as shown in the illustration above) depicting the location where the object is detected in the image along with its class label.
An example of such a network is the popular Faster R-CNN (Region-based Convolutional Neural Network) model, which is an end-to-end trainable, fully convolutional network. The Faster R-CNN model alternates between fine-tuning for the region proposal task (predicting regions in the image where an object might be present) and then fine-tuning for object detection (detecting what object is present) while keeping the proposals fixed. The architecture and some examples of region proposals are shown below.
Image compression is the process of reducing the file size of an image while still trying to preserve the quality of the image. This is done to save storage space, especially to run Image Processing algorithms on mobile and edge devices, or to reduce the bandwidth required to transmit the image.
Traditional approaches use lossy compression algorithms, which work by reducing the quality of the image slightly in order to achieve a smaller file size. JPEG file format, for example, uses the Discrete Cosine Transform for image compression.
Modern approaches to image compression involve the use of Deep Learning for encoding images into a lower-dimensional feature space and then recovering that on the receiver’s side using a decoding network. Such models are called autoencoders, which consist of an encoding branch that learns an efficient encoding scheme and a decoder branch that tries to revive the image loss-free from the encoded features.
Basic framework for autoencoder training. Image by the author.
For example, this paper proposed a variable rate image compression framework using a conditional autoencoder. The conditional autoencoder is conditioned on the Lagrange multiplier, i.e., the network takes the Lagrange multiplier as input and produces a latent representation whose rate depends on the input value. The authors also train the network with mixed quantization bin sizes for fine-tuning the rate of compression. Their framework is depicted below.
The authors obtained superior results compared to popular methods like JPEG, both by reducing the bits per pixel and in reconstruction quality. An example of this is shown below.
Image manipulation is the process of altering an image to change its appearance. This may be desired for several reasons, such as removing an unwanted object from an image or adding an object that is not present in the image. Graphic designers often do this to create posters, films, etc.
An example of Image Manipulation is Neural Style Transfer, which is a technique that utilizes Deep Learning models to adapt an image to the style of another. For example, a regular image could be transferred to the style of “Starry Night” by van Gogh. Neural Style Transfer also enables AI to generate art.
Example of Neural Style Transfer. Image by the author.
An example of such a model is the one proposed in this paper that is able to transfer arbitrary new styles in real-time (other approaches often take much longer inference times) using an autoencoder-based framework. The authors proposed an adaptive instance normalization (AdaIN) layer that adjusts the mean and variance of the content input (the image that needs to be changed) to match those of the style input (image whose style is to be adopted). The AdaIN output is then decoded back to the image space to get the final style transferred image. An overview of the framework is shown below.
Examples of images transferred to other artistic styles are shown below and compared to existing state-of-the-art methods.
Synthesis of new images is another important task in image processing, especially in Deep Learning algorithms which require large quantities of labeled data to train. Image generation methods typically use Generative Adversarial Networks (GANs) which is another unique neural network architecture.
GANs consist of two separate models: the generator, which generates the synthetic images, and the discriminator, which tries to distinguish synthetic images from real images. The generator tries to synthesize images that look realistic to fool the discriminator, and the discriminator trains to better critique whether an image is synthetic or real. This adversarial game allows the generator to produce photo-realistic images after several iterations, which can then be used to train other Deep Learning models.
Image-to-Image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. For example, a free-hand sketch can be drawn as an input to get a realistic image of the object depicted in the sketch as the output, as shown below.
Pix2pix is a popular model in this domain that uses a conditional GAN (cGAN) model for general purpose image-to-image translation, i.e., several problems in image processing like semantic segmentation, sketch-to-image translation, and colorizing images, are all solved by the same network. cGANs involve the conditional generation of images by a generator model. For example, image generation can be conditioned on a class label to generate images specific to that class.
Pix2pix consists of a U-Net generator network and a PatchGAN discriminator network, which takes in NxN patches of an image to predict whether it is real or fake, unlike traditional GAN models. The authors argue that such a discriminator enforces more constraints that encourage sharp high-frequency detail. Examples of results obtained by the pix2pix model on image-to-map and map-to-image tasks are shown below.
The information technology era we live in has made visual data widely available. However, a lot of processing is required for them to be transferred over the internet or for purposes like information extraction, predictive modeling, etc.
The advancement of deep learning technology gave rise to CNN models, which were specifically designed for processing images. Since then, several advanced models have been developed that cater to specific tasks in the Image Processing niche. We looked at some of the most critical techniques in Image Processing and popular Deep Learning-based methods that address these problems, from image compression and enhancement to image synthesis.
Recent research is focused on reducing the need for ground truth labels for complex tasks like object detection, semantic segmentation, etc., by employing concepts like Semi-Supervised Learning and Self-Supervised Learning, which makes models more suitable for broad practical applications.