Image Classification is one of the most fundamental tasks in computer vision.
And for a reason—
Image classification has revolutionized and propelled technological advancements in the AI field, from the automobile industry to medical analysis and automated perception in robots.
But how does image classification actually work, and what are its benefits and limitations?
This guide will help you find answers to those questions and understand the following:
Ready? Let's get started!
Image classification (or Image recognition) is a subdomain of computer vision in which an algorithm looks at an image and assigns it a tag from a collection of predefined tags or categories that it has been trained on.
Vision is responsible for 80-85 percent of our perception of the world, and we, as human beings, trivially perform classification daily on whatever data we come across.
Therefore, emulating a classification task with the help of neural networks is one of the first uses of computer vision that researchers thought about. Let's explore it in more detail.
A variety of algorithms can solve image classification as a task. Broadly, we can classify them into supervised and unsupervised algorithms.
In supervised classification, the classification algorithm is trained on a set of images along with their corresponding labels.
This helps the algorithm predict the correct tag for images that it has not yet seen with the help of information it has extracted from labeled sample data.
During training, the algorithm extracts features from the image matrix as data that is important enough to be processed. These features represent the image in a lower-dimensional feature space and allow the classifier to classify images based on them.
During the evaluation, features of test images are collected and classified again with the help of the network, which now knows the typical features of every class it has been trained with.
Popular supervised methods of classification based on machine learning algorithms include:
Popular neural networks used for Supervised Image Classification include AlexNet, ResNet, DenseNet, and Inception.
As you might have guessed, data labeling is an essential part of supervised Image classification, with the accuracy of the labeled data largely determining the performance of the ML model used.
Supervised classification algorithms can be further divided into two subcategories based on the data being classified.
Single-label classification is the most common classification task in supervised image classification.
As the name suggests, a single label or annotation is present for each image in single-label classification. Therefore, the model outputs a single value or prediction for each image that it sees.
The output from the model is a one-hot encoding with a length equal to the number of classes and value denoting the probability that the image belongs to this class.
A Softmax activation function is employed to make sure the probabilities sum up to one and the maximum of the probabilities is taken to form the model’s output.
While the Softmax initially seems not to provide any value to the prediction as the maximum probable class does not change after applying it, it helps to bound the output between one and zero, helping gauge the model confidence from the Softmax score.
Some examples of single-label classification datasets include MNIST, SVHN, ImageNet, and more.
Single-label classification can be of Multiclass classification type where there are more than two classes or binary classification, where the number of classes is restricted to only two.
Multi-label classification is a classification task where each image can contain more than one label, and some images can contain all the labels simultaneously.
While this seems similar to single-label classification in some respect, the problem statement is more complex compared to single-label classification.
Multi-label classification tasks popularly exist in the medical imaging domain where a patient can have more than one disease to be diagnosed from visual data in the form of X-rays.
Furthermore, in natural surroundings, image labeling can also be framed as a multi-label classification problem, indicating objects present in the images.
Unsupervised learning is a type of learning where the algorithm uses only raw data for training.
Classification tags are typically absent in this type of learning, and the model learns by recognizing patterns in the training data used.
Like supervised classification, unsupervised-based methods also involve the initial feature extraction step with the most informative details about the image extracted in the form of features.
These features are then processed by parametric (Gaussian Mixture Models) and nonparametric (K-means) clustering methods, or other unsupervised learning algorithms.
Classification of algorithms and techniques based on computer vision tasks extends beyond simple 2D image classification to the classification of visual information in the form of Video and 3D data.
Significantly differing from Image Classification, which only uses Image Processing algorithms and Convolutional Neural Networks to make a classification, Video Classification tasks make use of both image and temporal (relating to time) data.
Video Classification algorithms utilize the relation between the various frames in a continuous video to perform better than standard Image Classification algorithms on these tasks.
Neural Networks better suited to time series data like LSTMs (Long Short Term Memory) and RNNs (Recurrent Neural Networks) are used in conjunction with CNNs to perform video classification tasks to exploit the temporal relations between frames that simple CNN based methods would miss out on.
General Video classification datasets include sports datasets and datasets obtained from Youtube.
3D data classification is very similar to 2D image classification, with the primary difference being in the structure of the CNN and the nature of the movement of the sliding kernel.
Kernels in 3D data classification are also 3D and move along all three axes as compared to two axes linear motion in 2D CNNs.
CNNs are adept in capturing spatial data and hence adapt easily when the data is spaced out over three axes as compared to two.
Popular 3D classification datasets are found in the medical domain, with brain data obtained from MRI scans and structural data of macromolecules obtained from Cryo-Electron Microscopy.
A computer visualizes an image in the form of pixels. In its view, the image is just an array of matrices, with the size of the matrix dependent on the image resolution.
Image processing for the computer is thus the analysis of this mathematical data with the help of algorithms.
The algorithms break down the image into a set of its most prominent features, reducing the workload on the final classifier. These features give the classifier an idea of what the image represents and what class it might be put into.
The feature extraction process forms the most crucial step in classifying an image as all further steps depend on it.
Classification, particularly supervised classification, also depends largely on the data fed to the algorithm. A well-balanced classification dataset works wonders as compared to a bad dataset with class-wise data imbalance and poor quality of images and annotations.
Data annotation is a very important step for the supervised classification of images.
Labeled data should be collected and annotated accurately for classification algorithms to do their job well.
Data should be diversified with the object to be classified present in various environments and captured from all possible angles.
The development of a diverse dataset helps the model adapt to images it has not encountered before and make sure that the predictions that the model makes are due to the presence of the object itself and not due to other factors.
A popular story is passed around to demonstrate how machine learning can pick up unrelated features in datasets that have not been diversified enough.
Here's how it goes—
The story begins with the US Army trying to use Neural Networks to detect camouflaged enemy tanks.
Researchers trained a supervised network on 50 photos of camouflaged tanks and 50 photos of trees without tanks. They then validated their network on 200 more images they had captured for testing the network, only to find that the network successfully detected camouflaged tanks.
The researchers handed over the work to the Pentagon, which handed it back right after, complaining that in their tests, the network did not work at all.
It turned out that the researchers were using a flawed dataset.
They had taken pictures of camouflaged tanks on cloudy days and pictures of trees on sunny days, leading to the network ignoring the tanks and discerning only between cloudy and sunny days.
As we can learn from this story, a diverse dataset is thus necessary for a machine learning model to detect correctly what it is in the picture that the class label is referring to.
In addition to the diversified dataset, you also need to collect enough data for each class.
A dataset of several classes needs to have at least a minimum amount of data per class. In the absence of proper distribution of data across classes, a class-imbalanced dataset is formed, leading to the machine learning model favoring one class over the other in training and inference.
Here's a recap of best practices for data collection and annotation:
High-quality image data for classification is readily available through public datasets like ImageNet, MNIST, SVHN, CIFAR-10, CIFAR-100, and MS-COCO.
These datasets typically contain thousands of class-balanced, high-quality images that have been accurately annotated and labeled.
One of the most popular datasets for beginners is the MNIST dataset, as it is compact and easy to train on. The MNIST dataset consists of handwritten digits that a deep learning model tries to recognize.
The number of classes in this dataset is the number of possible digits, aka 9. There are 60,000 training samples and 10,000 test samples in the entire dataset, with each sample having a pixel area of 28x28.
CNNs or Convolutional Neural Networks are the primary neural networks used in computer vision and as image classifiers.
These networks have convolutional layers that work by sliding a kernel or a filter across the input image to capture relevant details in the form of features.
To understand how CNNs work, we must first understand kernels and how they help modify an image.
A kernel is a nxn square with values called weights. The kernel slides across the image horizontally and vertically to capture each nxn window and multiplies the values in that window with its weights element-wise, followed by a sum of all the n2 elements. Therefore, the kernel reduces the window to a single value in the output that forms a representation of the image pixel values.
As CNNs become deep, kernels are stacked on top of each other to capture information from the outputs of other kernels and form a knowledge representation of the input image. The learnable weight of a network kernel (also called its parameters), allows it to prioritize some detail in the input over the other details as all details are linearly multiplied with kernel weights.
Non-linearity is later introduced at the end of each convolutional block with the help of Tanh, Sigmoid, or the ReLU activation function.
Apart from being able to guess what features the image contains, CNNs as classifiers are largely translation-invariant. In other words, the position of an object in the image does not affect the capability of the CNN to recognize it and classify it into a proper class.
Exploiting the CNNs immunity to translation of objects in images, Data Augmentation is performed while training them, with the augmentation involving random (and constrained) rotations, flips, and crops of the image data.
Besides Deep Learning algorithms and CNNs, machine learning-based methods like KNNs, SVMs, and Random Forests can also perform image classification.
While these methods do not perform as accurately as deep learning methods, they are much faster to implement and run.
It gives them an edge in domains that do not require the level of accuracy offered by CNNs and would rather have the speed offered by traditional ML algorithms.
K nearest neighbors is a non-parametric classification algorithm that takes into k nearest data samples to decide the class it belongs to, k being a hyperparameter to fine-tune.
The K-nearest neighbor algorithm merely forms boundaries based on the training dataset and then projects the testing data onto the feature space and checks which boundary it fits in.
Support Vector Machine is a supervised classification algorithm that works by segregating samples in higher dimensional hyperplanes.
Not all data can be correctly classified and differentiated in the linear plane by a machine learning algorithm. SVMs segregate the data when segregation seems impossible by mapping the data into an n-dimensional space where segregation becomes easier, and a clear decision boundary can be drawn.
To understand random forests, we must have an idea of decision trees. Decision trees are flowchart-like structures that consist of nodes representing individual “tests” on features.
These tests are simple, and branches that disperse from these nodes denote the outcome of these tests.
For example, consider the classification of dogs and cats.
Cats typically have a lighter fur coat as compared to dogs. A possible “test” on the “fur coat” feature would be: “is the fur coat dark or light?” with branches denoting “light” and “dark” dispersing from the node.
A Random Forest is a collection of decision trees and acts as an ensemble of all the trees it contains. After being ensembled, these decision trees become much more robust and accurate collectively, making the Random Forest a strong machine learning algorithm for classification.
Image Classification models have to be evaluated to determine how well they perform in comparison to other models.
Here are some well-known metrics used in image classification:
Precision is a metric that is defined for each class. Precision in a class tells us what proportion of data predicted by the ML model to belong to the class was actually part of the class in the validation data.
A simple formula can demonstrate this:
Recall similar to precision is defined for each class.
Recall tells us what proportion of the data from the validation set belonging to the class was identified correctly (as belonging to the class).
Recall can be represented as:
F1 Score helps us achieve a balance between precision and recall to get an average idea of how the model performs.
F1 score as a metric is calculated as follows.
Precision and Recall scores largely depend on the problem the classification model is trying to address.
Recall is a critical metric, particularly in problems referring to the medical image analysis, like detection of pneumonia from chest X-rays, where false negatives cannot be present to prevent diagnosing a patient as healthy when they actually have the disease.
Precision is needed where the false positives have to be avoided, like email spam detection. If an important email is classified as spam, then a user would face significant issues.
Finally, let's recap everything you've learned today about image classification:
💡 Read more: