Blog

Webinars

AI agents

Darwin academy

Resources

Computer vision

Image Recognition: Definition, Algorithms & Uses

16 min read

—

Oct 5, 2022

Image recognition is the process of identifying objects in images. We cover the basics of the task and different approaches to it.

Pragati Baheti

Guest Author and Software Developer

Computer vision is the process of using computers to understand digital images. A core task of computer vision is image recognition, which helps to recognize and categorize elements within images

Image recognition involves a high-level understanding of contextual knowledge and parallel processing, and that’s why the visual performance of humans is incomparable and far superior to that of computers. When we visually see an object or scene, we automatically identify objects as different instances and tend to associate them.

For machines, image recognition is a highly complex task requiring significant processing power. And yet the image recognition market is expected to rise globally to $42.2 billion by the end of the year.

Let’s see what makes image recognition technology so attractive and how it works.

Here’s what we’ll cover:

Image recognition definition
History of image recognition technology
How it works: image recognition algorithms
Practical applications of AI for image recognition

Data labeling

Data labeling platform

Get started today

Data labeling

Data labeling platform

Get started today

What is image recognition, and why does it matter?

Today, users share a massive amount of data through apps, social networks, and websites in the form of images. With the rise of smartphones and high-resolution cameras, the number of generated digital images and videos has skyrocketed. In fact, it’s estimated that there have been over 50B images uploaded to Instagram since its launch.

So, all industries have a vast volume of digital data to fall back on to deliver better and more innovative services.

Image recognition allows machines to identify objects, people, entities, and other variables in images. It is a sub-category of computer vision technology that deals with recognizing patterns and regularities in the image data, and later classifying them into categories by interpreting image pixel patterns.

Image recognition includes different methods of gathering, processing, and analyzing data from the real world. As the data is high-dimensional, it creates numerical and symbolic information in the form of decisions.

New to computer vision? Have a look at our article explaining the concept: What Is Computer Vision? [Basic Tasks & Techniques]

a digital image with pixels with numerical representation

A digital image consists of pixels, each with finite, discrete quantities of numeric representation for its intensity or the grey level. AI-based algorithms enable machines to understand the patterns of these pixels and recognize the image.

How image recognition evolved over time

Vision is the most amazing and complex of senses.

It took almost 500 million years of human evolution to reach this level of perfection. In recent years, we have made vast advancements to extend the visual ability to computers or machines.

The first steps toward what would later become image recognition technology happened in the late 1950s. An influential 1959 paper is often cited as the starting point to the basics of image recognition, though it had no direct relation to the algorithmic aspect of the development.

The paper described the fundamental response properties of visual neurons as image recognition always starts with processing simple structures—such as easily distinguishable edges of objects. This principle is still the seed of the later deep learning technologies used in computer-based image recognition.

Another benchmark also occurred around the same time—the invention of the first digital photo scanner.

A group of researchers led by Russel Kirsch developed a machine that made it possible to convert images into grids of numbers, the binary values called pixels that machines can understand. One of the first images to be scanned was a small, grainy photograph captured at 30,976 pixels (176*176), but it has become an iconic image today.

Lawrence Roberts has been the real founder of image recognition or computer vision applications since his 1963 doctoral thesis entitled "Machine perception of three-dimensional solids."

He described the process of extracting 3D information about objects from 2D photographs by converting 2D photographs into line drawings. The feature extraction and mapping into a 3-dimensional space paved the way for a better contextual representation of the images.

The processes highlighted by Lawrence proved to be an excellent starting point for later research into computer-controlled 3D systems and image recognition. Machine learning low-level algorithms were developed to detect edges, corners, curves, etc., and were used as stepping stones to understanding higher-level visual data.

After 2010, developments in image recognition and object detection really took off. By then, the limit of computer storage was no longer holding back the development of machine learning algorithms.

In 2012, a new object recognition algorithm was designed, and it ensured an 85% level of accuracy in face recognition, which was a massive step in the right direction. By 2015, the Convolutional Neural Network (CNN) and other feature-based deep neural networks were developed, and the level of accuracy of image Recognition tools surpassed 95%.

State-of-the-art deep learning models like AlexNet and ImageNet were developed and unlocked the huge potential of the image recognition and computer vision industry. Today, many companies such as Google, Amazon, and Microsoft are focusing their R&D efforts on improving technologies capable of integrating image recognition.

Pro Tip: Before diving into the working of Image recognition, have a look at a comprehensive guide to neural network architectures.

How image recognition works: algorithms and technologies

Before diving into how image recognition works, let's look at the four primary purposes image recognition solves: detection, classification, tagging, and segmentation.

Classification

Artificial neural networks identify objects in the image and assign them one of the predefined groups or classifications.

Detection

The process of classification and localization of an object is called object detection. Once the object's location is found, a bounding box with the corresponding accuracy is put around it. Depending on the complexity of the object, techniques like bounding box annotation, semantic segmentation, and key point annotation are used for detection.

Tagging

Tagging is similar to classification but aims for better accuracy. It tries to identify multiple objects in an image. Therefore, an image can have one or more tags. Returning to the example of the image of a road, it can have tags like 'vehicles,' 'trees,' 'human,' etc.

Segmentation

Instance segmentation is the detection task that attempts to locate objects in an image to the nearest pixel. Instead of aligning boxes around the objects, an algorithm identifies all pixels that belong to each class. Image segmentation is widely used in medical imaging to detect and label image pixels where precision is very important.

Now, let’s move on to see how image recognition works in practice—

1. Data collection

To achieve image recognition, machine vision artificial intelligence models are fed with pre-labeled data to teach them to recognize images they’ve never seen before.

Some of the massive publicly available databases include Pascal VOC and ImageNet. They contain millions of labeled images describing the objects present in the pictures—everything from sports and pizzas to mountains and cats.

Pro Tip: Looking for the best image datasets to kickstart your computer vision project? Have a look at our selection of open source computer vision datasets.

Data collection, however, comes with challenges:

depiction of data collection challenges: viewpoint variation, deformation, inter-class variation, occlusion

Variation in the viewpoint of the image. The images can be aligned at different angles or vary in dimension, which can lead to inaccurate prediction of the machine learning model. The system fails to understand the effect of changing the alignment and viewport of the image.

Pro Tip: Learn everything you need to know about data augmentation techniques for computer vision and start training your AI models on V7 today.

Deformation. Generally, training data gives a biased perception that a particular object can only have a specific shape.

Occlusion. Some objects may obstruct the full view of an image and result in partial information being fed to the system. The neural network should acknowledge these variations as a part of the training process.

Pro Tip: To more, head straight to our introduction to image segmentation.

Inter-class variations. Some objects might vary in shape, size, and structure but can still belong to the same class. Having all the varied data points is crucial for better image processing.

2. Pre-processing of the image data

Once the dataset is ready, there are several things to be done to maximize its efficiency for model training.

Data annotation

The objects in the image that serve as the regions of interest have to labeled (or annotated) to be detected by the computer vision system. In other words, labels have to be applied to those frames or images.

Annotations for segmentation tasks can be performed easily and precisely by making use of V7 annotation tools, specifically the polygon annotation tool and the auto-annotate tool. A label once assigned is remembered by the software in the subsequent frames.

Representation of image

Pixel representation of digital image [Source: Stanford]

A digital image has a matrix representation that illustrates the intensity of pixels. The information fed to the image recognition models is the location and intensity of the pixels of the image. This information helps the image recognition work by finding the patterns in the subsequent images supplied to it as a part of the learning process.

3. Model architecture and training process

Due to their unique work principle, convolutional neural networks (CNN) yield the best results with deep learning image recognition.

Working of convolutional neural networks in image recognition

The complete pixel matrix is not fed to the CNN directly as it would be hard for the model to extract features and detect patterns from a high-dimensional sparse matrix. Instead, the complete image is divided into small sections called feature maps using filters or kernels.

The convolution layers in each successive layer can recognize more complex, detailed features—visual representations of what the image depicts. Such a “hierarchy of increasing complexity and abstraction” is known as feature hierarchy.

The corresponding smaller sections are normalized, and an activation function is applied to them. Rectified Linear Units (ReLu) are seen as the best fit for image recognition tasks. The matrix size is decreased to help the machine learning model better extract features by using pooling layers. Depending on the labels/classes in the image classification problem, the output layer predicts which class the input image belongs to.

Pro Tip: Read about the different types of activation functions in neural networks.

4. Traditional machine learning algorithms for image recognition

Before the development of parallel processing and extensive computing capabilities required for training deep learning models, traditional machine learning models had set standards for image processing.

Let us quickly walk through some of the most learned machine learning models:

Support Vector Machines

SVMs describe features by making histograms of images. They use a sliding detection window technique by moving around the image. The algorithm then takes the test picture and compares the trained histogram values with the ones of various parts of the picture to check for close matches.

Bag of Features

Bag of Features models like Scale Invariant Feature Transformation (SIFT) does pixel-by-pixel matching between a sample image and its reference image. The trained model then tries to pixel match the features from the image set to various parts of the target image to see if matches are found.

Some other machine learning models widely used in computer vision include:

Regression Algorithms
Instance-based Algorithms
Regularization Algorithms
Decision Tree Algorithms
Bayesian Algorithms
Clustering Algorithms

Pro Tip: Read the ultimate guide to machine learning to dive deeper.

5. Popular deep learning models for image recognition

Here’s a quick look into some of the most popular deep learning models recently:

YOLO (You Only Look Once)

YOLO algorithm applied to an image with dense objects

This object detection algorithm uses a confidence score and annotates multiple objects via bounding boxes within each grid box. YOLO, as the name suggests, processes a frame only once using a fixed grid size and then determines whether a grid box contains an image or not.

Single-shot detector (SSD)

Single-shot detectors divide the image into a default number of bounding boxes in the form of a grid over different aspect ratios. The feature map that is obtained from the hidden layers of neural networks applied on the image is combined at the different aspect ratios to naturally handle objects of varying sizes.

These types of object detection algorithms are flexible and accurate and are mostly used in face recognition scenarios where the training set contains few instances of an image.

Other machine learning algorithms include Fast RCNN (Faster Region-Based CNN) which is a region-based feature extraction model—one of the best performing models in the family of CNN.

A comparison of traditional machine learning and deep learning techniques in image recognition is summarized here.

Applications of image recognition in the world today

OK, now that we know how it works, let’s see some practical applications of image recognition technology across industries.

Damage assessment

Image recognition can be used to automate the process of damage assessment by analyzing the image and looking for defects, notably reducing the expense evaluation time of a damaged object.

It is used in car damage assessment by vehicle insurance companies, product damage inspection software by e-commerce, and also machinery breakdown prediction using asset images etc.

A research paper on deep learning-based image recognition highlights how it is being used detection of crack and leakage defects in metro shield tunnels.

Read More: See how one of V7’s clients, Abyss, uses V7 to advance critical infrastructure inspections.

Packaging inspection

Many companies find it challenging to ensure that product packaging (and the products themselves) leave production lines unaffected. Manual quality control tends to be costly and inefficient.

To solve this issue, Pharmacy Packaging Systems or other e-commerce platforms have developed a solution as part of the supply chain pipeline that uses cutting-edge AI technologies based on computer vision to check for broken products or quality issues.

Read More: See what V7 does to advance the development of AI in manufacturing.

Quality assurance

Image recognition applications lend themselves perfectly to the detection of deviations or anomalies on a large scale. Machines can be trained to detect blemishes in paintwork or food that has rotten spots preventing it from meeting the expected quality standard.

Automated barcode scanning using optical character recognition (OCR)

Machine vision-based technologies can read the barcodes-which are unique identifiers of each item.

We have seen shopping complexes, movie theatres, and automotive industries commonly using barcode scanner-based machines to smoothen the experience and automate processes.

Optical character recognition (OCR) identifies printed characters or handwritten texts in images and later converts them and stores them in a text file. OCR is commonly used to scan cheques, number plates, or transcribe handwritten text to name a few.

Read More: Explore how AI has transformed the manufacturing industry here.

Medical image analysis in healthcare

CT and MRI scan analysis using image recognition

Image recognition has multiple applications in healthcare, including detecting bone fractures, brain strokes, tumors, or lung cancers by helping doctors examine medical images. The nodules vary in size and shape and become difficult to be discovered by the unassisted human eye.

With social media being dominated by visual content, it isn’t that hard to imagine that image recognition technology has multiple applications in this area.

Here are three examples of how image recognition gets used in social media:

Image search

It's easier to search with an image than with words. This is why many e-commerce sites and applications are offering customers the ability to search using images.

Visual search uses features learned from a deep neural network to develop efficient and scalable methods for image retrieval. The goal of visual search is to perform content-based retrieval of images for image recognition online applications.

Social media networks have seen a significant rise in the number of users, and are one of the major sources of image data generation. These images can be used to understand their target audience and their preferences.

For example, marketers use logo recognition to determine how much exposure a brand receives from an influencer marketing campaign increasing the efficiency of advertising campaigns.

Finding inappropriate content

Inappropriate content on marketing and social media could be detected and removed using image recognition technology.

One way of doing this is through logo recognition, in which the legitimate brand can find fake logos on counterfeit products and remove any inappropriate or explicit content falsely associated with that brand.

Self-driving cars

IR in Autonomous driving

Image recognition technology is used in self-driving cars. By analyzing real-time video feeds, such autonomous vehicles can navigate through traffic by analyzing the activities on the road and traffic signals. On this basis, they take necessary actions without jeopardizing the safety of passengers and pedestrians.

The technology is also used by traffic police officers to detect people disobeying traffic laws, such as using mobile phones while driving, not wearing seat belts, or exceeding speed limit.

Applications in surveillance and security

Surveillance is largely a visual activity—and as such it’s also an area where image recognition solutions may come in handy.

Facial recognition

Facial recognition is used extensively from smartphones to corporate security for the identification of unauthorized individuals accessing personal information.

For example, Google Cloud Vision offers a variety of image detection services, which include optical character and facial recognition, explicit content detection, etc., and charges fees per photo. Microsoft Cognitive Services offers visual image recognition APIs, which include face or emotion detection, and charge a specific amount for every 1,000 transactions.

Video Surveillance

Drones equipped with high-resolution cameras can patrol a particular territory and use image recognition techniques for object detection. In fact, it’s a popular solution for military and national border security purposes.

Apart from the security aspect of surveillance, there are many other uses for image recognition. For example, pedestrians or other vulnerable road users on industrial premises can be localized to prevent incidents with heavy equipment.

Key Takeaways

Computer vision, the field concerning machines being able to understand images and videos, is one of the hottest topics in the tech industry. Robotics and self-driving cars, facial recognition, and medical image analysis, all rely on computer vision to work. At the heart of computer vision is image recognition which allows machines to understand what an image represents and classify it into a category.

The leading architecture used for image recognition and detection tasks is that of convolutional neural networks (CNNs). Convolutional neural networks consist of several layers, each of them perceiving small parts of an image. The neural network learns about the visual characteristics of each image class and eventually learns how to recognize them.

The combination of modern machine learning and computer vision has now made it possible to recognize many everyday objects, human faces, handwritten text in images, etc. We’ll continue noticing how more and more industries and organizations implement image recognition and other computer vision tasks to optimize operations and offer more value to their customers.

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

Video annotation

AI video annotation

Get started today

Pragati Baheti

Pragati is a software developer at Microsoft, and a deep learning enthusiast. She writes about the fundamental mathematics behind deep neural networks.

Next steps

Label videos with V7.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Book a demo

Explore V7 Darwin

Book a demo

Explore V7 Darwin