Let's start with a simple question: What do you see on this website?
Noticing things like a table of contents on your left, a datasets download button on the right, and a hero image above probably took you mere seconds, right?
Things aren't that fast and easy when it comes to machines. Enabling computers to see the world in the same way humans do is still a complex challenge that data scientists work hard on resolving.
Luckily, there's also good news—
The computer vision field has been rapidly developing, finding real-world applications, and even surpassing humans in solving some of the visual tasks. All of this thanks to the recent advances in artificial intelligence and deep learning.
So, in case you are still confused about this whole computer vision thing, in this article we will walk you through the most important concepts including:
Let’s get started.
Computer Vision is a subfield of Deep Learning and Artificial Intelligence where humans teach computers to see and interpret the world around them.
While humans and animals naturally solve vision as a problem from a very young age, helping machines interpret and perceive their surroundings via vision remains a largely unsolved problem.
Limited perception of the human vision along with the infinitely varying scenery of our dynamic world is what makes Machine Vision complex at its core.
Like all great things in the world of technology, computer vision started with a cat.
Two Swedish scientists, Hubel and Wiesel, placed a cat in a restricting harness and an electrode in its visual cortex.The scientists showed the cat a series of images through a projector, hoping its brain cells in the visual cortex would start firing.
With no avail with images, the eureka moment happened when a projector slide was removed, and a single horizontal line of light appeared on the wall—
Neurons fired, emitting a crackling electrical noise.
The scientists had just realized that the early layers of the visual cortex respond to simple shapes, like lines and curves, much like those in the early layers of a deep neural network.
They then used an oscilloscope to create these and observe the brain’s reaction.
This experiment marks the beginning of our understanding of the interconnection between computer vision and the human brain, which will be helpful for our understanding of artificial neural networks.
Before cat brains entered the scene, analog computer vision began as early as the 1950s at universities pioneering artificial intelligence.
The notion that machine vision must be derived from the animal vision was predominant as early as 1959—when the neurophysiologists mentioned above tried to understand cat vision.
Since then, the history of computer vision is dotted with milestones formed by the rapid development of image capturing and scanning instruments complemented by state-of-the-art image processing algorithms’ design.
The 1960s saw the emergence of AI as an academic field of study, followed by the development of the first robust Optical Character Recognition system in 1974.
By the 2000s, the focus of Computer Vision has been shifted to much more complex topics, including:
All of them have achieved commendable accuracies over the years.
The year 2010 saw the birth of the ImageNet dataset with millions of labeled images freely available for research. This led to the formation of the AlexNet architecture two years later— making it one of the biggest breakthroughs in Computer Vision, cited over 82K times.
Digital Image Processing, or Image Processing, in short, is a subset of Computer Vision. It deals with enhancing and understanding images through various algorithms.
More than just a subset, Image Processing forms the precursor of modern-day computer vision, overseeing the development of numerous rule-based and optimization-based algorithms that have led machine vision to what it is today.
Image Processing may be defined as the task of performing a set of operations on an image based on data collected by algorithms to analyze and manipulate the contents of an image or the image data.
Now that you know the theory behind computer vision let’s talk about its practical side.
Here’s a simple visual representation of how computer vision works:
While the three steps outlining the basics of computer vision seem easy, processing and understanding an image via machine vision are quite difficult. Here’s why—
An image consists of several pixels, with a pixel being the smallest quanta in which the image can be divided into.
Computers process images in the form of an array of pixels, where each pixel has a set of values, representing the presence and intensity of the three primary colors: red, green, and blue.
All pixels come together to form a digital image.
The digital image, thus, becomes a matrix, and Computer Vision becomes a study of matrices. While the simplest computer vision algorithms use linear algebra to manipulate these matrices, complex applications involve operations like convolutions with learnable kernels and downsampling via pooling.
Below is an example of how a computer “sees” a small image.
The values represent the pixel values at the particular coordinates in the image, with 255 representing a complete white point and 0 representing a complete dark point.
For larger images, matrices are much larger.
While it is easy for us to get an idea of the image by looking at it, a peek at the pixel values shows that the pixel matrix gives us no information on the image!
Therefore, the computer has to perform complex calculations on these matrices and formulate relationships with neighboring pixel elements to even say that this image represents a person’s face.
Developing algorithms for recognizing complex patterns in images might make you realize how complex our brains are to excel at pattern recognition so naturally.
Some operations commonly used in computer vision based on a Deep Learning perspective are:
The evolution of machine vision saw the large-scale formalization of difficult problems into popular solvable problem statements.
Division of topics into well-formed groups with proper nomenclature helped researchers around the globe identify problems and work on them efficiently.
The most popular computer vision tasks that we regularly find in AI jargon include:
Image classification is one of the most studied topics ever since the ImageNet dataset was released in 2010.
Being the most popular computer vision task taken up by both beginners and experts, image classification as a problem statement is quite simple.
Given a group of images, the task is to classify them into a set of predefined classes using solely a set of sample images that have already been classified.
As opposed to complex topics like object detection and image segmentation, which have to localize (or give positions for) the features they detect, image classification deals with processing the entire image as a whole and assigning a specific label to it.
Image segmentation is the division of an image into subparts or sub-objects to demonstrate that the machine can discern an object from the background and/or another object in the same image.
A “segment” of an image represents a particular class of object that the neural network has identified in an image, represented by a pixel mask that can be used to extract it.
This popular domain of Computer Vision has been studied widely both with the use of traditional image processing algorithms like watershed algorithms, clustering-based segmentation and with the use of popular modern-day deep learning architectures like PSPNet, FPN, UNet, SegNet, etc.
Object detection looks for class-specific details in an image or a video and detects them when they appear. These classes can be cars, animals, humans, or anything on which the detection model has been trained. Previously methods of object detection used Haar features, SIFT, and HOG features to detect features in an image and classify them based on classical machine learning approaches.
This process, other than being time-consuming and largely inaccurate, has severe limitations on the number of objects that can be detected. As such, Deep Learning models like YOLO, RCNN, SSD that use millions of parameters to break through these limitations are popularly employed for this task.
Often object detection is accompanied by Object Recognition, also known as Object Classification.
Facial Recognition is a subpart of object detection where the primary object being detected is the human face.
While similar to object detection as a task where features are detected and localized, facial recognition performs not only detection but also recognition of the detected face. Facial recognition systems search for common features and landmarks in faces like nose, eyes, and mouth and classify with the help of these features and the positioning of these landmarks.
Traditional Image Processing based methods for facial recognition include Haar Cascades easily accessible via the OpenCV library while more robust methods including the use of Deep Learning based algorithms are found in works like FaceNet.
Edge detection is the task of detecting boundaries in objects.
It is algorithmically performed with the help of mathematical methods that help detect sharp changes or discontinuities in the brightness of the image. Often used as a pre-processing step for many tasks, edge detection is primarily done by traditional image processing-based algorithms like Canny Edge detection and by convolutions with specially designed edge detection filters.
Furthermore, edges in an image give us paramount information about the image contents, resulting in all deep learning methods performing edge detection internally for the capture of global low-level features with the help of learnable kernels.
Image Restoration refers to the restoration or the reconstruction of faded and old image hard copies that have been captured and stored in an improper manner, leading to loss of quality of the image.
Typical image restoration processes involve the reduction of additive noise via mathematical tools, while at times, reconstruction requires major changes, leading to further analysis and the use of image inpainting.
In Image inpainting, damaged parts of an image are filled with the help of generative models that make an estimate of what the image is trying to convey. Often the restoration process is followed by a colorization process that colors the subject of the picture (if black and white) in the most realistic manner possible.
Features in computer vision are regions of an image that tell us the most about a particular object in the image.
While edges are strong indicators of object detail and therefore important features, much more localized and sharp details like corners also serve as features. Feature matching helps us to relate the features of one image with those of another image of a similar region.
The applications of feature matching are found in important computer vision tasks like object identification and camera calibration. The task of feature matching is generally performed in the following steps:
One of the most complex problems of computer vision, scene reconstruction is the digital 3D reconstruction of an object from a photograph.
Most algorithms in scene reconstruction roughly work by forming a point cloud at the surface of the object and reconstructing a mesh from this point cloud.
Video motion analysis is a task in machine vision that refers to the study of moving objects or animals and the trajectory of their bodies.
Motion analysis as a whole is a combination of many subtasks, particularly object detection, tracking, and segmentation, and pose estimation.
While human motion analysis is used in areas like sports, medicine, intelligent video analytics, and physical therapy, motion analysis is also used in other areas like manufacturing and to count and track microorganisms like bacteria and viruses.
One of the biggest challenges in machine vision is our lack of understanding of how the human brain and the human visual system works.
We have an enhanced and complex sense of vision that we can figure out at a very young age but are unable to explain the process by which we can understand what we see.
Furthermore, day-to-day tasks like walking across the street at the zebra crossing, pointing at something in the sky, checking out the time on the clock require us to know enough and to have a sense of judgement about the objects around us to understand our surroundings.
Such aspects are quite different from simple vision but are largely inseparable from it. The simulation of human vision via algorithms and mathematical representation thus requires not only the identification of an object in an image but an understanding of its presence and its behaviour.
Finally, let's discuss some of the most common computer vision use cases.
Probably one of the most popular applications of computer vision right now is the self-driving car. With companies like Tesla coming up with innovative models of autonomous vehicles, self-driving cars seem to be the biggest form of AI in day-to-day lives.
Augmented reality (AR) is a method of providing an experience of the natural surroundings with a computer-generated augmentation appropriate to the surroundings. With the help of computer vision, AR can be virtually limitless, with augmentations providing translations of written text and applying filters to objects in the world we see, directly when we see them.
Medical Imaging is an important and relevant subdiscipline of computer vision where images of X-rays and 3D scans like MRIs are classified into diseases like pneumonia and cancer.
Early diagnosis of diseases made possible with computer vision can save thousands of lives.
Computer Vision has been used to develop state of the art algorithms for the monitoring of security cameras via methods like pose estimation, face and person detection and object tracking.
Object detection is used in intelligent video analytics (IVA) anywhere CCTV cameras are present—in retail venues to understand how shoppers are interacting with products, in factories, airports and transport hubs to track queue lengths and access to restricted areas.
Computer vision is an integral part of manufacturing industries that are striving to automate their processes.
With the development of computer vision systems like defect detection and safety inspections, the quality of the manufactured goods increases.
Furthermore 3D vision systems enable efficient inspections to be carried out in a production line that would not be possible by humans.
One of the oldest applications of computer vision is optical character recognition.
With simple optical character recognition algorithms being experimented on as early as 1974, today, OCR is at a much-advanced state with Deep Learning systems being developed that can detect and translate text in natural environments and random places without human supervision.
Applying computer vision technology, low compute efficient OCR systems have been developed that can function even in smartphones and mobile devices.
Computer vision in retail can potentially transform customer experience by huge standards. With AI stores like “amazon-go” springing up throughout the US, retail seems to be possibly the most revolutionizing stop for computer vision.
Let's do a quick recap of everything we've learned in this computer vision guide: