What's the difference between object detection and object recognition?
What are bounding boxes?
Which computer vision technique should I use?
How should I build an accurate object detection model?
If you've found yourself asking these and similar questions—don't worry! You are in the right place.
In this article, we'll dive deeper into the topic of object detection and help you understand the following:
Ready? Let's get started.
Object detection is the field of computer vision that deals with the localization and classification of objects contained in an image or video.
To put it simply: Object detection comes down to drawing bounding boxes around detected objects which allow us to locate them in a given scene (or how they move through it).
Before we move on, let’s clarify the distinction between image recognition and object detection.
Image classification sends a whole image through a classifier (such as a deep neural network) for it to spit out a tag. Classifiers take into consideration the whole image but don’t tell you where the tag appears in the image.
Object detection is slightly more advanced, as it creates a bounding box around the classified object.
Classification has its advantages - it’s a better option for tags that don’t really have physical boundaries, such as “blurry” or “sunny”. However, object detection systems will almost always outperform classification networks in spotting objects that do have a material presence, such as a car.
Image segmentation is the process of defining which pixels of an object class are found in an image. Semantic image segmentation will mark all pixels belonging to that tag, but won’t define the boundaries of each object.
Object detection instead will not segment the object, but will clearly define the location of each individual object instance with a box.
Combining semantic segmentation with object detection leads to instance segmentation, which first detects the object instances, and then segments each within the detected boxes (known in this case as regions of interest).
Object detection is very good at:
However, it is outclassed by other methods in other scenarios.
You have to always ask yourself: Do these scenarios apply to my problem?
Either way, here's a cheat sheet you can use when choosing the right computer vision techniques for your needs:
Objects that are elongated - Use Instance Segmentation.
Long and thin items such as a pencil will occupy less than 10% of a box’s area when detected. This biases model towards background pixels rather than the object itself.
Picture: A diagonal pencil labeled on V7 using box and polygon
Objects that have no physical presence - Use classification
Things in an image such as the tag “sunny”, “bright”, or “skewed” are best identified by image classification techniques - letting a network take the image and figure out which feature correlate to these tags.
Objects that have no clear boundaries at different angles - Use semantic segmentation
The sky, ground, or vegetation in aerial images don’t really have a defined set of boundaries. Semantic segmentation is more efficient at “painting” pixels that belong to these classes. Object detection will still pick up the “sky” as an object, but it will struggle far more with such objects.
Objects that are often occluded - Use Instance Segmentation if possible
Occlusion is handled far better in two-stage detection networks than one-shot approaches. Within this branch of detectors, instance segmentation models will do a better job at understanding and segmenting occluded objects than mere bounding-box detectors.
Before deep learning took off in 2013, almost all object detection was done through classical machine learning techniques. Common ones included viola-jones object detection technique, scale-invariant feature transforms (SIFT), and histogram of oriented gradients.
These would detect a number of common features across the image, and classify their clusters using logistic regression, color histograms, or random forests. Today’s deep learning-based techniques vastly outperform these.
Deep learning-based approaches use neural network architectures like retinaNET, Yolo (You only look once), CentreNET, SSD (Single Shot Multibox detector), Region proposals (R-CNN, fast-rcnn, Faster RCNN, cascade R-CNN) for feature detection of the object, and then identification into labels.
Object detection generally is categorized into 2 stages:
State of the art object detection architectures consists of 2 stage architectures, many of which have been pre-trained on the COCO dataset. COCO is an image dataset composed of 90 different classes of objects (cars, persons, sport balls, bicycles, dogs, cats, horses e.t.c).
The dataset was gathered to solve common object detection problems. Nowadays it is becoming outdated as its images were captured mostly in the early 2,000’s making them much smaller, grainier, and with different objects than today’s images. Newer datasets like OpenImages are taking its spot as the de-facto pre-training dataset.
A single-stage detector removes the RoI extraction process and directly classifies and regresses the candidate anchor boxes. Examples are: Yolo family (Yolov2, Yolov3, Yolov4, and Yolov5) CornerNet, CentreNet, and others. For instance, let’s take a look at how Yolo Works.
Yolo is an object detection architecture simply called YOU LOOK ONCE. This involves the use of a single neural network trained end to end to take in a photograph as input and predicts bounding boxes and class labels for each bounding box directly. Yolo is a typical single-stage detector.
Two-stage detectors divide the object detection task into two stages: extract RoIs (Region of interest), then classify and regress the RoIs. Examples of object detection architectures that are 2 stage oriented include R-CNN, Fast-RCNN, Faster-RCNN, Mask-RCNN and others. Let’s take a look at the Mask R-CNN for instance.
The Mask R-CNN is a typical Object Instance Segmentation technique for object detection. This architecture is an extension of Faster R-CNN by adding a branch for predicting segmentation masks on each RoI, in parallel with the existing branch for classification and bounding box regression. The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Below is an architectural demonstration of Mask R-CNN.
On the other hand, Faster R-CNN is an object detection model that improves on Fast R-CNN by utilizing a region proposal network (RPN) with the generated feature maps from the convolutional layer, to estimate a region-based object classification (ROI pooling).
Below is an architectural diagram of Faster R-CNN.
Moreover, fast R-CNN is an improved version of the R-CNN that aggregates CNN features independent of their region of interest (ROI) into a single forward pass over the image. Generally, R-CNN (Region selection with CNN features) is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Hence, fast R-CNN was developed to solve the problem of slow computation.
Here's a quick breakdown of different family models used in object detection.
The R-CNN Model family includes the following:
The Yolo family model includes the following:
The CentreNet family model includes the following:
And—don't forget that you can build your own object detection model using V7 in less than an hour 😉
Finally, let's have a look at some of the most common object detection use cases.
Most face recognition systems are powered by object detection. It can be used to detect faces, classify emotions or expressions, and feed the resulting box to an image-retrieval system to identify a specific person out of a group. Face detection is one of the most popular object detection use cases, and you are probably already using it whenever you unlock your phone with your face. Person detection is also commonly used to count the number of people in retail stores or ensure social distancing metrics.
Object detection is used in intelligent video analytics (IVA) anywhere CCTV cameras are present in retail venues to understand how shoppers are interacting with products. These video streams pass through an anonymizaion pipeline to blur out people's faces and de-identify individuals. Some IVA use cases preserve privacy by only looking at people's shoes, by placing cameras below knee level and ensuring the system captures the presence of a person, without having to directly look at their identifiable features. IVA is often used in factories, airports and transport hubs to track queue lengths and access to restricted areas.
Self-driving cars use object detection to spot pedestrians, other cars, and obstacles on the road in order to move around safely. Autonomous vehicles equipped with LIDAR will sometimes use 3D object detection, which applies cuboids around objects.
Surgical video is very noisy data that is taken from endoscopes during crucial operations. Object detection can be used to spot hard-to-see items such as polyps or lesions that require a surgeon’s immediate attention. It’s also being used to inform hospital staff of the status of the operation.
Manufacturing companies can use object detection to spot defects in the production line. Neural networks can be trained to detect minute defects, from folds in fabric to dents or flashes in injection molded plastics. Unlike traditional machine learning approaches, deep learning-based object detection can also spot defects in heavily varying objects, such as food.
It is one of the most essential computer vision tasks that is applied in robotics, video surveillance, and automotive safety. Pedestrian detection plays a key role in object detection research as it provides the fundamental information for the semantic understanding of video footages.
Despite its relatively high performance, this technology still faces challenges such as various styles of clothing in appearance or the presence of occluding accessories that decrease the accuracy of the existing detectors.
Drones sport incredible cameras nowadays and can leverage models hosted in the cloud to assess any object they encounter.
For example, they can be used to inspect hard-to-reach areas in bridges for cracks and other structural damage or to inspect power lines, replacing dangerous routine helicopter operations.
Let's recap everything we've learned today: