The Ultimate Guide to Object Detection

Computer vision is currently one of the hottest fields of artificial intelligence—and object detection played a key role in its rapid development. This guide will help you understand basic object detection concepts.
Read time
min read  ·  
June 10, 2021
Object detection with V7

What's the difference between object detection and object recognition?

What are bounding boxes?

Which computer vision technique should I use?

How should I build an accurate object detection model?

If you've found yourself asking these and similar questions—don't worry! You are in the right place.

Here’s what we’ll cover:

  1. What is object detection?
  2. Types and modes of object detection
  3. How does object detection work?
  4. Object detection model architecture
  5. 5 object detection applications

Ready? Let's get started.

Speed up your ML data labeling

Annotate your video and image datasets 10x faster

Ready to streamline AI product deployment right away? Check out:

What is object detection?

Object detection is the field of computer vision that deals with the localization and classification of objects contained in an image or video.


To put it simply: Object detection comes down to drawing bounding boxes around detected objects which allow us to locate them in a given scene (or how they move through it).

Here's how you can perform object detection with V7.

Object detection vs. image classification

Before we move on, let’s clarify the distinction between image recognition and object detection.

Image classification sends a whole image through a classifier (such as a deep neural network) for it to spit out a tag. Classifiers take into consideration the whole image but don’t tell you where the tag appears in the image.

Object detection is slightly more advanced, as it creates a bounding box around the classified object.

Image classification vs Object detection

Classification has its advantages—it’s a better option for tags that don’t really have physical boundaries, such as “blurry” or “sunny”. However, object detection systems will almost always outperform classification networks in spotting objects that do have a material presence, such as a car.

Object detection vs image segmentation

Image segmentation is the process of defining which pixels of an object class are found in an image.

Semantic image segmentation will mark all pixels belonging to that tag, but won’t define the boundaries of each object.

Object detection instead will not segment the object, but will clearly define the location of each individual object instance with a box.

Combining semantic segmentation with object detection leads to instance segmentation, which first detects the object instances, and then segments each within the detected boxes (known in this case as regions of interest).

Object detection vs. semantic segmentation vs. instance segmentation

Pros and cons of object detection

Object detection is very good at:

  • Detecting objects that take up between 2% and 60% of an image’s area.
  • Detecting objects with clear boundaries.
  • Detecting clusters of objects as 1 item.
  • Localizing objects at high speed (>15fps)

However, it is outclassed by other methods in other scenarios.

You have to always ask yourself: Do these scenarios apply to my problem?

Either way, here's a cheat sheet you can use when choosing the right computer vision techniques for your needs.

Objects that are elongated—Use Instance Segmentation.

Long and thin items such as a pencil will occupy less than 10% of a box’s area when detected. This biases model towards background pixels rather than the object itself.

Picture: A diagonal pencil labeled on V7 using box and polygon

Objects that have no physical presence—Use classification

Things in an image such as the tag “sunny”, “bright”, or “skewed” are best identified by image classification techniques—letting a network take the image and figure out which feature correlate to these tags.

Objects that have no clear boundaries at different angles—Use semantic segmentation

The sky, ground, or vegetation in aerial images don’t really have a defined set of boundaries. Semantic segmentation is more efficient at “painting” pixels that belong to these classes. Object detection will still pick up the “sky” as an object, but it will struggle far more with such objects.

Objects that are often occluded—Use Instance Segmentation if possible

Occlusion is handled far better in two-stage detection networks than one-shot approaches. Within this branch of detectors, instance segmentation models will do a better job at understanding and segmenting occluded objects than mere bounding-box detectors.

💡 Pro tip: Looking for the perfect tool for building object detection models? Check out 13 Best Image Annotation Tools.

Types and modes of object detection

Before deep learning took off in 2013, almost all object detection was done through classical machine learning techniques. Common ones included viola-jones object detection technique, scale-invariant feature transforms (SIFT), and histogram of oriented gradients.  

These would detect a number of common features across the image, and classify their clusters using logistic regression, color histograms, or random forests. Today’s deep learning-based techniques vastly outperform these.

Deep learning-based approaches use neural network architectures like RetinaNet, YOLO (You Only Look Once), CenterNet, SSD (Single Shot Multibox detector), Region proposals (R-CNN, Fast-RCNN, Faster RCNN, Cascade R-CNN) for feature detection of the object, and then identification into labels.

How does object detection work

Object detection generally is categorized into 2 stages:

  1. Single-stage object detectors.
  2. Two-stage object detectors.
Object detection stages

State of the art object detection architectures consists of 2 stage architectures, many of which have been pre-trained on the COCO dataset. COCO is an image dataset composed of 90 different classes of objects (cars, persons, sport balls, bicycles, dogs, cats, horses e.t.c).

The dataset was gathered to solve common object detection problems. Nowadays it is becoming outdated as its images were captured mostly in the early 2,000’s making them much smaller, grainier, and with different objects than today’s images. Newer datasets like OpenImages are taking its spot as the de-facto pre-training dataset.

Single-stage object detectors

A single-stage detector removes the RoI extraction process and directly classifies and regresses the candidate anchor boxes. Examples are: YOLO family (YOLOv2, YOLOv3, YOLOv4, and YOLOv5) CornerNet, CenterNet, and others. For instance, let’s take a look at how YOLO Works.


YOLO is an object detection architecture simply called YOU ONLY LOOK ONCE. This involves the use of a single neural network trained end to end to take in a photograph as input and predicts bounding boxes and class labels for each bounding box directly. YOLO is a typical single-stage detector.

Two-stage object detectors

Two-stage detectors divide the object detection task into two stages: extract RoIs (Region of interest), then classify and regress the RoIs. Examples of object detection architectures that are 2 stage oriented include R-CNN, Fast-RCNN, Faster-RCNN, Mask-RCNN and others. Let’s take a look at the Mask R-CNN for instance.

Mask R-CNN

The Mask R-CNN is a typical Object Instance Segmentation technique for object detection. This architecture is an extension of Faster R-CNN by adding a branch for predicting segmentation masks on each RoI, in parallel with the existing branch for classification and bounding box regression. The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Below is an architectural demonstration of Mask R-CNN.

Mask r-CNN

On the other hand, Faster R-CNN is an object detection model that improves on Fast R-CNN by utilizing a region proposal network (RPN) with the generated feature maps from the convolutional layer, to estimate a region-based object classification (ROI pooling).

Below is an architectural diagram of Faster R-CNN.

Faster R-CNN diagram

Moreover, Fast R-CNN is an improved version of the R-CNN that aggregates CNN features independent of their region of interest (ROI) into a single forward pass over the image. Generally, R-CNN (Region selection with CNN features) is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation.

Hence, Fast R-CNN was developed to solve the problem of slow computation.

How does fast R-CNN work
💡 Pro tip: Looking for quality training data to build your object detection model? Check out 65+ Best Free Datasets for Machine Learning
V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

Object detection model architecture

Here's a quick breakdown of different family models used in object detection.

R-CNN Model Family

The R-CNN Model family includes the following:

  1. R-CNN—This utilizes a selective search method to locate RoIs in the input images and uses a DCN (Deep Convolutional Neural Network)-based region wise classifier to classify the RoIs independently.
  2. SPPNet and Fast R-CNN—This is an improved version of R-CNN that deals with the extraction of the RoIs from the feature maps. This was found to be much faster than the conventional R-CNN architecture.
  3. Faster R-CNN—This is an improved version of Fast R-CNN that was trained end to end by introducing RPN (region proposal network). An RPN is a network utilized in generating RoIs by regressing the anchor boxes. Hence, the anchor boxes are then used in the object detection task.
  4. Mask R-CNN adds a mask prediction branch on the Faster R-CNN, which can detect objects and predict their masks at the same time.
  5. R-FCN  replaces the fully connected layers with the position-sensitive score maps for better detecting objects.
  6. Cascade R-CNN addresses the problem of overfitting at training and quality mismatch at inference by training a sequence of detectors with increasing IoU thresholds.

YOLO Model Family

The YOLO family model includes the following:

  1. YOLO uses fewer anchor boxes (divide the input image into an S × S grid) to do regression and classification. This was built using darknet neural networks.
  2. YOLOv2 improves the performance by using more anchor boxes and a new bounding box regression method.
  3. YOLOv3 is an enhanced version of the v2 variant with a deeper feature detector network and minor representational changes. YOLOv3 has relatively speedy inference times with it taking roughly 30ms per inference.
  4. YOLOv4 (YOLOv3 upgrade) works by breaking the object detection task into two pieces, regression to identify object positioning via bounding boxes and classification to determine the object's class. YOLO V4 and its successors are technically the product of a different set of researchers than versions 1-3.
  5. YOLOv5 is an improved version of YOLOv4 with a mosaic augmentation technique for increasing the general performance of YOLOv4.


The CenterNet family model includes the following:

  1. SSD places anchor boxes densely over an input image and uses features from different convolutional layers to regress and classify the anchor boxes.
  2. DSSD introduces a deconvolution module into SSD to combine low level and high-level features. While R-SSD uses pooling and deconvolution operations in different feature layers to combine low-level and high-level features.
  3. RON proposes a reverse connection and an objectness prior to extracting multiscale features effectively.
  4. RefineDet refines the locations and sizes of the anchor boxes for two times, which inherits the merits of both one-stage and two-stage approaches.
  5. CornerNet is another keypoint-based approach, which directly detects an object using a pair of corners. Although CornerNet achieves high performance, it still has more room to improve.
  6. CenterNet explores the visual patterns within each bounding box. For detecting an object, this uses a triplet, rather than a pair, of keypoints. CenterNet evaluates objects as single points by predicting the x and y coordinate of the object’s center and it’s area of coverage (width and height). It is a unique technique that has proven to out-perform variants like the SSD and R-CNN family.  

And—don't forget that you can build your own object detection model using V7 in less than an hour 😉

Object detection applications

Finally, let's have a look at some of the most common object detection use cases.

Face and person detection

Most face recognition systems are powered by object detection. It can be used to detect faces, classify emotions or expressions, and feed the resulting box to an image-retrieval system to identify a specific person out of a group.

Face detection is one of the most popular object detection use cases, and you are probably already using it whenever you unlock your phone with your face.

Person detection is also commonly used to count the number of people in retail stores or ensure social distancing metrics.

Worker ppe detection in computer vision

Intelligent video analytics

Object detection is used in intelligent video analytics (IVA) anywhere CCTV cameras are present in retail venues to understand how shoppers are interacting with products. These video streams pass through an anonymizaion pipeline to blur out people's faces and de-identify individuals. Some IVA use cases preserve privacy by only looking at people's shoes, by placing cameras below knee level and ensuring the system captures the presence of a person, without having to directly look at their identifiable features. IVA is often used in factories, airports and transport hubs to track queue lengths and access to restricted areas.

Autonomous vehicles

Self-driving cars use object detection to spot pedestrians, other cars, and obstacles on the road in order to move around safely. Autonomous vehicles equipped with LIDAR will sometimes use 3D object detection, which applies cuboids around objects.

bounding boxes autonomous cars
💡 Pro tip: Check out The Complete Guide to Object Tracking [+V7 Tutorial].

Intelligence video surgery

Surgical video is very noisy data that is taken from endoscopes during crucial operations. Object detection can be used to spot hard-to-see items such as polyps or lesions that require a surgeon’s immediate attention. It’s also being used to inform hospital staff of the status of the operation.

surgical video ai annotation and object detection

Defect Inspection

Manufacturing companies can use object detection to spot defects in the production line. Neural networks can be trained to detect minute defects, from folds in fabric to dents or flashes in injection molded plastics.

Unlike traditional machine learning approaches, deep learning-based object detection can also spot defects in heavily varying objects, such as food.

defect inspection AI demo on fruit

Pedestrian detection

It is one of the most essential computer vision tasks that is applied in robotics, video surveillance, and automotive safety. Pedestrian detection plays a key role in object detection research as it provides the fundamental information for the semantic understanding of video footages.


Despite its relatively high performance, this technology still faces challenges such as various styles of clothing in appearance or the presence of occluding accessories that decrease the accuracy of the existing detectors.

computer vision intelligent video analytics

AI Drone Navigation

Drones sport incredible cameras nowadays and can leverage models hosted in the cloud to assess any object they encounter.

For example, they can be used to inspect hard-to-reach areas in bridges for cracks and other structural damage or to inspect power lines, replacing dangerous routine helicopter operations.

infrastructure damage detection with AI drones.
💡 Pro tip: Check out 15+ Top Computer Vision Project Ideas for Beginners to start building your own object detection models today!


Let's recap everything we've learned today:

  • Object detection is one of the most useful and popular computer vision techniques dealing with object localization and classification within an image or video.
  • Other computer vision tasks include image classification and image segmentation.
  • Image classification runs an image through a classifier for it to assign a tag, without specifying the tag's localization within an image.
  • Image segmentation defines which pixels of an object class are found in an image.
  • If your objects have no boundaries, use a classifier, if you need very high accuracy, use instance segmentation instead.
  • Object detection is the second most accessible form of image recognition (after classification) and a great way to spot many objects at high speed.
  • Deep learning-based approaches to object detection use convolutional neural networks architectures such as RetinaNET, YOLO, CenterNet, SSD, and Region Proposals.
  • Object detection finds applications in fields like self-driving cars, asset inspection, pedestrian detection, or video surveillance.

💡 Read More:

What is Machine Learning? The Ultimate Beginner's Guide

An Introduction to Autoencoders: Everything You Need to Know

7 Life-Saving AI Use Cases in Healthcare

The Beginner's Guide to Deep Reinforcement Learning

The Complete Guide to CVAT—Pros & Cons

5 Alternatives to Scale AI

YOLO: Real-Time Object Detection Explained

Multi-Task Learning in ML: Optimization & Use Cases

Previously CEO at Aipoly - First smartphone engine for convolutional neural networks. Management & Stats grad at Cass Business School and Singularity University. Never had a real job.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7’s new Gen AI product