YOLO: Real-Time Object Detection Explained

What is YOLO and how does it work? Learn about different YOLO versions and start training your own object detection models using personalized datasets of your choice.

Object detection is an advanced form of image classification where a neural network predicts objects in an image and points them out in the form of bounding boxes.

Object detection thus refers to the detection and localization of objects in an image that belong to a predefined set of classes.

Tasks like detection, recognition, or localization find widespread applicability in real-world scenarios, making object detection (also referred to as object recognition) a very important subdomain of Computer Vision.

After reading this article, you'll understand the following:

  1. What is two-stage object detection?
  2. What is YOLO?
  3. YOLO vs. other detectors
  4. How does YOLO work?
  5. YOLO architecture
  6. Differences in YOLO’s versions
  7. How to work with YOLO?

Let’s get started.

Two-stage object detection

Two-stage object detection refers to the use of algorithms that break down the object detection problem statement into the following two-stages: 

  1. Detecting possible object regions.
  2. Classifying the image in those regions into object classes.

Popular two-step algorithms like Fast-RCNN and Faster-RCNN typically use a Region Proposal Network that proposes regions of interest that might contain objects.

One and two stage object detectors

The output from the RPN is then fed to a classifier that classifies the regions into classes.

While this gives accurate results in object detection with a high mean Average Precision (mAP), it results in multiple iterations taking place in the same image, thus slowing down the detection speed of the algorithm and preventing real-time detection.

What is YOLO?

YOLO - You Only Look Once is an algorithm proposed by by Redmond et. al in a research article published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) as a conference paper, winning OpenCV People’s Choice Award.

Compared to the approach taken by object detection algorithms before YOLO, which repurpose classifiers to perform detection, YOLO proposes the use of an end-to-end neural network that makes predictions of bounding boxes and class probabilities all at once.

Following a fundamentally different approach to object detection, YOLO achieves state-of-the-art results beating other real-time object detection algorithms by a large margin.

YOLO vs. other detectors

In addition to increased accuracy in predictions and a better Intersection over Union in bounding boxes (compared to real-time object detectors), YOLO has the inherent advantage of speed.

YOLO is a much faster algorithm than its counterparts, running at as high as 45 FPS.

Here's how YOLO works in practice.

While algorithms like Faster RCNN work by detecting possible regions of interest using the Region Proposal Network and then perform recognition on those regions separately, YOLO performs all of its predictions with the help of a single fully connected layer.

Methods that use Region Proposal Networks thus end up performing multiple iterations for the same image, while YOLO gets away with a single iteration.  

YOLO limitations

Although YOLO does seem to be the best algorithm to use if you have an object detection problem to solve, it comes with several limitations.

YOLO struggles to detect and segregate small objects in images that appear in groups, as each grid is constrained to detect only a single object. Small objects that naturally come in groups, such as a line of ants, are therefore hard for YOLO to detect and localize.

YOLO is also characterized by lower accuracy when compared to much slower object detection algorithms like Fast RCNN.

Now, before we deep dive into more details about the YOLO architecture and methodology, let's go over some of the important terminology.

Intersection over Union (IoU)

Intersection over Union is a popular metric to measure localization accuracy and calculate localization errors in object detection models.

To calculate the IoU with the predictions and the ground truth, we first take the intersecting area between the bounding boxes for a particular prediction and the ground truth bounding boxes of the same area. Following this, we calculate the total area covered by the two bounding boxes—also known as the Union.

The intersection divided by the Union, gives us the ratio of the overlap to the total area, providing a good estimate of how close the bounding box is to the original prediction.

 

Intersection over Union (IoU) formula

Average Precision (AP)

Average Precision is calculated as the area under a precision vs recall curve for a set of predictions.

Recall is calculated as the ratio of the total predictions made by the model under a class with a total of existing labels for the class. 

On the other hand, Precision refers to the ratio of true positives with respect to the total predictions made by the model.

The area under the precision vs recall curve gives us the Average Precision per class for the model. The average of this value, taken over all classes, is termed as mean Average Precision (mAP).

💡 Note: In object detection, precision and recall are not for class predictions, but for predictions of boundary boxes for measuring the decision performance. An IoU value > 0.5. is taken as a positive prediction, while an IoU value < 0.5 is a negative prediction.

How does YOLO work?

The YOLO algorithm works by dividing the image into N grids, each having an equal dimensional region of SxS. Each of these N grids is responsible for the detection and localization of the object it contains.

Correspondingly, these grids predict B bounding box coordinates relative to their cell coordinates, along with the object label and probability of the object being present in the cell.

This process greatly lowers the computation as both detection and recognition are handled by cells from the image, but—

It brings forth a lot of duplicate predictions due to multiple cells predicting the same object with different bounding box predictions.

YOLO makes use of Non Maximal Suppression to deal with this issue.

Image Divided into Grids; Before Non- Maximal Suppression; After Non Maximal Suppression (Final Output)

In Non Maximal Suppression, YOLO suppresses all bounding boxes that have lower probability scores.

YOLO achieves this by first looking at the probability scores associated with each decision and taking the largest one. Following this, it suppresses the bounding boxes having the largest Intersection over Union with the current high probability bounding box.

This step is repeated till the final bounding boxes are obtained.

💡 Pro tip: Would you like to start annotating with bounding boxes? Check out 9 Essential Features for a Bounding Box Annotation Tool.

Before moving on, let's have a quick look at YOLO's architecture.

YOLO Architecture

Inspired by the GoogleNet architecture, YOLO’s architecture has a total of 24 convolutional layers with 2 fully connected layers at the end. 

YOLO Architecture

Here's a timeline showcasing YOLO's development in recent years.

YOLO versions timeline
YOLO timeline

The differences: YOLO, YOLOv2, YOLO9000, YOLOv3, YOLOv4+

Now, let's discuss and compare different versions of YOLO.

YOLOv2

YOLOv2 was proposed to fix YOLO’s main issues—the detection of small objects in groups and the localization accuracy.

YOLOv2 increases the mean Average Precision of the network by introducing batch normalization.

Batch Norm increases the mAP value by as much as 2 percent.

A much more impactful addition to the YOLO algorithm, as proposed by YOLOv2, was the addition of anchor boxes. YOLO, as we know, predicts a single object per grid cell. While this makes the built model simpler, it creates issues when a single cell has more than one object, as YOLO can only assign a single class to the cell.

YOLOv2 gets rid of this limitation by allowing the prediction of multiple bounding boxes from a single cell. This is achieved by making the network predict 5 bounding boxes for each cell.

The number 5 is empirically derived as having a good trade-off between model complexity and prediction performance. DarkNet-19 containing a total of 19 convolutional layers and 5 max-pooling layers is used as the backbone for the YOLOv2 architecture. 

YOLO9000

YOLO9000

Using a similar network architecture as YOLOv2, YOLO9000 was proposed as an algorithm to detect more classes than COCO as an object detection dataset could have made possible.

The object detection dataset that these models were trained on (COCO) has only 80 classes as compared to classification networks like ImageNet which has 22.000 classes.

To enable the detection of many more classes, YOLO9000 makes use of labels from both ImageNet and COCO, effectively merging the classification and detection tasks to only perform detection.

Since some classes of COCO can be referred to as superset classes of some classes of ImageNet, YOLO9000 makes use of a hierarchical classification-based algorithm inspired by WordNet, where classes and their subclasses are represented in a tree-based fashion.

While YOLO9000 provides a lower mean Average Precision as compared to YOLOv2, it is capable of detecting more than 9000 classes, making it a powerful algorithm.

YOLOv3

While YOLOv2 is a superfast network, various alternatives that offer better accuracies—like Single Shot Detectors—have also entered the scene. Although much slower, they outstrip YOLOv2 and YOLO9000 in terms of accuracy.

To improve YOLO with modern CNNs that make use of residual networks and skip connections, YOLOv3 was proposed.

Here's how it works as presented by Joseph Redmon.

While YOLOv2 uses the DarkNet-19 as the model architecture, YOLOv3 uses a much more complex DarkNet-53 as the model backbone— a 106 layer neural network complete with residual blocks and upsampling networks.

YOLOv3’s architectural novelty allows it to predict at 3 different scales, with the feature maps being extracted at layers 82, 94, and 106 for these predictions..

By detecting features at 3 different scales, YOLOv3 makes up for the shortcomings of YOLOv2 and YOLO, particularly in the detection of smaller objects. With the architecture allowing the concatenation of the upsampled layer outputs with the features from previous layers, the fine-grained features that have been extracted are preserved thus making the detection of smaller objects easier.

YOLOv3 only predicts 3 bounding boxes per cell (compared to 5 in YOLOv2) but it makes three predictions at different scales, totaling up to 9 anchor boxes.

YOLOv4, YOLOv5, YOLACT, and future YOLOs

Joseph Redmond left the AI community a few years back, so YOLOv4 and other versions past that are not his official work. Some of them are maintained by co-authors but none of the releases past YOLOv3 is considered the “official” YOLO. 

However, the legacy continues through new researchers.

YOLOv4 was proposed by Bochkovskiy et. al. in 2020 as an improvement to YOLOv3. The algorithm achieves state-of-the-art results at 43.5 % Average Precision running at 65 FPS on a Tesla v100 GPU.

These results are achieved by including a combination of changes in architectural design and training methodologies of YOLOv3.

YOLOv4 proposes the addition of Weighted  Residual Connections, Cross Mini Batch Normalization, Cross Stage Partial Connections, Self Adversarial Training, and Mish Activation as methodological changes amongst modern methods of regularization and data augmentation. The authors also make available a YOLOv4 Tiny version that provides faster object detection and a higher FPS while making a compromise in the prediction accuracy.

YOLOv5 is an open-source project that consists of a family of object detection models and detection methods based on the YOLO model pre-trained on the COCO dataset. It is maintained by Ultralytics and represents the organization’s open-source research into the future of Computer Vision works.

YOLACT (You Only Look At Coefficients) proposed by Bolya is an application of the YOLO principle for real-time instance segmentation.

In other words, YOLACT proposes an end-to-end convolutional network for instance segmentation that achieves 29.8 mean Average Precision at 33.5 FPS on a single Titan Xp, which is significantly faster than other instance segmentation algorithms.

YOLACT performs instance segmentation by generating a set of prototype masks and per-instance mask coefficients. A linear combination of the two steps is performed to generate the final instance masks.

How to work with YOLO

We won't get into the nitty-gritty of working with YOLO in this article, but here's a detailed guide for training YOLOv5 on your personalized dataset: YOLOv5 Training Guide.

You can create and export datasets with V7 and train YOLOv5 for detecting specific category objects.

Annotating with bounding boxes using V7

Additionally, there are pre-trained models available for download that you can use right away.

You can also have a look at this list of 65+ Best Free Datasets for Machine Learning to find relevant data for training your models.

YOLO in a nutshell: Key Takeaways

YOLO provided a super fast and accurate object detection algorithm that revolutionized computer vision research related to object detection.

With over 5 versions (3 official) and cited more than 16 thousand times,  YOLO has evolved tremendously ever since it was first proposed in 2015.

YOLO has large-scale applicability with thousands of use cases, particularly for autonomous driving, vehicle detection, and intelligent video analytics.

However—

Like almost all tech, YOLO (and object detection in general), can have both positive and negative societal impact, which is why its usage should be regulated.

💡 Read next:

Annotating With Bounding Boxes: Quality Best Practices

Optical Character Recognition: What is It and How Does it Work [Guide]

An Introductory Guide to Quality Training Data for Machine Learning

Hmrishav Bandyopadhyay
Hmrishav Bandyopadhyay

Hmrishav Bandyopadhyay studies Electronics and Telecommunication Engineering at Jadavpur University. He previously worked as a researcher at the University of California, Irvine, and Carnegie Mellon Univeristy. His deep learning research revolves around unsupervised image de-warping and segmentation.

Related posts

Upgrade to a new era of software

We're telling the stories of teams that pioneer neural networks to solve any visual task. You can join them by signing up to V7 - the only platform to develop AIs for aony computer vision use case, and monitor them in production.You'll be able to develop your own training data and models, or apply pre-existing AI models to solve new use cases.

Learn about V7

Ready to get started?

Schedule a demo with our team or discuss your project.

Dataset Management

AutoML model training to solve visual tasks or auto-label your datasets, and a scalable inference engine to launch your project.