Mean Average Precision (mAP) Explained: Everything You Need to Know

What is Mean Average Precision (mAP), how to calculate it, and why is it important for evaluating models' performance?
Read time
9
min read  ·  
March 7, 2022
Bounding box annotations on cars

If you’ve ever built an object detector or or dabbled with projects involving information retrieval and re-identification (ReID), you’ve probably come across the metric called Mean Average Precision (mAP).

Mean Average Precision (mAP) is commonly used to analyze the performance of object detection and segmentation systems. 

Many object detection algorithms, such as Faster R-CNN, MobileNet SSD, and YOLO use mAP to evaluate the their models. The mAP is also used across several benchmark challenges such as Pascal, VOC, COCO, and more. 

Here’s what we’ll cover:

  1. What is mean average precision (mAP)?
  2. AP vs. mAP: How to correctly calculate mAP?
  3. The Precision-Recall Curve breakdown
  4. Mean Average Precision (mAP) for Object Detection

And in case you are interested in building your own computer vision models—you are in for a treat! V7 gives you access to one of the best Open Datasets libraries and the tools to annotate your data and train your AI models in hours, not weeks.

V7 allows you to build image classifiers, object detectors, OCR, and semantic segmentation models.

Speed up your ML data labeling

Annotate your video and image datasets 10x faster

Check out:

  1. V7 Image Annotation
  2. V7 Video Annotation
  3. V7 Dataset Management
  4. V7 Automated-Annotation
  5. 13 Best Image Annotation Tools

Now, let’s dive in!

What is Mean Average Precision (mAP)?

Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.

mAP formula is based on the following sub metrics:

  • Confusion Matrix,
  • Intersection over Union(IoU),
  • Recall, 
  • Precision

Let’s discuss each sub-metric and how it is interpreted.

Confusion Matrix

To create a confusion matrix, we need four attributes:

True Positives (TP):  The model predicted a label and matches correctly as per ground truth.

True Negatives (TN): The model does not predict the label and is not a part of the ground truth.

False Positives (FP): The model predicted a label, but it is not a part of the ground truth (Type I Error).

False Negatives (FN): The model does not predict a label, but it is part of the ground truth. (Type II Error).

Confusion matrix
Confusion matrix

Intersection over Union (IoU)

Intersection over Union indicates the overlap of the predicted bounding box coordinates to the ground truth box. Higher IoU indicates the predicted bounding box coordinates closely resembles the ground truth box coordinates.

Intersection over Union
Intersection over Union
Ground truth box vs predicted box
Ground truth box vs predicted box

Precision

Precision measures how well you can find true positives(TP) out of all positive predictions. (TP+FP).

Precision formula
Precision formula

For instance, the precision is calculated using the IoU threshold in object detection tasks. 

In the image below, the cat on the left has 0.3 IoU (< IoU Threshold) w.r.t ground truth and is classified as false positive. In contrast, the cat on the right is classified as true positive because it has an IoU of 0.7 (> IoU Threshold) w.r.t ground truth.

Calculating IoU threshold
Calculating IoU threshold

The precision value may vary based on the model's confidence threshold.

Recall

Recall measures how well you can find true positives(TP) out of all predictions(TP+FN).

Recall formula
Recall formula

How to correctly calculate mAP?

Average Precision is calculated as the weighted mean of precisions at each threshold; the weight is the increase in recall from the prior threshold.

Mean Average Precision is the average of AP of each class. However, the interpretation of AP and mAP varies in different contexts. For instance, in the evaluation document of the COCO object detection challenge, AP and mAP are the same.

Here is a summary of the steps to calculate the AP:

  1. Generate the prediction scores using the model.
  2. Convert the prediction scores to class labels.
  3. Calculate the confusion matrix—TP, FP, TN, FN.
  4. Calculate the precision and recall metrics.
  5. Calculate the area under the precision-recall curve.
  6. Measure the average precision.

The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of classes.

Mean Average Precision Formula
Mean Average Precision Formula

The mAP incorporates the trade-off between precision and recall and considers both false positives (FP) and false negatives (FN). This property makes mAP a suitable metric for most detection applications.

💡 Pro tip: Have a look at 27+ Most Popular Computer Vision Applications and Use Cases

Precision-Recall Curve breakdown

Precision-Recall curve is obtained by plotting the model's precision and recall values as a function of the model's confidence score threshold.

Precision is a measure of when ""your model predicts how often does it predicts correctly?"" It indicates how much we can rely on the model's positive predictions. 

Recall is a measure of ""has your model predicted every time that it should have predicted?"" It indicates any predictions that it should not have missed if the model is missing. 

Why do we need to use a precision-recall curve instead of precision and recall independently?

The paperspace article on mAP clearly articulates the tradeoff of using precision and recall as independent metrics as follows.

“When a model has high recall but low precision, then the model classifies most of the positive samples correctly but it has many false positives(i.e. classifies many Negative samples as Positive).“

“When a model has high precision but low recall, then the model is accurate when it classifies a sample as Positive but it may classify only some of the positive samples.”

The precision-recall curve encapsulates the tradeoff of both metrics and maximizes the effect of both metrics. It gives us a better idea of the overall accuracy of the model.

Based on the problem at hand, the model with an element of confidence score threshold can tradeoff precision for recall and vice versa. For instance, if you are dealing with a cancer tumor detection problem, avoiding false negatives is a higher priority than avoiding false positives.

We should avoid missing tumor detection at the cost of detecting more tumors with less accuracy. Lowering the confidence score threshold will encourage the model to output more predictions (high recall) at the expense of lowering correct predictions(lower precision).

The precision-recall is downward sloping because as the confidence score is decreased, more predictions are made (increasing recall), and fewer correct predictions are made (lowering precision). 

Consider a situation where you are supposed to guess all the countries in the world. 

You will confidently predict the names of a few countries (maybe 10 or 20) quickly with maximum precision. However, with each different guess, you will approach higher recall and lower your guesses' precision. If the precision-recall curve is upward sloping, then there is most likely an issue with the model's confidence score.

Over the years, AI researchers have tried to combine precision and recall into a single metric to compare models. There are a couple of metrics that are widely used:

  • F1 Score—It finds the most optimal confidence score threshold where precision and recall give the highest F1 score. The F1 score calculates the balance between precision and recall. If the F1 score is high, precision and recall are high, and vice versa.
F1 score formula
F1 score formula
  • AUC (Area Under the Curve) covers the area underneath the precision-recall curve.
AUC (Area Under the Curve)
AUC (Area Under the Curve)

The Area Under Curve for precision-recall (PR-AUC) curve summarizes the PR values for different thresholds under a single metric. 

Different score metrics and their PR curves
Different score metrics and their PR curves

The above image clearly shows how precision and recall values are incorporated in each metric: F1, Area Under Curve(AUC), and Average Precision(AP). The consideration of accuracy metric heavily depends on the type of problem.

AUC and AP are considered superior metrics compared to the F1 score because of the overall area coverage. For interpretability purposes, the researchers use AP as a standard metric.

V7 Go interface
Solve any task with GenAI

Automate repetitive tasks and complex processes with AI

Mean Average Precision for Object Detection

Object Detection is a well-known computer vision problem where models seek to localize the relevant objects in images and classify those objects into relevant classes. The mAP is used as a standard metric to analyze the accuracy of an object detection model.

💡 Pro tip: Have a look at 27+ Most Popular Computer Vision Applications and Use Cases.

Let us walk through an object detection example for mAP calculation.

Consider the below image of cars driving on the highway, and the model’s task is to detect the cars. The output of the model is shown as red boxes. The model gave seven detections from P1 to P7, and the IoU values are calculated w.r.t. ground truth.

For object detection tasks, precision is calculated based on the IoU threshold. The precision value differs based w.r.t IoU threshold. 

If IoU threshold = 0.8 then precision is 66.67%. (4 out of 6 are considered correct)

If IoU threshold = 0.5 then precision is 83.33%. (5 out of 6 are considered correct)

If IoU threshold = 0.2 then precision is 100%.    (6 out of 6 are considered correct)

Object detection on car images
Object detection on car images
💡 Pro tip: Have a look at 65+ Best Free Datasets for Machine Learning and 20+ Open Source Computer Vision Datasets to find more datasets to train your Object Detectors.

This shows that the AP metric is dependent on the IoU threshold. Choosing the IoU threshold becomes an arbitrary process for the researcher as it needs to be carefully chosen for each task as the model's accuracy expectation may vary. Hence, to avoid this ambiguity while evaluating an object detection model, the mean average precision(mAP) came into existence.

The idea of mAP is pretty simple -> Consider a set of thresholds in AP calculation.

Calculate AP across a set of IoU thresholds for each class k and then take the average of all AP values. This eliminates the necessity of picking an optimal IoU threshold by using a set of IoU thresholds that covers tail ends of precision and recall values.

mAP for each class in the dataset
mAP for each class in the dataset

In the sketch above, the orange line represents the high IoU requirement (around 90%), and the blue line represents the low IoU requirement (around 10%). The set of IoU thresholds represents the number of lines in the PR curve.

For each class k, we calculate the mAP across different IoU thresholds, and the final metric mAP across test data is calculated by taking an average of all mAP values per class.

 mAP multi-class formula
 mAP multi-class formula

The mAP calculation varies in different object detection challenges.

COCO mAP

According to the COCO 2017 challenge evaluation guidelines, the mAP was calculated by averaging the AP over 80 object classes AND all 10 IoU thresholds from 0.5 to 0.95 with a step size of 0.05.

💡 Pro tip: Looking for the tool to annotate your data for free? Have a look at our Complete Guide to CVAT—Pros & Cons.

The primary challenge metric in COCO 2017 challenge is calculated as follows:

  1. AP is calculated for the IoU threshold of 0.5 for each class.
  2. Calculate the precision at every recall value(0 to 1 with a step size of 0.01), then it is repeated for IoU thresholds of 0.55,0.60,…,.95.
  3. Average is taken over all the 80 classes and all the 10 thresholds.

Moreover, additional metrics are used to identify the model’s accuracy on different object scales(APsmall, APmedium, and APlarge). 

Metrics used in COCO challenge
Metrics used in COCO challenge

Have a look at the COCO mAP comparison table for a popular one-stage object detector YOLOv3 vs. two-stage detectors Faster R-CNN.

YOLOv3 COCO mAP results
YOLOv3 COCO mAP results

Google Open Images Dataset V4 Competition uses mean Average Precision (mAP) over the 500 classes to evaluate the object detection algorithms.

PASCAL VOC Challenge: The current PASCAL VOC object detection challenge metrics are the Precision x Recall curve and Average Precision (AP).

💡 Pro tip: Read The Essential Guide to Neural Network Architectures.

Mean Average Precision: Key Takeaways

Here's everything we've covered so far:

  • Mean Average Precision(mAP) is the current benchmark metric used by the computer vision research community to evaluate the robustness of object detection models.
  • Precision measures the prediction accuracy, whereas recall measures total numbers of predictions w.r.t ground truth.
  • mAP encapsulates the tradeoff between precision and recall and maximizes the effect of both metrics.
  • The object detection task's true and false positives are classified using the IoU threshold.
  • Calculating mAP over an IoU threshold range avoids the ambiguity of picking the optimal IoU threshold for evaluating the model's accuracy.

Deval is a senior software engineer at Eagle Eye Networks and a computer vision enthusiast. He writes about complex topics related to machine learning and deep learning.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Name
Company
Automate repetitive tasks with V7's new Gen AI tool
Explore V7 Go
Ready to get started?
Try our trial or talk to one of our experts.
V7 Go Summer Release