What is Mean Average Precision (mAP), how to calculate it, and why is it important for evaluating models' performance?

9

min read ·

March 7, 2022

If you’ve ever built an object detector or or dabbled with projects involving information retrieval and re-identification (ReID), you’ve probably come across the metric called *Mean Average Precision (mAP).*

Mean Average Precision (mAP) is commonly used to analyze the performance of object detection and segmentation systems.

Many object detection algorithms, such as Faster R-CNN, MobileNet SSD, and YOLO use mAP to evaluate the their models. The mAP is also used across several benchmark challenges such as Pascal, VOC, COCO, and more.

Here’s what we’ll cover:

- What is mean average precision (mAP)?
- AP vs. mAP: How to correctly calculate mAP?
- The Precision-Recall Curve breakdown
- Mean Average Precision (mAP) for Object Detection

And in case you are interested in building your own computer vision models—you are in for a treat! V7 gives you access to one of the best Open Datasets libraries and the tools to annotate your data and train your AI models in hours, not weeks.

V7 allows you to build image classifiers, object detectors, OCR, and semantic segmentation models.

Check out:

- V7 Image Annotation
- V7 Video Annotation
- V7 Dataset Management
- V7 Automated-Annotation
- 13 Best Image Annotation Tools

Now, let’s dive in!

Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.

mAP formula is based on the following sub metrics:

- Confusion Matrix,
- Intersection over Union(IoU),
- Recall,
- Precision

Let’s discuss each sub-metric and how it is interpreted.

To create a confusion matrix, we need four attributes:

**True Positives (TP)**: The model predicted a label and matches correctly as per ground truth.

**True Negatives (TN)**: The model does not predict the label and is not a part of the ground truth.

**False Positives (FP)**: The model predicted a label, but it is not a part of the ground truth (Type I Error).

**False Negatives (FN)**: The model does not predict a label, but it is part of the ground truth. (Type II Error).

Intersection over Union** **indicates the overlap of the predicted bounding box coordinates to the ground truth box. Higher IoU indicates the predicted bounding box coordinates closely resembles the ground truth box coordinates.

Precision** **measures how well you can find true positives(TP) out of all positive predictions. (TP+FP).

For instance, the precision is calculated using the IoU threshold in object detection tasks.

In the image below, the cat on the left has ** 0.3 IoU (< IoU Threshold)** w.r.t ground truth and is classified as false positive. In contrast, the cat on the right is classified as true positive because it has an IoU of

The precision value may vary based on the model's confidence threshold.

Recall measures how well you can find true positives(TP) out of all predictions(TP+FN).

Average Precision is calculated as the weighted mean of precisions at each threshold; the weight is the increase in recall from the prior threshold.

Mean Average Precision *is *the average of AP of each class. However, the interpretation of AP and mAP varies in different contexts. For instance, in the evaluation document of the COCO object detection challenge, AP and mAP are the same.

Here is a summary of the steps to calculate the AP:

- Generate the prediction scores using the model.
- Convert the prediction scores to class labels.
- Calculate the confusion matrix—TP, FP, TN, FN.
- Calculate the precision and recall metrics.
- Calculate the area under the precision-recall curve.
- Measure the average precision.

The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of classes.

The mAP incorporates the trade-off between precision and recall and considers both false positives (FP) and false negatives (FN). This property makes mAP a suitable metric for most detection applications.

** Precision-Recall curve **is obtained by plotting the model's precision and recall values as a function of the model's confidence score threshold.

** Precision** is a measure of when ""

** Recall** is a measure of ""

Why do we need to use a precision-recall curve instead of precision and recall independently?

*The **paperspace article on mAP **clearly articulates the tradeoff of using precision and recall as independent metrics as follows.*

*“When a model has **high recall but low precision**, then the model classifies most of the positive samples correctly but it has many false positives(i.e. classifies many Negative samples as Positive).“*

*“When a model has **high precision but low recall**, then the model is accurate when it classifies a sample as Positive but it may classify only some of the positive samples.”*

The precision-recall curve encapsulates the tradeoff of both metrics and maximizes the effect of both metrics. It gives us a better idea of the overall accuracy of the model.

Based on the problem at hand, the model with an element of confidence score threshold can tradeoff precision for recall and vice versa. For instance, if you are dealing with a cancer tumor detection problem, avoiding false negatives is a higher priority than avoiding false positives.

We should avoid missing tumor detection at the cost of detecting more tumors with less accuracy. Lowering the confidence score threshold will encourage the model to output more predictions (high recall) at the expense of lowering correct predictions(lower precision).

The precision-recall is ** downward sloping** because as the confidence score is decreased, more predictions are made (increasing recall), and fewer correct predictions are made (lowering precision).

Consider a situation where you are supposed to guess all the countries in the world.

You will confidently predict the names of a few countries (maybe 10 or 20) quickly with maximum precision. However, with each different guess, you will approach higher recall and lower your guesses' precision. If the precision-recall curve is upward sloping, then there is most likely an issue with the model's confidence score.

Over the years, AI researchers have tried to combine precision and recall into a single metric to compare models. There are a couple of metrics that are widely used:

**F1 Score**—It finds the most optimal confidence score threshold where precision and recall give the highest F1 score. The F1 score calculates the balance between precision and recall. If the F1 score is high, precision and recall are high, and vice versa.

**AUC (Area Under the Curve)**covers the area underneath the precision-recall curve.

The Area Under Curve for precision-recall (PR-AUC) curve summarizes the PR values for different thresholds under a single metric.

The above image clearly shows how precision and recall values are incorporated in each metric: **F1**, **Area Under Curve(AUC)**, and **Average Precision(AP)**. The consideration of accuracy metric heavily depends on the type of problem.

AUC and AP are considered superior metrics compared to the F1 score because of the overall area coverage. For interpretability purposes, the researchers use AP as a standard metric.

** Object Detection** is a well-known computer vision problem where models seek to localize the relevant objects in images and classify those objects into relevant classes. The mAP is used as a standard metric to analyze the accuracy of an object detection model.

Let us walk through an object detection example for mAP calculation.

Consider the below image of cars driving on the highway, and the model’s task is to detect the cars. The output of the model is shown as red boxes. The model gave seven detections from P1 to P7, and the IoU values are calculated w.r.t. ground truth.

For object detection tasks, precision is calculated based on the IoU threshold. The precision value differs based w.r.t IoU threshold.

If IoU threshold = **0.8** then precision is **66.67%. **(4 out of 6 are considered correct)

If IoU threshold = **0.5** then precision is **83.33%. **(5 out of 6 are considered correct)

If IoU threshold = **0.2** then precision is **100%. **(6 out of 6 are considered correct)

This shows that the AP metric is dependent on the IoU threshold. Choosing the IoU threshold becomes an arbitrary process for the researcher as it needs to be carefully chosen for each task as the model's accuracy expectation may vary. Hence, to avoid this ambiguity while evaluating an object detection model, the mean average precision(mAP) came into existence.

The idea of mAP is pretty simple -> Consider a set of thresholds in AP calculation.

Calculate AP across a set of IoU thresholds for each class **k** and then take the average of all AP values. This eliminates the necessity of picking an optimal IoU threshold by using a set of IoU thresholds that covers tail ends of precision and recall values.

In the sketch above, the orange line represents the high IoU requirement (around 90%), and the blue line represents the low IoU requirement (around 10%). The set of IoU thresholds represents the number of lines in the PR curve.

For each class k, we calculate the mAP across different IoU thresholds, and the final metric mAP across test data is calculated by taking an average of all mAP values per class.

The mAP calculation varies in different object detection challenges.

**COCO mAP**

According to the COCO 2017 challenge evaluation guidelines, the mAP was calculated by averaging the AP over ** 80 object classes** AND all

The primary challenge metric in COCO 2017 challenge is calculated as follows:

- AP is calculated for the IoU threshold of 0.5 for each class.
- Calculate the precision at every recall value(0 to 1 with a step size of 0.01), then it is repeated for IoU thresholds of 0.55,0.60,…,.95.
- Average is taken over all the 80 classes and all the 10 thresholds.

Moreover, additional metrics are used to identify the model’s accuracy on different object scales(APsmall, APmedium, and APlarge).

Have a look at the COCO mAP comparison table for a popular one-stage object detector ** YOLOv3 **vs. two-stage detectors

Google Open Images Dataset V4 Competition uses mean Average Precision (mAP) over the 500 classes to evaluate the object detection algorithms.

PASCAL VOC Challenge: The current PASCAL VOC object detection challenge metrics are the Precision x Recall curve and Average Precision (AP).

Here's everything we've covered so far:

- Mean Average Precision(mAP) is the current benchmark metric used by the computer vision research community to evaluate the robustness of object detection models.

- Precision measures the prediction accuracy, whereas recall measures total numbers of predictions w.r.t ground truth.

- mAP encapsulates the tradeoff between precision and recall and maximizes the effect of both metrics.

- The object detection task's true and false positives are classified using the IoU threshold.

- Calculating mAP over an IoU threshold range avoids the ambiguity of picking the optimal IoU threshold for evaluating the model's accuracy.