If you’ve ever built an object detector or or dabbled with projects involving information retrieval and re-identification (ReID), you’ve probably come across the metric called Mean Average Precision (mAP).
Mean Average Precision (mAP) is commonly used to analyze the performance of object detection and segmentation systems.
Many object detection algorithms, such as Faster R-CNN, MobileNet SSD, and YOLO use mAP to evaluate the their models. The mAP is also used across several benchmark challenges such as Pascal, VOC, COCO, and more.
Here’s what we’ll cover:
And in case you are interested in building your own computer vision models—you are in for a treat! V7 gives you access to one of the best Open Datasets libraries and the tools to annotate your data and train your AI models in hours, not weeks.
V7 allows you to build image classifiers, object detectors, OCR, and semantic segmentation models.
Check out:
Now, let’s dive in!
Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.
mAP formula is based on the following sub metrics:
Let’s discuss each sub-metric and how it is interpreted.
To create a confusion matrix, we need four attributes:
True Positives (TP): The model predicted a label and matches correctly as per ground truth.
True Negatives (TN): The model does not predict the label and is not a part of the ground truth.
False Positives (FP): The model predicted a label, but it is not a part of the ground truth (Type I Error).
False Negatives (FN): The model does not predict a label, but it is part of the ground truth. (Type II Error).
Intersection over Union indicates the overlap of the predicted bounding box coordinates to the ground truth box. Higher IoU indicates the predicted bounding box coordinates closely resembles the ground truth box coordinates.
Precision measures how well you can find true positives(TP) out of all positive predictions. (TP+FP).
For instance, the precision is calculated using the IoU threshold in object detection tasks.
In the image below, the cat on the left has 0.3 IoU (< IoU Threshold) w.r.t ground truth and is classified as false positive. In contrast, the cat on the right is classified as true positive because it has an IoU of 0.7 (> IoU Threshold) w.r.t ground truth.
The precision value may vary based on the model's confidence threshold.
Recall measures how well you can find true positives(TP) out of all predictions(TP+FN).
Average Precision is calculated as the weighted mean of precisions at each threshold; the weight is the increase in recall from the prior threshold.
Mean Average Precision is the average of AP of each class. However, the interpretation of AP and mAP varies in different contexts. For instance, in the evaluation document of the COCO object detection challenge, AP and mAP are the same.
Here is a summary of the steps to calculate the AP:
The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of classes.
The mAP incorporates the trade-off between precision and recall and considers both false positives (FP) and false negatives (FN). This property makes mAP a suitable metric for most detection applications.
Precision-Recall curve is obtained by plotting the model's precision and recall values as a function of the model's confidence score threshold.
Precision is a measure of when ""your model predicts how often does it predicts correctly?"" It indicates how much we can rely on the model's positive predictions.
Recall is a measure of ""has your model predicted every time that it should have predicted?"" It indicates any predictions that it should not have missed if the model is missing.
Why do we need to use a precision-recall curve instead of precision and recall independently?
The paperspace article on mAP clearly articulates the tradeoff of using precision and recall as independent metrics as follows.
“When a model has high recall but low precision, then the model classifies most of the positive samples correctly but it has many false positives(i.e. classifies many Negative samples as Positive).“
“When a model has high precision but low recall, then the model is accurate when it classifies a sample as Positive but it may classify only some of the positive samples.”
The precision-recall curve encapsulates the tradeoff of both metrics and maximizes the effect of both metrics. It gives us a better idea of the overall accuracy of the model.
Based on the problem at hand, the model with an element of confidence score threshold can tradeoff precision for recall and vice versa. For instance, if you are dealing with a cancer tumor detection problem, avoiding false negatives is a higher priority than avoiding false positives.
We should avoid missing tumor detection at the cost of detecting more tumors with less accuracy. Lowering the confidence score threshold will encourage the model to output more predictions (high recall) at the expense of lowering correct predictions(lower precision).
The precision-recall is downward sloping because as the confidence score is decreased, more predictions are made (increasing recall), and fewer correct predictions are made (lowering precision).
Consider a situation where you are supposed to guess all the countries in the world.
You will confidently predict the names of a few countries (maybe 10 or 20) quickly with maximum precision. However, with each different guess, you will approach higher recall and lower your guesses' precision. If the precision-recall curve is upward sloping, then there is most likely an issue with the model's confidence score.
Over the years, AI researchers have tried to combine precision and recall into a single metric to compare models. There are a couple of metrics that are widely used:
The Area Under Curve for precision-recall (PR-AUC) curve summarizes the PR values for different thresholds under a single metric.
The above image clearly shows how precision and recall values are incorporated in each metric: F1, Area Under Curve(AUC), and Average Precision(AP). The consideration of accuracy metric heavily depends on the type of problem.
AUC and AP are considered superior metrics compared to the F1 score because of the overall area coverage. For interpretability purposes, the researchers use AP as a standard metric.
Object Detection is a well-known computer vision problem where models seek to localize the relevant objects in images and classify those objects into relevant classes. The mAP is used as a standard metric to analyze the accuracy of an object detection model.
Let us walk through an object detection example for mAP calculation.
Consider the below image of cars driving on the highway, and the model’s task is to detect the cars. The output of the model is shown as red boxes. The model gave seven detections from P1 to P7, and the IoU values are calculated w.r.t. ground truth.
For object detection tasks, precision is calculated based on the IoU threshold. The precision value differs based w.r.t IoU threshold.
If IoU threshold = 0.8 then precision is 66.67%. (4 out of 6 are considered correct)
If IoU threshold = 0.5 then precision is 83.33%. (5 out of 6 are considered correct)
If IoU threshold = 0.2 then precision is 100%. (6 out of 6 are considered correct)
This shows that the AP metric is dependent on the IoU threshold. Choosing the IoU threshold becomes an arbitrary process for the researcher as it needs to be carefully chosen for each task as the model's accuracy expectation may vary. Hence, to avoid this ambiguity while evaluating an object detection model, the mean average precision(mAP) came into existence.
The idea of mAP is pretty simple -> Consider a set of thresholds in AP calculation.
Calculate AP across a set of IoU thresholds for each class k and then take the average of all AP values. This eliminates the necessity of picking an optimal IoU threshold by using a set of IoU thresholds that covers tail ends of precision and recall values.
In the sketch above, the orange line represents the high IoU requirement (around 90%), and the blue line represents the low IoU requirement (around 10%). The set of IoU thresholds represents the number of lines in the PR curve.
For each class k, we calculate the mAP across different IoU thresholds, and the final metric mAP across test data is calculated by taking an average of all mAP values per class.
The mAP calculation varies in different object detection challenges.
COCO mAP
According to the COCO 2017 challenge evaluation guidelines, the mAP was calculated by averaging the AP over 80 object classes AND all 10 IoU thresholds from 0.5 to 0.95 with a step size of 0.05.
The primary challenge metric in COCO 2017 challenge is calculated as follows:
Moreover, additional metrics are used to identify the model’s accuracy on different object scales(APsmall, APmedium, and APlarge).
Have a look at the COCO mAP comparison table for a popular one-stage object detector YOLOv3 vs. two-stage detectors Faster R-CNN.
Google Open Images Dataset V4 Competition uses mean Average Precision (mAP) over the 500 classes to evaluate the object detection algorithms.
PASCAL VOC Challenge: The current PASCAL VOC object detection challenge metrics are the Precision x Recall curve and Average Precision (AP).
Here's everything we've covered so far: