MPAP and Miou of network evaluation


mean Average Precision(mAP)

Before introducing the concept of map, let’s review some concepts:

TP: true positive, a real class, predicts a positive class as a positive class number.
TN: true negative, which predicts the number of negative classes as the number of negative classes.
FP: false positive, a false positive class, predicts a negative class as a positive class.
FN: false negative, false negative, which predicts positive classes as negative classes.

positive negtive
true TP FP
false FN TN

According to the above, we can get the accuracy, precision, recall and F1 score.

#Proportion of all samples with correct classification
Accuracy =  (TP + TN) / (TP + TN + FP+ FN)  

#Of all the positive samples predicted, the correct proportion is predicted
precision = TP / (TP + FP)

#Proportion of all positive samples predicted to be positive
recall = TP / (TP + FN)

#A trade-off between accuracy and recall
F1 = 2 * precision * recall /(precision + recall)

MPAP is a common evaluation standard in target detection tasks. What is mPAP and why is it used.
In the task of target detection, it is necessary to judge whether a predicted bounding box is correct or not. We will calculate the predicted bounding box and the IOU of the real box, and then set a threshold value. If the IOU > threshold value, then it is considered to be correct. If the threshold value of IOU is increased, the accuracy rate will increase and the recall rate will decrease. If the threshold value of IOU is reduced, the recall rate will increase and the accuracy rate will decrease. In this way, it is certainly not enough to evaluate the network model with only one threshold. How to implement a trade off between precision and recall.
Since one threshold is not enough, then take multiple thresholds to get multiple precision and recall. In this way, the following precision recall curve, also known as PR curve, can be obtained. The area enclosed by PR curve and coordinate axis is AP.
MPAP and Miou of network evaluation

Before voc2010, only 11 values [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] are selected from recall, corresponding to 11 points in total, and then the area enclosed by PR curve and coordinate axis is calculated as AP.

In voc2010 and later, for each different recall value (including 0 and 1), select the precision maximum value when it is greater than or equal to these recall values, and then calculate the area under PR curve as AP value.

Each category can get a PR curve, corresponding to an AP. Average the APS of all categories to obtain the map.

Here is the interpolation average AP, and there is another calculation method. The difference between them can refer to here.

The figure below shows the original PR curve (green) and the interpolated PR curve (blue dotted line). It is difficult to directly calculate the area enclosed by the original PR curve and the coordinate axis (integral calculation is required), while it is convenient and simple to calculate the area enclosed by the blue dotted line and the coordinate axis. Interpolation method fills up the rising part of PR curve to ensure that PR curve is a decreasing curve.

MPAP and Miou of network evaluation

The map calculation code is as follows:
1. Firstly, TP and FP of each category are calculated to get the accuracy and recall rate of each category.

def calc_detection_voc_prec_rec(
        pred_bboxes, pred_labels, pred_scores, gt_bboxes, gt_labels,
    An evaluation code for the Pascal VOC data set used to calculate the accuracy and recall rates

        PRED? Bboxes (list): an iterative list of prediction boxes, each of which is an array
        PRED? Labels (list): a list of prediction labels that can be iterated
        Pred_scores (list): a list of prediction probabilities that can be iterated
        GT > bboxes (list): a list of real boxes that can be iterated
        Gt_labels (list): an iterative list of real box labels
        Gt_difficulties (list): a list of real box prediction difficulties that can be iterated. The default value is none, indicating that the difficulty levels are all low
        IOU thresh (float): if the IOU of the prediction box and the corresponding real box is greater than this threshold, the prediction is considered correct

        Rec (list): array list. Rec [l] represents the recall rate of the first class. If the first class does not exist, it is set to none
        Pre (list): array list. Pre [l] indicates the accuracy of the first class. If the first class does not exist, it is set to none
    #Turn all lists to iteratable objects
    pred_bboxes = iter(pred_bboxes)
    pred_labels = iter(pred_labels)
    pred_scores = iter(pred_scores)
    gt_bboxes = iter(gt_bboxes)
    gt_labels = iter(gt_labels)
    if gt_difficults is None:
        gt_difficults = itertools.repeat(None)
        gt_difficults = iter(gt_difficults)

    #Number of easy real boxes per category level
    n_pos = defaultdict(int)
    score = defaultdict(list)
    #Indicates whether each prediction box matches the real box
    match = defaultdict(list)
    # pred_bbox, pred_label, pred_score, gt_bbox
    #The length of the six lists is the same
    #Each iteration is equivalent to a batch
    for pred_bbox, pred_label, pred_score, gt_bbox, gt_label, gt_difficult in \
                pred_bboxes, pred_labels, pred_scores,
                gt_bboxes, gt_labels, gt_difficults):

        if gt_difficult is None:
            gt_difficult = np.zeros(gt_bbox.shape[0], dtype=bool)
        #Process each category separately
        for l in np.unique(np.concatenate((pred_label, gt_label)).astype(int)):
            #Retrieve the forecast box and forecast score belonging to category L
            pred_mask_l = pred_label == l
            pred_bbox_l = pred_bbox[pred_mask_l]
            pred_score_l = pred_score[pred_mask_l]

            #Sort forecast boxes in ascending order of probability score
            order = pred_score_l.argsort()[::-1]
            pred_bbox_l = pred_bbox_l[order]
            pred_score_l = pred_score_l[order]
            #Take out the real box belonging to category L
            gt_mask_l = gt_label == l
            gt_bbox_l = gt_bbox[gt_mask_l]
            gt_difficult_l = gt_difficult[gt_mask_l]

            #Count the number of non difficult borders by category, default to all
            n_pos[l] += np.logical_not(gt_difficult_l).sum()
            #If there is no forecast box
            if len(pred_bbox_l) == 0:

            #No match if the number of real boxes is 0
            if len(gt_bbox_l) == 0:
                match[l].extend((0,) * pred_bbox_l.shape[0])

            pred_bbox_l[:, 2:] += 1
            gt_bbox_l[:, 2:] += 1

            #Calculating IOU of prediction box and real box
            iou = bbox_iou(pred_bbox_l, gt_bbox_l)
            #Get the index of the real box with the largest IOU of each prediction box
            gt_index = iou.argmax(axis=1)

            #If the IOU is less than the threshold value, i.e. there is no prediction box corresponding to the real box, set the index to - 1
            gt_index[iou.max(axis=1) < iou_thresh] = -1
            del iou

            #Indicates whether the real box is matched or not. If not, the label is 0. Otherwise, the label is 1
            #Note: each real box can only match one prediction box once
            selec = np.zeros(gt_bbox_l.shape[0], dtype=bool)
            for gt_idx in gt_index:
                if gt_idx >= 0:
                    #If the corresponding real box difficulty level is high
                    if gt_difficult_l[gt_idx]:
                        #Match if the real box is matched
                        if not selec[gt_idx]:
                    #Set the prediction box corresponding to index GT > IDX as matched
                    selec[gt_idx] = True
    n_fg_class = max(n_pos.keys()) + 1
    prec = [None] * n_fg_class
    rec = [None] * n_fg_class

    for l in n_pos.keys():
        score_l = np.array(score[l])
        match_l = np.array(match[l], dtype=np.int8)
        #Descending order according to the probability of prediction category
        order = score_l.argsort()[::-1]
        match_l = match_l[order]

        tp = np.cumsum(match_l == 1)
        fp = np.cumsum(match_l == 0)
        #If FP + TP is 0, set prec [l] to Nan
        prec[l] = tp / (fp + tp)
        #If n POS [l] is 0, set rec [l] to none
        if n_pos[l] > 0:
            rec[l] = tp / n_pos[l]

    return prec, rec

2. Calculate the AP for each category based on pre and REC.

def calc_detection_voc_ap(prec, rec, use_07_metric=False):
        Prec: array list
        Rec: array list
        AP (array): average accuracy of each category, shape - > (len (n ﹤ FG class),)


    n_fg_class = len(prec)
    ap = np.empty(n_fg_class)
    for l in six.moves.range(n_fg_class):
        if prec[l] is None or rec[l] is None:
            ap[l] = np.nan

        if use_07_metric:
            # 11 point metric
            ap[l] = 0
            for t in np.arange(0., 1.1, 0.1):
                if np.sum(rec[l] >= t) == 0:
                    p = 0
                    p = np.max(np.nan_to_num(prec[l])[rec[l] >= t])
                ap[l] += p / 11

            Interpolation algorithm
            #Insert 0 at the beginning and the end to ensure the decrease of the final PR curve
            mpre = np.concatenate(([0], np.nan_to_num(prec[l]), [0]))
            mrec = np.concatenate(([0], rec[l], [1]))

            #Np.maximum.accumulate along the specified axis, starting from the second element, compare it with the previous element, and take the maximum value
            #Compare from back to front, take the maximum value, and fill in the rising part of PR curve
            #The following code is equivalent to
            #for i in range(mpre.size - 1, 0, -1):
            #    mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i]) 

            mpre = np.maximum.accumulate(mpre[::-1])[::-1]
            #Starting from position 2, get the index that is not equal to the previous value
            i = np.where(mrec[1:] != mrec[:-1])[0]
            #Calculated area
            ap[l] = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])

    return ap

3. Calculate the average for all categories of APS.

Mean Intersection over Union(MIoU)

Miou is the model evaluation standard in the semantic segmentation task. After the IOU of each category is averaged, Miou is obtained. The calculation of IOU is shown in the figure below, IOU = overlap / Union.

MPAP and Miou of network evaluation

The Miou calculation code is as follows:

  1. Calculating confusion matrix
def gen_matrix(gt_mask, pred_mask, class_num):
    Gt_mask (ndarray): shape - > (height, width), real segmentation map
    Pred_mask (ndarray): shape - > (height, width), predicted segmentation result
    Class_num: number of classes without background
    mask = (gt_mask >= 0) & (gt_mask < n)
    #Bincount is a count function. It sorts the array from small to large and counts it. By default, it counts from 0 to the maximum value of the array.
    count = np.bincount(n * gt_mask[mask].astype(int) \
    + pred_mask[mask], minlength=n ** 2)
    #Confusion matrix
    cf_mtx = count.reshape(class_num, class_num)
    return cf_mtx
  1. According to the confusion matrix, all kinds of IOU are calculated, and finally the Miou is averaged.
def mean_iou(cf_mtx):
    CF ﹣ MTX (ndarray): shape - > (class ﹣ num, class ﹣ Num), confusion matrix
    mIou = np.diag(cf_mtx) / (np.sum(cf_mtx, axis=1) + \
    np.sum(cf_mtx, axis=0) -np.diag(cf_mtx))
    #Average IOU of all categories
    mIou = np.nanmean(mIou)
    return mIou