Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

Time:2021-4-20

This paper is published by the authors of centernet, and proposes an anchor free / two-stage target detection algorithm CPN, which uses key points to extract candidate box, and then uses two-stage classifier to predict. The overall idea of this paper is very simple, but the accuracy and reasoning speed of CPN are very good, faster than the original key point algorithm, and the source code will be open, so you can have a try

Source: Xiaofei’s algorithm Engineering Notes official account

Paper: corner proposal network for anchor free, two stage object detection

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

  • Address: http://arxiv.org/abs/2007.13816
  • Paper code: http://github.com/Duankaiwen/CPNDet

Introduction


Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

At present, amchor based target detection method and anchor free target detection method have emerged a large number of excellent detection schemes. The paper thinks that for shape specific targets, anchor free target detection method is more dominant, but acnhor free method usually has a lot of false detection, as shown in Figure 1, an independent classifier is needed to improve the detection accuracy. Therefore, this paper proposes cpndet (corner proposal network), which combines the detection paradigm of anchor free and two stage. Firstly, corner detection is performed based on cornernet, and effective corner enumeration groups are combined into a large number of candidate prediction frames. Because the candidate box contains a large number of negative samples, a binary classifier is trained to filter most of the candidate prediction boxes, and then a multi class classifier is used for label prediction.

Anchor-based or Anchor-free? One-stage or Two-stage?


This paper mainly discusses anchor – based vs anchor – Free and one – stage vs two – Stage.

Anchor-based or Anchor-free?

Anchor based method lays a large number of anchors on the feature map, and then predicts whether each anchor contains objects and tags. Usually, anchor is associated with a specific position of the image, and its size is relatively fixed. Bbox regression can slightly change its geometry. Anchor free method is not limited by the preset anchor, it directly locates the key points of the target, and then predicts its shape and label. Therefore, the paper thinks that anchor free method is more flexible in the location of arbitrary shape targets, and the recall rate is also higher.

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

This paper also compares the recall rates of anchor based method and anchor free method in different sizes and different proportions of targets, and the results are shown in the table above. It can be seen that anchor free method usually has a high recall rate, especially for objects with large aspect ratio. Anchor based method has a low recall rate due to the large difference between the preset anchor and the target. Secondly, although fcos is also an anchor free method, it needs to predict the distance from the key point to the boundary, which is difficult to predict in this case.

One-stage or Two stage?

Although anchor free method solves the constraint of finding target candidate box, due to the lack of internal information of target, it is difficult to establish the relationship between key points and target, which will have a great impact on the accuracy of detection, and the establishment of the relationship between key points and target usually needs rich semantic information.

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

In this paper, cornernet and centernet with high recall rate are taken as experimental targets, and the results are shown in the table above. The enhancement of backbone network can improve the accuracy, but there are still many false detections. If the false detection without target is removed ($ AP_ { refined }$) And correct the wrong identification of labels ($ AP_ { correct }$) After that, the accuracy can be significantly improved. The above experiments show that in order to establish the relationship between the key point and the target, we need to borrow two – Stage paradigm, extract the candidate box information to filter the false detection part.

The Framework of Corner Proposal Network


Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

Based on the above analysis, combined with acnhor – Free method and two – Stage paradigm proposed corner – Proposal – Network ( CPN ) The complete structure is shown in Figure 2. First, use anchor – Free method extracts key points, traverses key points and combines them into candidate boxes. Finally, two classifiers are used to filter and predict the candidate boxes.

Stage 1: Anchor-free Proposals with Corner Keypoints

The first stage is the anchor free candidate box extraction process. Assuming that each target is located by two key points, the heat map of two groups of corner points is output according to cornernet, and Top-k upper left corner points and Top-k lower right corner points are selected. The effective key points are combined into the candidate box of the target

  1. Does the key point belong to the same category
  2. The upper left corner must be in the upper left position of the upper right corner

Although the candidate box is extracted based on cornernet, the post-processing is quite different. Cornernet uses embedding vector to combine the key points. This paper thinks that embedding vector can not guarantee that it is learnable, but uses independent classifier to process, which can use complete intermediate features to improve the accuracy.

Stage 2: Two-step Classification for Filtering Proposals

  although the recall rate of candidate box extraction based on anchor free method is very high, it will bring a lot of false detection. This paper uses two step classification method for filtering and correction. Firstly, a lightweight binary classifier is used to filter 80% of the candidate boxes, and then multiple classifiers are used to predict the categories of the remaining candidate boxes.
  in the first step, the binary classifier is trained to determine whether the candidate box is the target. The roialign of $7 times 7 is used to extract the features of each candidate box on the box feature graph, and then a convolution layer of $32 times 7 times 7 is used to output the classification confidence of each prediction box

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

  $M $is the total number of candidate frames; $n $is the number of positive samples; $p_ M $is the probability that the candidate box of $M $is the target, and $\ tau $is the IOU threshold, which is set to 0.7.
  the second step is used to predict the categories of the remaining candidate boxes. Due to the lack of internal information of the target, the categories of key points are usually not accurate, so a powerful classifier is needed to predict according to the ROI features. Firstly, the roialign of $7 / times 7 $is used to extract the features of each candidate box on the category feature graph, and then the convolution layer of $256 / times 7 / times 7 $is used to output the $C $dimension vector, where $C $is the number of categories and the loss function is as follows:

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

The number of candidate boxes and positive samples after filtering are $- hat {m} $and $- hat {n} $and $Q_ {m, C} $is the $C $category confidence of the $M $candidate box, and the remaining parameters are similar to the first step.

The Inference Process

The reasoning process is basically the same as the training process, because the training process contains many low-quality prediction frames_ M $and $Q_ The value of {m, C} $is biased to zero, so in the first step of reasoning, a relatively low threshold (0.2) is used for filtering, and about 20% of the candidate boxes are reserved. In the second step, each candidate box has two tags, which are corner prediction tags $s_ 1 $and label $s predicted by the second stage classifier_ When the score of one of the tags is greater than 0.5, the candidate box will be output, and the score will be calculated as $s_ c=(s_ 1+0.5)(s_ 2 + 0.5) $, and then normalized to $[0,1] $.

Experiments


Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

Compared with SOTA detection algorithm, the initial input resolution is $511 times 511.

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

B-classifier is binary classifier and m-classifier is multi label classifier.

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

Compared with other keypoint based methods, CPN has lower false detection rate.

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

Performance comparison between binary classifier and embedding vector of cornernet.

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020

Comparison of reasoning speed.

CONCLUSION


This paper proposes an anchor free / two-stage target detection algorithm CPN, which uses key points to extract candidate box, and then uses two-stage classifier to predict. The overall idea of the paper is very simple, but the accuracy and reasoning speed of CPN are very good, faster than the original key point algorithm, and the details of the paper are also worth considering.



If this article is helpful to you, please like it or read it
More content, please pay attention to WeChat official account.

Cpndet: crudely add two stage fine tuning to centernet, faster and stronger | ECCV 2020