Yolo series is a very classic structure in the field of target detection. Although there are many higher quality and more complex networks, the structure of Yolo can still bring a lot of inspiration to Algorithm Engineers. These three papers look like a parameter adjustment manual. They teach you how to use various tricks to improve the accuracy of target detection network on your hand
Source: Xiaofei’s algorithm Engineering Notes official account
YOLOv1
**You only look once:
Unified, RealTime Object Detection**
 Thesis address: https://arxiv.org/abs/1506.02640
Introduction
Yolo is very simple. A network can classify and locate multiple objects simultaneously without the concept of proposal. It is a milestone of one stage realtime detection network. The Standard Version achieves 45 FPs in titanx and 150 FPs in the fast version, but the accuracy is not as good as the SOTA network at that time
Unified Detection
If the center point of GT is in the grid, the grid is responsible for the prediction of the GT
 Each lattice predicts $B $bbox, and each bbox predicts five values: $X, y, W, h $and confidence, which are the coordinates of the center point and the width and height of bbox, respectively. The coordinates of the center point are the relative values of the edge of the grid, and the width and height are the relative values of the whole graph. The confidence level can reflect whether the lattice contains objects or not and the probability of objects, which is defined as ${PR} (object) * IOU_ {PRED} ^ {truth} $, 0 if there is no object and IOU if there is one
 The conditional probability of $C $classes predicted by each lattice is $\ pr (class_ Iobject) $, note that the prediction is made by lattice, not by bbox
in the test, the individual bbox probability is multiplied by the conditional probability of the class to get the probability of the final category, which combines the accuracy of the category and location
for Pascal VOC, set $s = 7 $, $B = 2 $, a total of $C = 20 $classes, and finally forecast the data of $7 times 7 times (2 times 5 + 20) $data
Network Design
The backbone network consists of 24 layers of convolution plus 2 full connection layers. There is no bypass module similar to the initiation module. Instead, the dimension is reduced by the $3 / times 3 $convolution followed by the $1 / times 1 $convolution. In addition, the network of fast Yolo is reduced to 9 layers
Training
The first 20 layers of backbone network are connected with average pooling layer and full connection layer for Imagenet pre training. The input of detection network training is increased from $224 / times 224 $to $448 / times 448 $. The last layer uses relu and the other layer uses leaky relu
The loss function is shown in Formula 3. One GT corresponds to only one bbox. Since there are many non targets in training, and there are few training samples for positioning, so the weight $\ lambda is used_ {coord} = 5 $and $/ lambda_ It consists of three parts: 1
 The first part is coordinate regression, which uses the loss of square difference. In order to make the model pay more attention to the small error of small target rather than the small error of large target, the square root loss is used to weight the width and height in disguised form. Here $/ BBB {1}_ {ij} ^ {obj} $refers to whether the current bbox is responsible for the prediction of GT, which needs to satisfy two conditions. Firstly, the center point of GT is in the lattice corresponding to the bbox; secondly, if the bbox is in the $B $boxes of the corresponding grid, the IOU of GT is the largest
 The second part is the regression of bbox confidence_ {ij} ^ {obj} $is the same as above, $/ BBB {1}_ {ij} ^ {noobj} $is non $/ BBB {1}_ The bbox of {ij} ^ {obj} $is given a low weight because of the large number of negative samples. If there is a goal, $\ hat {C} $is actually IOU, although many implementations take 1 directly
 The third part is the classification confidence. Compared with the lattice, $/ BBB {1}_ I ^ {obj} $indicates whether the GT center is in the lattice
Inference
For Pascal VOC, 98 bbox were predicted, and the results were processed by non maximum suppression
Experiments
summary
the groundbreaking one stage detector is connected with two full connection layers after convolution network for positioning and confidence prediction, and a new lightweight backbone network is designed. Although the accuracy rate is far from SOTA, the speed of the model is really fast
The author mentioned several limitations of Yolo
 Each grid only predicts one category and two boxes, which is not good for dense scenes
 Because of its strong dependence on data, it can’t be generalized to the uncommon aspect ratio objects, and the features are too rough due to the excessive down sampling
 IOU should pay more attention to the error source of the object, because the error size of the object should be treated less
YOLOv2
Paper: yolo9000: better, faster, stronger
 Thesis address: https://arxiv.org/abs/1612.08242
Introduction
Based on yolov1, yolov2 adds a series of popular lifting methods, a faster and more accurate one stage target detection algorithm. In addition, the author puts forward yolo9000, which can be used to detect 9000 kinds of objects. For the introduction of the model, it is divided into better / faster / strong, respectively introduces the trick to improve the accuracy, the method of network acceleration and the implementation of super multi classification
Better
Yolov1 is still quite a navie idea. Therefore, the author has added a lot of methods to improve the accuracy rate in yolov2. It is a complete network after careful consideration. The specific method is as follows: Table 2

Batch Normalization
BN layer can accelerate the convergence of the network. Adding the BN layer Yolo can improve 2% map and discard dropout for training

High Resolution Classifier
The original Yolo backbone network uses the input of $224 / times 224 $for pre training, and then directly uses $448 / times 448 $for detection training, which requires the network to adapt to the learning of new pixel and target detection at the same time. In order to be more smooth, this paper first detects 10 epoch fine tunes of $448 / times 448 $input in the backbone network before training, which brings about 4% map improvement

Convolutional With Anchor Boxes
yolov1 directly predicts bbox. Referring to fast rcnn, the preset anchor achieves good results. Yolov2 removes the full connection layer and starts to use Achor
first, remove the last pooling layer to keep the result high pixels. Modify the input resolution to 416 to ensure that the feature map is odd. In this way, there is only one central grid to facilitate the prediction of large objects. The final feature map is 1 / 32 times of the input, i.e. $13 / times 13 $. After adding anchor, the mechanism of prediction is transformed from binding on lattice to binding to anchor. Each anchor predicts $C + 5 results, objectness confidence predicts IOU, and class confidence predicts conditional probability of classification. After using anchor, the accuracy rate decreases, the specific reason is that the output box is more, the recall rate is improved, and the relative accuracy rate is reduced

Dimension Clusters
At present, anchor is manually set, which may not be the optimal setting. Kmeans is used to cluster the box of training set to obtain a more accurate default anchor. Clustering uses IOU as the distance calculation, specifically $d (box, centroid) = 1iou (box, centroid) $. As can be seen from Figure 2, five clusters have the highest cost performance ratio, which is also the setting used by yolov2

Direct location prediction
After using anchor, the initial training of yolov2 is very unstable, which mainly comes from the error caused by the center point $(x, y) $. The region proposal method uses the ratio of anchor width to height to carry out the center point displacement. Because there is no constraint, the center point can be anywhere in the graph, which leads to the instability of initial training
Therefore, yolov2 continues to use Yolo’s strategy to predict the center position relative to the width and height of the lattice, and uses the logical return constraint value to be in the range of $[0,1] $while the width and height are changed to the ratio relative to anchor width and height. So, we need to add 5 points to each bbox grid. After constraining the center position, 5% map is increased

FineGrained Features
Fast rcnn and SSD use different layers of feature maps to predict, while yolov2 proposes a passthrough layer, which samples the $26 / times 26 $feature of the early layer at intervals, and samples the original $26 / times 512 $feature to $13 / times 13 / times 2048 $(that is, the feature graph is divided into several small grids of $4 / times 4 $, and then the values of positions 1, 2, 3, and 4 of all grids are combined into a new feature graph), and then it is predicted together with the final feature graph concatenate, which leads to 1% map improvement

MultiScale Training
Since yolov2 is a full convolution network, the input size can be arbitrarily modified. During training, the input resolution is randomly switched once every 10 batches, and the candidate resolution is a multiple of 32, such as $\ {320, 352,…, 608} $. In practical use, different resolutions can be used to meet the requirements of different accuracy and speed. The results are shown in Table 3

Main Result
Faster
In order to speed up, yolov2 uses a new backbone network, darknet19, which contains 19 layers of convolution and 5 pooling layers. It uses $1 / times 1 $convolution to compress the results of $3 / times 3 $convolution, uses BN layer to stabilize training, accelerate convergence and regularization model, and uses global pooling to predict
Stronger
Yolov2 proposed to combine classification data and detection data for training, and get a super multi classification model

Hierarchical classification
The tagging granularity of imangenet and coco is different. Therefore, it is necessary to label the data with multiple labels, which is similar to the classification of species and subject class boundaries, and construct wordtree
For example, Norfolk terrier and other hounds belong to the lower classification of hound nodes, while the classification probability of Norfolk Terrier is the product of the probabilities of all nodes in the path from the root node to the current node
After imagenet1k has been re labeled, there are 1369 wordtree nodes in total. Each peer classification uses a softmax. Based on wordtree, it retrains darknet19, which achieves 71.9% top1 accuracy, only a little lower. From the results, most of the errors are finegrained level errors. For example, the wrong result also thinks that the current object is a dog, but the dog’s breed is wrong, so this hierarchical classification should be conducive to guiding feature extraction

Dataset combination with WordTree
Merge coco and Imagenet to get wordtree in Figure 6, which has 9418 classes

Joint classification and detection
Due to too much Imagenet data, the coco dataset was oversampled 4 times. When the input image is the detection data, the back propagation of the total loss function is carried out, and the back propagation of classification is limited to the tag level of GT and above. When the input image is classified data, the bbox with the highest confidence ($/ Ge. 3 $) is used to carry out the backward propagation of the classification part of the loss function
Training
Yolov2 is similar to yolov1. Firstly, GT is assigned to the largest bbox of the corresponding lattice IOU according to the center point (some implementations on the Internet are used as the largest anchor of IOU, and the author’s implementation is bbox, to be verified). The loss calculation includes three parts
 The IOU of the largest bbox in the corresponding lattice is smaller than the bbox of thresh, and it only regresses to objectness and leads to 0
 For bbox with GT, all losses are regressed
 For all boxes, in the first 12800 iterations, the coordinates with the preset box are regressed. This is because there is very little coordinate regression. In the early stage, let the prediction fit anchor to stabilize the training
summary
Based on Yolo, yolov2 integrates some work methods and makes a lot of improvements
 Join batch normalization
 High resolution fine tune for backbone network training
 Add anchor box mechanism
 Using kmeans to assist anchor setting
 The method of Yolo is used to modify the anchor center point
 Using passthrough layer to fuse low dimensional features
 Using multi scale training to improve accuracy
 Proposed darknet19 to accelerate
 Super multiobjective classification using hierarchical classification
YOLOv3
Paper: yolov3: an incremental improvement
 Thesis address: https://arxiv.org/abs/1804.02767
Introduction
The publication of yolov3 is not a complete paper. It is the author’s little work to sort out, mainly adding some effective tricks
Bounding Box Prediction
The overall coordinate regression of yolov3 is similar to that of yolov2. The logical regression function is still used to predict the objectness of anchor. Each GT only gives one IOU the largest anchor to produce total loss (this paper writes about the bounding box prior, not the bounding box, that is, the preset box, so that the calculated level can be found, and the function is similar to the original, but the author implements the bounding Other anchors with IOU greater than 0.5 with GT do not produce any loss, while those with IOU less than 0.5 with GT only produce objective loss
Class Prediction
In order to support multi label, independent logical classification is used for class prediction, and binary cross entropy loss function is used for training
Predictions Across Scales
Yolov3 performs bbox prediction on three different feature maps. These feature maps adopt the method similar to FPN to adopt the highlevel features, and then concatenate them with the lowlevel feature maps. Each layer of feature maps has three special anchors. First, several convolution layers are used to process the merged feature maps, and then a 3D tensor is predicted, which contains location information, objectness information and category information respectively. For example, the size of the $1 + N channel is the size of the $1 + N channel
Feature Extractor
Yolov3 proposed a new backbone network, darknet53, which fused darknet19 with residual network, and added a shortcut connection on the basis of the previous $3 / times 3 $convolution and $1 / times 1 $convolution
The accuracy of darknet53 is similar to the current SOTA classification network, but it is much faster
Main Result
summary
Yolov3 is an unofficial version, and the author’s improvement is relatively small. It mainly integrates some methods used to improve the accuracy
 Change the prediction of category confidence into logical independent classification
 Multi level prediction based on the structure of FPN
 This paper proposes darknet53 and adds shortcut connection to the network
Conclusion
Yolo series is a very classic structure in the field of target detection. Although there are many higher quality and more complex networks, the structure of Yolo can still bring a lot of inspiration to Algorithm Engineers. These three papers look like a reference manual. They teach you how to improve the accuracy of the target detection network on your hand. All kinds of tricks are worth studying
If this article is helpful to you, please give me a like or read it
More content, please pay attention to WeChat official account.