# Brief introduction of yolov1 / V2 / V3 | target detection

Time：2020-11-26

Yolo series is a very classic structure in the field of target detection. Although there are many higher quality and more complex networks, the structure of Yolo can still bring a lot of inspiration to Algorithm Engineers. These three papers look like a parameter adjustment manual. They teach you how to use various tricks to improve the accuracy of target detection network on your hand

Source: Xiaofei’s algorithm Engineering Notes official account

# YOLOv1

**You only look once:
Unified, Real-Time Object Detection**

### Introduction

Yolo is very simple. A network can classify and locate multiple objects simultaneously without the concept of proposal. It is a milestone of one stage real-time detection network. The Standard Version achieves 45 FPs in titanx and 150 FPs in the fast version, but the accuracy is not as good as the SOTA network at that time

### Unified Detection

If the center point of GT is in the grid, the grid is responsible for the prediction of the GT

• Each lattice predicts \$B \$bbox, and each bbox predicts five values: \$X, y, W, h \$and confidence, which are the coordinates of the center point and the width and height of bbox, respectively. The coordinates of the center point are the relative values of the edge of the grid, and the width and height are the relative values of the whole graph. The confidence level can reflect whether the lattice contains objects or not and the probability of objects, which is defined as \${PR} (object) * IOU_ {PRED} ^ {truth} \$, 0 if there is no object and IOU if there is one
• The conditional probability of \$C \$classes predicted by each lattice is \$\ pr (class_ I|object) \$, note that the prediction is made by lattice, not by bbox

in the test, the individual bbox probability is multiplied by the conditional probability of the class to get the probability of the final category, which combines the accuracy of the category and location
for Pascal VOC, set \$s = 7 \$, \$B = 2 \$, a total of \$C = 20 \$classes, and finally forecast the data of \$7   times 7   times (2   times 5 + 20) \$data

### Network Design

The backbone network consists of 24 layers of convolution plus 2 full connection layers. There is no bypass module similar to the initiation module. Instead, the dimension is reduced by the \$3 / times 3 \$convolution followed by the \$1 / times 1 \$convolution. In addition, the network of fast Yolo is reduced to 9 layers

### Training

The first 20 layers of backbone network are connected with average pooling layer and full connection layer for Imagenet pre training. The input of detection network training is increased from \$224 / times 224 \$to \$448 / times 448 \$. The last layer uses relu and the other layer uses leaky relu

The loss function is shown in Formula 3. One GT corresponds to only one bbox. Since there are many non targets in training, and there are few training samples for positioning, so the weight \$\ lambda is used_ {coord} = 5 \$and \$/ lambda_ It consists of three parts: 1

• The first part is coordinate regression, which uses the loss of square difference. In order to make the model pay more attention to the small error of small target rather than the small error of large target, the square root loss is used to weight the width and height in disguised form. Here \$/ BBB {1}_ {ij} ^ {obj} \$refers to whether the current bbox is responsible for the prediction of GT, which needs to satisfy two conditions. Firstly, the center point of GT is in the lattice corresponding to the bbox; secondly, if the bbox is in the \$B \$boxes of the corresponding grid, the IOU of GT is the largest
• The second part is the regression of bbox confidence_ {ij} ^ {obj} \$is the same as above, \$/ BBB {1}_ {ij} ^ {noobj} \$is non \$/ BBB {1}_ The bbox of {ij} ^ {obj} \$is given a low weight because of the large number of negative samples. If there is a goal, \$\ hat {C} \$is actually IOU, although many implementations take 1 directly
• The third part is the classification confidence. Compared with the lattice, \$/ BBB {1}_ I ^ {obj} \$indicates whether the GT center is in the lattice

### Inference

For Pascal VOC, 98 bbox were predicted, and the results were processed by non maximum suppression

### summary

the groundbreaking one stage detector is connected with two full connection layers after convolution network for positioning and confidence prediction, and a new lightweight backbone network is designed. Although the accuracy rate is far from SOTA, the speed of the model is really fast
The author mentioned several limitations of Yolo

• Each grid only predicts one category and two boxes, which is not good for dense scenes
• Because of its strong dependence on data, it can’t be generalized to the uncommon aspect ratio objects, and the features are too rough due to the excessive down sampling
• IOU should pay more attention to the error source of the object, because the error size of the object should be treated less

# YOLOv2

Paper: yolo9000: better, faster, stronger

### Introduction

Based on yolov1, yolov2 adds a series of popular lifting methods, a faster and more accurate one stage target detection algorithm. In addition, the author puts forward yolo9000, which can be used to detect 9000 kinds of objects. For the introduction of the model, it is divided into better / faster / strong, respectively introduces the trick to improve the accuracy, the method of network acceleration and the implementation of super multi classification

### Better

Yolov1 is still quite a navie idea. Therefore, the author has added a lot of methods to improve the accuracy rate in yolov2. It is a complete network after careful consideration. The specific method is as follows: Table 2

• ##### Batch Normalization

BN layer can accelerate the convergence of the network. Adding the BN layer Yolo can improve 2% map and discard dropout for training

• ##### High Resolution Classifier

The original Yolo backbone network uses the input of \$224 / times 224 \$for pre training, and then directly uses \$448 / times 448 \$for detection training, which requires the network to adapt to the learning of new pixel and target detection at the same time. In order to be more smooth, this paper first detects 10 epoch fine tunes of \$448 / times 448 \$input in the backbone network before training, which brings about 4% map improvement

• ##### Convolutional With Anchor Boxes

yolov1 directly predicts bbox. Referring to fast r-cnn, the preset anchor achieves good results. Yolov2 removes the full connection layer and starts to use Achor
first, remove the last pooling layer to keep the result high pixels. Modify the input resolution to 416 to ensure that the feature map is odd. In this way, there is only one central grid to facilitate the prediction of large objects. The final feature map is 1 / 32 times of the input, i.e. \$13 / times 13 \$. After adding anchor, the mechanism of prediction is transformed from binding on lattice to binding to anchor. Each anchor predicts \$C + 5 results, objectness confidence predicts IOU, and class confidence predicts conditional probability of classification. After using anchor, the accuracy rate decreases, the specific reason is that the output box is more, the recall rate is improved, and the relative accuracy rate is reduced

• ##### Dimension Clusters

At present, anchor is manually set, which may not be the optimal setting. K-means is used to cluster the box of training set to obtain a more accurate default anchor. Clustering uses IOU as the distance calculation, specifically \$d (box, centroid) = 1-iou (box, centroid) \$. As can be seen from Figure 2, five clusters have the highest cost performance ratio, which is also the setting used by yolov2

• ##### Direct location prediction

After using anchor, the initial training of yolov2 is very unstable, which mainly comes from the error caused by the center point \$(x, y) \$. The region proposal method uses the ratio of anchor width to height to carry out the center point displacement. Because there is no constraint, the center point can be anywhere in the graph, which leads to the instability of initial training

Therefore, yolov2 continues to use Yolo’s strategy to predict the center position relative to the width and height of the lattice, and uses the logical return constraint value to be in the range of \$[0,1] \$while the width and height are changed to the ratio relative to anchor width and height. So, we need to add 5 points to each bbox grid. After constraining the center position, 5% map is increased

• ##### Fine-Grained Features

Fast r-cnn and SSD use different layers of feature maps to predict, while yolov2 proposes a passthrough layer, which samples the \$26 / times 26 \$feature of the early layer at intervals, and samples the original \$26 / times 512 \$feature to \$13 / times 13 / times 2048 \$(that is, the feature graph is divided into several small grids of \$4 / times 4 \$, and then the values of positions 1, 2, 3, and 4 of all grids are combined into a new feature graph), and then it is predicted together with the final feature graph concatenate, which leads to 1% map improvement

• ##### Multi-Scale Training

Since yolov2 is a full convolution network, the input size can be arbitrarily modified. During training, the input resolution is randomly switched once every 10 batches, and the candidate resolution is a multiple of 32, such as \$\ {320, 352,…, 608} \$. In practical use, different resolutions can be used to meet the requirements of different accuracy and speed. The results are shown in Table 3

### Faster

In order to speed up, yolov2 uses a new backbone network, darknet-19, which contains 19 layers of convolution and 5 pooling layers. It uses \$1 / times 1 \$convolution to compress the results of \$3 / times 3 \$convolution, uses BN layer to stabilize training, accelerate convergence and regularization model, and uses global pooling to predict

### Stronger

Yolov2 proposed to combine classification data and detection data for training, and get a super multi classification model

• ##### Hierarchical classification

The tagging granularity of imangenet and coco is different. Therefore, it is necessary to label the data with multiple labels, which is similar to the classification of species and subject class boundaries, and construct wordtree

For example, Norfolk terrier and other hounds belong to the lower classification of hound nodes, while the classification probability of Norfolk Terrier is the product of the probabilities of all nodes in the path from the root node to the current node

After imagenet1k has been re labeled, there are 1369 wordtree nodes in total. Each peer classification uses a softmax. Based on wordtree, it retrains darknet-19, which achieves 71.9% top-1 accuracy, only a little lower. From the results, most of the errors are fine-grained level errors. For example, the wrong result also thinks that the current object is a dog, but the dog’s breed is wrong, so this hierarchical classification should be conducive to guiding feature extraction

• ##### Dataset combination with WordTree

Merge coco and Imagenet to get wordtree in Figure 6, which has 9418 classes

• ##### Joint classification and detection

Due to too much Imagenet data, the coco dataset was oversampled 4 times. When the input image is the detection data, the back propagation of the total loss function is carried out, and the back propagation of classification is limited to the tag level of GT and above. When the input image is classified data, the bbox with the highest confidence (\$/ Ge. 3 \$) is used to carry out the backward propagation of the classification part of the loss function

### Training

Yolov2 is similar to yolov1. Firstly, GT is assigned to the largest bbox of the corresponding lattice IOU according to the center point (some implementations on the Internet are used as the largest anchor of IOU, and the author’s implementation is bbox, to be verified). The loss calculation includes three parts

• The IOU of the largest bbox in the corresponding lattice is smaller than the bbox of thresh, and it only regresses to objectness and leads to 0
• For bbox with GT, all losses are regressed
• For all boxes, in the first 12800 iterations, the coordinates with the preset box are regressed. This is because there is very little coordinate regression. In the early stage, let the prediction fit anchor to stabilize the training

### summary

Based on Yolo, yolov2 integrates some work methods and makes a lot of improvements

• Join batch normalization
• High resolution fine tune for backbone network training
• Using k-means to assist anchor setting
• The method of Yolo is used to modify the anchor center point
• Using passthrough layer to fuse low dimensional features
• Using multi scale training to improve accuracy
• Proposed darknet-19 to accelerate
• Super multi-objective classification using hierarchical classification

# YOLOv3

Paper: yolov3: an incremental improvement

### Introduction

The publication of yolov3 is not a complete paper. It is the author’s little work to sort out, mainly adding some effective tricks

### Bounding Box Prediction

The overall coordinate regression of yolov3 is similar to that of yolov2. The logical regression function is still used to predict the objectness of anchor. Each GT only gives one IOU the largest anchor to produce total loss (this paper writes about the bounding box prior, not the bounding box, that is, the preset box, so that the calculated level can be found, and the function is similar to the original, but the author implements the bounding Other anchors with IOU greater than 0.5 with GT do not produce any loss, while those with IOU less than 0.5 with GT only produce objective loss

### Class Prediction

In order to support multi label, independent logical classification is used for class prediction, and binary cross entropy loss function is used for training

### Predictions Across Scales

Yolov3 performs bbox prediction on three different feature maps. These feature maps adopt the method similar to FPN to adopt the high-level features, and then concatenate them with the low-level feature maps. Each layer of feature maps has three special anchors. First, several convolution layers are used to process the merged feature maps, and then a 3-D tensor is predicted, which contains location information, objectness information and category information respectively. For example, the size of the \$1 + N channel is the size of the \$1 + N channel

### Feature Extractor

Yolov3 proposed a new backbone network, darknet-53, which fused darknet-19 with residual network, and added a shortcut connection on the basis of the previous \$3 / times 3 \$convolution and \$1 / times 1 \$convolution

The accuracy of darknet-53 is similar to the current SOTA classification network, but it is much faster

### summary

Yolov3 is an unofficial version, and the author’s improvement is relatively small. It mainly integrates some methods used to improve the accuracy

• Change the prediction of category confidence into logical independent classification
• Multi level prediction based on the structure of FPN
• This paper proposes darknet-53 and adds shortcut connection to the network

# Conclusion

Yolo series is a very classic structure in the field of target detection. Although there are many higher quality and more complex networks, the structure of Yolo can still bring a lot of inspiration to Algorithm Engineers. These three papers look like a reference manual. They teach you how to improve the accuracy of the target detection network on your hand. All kinds of tricks are worth studying