# Generalized focal loss: the magic change of focal loss and the probability distribution of the prediction box, maintaining the rising point | neurips 2020

Time：2022-6-20

In order to efficiently learn accurate prediction frames and their distributions, the paper extends focus loss and proposes a generalized focus loss that can optimize continuous value objectives, including quality focus loss and distribution focus loss. Qfl is used to learn the joint representation of better classification scores and positioning quality. DFL provides more information and accurate prediction by modeling the general distribution of the prediction frame position. From the experimental results, GFL can detect the performance of all one stage algorithms

Source: Xiaofei’s algorithm Engineering Notes official account

Thesis: Generalized Focal loss: learning qualified and distributed bounding boxes for sense object detection

# Introduction

At present, the deny detector (one stage) is the mainstream direction in target detection. This paper mainly discusses two methods:

• Representation of prediction box: it can be considered as the output of the network to the position of the prediction box. The conventional method models it as a simple Dirac delta distribution, that is, directly output the position results. Some methods model it as Gaussian distribution, output mean and variance, respectively represent the uncertainty of location results and location results, and provide additional information.

• Location quality estimation: some recent studies have added additional location quality prediction, such as the prediction of IOU score in IOU net and the prediction of centerness score in fcos. Finally, the location quality and classification score are combined into the final score.

After analysis, the paper finds that the above two practices have the following problems:

• Location quality estimation and classification score are actually incompatible: first, location quality estimation and classification score are usually trained independently, but they are combined in reasoning. Secondly, the location quality estimation only uses the positive sample points for training, resulting in the negative sample points may estimate the high location quality. This difference between training and testing will reduce the detection performance.
• The representation method of prediction box is not flexible enough: most algorithms model it as Dirac delta distribution. This method does not consider the ambiguity and uncertainty in the data set, only knows the result, and does not know whether the result is reliable or not. Although some methods model it as Gaussian distribution, the Gaussian distribution is too simple and rough to reflect the real distribution of the prediction frame.

In order to solve the above two problems, the paper puts forward the solutions:

• For the location quality estimation, the paper directly combines it with the classification score, retains the category vector, and the meaning of the score of each category becomes the IOU of GT. In addition, using this method can train both positive and negative samples at the same time, and there will be no difference between training and testing.

• For the representation method of prediction box, the general distribution is used for modeling without imposing any constraints. It can not only obtain reliable and accurate prediction results, but also perceive its potential real distribution. As shown in the above figure, for ambiguous or uncertain boundaries, the distribution will be represented as a smooth curve, otherwise, the distribution will be represented as a sharp curve.

in fact, using the two strategies mentioned above will face the problem of optimization. In the conventional one stage detection algorithm, focus loss is used to optimize the classification branches, and focus loss is mainly used for discrete classification labels. After the paper combines the positioning quality with the classification score, its output becomes a continuous IOU score related to the category, and focal loss cannot be directly used. Therefore, the paper extends focal loss and proposes GFL (Generalized Focal LOS), which can deal with the global optimization problem of continuous value objectives. GFL includes two specific forms: Qfl (quality focal LOS) and DFL (distribution focal LOS). Qfl is used to optimize difficult samples and predict the continuous value scores of corresponding categories, while DFL provides more information and accurate location prediction by modeling the general distribution of the location of the prediction box.
in general, GFL has the following advantages:

• A simple and efficient joint prediction strategy is proposed to eliminate the difference between training and testing of additional quality estimation branches.
• It can flexibly model the real distribution of the prediction box to provide more information and accurate location prediction.
• With the introduction of additional overhead, the performance of all one stage detection algorithms can be improved.

# Method

### Focal Loss (FL)

FL is mainly used to solve the imbalance of positive and negative samples in the one stage target detection algorithm:

It includes the standard cross entropy part $-log (p\u T)$and the scaling factor part $(1-p\u T) ^ {\gamma}$. The scaling factor will automatically reduce the weight of easy samples and make the training focus on difficult samples.

### Quality Focal Loss (QFL)

Since FL only supports discrete tags, in order to apply its idea to continuous tags that combine classification and location quality, it is extended. Firstly, the cross entropy part $-log (p\u T)$is extended to the complete form $- ((1-y) log (1-\sigma) + y\ log (\sigma))$, and then the scaling factor $(1-p\u T) ^ {\gamma}$is generalized to the absolute difference between the predicted value $\sigma$and the continuous label $y$, which is combined to obtain Qfl:

$\sigma=y$is the global minimum solution of Qfl.

The super parameter $\beta$of the scaling factor is used to control the rate of weight reduction, as shown in the above figure. Assuming that the target continuous label $y=0.5$, the farther away from the label, the greater the weight. On the contrary, it tends to 0, similar to fl.

### Distribution Focal Loss (DFL)

Like other one stage detection algorithms, this paper takes the distance from the current position to the target boundary as the regression target. In the conventional method, the regression target $y$is modeled as a Dirac delta distribution. The Dirac delta distribution meets the requirements of $\int^ {+\infty}_ {-\infty}\delta (X-Y) dx=1$, the label $y$can be obtained by integral:

As mentioned earlier, this method does not reflect the real distribution of the prediction frame and cannot provide more information. Therefore, the paper intends to express it as the general distribution $p (x)$. Given the value range $[y\u 0, y\u n]$of the tag $y$, the predicted value $\hat{y}$can be obtained from the modeled genreal distribution like the Dirac delta distribution:

In order to be compatible with neural network, the integral of continuous region $[y_0, y_n]$is changed into discrete region $\{y\u 0, y\u 1, \cdots, y\u I, y\u{i+1}, \cdots, y_ {n-1}, y_ The integral of n \}$, the interval of discrete region $\delta=1$, and the predicted value $\hat{y}$can be expressed as:

$p (x)$can be obtained by softmax operation $\mathcal{s} (\cdot)$and marked as $\mathcal{s}_ I$, the predicted value $\hat{y}$can use conventional methods for subsequent end-to-end learning, such as smooth L1, IOU loss and giou loss.

However, in fact, the same integration result $y$can be obtained from many different distributions, which will reduce the efficiency of network learning. Considering that more distributions should be concentrated near the regression target $y$, the paper proposes DFL to force the network to increase the $y closest to$y$_ I$and $y_ The probability of {i+1}$. Since the regression prediction does not involve the imbalance of positive and negative samples, DFL only needs the cross entropy part:

The global optimal solution of DFL is $\mathcal{s}_ i=\frac{y_{i+1}-y}{y_{i+1}-y_ i}$，$\mathcal{S}_ {i+1}=\frac{y – y_i}{y_{i+1}-y_ i}$, making $\hat{y}$infinitely close to the label $y$.

### Generalized Focal Loss (GFL)

Qfl and DFL can be uniformly expressed as GFL, assuming a value of $y_ L$and $y_ The prediction probabilities of r$are $p_ {y_l}$and $p_ {y_r}$, the final prediction result is $\hat{y}=y_ l p_ {y_l}+y_ r p_ {y_r}$, GT label is $y$, meeting $y_ l \le y \le y_ R$, taking $|y-\hat{y}|^{\beta}$as the scaling factor, the formula of GFL is:

The global optimum of GFL is $p^{*}_ {y_l}=\frac{y_r-y}{y_r-y_l}$，$p^{*}_ {y_r}=\frac{y-y_l}{y_r-y_l}$。

FL, Qfl and DFL can be considered as special cases of GFL. After using GFL, there are the following differences compared with the original method:

• The output of the classification branch is directly used for NMS, and there is no need to merge the output of the two branches
• The regression branch predicts each position of the prediction box from the original output single value to output $n+1$values

After using GFL, the network loss $\mathcal{l}$becomes:

$\mathcal{L}_ {\mathcal{b}}$is giou loss

# Experiment

Performance comparison.

Comparative experiment.

Based on ATSs and SOTA algorithm.

# Conclusion

In order to efficiently learn accurate prediction frames and their distributions, the paper extends focus loss and proposes a generalized focus loss that can optimize continuous value objectives, including quality focus loss and distribution focus loss. Qfl is used to learn the joint representation of better classification scores and positioning quality. DFL provides more information and accurate prediction by modeling the general distribution of the prediction frame position. From the experimental results, GFL can detect the performance of all one-stage algorithms.