Sorting and analysis of loss function in face recognition


  • Write it at the front
  • Cross-Entropy Loss (softmax loss)
  • Contrastive Loss – CVPR2006
  • Triplet Loss – CVPR2015
  • Center Loss – ECCV2016
  • L-Softmax Loss – ICML2016
  • A-Softmax Loss – CVPR2017
  • AM-Softmax Loss-CVPR2018
  • ArcFace Loss – CVPR2019
  • Euclidean distance or angular distance and normalization
  • reference resources

Blog: blog Garden CSDN blog

Write it at the front

The comparison between closed set and open set face recognition is as follows,


The distance (similarity) between feature vectors is calculated to judge whether they come from the same person or not.It is important to choose a measurement that fits the context of the problemThere are two kinds of face recognition,Euclidean distance andCosine distance (angular distance)

If we want to use Euclidean distance in the test phase, we should construct the loss based on the Euclidean distance in the training phase.

In fact, there are some internal relations between different measurement methods,

  • Euclidean distance is related to both the modulus and the angle of the vectorWhen the angle is fixed, the larger the angle is, the greater the Euclidean distance will be. If the angle is fixed, the mode will increase year on year, and the Euclidean distance will also increase;
  • There is a monotonic relationship between cosine distance and angular distance(negative correlation), but the “density” of the two distributions is different\(\pi\)At constant (linear) speed, the cosine values are 0 and 0\(\pi\)It changes slowly in the vicinity\(\frac{\pi}{2}\)Approximately linear change in the vicinity
  • When the vector module length is normalized, Euclidean distance and cosine distance have monotonic relationshipTherefore, in the prediction stage, the normalized features can be identified by which metric

Different loss functions can be divided by measurement,

  • Euclidean distance :Contrastive Loss,Triplet Loss,Center Loss……
  • Cosine distance (angular distance):Large-Margin Softmax Loss,Angular-Softmax Loss,Large Margin Cosine Loss,Additive Angular Margin Loss……

Let’s start with the most basic softmax loss.

Cross-Entropy Loss (softmax loss)

Cross entropy loss, also known as softmax loss, is one of the most widely used loss functions in deep learning.

\[\mathcal{L}_{\mathrm{s}}=-\frac{1}{N_b} \sum_{i=1}^{N_b} \log \frac{e^{W_{y_{i}}^{T} x_{i}+b_{y_{i}}}}{\sum_{j=1}^{n} e^{W_{j}^{T} x_{i}+b_{j}}}

Among them,

  • \(n\)Categories,\(N_b\)Is batch size
  • \(x_{i} \in \mathbb{R}^{d}\), No\(i\)The characteristics of the samples are as follows\(d\)Dimension, belonging to\(y_i\)class
  • \(W \in \mathbb{R}^{d \times n}\)Is the weight matrix,\(W_j\)express\(W\)Of the\(j\)Column,\(b_{j} \in \mathbb{R}^{n}\)For bias

features\(x\)Weight matrix through fully connected layer\(W\)Get the same number of categories\(n\)individual\((-\infty, +\infty)\)Real numbers. The larger the real number, the more like a certain category. Softmax is used to\((-\infty, +\infty)\)Of\(n\)Real numbers throughindexMap to\((0, +\infty)\)And then normalize the sum to 1 to obtain some kind of probability interpretation.

The index operation enlarges the small difference index before mappingSoftmax loss hopes that the larger the item corresponding to the label is, the better. However, due to the existence of exponential operation, only the difference before mapping is large enough, and it is not necessary to exert “all efforts”.

In face recognition, face classification can be used to drive the model to learn the feature representation of face. butThis loss pursues the separability of classes and does not explicitly optimize the distance between classes and within classesThis inspired the emergence of other loss functions.

Contrastive Loss – CVPR2006

Contrastive loss was proposed by Lecun in dimensionality reduction by learning an invariant mapping (cvpr2006). At first, it was hoped that the samples after dimensionality reduction would keep the original distance relationship, and the similar samples would still be similar, and the dissimilar samples would not be similar, as shown below,

L\left(W, Y, \vec{X}_{1}, \vec{X}_{2}\right)=
(1-Y) \frac{1}{2}\left(D_{W}\right)^{2}+(Y) \frac{1}{2}\left\{\max \left(0, m-D_{W}\right)\right\}^{2}

Among them,\(\vec{X}_{1}\)and\(\vec{X}_{2}\)For the sample pair,\(Y=\{0, 1\}\)Indicates whether the sample pair is similar,\(Y=0\)be similar,\(Y=1\)It’s not the same,\(D_W\)Is the Euclidean distance between feature pairs of samples,\(W\)Is the parameter to be learned. Equivalent to Euclidean distance loss + hinge loss


The smaller the expected distance within the class, the better, and the greater the hope between the classes (larger than margin), which is consistent with the purpose of face recognition feature learning. Contrastive loss is used in deepid2 as verification loss, which forms a joint loss with identification loss in the form of softmax loss, as shown below,

\[\operatorname{Ident}\left(f, t, \theta_{i d}\right)=-\sum_{i=1}^{n}-p_{i} \log \hat{p}_{i}=-\log \hat{p}_{t} \\

\operatorname{Verif}\left(f_{i}, f_{j}, y_{i j}, \theta_{v e}\right)=\left\{\begin{array}{ll}
\frac{1}{2}\left\|f_{i}-f_{j}\right\|_{2}^{2} & \text { if } y_{i j}=1 \\
\frac{1}{2} \max \left(0, m-\left\|f_{i}-f_{j}\right\|_{2}\right)^{2} & \text { if } y_{i j}=-1

This kind of joint loss composed of softmax loss and other losses is common. By introducing softmax loss, training can be more stable and easier to converge.

Triplet Loss – CVPR2015

The input to contrastive loss is a pair of samples.

The input of triplet loss is 3 samples, 1 pair of positive samples (the same person) and 1 pair of negative samples (different people). It is hoped to shorten the distance between positive samples and widen the distance between negative samples. Triplet loss is from facenet: a unified embedding for face recognition and clustering.


The loss function is as follows,

\[\mathcal{L}_t = \sum_{i}^{N}\left[\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{p}\right)\right\|_{2}^{2}-\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{n}\right)\right\|_{2}^{2}+\alpha\right]_{+}

The loss is expected to have a margin while narrowing the positive samples and opening the negative samples,


Softmax loss’s final full connection layer parameters are directly proportional to the number of people, which poses a challenge to video memory on large-scale data sets.

The input of contrastive loss and triplet loss are pair and triplet, which are convenient for training on large data sets. However, it is difficult to select pairs and triplets, which are unstable and difficult to converge. They can be used in combination with softmax loss, or form a joint loss, or “warm up” with softmax loss one after another.

Center Loss – ECCV2016

Because of the change of facial expression and angle, the intra class distance of the same person may even be larger than that of different people. The starting point of center loss is,We hope not only that classes are separable, but also that they are compact within classes, the former is implemented through softmax loss, and the latter is implemented through center loss, as shown in the following figure. Assign one for each categoryLearnable class centerThe smaller the sum of the distances, the more compact the class is.

\[\mathcal{L}_{c e}=\frac{1}{2} \sum_{i=1}^{N}\left\|x_{i}-c_{y_{i}}\right\|_{2}^{2}

The joint loss is as follows, through the hyperparameter\(\lambda\)To balance the two losses and assign different learning rates to the two losses,

L_{c} &=L_{s}+\lambda L_{c e} \\
&=-\sum_{i=1}^{N_{b}} \log \frac{e^{W_{y_{i}}^{T} x_{i}+b_{y_{i}}}}{\sum_{j=1}^{n} e^{W_{j}^{T} x_{i}+b_{j}}}+\frac{\lambda}{2} \sum_{i=1}^{N_b}\left\|x_{i}-c_{y_{i}}\right\|_{2}^{2}


We hope to achieve the following results,


The above losses are optimized on Euclidean distance. The loss function optimized on cosine distance is described below.

L-Softmax Loss – ICML2016

L-softmax, namely large margin softmax, is from large margin softmax loss for volatile neural networks.

If bias is ignored, FC + softmax + cross entropy can be written as follows,

\[L_{i}=-\log \left(\frac{e^{\left\|\boldsymbol{W}_{y_{i}}\right\|\left\|\boldsymbol{x}_{i}\right\| \cos \left(\theta_{y_{i}}\right)}}{\sum_{j} e^{\left\|\boldsymbol{W}_{j}\right\|\left\|\boldsymbol{x}_{i}\right\| \cos \left(\theta_{j}\right)}}\right)

Can be\(\boldsymbol{W}_{j}\)Deemed to be\(j\)Class center vector of class, yes\(x_i\), hope\(\left\|\boldsymbol{W}_{y_{i}}\right\|\left\|\boldsymbol{x}_{i}\right\| \cos \left(\theta_{y_{i}}\right)\)comparison\(\left\|\boldsymbol{W}_{j}\right\|\left\|\boldsymbol{x}_{i}\right\| \cos \left(\theta_{j}\right), j \neq y_i\)The bigger the better, there are two factors,

  • \(\boldsymbol{W}\)Module of each column
  • \(\boldsymbol{x}_i\)And\(\boldsymbol{W}_j\)Angle of

L-softmax focuses on the second factorincluded angleCompared with softmax, I hope\(\boldsymbol{x}_i\)And\(\boldsymbol{W}_{y_i}\)Get closer, so yes\(\cos \left(\theta_{y_{i}}\right)\)A stronger constraint is applied to the angle\(\theta_{y_i}\)Multiply by a factor\(m\)If you want to get the same inner product value as softmax, you need to\(\theta_{y_i}\)Smaller

\[L_{i}=-\log \left(\frac{e^{\left\|\boldsymbol{W}_{y_{i}}\right\|\left\|\boldsymbol{x}_{i}\right\| \psi\left(\theta_{y_{i}}\right)}}{e^{\left\|\boldsymbol{W}_{y_{i}}\right\|\left\|\boldsymbol{x}_{i}\right\| \psi\left(\theta_{y_{i}}\right)}+\sum_{j \neq y_{i}} e^{\left\|\boldsymbol{W}_{j}\right\|\left\|\boldsymbol{x}_{i}\right\| \cos \left(\theta_{j}\right)}}\right)

For this purpose, it is necessary to construct\(\psi(\theta)\)The following conditions shall be met

  • \(\psi(\theta) < cos(\theta)\)
  • Monotonic decline

Constructed in this paper\(\psi(\theta)\)As follows, through\(m\)Adjust margin size

\[\psi(\theta)=(-1)^{k} \cos (m \theta)-2 k, \quad \theta \in\left[\frac{k \pi}{m}, \frac{(k+1) \pi}{m}\right], k \in [0, m-1]

When\(m=2\)As shown below,


In the case of two classifications, the explanation is as follows,


For gradient calculation and back propagation, the\(\cos(\theta)\)Replace with only\(W\)and\(w_i\)Expression for\(\frac{\boldsymbol{W}_{j}^{T} \boldsymbol{x}_{i}}{\left\|\boldsymbol{W}_{j}\right\|\left\|\boldsymbol{x}_{i}\right\|}\)By the formula of double angle\(\cos(m \theta)\)

\cos \left(m \theta_{y_{i}}\right) &=C_{m}^{0} \cos ^{m}\left(\theta_{y_{i}}\right)-C_{m}^{2} \cos ^{m-2}\left(\theta_{y_{i}}\right)\left(1-\cos ^{2}\left(\theta_{y_{i}}\right)\right) \\
&+C_{m}^{4} \cos ^{m-4}\left(\theta_{y_{i}}\right)\left(1-\cos ^{2}\left(\theta_{y_{i}}\right)\right)^{2}+\cdots \\
&(-1)^{n} C_{m}^{2 n} \cos ^{m-2 n}\left(\theta_{y_{i}}\right)\left(1-\cos ^{2}\left(\theta_{y_{i}}\right)\right)^{n}+\cdots

At the same time, in order to facilitate training, we define the super parameters\(\lambda\), will\(\left\|\boldsymbol{W}_{y_{i}}\right\|\left\|\boldsymbol{x}_{i}\right\| \psi\left(\theta_{y_{i}}\right)\)Replace with\(f_{y_i}\)

\[f_{y_{i}}=\frac{\lambda\left\|\boldsymbol{W}_{y_{i}}\right\|\left\|\boldsymbol{x}_{i}\right\| \cos \left(\theta_{y_{i}}\right)+\left\|\boldsymbol{W}_{y_{i}}\right\|\left\|\boldsymbol{x}_{i}\right\| \psi\left(\theta_{y_{i}}\right)}{1+\lambda}

When training, from the larger\(\lambda\)Start, then gradually decrease, similar to “warm up” with softmax.

A-Softmax Loss – CVPR2017

A-softmax, or angular softmax, is from sphereface: deep hypersphere embedding for face recognition.

In l-softmax, the\(x_i\)When classifying, classes are also consideredModule of central vectorandincluded angle

The biggest difference between a-softmax is thatThe center vector of each class is normalized, namely\(||W_j|| = 1\)At the same time, make bias 0,Only consider when classifying\(x_i\)and\(W_j\)Angle ofAnd introduce the same margin as l-softmax, as shown below,

\[\mathcal{L}_{\mathrm{AS}}=-\frac{1}{N} \sum_{i=1}^{N} \log \left(\frac{e^{\left\|\boldsymbol{x}_{i}\right\| \psi\left(\theta_{y_{i}, i}\right)}}{e^{\left\|\boldsymbol{x}_{i}\right\| \psi\left(\theta_{y_{i}, i}\right)}+\sum_{j \neq y_{i}} e^{\left\|\boldsymbol{x}_{i}\right\| \cos \left(\theta_{j, i}\right)}}\right) \\
\psi(\theta_{y_i, i})=(-1)^{k} \cos (m \theta_{y_i, i})-2 k, \quad \theta_{y_i, i} \in\left[\frac{k \pi}{m}, \frac{(k+1) \pi}{m}\right], k \in [0, m-1]

When\(m=1\)When margin is not introduced, it is calledmodified softmax loss

Softmax loss, modified softmax loss and a-softmax loss are as follows,


The visualization is as follows,


AM-Softmax Loss-CVPR2018

Am softmax is also called additive margin softmax, which comes from the paper “additional margin softmax for face verification”, which is the same as cosface “cosface: large margin cosine loss for deep face recognition”. The loss in cosface is called LMCL (large margin cosine loss).

Compared with a-softmax, there are two changes,

  • Yes\(x_i\)Also do normalization, while retaining the\(W\)The normalization of each column and bias is 0
  • take\(\cos(m \theta)\)become\(s \cdot (\cos \theta – m)\)
\[\mathcal{L}_{\mathrm{AM}}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s \cdot\left(\cos \theta_{y_{i}}-m\right)}}{e^{s \cdot\left(\cos \theta_{y_{i}}-m\right)}+\sum_{j=1, j \neq y_{i}}^{c} e^{s \cdot \cos \theta_{j}}}

Compared with softmax, we hope to get larger inter class distance and smaller intra class distance. If we use cosine distance, it means that we want to get the same distance as softmax\(y_i\)A smaller angle is required for the corresponding component\(\theta\)To do this, we need to build\(\psi(\theta)\)As mentioned earlier, the

  • \(\psi(\theta) < cos(\theta)\)
  • Monotonic decline

Previously constructed\(\psi(\theta)\)From\(\cos(m \theta)\)\(m\)And\(\theta\)It’s the relationship of multiplication\(\varphi(\theta)= s(\cos (\theta)-m)\)

  • \(\cos(\theta) – m\)\(m\)Change multiplication to addition,\(\cos(m\theta)\)Apply margin to the angle,\(\cos(\theta) – m\)Directly acting on the cosine distance, the problem with the former is thatWhen the angle between the class center vector is small, the penalty is smallIf the angle is small, the margin will be smaller and the calculation is complexHard Margin SoftmaxWe hope to learn feature mapping under the condition of ensuring “hard” space between classes.


  • \(s\): will\(x_i\)After normalization, it is equivalent toThe feature is embedded into the unit hypersphere, and the representation space is limitedAfter the unification of characteristics and rights\(\mid \boldsymbol{W}_{j}\|\| \boldsymbol{x}_{i} \| \cos \left(\theta_{ij}\right)=cos(\theta_{ij})\)The value range of is\([-1, 1]\), i.e\(x_i\)The maximum cosine distance to the center vector of each class is 1 and the minimum is – 1,The difference of various components is too smallSo multiply by a factor\(s\), willFeature mapping to radius is\(s\)On the hypersphere, the representation space is enlarged and the difference between the components is widened.


ArcFace Loss – CVPR2019

Arcface loss, namely additive angular margin loss, is from arcface: additional angular margin loss for deep face recognition.

Am softmax loss acts margin on cosine distance, but arcface acts margin on angle, and its loss function is as follows,

\[\mathcal{L}_{\mathrm{AF}}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s \cdot\left(\cos \left(\theta_{y_{i}}+m\right)\right)}}{e^{s \cdot\left(\cos \left(\theta_{y_{i}}+m\right)\right)}+\sum_{j=1, j \neq y_{i}}^{n} e^{s \cdot \cos \theta_{j}}}


Whether the margin is added to the cosine distance (cosface) or the angle (arcface) is analyzed in “additional margin softmax for face verification”,


Arccos is not calculated in arcface, so the calculation is not complicated. The margin is added to the angle, but the cosine distance is still optimized.

Another thing to note is that no matter whether the margin is added to the cosine distance or the angle, it is easy to see that the intra class distance is reduced and the inter class distance is increased by simply looking at the schematic diagram?

In this paper, the mathematical description of the distance between classes and within class is given

\[L_{Intra}=\frac{1}{\pi N} \sum_{i=1}^{N} \theta_{y_{i}} \\
L_{Inter}=-\frac{1}{\pi N(n-1)} \sum_{i=1}^{N} \sum_{j=1, j \neq y_{i}}^{n} \arccos \left(W_{y_{i}}^{T} W_{j}\right)

\(W\)It’s a parameter to be learned\(x_i\)Also through the weight learning of the front layer, in the training process\(x_i\)and\(W\)It’s going to change, it’s going to be driven by gradients, and it’s going to move in the direction where the loss decreases. The introduction of margin intentionally depresses the value of the corresponding component of the class label to try to “squeeze” the potential of the model. In softmax, the position where the convergence can be achieved still needs to continue to decline. The decline can be achieved by increasing the value of the corresponding component of the class tag, or by reducing the value of other components. So,\(x_i\)In the direction of\(W_{y_i}\)As you get closer,\(W_j, j\neq y_i\)It may be moving away\(x_i\)The final result may be\(x_i\)As close as possible\(W_{y_i}\), and\(W_j, j\neq y_i\)Far away\(W_{y_i}\)

Euclidean distance or angular distance and normalization

Here, let’s talk about why it’s right\(W\)and\(x\)However, there are too many subjective thoughts, which are not verified.

Why should we do feature normalization / standardization in the article? We mentioned that,

The purpose of normalization / standardization is to obtain some “independence” — bias independence, scale independence, length independence When the physical meaning and geometric meaning behind the normalization / standardization method are consistent with the needs of the current problem, it will have a positive effect on solving the problem, otherwise, it will have a negative effect.

features\(x\)And\(W\)The result of the inner product of each class center vector depends on\(x\)The module of\(W_j\)The size of the module and the angle between them will affect the result of the inner product.

  • Yes\(W_j\)Normalization: if the training set has seriousCategory imbalanceThe network will tend to divide the input image into categories with a large number of images, which will be reflected in the modulus of the class center vector, that is, the module of the class center vector of the category with more pictures will be larger. This is verified by experiments in one shot face recognition by promoting underrepresented classes. So,Yes\(W_j\)The norm normalization is equivalent to forcing the network to treat each category equally, which is equivalent to putting the priori of treating everyone equally into the network to alleviate the problem of class imbalance.

  • Yes\(x\)Normalization: for a specific\(x_i\)Its inner product with the center of each class is\(x_i \cdot W_j = |x_i||W_j|\cos \theta_{ij} = |x_i|\cos \theta_{ij}\)Because the inner product result of each category contains\(x_i\)Module of,\(x_i\)It seems that the normalization of the norm of the inner product does not affect the size relationship between the results of the inner product, but affects the size of the loss\([4,1,1,1]\)The mode is doubled at the same time\([8,2,2,2]\)After the index normalization of softmax, the loss of the latter is smaller. In the convolution calculation, function and thought of convolution neural network, we know that the pattern is contained in the weight of convolution kernel, and the feature represents the degree of similarity with a certain pattern, the more similar the response is. What kind of input image is easy to obtain small modulus features? First, the input image with small numerical scale can be normalized. This can be solved by normalizing the input image (if the normalization of input image is not enough, normalizing the feature may alleviate this problem), and thenThose fuzzy, large angle, occluded face images will have smaller featuresThese are the training mapsDifficult sampleIf they are not large, they may be flooded by simple samples.From this point of view, yes\(x\)Normalization may make the network pay more attention to these difficult samples, find more detailed features, focus on angle discrimination, and further squeeze the potential of the network.It’s like focal loss.

The network will use all the means it may use to reduce the loss. Some means may not be what you expect. At this time, adjust the data, integrate the priori, add regularities and other ways to suppress the means that you do not want the network to take, and correct the way of network evolution, so as to make the network grow in the direction you expect.


reference resources

  • A Performance Comparison of Loss Functions for Deep Face Recognition
  • InsightFace: 2D and 3D Face Analysis Project
  • Loss of face recognition
  • Loss of face recognition (2)
  • Deep digging: the role of feature normalization, weight normalization and triplet in face recognition from the perspective of data