The article is transferred from official account (machine learning alchemy), and it pays attention to “alchemy” to get massive free learning materials.
In general, GN is the improvement of BN and the equilibrium of in and LN.
Advantages of 1 bn
Here is a brief introduction to BN, in the previous article has introduced the BN algorithm and process in detail.
BN was proposed by Google in 2015, and Google described it very clearly in the ICML paper, that is, in each SGD, the corresponding activation is standardized by mini batch, so that the mean value of the result (each dimension of the output signal) is 0 and the variance is 1. The final “scale and shift” operation is to “intentionally” add BN for training, which can restore the original input, so as to ensure the retention of useful information in the data.
[benefits of BN]
- BN makes the distribution of input data in each layer relatively stable and accelerates the learning speed of the model;
- BN makes the model less sensitive to the parameters in the network, simplifies the parameter adjustment process and makes the network learning more stable;
- BN allows the network to use saturation activation functions (such as sigmoid, tanh, etc.) to alleviate the gradient disappearance problem;
- BN has certain regularization effect.
Disadvantages of 2 BN
2.1 limited by batch size
BN is normalized along the batch dimension, which is limited by the batch size,When the batch size is very small, BN will get inaccurate statistical estimation, which will lead to significant increase in model error
[generally, batch size = 32 is the most suitable for each GPU. ]
However, for object detection, semantic segmentation, video scene, etc., the input image size is relatively large, which is limited by the video memory limit of GPU graphics card, which makes it impossible to set a large batch size, such as the classic fast RCNN and mask r-cnn networks,Due to the high resolution of the image, the batch size can only be 1 or 2
2.2 distribution of training set and test set
When BN processes the training set, the mean and variance are the calculated mean and variance of the whole training set(if you don’t understand this part, you may need to see the detailed explanation of BN algorithm.)
Therefore, if the data distribution of test and training is different, it will lead to inconsistency between training and testing.
3 Group Normalzation
Group normalization (GN) was proposed by he Kaiming’s team in March 2018. GN optimizes the disadvantage that BN does not perform well in a small mini batch.
Group normalization (GN) is an alternative method of BN, which divides channels into several groups, and then calculates the mean value and method within each group to normalize.The calculation of GB has nothing to do with the batch size, so it is very stable for high-precision pictures with small batchsize,
The figure below shows the comparison of model error rate changes between BN and GN when the batch size is getting smaller and smaller
So we can use GN instead of BN~
In fact, it is not difficult to find that there is a certain relationship between gn and LN.
There are four ways to normalize. Let’s start with the simplest instance normalization
- In: only normalize each channel of each picture. In other words, normalize the [h, w] dimension. Suppose that a characteristic graph has 10 channels, then 10 means and 10 variances will be obtained; if a batch has 5 samples and each sample has 10 channels, then in will calculate 50 mean variance in total;
- Ln: normalize all channels of a characteristic graph. Five 10 channel characteristic graphs, LN will give 5 mean variances;
- GN: This is a method between LN and in. If the group is divided into two groups, then 10 channels will be divided into 5 and 5 groups. Then five 10 channel characteristic graphs are used to calculate 10 mean variances.
- BN: This is to calculate the batch dimension. Therefore, if five characteristic graphs of 100 channels are assumed, 100 mean variances will be calculated. A mean variance is calculated for each channel of the five batches.
In the paper of GN, the group number recommended by GN is given
- The first table shows how the group number of GN decreases and degenerates into LN. In fact, grouping 32 groups has the best effect;
- The second table shows how the number of channels in each group of GN decreases and degenerates into in.Each group of 16 channels has the best effect ，Personally, I will try this parameter setting with 16 channels as a group.
4 pytorch to implement GN
import numpy as np import torch import torch.nn as nn class GroupNorm(nn.Module): def __init__(self, num_features, num_groups=32, eps=1e-5): super(GroupNorm, self).__init__() self.weight = nn.Parameter(torch.ones(1,num_features,1,1)) self.bias = nn.Parameter(torch.zeros(1,num_features,1,1)) self.num_groups = num_groups self.eps = eps def forward(self, x): N,C,H,W = x.size() G = self.num_groups assert C % G == 0 x = x.view(N,G,-1) mean = x.mean(-1, keepdim=True) var = x.var(-1, keepdim=True) x = (x-mean) / (var+self.eps).sqrt() x = x.view(N,C,H,W)
Of course, if you want to ask pytorch whether it has integrated GN? That’s inevitable. The following code compares the results of pytorch integrated GN with our manual GN.
import torch import torch.nn as nn x=torch.randn([2,10,3,3])+1 #The method of torch integration m=torch.nn.GroupNorm(num_channels=10,num_groups=2) #First calculate the average value of the first five channels firstDimenMean = torch.Tensor.mean(x[0,0:5]) #First calculate the variance of the first five channels firstDimenVar= torch.Tensor.var(x[0,0:5],False) #Subtract the mean power difference y2 = ((x - firstDimenMean)/(torch.pow(firstDimenVar+m.eps,0.5) )) * m.weight + m.bias print(y2) y1=m(x) print(m.weight) print(m.bias) print(y1[0,0,0,1])
tensor(0.4595, grad_fn=<AddBackward0>) Parameter containing: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], requires_grad=True) Parameter containing: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True) tensor(0.4595, grad_fn=<SelectBackward>)