**Batch standardization**

Model training is not easy, especially for some very complex models, which can not get convergence results very well. Therefore, adding some preprocessing to the data and using batch standardization can get very good convergence results, which is also an important reason why convolutional networks can be trained to very deep layers.

**Data preprocessing**

At present, the most common methods of data preprocessing are centralization and standardization. Centralization is equivalent to correcting the central position of data. The implementation method is very simple, that is, subtract the corresponding mean value from each feature dimension, and finally get the feature of zero mean value. Standardization is also very simple. After the data becomes 0 mean, in order to make different characteristic dimensions have the same scale, it can be divided by the standard deviation to approximate a standard normal distribution, or it can be transformed into – 1 ~ 1 according to the maximum and minimum values. These two methods are very common. If you remember, We have used this method in the part of neural network to realize data standardization. As for other methods, such as PCA or white noise, we have used very little.

**Batch Normalization**

In the previous data preprocessing, try to input a normal distribution whose features are irrelevant and meet a standard,

The performance of this model is generally good. However, for deep network structures, the nonlinear layer of the network will make the output results become relevant, and no longer meet a standard n (0,1) distribution, and even the center of the output has shifted, which is very difficult for model training, especially for deep model training.

Therefore, in 2015, a paper proposed this method, batch standardization. In short, it is to normalize the output of each layer of network to make it obey the standard normal distribution, so that the input of the latter layer of network is also a standard normal distribution, so it can be better trained and accelerate the convergence speed. The implementation of batch normalization is very simple. For a given batch dataThe formula of the algorithm is as follows

The first and second lines calculate the mean and variance of data in a batch, and then use the third formula to standardize each data point in the batch,ϵIs a small constant introduced to calculate stability, usually taken asFinally, the final output result is obtained by weight correction, which is very simple,

Realize a simple one-dimensional case, that is, the case in neural network

```
import sys
sys.path.append('..')
import torch
def simple_batch_norm_1d(x, gamma, beta):
eps = 1e-5
x_ Mean = torch.mean (x, dim = 0, keepdim = true) # keep the dimension for broadcast
x_var = torch.mean((x - x_mean) ** 2, dim=0, keepdim=True)
x_hat = (x - x_mean) / torch.sqrt(x_var + eps)
return gamma.view_as(x_mean) * x_hat + beta.view_as(x_mean)
x = torch.arange(15).view(5, 3)
gamma = torch.ones(x.shape[1])
beta = torch.zeros(x.shape[1])
print('before bn: ')
print(x)
y = simple_batch_norm_1d(x, gamma, beta)
print('after bn: ')
print(y)
```

You can see that there are five data points and three features. Each column represents different data points of a feature. After using batch standardization, each column becomes a standard normal distribution. At this time, there will be a problem: should batch standardization be used during testing? The answer is yes, because using it during training and not using it during testing will certainly lead to deviation in the results, but if there is only one data set during testing, isn’t the mean value this value and the variance is 0? This is obviously random, so the test data set can not be used to calculate the mean and variance, but the moving mean and variance calculated during training

Implement the following batch standardization methods that can distinguish between training status and test status

```
def batch_norm_1d(x, gamma, beta, is_training, moving_mean, moving_var, moving_momentum=0.1):
eps = 1e-5
x_ Mean = torch.mean (x, dim = 0, keepdim = true) # keep the dimension for broadcast
x_var = torch.mean((x - x_mean) ** 2, dim=0, keepdim=True)
if is_training:
x_hat = (x - x_mean) / torch.sqrt(x_var + eps)
moving_mean[:] = moving_momentum * moving_mean + (1. - moving_momentum) * x_mean
moving_var[:] = moving_momentum * moving_var + (1. - moving_momentum) * x_var
else:
x_hat = (x - moving_mean) / torch.sqrt(moving_var + eps)
return gamma.view_as(x_mean) * x_hat + beta.view_as(x_mean)
```

The following is an example of using deep neural network to classify MNIST data sets to test whether batch standardization is useful

```
import numpy as np
From Torch vision.datasets import MNIST # import MNIST data built in pytorch
from torch.utils.data import DataLoader
from torch import nn
from torch.autograd import Variable
```

Download MNIST dataset using built-in functions

```
train_set = mnist.MNIST('./data', train=True)
test_set = mnist.MNIST('./data', train=False)
def data_tf(x):
x = np.array(x, dtype='float32') / 255
X = (x - 0.5) / 0.5 # data preprocessing, standardization
X = x.reshape ((- 1,) # flatten
x = torch.from_numpy(x)
return x
train_ set = mnist.MNIST('./data', train=True, transform=data_ TF, Download = true) # reload the dataset and declare the defined data transformation
test_set = mnist.MNIST('./data', train=False, transform=data_tf, download=True)
train_data = DataLoader(train_set, batch_size=64, shuffle=True)
test_data = DataLoader(test_set, batch_size=128, shuffle=False)
class multi_network(nn.Module):
def __init__(self):
super(multi_network, self).__init__()
self.layer1 = nn.Linear(784, 100)
self.relu = nn.ReLU(True)
self.layer2 = nn.Linear(100, 10)
self.gamma = nn.Parameter(torch.randn(100))
self.beta = nn.Parameter(torch.randn(100))
self.moving_mean = Variable(torch.zeros(100))
self.moving_var = Variable(torch.zeros(100))
def forward(self, x, is_train=True):
x = self.layer1(x)
x = batch_norm_1d(x, self.gamma, self.beta, is_train, self.moving_mean, self.moving_var)
x = self.relu(x)
x = self.layer2(x)
return x
net = multi_network()
#Define loss function
criterion = nn.CrossEntropyLoss()
Optimizer = torch.optim.sgd (net. Parameters(), 1E-1) # using random gradient descent, learning rate 0.1
from datetime import datetime
import torch
import torch.nn.functional as F
from torch import nn
from torch.autograd import Variable
def get_acc(output, label):
total = output.shape[0]
_, pred_label = output.max(1)
num_correct = (pred_label == label).sum().item()
return num_correct / total
#Define training function
def train(net, train_data, valid_data, num_epochs, optimizer, criterion):
if torch.cuda.is_available():
net = net.cuda()
prev_time = datetime.now()
for epoch in range(num_epochs):
train_loss = 0
train_acc = 0
net = net.train()
for im, label in train_data:
if torch.cuda.is_available():
im = Variable(im.cuda()) # (bs, 3, h, w)
label = Variable(label.cuda()) # (bs, h, w)
else:
im = Variable(im)
label = Variable(label)
# forward
output = net(im)
loss = criterion(output, label)
# backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_acc += get_acc(output, label)
cur_time = datetime.now()
h, remainder = divmod((cur_time - prev_time).seconds, 3600)
m, s = divmod(remainder, 60)
time_str = "Time %02d:%02d:%02d" % (h, m, s)
if valid_data is not None:
valid_loss = 0
valid_acc = 0
net = net.eval()
for im, label in valid_data:
if torch.cuda.is_available():
im = Variable(im.cuda(), volatile=True)
label = Variable(label.cuda(), volatile=True)
else:
im = Variable(im, volatile=True)
label = Variable(label, volatile=True)
output = net(im)
loss = criterion(output, label)
valid_loss += loss.item()
valid_acc += get_acc(output, label)
epoch_str = (
"Epoch %d. Train Loss: %f, Train Acc: %f, Valid Loss: %f, Valid Acc: %f, "
% (epoch, train_loss / len(train_data),
train_acc / len(train_data), valid_loss / len(valid_data),
valid_acc / len(valid_data)))
else:
epoch_str = ("Epoch %d. Train Loss: %f, Train Acc: %f, " %
(epoch, train_loss / len(train_data),
train_acc / len(train_data)))
prev_time = cur_time
print(epoch_str + time_str)
train(net, train_data, test_data, 10, optimizer, criterion)
```

#Hereγand Are trained as parameters and initialized to random Gaussian distribution,

#moving_ Mean and moving_ VaR is initialized to 0, which is not an updated parameter. After training for 10 times, we can see how much the moving average and moving variance are modified

#Playing moving_ Top 10 items of mean

```
print(net.moving_mean[:10])
no_bn_net = nn.Sequential(
nn.Linear(784, 100),
nn.ReLU(True),
nn.Linear(100, 10)
)
optimizer = torch.optim.SGD(no_ bn_ Net. Parameters() (1E-1) # using random gradient descent, the learning rate is 0.1
train(no_bn_net, train_data, test_data, 10, optimizer, criterion)
```

It can be seen that although the final results are the same in the two cases, if we look at the previous cases, we can see that the use of batch standardization can converge faster. Because it is only a small network, it can converge with or without batch standardization. However, for deeper networks, the use of batch standardization can converge quickly during training. As can be seen from the above, We have achieved batch standardization in the 2-dimensional case. The standardization in the 4-dimensional case corresponding to convolution is similar. We only need to calculate the mean and variance along the dimension of the channel, but we are very tired to achieve batch standardization ourselves. Of course, pytorch also has built-in batch standardization functions for us, One dimension and two dimension are torch. NN. Batchnorm1d() and torch. NN. Batchnorm2d() respectively. Unlike our implementation, pytorch not onlyβAs a training parameter, it will also be moving_ Mean and moving_ VaR is also trained as a parameter

and**Let’s try batch standardization under convolution network to see the effect**

```
def data_tf(x):
x = np.array(x, dtype='float32') / 255
X = (x - 0.5) / 0.5 # data preprocessing, standardization
x = torch.from_numpy(x)
x = x.unsqueeze(0)
return x
train_ set = mnist.MNIST('./data', train=True, transform=data_ TF, Download = true) # reload the dataset and declare the defined data transformation
test_set = mnist.MNIST('./data', train=False, transform=data_tf, download=True)
train_data = DataLoader(train_set, batch_size=64, shuffle=True)
test_data = DataLoader(test_set, batch_size=128, shuffle=False)
```

**Use batch standardization**

```
class conv_bn_net(nn.Module):
def __init__(self):
super(conv_bn_net, self).__init__()
self.stage1 = nn.Sequential(
nn.Conv2d(1, 6, 3, padding=1),
nn.BatchNorm2d(6),
nn.ReLU(True),
nn.MaxPool2d(2, 2),
nn.Conv2d(6, 16, 5),
nn.BatchNorm2d(16),
nn.ReLU(True),
nn.MaxPool2d(2, 2)
)
self.classfy = nn.Linear(400, 10)
def forward(self, x):
x = self.stage1(x)
x = x.view(x.shape[0], -1)
x = self.classfy(x)
return x
net = conv_bn_net()
Optimizer = torch.optim.sgd (net. Parameters(), 1E-1) # using random gradient descent, learning rate 0.1
train(net, train_data, test_data, 5, optimizer, criterion)
```

**Batch standardization is not used**

```
class conv_no_bn_net(nn.Module):
def __init__(self):
super(conv_no_bn_net, self).__init__()
self.stage1 = nn.Sequential(
nn.Conv2d(1, 6, 3, padding=1),
nn.ReLU(True),
nn.MaxPool2d(2, 2),
nn.Conv2d(6, 16, 5),
nn.ReLU(True),
nn.MaxPool2d(2, 2)
)
self.classfy = nn.Linear(400, 10)
def forward(self, x):
x = self.stage1(x)
x = x.view(x.shape[0], -1)
x = self.classfy(x)
return x
net = conv_no_bn_net()
Optimizer = torch.optim.sgd (net. Parameters(), 1E-1) # using random gradient descent, learning rate 0.1
train(net, train_data, test_data, 5, optimizer, criterion)
```

The above implementation of adding BN to pytorch is all the content shared by Xiaobian. I hope it can give you a reference and support developeppaer.