Absrtact: This paper introduces the spatial transformation network and its working principle in detail. At last, the whole mechanism is reduced to two familiar concepts: affine transformation and bilinear interpolation.

`Pure dry goods: space transformation network implemented by deep learning-part1`

In the first part, we mainly introduce two very important concepts: affine transformation and bilinear interpolation, and realize that these two concepts are very important for understanding spatial transformer networks.

In this article, we will introduce in detail a paper on space transformer network, which was first proposed by Max jaderberg, Karen simonyan, Andrew zisserman and koray kavukcuoglu, researchers of Google deep mind.

After reading this article, I hope you can have a clear understanding of this model. We will use tensorflow to implement the network in the third part.

## objective

In the classification task, we usually want the system to have strong robustness to the changing input. In other words, if the input needs to go through some “transformation”, our classification model should theoretically output the same class label before the transformation. In general, image classification models may face the following “challenges”:

- Scale change: size change in the real world and in images.
- Change of perspective: with the change of the angle of the observer, the orientation of the object is different.
- Deform: non rigid bodies can deform and twist into unusual shapes.

For humans, it’s easy to categorize the objects in the figure above. However, the computer algorithm is only applicable to the original 3-D luminance value array, so the tiny change of the input image may also change the pixel value in the corresponding array. Therefore, the ideal image classification model should theoretically be able to separate the shape and deformation of the object from the texture and shape, such as the following cat image.

If our model can use some combination to perform from left to right, so as to simplify the subsequent classification tasks, is this ideal?

## Pooling layer

It has been proved that we use pooling layer in neural network architecture, which makes the model have a certain degree of space constancy. Pool operation is also a downsampling mechanism, which reduces the space size of feature mapping in depth dimension layer by layer, and reduces the number of parameters and calculation cost.

The pooling layer samples the array in space. In the above figure, the size is [224] on the left*224*The input array of 64] and the filter with dimension 2 and step 2 do the maximum pooling operation, and the output size is [112*112*64]. The figure on the right shows the maximum pooled array of 2 * 2.

How is it constant? We can understand that the idea of pooling is to use complex input, decompose it into units, and “pool” the information in these complex units to produce a simpler set of units to describe the output. For example, suppose we have three pictures with the number 7. Each picture has a different direction. We can capture approximately the same information by aggregating pixel values, so no matter where the number is in the grid, the pooling on each image grid will detect the number 7.

Pooling is not popular because of the following limitations. First, pooling is destructive. When using pooling, 75% of feature activation will face loss, which means that we will lose the exact location information. As we mentioned before, pooling gives the network certain spatial robustness, and location information is particularly important in visual recognition. Think about the cat classifier mentioned above. It may be more important to know the position of the beard than the position of the nose. When we use maximum pooling, this information is lost.

Another limitation of pooling is that it is local and predefined. Because of the small receptive field, the pooling operation will only have an impact on the deeper layer of the network, which means that the intermediate feature mapping may have greater input distortion. Remember, since only increasing the receptive field will reduce our feature mapping too much, we can’t just increase the receptive field at will.

Another major problem is that the convolution network is not constant for relatively large input distortion. This limitation stems from the fact that there is only one limited predefined pooling mechanism for dealing with data space changes. This is where the space transformation network works!

Geoffrey Hinton once said: using pooling operation in convolutional neural network is a big mistake, and the stable operation of the network itself is a disaster.

## Space transformation network (STNS)

By providing exact space transformation for convolution neural network, the space transformation mechanism solves the above problems, which has three attributes.

- Modularity: you can plug a space transformation network into any part of the existing architecture with only minor adjustments.
- Distinguishability: the back propagation algorithm can be used for training, allowing end-to-end training for the inserted model.
- Dynamic: the active spatial transformation is performed on the feature mapping of each input sample, while the pooling layer operates on all input samples.

As you can see, spatial transformations are superior to pooling operators in all respects. So, what is spatial transformation?

As shown in the figure above, the spatial transformation module consists of three parts: a localization network, a grid generator and a sampler. We can’t do affine transformation on the input image blindly. First we need to create a sampling grid, transform it, and then use the grid to sample the input image, which is very important. Let’s take a look at the core of space transformation.

## Location network

The location network acts on the input feature mapping and outputs the affine transformation parameter θ. It is defined as follows:

1. Input: feature map u with shape (h, W, c).

2. Output: transformation matrix θ with shape of (6,).

3. Architecture: fully connected network or convolutional network.

When training the network, we hope that the positioning network can output more and more accurate θ. What does precision mean? Imagine the number 7 turning 90 degrees counter clockwise. After two times of training, the positioning network can output a transformation matrix to perform 45 degrees of clockwise rotation. After five times of training, it can actually learn to complete 90 degrees of clockwise rotation. Our output image looks like a standard number 7, which our neural network can see in the training data, and it can be easily classified.

Another way is to locate network learning storage how to convert each training sample to its layer weight.

## Parameterized sampling grid

The function of grid generator is to output a parameterized sampling grid, which is a set of points, that is, the input mapping is sampled to produce the desired conversion output.

Specifically, the grid generator first creates a normalized grid of the same size as the input image U (format is (h, w)), that is, an index set (XT, YT) map covering the entire input feature map (superscript T represents the target coordinate of the output feature map). Because we have done affine transformation on this mesh, and want to use the transformation, we continue to add a row on the coordinate vector for calculation. Finally, we shape 6 parameters θ into a 2 * 3 matrix, and perform the following multiplication operation, then we will get the parameterized sampling grid we need.

The output column vector contains a set of indexes that tell us where to sample the input to get the required conversion output. But what if these indexes are scores? This is why bilinear interpolation is about to be introduced.

## Differentiable image sampling

Because bilinear interpolation is differentiable, it is very suitable for space transformation network. Through input feature mapping and parameterized sampling grid, we take bilinear sampling and obtain output feature mapping V with shape (H ‘, W’, C ‘). This means that we can perform the down and up sampling by specifying the shape of the sampling grid. We are definitely not limited to bilinear sampling, but we can use other sampling cores, but it is important to note that it must be differentiable in order to allow loss gradients to flow all the way back to the location network.

The figure shows two examples of applying parametric sampling grid to image U (output V). (a) Identity transformation (i.e. u = V) (2) affine transformation (i.e. rotation)

The above is the internal working principle of spatial transformation, which can be reduced to two key concepts we have been talking about: affine transformation and bilinear interpolation. We let the network learn the best affine transformation parameters, which will help to complete the classification task independently.

**Interesting spatial transformation**

Finally, we give two examples to illustrate the application of space transformation.

**Distort Minist data set**

The following figure is the result of classifying the numbers of distorted MNIST data sets by using spatial transformation as the first layer of the full connected network.

Notice how it learns the ideal “robust” image classification model? By amplifying and eliminating the background clutter, the input is classified as “standardized”.

German traffic sign recognition gtsrb data set

Left: the behavior of spatial transformation during training. Notice how it gradually eliminates the background to learn about traffic signs. The right figure shows the output of different input images. Note that the output remains approximately constant regardless of the input variation and distortion.

## summary

This paper gives an overview of Google deep mind’s space transformation network paper. Firstly, we introduce the challenge of classification model, which is mainly caused by the distortion of input image. One solution is to use the pooling layer, but with obvious limitations – low utilization. Another solution is the spatial transformation network.

This includes a differentiable module, which can be inserted into any position of convolution network to increase its geometric invariance. It gives the feature mapping of network space transformation without additional data or supervision cost. Finally, the whole mechanism can be summed up into two familiar concepts: affine transformation and bilinear interpolation.

Article title: deep learning paper implementations: spatial transformer networks – Part II

Please read the original for details