Time：2019-12-2

# Capsule network

Author: Lin Zelong

# 1. Background

CNN has performed very well in image classification. It has completed many incredible tasks, and surpassed human beings in some projects, which has a significant impact on the whole field of machine learning. The essence of CNN is multiplied or added by a large number of vectors and matrices, so the calculation cost of neural network is very large, so it is very difficult to transfer all the pixel information on a picture to the next layer of operation, so there is a “convolution” and “pooling” method, which can help us simplify the calculation of neural network without losing the essence of data.
It’s true that CNN’s experiment results are very good when it classifies images that are very close to the dataset, but if the images have flipping, tilting or any other directional problems, the performance of convolutional neural network is worse. This problem can be solved by flipping and panning the same image during training (data enhancement). The essence of the problem is that the filter in the network can understand the image at a more precise level. For a simple example, for a face, its components include facial contour, two eyes, a nose and a mouth. For CNN, these parts are enough to recognize a face. However, the relative position and orientation of these components are less important.
The main reason is that when people recognize images, they expand from top to bottom in the way of tree shape, while CNN abstracts information step by step from bottom to top through layers of filtering. This is the biggest difference between people and CNN neural network that the authors of capsule network think.
This paper is based on an article published by Hinton et al. In 2017, dynamic routing between caps, which introduces a neural network that can make up for CNN’s inability to process image location direction and other shortcomings. Compared with CNN, this network is closer to the principle of human image recognition.

Figure 1The difference between human recognition image and CNN

# 2. How the capsule network works

The fundamental difference between the capsule network and the traditional artificial neural network is the unit structure of the network. For the traditional neural network, the calculation of neurons can be divided into the following three steps:

1. The input is scalar weighted.
2. Sum the weighted input scalar.
3. Scalar to scalar nonlinearization.

For capsules, the calculation is divided into the following four steps:

1. Multiply the input vector, where $v_1$and $v_2$come from the output of the previous capsule respectively. Within a single capsule, multiply $v_1$and $v_2$by $w_1$and $w_$respectively to get new $u_1$and $u_$.
2. The input vector is scalar weighted so that $u$1 and $C$1 are multiplied by $u$2 and $C$2, where $C$1 and $C$2 are scalars and $C$1 + C $2 = 1$.
3. Sum the resulting vectors and get $s = C  U 1 + C  2U  2$.
4. The non-linear transformation from vector to vector transforms the result vector $s$, that is, the result $s$, which is the output of this capsule, can be used as the input of the next capsule.

Figure 2: operation mode of single capsule

# 3. Details of capsule network

## 3.1 dynamic searching algorithm of capsule network

In the previous chapter, we learned about the overall way of capsule network work. In this chapter, we will focus on how to update the parameters inside the capsule. First, we will look at the pseudo code introduction of this algorithm in the paper:
Figure 3: parameter update in a single capsule

The first line of code: capsule of all l layers and input $u_i$, and iteration times $R$, where $u_i$is the result of multiplying input vector $u_i$and weight $w_i$. The second line of code: a new set of temporary variables $B {ij}$. Each scalar $B {ij}$has an initial value of 0, which corresponds to $C {ij}$. After the iteration, this variable will be stored in $C {ij}$. The third line of code: we need to set an iterations, that is, the number of internal iterations (super parameters), and then start the iteration. The fourth line of code: first, let all $C = softmax (b)$, so that all $CI$sum is 1 and non negative, the first time all $C$values are the same. The fifth line of code: let $s_i = u_ic_i$, this step is just a simple operation. The sixth line of code: we introduce the results of the previous step into the squash function (described in Chapter 2) for non-linearity, so as to ensure that the direction of the vector remains unchanged, and all the input vectors are put together, and the numerical magnitude of the vector remains the same as the previous capsule. Line 7: we will update the vector $B {ij}$. This is the most critical step of dynamic path finding algorithm. The output of the high-level capsule and the output of the low-level capsule are dot products, so as to expand those vectors which are consistent with the output direction of the high-level capsule.

The following figure shows the algorithm diagram when iterations = 3:

Figure 4When the number of iterations is equal to three, the dynamic path finding algorithm

3 $a s output$V $. ## 3.2 intuitive understanding of dynamic routing algorithm The output of two high-level capsules is represented by the purple vector$V ﹣ 1 V ﹣ 2 $, the orange vector represents the input received from a low-level capsule, and the other black vector represents the input received from other low-level capsules. The purple output$V ^ 1 $on the left and the orange input$u ^ 1 $point in the opposite direction, so they are not similar, which means that their dot product is negative, and updating the routing coefficient will reduce$C ^ 1 $. The purple output$V ^ 2 $on the right and the orange input$u ^ 2 $point in the same direction, and they are similar. Therefore, when updating parameters, the routing coefficient$C ^ 2 \$will increase. Repeat the process on all high-level capsules and all their inputs to get a set of routing parameters to achieve the best match between the output from low-level capsules and the output from high-level capsules.

Figure 5: intuitive understanding of path finding algorithm

# 4. Network structure in the paper

## 4.1 network structure of training

The capsnet architecture on the data set of MINST is shown in the figure below. The architecture can be simply expressed as two volume layers and one full connection layer. Conv1 has 256 9 × 9 convolution kernels, which are activated in steps of 1 and relu. This layer converts the intensity of the pixel to be used later for activation of the local feature detector as the basic capsule input.

Figure 6: network structure of training

The first layer is convolution layer: input: 28 × 28 image (monochrome), output: 20 × 20 × 256 tensor, convolution core: 256 9 × 9 × 1 cores with 1 step, activation function: relu.
The second layer is primarycaps layer: input: 20 × 20 × 256 tensor, output: 6 × 6 × 8 × 32 tensor (32 capsules in total), convolution core: 8 9 × 9 × 256 cores / capsules with 2 steps.
The third layer is digitcaps layer: input: 6 × 6 × 8 × 32 tensor (each capsule outputs 8-dimensional vector), output: 16 × 10 matrix (10 capsules).

## 4.2 reconstructed network structure

The refactorer selects a 16 dimensional vector from the correct digitcap and learns to encode it into a digital image (note that only the correct digitcap vector is used in training, but the incorrect digitcap is ignored). It takes the output of the correct digitcap as the input to reconstruct a 28 × 28 pixel image. The loss function is the Euclidean distance between the reconstructed image and the input image. The decoder forces capsule learning to reconstruct the useful features of the original image. The closer the reconstructed image is to the input image, the better. The following shows the reconstructed network structure (final output 28 * 28) and an example of the reconstructed image (L, P, R corresponds to the label, prediction, reconstruction target).
Figure 7: reconstructed network structureFigure 8: initial image and reconstructed image

# 5. summary

Capsule network can be regarded as a revolutionary network architecture in the stage where convolution network is difficult to upgrade. The output of neurons changes from a scalar to a set of vectors, which makes the network flow. Each capsule that identifies the substructure makes the details in the original figure highly fidelity, which is directly evolved from the complex structure. By reconstructing the original image, the model achieves the same result through the change of network structure after the transformation of view angle. In addition, it should be pointed out that CNN and CNN are not mutually exclusive, and the bottom layer of the network can also be convolutional. After all, capsule network is good at retelling high fidelity information with fewer bits.

Demo address of capsule network: https://momodel.cn/workspace/5da92fb8ce9f60807bbe6d33/app

# 6. References

Paper address: dynamic routing between capsules
Blog Park: capsnet capsule network (understanding)
github：A Keras implementation of CapsNet
github：CapsNet-Tensorflow