Neural network Chapter 3 convolution neural network reading notes


Chapter 3 convolutional neural network

Convolutional neural network CNN is one of the most widely used models at present. It has the characteristics of local connection and weight sharing. It is a deep feedforward neural network.

3.1 convolution and pooling

Convolution and pooling are two core operations in CNN.

3.1.1 convolution in signal processing

Aside: because the core knowledge of this part should belong to the course of signal and system, but I haven’t learned it, so it should be painful to learn it. While extracting the formulas in the original book, I also try to give some of my own understanding

A typical application of convolution: given input signal\(f(\tau)\)And system response\(g(\tau)\), find the output of the system.

The mathematical definition of convolution is:\((f*g)(t)=\int^{\infty}_{\infty}f(\tau)g(t-\tau)d\tau\), this process is\(g\)After turning and moving, and\(f\)Multiply and then calculate the integral, as shown in the figure below:


Then an example of convolution in image processing is given, which has little to do with GNN, so I won’t say much. I just pay attention to the convolution process: convolution kernel and image are convoluted according to the corresponding position (matrix inner product).

Convolution theorem: convert complex convolution operation in time domain into simple multiplication operation in frequency domain:\((f*g)(t)\Leftrightarrow F(w)G(w)\)

I thought that there were quite a few chapters in the first half of the GCN paper talking about this, but I didn’t understand it at that time. Again, the pain mask.

The key preparatory knowledge of convolution theorem is Fourier transform, which converts the data in time domain to frequency domain, decomposes the function into the superposition of some columns of trigonometric functions with different frequencies, which can be understood as looking at the data from another dimension.

Back to the example of image processing, convolution is actually transforming the image and convolution core into the frequency domain. The convolution core is processed as a filter. If it is a low-pass filter, the higher frequency part will be filtered out, and the filtered image will lose some detailed parts, because the high frequency corresponds to the areas with drastic changes, such as the edges and details of the image.

3.1.2 convolution operation in deep learning

Of course, the actual convolution operation in deep learning is different from the convolution of signal processing mentioned above.

Single channel convolution

Convolution in deep learning under single channel input is defined as:\(H_{m,n}=\Sigma^{\lfloor\frac{2}{k}\rfloor}_{i=-\lfloor\frac{2}{k}\rfloor}\Sigma^{\lfloor\frac{2}{k}\rfloor}_{j=-\lfloor\frac{2}{k}\rfloor}X_{m+i,n+j}G_{i,j}\)Here, the convolution kernel is a\(k\times k\)Matrix of. To understand this convolution process vividly is to walk around the input matrix with the convolution kernel, convolute every place (find the inner product of the matrix), then save the results and get the results in the whole volume. The following figure is taken as an example to assist in understanding:


Convolution kernel is to go around and calculate the input matrix. Each calculation is the inner product operation of the matrix to obtain a number. Finally, all the numbers form a new matrix.

In the actual calculation, multiple convolution kernels may be used, and each convolution kernel is spliced after operation, which turns the result into a three-dimensional tensor.

It can be seen that the input dimension decreases after convolution, which will lead to two problems:

  1. After multiple convolution operations, the output size will be smaller and smaller
  2. The value of the edge has little effect on the output, because the convolution will only be rolled up several times, and the middle value will be rolled up almost every time, resulting in the loss of edge information

To solve this problem, the value of the filling edge is usually 0, which increases the dimension of the original input matrix (this process is called padding).

In order to ensure that the dimensions of the output and input after convolution are consistent, the number of zeros filled is related to the shape of the convolution kernel\(k\times k\)The convolution kernel of needs to take\(P=\lfloor\frac{k}{2}\rfloor\), fill in the input matrix\(P\)Circle 0.

In addition, sometimes the convolution does not need to be rolled once at each position. You can set the length to adjust the movement of the convolution core so that the convolution core convolutes every few steps.

Multichannel convolution

Multichannel actually means that the input is not a two-dimensional matrix, but multi-dimensional:\(X\in R^{H\times W\times C}\), where\(C\)Indicates the number of channels entered. Suppose that each convolution kernel of a single channel is still\(k\times k\)Yes, so the convolution kernel of multiple channels is\(G^{c’}\in R^{k\times k\times C}\)

The specific convolution process is consistent with that of a single channel, but more convolution is done with one more dimension:


Similarly, there can also be multiple convolution kernels, that is, the convolution kernel can be raised by another dimension, and the output will be one more dimension.

After convolution, a bias is usually added.

Review the convolution process in deep learning\(H\in R^{H\times W\times C}\)Convolution kernel\(G\in R^{k\times k\times C\times C’}\)(C ‘different convolution kernels), the final output\(H\in R^{H’\times W’ \times C’}\), the total number of parameters introduced is\(k^2\times C\times C’+C’\)(C ‘offset).

The size, filling and step size of convolution kernel will affect the dimension of output. It is assumed that the input is\(H\times W\times C\), convolution kernel size\(k\times k\times C\)have\(C’\)Different convolution kernels, filling values\(p\), in steps of\(s\), dimension of output\(H’\times W’ \times C’\)。 Can calculate:\(H’=\lfloor\frac{H+2p-k}{s}\rfloor+1,W’=\lfloor\frac{W+2p-k}{s}\rfloor+1\)

3.1.3 pooling

The main purpose of pooling is to reduce the dimension, so as to reduce the amount of calculation, and provide translation invariance at the beginning of training (explained later).

Common pooling operations include average pooling and maximum pooling.

The specific operation is to use a fixed size sliding window to slide on the input, and aggregate the elements in the window into one element each time (average and maximum), as shown in the following figure:


In general, the size of the sliding window is\(k\times k\), generally\(2\times 2\), the sliding step size will take a value equal to the window size, so it will take\(2\)。 For multi-channel, pooling will be performed window by window without affecting the number of channels. Finally, after pooling, the input will be halved in length and width.

3.2 convolutional neural network

Convolution neural network is obtained by stacking convolution layer and pooling layer.

3.2.1 structure of convolutional neural network

Take Alex net for image recognition as an example to see the structure of CNN. Alexnet consists of 5 convolution layers, 2 pooling layers and 3 full connection layers.

The structure of alexnet is:

  1. Input layer is\(224\times224\times3\)Size image
  2. Layer 1: convolution layer, convolution kernel size is\(11\times11\), 96 convolution kernels in total, so the convolution dimension\(11\times11\times3\times96\), the sliding step length is 4, the filling is 2, and the output dimension can be calculated according to the formula given in 3.1.2\(55\times55\times96\), after convolution, relu activation function is used
  3. Layer 2: pooling layer, maximum pooling, window size\(3\times3\), step size 2, output dimension\(27\times27\times96\)
  4. The third layer: convolution layer, the size of convolution kernel is\(5\times5\), 256 convolution kernels in total, convolution layer dimension\(5\times5\times96\times256\), sliding step 1, filling 2, output dimension\(27\times27\times256\), use relu after convolution
  5. Layer 4: pooling layer, maximum pooling, window size\(3\times3\), step size 2, output dimension\(13\times13\times256\)
  6. Layers 5 ~ 7: all are convolution layers, and the final output is\(13\times13\times256\)
  7. Layer 8: pool layer, window size\(3\times3\), step size 2, output dimension\(6\times6\times256\)
  8. The last three layers: the full connection layer flattens the convolution output, that is, the dimension is\(R^{9216}\)Finally, you will get\(1000\)The output of dimension is used for image classification


In addition to the deep CNN network structure, alexnet also puts forward two improvements: using relu to avoid gradient disappearance; Use dropout to avoid over fitting (dropout randomly sets some positions to 0 during training, which is equivalent to abandoning some information and forcing the model to infer in the case of less information, so as to make it learn more robust and discriminative features).

We can see two parts of CNN network structure: convolution layer + pooling layer; And the last full connection layer. The coupon link layer flattens the convoluted features, discards the spatial information, aggregates the global information and maps it to the output space.

3.2.2 characteristics of convolutional neural network

Local connection

Each convolution calculation only calculates the part of the convolution kernel, so the input and output are locally connected. Like one\(5\times5\)The input matrix is a\(3\times3\)After the convolution kernel is convoluted twice, there is only one element left, and this element is only related to this\(5\times5\)It is related to the input of, which is called perceptual field.


Weight sharing

Different regions in the input use the same convolution kernel, which brings translation invariance: no matter how the input is translated, the output is always the same.

Pooling also brings translation invariance: for example, maximum pooling, even if the result after convolution changes, as long as the maximum remains unchanged, the pooling result remains unchanged.

Hierarchical expression

This is more abstract. In short, it is because the convolution network is obtained by stacking convolution layers. The convolution at the bottom may be greatly related to the original input, but the corresponding features of the convolution at the higher level are more abstract. Generally speaking, low-level features are universal, while high-level features are more related to tasks (for example, image classification determines which kind of features an image belongs to).

3.3 special convolution form

3.3.1 \(1\times1\)convolution

As shown in the figure:


Two functions:

  1. It is used for information aggregation to increase nonlinearity. After convolution, it can pass through an activation function to increase the expression ability of the model
  2. For the transformation of the number of channels, the number of channels of the characteristic graph can be increased or reduced

3.3.2 transpose convolution

It is mainly used for semantic segmentation. It should be the content of CV. I don’t know much about

The explanation in the book is also messy, but generally speaking, transpose convolution can increase the long-term dimension of input, as shown in the figure:


In figure a, a\(2\times2\)Input of,\(3\times3\)The convolution kernel, the input is filled with two circles of 0, and the final output is 0\(4\times4\)of

Figure B shows a more general process, step size\(s\), fill\(p\)\(k\times k\)Finally, the convolution kernel will be filled in the periphery\(k-1-p\)Circle 0, and the input will be inserted\(s-1\)0, and then convolution.

3.3.3 cavity convolution

Contrary to transpose convolution expansion input, void convolution will expand convolution kernel and fill in 0 value, as shown in the following figure:


However, the parameters of convolution kernel remain unchanged, so hole convolution is a method to expand the perceptual field without increasing parameters.

Void rate\(r\)Used to control the degree of expansion, which will be inserted in the middle of the convolution kernel\(r-1\)0, and then the convolution kernel becomes\(k+(k-1)(r-1)\)Wei.

3.3.4 grouping convolution

Is the input grouping, each group is convoluted separately, and the final output is combined again. Because the input is grouped, the convolution kernel will actually become smaller, as shown in the following figure:


The advantage is that convolution can be carried out in parallel. For example, each group runs on different GPUs, which can speed up convolution calculation and reduce the number of parameters used.

3.3.5 deep separable convolution

There are not many explanations in the book. It will be clearer to write directly in the book, so I won’t summarize myself:

The depth separable convolution consists of two parts, one is the layer by layer convolution along the depth; The other part is 1 × 1 convolution. Layer by layer convolution along the depth is a special case of block convolution. When g = C1 = C2, it is equivalent to setting a convolution kernel for each input channel for convolution respectively. Since this convolution only uses the information of a single input channel, that is, only the information of spatial position is used, and the information between channels is not used, 1 is usually used thereafter × 1 convolution to increase the information between channels.

3.4 application of convolution network in image classification

It is basically the content in CV. Here is a special content used in GNN: RESNET.

Here’s what I learned when learning GCN. A fatal disadvantage of GCN is the excessive smoothness of deep GCN, that is, the expression of all nodes tends to be unified, because each layer of GCN volume will collect the information of neighbors with a further hop. Neighbors can easily spread all over the whole graph, which makes each node learn the expression of the whole graph, As a result, there is no discrimination between nodes, which makes GCN difficult to do. The first course project I came into contact with GNN was to try a deeper GCN, which was a team task. The simple idea of my teammates and I is to weight each layer so that the deeper weight is lower, so that nodes can learn more about the characteristics of neighbors closer to themselves, and the farther neighbors just know a little. In order to make this weight trainable, a linear layer is added in the middle of each GCN layer to train the weight. Of course, the final effect Not good. The accuracy is only a little higher than that of bare deep GCN, but much lower than that of layer 2 GCN. Later, we read the JK net paper, and the general idea is the same, but the author of the paper considers more, the implementation is better, and the final accuracy is also very high. Another possible way is to introduce RESNET, but I didn’t know anything at that time, so I didn’t try. In fact, there are many attempts to introduce the content of CV into GNN. For example, another very famous GNN model gat adds the attention mechanism in CV to GNN.

Not to mention, RESNET’s model is shown in the figure below:


It can be seen that the output of RESNET is determined by the input and the result after convolution, rather than just the result of convolution.

The advantages are: when propagating forward, the input and the original output information will be fused to make better use of the features (it is equivalent to that my convolution may be over rolled. As mentioned earlier, GCN rolls up too many layers of neighbor information, resulting in the failure to learn the information of the surrounding neighbors, and then throw the input back to retrieve the information that may be lost in the convolution process); During back propagation, a part of the gradient is transmitted back to the input through the direct connection between input and output, which can alleviate the problem of gradient disappearance.

RESNET has also brought many improved models, such as densenet, which has also been used in GCN.