## Introduction notes to Python deep learning 4 CNN

## Convolutional neural network

The theme of this part isConvolutional neural network CNN。

### network structure

Like the neural network introduced before, CNN is also built layer by layer. However, there are new problems in CNNConvolution layerandPooling layer。

In the neural network introduced earlier, all neurons in adjacent layers are connected, which is called**Fully connected**The following figure is an example of a network based on the affinity layer:

The structure of CNN is usually:

In CNN, the connection order of layers is “revolution relu – (pooling)” (the pooling layer is sometimes omitted). This can be understood as that the previous “affiliate relu” connection is replaced by a “revolution relu – (pooling)” connection. However, the part close to the output layer still uses the previous “affiliate relu”, and the output layer also uses “affiliate softmax”

### Convolution layer

#### What’s wrong with the full connection layer?

There is a problem with the full connection layer:**The shape of the data is “ignored”**For example, when inputting a picture, the usual picture data is three-dimensional data in the direction of “height, length and channel”. However, when inputting to the full connection layer, it is necessary to flatten the three-dimensional data into one-dimensional data.

In the three-dimensional data of the image, there are important spatial information, such as spatially adjacent pixels are similar values, each channel of RBG has close correlation, and there is no correlation between pixels far away, etc,**There may be essential patterns worth extracting in 3D shapes**。 Because the full connection layer ignores the shape and processes all the input data as the same neurons (neurons of the same dimension), the information related to the shape cannot be used.

The convolution layer can keep the shape unchanged. When the input data is an image, the convolution layer will receive the input data in the form of 3D data and output it to the next layer in the form of 3D data. So,**In CNN, it is possible to correctly understand the shape data such as images**。

*In CNN, sometimes the input and output data of convolution layer is called feature map, such as input feature map and output feature map. Subsequent “input / output data” and “characteristic diagram” will mean the same.*

Some terms and knowledge in digital image processing will be involved in the follow-up. I have sorted them out in detail in my previous sharing,Digital image processing notes。

The main difference between traditional digital image processing and image processing based on deep learning is that the filters used in traditional digital image processing are fixed and general, or have been determined according to empirical research; Based on deep learning, the most appropriate filter for the current scene can be “learned”. In addition, there is no difference between the two for image processing operation.

Simplify as shown in the figure below:

#### Convolution operation

The processing performed by the convolution layer isConvolution operation。 Convolution is equivalent to**“Filter operation” in image processing**。*“Filter” is also called “kernel”.*

An example where the input data size is (4,4), the filter size is (3,3), and the final output size is (2,2):

For the input data, the convolution operation slides the window of the filter at certain intervals and applies it,**Multiply the elements of the filter at each position by the corresponding elements of the input, and then sum**This calculation is sometimes called multiplication accumulation addition. then,**Save the result to the corresponding location of the output**。 This process is carried out at all positions, and the output of convolution operation can be obtained.

On CNN,**Filter parameters**It corresponds to the previous one**weight**。 And, CNN also exists**bias**。 The offset is usually only 1, and this value is added to all elements to which the filter is applied.

#### fill

It can be seen that after filtering, the image is “one circle less” than before, because the outermost element of the image does not have a complete surrounding element multiplied by the filter. We use padding so that the edge elements will not be lost.

Before the treatment of convolution, it is sometimes necessary to**Fill in fixed data (ratio) around the input data**

**Such as 0, etc.), which is called padding**。

Filling processing of convolution operation: fill 0 around the input data (dotted line in the figure indicates filling, 0 is not displayed)

#### stride

**The position interval at which the filter is applied is called a stride**。 If the stride is set to 2, as shown in the following figure,**The interval of the window to which the filter is applied becomes 2 elements**。

To sum up, the output size will decrease when the stride is increased. When the fill is increased, the output size becomes larger. So how to calculate the output size through the two and the input size?

Assuming that the input size is (h, w), the filter size is (FH, FW), the output size is (oh, ow), the filling is p and the step is s, then the input size is:

Pay attention to the division in the formula. When the output size cannot be divided completely (when the result is decimal), countermeasures such as error reporting need to be taken.

#### Convolution operation of 3D data

Let’s take a look at the example of convolution operation on 3D data with channel direction.

`Note: in the convolution operation of 3D data, the number of channels of the filter can only be set to the same value as the number of channels of the input data.`

Here, we find another problem: three-dimensional data becomes two-dimensional data after being filtered by a filter.

How to solve it** Just use multiple filters (weights) * *.

By applying FN filters, the output characteristic diagram also generates FN filters. If the FN feature maps are collected together, a block with the shape of (FN, oh, ow) is obtained. Pass this box to the next layer, which is the processing flow of CNN.

Each channel has an offset, so the shape of the offset is (FN, 1, 1).

#### Batch processing

In the processing of neural network, the batch processing of packaging the input data is carried out. If we want the convolution operation to also correspond to batch processing, we need to save the data transferred between layers as**4D data**(a batch of 3D data). Specifically, data is saved in the order of (batch_num, channel, height, width).

### Pool layer

**Pooling is an operation to reduce the height and rectangle upward space**。 In the field of image recognition, Max pooling is mainly used, that is, the maximum value in a target area is selected.

The following example is step 2 × Processing sequence during Max pooling of 2:

Characteristics of pool layer:

- There are no parameters to learn. Its work is very simple. It just takes the maximum value (or average value) from the target area, so there are no parameters to learn.
- The number of channels does not change, and each channel is pooled separately.
- It is robust to small position changes.

### Implementation of convolution layer and pooling layer

We will implement these two layers in Python, but we have to solve some small problems before we start.

#### Expansion based on im2col

CNN processes 4-dimensional data, so the implementation of convolution operation looks very complex, but the problem will become very simple by using im2col.

If the convolution operation is implemented honestly, it is estimated that several layers of for statements will be repeated. In this way**Implementation is a little troublesome**Moreover, there are problems in numpy after using the for statement**Slow processing**Disadvantages of (in numpy, it is better not to use the for statement when accessing elements)

**im2col（image to column) Is a function that will input dataopen**To fit the filter (weight).

In the above figure, the stride is set large for easy observation so that the application areas of the filter do not overlap. In the actual convolution operation, the application areas of the filter are almost overlapped.**When the application areas of the filter overlap, the number of elements expanded by im2col will be more than that of the original block**。 Therefore, the implementation using im2col has the disadvantage of consuming more memory than the ordinary implementation. However, it is beneficial for computer calculation to summarize it into a large matrix for calculation.

```
def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
"""
Parameters
----------
input_ Data: input data composed of 4-dimensional arrays (data volume, channel, height and length)
filter_ H: filter high
filter_ W: length of filter
Stripe: stride
Pad: fill
Returns
-------
Col: 2-dimensional array
"""
N, C, H, W = input_data.shape
out_h = (H + 2*pad - filter_h)//stride + 1
out_w = (W + 2*pad - filter_w)//stride + 1
img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
return col
```

After expanding the input data with im2col, you only need to expand the filter (weight) of the convolution layer vertically into one column and calculate the product of the two matrices.

It can be seen that the matrix multiplication can be used for calculation after expansion.

#### Implementation of convolution layer

```
class Convolution:
def __init__(self, W, b, stride=1, pad=0):
self.W = W
self.b = b
self.stride = stride
self.pad = pad
#Intermediate data (used in backward)
self.x = None
self.col = None
self.col_W = None
#Gradient of weight and bias parameters
self.dW = None
self.db = None
def forward(self, x):
FN, C, FH, FW = self.W.shape
N, C, H, W = x.shape
#Calculate output data size
out_h = int(1 + (H + 2*self.pad - FH) / self.stride)
out_w = int(1 + (W + 2*self.pad - FW) / self.stride)
#Input data expansion
col = im2col(x, FH, FW, self.stride, self.pad)
#Expansion of filter
col_W = self.W.reshape(FN, -1).T
#Calculation using matrix multiplication
out = np.dot(col, col_W) + self.b
#Change back to 3D shape
out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)
return out
def backward(self, dout):
FN, C, FH, FW = self.W.shape
dout = dout.transpose(0, 2, 3, 1).reshape(-1, FN)
self.db = np.sum(dout, axis=0)
self.dW = np.dot(self.col.T, dout)
self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)
dcol = np.dot(dout, self.col_W.T)
dx = col2im(dcol, self.x.shape, FH, FW, self.stride, self.pad)
return dx
```

Note here that by specifying – 1 in reshape, the reshape function will automatically calculate the number of elements in the – 1 dimension to make the number of elements of the multidimensional array consistent. Transfer will change the order of the axes of the multidimensional array. For example, transfer (0, 3, 1, 2) is to change the axes of the original 0, 1, 2, 3 positions to the positions of the input parameters.

The above is the implementation of forward processing of convolution layer. As for the code of back propagation of convolution layer, col2im is used, which is the inverse process of im2col. The code is as follows:

```
def col2im(col, input_shape, filter_h, filter_w, stride=1, pad=0):
"""
Parameters
----------
col :
input_ Shape: the shape of the input data (for example: (10, 1, 28, 28))
filter_h :
filter_w
stride
pad
Returns
-------
"""
N, C, H, W = input_shape
out_h = (H + 2*pad - filter_h)//stride + 1
out_w = (W + 2*pad - filter_w)//stride + 1
col = col.reshape(N, out_h, out_w, C, filter_h, filter_w).transpose(0, 3, 4, 5, 1, 2)
img = np.zeros((N, C, H + 2*pad + stride - 1, W + 2*pad + stride - 1))
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
img[:, :, y:y_max:stride, x:x_max:stride] += col[:, :, y, x, :, :]
return img[:, :, pad:H + pad, pad:W + pad]
```

#### Implementation of pooling layer

The pooled application area is expanded separately by channel.

```
class Pooling:
def __init__(self, pool_h, pool_w, stride=1, pad=0):
self.pool_h = pool_h
self.pool_w = pool_w
self.stride = stride
self.pad = pad
self.x = None
self.arg_max = None
def forward(self, x):
N, C, H, W = x.shape
out_h = int(1 + (H - self.pool_h) / self.stride)
out_w = int(1 + (W - self.pool_w) / self.stride)
#Unfold
col = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
col = col.reshape(-1, self.pool_h * self.pool_w)
#Maximum
arg_max = np.argmax(col, axis=1)
out = np.max(col, axis=1)
#Conversion
out = out.reshape(N, out_h, out_w, C).transpose(0, 3, 1, 2)
self.x = x
self.arg_max = arg_max
return out
def backward(self, dout):
dout = dout.transpose(0, 2, 3, 1)
pool_size = self.pool_h * self.pool_w
dmax = np.zeros((dout.size, pool_size))
dmax[np.arange(self.arg_max.size), self.arg_max.flatten()] = dout.flatten()
dmax = dmax.reshape(dout.shape + (pool_size,))
dcol = dmax.reshape(dmax.shape[0] * dmax.shape[1] * dmax.shape[2], -1)
dx = col2im(dcol, self.x.shape, self.pool_h, self.pool_w, self.stride, self.pad)
```

*The maximum value can be calculated using NP of numpy Max method. np. Max can specify the axis parameter and find the maximum value in each axis direction specified by this parameter. For example, if it is written as NP Max (x, axis = 1), you can find the maximum value in each axis direction of the first dimension of input X.*

By expanding the input data into a shape that is easy to pool, the later implementation will become very simple

### Implementation of CNN

The composition of the network to be implemented is “revolution relu pooling affinity”-

Relu affine softmax “, we implement it as a class named simpleconvnet

```
class SimpleConvNet:
"""
conv - relu - pool - affine - relu - affine - softmax
Parameters
----------
input_ Size: enter the size (784 in the case of MNIST)
hidden_ size_ List: a list of the number of neurons in the hidden layer (e.g. [100, 100, 100])
output_ Size: output size (10 in the case of MNIST)
activation : 'relu' or 'sigmoid'
weight_ init_ STD: standard deviation of specified weight (e.g. 0.01)
Set "initial value of he" when 'relu' or 'he' is specified
Set "initial value of Xavier" when 'sigmoid' or 'Xavier' is specified
"""
def __init__(self, input_dim=(1, 28, 28),
conv_param={'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1},
hidden_size=100, output_size=10, weight_init_std=0.01):
filter_num = conv_param['filter_num']
filter_size = conv_param['filter_size']
filter_pad = conv_param['pad']
filter_stride = conv_param['stride']
input_size = input_dim[1]
#Calculate the volume layer output data size
conv_output_size = (input_size - filter_size + 2*filter_pad) / filter_stride + 1
#Calculate the output data size of the pooling layer
pool_output_size = int(filter_num * (conv_output_size/2) * (conv_output_size/2))
#Initialize weight
self.params = {}
self.params['W1'] = weight_init_std * \
np.random.randn(filter_num, input_dim[0], filter_size, filter_size)
self.params['b1'] = np.zeros(filter_num)
self.params['W2'] = weight_init_std * \
np.random.randn(pool_output_size, hidden_size)
self.params['b2'] = np.zeros(hidden_size)
self.params['W3'] = weight_init_std * \
np.random.randn(hidden_size, output_size)
self.params['b3'] = np.zeros(output_size)
#Generation layer
self.layers = OrderedDict()
self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
conv_param['stride'], conv_param['pad'])
self.layers['Relu1'] = Relu()
self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
self.layers['Relu2'] = Relu()
self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])
#The output layer is placed separately
self.last_layer = SoftmaxWithLoss()
def predict(self, x):
for layer in self.layers.values():
x = layer.forward(x)
return x
def loss(self, x, t):
Find loss function
Parameter x is the input data and t is the teacher label
"""
y = self.predict(x)
return self.last_layer.forward(y, t)
def accuracy(self, x, t, batch_size=100):
if t.ndim != 1 : t = np.argmax(t, axis=1)
acc = 0.0
for i in range(int(x.shape[0] / batch_size)):
tx = x[i*batch_size:(i+1)*batch_size]
tt = t[i*batch_size:(i+1)*batch_size]
y = self.predict(tx)
y = np.argmax(y, axis=1)
acc += np.sum(y == tt)
return acc / x.shape[0]
def numerical_gradient(self, x, t):
"" "find gradient (numerical differentiation)"
Parameters
----------
X: input data
T: teacher label
Returns
-------
Dictionary variable with gradient of each layer
grads['W1']、grads['W2']、... Is the weight of each layer
grads['b1']、grads['b2']、... Is the offset of each layer
"""
loss_w = lambda w: self.loss(x, t)
grads = {}
for idx in (1, 2, 3):
grads['W' + str(idx)] = numerical_gradient(loss_w, self.params['W' + str(idx)])
grads['b' + str(idx)] = numerical_gradient(loss_w, self.params['b' + str(idx)])
return grads
def gradient(self, x, t):
"" "find gradient (error back propagation method)"
Parameters
----------
X: input data
T: teacher label
Returns
-------
Dictionary variable with gradient of each layer
grads['W1']、grads['W2']、... Is the weight of each layer
grads['b1']、grads['b2']、... Is the offset of each layer
"""
# forward
self.loss(x, t)
# backward
dout = 1
dout = self.last_layer.backward(dout)
layers = list(self.layers.values())
layers.reverse()
for layer in layers:
dout = layer.backward(dout)
#Set
grads = {}
grads['W1'], grads['b1'] = self.layers['Conv1'].dW, self.layers['Conv1'].db
grads['W2'], grads['b2'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
grads['W3'], grads['b3'] = self.layers['Affine2'].dW, self.layers['Affine2'].db
return grads
def save_params(self, file_name="params.pkl"):
params = {}
for key, val in self.params.items():
params[key] = val
with open(file_name, 'wb') as f:
pickle.dump(params, f)
def load_params(self, file_name="params.pkl"):
with open(file_name, 'rb') as f:
params = pickle.load(f)
for key, val in params.items():
self.params[key] = val
for i, key in enumerate(['Conv1', 'Affine1', 'Affine2']):
self.layers[key].W = self.params['W' + str(i+1)]
self.layers[key].b = self.params['b' + str(i+1)]
```

In addition to the use of convolution layer and pooling layer in the network structure, it can be seen that the main flow of CNN implementation code is no different from the neural network previously implemented with full link layer.

### CNN visualization

What is the convolution layer used in CNN “observing”?

The following figure shows the weight of the convolution layer of the first layer before and after learning. The elements of the weight are real numbers. However, on the display of the image, the minimum value is uniformly displayed as black (0) and the maximum value is displayed as white (255):

The filter before learning is initialized randomly, so there is no rule to follow in the intensity of black and white, but the filter after learning becomes a regular image. We found that,**Through learning, the filter is updated into a regular filter**。

If you want to ask what the regular filter learned on the right side of the figure is “observing”, the answer is that it is observing**edge**(dividing line of color change) and**plaque**(local block area), etc.

In traditional digital image processing, the filter for edge detection is generally fixed, such as gradient operator, Gauss Laplace operator and so on. The Shenjiang network can learn the law of edges according to the training image and generate appropriate filters.

#### Information extraction based on hierarchical structure

According to the research related to the visualization of deep learning, with the deepening of the level, the extracted information (correctly speaking, neurons with strong reflection) is becoming more and more abstract.

The information extracted from the convolution layer of CNN. Layer 1 neuron pair**Edge or plaque**Yes, layer 3 is right**lines**

**reason**Yes, layer 5 is right**Object parts**Yes, the last full connection layer pair**Category of object**(dog or car) yes

Response.

If multiple convolution layers are stacked, the extracted information becomes more complex and abstract with the deepening of the layer, which is a very interesting place in deep learning. As the hierarchy deepens, neurons change from simple shape to “advanced” information. let me put it another way,**Just as we understand the “meaning” of things, the object of response is gradually changing**。

### Representative CNN

#### LeNet

Lenet is the earliest CNN. Compared with “CNN now”, lenet has several differences:

- The first difference is the activation function. The sigmoid function is used in lenet, but now CNN mainly uses the relu function.
- The original lenet uses subsampling to reduce the size of intermediate data, while Max pooling is the mainstream in CNN today.

#### AlexNet

Alexnet is the trigger for the upsurge of deep learning, but its network structure is basically no different from lenet.

There are multiple convolution layers and pooling layers stacked on alexnet. Finally, the results are output through the full connection layer. In terms of structure, alexnet and lenet are not very different, but there are the following differences:

- The activation function uses relu.
- Use the LRN (local response normalization) layer for local normalization

With the deepening of the level, neurons change from simple shape to “advanced” information. let me put it another way,**Just as we understand the “meaning” of things, the object of response is gradually changing**。

### Representative CNN

#### LeNet

[external chain picture transferring… (img-2dfxyuwg-1642312849109)]

Lenet is the earliest CNN. Compared with “CNN now”, lenet has several differences:

- The first difference is the activation function. The sigmoid function is used in lenet, but now CNN mainly uses the relu function.
- The original lenet uses subsampling to reduce the size of intermediate data, while Max pooling is the mainstream in CNN today.

#### AlexNet

Alexnet is the trigger for the upsurge of deep learning, but its network structure is basically no different from lenet.

[external chain picture transferring… (img-ysh2m5dr-1642312849110)]

There are multiple convolution layers and pooling layers stacked on alexnet. Finally, the results are output through the full connection layer. In terms of structure, alexnet and lenet are not very different, but there are the following differences:

- The activation function uses relu.
- Use the LRN (local response normalization) layer for local normalization
- Use dropout (randomly delete neurons during learning).