## 1、 Introduction

In the field of image semantic segmentation, a problem that has puzzled computer scientists for many years is how can we separate the objects we are interested in and the objects we are not interested in? For example, we have a picture of a kitten. How can we recognize the image by computer and achieve the effect of separating the kitten and the background in the picture from each other? As shown in the figure below:

The FCN, which came out in 2015, solved this problem perfectly. It improved the score of mean IU (average recognition accuracy) from 40% to 62.2% (running on Pascal VOC data set, written in FCN paper), and the pixel level recognition accuracy was 90.2%. This is a very perfect result, almost beyond the ability of human to distinguish and segment images. As shown in the figure above, the kitten is divided into three parts: background, kitten and edge. Therefore, each pixel in the image has only three predicted values, whether it is kitten, background or edge. This is what full convolution networks do**Do pixel level**Classification task for. So how is this network designed and implemented?

## 2、 Realization of network

Although the implementation of this network is very domineering, full convolution neural network. However, in fact, using this name is nothing more than replacing the full connection layer of the last few layers of convolution network for classification into 1 * 1 convolution network, so it is called this name. In this network, the image is first convoluted, convoluted, pooled, and then convoluted, convoluted and pooled until our image is small enough. At this time, upsampling can be performed to restore the size of the image. What is upsampling? You probably haven’t heard of it. We’ll come together later. The structure of this network is as follows (attach the original picture of the paper)

From this, we can see that we input a picture of a kitten and a dog together, and finally forward propagation. In the penultimate layer of the forward propagation network, the length of the convolutional neural network becomes n * n * 21, because there are 21 softmax classification results in the VOC data set, so each category needs to have an output of correlation probability (confidence degree). The forward propagation convolutional neural network can be vgg16, alexnet, Google concept net, and even RESNET. The author of this paper has made corresponding attempts on the first three nets, but because RESNET has not come out at that time, he has not tried to use it in the previous network. When our network becomes an output of n * n * 21, we upsampling the image, which is equivalent to restoring the result with 21 classified outputs to a graph with the same size and channel as the original image. Each pixel on the graph represents the probability of 21 categories of things, so we can get the probability of which category each pixel in the graph should be divided into. So what is upsampling of images?

## 3、 Upsampling of images

The upsampling of an image is just the opposite of convolution. We can make the image smaller and smaller by normal convolution, while the upsampling can also make the image bigger and larger by convolution, and finally scale to the same size as the original image. The paper on upsampling should be explained in detail in this paper https://arxiv.org/abs/1603.07285 。 There are three common methods of upsampling: bilinear interpolation, deconvolution, and unpolling. In the full convolution neural network, we use deconvolution to achieve up sampling. Let’s first review forward convolution, also known as forward convolution**Down sampling**，**Forward convolution**First, we have a 3 * 3 convolution kernel as follows:

Then, a 5 * 5 feature graph is convoluted with sliding window method, padding = 0, stripe = 1, kernel size = 3, so finally a 3 * 3 feature graph is obtained

What about up sampling? In this way, we assume that the input is only a 2 * 2 feature map, and the output is a 4 * 4 feature map. First, we fill the original 2 * 2 map with pading = 2. After consulting a lot of data, we know that the number 0 is filled around here, and the padding around is not the number trained by neural network. Then scan this area with a receptive field with kernel size = 3 and stripe = 1, and a 4 * 4 feature map can be obtained! :

We can even separate each pixel of this 2 * 2 feature map with a space. The numbers in the space are filled with 0, and the numbers filled by padding around are all zero. Then we can continue to sample and get a 5 * 5 feature map, as shown below:

So our deconvolution is done. So what is 1 * 1 convolution?

## 4、 1 * 1 convolution

In the process of forward propagation of our convolution neural network, the final output is an n * n * 21, which can be defined artificially by 1 * 1 convolution, so that we can get a 21 categories, the probability of each category, and finally output the feature map with the same size as the original image. Each pixel has 21 channels, which represents this The probability value of the output of a certain category of pixels. When Professor Wu Enda explained convolution neural network, he used a very classic image to represent 1 * 1 convolution

The length and width of the original feature map is 28, and the channel is 192. We can use 32 convolution kernels to transform 28 * 28 * 192 into a 28 * 28 * 32 feature map. When 1 * 1 convolution is used, the output length remains unchanged, and the number of channels is the same as that of convolution kernel. This process can be represented by an abstract 3d stereogram

Therefore, we can reduce or increase the dimension of data by controlling the number of convolution kernels. Increase or decrease the channel, but the length and width of the feature map will not change. We propagate forward in full convolutional neural network (FCN). The last step of down sampling (see the first picture of this blog) is to change a n * n * 4096 feature map into an n * n * 21 feature map.

## 5、 Skip implementation of total convolution neural network

If we directly adopt the method of convolution first and then upsampling to get the feature image with the same size as the original image, the effect of semantic segmentation is not very good after experiment. Because in the convolution, when the feature map is still relatively large, the image information we extract is very rich**Information loss**The more obvious. We can find that after the first five convolutions and pooling, the resolution of the original image is reduced by 2,4,8,16,32 times respectively. For the image of the last layer, 32 times of upsampling is needed to get the same size as the original image. However, only relying on the last layer of image for upsampling, the result is still not accurate, and some details are still very inaccurate. Therefore, the author adopts the method of step-by-step connection, that is, the feature images extracted in the first few layers of convolution are connected with the upper sampling layer after the convolution, and then the upper sampling is continued on the network. After the up sampling for many times, the feature map with the same size as the original image can be obtained, which can also obtain more information of the original image when restoring the image. As shown in the figure below:

The first step-by-step connection proposed by the author is to upsample the output of layer 5, then combine it with the prediction of pooling layer 4, and finally get the strategy of original picture. This strategy is called fcn-16s, and then try to combine the prediction method with all pooling layers, called fcn-8s. It is found that this method has the highest accuracy. As shown in the figure below:

Ground truth is the artificial annotation of the original image, and the prediction made by neural network is in front of it. For skip connection, the size of the original image given by us is 500 * 500 * 3, which does not matter, because the full convolution neural network can accept images of any size. First, we do a convolution from the previous green map of maxtool, which has just finished from the pooling layer. Then, we convolute the next green feature graph. Finally, we add the output of 16 * 16 * 21, which has completed 1 * 1 convolution, and add the three outputs together. In this way, the skip input is realized, and the fused results are upsampled As like as two peas, we get a graph of 568*568*21, and transform the graph into a 500*500*21 feature graph through a softmax layer. Therefore, the length and width of the image are exactly the same as the original one. Each pixel has 21 probability values, which indicates that the probability of this pixel belonging to a category is not different from that of the original channel.

Then let’s look at the code implementation based on tensorflow.

## 6、 Implementation of total convolution neural network by tensorflow code

First, import the packet and read the image data

```
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os
import glob
images=glob.glob(r"F:\UNIVERSITY STUDY\AI\dataset\FCN\images\*.jpg")
#Then read the target image
anno=glob.glob(r"F:\UNIVERSITY STUDY\AI\dataset\FCN\annotations\trimaps\*.png")
```

Glob library can be used to read local images and make data for each batch. I put the data set in the folder F: University study / AI / dataset / FCN.

The power image folder is used to load the images of the training set, and the annotation folder is used to load the dataset that people mark the boundary.

The marked pictures are as follows:

The original picture is a picture of a dog. The original picture is as follows:

Then, the data set, batch data and the function of reading image file are made, including PNG and JPG, which are analyzed into three-dimensional matrix respectively

```
#Now make a batch of the read in data
np.random.seed(2019)
index=np.random.permutation(len(images))
images=np.array(images)[index]
anno=np.array(anno)[index]
#Create dataset
dataset=tf.data.Dataset.from_tensor_slices((images,anno))
test_count=int(len(images)*0.2)
train_count=len(images)-test_count
data_train=dataset.skip(test_count)
data_test=dataset.take(test_count)
def read_jpg(path):
img=tf.io.read_file(path)
img=tf.image.decode_jpeg(img,channels=3)
return img
def read_png(path):
img=tf.io.read_file(path)
img=tf.image.decode_png(img,channels=1)
return img
#Now write normalized functions
def normal_img(input_images,input_anno):
input_images=tf.cast(input_images,tf.float32)
input_images=input_images/127.5-1
input_anno-=1
return input_images,input_ann
#Load function
def load_images(input_images_path,input_anno_path):
input_image=read_jpg(input_images_path)
input_anno=read_png(input_anno_path)
input_image=tf.image.resize(input_image,(224,224))
input_anno=tf.image.resize(input_anno,(224,224))
return normal_img(input_image,input_anno)
data_train=data_train.map(load_images,num_parallel_calls=tf.data.experimental.AUTOTUNE)
data_test=data_test.map(load_images,num_parallel_calls=tf.data.experimental.AUTOTUNE)
#Now let's start making batch
BATCH_ Size = 3 ා adjust according to the display memory
data_train=data_train.repeat().shuffle(100).batch(BATCH_SIZE)
data_test=data_test.batch(BATCH_SIZE)
```

Then we use vgg16 for convolution operation and Imagenet’s pre training model for transfer learning to build neural network and jump level connection

```
conv_base=tf.keras.applications.VGG16(weights='imagenet',
input_shape=(224,224,3),
include_top=False)
#Now create a child model to inherit conv_ The weight of base is used to obtain the intermediate output of the model
#It's amazing to use this method to inherit without explicitly specifying which model to inherit
#This method is called after the model is established
#This will continue to automatically inherit the previous network structure
#And if defined
sub_model=tf.keras.models.Model(inputs=conv_base.input,
outputs=conv_base.get_layer('block5_conv3').output)
#Now create a multi output model, three outputs
layer_names=[
'block5_conv3',
'block4_conv3',
'block3_conv3',
'block5_pool'
]
layers_output=[conv_base.get_layer(layer_name).output for layer_name in layer_names]
#Create a multi output model so that after a picture passes through the network, there will be multiple output values
#However, although the output value is available, how can we perform the jump level connection?
multiout_model=tf.keras.models.Model(inputs=conv_base.input,
outputs=layers_output)
multiout_model.trainable=False
inputs=tf.keras.layers.Input(shape=(224,224,3))
#This multi output model will output multiple values, so it can be accepted with multiple parameters.
out_block5_conv3,out_block4_conv3,out_block3_conv3,out=multiout_model(inputs)
#Now the output of the last layer is up sampled, and then added with the multi output results of the middle layer respectively to realize the jump level connection
#There are 512 convolution kernels and the size of the filter is 3 * 3
x1=tf.keras.layers.Conv2DTranspose(512,3,
strides=2,
padding='same',
activation='relu')(out)
#After sampling, a layer of convolution is added to extract features
x1=tf.keras.layers.Conv2D(512,3,padding='same',
activation='relu')(x1)
#Add to the penultimate layer of the multiple output results, and the shape remains unchanged
x2=tf.add(x1,out_block5_conv3)
#X2 for up sampling
x2=tf.keras.layers.Conv2DTranspose(512,3,
strides=2,
padding='same',
activation='relu')(x2)
#Get X3 directly, don't use it
x3=tf.add(x2,out_block4_conv3)
#X3 for up sampling
x3=tf.keras.layers.Conv2DTranspose(256,3,
strides=2,
padding='same',
activation='relu')(x3)
#Feature extraction by adding convolution
x3=tf.keras.layers.Conv2D(256,3,padding='same',activation='relu')(x3)
x4=tf.add(x3,out_block3_conv3)
#X4 needs to be upsampling again to get the same size image as the original image, and then classify it
x5=tf.keras.layers.Conv2DTranspose(128,3,
strides=2,
padding='same',
activation='relu')(x4)
#Continue to extract features by convolution
x5=tf.keras.layers.Conv2D(128,3,padding='same',activation='relu')(x5)
#The last step is image restoration
preditcion=tf.keras.layers.Conv2DTranspose(3,3,
strides=2,
padding='same',
activation='softmax')(x5)
model=tf.keras.models.Model(
inputs=inputs,
outputs=preditcion
)
```

Compile and fit model:

```
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
Metrics = ['acc '] ා this parameter should be used for printing accuracy. Now I understand it
)
model.fit(data_train,
epochs=1,
steps_per_epoch=train_count//BATCH_SIZE,
validation_data=data_test,
validation_steps=train_count//BATCH_SIZE)
```

Output:

```
Train for 1970 steps, validate for 1970 steps
1969/1970 [============================>.] - ETA: 1s - loss: 0.3272 - acc: 0.8699WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 1970 batches). You may need to use the repeat() function when building your dataset.
1970/1970 [==============================] - 3233s 2s/step - loss: 0.3271 - acc: 0.8699 - val_loss: 0.0661 - val_acc: 0.8905
```

With only one epoch, the pixel accuracy has reached 89%. Isn’t it amazing? hey

This is all the knowledge of FCN! I hope you can get something from it. It’s hard to write it. If you think it’s OK, don’t forget to click on the bottom right corner“**recommend**”And the lower left corner“**follow**”Ah!