Time：2020-10-23

# introduction

Icnet is a model focusing on real-time semantic segmentation, which means it runs faster, has smaller model and less parameters. However, the improvement of these indicators will inevitably lead to the decline of performance accuracy, so the author proposes a cascaded network architecture to help restore image edge and other details. The author of this paper isHengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, Jiaya Jia

Original paper

# network structure

## velocity analysis

In convolution network, the V dimension of input characteristic graph is $C ^, H ^, w$, and the U dimension of output characteristic graph after volume and operation is $C ^, times H ^, ^,$, where C, h and W represent the length, width and channel number of characteristic graph respectively. Convolution operation realizes the mapping from V to u through a convolution kernel K with dimension of $C / times K / times K$, where $K / times K$represents the size of convolution kernel. So the number of convolution operations is $C ^, CK ^ 2H ^, w ^,$. Because $H ^, = \ frac {h} {s}, w ^, = \ frac {w} {s}$, the operation quantity formula can be transformed into $C ^, CK ^ 2H {w} {s ^ 2}$.

From the above final formula, it can be seen that the computational complexity depends on the resolution (W, H), the number of channels (c) and the number of convolution kernels ($C ^,$).

(b) in the above figure shows the calculation amount of each layer for pspnet50 to process two different resolution images. The blue curve represents the image with a resolution of $1024 / times2048$and the green curve is $512 / times1024$. It can be found that for the images with high resolution, the computational complexity in Stage5 is greatly increased.

## framework

It is very difficult to choose between speed and accuracy. Reducing the complexity of the model can improve the speed, but it will greatly reduce the accuracy, and the complex model is very time-consuming. The author does not choose any one of them, but combines the two to propose a cascaded network structure, i.e. image cascade network

It uses three kinds of different resolution images, which are the original image, the size of the original image of $\ frac {1} {2}$and the size of the original image of $\ frac {1} {4}$as input. The traditional large-scale semantic segmentation network for the original image consumes a lot of calculation, so the author uses the traditional segmentation network to segment the image of $/ frac {1} {4}$. Because the image resolution is much smaller, the calculation is very small (as can be seen from the pspnet50 processing different resolution images). To be specific, we use pspnet to subsample the graph with a size of $\ frac {1} {4}$by 8 times, and then we can get the feature graph with the size of the original image $\ frac {1} {32}$(as shown in the first line above). In order to obtain better segmentation effect, the original image and the $/ frac {1} {2}$image are used to restore the image details. The original image is only 8 times down sampled with the previous feature image, and the original image is only 8 times down sampled by a very simple CNN network, then CFF operation is performed with the fusion feature image of the previous level, and then the original image size is restored by up sampling.

The CFF module in the model diagram is a module used to fuse low-resolution images with high-resolution images. At the same time, there is an auxiliary loss function with output. The specific figure is as follows:

It can be seen from the figure that the F1 input of the module is the low resolution feature map. After twice up sampling, the resolution of the feature map is the same as that of the higher level feature map. After a hole convolution, it is added with F2. F2 is a feature map with higher resolution than F1. It makes a simple $1 / times1$convolution projection map and adds it to F1.

In addition, the CFF module also adds an auxiliary loss function output, adding these loss functions to get the total loss function, which can make the gradient optimization smoother and have stronger learning ability.

The final loss function is defined as:

$L = -\sum_{t=1}^T\lambda_t\frac{1}{y_tx_t}\sum_{y=1}^{y_t}\sum_{x=1}^{x_t}log\frac{e^{F_{n^{hat},y,x}^t}}{\sum_{n=1}^Ne^{F_{n,y,x}^t}}$

In the second half of the whole function, the wie standard softmax loss function is added with a summation of T from 1 to T. t means that there are several auxiliary loss functions, and adding them with weights is the final loss function L, $/ lambda_ T$is the weight coefficient.

# Model comparison and analysis

Now make some comparison between icnet and the existing architecture. The classic architecture is shown in (a), (b), (c) above, and icnet is shown in (d). (a) The typical models of (b) are UNET, segnet, and (c) are deeplab and pspnet. The traditional architecture is very large, and the image resolution is also very large. In icnet, only low resolution images ($/ frac {1} {4}$size images) are fed into the large network, which greatly reduces the amount of computation. High precision original images are only fed into the lightweight network to help recover the edge details of images. Thanks to this new cascade structure design, icnet has high real-time performance.

# experiment

The authors areCityscapes,CamVid,COCO-StuffExperiments have been carried out on the dataset. The specific experimental data can be seen in the original paper. It can be found that it has achieved a good balance between accuracy and real-time performance, and the accuracy is not much lower than that of large-scale network. The speed is only slower than eNet, but the accuracy of eNet is much worse.

## My experiment

In this experiment, tensorflow2 is used to allocate Tesla P100 in Google Cola environment. Two different versions have been tried. One is to optimize the final output layer loss without auxiliary loss, and the result is better (I don’t know whether my writing method with auxiliary loss is correct).

At the same time, the pspnet used in the low resolution image only uses vgg16 for convenience, and the medium resolution image also uses vgg16 to down sample and share the weight with the low resolution vgg16.

## Loss without assistance

base_model = keras.applications.VGG16(include_top=False)
base_model.summary()
layer = base_model.get_layer("block3_pool").output
down_stack = keras.Model(inputs=base_model.input, outputs=layer)

### Ppm module in pspnet

def PPM(x,f):
x_1 = AveragePooling2D((10,10))(x)
x_2 = AveragePooling2D((5,5))(x)
x_5 = AveragePooling2D((2,2))(x)

x_1 = Conv2D(f,(1,1))(x_1)
x_1 = BatchNormalization()(x_1)
x_1 = Activation('relu')(x_1)

x_2 = Conv2D(f,(1,1))(x_2)
x_2 = BatchNormalization()(x_2)
x_2 = Activation('relu')(x_2)

x_5 = Conv2D(f,(1,1))(x_5)
x_5 = BatchNormalization()(x_5)
x_5 = Activation('relu')(x_5)

x_1 = UpSampling2D((10,10),interpolation='bilinear')(x_1)
x_2 = UpSampling2D((5,5),interpolation='bilinear')(x_2)
x_5 = UpSampling2D((2,2),interpolation='bilinear')(x_5)

x = Concatenate()([x,x_1,x_2,x_5])
return x

### PSPNet

def PSPnet(input_shape):
x_input = Input(input_shape)

x = down_stack(x_input)
x = PPM(x,128)
x = BatchNormalization()(x)
x = Activation('relu')(x)

model = keras.Model(x_input,x)
return model

### ICNet

def ICNet(input_shape,n_class):
x_input = Input(input_shape)

#1 / 4 picture
x_4 = Lambda(lambda x:tf.image.resize(x,size=(int(x.shape[1])//4, int(x.shape[2])//4)))(x_input)
x_4 = PSPnet((80,80,3))(x_4)

#1 / 2 picture
x_2 = Lambda(lambda x:tf.image.resize(x,size=(int(x.shape[1])//2, int(x.shape[2])//2)))(x_input)
x_2 = down_stack(x_2)

x_4 = UpSampling2D((2,2),interpolation='bilinear')(x_4)

#1 / 4 picture的分支模型，用来计算辅助loss
x_4_ = Conv2D(n_class,(1,1),activation='softmax')(x_4)
model_16 = keras.Model(x_input,x_4_)

x_4 = BatchNormalization()(x_4)
x_2 = BatchNormalization()(x_2)
x_2 = Activation('relu')(x_2)

x_2 = UpSampling2D((2,2),interpolation='bilinear')(x_2)

#1 / 2 picture的分支模型，用来计算辅助loss
x_2_ = Conv2D(n_class,(1,1),activation='softmax')(x_2)
model_8 = keras.Model(x_input,x_2_)

x_2 = BatchNormalization()(x_2)
x = BatchNormalization()(x)
x = Activation('relu')(x)

x = UpSampling2D((2,2),interpolation='bilinear')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x_ = Conv2D(n_class,(1,1),activation='softmax')(x)
model_4 = keras.Model(x_input,x_)

# x = UpSampling2D((4,4),interpolation='bilinear')(x)
# x = Conv2D(n_class,(1,1))(x)
# x = Activation('softmax')(x)

# model = keras.Model(x_input,x)
return model_4,model_8,model_16

### train

model_4,model_8,model_16 = ICNet((320,320,3),17)
loss = keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'])
model_4.fit(train_x,train_y_4_oh,epochs = 200,batch_size = 16,callbacks=[tensorboard_callback])

It can be seen that only the final output model is trained, and other branch models, label data set train, are not trained_ Y_ 4_ Oh is obtained by interpolating the original image tag to reduce the size of $/ frac {1} {4}$. Finally, the model will output the predicted image size of the original image of $\ frac {1} {4}$image, and then 4 times up sampling can get the size of the original image. This method is very fast, but it will bring some loss of accuracy. It can also be trained directly on the label set of the original image size, which will be slightly slower but with higher accuracy.

### Result chart

In this experiment, we use the prediction map with the size of $/ frac {1} {4}$output, and then sample 4 times. The edge is not perfect and rough, but the position and outline of the object are accurate. If you want a more accurate prediction map, you can train directly on the size tag set of the original image.

## Loss with auxiliary

The building part of the model is the same as the previous part, including icnet part. In the training part, the fit method of model in keras is no longer used, but the eagle mode of tensorflow2 is used.

### train

cross_entropy = tf.losses.CategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
@tf.function()
def train_step(img,labels_16,labels_8,labels_4,labels):
l1 = 0.4
l2 = 0.4
l3 = 1.
images_16 = model_16(img)
loss_16 = cross_entropy(labels_16,images_16)
images_8 = model_8(img)
loss_8 = cross_entropy(labels_8,images_8)
images_4 = model_4(img)
loss_4 = cross_entropy(labels_4,images_4)

total_loss = l3*loss_4+l2*loss_8+l1*loss_16

return total_loss
def train(epochs, batch_size=16):
m = train_x.shape[0]
step_per_epoch = m // batch_size
for epoch in range(epochs):
tloss = 0
for i in range(step_per_epoch):
tloss += train_step(train_x[i*batch_size:(i+1)*batch_size],train_y_16_oh[i*batch_size:(i+1)*batch_size],train_y_8_oh[i*batch_size:(i+1)*batch_size],train_y_4_oh[i*batch_size:(i+1)*batch_size],train_y_oh[i*batch_size:(i+1)*batch_size])
tloss += train_step(train_x[(i+1)*batch_size:],train_y_16_oh[(i+1)*batch_size:],train_y_8_oh[(i+1)*batch_size:],train_y_4_oh[(i+1)*batch_size:],train_y_oh[(i+1)*batch_size:])
mloss = tloss / (step_per_epoch+1)
with train_summary_writer.as_default():
tf.summary.scalar('loss', mloss, step=epoch)
display.clear_output(wait=True)
Print ('th '+ str (epoch + 1) +'Round training')
print('loss:'+str(mloss))
r = model_4(np.expand_dims(train_x[0],axis = 0))
r = UpSampling2D((4,4),interpolation='bilinear')(r)
plt.imshow(train_y[0])
plt.show()
plt.show()
train(200)