Application of automatic web search (NAS) in semantic segmentation (2)


preface:This paper will introduce how to search the semantic segmentation model based on proxylessnas. The final search model structure can reach 36 FPS test results on the CPU, and show the application of automatic network search (NAS) in semantic segmentation.

With the advent of neural architecture search technology, deep learning has gradually developed to the stage of automatic design of network structure and super parameter configuration. Especially in the context of AI landing, many models need to be deployed on mobile devices. According to different devices (GPU, CPU, chip, etc.) and different model requirements (latency, model size, flops), using NAS to automatically search the best network structure will be a promising direction.LastThis paper introduces the basic framework of NAS, darts [1], which is required for entry, and its application in the field of semantic segmentation. Only a few months from now, the number of NAS papers has increased significantly: in terms of theoretical research, the methods of search strategy and evaluation performance seem to be stable. I have to mention that regnet [2] of fair team recently discussed the design of search space and verified the theories of common design models one by one through a large number of experiments, We can reduce the search space according to its conclusion, so as to improve the search efficiency; In terms of application, it mainly focuses on object detection, as well as segmentation, Reid, Gan and other fields.

NAS is a new technology, but semantic segmentation is an old saying. Since the advent of FCN, the simple and crude encoder decoder structure of segnet and UNET can achieve acceptable results on a variety of images. After the deep lab series, it has reached its peak in the open source dataset. From an academic point of view, semantic segmentation seems to have reached the bottleneck, so researchers have turned to small samples, semi supervised, domain adaptation, cloud point and other directions to find another way. But the landing of semantic segmentation is very difficult. In the actual landing scenario, various object detection tasks can be completed by using the common backbone (RESNET or Yolo Series), but the effect on segmentation is not good:

  1. Due to the light and other reasons, the intensity distribution of the actual scene image is more complex, and the segmentation needs to subdivide the boundary, so the determination of the pixel value is particularly important. However, compared with detection, segmentation has high data labeling cost, resulting in less training data, and only relying on data augmentation and other means to improve is limited.
  2. Segmentation is the task of pixel wise. Because it processes every pixel, the model is generally much larger than the model of object detection (you see this model is long and wide). If your model requires real-time reasoning (> 16 FPS), accuracy and speed will inevitably conflict. Double kill!
  3. When semantic segmentation is used in video stream, the requirement for accuracy will be higher. Even if there is only a few pixels difference between every two frames, even if there is little difference in the value of Miou, the human eye looks unstable, and there will be a boundary of “jitter”. Triple kill!
  4. When the semantic segmentation model goes down the cloud and is deployed on the mobile terminal with limited computing power, the underlying chip may not support many operations, making Quadra kill once the model that can play happily on the GPU arrives on the CPU!

The accuracy and speed of the model must be balanced in the landing of semantic segmentation, and it is very difficult to design such a network structure. A series of small models such as bisenet [3], shufflenetv2 [4], mobilenetv3 [5] were tried, but the accuracy and speed did not meet the requirements. As the saying goes, a tall building rises from the ground, success can only depend on itself. In the end, we should hope that NAS will automatically search for qualified models. The NAS used in semantic segmentation described in the previous article is still in the exploratory stage. It runs on the GPU and tries to reduce flops or params. However, flops or parameters are not positively correlated with the speed of model reasoning. Only reducing the parameters can not meet the requirements of real-time reasoning. The later fasterseg [6] seems to have an amazing speed, but it also uses tensorrt to accelerate. This paper will try to complete the task of real-time human shape segmentation on the CPU, and select proxylessnas as the baseline to search the model structure. The experimental results prove that proxylessnas [7] can stand the test, and the conscience of the industry.

1.Overview of ProxylessNAS

The reason for choosing proxylessnas [7] is not only that it comes from a famous family and the code is open source. The accuracy of cifar10 and Imagenet data sets can stand out from many NAS models, but also that it takes into account the work of model performance (such as speed, model size and parameter quantity) earlier. In addition, unlike DAG cells searched by darts [1], the backbone network of proxylessnas [7] adopts a simple chain structure. This chained structure has obvious speed advantages over DAG cell because the connection between its operators is relatively simple.

1.1 Super-net setting

We still use the basic framework of NAS to parse proxylessnas [7].

Application of automatic web search (NAS) in semantic segmentation (2)

Figure 1: NAS framework

  • Search space: the operation candidate defined in the search space is a block from mobilenetv2 [8], taking different kernel sizes (3, 5, 7) and different expansion rates (3, 6), plus identity and zero operations, a total of 8 OPS (c.f. Figure 1). The macro structure of the network is a common chain structure to complete classification. Each layer has 8 OPS candidates (c.f. Figure 2). As mentioned earlier, too complex connection between operators will slow down the speed. The common small model structure is this chain structure.
  • Search strategy: the search strategy adopts the differentiable method, which is very common in the past two years. Although it is less stable than RL and EA, it can greatly improve the search speed.
  • Evaluation performance: one shot weight sharing is also the most common form of super net. For teams and individuals who lack computing resources, this method can improve search efficiency and reduce memory consumption.

1.2 Super-net training

The parameters of super net include two parts: the parameters of the operation itself and the weight of each operation (recorded as {alpha, beta, sigma…} in Figure 2). The training data is divided into two parts, one is used to train the weight of operations in super net, and the other is used to update the weight of OPS.

  • Training: at the beginning of each iteration, an operation (c.f. the binary gate in Figure 2) is activated randomly at each layer. All activated operations are connected to form a subnet, which is recorded as a subnet. The weight of the subnet is updated through back propagation. The inactive OPS is not put into memory, that is, only one subnet is in memory during training, which also enables the whole search process to be completed on a single card.
  • Searching: the weight alpha of each operation represents its importance, that is, the probability of being finally selected. Probability = softmax (alpha). In other words, the search process is the process of constantly updating the weight alpha. Like training, each iteration needs to activate a subnet randomly, but this time the weight of the operation should be fixed. Calculate the alpha on this subnet through back propagation. EQ (4) in the paper gives the calculation method. Since the binary gate is directly proportional to probability, the derivative of loss to probability is transformed into the derivative of binary gate in the formula, and the derivative of loss to binary gate has been calculated and saved during back propagation (this part of paper is not detailed, please refer to the source code).

Application of automatic web search (NAS) in semantic segmentation (2)

Figure 2 illustrates the architecture of the super-net: the chained-structure searchable backbone (left) and each layer of the searchable backbone (right).

The process of proxylessnas expressed in Figure 2 is to update the operation weight alpha while training the operation parameters, and finally use softmax to select the operation with the maximum probability in each layer. After reading paper, I do find that there are many things worth learning from, but there are also some questions (c.f. Table 1).

Table 1 discusses the advantages and remaining issues of ProxylessNAS

Application of automatic web search (NAS) in semantic segmentation (2)

2.Real-time Semantic Segmentation using ProxylessNAS on CPU

Although there are still many unsolved problems with proxylessnas, single card search training saves time and effort. With the help of Intel’s openvino reasoning framework, this paper attempts to use proxylessnas to search the real-time semantic segmentation model that can run on CPU (x86) for human shape segmentation. The improvement of the algorithm and experimental results will be introduced in detail below.

2.1 Super-net setting

  • Search space: when setting up the search space, with the mentality of making great efforts to work miracles, I stuffed in all the commonly used operations, namely mbv3 (3×3), mbv3 (5×5), dilconv (3×3), dilconv (5×5), sepconv (3×3), sepconv (5×5) and shufflebock, a total of 7 ops. Among them, mbv3 is the basic module from mobilenetv3 [5], dilconv and sepconv are the divided separable and separable configurations from darts [1], and shufflebock is the basic module from shufflenetv2 [4]. Two kernel sizes are set for the first three operations. When defining the macro network structure, the structure of deep labv3 + [9] (c.f. Figure 3) is adopted: head + searchable backbone + ASPP + decoder. Similar to UNET, the feature map of the encoder is directly “added” to the decoder. There is no “concatenation” here to prevent the model from being too “wide” and slowing down. Where S2, S4, S8, S16 and S32 respectively refer to the decrease of the resolution of the feature map by 2, 4, 8, 16 and 32 times. Similar to proxylessnas, the parameters of Supernet include two parts: one is the weight of the operation itself, and the other is the weight alpha of the operation.
  • Searching strategy: continuing the differentiable derivation method of proxylessnas
  • Evaluation performance: one shot weight sharing

Application of automatic web search (NAS) in semantic segmentation (2)

Figure 3 illustrates the macro-architecture of our super-net (top) and the searchable backbone (bottom)

2.2 Improvement from ProxylessNAS

  • Decoding the training and searching process: in proxylessnas, “training” and “searching” are completed in turn at the same time, that is, searching while training. During the experiment, I completely separated “training” from “searching”. First, I used 50 epochs to update only the operation parameters in super net, and then updated the operation weight alpha after training. The reason for this is to avoid that when the operation parameters are unstable, some alpha will affect the subsequent decisions too much.
  • Consider the latency as a hard constraint: because the reasoning speed of the model is important and cannot be calculated by simple superposition, the reasoning speed of this subnet should be calculated every time the subnet is activated randomly. If it does not meet the requirements (e.g. latency > 30ms), search a subnet again, In this way, many operations with too slow reasoning speed can be selected and learned to a certain extent.

2.3 Experiments

Experiment setting:

  • Task: real time portrait segmentation based on CPU (x86)
  • DL platform: Intel openvino

  • Dataset: > 20K images, one part from coco / Pascal dataset with “person” category, and the other part is private data
  • Data augmentation: random crop, cutout, random brightness/contrast adjust, random Gaussian blur/sharpen
  • Searching time: single card 2 GPU days (K80), including training and searching

Experimental results:

Under the same network structure, we use mobilenetv3 [5] as the backbone for comparison. The comparison results are shown in Table 2.

Table 2 illustrates the experimental results

Application of automatic web search (NAS) in semantic segmentation (2)

From the experimental data, the parameters and flops of mobilenetv3 [5] are twice as small as those we searched, but the reasoning speed on K80 is very similar, and the accuracy Miou is quite different. Considering the accuracy and speed, the backbone searched by proxylessnas [7] is obviously better than that of mobilenetv3 [5]. According to the experimental results in Figure 4, when the feature is more complex, the results of mobilenetv3 [5] are much worse

Application of automatic web search (NAS) in semantic segmentation (2)

Figure 4 compares the segmentation results of our searched network and MobileNetv3

Convert the model into openvino support mode and deploy it on the CPU (Intel Core i7-8700). The running speed is about 27ms per frame (FPS = 36). The results are shown in Figure 5.

Application of automatic web search (NAS) in semantic segmentation (2)

Figure 5 shows the segmentation results in real application scenario

It’s time to show the searched backbone, which looks like this ~ (c.f. Figure 6)

Application of automatic web search (NAS) in semantic segmentation (2)

Figure 6 illustrates the searched backbone structure

3.Future work

Through experiments, we can see that the proxylessnas search strategy can be migrated from classification to segmentation. Under the condition of similar speed, the accuracy of the searched network is much higher than that of the original mobilenetv3 [5]. However, limited to the current scenario, it can not be said that the manually designed model is not good or will be replaced (although mobilenetv3 is also found by NAS). In specific scenarios and when there are specific needs, designing the network structure with NAS is indeed more efficient than manual design and a large number of parameter adjustment experiments, and has more development prospects in AI landing. This article is just a preliminary exploration of proxylessnas, and the following aspects will be explored in the future.

  • The experimental results show that the form of super net weight sharing is reasonable. However, in the structure search, it is unreasonable to use the operation with the maximum probability of each layer as the output result. Because the subnet has certain coupling in search and training, the operation of each layer is both prosperous and lossy. Finally, the best operation of each layer is selected. When combined, it may not meet the preset hard constraint. There are still areas that need to be improved. For example, the weight of sub path of operation of two adjacent layers can be calculated instead of the weight of operation of each layer.
  • Proxylessnas is an early work of MIT Hansong team, and now subsequent ofa has been published (also read on your knees). In ofa, the author completely separates training and searching, combines knowledge differentiation, trains the teacher model first, and then searches the best student model in the teacher model with the idea of NAS. Ofa can be understood as automatic network pruning or automatic disintegration. If the experimental results of ofa are good, there will be subsequent sharing of practical experience about ofa.
  • Figure when the five practical effects are displayed, the integration of portrait and background is relatively natural, but semantic segmentation is a classification task in the final analysis. The pixel at the edge is “black or white”. If you want to integrate with the background naturally, you need to calculate the transparency alpha mate of the foreground. Here, another background matting technology is involved, and the effect is better when used in combination with segmentation. In fact, it can be seen from the following figure in Figure 5 that segmentation does not separate the hair, but it remains in the results, which is also the reason for using background matting. In addition to optimizing the segmentation results, matching can also switch the background (cf. Figure 7), PS and other functions.

In the next article, I will introduce the practical experience of background matting. Please look forward to it.

Application of automatic web search (NAS) in semantic segmentation (2)

Figure 7 shows the demo of background matting


[1] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. “Darts: Differentiable architecture search.” ICLR (2019).

[2] Radosavovic, Ilija, et al. “Designing Network Design Spaces.” arXiv preprint arXiv:2003.13678 (2020).

[3] Yu, Changqian, et al. “Bisenet: Bilateral segmentation network for real-time semantic segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018.

[4] Zhang, Xiangyu, et al. “Shufflenet: An extremely efficient convolutional neural network for mobile devices.” Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018.

[5] Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019.

[6] Chen, Wuyang, et al. “FasterSeg: Searching for Faster Real-time Semantic Segmentation.” ICLR (2020).

[7] Cai, Han, Ligeng Zhu, and Song Han. “Proxylessnas: Direct neural architecture search on target task and hardware.” ICLR (2019).

[8] Sandler, Mark, et al. “Mobilenetv2: Inverted residuals and linear bottlenecks.” Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018.

[9] Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018.

Click focus to learn about Huawei cloud’s new technologies for the first time~

Application of automatic web search (NAS) in semantic segmentation (2)