Image segmentation is an important basic technology in the field of computer vision and an important part of image understanding. Image segmentation is the process of subdividing digital image into multiple image sub regions. By simplifying or changing the representation of image, the image can be more easily understood. More simply, image segmentation is to attach labels to every pixel in a digital image, so that pixels with the same labels have some common visual characteristics.
Image segmentation technology has been studied since the birth of digital image processing in the 60s. With the development of deep learning research in recent years, image segmentation technology also has great development. Early image segmentation algorithms can’t segment some objects with abstract semantics, such as text, animal, pedestrian and vehicle. This is because the early image segmentation algorithms are based on simple pixel values or some low-level features, such as edges, textures, etc., which are difficult to accurately describe by some descriptions of artificial design. This classic problem is called “semantic gap”.
Thanks to the feature that deep learning can “automatically learn features”, the third generation image segmentation can well avoid the “semantic gap” brought by artificial design features. From the beginning, segmentation can only be based on pixel value and low-level features, to now, it can complete some segmentation requirements based on high-level semantics.
Goldmap has big data of image / video, so it needs to understand the content of image in many business scenarios. For example, in the automatic production of data, it is often necessary to find targets such as text, pavement, houses, bridges, signboards, pavement markings, etc. Some of these data are taken by collecting vehicles or satellites, and some are taken by users’ mobile phones, as shown in the following figure:
In the face of these images with complex semantics and huge content differences, how does Gaud understand them through image segmentation? This paper introduces that image segmentation in Gaud map has gradually grown from a “means” to solve some small problems to a powerful technical assistance of highly automated data production line.
2、 Exploration period: some early attempts
In the data collection on the street, we need to automatically produce the POI (point of interest) data collected from the communities, shops, etc. We use OCR algorithm to recognize the characters, but we are worried about how many POIs there are in the collected image. For example, in the following two stores of “leading beauty” and “swallow children’s wear”, the human eye can easily distinguish them, but the machine is not. Some simple strategies, such as background color, are prone to many errors.
For example, when two listing styles are very similar, we use the unsupervised GPB OWT UCM algorithm  to segment the image into multiple regions based on the detection of multi-level contour and the improved watershed algorithm, and use the text detection results of cascade boosting to segment the regions with text in the image.
3、 Growing period: semantic segmentation in natural scene
Fcns (full convolutional networks), which came out at the end of 2014, is undoubtedly another milestone of in-depth learning development since it won the Imagenet competition in 2012. Fcns provides the first end-to-end deep learning image segmentation solution. Fcns can be classified pixel by pixel from any size input based on CNN. We also landed it in the first time in Gaud’s own application scenarios, such as text region segmentation. Because of the complexity of background and illumination, the orientation and font of characters in natural scene are various, it is very difficult to construct the features manually.
Soon, we found that fcns can not meet our needs well. Although fcns provides a solution to the problem of semantic gap, in general, it can only give a “rough” region segmentation result, which can not achieve a good “instance segmentation”, nor can it solve the problems of false alarm, target adhesion, target multi-scale, edge accuracy, etc. A typical example is in the segmentation of text area, the “close” text area is particularly easy to stick together, resulting in a count error when calculating the number of text lines in the image.
Therefore, we propose a multi task network to implement our own instance segmentation framework. To solve the problem of target conglutination, we add a segmentation task in the original network, whose goal is to segment the “central axis” of each text line, and then split the conglutinated text line area through the central axis area. The split method is similar to Dijkstra’s algorithm, which solves the distance between each text region pixel and the central axis of the region, and takes the shortest central axis as the pixel ownership.
Another problem is the false alarm in fcns results, that is, the non text area is divided into text area. Although compared with some traditional methods, the false alarms in fcns results are much less, in order to achieve better segmentation accuracy, we add a parallel r-cnn sub network to the original network for text detection, and use these detection results to suppress false alarms.
In order to achieve better results through end-to-end learning, we designed a consistency loss function to ensure that the sub networks of segmentation and detection under the network backbone can guide and optimize each other. It can be seen from the energy map of the optimized segmented network that the probability of false alarm is significantly reduced. For details, please refer to our article published on arXiv in 2017 .
4、 Maturity: refinement and instantiation of segmentation
Thanks to the mask r-cnn framework , the instanced image segmentation becomes easier. Taking the segmentation of merchant listing mentioned earlier as an example, the segmentation of listing area is also very easy to appear adhesion, and listing styles are diverse, there is no such obvious “central axis” as text line. The target detection method can extract the outer rectangle of the tag. But the problem is that there is often a non vertical angle of view in the shooting of listing in the natural scene, so it is not a rectangle in the image, and the usual detection algorithm will bring inaccurate edge estimation. By integrating the two branches of detection and segmentation, mask r-cnn implements a general instanced image segmentation framework. The target detection branch extracts the target area through RPN, and classifies it to realize the instantiation of the target; then segment in these target areas to extract the accurate edge.
Some more complex scene understanding requirements also put forward higher requirements for the precision of image segmentation. This is mainly reflected in two aspects: (1) edge accuracy (2) recall ability of different scale targets.
In the data production of high-precision map, the road surface in the image needs to be segmented. However, the accuracy of high-precision map is required in centimeter level, and the error is only 1-2 pixel points when converting to the image. It is not difficult to observe the results of the original segmentation. The inaccurate position of segmentation generally appears on the edge of the region, and it is easy to learn in the region.
Therefore, we designed a special loss function to artificially increase the punitive error generated by the real value edge area, so as to enhance the learning effect of the edge. As shown in the figure, the left side is the segmentation of the driving road area, and the right side is the segmentation of the road and the ground markings.
There are many kinds of objects to be understood in the road scene. On the one hand, they have different sizes. On the other hand, due to the change of the depth of field, the scale presented in the image is also different. In particular, some special targets, such as light pole, lane line and so on, are “slender” and have a large length but a small width on the image. The characteristics of these objects make fine image segmentation difficult.
First of all, due to the limitations of the network experience field, it is not easy to accurately segment the large and small objects, such as the road surface and lamp post in the road scene, the road and building complex in the satellite image. To solve this problem, the current network structures such as pspnet , deeplab , FPN  can be solved in different degrees.
Secondly, due to the different target scales, the proportion of the number of samples in the segmentation network is very uneven (each pixel can be considered as a sample). We migrate the focal loss  which was originally used for the target detection task to the image segmentation network. The characteristic of focal loss is that it can concentrate the error on the bad data of training. This feature enables the small-scale targets that are difficult to learn to be segmented more accurately.
5、 Future prospects
At present, image segmentation technology is developing towards more and more accurate direction, such as the proposal of mask scoring r-cnn , hybrid task cascade , which continuously optimizes the accuracy of its segmentation based on mask r-cnn. However, from the perspective of application, image segmentation based on deep learning is “cumbersome” compared with the equivalent classification task.
Due to the accuracy requirements of the image segmentation task, the input image will not be compressed to a small size like the classification task, which brings about an exponential increase in the amount of computation, making it difficult to guarantee the real-time performance of the image segmentation task. To solve this problem, the network structures such as icnet and mobile reduce the computation in the initial stage of convolution by fast down sampling, but also bring about the loss in effect. The training method based on knowledge distillation is like a better optimization scheme. The training effect of small network is better than that of individual training. In the process of training, knowledge distillation avoids the experience and skills needed by network pruning, and directly uses small networks with lower cost to complete the complex tasks that can only be realized by large networks.
For Gaud map, image segmentation has been an indispensable basic technology, and has been widely used in various data automatic production lines to help the highly automatic data production of Gaud map. In the future, we will continue to create a more accurate and lightweight image segmentation technology scheme under the map application scenario.
 Arbelaez, Pablo, et al. “Contour detection and hierarchical image segmentation.” IEEE transactions on pattern analysis and machine intelligence 33.5 (2010): 898-916.
 Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
 Jiang, Fan, Zhihui Hao, and Xinran Liu. “Deep scene text detection with connected component proposals.” arXiv preprint arXiv:1708.05133 (2017).
 He, Kaiming, et al. “Mask r-cnn.” Proceedings of the IEEE international conference on computer vision. 2017.
 Zhao, Hengshuang, et al. “Pyramid scene parsing network.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
 Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848.
 Lin, Tsung-Yi, et al. “Feature pyramid networks for object detection.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
 Lin, Tsung-Yi, et al. “Focal loss for dense object detection.” Proceedings of the IEEE international conference on computer vision. 2017.
 Huang, Zhaojin, et al. “Mask scoring r-cnn.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
 Chen, Kai, et al. “Hybrid task cascade for instance segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
Author of this article: Gao De, a technical brother
Read the original text
This article is Alibaba cloud content and cannot be reproduced without permission.