Ali Mei’s Guide Reading:Nowadays, we spend more and more time on mobile phones, and videos attract our attention. There are many interesting videos that need to be segmented from the video, which takes more than 99% of the time of the creator. Today, Ren Haibing, a senior algorithm expert in Alibaba, will tell us: Alibaba’s three directions and latest applications of video object segmentation algorithm, hoping to help you like video creation.
Video Object Segmentation (VOS), as its name implies, is the complete segmentation of interested object regions from all video images. In order to facilitate your understanding.
Video object segmentation is an important material for content re-creation. For example, the current popular “naked-eye 3D video”, based on the distance between the main objects in the video and the audience, uses the change of skin occlusion to produce 3D effect. The key point is to separate the foreground objects from the video, which takes more than 99% of the author’s time.
Therefore, for video websites like Youku, video object segmentation is a very valuable algorithm, which can empower content producers and improve content production efficiency. In particular, interactive video object segmentation algorithm can make use of a small amount of user interaction to gradually improve the accuracy of video object segmentation and enhance user experience. This can not be achieved by any unsupervised video object segmentation algorithm.
At present, the research on video object segmentation in CV academia is mainly divided into three directions:
- Semi-supervised video object segmentation
- Interactive video object segmentation
- Un supervised video object segmentation
These three directions correspond to the three tracks in Davis Challenge 2019 on Video Object Segmentation . Among them, academia is more inclined to study semi-supervised video object segmentation, because this is the most basic algorithm of video object segmentation, but also a relatively pure research point. Next, I will first introduce three research directions of video object segmentation, and then share the latest applications in the field of video with the exploration of Ali Civic Moku Laboratory.
I. Semi-supervised Video Object Segmentation
Semi-supervised video object segmentation, also known as one-shot video object segmentation (OSVOS). In semi-supervised video object segmentation, given the segmentation area of the user’s interested object in the first frame of video, the algorithm obtains the segmentation area of the object in the subsequent frame. Objects can be one or more. In video, there are changes in object and background motion, illumination, object rotation, occlusion and so on. Therefore, the focus of semi-supervised video object segmentation algorithm is how to acquire the changed object appearance information adaptively. An example is shown in the following figure:
In Figure 1, the RGB image of the first behavior sequence and the region of interest of the second behavior sequence are shown. Among them (a) is the first frame image of the video, and the camel area is the ground-truth of the given object. (b) (c) and (d) are subsequent frames 20, 40 and 60, and the subsequent images are only RGB images, which require an algorithm to estimate the region of the object. The difficulty of this example is:
- The foreground and background colors are very similar.
- With the movement of the target camel, a new camel appears in the background, which needs to be divided into two different camel regions.
At present, semi-supervised video object segmentation algorithms are divided into two categories: online learning and non-online learning.
According to the ground-truth of the object in the first frame, the fine-tune segmentation model is based on the one-shot learning strategy. Classical online learning algorithms include Lucid data dreaming , OSVOS , PreMVOS , etc. Online learning algorithm trains the model for each object separately, and achieves high segmentation accuracy. But online learning itself is fine-tuning of deep learning model, which requires a lot of computing time. Until 2019, online learning algorithms will be the mainstream. This year, a lot of online learning algorithms have emerged. Its model is well trained and does not need fine-tune for samples. It has better timeliness, such as FEELVOS  of CVPR2019 and space-time memory network .
The main evaluation criteria for semi-supervised video object segmentation are average Jaccard and F-measurement. The average Jaccard value is the average of the segmentation accuracy of all objects on all frames. F-measurement is the accuracy of the edge of the segmentation area. Semi-supervised video object segmentation can not be directly applied in practical applications because it requires ground-truth of the first frame object area. But it is the core component of interactive and unsupervised video object segmentation algorithm.
Interactive Video Object Segmentation
Interactive video object segmentation is a more practical method of video object segmentation, which began last year. In interactive video object segmentation, the input is not the ground-truth of the object in the first frame, but the user interaction information of the object in any frame of video. Interactive information can be object bounding box, scribble of object area, extremum point of outer edge, etc.
The basic process is shown in the following figure:
Interactive video object segmentation usually includes the following five steps:
- Users input interactive information and mark interested objects, such as bounding box, scribble information, edge points, etc.
- According to the interactive information input by the user, the object region on the frame image is segmented by the interactive image object segmentation algorithm.
- According to the object region of the previous frame, the semi-supervised video object segmentation algorithm is used to transfer the object to other frame images frame by frame, and the object region of all frame images is obtained. Then, the user checks the segmentation result and gives new interactive information on the frame with poor segmentation.
- According to the new interactive information, the algorithm modifies the segmentation results on the frame image.
- Repeat steps 3 and 4 until the result of video object segmentation is satisfactory to users.
Interactive video object segmentation is not a single algorithm, and a variety of solutions are organically integrated, including interactive image object segmentation, semi-supervised video object segmentation, interactive video object area transfer algorithm, etc. The main evaluation methods are Jaccard&[email protected] (J&[email protected]) and Rea Under Curve (AUC) proposed by Davis Challenge on Video Object Segmentation. Davis contest proposed to limit eight user interactions and build a curve of accuracy over time. The area under the curve is AUC, and the curve interpolation at t = 60s is J&[email protected] The following figure is a J&F curve with time.
From the evaluation indicators, we can see that interactive video object segmentation emphasizes the timeliness of the segmentation algorithm, and can not let users wait for a long time. Therefore, semi-supervised video object segmentation algorithm based on online learning method is generally not used in interactive video object segmentation. At present, there is no open source code for interactive video object segmentation. However, the interactive video object segmentation algorithm is of great significance to the industry, for the following reasons:
1) Semi-supervised video object segmentation requires ground-truth of the first frame of the object, which is difficult to obtain in practice. Interactive video object segmentation only needs simple user interaction, which is very easy to achieve.
2) Interactive video object segmentation can achieve very high segmentation accuracy through multiple interactions. High-precision segmentation results can provide better user experience, which is the result that users need.
3. Unsupervised Video Object Segmentation
Unsupervised video object segmentation is a fully automatic video object with no input except RGB video. The aim is to segment salient object regions in video. Among the three directions mentioned above, unsupervised video object segmentation is the latest research direction.
For the first time this year, the Davis and YouTube VOS contests have been unsupervised. At the algorithm level, the salient object detection module needs to be added to unsupervised video object segmentation, while other core algorithms remain unchanged.
In semi-supervised and interactive video object segmentation, objects are specified beforehand and there is no ambiguity. In unsupervised video object segmentation, object saliency is a subjective concept, and there are some ambiguities among different people. Therefore, in Davis VOS, participants are required to provide a total of N object segmentation results (in Davis Unsupervised VOS 2019, N = 20) and to compute corresponding relationships with L salient object sequences marked by ground-truth. Corresponding objects and missing objects are involved in calculating the mean of J&F. No punishment is imposed on superfluous objects in N objects.
IV. Research Status of Moku Laboratory of Ali Culture and Entertainment
At present, many semi-supervised video object segmentation algorithms are innovative academically, but their practical results are not good. We have counted this year’s CVPR papers. On the Davis 2017 val dataset, there is no regular paper J&F > 0.76. Feelvos , siammask  and other algorithms are very good in theory, but there are many problems in practice. There is no open source code for interactive video object segmentation.
Therefore, Ali Civic Moku Laboratory has been engaged in semi-supervised and interactive video object segmentation algorithm since the end of March 2019.
In May 2019, we completed a basic semi-supervised video object segmentation algorithm and interactive video object segmentation solution, and participated in DAVIS Challenge on Video Object Segmentation 2019, and won the fourth place in the interactive video object segmentation track.
The proposed VOS with robust tracking strategy  can greatly improve the robustness of the basic algorithm. On the Davis 2017 verification set, the accuracy of our interactive video object segmentation algorithm J&[email protected] increased from 0.353 at the end of March to 0.761 at the beginning of May. Now, our semi-supervised video object segmentation algorithm also achieves J&F=0.763. It can be said that our results in this collection are close to the first-class level in the industry.
V. Follow-up Plan of Moku Laboratory of Ali Culture and Entertainment
At present, we are continuing to explore the application of algorithms in complex scenarios, which include small objects, highly similar foreground and background, fast moving or fast changing appearance, and serious object occlusion. Subsequently, we plan to work on online learning, space-time network, region proposal and verification strategies to improve the segmentation accuracy of video object segmentation algorithm in complex scenes.
In addition, image object segmentation algorithm and multi-target object tracking algorithm are also important foundations of video object segmentation algorithm, and we will continue to improve the accuracy in these areas.
 The 2019 DAVIS Challenge on VOS: Unsupervised Multi-ObjectSegmentation. S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K. Maninis,and L. Van Gool .arXiv:1905.00737, 2019
 A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid datadreaming for object tracking. In arXiv preprint arXiv: 1703.09554, 2017. 2
 S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix´e,D. Cremers,and L. Van Gool. One-shot video object segmentation. CVPR, 2017
 J. Luiten, P. Voigtlaender, and B. Leibe. PReMVOS:Proposal-generation, refinement and merging for video object segmentation.arXiv preprint arXiv:1807.09190, 2018.
 Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, BastianLeibe, Liang-Chieh Chen. FEELVOS: Fast End-to-End Embedding Learning for VideoObject Segmentation. CVPR 2019
. Seoung Wug Oh, Joon-Young Lee, Ning Xu, Seon Joo Kim.Fast User-GuidedVideo Object Segmentation by Interaction-and-Propagation Networks. CVPR2019
. Wang, Qiang，Zhang, Li，Luca Bertinetto, Weiming Hu, Philip H.S. Torr.Fast Online ObjectTracking and Segmentation: A Unifying Approach. CVPR2019
 H. Ren, Y. Yang, X. Liu. Robust Multiple Object Mask Propagation withEfficient Object Tracking. The 2019 DAVIS Challenge on Video ObjectSegmentation – CVPR Workshops, 2019
The author of this article: Ren Haibing
Read the original text
This article is from Ali Technologies, Yunqi Community Partner. If you need to reprint it, please contact the original author.