Author / yassine

Original link/ https://yassouali.github.io/m…

Similar to my cvpr2020 post, in order to understand the overall trend of this year’s conference, I will summarize some papers (list some) that attracted my attention in this blog post and summarize the whole conference.

First, here are some related links:

All papers included in the meeting: https://www.ecva.net/papers.php

Partial achievement display: https://crossminds.ai/categor…

YouTube playlist: https://www.youtube.com/playl…

One sentence introduction of each paper: https://www.paperdigest.org/2…

ECCV website: https://papers.eccv2020.eu/pa…

*Disclaimer: This article is not a description or representation of the papers and topics in eccv2020; It’s just an overview of what I’m interested in.*

# Overall statistical overview

The statistics in this part are extracted from the official opening & Awards. Let’s start with some general statistics:

Compared with the 2018 conference, the number of papers submitted has a continuous growth trend compared with previous years, and the number of papers submitted has increased by more than 200%, which is close to the number of papers submitted by CVPR 2020. As expected, the number of reviewers and areas covered increased accordingly.

As expected, most of the included papers focus on topics related to deep learning, recognition, detection and understanding. Similar to CVPR 2020, researchers are increasingly interested in fields such as label effective methods of unsupervised learning and low vision.

In terms of the composition of research institutions; Similar to this year’s ICML, Google ranks first with 180 authors, followed by 140 authors from the Chinese University of Hong Kong and 110 authors from Peking University.

In the next section, we will introduce some abstracts by topic.

# Recognition, detection, segmentation and pose estimation

=

# End-to-End Object Detection with Transformers

# （https://arxiv.org/abs/2005.12872）

=

The task of target detection includes locating and classifying the visible objects in a given image. Nowadays, most target detection frameworks include a series of predefined boxes, that is, geometric a priori boxes called anchors or region proposals). These boxes will be classified by the network, then regressed to adjust the size of the boundary box, and then post-processing steps are carried out to delete duplicate detection results. However, due to the introduction of post-processing, the whole network can not be trained end-to-end like other computer vision tasks.. In this paper, the author proposes a new target detection framework, Detr (detection transformer), which is a network model that can be trained completely end-to-end without any geometric prior knowledge. The following figure shows the comparison between Detr and fast r-cnn (the figure is obtained from the authors’ presentation), emphasizing the overall naturalness of Detr.

Detr is constructed based on the transformer structure of encoder decoder. The model consists of three components: convolutional neural network, feature extractor, encoder and decoder. A given image first needs to obtain image features through the feature extractor. Then, the position coding information generated by sin functions with different frequencies is added to the feature to retain the two-dimensional structure information of the image. Then, the generated new features will be transmitted through the transformer encoder to summarize the information between features and separate different target instances. In order to decode, the target query vector will pass through the decoder together with the coding vector and generate the final output feature vector. These query vectors are a set of fixed learning embedding vectors, which are randomly initialized at the beginning, optimized in the training process, and remain unchanged in the evaluation stage. The number of query vectors also determines the upper bound of the number of targets that the detector can detect. Finally, the output eigenvector passes through a (shared) fully connected layer to predict the category and bounding box corresponding to each query. In order to calculate the loss and train the model, the author uses Hungarian algorithm to match the output and annotation one-to-one.

=

# MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution

（https://arxiv.org/abs/1909.12978）

The traditional neural network can only be used when a specific number of computing resources are sufficient, and if it does not meet the needs of computing resources, the model will not be used. However, this will greatly limit the use of the model in practical application. For example, if the model needs to be used for forward reasoning on the phone, the computing resources will always vary according to the load and the battery power of the phone. A simple solution is to keep several models of different sizes on the device and use the model with corresponding resources each time, but this requires a lot of memory and cannot be applied to different computing resources. Recently, networks similar to S-Net and US net sample subnetworks of different sizes during training, so that the network can be adjusted to different network widths (i.e. number of channels) during deployment. However, under the condition of very low computing resources, the performance of such networks will decline sharply.

This paper proposes to make use of the network scale and input image scale of different sizes at the same time in order to find a good balance between accuracy and computational efficiency. As mentioned above, in one training iteration, four sub networks will be sampled, one of which is a complete network and three sub networks with different widths. The complete network uses the original size image data and labels for cross entropy loss training, and the other three randomly input images of different scales (original images or images sampled under the scale), and use the KL divergence between their output and the output of the complete network for supervision training (i.e. distillation loss). In this way, each sub network can learn to adapt to the multi-scale representation of different network sizes and input sizes. In the deployment process, given specific resource constraints, the best combination of network scale and input scale can be selected for reasoning.

# Gradient Centralization: A New Optimization Technique for Deep Neural Networks

（https://arxiv.org/abs/2004.01461）

In the optimization process of neural network, using second-order statistical data such as mean and variance to standardize the form of network activation value or network weight has become a very important part in the training process of neural network, such as batchnorm and weight norm. Therefore, gradient centralization (GC) can directly operate the gradient by centralizing the gradient vector to zero mean, rather than additional normalization module operation on weight or activation, so as to smooth and accelerate the training process of neural network and even improve the generalization performance of model.

Given the calculated gradient, the GC operator first calculates the mean of the gradient vector, as shown above, and then subtracts the corresponding mean. Mathematically, for a weight vector wi, the corresponding gradient is ≓ wi (I = 1,2,…, n), then the GC operation can be defined as:

# Smooth-AP: Smoothing the Path Towards Large-Scale Image

（https://arxiv.org/abs/2007.12163）

In image retrieval, the goal is to retrieve image data of the same category as the query image from a large number of images. This task is different from the classification task. In the image retrieval task, the categories of test images have been seen in the training process. The categories of test images may be rare, but we still need to find similar images in the image set, which is an open set problem., In the image retrieval task, the categories of test images have been seen in the training process. The categories of test images may be rare, but we still need to find similar images in the image set, which is an open set problem. The training goal of feature extractor is to achieve good sorting effect (that is, the similarity of images belonging to the same category should be as high as possible). The performance of the network is measured by average precision (AP), which calculates the ratio of the ranking of each correct retrieval result to its ranking in the whole image set and sums it. To calculate the ranking of a given image, we need to apply a thresholding operation, which acts on the hevisside step function, making it non differentiable, so we can’t directly use the final ranking to optimize the model end-to-end.

In order to solve this problem, the author proposes to replace the Heaviside step function with an igmoid function controlled by temperature parameters, so that the ranking can be distinguished and can be used as a loss function to optimize the network model end-to-end. Compared with the triple loss function, the smooth AP loss function optimizes the ranking loss, while the triple loss is an indirect alternative loss optimized to obtain a good ranking.

# Hybrid Models for Open Set Recognition

（https://arxiv.org/abs/2003.12506）

The existing image classification methods are usually based on the closed set hypothesis, that is, the training set covers all possible categories that may appear in the test stage. However, this assumption is obviously unrealistic, because even for large-scale data sets with 1K items such as Imagenet, it can not cover all possible categories existing in the real world. This is the source of open set classification, and attempts to solve this problem by assuming that the test set contains known and unknown classes.

In this paper, the author uses a flow based model to solve the open set classification problem. The flow based method can fit the probability distribution suitable for training samples in an unsupervised way through maximum likelihood estimation. The flow model can then be used to predict the probability density of each sample. When the probability density of the input sample is large, it may be part of the training distribution of the known category, while the probability density of the outliers will be small. Although previous methods stack classifiers on top of the flow model, the author suggests learning joint embedding vectors for the flow model and classifier, because the embedding vectors learned only from the flow based model may not have enough discriminant features to classify effectively. As shown above, in the training process, the image will be mapped into an implicit feature by an encoder network, and then the encoded feature will be sent to the classifier and the stream model at the same time. The classifier side uses the cross entropy loss for supervision, and the stream model side is responsible for probability density estimation. The whole network architecture can be trained end-to-end. For testing, calculate the logP (x) log P (x) of each image, and then compare it with the lowest logP (x) log P (x) obtained in the training set. If it is greater than the threshold, it is sent to the classifier to identify its specific known class, otherwise it is rejected as an unknown sample.

# Conditional Convolutions for Instance Segmentation

（https://arxiv.org/abs/2003.05664）

Instance segmentation is still one of the challenging tasks in the field of computer vision. Each visible target in a given image needs to be marked with a pixel by pixel mask and a category label. The dominant method is Msak r-cnn, which includes two steps. First, the target detector fast r-cnn generates the corresponding bounding box for each instance. Then, for each detected instance, ROI align is used to cut out the region of interest from the output feature map and scale it to the same resolution. Then, it is sent to a mask head network, which is a small full convolution network to predict the segmentation mask. However, the author points out the following limitations of this architecture. (1) ROI align may acquire irrelevant features belonging to background interference or other instances, (2) the resizing scaling operation limits the resolution of instance segmentation, (3) the mask head network needs to stack multiple 3×3 convolutions to generate a large enough receptive field to generate a mask, which greatly increases the amount of calculation of the mask head.

In this paper, the authors propose to use FCN in semantic segmentation network for instance segmentation. For effective instance segmentation, FCN needs two types of information: one is the apparent information for target classification, and the other is the location information for distinguishing different targets in the same category. The network structure proposed in this paper is called condinst (conditional convolution for instance segmentation), which is constructed based on the network of condconv and hypernetworks. For each instance, a sub network will generate the weight of mask FCN header according to the central area of each instance, that is, the mask used to predict a given instance. Specifically, as shown above, the network is composed of a plurality of mask heads at a plurality of different scales of the feature map. Each head network predicts the category of a given instance at a predetermined position and generates the network weight to be used by the mask FCN head network. Then, the mask is estimated by each head network using the corresponding parameters.

# Multitask Learning Strengthens Adversarial Robustness

（https://arxiv.org/abs/2007.07236）

One of the main limitations of deep neural networks is that they are vulnerable to adversarial attacks. In this attack, the introduction of very small and invisible disturbances into the image will lead to completely wrong output, and even the input apparent appearance is almost identical to the naked eye. In recent years, from the input data (for example, using unlabeled data and countermeasure training) to the regularized model itself (for example, Parseval network), researchers have deeply discussed the countermeasure robustness of neural network at many levels. However, the output of the model has not been used to improve the robustness of the model. In this paper, the author studies the impact of multi task learning with multiple outputs on anti robustness. This setting is very useful because more and more machine learning applications require models that can solve multiple tasks at the same time.

The bounded p-normal sphere attack method is used, in which antagonistic disturbances are found in the p-normal sphere under a given radius of a given input sample. Then, the calculated total loss change is regarded as the vulnerability of the network. The author shows higher robustness under dual task training (for example, randomly select two tasks from the following two items: segmentation, depth estimation, normal vector estimation, reshaping, input reconstruction, 2D or 3D key point prediction, etc.). Improved robustness can be observed on single task attacks (i.e., perturbations calculated using one output) and multi task attacks (i.e., the largest of the corresponding perturbations calculated using all outputs). The author also shows theoretically that this multi task robustness can be obtained only when the task is related.

# Dynamic Group Convolution for Accelerating Convolutional Neural Networks

（https://arxiv.org/abs/2007.04242）

The first appearance of packet convolution can be traced back to alexnet. At that time, the purpose of packet convolution was to accelerate training, and then applied to the design of lightweight CNN networks, such as mobilenet and shufflenet. They include equally dividing the input and output in the convolution layer into mutually exclusive parts or groups along the channel dimension, and performing normal individual convolution operations in each individual group. Therefore, for the GG group, the amount of calculation is reduced by GG times. However, the author believes that they have two key defects: (1) the first appearance of grouping convolution can be traced back to alexnet. At that time, the purpose of grouping was to accelerate training. (2) The existing packet convolution does a fixed packet operation on the input channel, ignoring the correlation between the inputs.

In order to adaptively select the most relevant input channel for each group while maintaining the complete structure of the original network, the author proposes dynamic group convolution (DGC). DCG is composed of two header networks. Each header network has a significance score generator to generate importance scores for each channel. By using these scores, channels with low importance scores are trimmed and removed. Then, ordinary convolution is performed on the remaining feature layers and the output is obtained. Finally, the outputs from different head networks will be cascaded in the channel and the position of the channel will be changed randomly.

# Disentangled Non-local Neural Networks

（https://arxiv.org/abs/2006.06668）

The non local module uses the attention mechanism to model the dependency between long-distance pixels, and has been widely used in many computer vision recognition tasks, such as target detection, semantic segmentation and video action recognition.

In this paper, the author tries to better explain the non local block, find its limitations, and propose an improved version. Firstly, the authors recalculate the similarity between pixel I (called key pixel) and pixel J (called query pixel) as the sum of two terms: one is a paired term, which is formally a whitened dot product result, describing the relationship between query pixel and key pixel, and the other is a unary term, It indicates which query pixel is most affected by a given key pixel. Then, in order to understand the influence and function of each term, they only use one of them for training, and find that pairs of terms are responsible for category information, and unary terms are responsible for boundary information. However, by analyzing the gradient of non local block, the authors found that when the above two items are combined to apply to attention manipulation, their gradient is multiplied. As a result, if the gradient of one term is 0, the gradient of the other term is not 0, which will not play a role in the training of the network. In order to solve this problem, the authors propose a decomposition version of non local module, so that the two can be optimized separately.

# Hard negative examples are hard, but useful

（https://arxiv.org/abs/2007.12749）

Depth measurement learning aims to optimize an embedded function, so that after mapping, semantically similar images will be in a relatively close position in high-dimensional space, and semantically dissimilar images will be far away. A common method to learn this mapping is to define a loss function according to the triples of the image. Among them, this triple contains an anchor image, a positive sample image of the same category as the anchor image and a negative sample image of different categories from the anchor image. Then, when the position of the anchor mapped to the negative image is closer than the position of the positive image, the model is punished. Then, in the process of optimization, the model will punish when the distance between the anchor image and the negative sample image is less than the distance between the anchor image and the positive sample image. However, during the optimization period, most candidate triples have reached the standard, that is, the distance between the anchor image and the positive sample is less than that between the anchor image and the negative sample, which makes these triples very redundant for training. On the other hand, using the most difficult negative samples for optimization will also lead to local optimization in the early stage of training. In this case, the similarity of anchor negative samples calculated according to the cosine similarity (i.e. the dot product result of normalized feature vector) will be much greater than that of anchor positive samples.

The authors show the problem of using difficult sample mining in the standard implementation of triple loss. Specifically, (1) if normalization is not considered in the gradient calculation process, a large part of the gradient will be lost; (2) If two images of different categories are very close in the embedding space, the lost gradient is likely to pull them closer rather than separate them more. In order to solve this problem, the authors no longer pull the anchor positive sample pair closer to the original triple loss as much as possible in order to cluster it more closely. On the contrary, the authors will avoid updating the loss gradient of the anchor positive sample pair, so that the clustering composed of instances of a certain class will not be too compact. This method only focuses on directly pulling the difficult negative sample away from the anchor image.

# Volumetric Transformer Networks

（https://arxiv.org/abs/2007.09433）

One of the keys behind the success of convolutional neural network CNN is its ability to learn the discriminant feature expression of each part of semantic target, which is very useful for computer vision tasks. However, CNN still lacks the ability to deal with various spatial changes (such as size, viewpoint and intra class changes). The latest methods such as spatial transformer network (STN) try to suppress the spatial transformation of images by deforming the feature images with different spatial distribution into a standard form, and then classify these standardized features. However, this method does the same deformation operation for all feature channels, but it does not take into account that each feature channel can represent different semantic components. Transforming it into a standard form requires different spatial transformation operations.

In order to solve this problem, this paper introduces the volumetric transformer network (VTN), as shown in the figure, which is a learnable module. It will predict a deformation transformation for each pixel position of each channel, which is used to transform the middle CNN features into a standard form independent of spatial position. VTN is a network with encoder decoder structure, in which the network module is used to transfer information between different feature graph channels to estimate the dependency between different semantic components.

# Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation

（https://arxiv.org/abs/1911.06987）

Data enhancement (DA) has become an important and indispensable part of deep learning methods. Recent studies (such as autoauement, fast autoauement and randauement) show that the data enhancement strategy obtained by search algorithm is better than the standard enhancement strategy. This kind of algorithm defines all possible data transformation sets in advance, such as geometric transformation (such as rotation) or color enhancement transformation (such as negative effect), in order to find the optimal data enhancement parameters, such as the amplitude of enhancement, the probability of data enhancement and the number of combinations of different data enhancement methods, as shown in the left figure below. The optimal strategy is learned through double optimization loop, so as to minimize the verification error of CNN trained with the given strategy. However, this optimization method has some limitations, especially when the search space of data enhancement strategy is too large, complex search methods are required, and the single data of strategy optimization needs to fully train CNN. In order to solve this problem, the author suggests to use the density matching optimization strategy of original image and enhanced image and gradient based optimization to find the best strategy.

By treating DA as a way to fill in the missing points of the original data, the goal is to minimize the distance between the enhanced data and the original data using adversarial learning, and in order to learn the best enhancement strategy, the parameters about the transformation need to be distinguishable. For the possibility of applying a given enhancement, the author uses the random binary variables sampled from the Bernoulli distribution and optimized them by Gumbel trick. The augmented intensity is approximated by direct estimation, and the combination of augmented methods uses the combination of one hot vectors to learn.

Due to the long length of the original text, in order to ensure readers’ reading experience, semi supervised learning, unsupervised learning, transfer learning, representation learning and small sample learning, three-dimensional computer vision and robotics, image and video synthesis, vision and language will be released next week.