Visual recognition needs rich expression, from high to low level, from small to large scale, from fine to coarse resolution. Although convolutional networks have deep features, isolated layers are not enough: combining and aggregating these expressions can improve the inference of “what” and “where”. Architecture work is exploring all aspects of the network backbone, designing deeper networks or wider architectures, but if betterLayers and blocks of aggregated networksIt deserves further attention. Although jump joins have been combined to combine layers, these connections themselves are “shallow” and require only one simple step to fuse. We use itDeeper polymerizationTo expand the standard architecture to better integrate layer information. oursDeep polymerization structureThe feature levels are aggregated iteratively and hierarchicallyDeeper precision and fewer parameters。 Experimental results show that deep polymer structure can improve the recognition and resolution.
How to connect layers and modules needs further exploration. The deeper network layer extracts more semantics and more global features. However, it does not mean that the last layer is the final representation of the task. In fact, jump join has been proved to be effective for classification and regression as well as structured tasks. The aggregation of depth and width is the key dimension of architecture.
We study how to aggregate layers to better integrate semantic and spatial information for recognition and location. Extending the “shallow” jump join of the current method, we aggregate architecture for more depth and sharing. We introduce two deep aggregation (DLA) structures: iterative deep aggregation (IDA) and layered deep aggregation (HDA). These structures are expressed through the framework of architecture, not determined by the choice of network backbone, and can be compatible with the existing and future network structures. IDA is mainly engaged inFusion of resolution and scaleHDA mainly integrates variousModule and channel characteristics. Ida refines resolution and aggregation scale (similar to uneven network) level by level according to the basic network structure. HDA integrates its own tree connection structure and aggregates all levels into different levels of representation (spatial scale fusion, similar to FPN)。 The strategy in this paper can improve the effect by using IDA and HDA together.
In this paper, the existing RESNET and resetnext network structures are used in DLA architecture to carry out large-scale image classification, fine-grained recognition, semantic segmentation and boundary detection.The results show that the use of DLA can improve the performance of the model, reduce the number of parameters and memory overhead on the basis of the existing network structures of RESNET, resnetxt and densenet。 DLA achieves the best accuracy of compact model in current classification tasks. Without more structural changes, the same network can achieve the best accuracy in multiple fine-grained identification tasks. DLA achieves the highest inter class precision in the semantic segmentation task of cityscapes, and achieves the best precision in the boundary detection task on Pascal boundaries dataset. DLA is a general and effective depth visionNetwork expansion technology。
Deep layer Aggregation
In this paper, aggregation is defined as the combination of different layers in the whole network. In this article,We focus on networks that aggregate depth, resolution, and scale more effectively。 If a set of polymerization is composite and nonlinear, and the earliest polymerization layer passes through multiple polymerization layers, we call it deep polymerization.
Because the network can contain many layers and connections, modular design can overcome the problem of network complexity by grouping and multiplexing. Layers are combined into blocks, and blocks are combined into hierarchies according to their feature resolution. This paper mainly discusses how toAggregate block and hierarchy。
3.1 iterative depth aggregation (IDA)
IDA is an iterative stack backbone network structure.In this paper, the stacked blocks in the network are divided into hierarchies according to the resolution. The deeper level has more semantic information, but the spatial information is more rough。 Jumping from the shallow to the deep can merge scale and resolution. But such jump connections, such as FCN, u-net, and FPN, are all preemptive and aggregate at least the shallowest network.
(b) Each part of the network is composed by jumping connections. But in doing so, each step is just to fuse the shallow information ahead. (c) Iterative aggregation enables the shallow network to get more processing in the subsequent network; (d) through the aggregation of tree structure blocks, the feature structure can be propagated at different depths of the network.
Therefore, IDA’s more radical aggregation method is proposed in this paper. It starts to aggregate at the shallowest and smallest scales, and continuously iterates and merges with deeper and larger scale information. In this wayThe information of the shallow network can be refined through subsequent aggregation at different levels。 Figure 2 (c) shows the basic structure of IDA
3.2 hierarchical deep polymerization (HDA)
HDA fuses blocks and levels in the tree structure to preserve and combine feature channels.Through HDA, shallow and deep network layers can be combined together, which can learn rich combination information across different levels of features。 Although IDA can effectively combine hierarchies, IDA is still sequential, and it is not enough to integrate the information of each block of the network. The deep branching structure of HDA is shown in Fig. 2 (d).
After the foundation HDA structure is established, its depth and efficiency can be improved on this basis. By changing the aggregation form of the middle-level network and the whole tree structure, we return the output of one aggregation node to the backbone network as the input of the next tree structure, as shown in Figure 2 (E).In this way, the information of all previous blocks can be aggregated into the subsequent processing, and the features are better preserved, rather than processing the previous blocks separately。 In order to improve the efficiency, this paper fuses the aggregation nodes with the same depth (i.e. the same feature map size) by fusing the parent block with the left sub block, as shown in Fig. 2 (f).
HDA was modeled as follows:
Q (1): what is aggregation and what is the difference between convergence and convergence?
My ans: fusion is divided into semantic fusion and spatial fusion. Semantic (object information) fusion can infer “what” and spatial fusion can infer “where”. Aggregation is the combination of semantic fusion and spatial fusion.
Q (1): what’s the difference between blocks and hierarchies?
My ans: the block is composed of multiple network layers, and the hierarchy is composed of multiple blocks. The I / O resolution of the same level is consistent, and the resolution of different levels is different.
Q (2): what are the advantages of IDA and HDA respectively?
My ans: IDA is an iterative aggregation. From Figure 3 (c), we can see that the shallow information in front is continuously aggregated with the deep information behind. In this way, the shallow information in front can be continuously refined and rich in semantic information. However, IDA is sequential, so it is difficult to combine the information of different blocks. HDA is a hierarchical aggregation of tree structure blocks to better learn rich (spatial) information across different depth of network feature levels. So IDA and HDA can learn both semantic information and spatial information.
3.3 polymeric elements
Aggregation node: the main function is to combine and compress the input of nodes. These nodes are trained to select and project important information to maintain the same scale output as the input dimension. In this structure, all IDA nodes are binary nodes (only two inputs), while HDA nodes have a series of parameter inputs according to the depth of the tree structure.
Although an aggregation node can adopt any block or layer structure, for the sake of simplicity, this paper uses a single convolution connected with a BN layer and a nonlinear activation layer structure. This avoids the complexity of the polymerization structure. In the image classification network, all (aggregate) nodes adoptConvolution of 1×1。 In the semantic segmentation, an additional IDA layer is added to the feature map to make difference, in this case, we use theConvolution of 3×3。
Because residual connection is very important for deeply integrated network structure, residual connection is also used in aggregation nodes. However, their necessity for polymerization is not clear. Through HDA, the shortest path from any block to the root of the network is the maximum depth of the architecture. Therefore, the aggregation path is unlikely to cause gradient disappearance and gradient explosion. In this experiment, when the deepest network structure reaches 4 layers or above, residual connection can promote HDA, but it is not conducive to smaller networks. Our basic aggregation, that is, in formulas 1 and 2, is defined by
Block and hierarchy: DLA is a general architecture because it can be compatible with different backbone networks. Our architecture has no requirements for the intrinsic structure of blocks and hierarchies.
In the experiment, we use three different types of residual blocks. The base block combines the stacked convolution layers and identical jump connections. Bottleneck blocks are regularized by using 1×1 convolution to reduce the dimensions of stacked convolution layers. Segmentation blocks diversify features by grouping channels into separate paths. In this work, we reduce the ratio of the output of bottleneck block and partition block to the number of intermediate channels by half, and the cardinality of partition block is 32. For exact details of these blocks, please refer to the cited paper.
4.1 classified network
Our classification network uses IDA and HDA on the basis of RESNET and RESNET. We use IDA connection between levels and HDA within levels. They are hierarchical networks, which are partitioned by spatial resolution, and each block is connected by a residual network. The resolution at the end of each level is halved. There are six levels in total. The first level maintains the input resolution, but the last large level is 32x down sampling. The final feature map is processed by global average pooling, and then linear interpolation becomes very rough. The final classification result is in the form of softmax.
We use IDA to connect between levels, and HDA to connect within levels. This type of aggregation is easy to combine by going to aggregation nodes. In this case, we only need to change the root node of each hierarchy by combining Formula 1 and formula 2. Our level is downsampling through maxpooling of size 2 and strip2.
At the earliest level, they had their own structure. Like DRN, we replace maxtooling with stirded convolution at levels 1-2. Level 1 consists of a 7×7 convolution followed by a base block. Level 2 is just a building block. For other layers, we use IDA and HDA, a combination of block and hierarchy of backbone network.
DLA is shown in Figure 3.
DLA learning can better extract the full range of semantic and spatial information from the network。 Iterative join links the adjacent stages together, and gradually deepens and refines the expression in space. The hierarchical connection between stages across the hierarchical tree structure can better propagate features and gradients.
4.2 dense prediction network
Semantic segmentation, edge detection, and other image to image tasks can use aggregation to fuse local and global information. The conversion from classified DLA to full convolution DLA is simple and no different from other architectures. We make the most of itInterpolation and further IDA amplificationTo achieve the task output resolution.
adoptProjection and upsamplingIda interpolation increases both depth and resolution, as shown in Figure 4. In network optimization, all projection and up sampling parameters are trained jointly. The parameters of the up sampling step are initialized by bilinear interpolation. We first project the output of level 3-6 to 32 channels, and then interpolate the level to achieve the same resolution as level 2. Finally, these levels are iteratively aggregated to learn a deep fusion of low-level and high-level features. It has the same purpose as FCN hop connection, hierarchical feature and FPN top-down connection. Our method of aggregation is different, which is to further refine features from shallow to deep. Note that we use IDA twice in this case: the first at the backbone connection level. Then the resolution is restored.
What problems does the article solve
Solve the problem of network layer and hierarchical aggregation.
Summarize the ideas of the article in my own words：
In order to improve the accuracy of the network, DLA extension technology is proposed to fuse semantic and spatial information. DLA is divided into IDA and HDA structures. IDA is an iterative aggregation structure. The shallow information can be continuously aggregated with the following layers to better extract features. HDA is a kind of aggregation structure of tree like blocks, learning the feature information across levels.
critical factor ：
- The tree type aggregation learning can learn rich spatial information
For my own use：
Expanding network with DLA Technology