Dense Prediction with Attentive Feature Aggregation
Original document: https://www.yuque.com/lart/pa…
The paper accidentally turned over from arXiv can be regarded as an extension of the previous work of hierarchical multiscale attention for semantic segmentation
Read the paper from the abstract
Aggregating information from features across different layers is an essential operation for dense prediction models.
This paper focuses on the problem of cross – layer feature integration
Despite its limited expressiveness, _feature concatenation dominates the choice of aggregation operations_.
Although it is feature stitching, most of them will follow some complex convolution structures
In this paper, we introduce Attentive Feature Aggregation (AFA) to fuse different network layers with more expressive nonlinear operations. AFA exploits both spatial and channel attention to compute weighted average of the layer activations.
AFA of core module Spatial and channel attention are used to weighted sum the features of different layers Thus, a nonlinear integrated operation is constructed
Inspired by neural volume rendering, we extend AFA with ScaleSpace Rendering (SSR) to perform _late fusion of multiscale predictions_.
An interesting point mentioned here is the structure used to integrate multiscale prediction The design of the structure draws on the idea of neural rendering (I don’t know much about this aspect)
AFA is applicable to a wide range of existing network designs.
Because AFA itself is a general module of the model, it can be easily migrated to different models to realize the cross layer integration of features
Our experiments show consistent and significant improvements on challenging semantic segmentation benchmarks, including Cityscapes, BDD100K, and Mapillary Vistas, at negligible computational and parameter overhead. In particular, AFA improves the performance of the Deep Layer Aggregation (DLA) model by nearly 6% mIoU on Cityscapes. Our experimental analyses show that AFA learns to progressively refine segmentation maps and to improve boundary details, leading to new stateoftheart results on boundary detection benchmarks on BSDS500 and NYUDv2.
The segmentation task and edge detection task are tried
primary coverage

We propose Attentive Feature Aggregation (AFA) as a nonlinear feature fusion operation to replace the prevailing tensor concatenation or summation strategies.
 Our attention module uses both spatial and channel attention to learn and predict the importance of each input signal during fusion. Aggregation is accomplished by computing a linear combination of the input features at each spatial location, weighted by their relevance.
 Compared to linear fusion operations, our AFA module can _take into consideration complex feature interactions and attend to different feature levels depending on their importance_.
 AFA introduces negligible computation and parameter overhead and can be easily used to replace fusion operations in existing methods, such as skip connections.
 Unlike linear aggregation, our AFA module leverages extracted spatial and channel information to efficiently select the essential features and to _increase the receptive field at the same time_.

Inspired by neural volume rendering [_Volume rendering, Nerf: Representing scenes as neural radiance fields for view synthesis_], we propose ScaleSpace Rendering (SSR) as a novel attention computation mechanism to fuse multiscale predictions.
 _ We treat those predictions as sampled data in scalespace and design a coarsetofine attention concept to render final predictions. _ (this idea is very interesting. The acquisition of the final prediction is regarded as a problem of sampling predictions of different scales from the scale space to render the final prediction.)
 Repeated use of attention layers may lead to numerical instability or vanishing gradients. We extend the abovementioned attention mechanism to fuse the dense predictions from multiscale inputs more effectively.
 Our solution resembles a volume rendering scheme applied to the scale space. This scheme provides a hierarchical, coarsetofine strategy to combine features, leveraging a scalespecific attention mechanism. We will also show that our approach generalizes the hierarchical multiscale attention method [_Hierarchical multiscale attention for semantic segmentation_].
Attentive Feature Aggregation (AFA)
Two integration forms are designed here, one is suitable for dual input and the other is suitable for multi input progressive integration The core is based on spatial attention and channel attention Note that the calculations here are in the form of pairwise integration, so we use sigmoid to construct the relative weight after calculating an attention
For the dual input form, spatial attention is calculated by shallow features because it contains rich spatial information Channel attention is calculated by deeper features because it contains more complex channel features For the multi input form (only three layers are shown in the figure, and in fact more layers of inputs can be introduced), the channel and spatial attention are completely calculated by the input of the current layer, and if there is a level of simplification of the previous calculation, the attention will be used to add weight to the current and previous outputs In addition, the order of integration is described in the original text as “a feature with higher priority will have gone through a higher number of aggregates”. My understanding is that it should be a process from deep to shallow
The proposed integration module can be used in many structures, such as DLA, UNET, hrnet and FCN
ScaleSpace Rendering (SSR)
SSR proposed here is a strategy more similar to model integration
It integrates multiscale reasoning by calculating the relative weight of the predicted output at different scales Therefore, two issues are involved:
 How does SSR learn? It is not mentioned in the paper. However, according to the above figure, the training uses two scales of input, indicating that SSR can be trained. Because it is a learnable structure that can predict parameters, an attention parameter will be predicted automatically for each input. Through the corresponding calculated parameters under these different scale inputs, the final weighted specific gravity for multiple scales is obtained.
 Which scale will different sizes of forecasts be integrated into in the end? This is not mentioned in the paper. However, according to the expression based on the relative size of the original input in the above figure, it should eventually be integrated into the 1.0 times of the original input scale (which should be consistent with the design form in the hierarchical multiscale attention).
Expression form
In order to express the integration of multiscale prediction, the author first focuses on a single pixel It is assumed that the model provides prediction for the target pixel on $k $different scales
The prediction for the $I $th scale can be expressed as $P_ i \in \mathbb{R}^{d}$. Therefore, the characteristic representation of the target pixel in the scale space can be defined as $p \ triangleq (p_1, \ dots, p_k) $ Further, it is assumed that $I < J $means that the scale $I $is coarser than the scale $J $
So the target pixel can be imagined as a light moving in scale space,From scale $1 $to scale $k$.
Based on this idea, the original hierarchical attention in the proposed multi feature fusion mechanism is redesigned, and the volume rendering equation is simulated, where the volume is implicitly given by the scale space
For this purpose, in addition to the feature characterization $p located at the scale $I $_ I $, assumingThe model also predicts a scalar for the target pixel$y_ i \in \mathbb{R}$. In the context of volume rendering, particles willThe probability of crossing the scale $I $, given some non negative scalar functions $\ Phi: \ mathbb {r} \ rightarrow \ mathbb {r}_ When {+} $, it can be expressed as $e ^ { \ phi (y_i)} $
You can then scale your attention to $\ alpha_ I $is expressed as the probability that the particle reaches the scale $I $and stays here (each time satisfies the Bernoulli distribution, either stay or go, go ahead, and stay for the current time):
$\alpha_i(y) \triangleq [1 – e^{\phi(y_i)}] \prod^{i1}_{j=1}e^{\phi(y_j)}, \, y \triangleq (y_1, \dots, y_k)$
$y $represents the scalar parameter predicted for the target pixel of each scale
$P_{final} \triangleq \sum^{k}_{i=1}P_i \alpha_i(y)$
Finally, according to the volume rendering equation, the final prediction obtained by multiscale prediction fusion for the target pixel is weighted and summed by the attention parameters of different scales This also reflects that the final feature obtained for the target pixel is obtained by fusing the feature expression of all scales driven by $y $
Based on the analysis of context, the design here should finally integrate all scales into 1
The proposed SSR can be regarded as hierarchical multiscale attention (HMA), https://github.com/NVIDIA/semanticsegmentation ]A generalized form of
The latter form can be obtained by setting $\ phi (y_i) \ triangleq \ log (1 + e ^ {y_i}) $, and fixing $\ phi (y_k) \ triangleq \ infty $ Here are:
$$
\alpha_i = [1\frac{1}{1+e^{y_i}}] \prod^{i1}_{j=1}\frac{1}{1+e^{y_j}}, \\
\alpha_1=1\frac{1}{1+e^{y_1}}, \\
\alpha_k=\prod^{k1}_{j=1}\frac{1}{1+e^{y_j}}.
$$
From the form here, there are two puzzling places:
 The form is not quite right The original hierarchical multiscale attention used sigmoid to integrate different scales This is not consistent with sigmoid
 According to the form here, combined with the cascade relationship of spatial attention (sigmoid), it can be seen that the output is at the position of $I = 1 $, that is, the information of other layers is gradually integrated in the form of decreasing layer number This is roughly similar to the form in the figure below
The input is zoomed in and out again before it is sent into the model Here, the final output size corresponds to 1.0 timesOriginal input sizeof So, supposeIntegrate the features from K to 1 according to the scale number, and output the results at layer 1.
Since the attention constructed in this paper is based on the probability of not selecting the current layer (passing through the current layer), the form corresponding to the above figure is as follows:
$$
\alpha_i = [1p(y_i)]\prod_{j=1}^{i1} p(y_j), \\
\alpha_1 = 1p(y_1), \\
\alpha_k = \prod_{j=1}^{k1} p(y_j), \\
p(y_i) = 1\text{sigmoid}(y_i), \\
\Rightarrow P = \sum^{k}_{i=1} P_{i}\alpha_i(y).
$$
It can be seen that the attention weight for the first layer is the output of direct sigmoid The output of layer K is obtained by complementing and class multiplying the sigmoid output of each layer
Selection of $\ Phi $
The absolute value function used in the experiment is $\ phi (y_i) \ triangleqy_ i$. This is inspired by the better analysis of the gradient flow through the attention mechanism, because the authors found that the existing attention mechanism may suffer from the disappearance of the gradient
The form of attention coefficient sorted out earlier:
$$
\alpha_i(y) \triangleq [1 – e^{\phi(y_i)}] \prod^{i1}_{j=1}e^{\phi(y_j)} = \prod^{i1}_{j=1}e^{\phi(y_j)} – \prod^{i}_{j=1}e^{\phi(y_j)}, \, y \triangleq (y_1, \dots, y_k)
$$
Consider the $I $layer factor $\ alpha_ I (y) $about learnable parameter $Y_ Derivative of L $:
$$
J_{il} \triangleq \frac{\partial \alpha_i(y))}{\partial y_l}
\begin{cases}
\frac{\partial [e^{\phi(y_i)}]}{\partial y_l}\prod^{i1}_{j=1}e^{\phi(y_j)} = \frac{\partial \phi(y_i)}{\partial y_l}\prod^{i}_{j=1}e^{\phi(y_j)} = \phi ‘(y_l)\prod^{i}_{j=1}e^{\phi(y_j)} & \text{ if } l= i\\
0 & \text{ if } l> i \\
\phi ‘(y_l)\prod^{i1}_{j=1}e^{\phi(y_j)} + \phi ‘(y_l)\prod^{i}_{j=1}e^{\phi(y_j)} = \phi ‘(y_l)\alpha_i(y) & \text{ if } l< i
\end{cases}
$$
When considering two scales, i.e. $k = 2 $:
$$
J =
\begin{bmatrix}
\phi ‘(y_1)a_1 & 0 \\
\phi ‘(y_1)a_1(1a_2) & \phi ‘(y_2)a_1a_2
\end{bmatrix}, \\
a_i \triangleq e^{\phi(y_i)}.
$$
The upper left corner calculates the derivative of the attention coefficient of layer 1 with respect to the parameters of layer 1, and the upper right corner is the derivative of layer 1 with respect to layer 2 As you can see, if $a_ When 1 \ rightarrow 0 $, the gradient will disappear, regardless of $a_ How much is $2
Therefore, in order to avoid the problem of gradient disappearance, you need to carefully set $\ Phi $here When the absolute value function is selected, the Jacobian matrix here will not be in $a_ 1 > 0 $and $(y_1, y_2) \ NEQ (0, 0) $
But if we take the absolute value function here and find the derivative is + – 1, there will still be the problem of the disappearance of the gradient?
Considering the situation in HMA, according to the form given by the author, here are:
$$
\phi ‘(y_i) = \frac{\partial \log(1+e^{y_i})}{\partial y_i} = \frac{e^{y_i}}{1+e^{y_i}} = 1 – \frac{1}{1+e^{y_i}} = 1 – e^{\log(1+e^{y_i})} = 1 – a_i, \\
a_2 = 0.
$$
Branch 2 does not participate in attention calculation When $a_ The gradient will disappear when \ rightarrow 1 $
According to my previous form, there are:
$$
\phi ‘(y_i) = \frac{\partial \log(1+e^{y_i})}{\partial y_i} = \frac{e^{y_i}}{1+e^{y_i}}, \\
a_i = e^{\log(1+e^{y_i})} = \frac{1}{1+e^{y_i}}, \\
\phi ‘(y_i) = a_i – 1.
$$
There will also be the problem of disappearance
link
 Thesis: https://arxiv.org/abs/2111.00770
 code: http://vis.xyz/pub/dlaafa
 The idea of this paper comes from nerf. You can look at the introduction of nerf and the design of SSR

Some information about volume rendering:
 A very rich and comprehensive Chinese CG learning material: GPU Programming and CG language primer
 A small review on HowNet in the past 21 years: a review of viewpoint synthesis algorithms based on neural radiation field