Hire-MLP: Vision MLP via Hierarchical Rearrangement
This article is very easy to read. There are no complex words and sentence patterns. It is very smooth from beginning to end. I like this writing style very much.
Learn about the article from the summary
This paper presents Hire-MLP, a simple yet competitive vision MLP architecture via hierarchical rearrangement.
Previous vision MLPs like MLP-Mixer are not flexible for various image sizes and are inefficient to capture spatial information by flattening the tokens.
The purpose of this paper:
- Remove the dependence of the previous MLP method on the input data size
- More effective way to capture spatial information
Hire-MLP innovates the existing MLP-based models by proposing the idea of hierarchical rearrangement to aggregate the local and global spatial information while being versatile for downstream tasks.
Integration of local and global spatial information: local may depend on convolution operation; In the context of MLP, how to deal with global spatial information if spatial MLP is not used? Is it a pool operation—— From the back, it is actually more similar to the processing method of swintransformer. First, the local areas are densely connected, and then the regions are associated.
It is more general for downstream tasks (segmentation and detection, etc.), which shows that the method in this paper will adopt multi-scale structure.
How is the feature down sampling handled here? Pooling? Step convolution? Or patch integration—— From the paper, the step convolution is used.
Specifically, the inner-region rearrangement is designed to capture local information inside a spatial region_. Moreover, to enable information communication between different regions and capture global context, _the cross-region rearrangement is proposed to circularly shift all tokens along spatial directions.
It seems that some processing is similar to that of swintransformer. After local processing, the global offset is performed to associate various regions.
The proposed HireMLP architecture is built with simple channel-mixing MLPs and rearrangement operations, thus enjoys high flexibility and inference speed.
The processing here does not seem to mention spatial MLP, so how to deal with it in the local area?
Experiments show that our Hire-MLP achieves state-of-the-art performance on the ImageNet-1K benchmark. In particular, Hire-MLP achieves an 83.4% top-1 accuracy on ImageNet, which surpasses previous Transformer-based and MLP-based models with _better trade-off for accuracy and throughput_.
It can be seen that the main change is that the original space MLP is replaced with hire module.
Here is the processing based on the region, so the module needs to divide the blocks according to the h-axis and W-axis first.
The rearrangement operation here is carried out from two dimensions, one along the height and the other along the width.
It can be seen from the figure that for the rearrangement in the area, the high direction is to stack the adjacent layers (local strip areas) on the H axis onto the channel C dimension, and the processing in the wide direction is similar. By stacking on the channel, the processing of local area features can be realized directly by using the channel MLP.
The idea here is very interesting.
But if you think about it carefully, it can actually be seen as a decomposition of convolution. In pytorch, use
[nn.Unfold](https://pytorch.org/docs/stable/generated/torch.nn.Unfold.html?highlight=unfold#torch.nn.Unfold)The process of implementing convolution operation is actually similar to this. By stacking the data of the local window on the channel dimension, and then using the full connection layer processing, it can be equivalent to the convolution operation of a larger kernel.
Here, it can be regarded as window drawing without overlapping. Maybe this follow-up work will try to use overlapping forms.
But in this way, it’s more like convolution.
>>> # Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape) >>> inp = torch.randn(1, 3, 10, 12) >>> w = torch.randn(2, 3, 4, 5) >>> inp_unf = torch.nn.functional.unfold(inp, (4, 5)) >>> out_unf = inp_unf.transpose(1, 2).matmul(w.view(w.size(0), -1).t()).transpose(1, 2) >>> out = torch.nn.functional.fold(out_unf, (7, 8), (1, 1)) >>> # or equivalently (and avoiding a copy), >>> # out = out_unf.view(1, 2, 7, 8) >>> (torch.nn.functional.conv2d(inp, w) - out).abs().max() tensor(1.9073e-06)
Another point is that the processing of local square windows is divided into one-dimensional strip windows in two different directions: H and W. Split KxK into 1xk and kx1.
It seems that various designs of convolution model have almost exhausted the basic units of model structure（^__^) 。
For cross region rearrangement, the feature shifts along the h-axis or W-axis as a whole（
torch.roll）To deal with. It seems useless to use this operation alone, but if the processing in the previously designed area is carried out after rearrangement, the cross window of local area is just realized.
However, there is a problem to be noted here. It can be seen that the processing of the local area here is only included after the window feature offset, and the local processing of the features before the offset is not considered. A more reasonable form should be
Window internal processing - > window feature offset - > window internal processing - > offset window feature position recovery - > window internal processing (optional), it seems that such cross processing can better cover a wider space,Unlike now, window processing always corresponds to fixed areas。
Ablation experiments mainly discussed the following points:
- Number of windows divided: by default, the width of windows divided along the h-axis and W-axis is the same. Smaller windows will emphasize local information more. And in the experiment, empirically use a larger window width in a shallower layer to obtain a larger receptive field.
It can be seen that when the window width is gradually increased, the performance will decline. The author speculates that with the increase of region size, some information may be lost in the bottleneck structure.
- Discussion of step s of cross window offset.
It can be seen that the effect will be better if the shallow window offset is a little larger. Perhaps it is because increasing the receptive field of shallow features can bring some benefits.
- Discussion on different forms of padding. This is because for the input of 224, the feature size of stage 4 is 7×7, which can not realize the uniform division of windows. Therefore, it is necessary to add padding under the setting of non overlapping windows in this paper. Several strategies are discussed here.
- The importance of different branches in the hire module.
It can be seen that the processing within the region is very important. In fact, it is understandable. If there is no such local operation. So simple offset cross window offset is meaningless for channel MLP. Because it’s a point operation.
- Different forms of cross window communication.
Here, we compare the offset (which can retain a certain adjacent relationship, i.e. relative position information) and the inter group shuffle such as shufflenet. It is important to see the relative position information.
- Hire-MLP: Vision MLP via Hierarchical Rearrangement: https://arxiv.org/pdf/2108.13341.pdf