Hire MLP of vision MLP: vision MLP via hierarchical arrangement


Hire-MLP: Vision MLP via Hierarchical Rearrangement

Original document:https://www.yuque.com/lart/pa…

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

This article is very easy to read. There are no complex words and sentence patterns. It is very smooth from beginning to end. I like this writing style very much.

Learn about the article from the summary

This paper presents Hire-MLP, a simple yet competitive vision MLP architecture via hierarchical rearrangement.


  1. What is it? Simplified processing strategy for spatial MLP.
  2. Why? Divide and rule and simplify treatment.
  3. How? Replace the original undifferentiated global space MLP with local processing before associating each region.

Previous vision MLPs like MLP-Mixer are not flexible for various image sizes and are inefficient to capture spatial information by flattening the tokens.

The purpose of this paper:

  1. Remove the dependence of the previous MLP method on the input data size
  2. More effective way to capture spatial information

Hire-MLP innovates the existing MLP-based models by proposing the idea of hierarchical rearrangement to aggregate the local and global spatial information while being versatile for downstream tasks.

Integration of local and global spatial information: local may depend on convolution operation; In the context of MLP, how to deal with global spatial information if spatial MLP is not used? Is it a pool operation—— From the back, it is actually more similar to the processing method of swintransformer. First, the local areas are densely connected, and then the regions are associated.
It is more general for downstream tasks (segmentation and detection, etc.), which shows that the method in this paper will adopt multi-scale structure.
How is the feature down sampling handled here? Pooling? Step convolution? Or patch integration—— From the paper, the step convolution is used.

Specifically, the inner-region rearrangement is designed to capture local information inside a spatial region_. Moreover, to enable information communication between different regions and capture global context, _the cross-region rearrangement is proposed to circularly shift all tokens along spatial directions.

It seems that some processing is similar to that of swintransformer. After local processing, the global offset is performed to associate various regions.

The proposed HireMLP architecture is built with simple channel-mixing MLPs and rearrangement operations, thus enjoys high flexibility and inference speed.

The processing here does not seem to mention spatial MLP, so how to deal with it in the local area?

Experiments show that our Hire-MLP achieves state-of-the-art performance on the ImageNet-1K benchmark. In particular, Hire-MLP achieves an 83.4% top-1 accuracy on ImageNet, which surpasses previous Transformer-based and MLP-based models with _better trade-off for accuracy and throughput_.

primary coverage

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

It can be seen that the main change is that the original space MLP is replaced with hire module.

Hierarchical Rearrangement

Here is the processing based on the region, so the module needs to divide the blocks according to the h-axis and W-axis first.

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

The rearrangement operation here is carried out from two dimensions, one along the height and the other along the width.

Inner-region Rearrangement

It can be seen from the figure that for the rearrangement in the area, the high direction is to stack the adjacent layers (local strip areas) on the H axis onto the channel C dimension, and the processing in the wide direction is similar. By stacking on the channel, the processing of local area features can be realized directly by using the channel MLP.

The idea here is very interesting.

But if you think about it carefully, it can actually be seen as a decomposition of convolution. In pytorch, use[nn.Unfold](https://pytorch.org/docs/stable/generated/torch.nn.Unfold.html?highlight=unfold#torch.nn.Unfold)The process of implementing convolution operation is actually similar to this. By stacking the data of the local window on the channel dimension, and then using the full connection layer processing, it can be equivalent to the convolution operation of a larger kernel.

Here, it can be regarded as window drawing without overlapping. Maybe this follow-up work will try to use overlapping forms.

But in this way, it’s more like convolution.

>>> # Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape)
>>> inp = torch.randn(1, 3, 10, 12)
>>> w = torch.randn(2, 3, 4, 5)
>>> inp_unf = torch.nn.functional.unfold(inp, (4, 5))
>>> out_unf = inp_unf.transpose(1, 2).matmul(w.view(w.size(0), -1).t()).transpose(1, 2)
>>> out = torch.nn.functional.fold(out_unf, (7, 8), (1, 1))
>>> # or equivalently (and avoiding a copy),
>>> # out = out_unf.view(1, 2, 7, 8)
>>> (torch.nn.functional.conv2d(inp, w) - out).abs().max()

Another point is that the processing of local square windows is divided into one-dimensional strip windows in two different directions: H and W. Split KxK into 1xk and kx1.

It seems that various designs of convolution model have almost exhausted the basic units of model structure(^__^) 。

Cross-region Rearrangement

For cross region rearrangement, the feature shifts along the h-axis or W-axis as a whole(torch.roll)To deal with. It seems useless to use this operation alone, but if the processing in the previously designed area is carried out after rearrangement, the cross window of local area is just realized.

However, there is a problem to be noted here. It can be seen that the processing of the local area here is only included after the window feature offset, and the local processing of the features before the offset is not considered. A more reasonable form should beWindow internal processing - > window feature offset - > window internal processing - > offset window feature position recovery - > window internal processing (optional), it seems that such cross processing can better cover a wider space,Unlike now, window processing always corresponds to fixed areas

experimental result

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

Ablation experiments mainly discussed the following points:

  • Number of windows divided: by default, the width of windows divided along the h-axis and W-axis is the same. Smaller windows will emphasize local information more. And in the experiment, empirically use a larger window width in a shallower layer to obtain a larger receptive field.

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

It can be seen that when the window width is gradually increased, the performance will decline. The author speculates that with the increase of region size, some information may be lost in the bottleneck structure.

  • Discussion of step s of cross window offset.

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

It can be seen that the effect will be better if the shallow window offset is a little larger. Perhaps it is because increasing the receptive field of shallow features can bring some benefits.

  • Discussion on different forms of padding. This is because for the input of 224, the feature size of stage 4 is 7×7, which can not realize the uniform division of windows. Therefore, it is necessary to add padding under the setting of non overlapping windows in this paper. Several strategies are discussed here.

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

  • The importance of different branches in the hire module.

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

It can be seen that the processing within the region is very important. In fact, it is understandable. If there is no such local operation. So simple offset cross window offset is meaningless for channel MLP. Because it’s a point operation.

  • Different forms of cross window communication.

Hire MLP of vision MLP: vision MLP via hierarchical arrangement

Here, we compare the offset (which can retain a certain adjacent relationship, i.e. relative position information) and the inter group shuffle such as shufflenet. It is important to see the relative position information.


Recommended Today

Network counting experiment I Division VLAN

Experiment 1  vlanCreation and division of 1、 Experiment purpose: 1. Understand the working principle of VLAN; 2. Learn the method of dividing VLANs based on ports; 3. Understand the communication between the same VLANs across switches; 4. Further learn the configuration commands of switch ports. 2、 Experimental principle: VLAN (virtual local area network), that is, […]