Facebook AI proposes timesformer: a video understanding framework based entirely on transformer


The content of this article comes from the paper: “is space time attention all you need for video understanding?”. In the process of editing, some understanding has been added. It is inevitable that the understanding is not thorough or the writing is wrong. You are welcome to point out in the comments area. Thank you!

What can timesformer do?

Recently, Facebook AI proposed a new video understanding architecture called time space transformer, which is completely based on transformer. Since transformer was proposed, it has been widely used in the field of NLP. It is the most commonly used method in machine translation and language understanding.

Timesformer achieves the results of SOTA on a number of challenging behavior recognition datasets, including dynamics-400, dynamics-600, something-something-v2, diving-48 and howto100m datasets. Compared with the modern 3D convolutional neural network, the training time of timesformer is three times faster and the reasoning time is one tenth of it.

In addition, the scalability of timesformer makes it possible to train larger models on longer video clips. The current 3D CNN can only process clips of a few seconds at most. Using timesformer, it can even train on clips of a few minutes. It paves the way for future AI systems to understand more complex human behaviors.

Compared with convolution, what is the difference between transformer and convolution?

The traditional video classification model uses 3D convolution kernel to extract features, while timesformer is based on the self attention mechanism in transformer, which enables it to capture the spatiotemporal dependence of the whole video. In this model, the input video is regarded as a spatiotemporal sequence of patches extracted from each frame to apply transformer to video. This is used in a very similar way to NLP. NLP transformer infers the meaning of each word by comparing each word with all the words in the sentence. This method is called self attention mechanism. This model obtains the semantics of each image block by comparing the semantics of each image block with other image blocks in the video, so that it can capture the local dependency between adjacent image blocks and the global dependency of remote image blocks at the same time.

As we all know, the training of transformer consumes a lot of resources

  1. The video is decomposed into a subset of disjoint image block sequences;
  2. A unique way of self attention is used to avoid complex computation among all image block sequences. This technique is called divided space-time attention. In temporal attention, each image block is only attached to the image blocks extracted from the corresponding positions of the rest frames. In spatial attention, the image block is only attached to the extracted image block of the same frame. The author also found that the effect of separate spatiotemporal attention mechanism is better than that of common spatiotemporal attention mechanism.

In the field of CV, convolution has the following defects compared with transformer

  1. Convolution has strong inductive bias (such as local connectivity and translation invariance). Although it is undoubtedly effective for some relatively small training sets, when we have enough data sets, these will limit the expression ability of the model. Compared with CNN, transformer has less inductive bias, which makes it more suitable for very large data sets.
  2. Convolution kernels are specially designed to capture local spatio-temporal information. They cannot model the dependence outside the receptive field. Although stacking convolutions and deepening the network will expand the receptive field, these strategies still limit the long-term modeling by aggregating information in a short range. On the contrary, the self attention mechanism can be used to capture local and global long-range dependence by directly comparing the features in all spatial and temporal positions.
  3. When applied to high-definition long video, training depth CNN network is very expensive. At present, some studies have found that in the field of still image, the training and derivation of transformer is faster than CNN. So that the same computing resources can be used to train the network with stronger fitting ability.

How to realize it?

Based on the image model vision transformer (VIT), this work extends the self attention mechanism from the image space to the 3D space of space-time.

Video self attention module

In the figure below, we can see clearly how the attention mechanism works

Different ways of exerting attention

In the figure, the blue image block is the image block of query, the other color image blocks are the image blocks used by each self attention strategy, and the image blocks without color are not used. In the strategy, there are multiple color image blocks, which represent the attention mechanism. For example, t + s means t first, then s, and L + G are the same. Only three frames are shown here, but they are applied to the whole sequence.

By partitioning the input image, five different attention mechanisms are studied

  1. Spatial attention mechanism (s): the self attention mechanism only takes the image blocks in the same frame
  2. Spatiotemporal common attention mechanism (st): take all image blocks in all frames for attention mechanism
  3. Separate spatiotemporal attention mechanism (T + s): first, self attention mechanism is applied to all image blocks in the same frame, and then attention mechanism is applied to corresponding image blocks in different frames
  4. Sparse local global attention mechanism (L + G): firstly, the local attention is calculated by using the adjacent H / 2 and w / 2 image blocks in all frames, and then the self attention mechanism is calculated in the whole sequence by using the step size of two image blocks in space, which can be regarded as a faster approximation of global time-space attention
  5. Axial attention mechanism (T + W + H): first, self attention mechanism is carried out in the time dimension, then self attention mechanism is carried out in the image block with the same ordinate, and finally self attention mechanism is carried out in the image block with the same abscissa

I believe we all have a good understanding of the specific action mode (Q, K and V) of transformer. Here we change the input information in self attention, so we don’t show the formula. Interested can read the original.

Experimental part

An analysis of the mechanism of self attention

Five self attention strategies are studied on k400 and ssv2 datasets, and the classification accuracy at video level is reported in the table. The effect of time and space attention is the best.

It can be seen from the table that for k400 data set, it is better to use only spatial information, which was also found by the previous researchers. However, for ssv2 data set, the effect of using only spatial information is very poor. This shows the importance of time modeling.

Influence of image size and video length

When the size of each image block remains unchanged, the larger the image, the more the number of image blocks. At the same time, the more frames, the more data input attention mechanism. The author also studies the impact of these on the final performance, and the result is that with more input information, the improvement of the effect is very obvious.

Because of the limitation of video memory, there is no way to test more than 96 frames of video clips. The author said that this is a big improvement, because the current convolution model, the input is generally limited to 8-32 frames.

Importance of pre training and dataset size

Because this model needs a very large amount of data to be trained, the author tried to train from scratch, but failed. Therefore, all the results reported in this paper are pre trained by Imagenet.

In order to study the influence of the size of data sets, two data sets were used. In the experiment, four groups were divided, using 25%, 50%, 75% and 100% data respectively. The result is that timesformer doesn’t perform well when there is less data, and performs well when there is more data.

Compared with SOTA

Three variants of the model are used in this section:

  1. Timesformer: enter 8224224, 8 is the number of frames
  2. Timesformer HR: high spatial definition, input 16448448
  3. Timesformer-l: wide time range, input 96224224

The result of video classification on k400 data set has reached SOTA.

The results on K600 data set meet the requirements of SOTA.

According to the results of ssv2 and divising48, ssv2 does not achieve the best results. The author mentions that the proposed method adopts a completely different structure, which is better for such a challenging data set, and there is room for further development.

Long term modeling in video

The author also verifies the advantages of the proposed model for long-term video modeling compared with CNN. This step uses the howto100m dataset.

Among them, # input frames represents the number of frames of the input model, single clip coverage represents how long the input video covers, and # test clips represents the prediction stage, in which several segments of the input video need to be clipped before it can be input into the network. It can be seen that when timesformer inputs 96 frames, it can effectively use the information that the video depends on for a long time and achieve the best effect.

reference material

  • ViT(Vision Transformer):https://arxiv.org/abs/2010.11929
  • Link to this original paper: https://arxiv.org/pdf/2102.05095.pdf
  • code: https://github.com/lucidrains/TimeSformer-pytorch

Write at the end: if you think this article is helpful to you, welcome to support me with comments, thank you! Also welcome to my official account: algorithm brother Chris.

Recommended reading: Chris: introduce the attention mechanism into RESNET. Here are some skills in the visual field! How to use it