An image can top 16×16 words! ——Transformer for large scale image scaling recognition (a brief review of ICLR papers in 2021)

Time:2021-4-21

By Stan kriventsov
compile | Flin
Source: Media

In this blog post, I would like to explain the meaning of the author’s new paper “a picture equals 16×16 words: transformer for large-scale image recognition” submitted to the 2021 ICLR conference without too much technical details (so far anonymous).

In another article, I provided an example of using this new model (called vision transformer) with pytorch to predict standard MNIST datasets.

Since 1960, deep learning (machine learning uses neural network to have more than one hidden layer) has come out. However, in 2012, alexnet, a convolutional network (in short, a network, first to find small patterns in each part of the image, and then try to combine them into a whole image), was promoted by alexnet Designed by krizhevsky, he won the annual Imagenet image classification competition.

In the next few years, deep computer vision technology has experienced a real revolution. Every year, new convolution architectures (googlenet, RESNET, densenet, efficientnet, etc.) appear to be used in Imagenet and other benchmark datasets (such as cifar) – 10、CIFAR – 100).

The following figure shows the progress of the highest accuracy of the machine learning model (the accuracy of correctly predicting the content of the image at the first attempt) on the Imagenet dataset since 2011.

However, in the past few years, the most interesting development of deep learning is not in the field of images, but in natural language processing (NLP), which was first proposed by Ashish Vaswani et al. In the paper “attention is everything you need” in 2017.

The idea of attention refers to the weight that can be trained to simulate the importance of each connection between different parts of the input sentence. Its influence on NLP is similar to convolution network in computer vision, which greatly improves the effect of machine learning model on various language tasks (such as natural language understanding) and machine translation.

Attention is particularly effective for linguistic data because understanding human language often requires tracking long-term dependencies. We might say “we arrived in New York” first, and then “the weather in the city is fine.”. For any human reader, it should be clear that “city” in the last sentence refers to “New York”, but for a model based only on finding patterns in nearby data (such as convolutional networks), this connection may not be detectable.

The problem of long-term dependency can be solved by using recursive networks, such as lstms. Before the arrival of the transformer, lstms is actually the top-level model in NLP, but even those models are difficult to match specific words.

The global attention model in the transformer measures the importance of each connection between any two words in the text, which explains the advantages of their performance. Recursive networks are still competitive and may still be the best choice for sequential data types where attention is less important, such as daily sales or stock prices.

Although in NLP and other sequence models, the dependencies between remote objects may have special significance, they can’t be ignored in image tasks. To form a complete picture, it is usually necessary to understand the various parts of the image.

So far, the reason why attention models have not performed well in computer vision is the difficulty of scaling them (their scaling ratio is n) ² So the whole set of attention weights between pixels of 1000X1000 image will have one million items).

Perhaps more importantly, in fact, as opposed to the words in the text, the pixels in the image themselves are not very meaningful, so it doesn’t matter much to connect them through attention.

This new paper proposes an approach that focuses not on pixels but on small areas of the image (probably 16×16 in the title, although the optimal block size actually depends on the image size and content of the model).

The picture above (from the paper) shows how the visual transformer works.

Each color block in the input image is flattened by using a linear projection matrix, and position embedding is added to it (the learned value, which contains information about the original position of the color block in the image). This is necessary because the transformer processes all inputs regardless of their order, so having this location information helps the model to correctly evaluate the attention weight. Additional class tags are connected to the input (position 0 in the image) as placeholders for the classes to be predicted in the classification task.

Similar to the 2017 version, the transformer encoder consists of multiple layers of attention, normalization and full connection, which have residual (skip) connections, as shown in the right half of the figure.

In each region of interest, multiple headers can capture different connection patterns. If you are interested in learning more about transformers, I recommend reading this excellent article by Jay alammar.

The output fully connected MLP header provides the required class prediction. Of course, as today, the main model can be pre trained on large image data sets, and then the final MLP header can be fine tuned to specific tasks through standard transfer learning methods.

One of the characteristics of the new model is that, although it is more effective than convolution method to obtain the same prediction accuracy with less computation, its performance seems to be improving with more and more data training, which is better than other models.

The author of this article trained the visual converter image on a private Google jft-300m dataset containing 300 million, thus obtaining the most advanced accuracy in many benchmark tests. One can expect this pre trained model to be released soon so that we can all try it out.

It’s really exciting to see the new application of neural attention in the field of computer vision! Hope in the next few years, on the basis of this development, can make greater progress!

Link to the original text:https://medium.com/swlh/an-image-is-worth-16×16-words-transformers-for-image-recognition-at-scale-brief-review-of-the-8770a636c6a8

Welcome to panchuang AI blog:
http://panchuang.net/

Sklearn machine learning official Chinese document:
http://sklearn123.com/

Welcome to pancreato blog Resource Hub:
http://docs.panchuang.net/

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]