NLP’s Imagenet era has arrived


Abstract:NLP is about to change dramatically. Are you ready?

The field of natural language processing (NLP) is changing.

As the core expression technology of NLP, word vector is challenged by many new technologies, such as Elmo, ulmfit and openai transformer. These methods indicate a watershed: their influence in NLP may be as extensive as that of the pre trained Imagenet model in computer vision.

Pre training from shallow to deep

The pre trained word vector brings a good direction to the development of NLP. Word2vec, a language modeling approximation proposed in 2013, is used for its efficiency and ease of use when the hardware speed is slow and the deep learning model is not widely supported. Since then, the standard way of NLP project remains unchanged to a large extent: word embedding which preprocesses a large number of unlabeled data via algorithm is used to initialize the first layer of neural network, and other layers then train on the data of specific tasks. This method performs well in most tasks with limited amount of training data, and will improve by two to three percentage points through the model. Although these pre trained word embedding models have great influence, they have one major limitation: they only include prior knowledge into the first layer of the model, while the rest of the network still needs to be trained from scratch.

NLP's Imagenet era has arrived
Word2vec captured relationships (source: tensorflow tutorial)

Word2vec and other related methods sacrifice expressiveness for efficiency. Using word embedding is like initializing the computer vision model, only the pre training representation of the edge of the encoded image: they are helpful for many tasks, but they can’t capture more useful information. Using the model of word embedding initialization needs to learn from scratch, not only to eliminate the ambiguity of words, but also to extract meaning from the word sequence, which is the core of language understanding. It needs to model complex language phenomena, such as semantic combination, polysemy, long-term dependence, consistency, negation and so on. Therefore, NLP models that use these shallow representations of initialization still need a lot of examples to achieve good performance.

The core of the latest progress of ulmfit, Elmo and openai transformer is a key paradigm shift: from initializing the first layer of our model to representing the whole model in layers. If learning a word vector is like learning the edge of an image, then these methods are like learning the complete hierarchy of features, from edge to shape to advanced semantic concepts.

Interestingly, the computer vision (CV) community has been pre training the entire model over the years to achieve both low-level and high-level features. In most cases, this is done by learning to classify images on the Imagenet dataset. Ulmfit, Elmo and openai transformer have now brought the NLP community close to having the ability of “Imagenet for language”, even though the model can learn the higher-level nuance tasks of the language, similar to the way Imagenet enables training to learn the CV model of image general function. In the rest of this article, we’ll explore why these approaches look so promising by extending and building analogical Imagenet.


NLP's Imagenet era has arrived

The influence of Imagenet on machine learning research process is almost irreplaceable. The dataset was first released in 2009 and rapidly evolved into the Imagenet large scale visual recognition challenge (ilsvrc). In 2012, the performance of deep neural network submitted by Alex krizhevsky, Ilya sutskever and Geoffrey Hinton is 41% better than that of the second competitor, which shows that deep learning is a feasible machine learning strategy and may trigger the outbreak of deep learning in ML research field.

The success of Imagenet shows that in the era of deep learning, data is at least as important as algorithms. Imagenet dataset not only shows the deep learning ability born in 2012, but also makes a breakthrough in migration learning: researchers soon realized that they can use the most advanced model to learn from Imagenet to initialize any weight, and this “fine tuning” method can show good performance.

NLP's Imagenet era has arrived
The characteristics trained on ilsvrc-2012 can be summarized as sun-397 data set.

The pre trained Imagenet model has been used in tasks such as object detection, such as semantic segmentation, human pose estimation and video recognition, and has performed very well. At the same time, they have applied CVs to areas where the number of training samples is small and annotation is expensive.

What’s in Imagenet?

In order to determine the language form of Imagenet, we must first determine what makes Imagenet suitable for migration learning. We only know about this problem before: reducing the number of samples or classes of each class will only lead to performance degradation, while fine-grained and more data are not always good for performance.

Rather than looking at the data directly, it’s more prudent to explore what the models trained on the data have learned. As we all know, the feature migration order of deep neural network trained on Imagenet is from the first layer to the last layer, from general tasks to specific tasks: lower layer learns to model low-level features, such as edges, while higher layer learns to model high-level concepts, such as patterns and the whole part or object, as shown in the following figure. Importantly, knowledge of the edges, structure, and visual composition of objects is relevant to many CV tasks, which reveals why these layers are migrated. Therefore, a key attribute of Imagenet like datasets is to encourage model learning to generalize to the characteristics of new tasks in the problem domain.

NLP's Imagenet era has arrived
Visualizing the information captured by different layers of features in the googlenet trained on Imagenet

In addition, it’s hard to generalize why Imagenet migration works so well. For example, another possible advantage of an Imagenet dataset is the quality of the data, and the creators of Imagenet do their best to ensure reliable and consistent annotations. However, the work of remote monitoring is a comparison, which shows that a large number of weak label data may be enough. In fact, recently Facebook researchers said they could pre train models by predicting the subject labels on billions of social media images and the most advanced accuracy on Imagenet.

Without more specific insights, we have to understand two key requirements:

  1. Imagenet like datasets should be large enough, that is, about millions of training samples.
  2. It should represent the problem space of the discipline.

Imagenet for language tasks

NLP models are generally much shallower than CVs. Therefore, most of the feature analysis focuses on the first embedding layer, and few people study the high-level nature of transfer learning. We consider data sets that are large enough. In the current NLP situation, there are several common tasks, which are likely to be used in the NLP pre training model.

Reading comprehension is the task of answering natural language questions about paragraphs. The most popular dataset for this task is the Stanford question answering dataset (squad), which contains more than 100000 question and answer pairs, and requires the model to answer questions by highlighting the span in the paragraph, as follows:

NLP's Imagenet era has arrived

Natural language reasoning is the task of identifying the relationship (implication, contradiction and neutrality) between a text and a hypothesis. The most popular data set for this task is the Stanford natural language inference (SnLi) corpus, which contains 570k humanized English sentence pairs. An example of a dataset is as follows:

NLP's Imagenet era has arrived

Machine translation is one of the most researched tasks in NLP. Over the years, people have accumulated a lot of training data for popular language pairs, such as the French sentence pairs in 40m English French WMT 2014. Please refer to the following two example translation pairs:

NLP's Imagenet era has arrived

Composition parsing attempts to extract the syntactic structure of a sentence in the form of a (linearized) analysis tree, as shown below. At present, there are millions of weak label parsing used to train the sequence to sequence model of this task.

NLP's Imagenet era has arrived

Language modeling (LM) aims to predict the first word of the next word. The existing benchmark data set contains up to 100 million words, but due to the unsupervised task, any number of words can be used for training. Please refer to the following Wikipedia articles for an example of a popular wikitext-2 dataset:

NLP's Imagenet era has arrived

All of these tasks provide or allow a sufficient number of samples to be collected for training. In fact, the above tasks (as well as many other tasks such as emotion analysis, skip thoughts and self coding) have been used for pre training characterization in recent months.

Although any data contains some biases, human annotation may inadvertently introduce additional information, which will also be used by the model. Recent studies have shown that the current optimal models in tasks such as reading comprehension and natural language reasoning do not actually form deep natural language understanding, but pay attention to some clues to perform shallow pattern matching. For example, gururangan et al. (2018) showed in annotation artifacts in natural language influence data that taggers tend to generate implication examples by removing gender or quantitative information, and generate contradictions by introducing negative words. With the help of these clues, the model can classify the hypotheses with 67% accuracy on SnLi dataset without looking at the premise.

So the harder question is: which task best represents the NLP problem? In other words, which task enables us to learn the most about natural language understanding or relationship?

The case of language modeling

In order to predict the most likely next word in a sentence, a model needs not only to be able to express Grammar (the grammatical form of the predicted word must match its modifier or verb), but also model semantics. More importantly, the most accurate model must contain something that can be considered world knowledge or common sense. Consider an incomplete sentence, “service is poor, but food is.” In order to predict the following words, such as “delicious” or “bad”, the model should not only remember the attributes used to describe the food, but also be able to recognize the “but” combined with the introduction of contrast, so that the new attribute has the “poor” opposition.

Language modeling, the last approach mentioned above, has been shown to capture many aspects of the language associated with downstream tasks, such as long-term dependencies, hierarchical relationships, and emotions. Compared with the related unsupervised tasks (such as skip thinking and automatic coding), language modeling performs better in grammar tasks, even with less training data.

One of the greatest benefits of language modeling is that training data can be provided free of charge through any text corpus, and unlimited training data can be obtained. This is particularly important because NLP is not only English, but there are currently about 4500 languages around the world. As a pre training task, language modeling has opened the door for no language development model before. For languages with very low data resources, even unmarked data is rare. Multilingual language models can be trained on multiple related languages at the same time, similar to cross language embedding.

NLP's Imagenet era has arrived
Different stages of ulmfit

So far, our argument that language modeling is a pre training task is purely conceptual. But in recent months, we have also gained experience: language model embedding (Elmo), universal language model tuning (ulmfit), and openai have proven empirically how language modeling can be used for pre training, as shown above. All three methods use pre training language model to achieve the latest technology of various tasks in natural language processing, including text classification, Q & A, natural language reasoning, sequence marking and so on.

In many cases such as Elmo shown below, the algorithm using the pre training language model as the core is 10% to 20% higher than the current optimal result on the basis of extensive research. Elmo also won the best paper of the NLP top event naacl-hlt 2018. Finally, these models show very high sample efficiency, only need hundreds of samples to achieve the optimal performance, and even can achieve zero shot learning.

NLP's Imagenet era has arrived
The improvement of Elmo in various NLP tasks

In view of the changes made in this step, NLP practitioners are likely to download the preprocessed language model in a year’s time, rather than embedding the preprocessed words into their own models, just like how to preprocess the Imagenet model is now the starting point of most CV projects.

However, similar to word2vec, the task of language modeling naturally has its own limitations: it is only a real agent of language understanding, and the single monomer model is not able to capture the information required by some downstream tasks. For example, in order to answer questions about or follow a character’s trajectory in a story, the model needs to learn to perform anaphora or solve them together. In addition, language models capture only what they see. Some types of specific information, such as most common sense knowledge, are difficult to learn from the text alone, which requires the integration of some external information.

A prominent problem is how to transfer information from a pre training language model to downstream tasks. There are two main paradigms: one is whether to use the pre training language model as a fixed feature extractor and integrate its representation into the randomly initialized model (as Elmo did); the other is whether to fine tune the complete language model (as ulmfit did). The latter is commonly used in computer vision, in which the highest level or the highest level of the model will be adjusted during training. Although NLP model is usually shallower, so it needs different fine-tuning techniques compared with the corresponding visual model, but the recent pre training model has become deeper. Next month, I will show the role of each core component of NLP migration learning, including highly expressive language model encoders (such as deep bilstm or transformer), the amount and nature of data used for pre training, and the method of fine tuning the use of pre training models.

But what is the theoretical basis?

So far, our analysis is mainly conceptual and empirical, because it is still difficult to understand why the model trained on Imagenet is so well migrated. A more formal way to consider the generalization ability of the pre training model is based on the bias learning model (Baxter, 2000). Suppose our problem domain covers all permutations of tasks in a particular discipline, such as computer vision – which makes up the environment. We provide many datasets for this, which allow us to induce a series of hypothesis spaces H = H ‘. Our goal in biased learning is to find the biases, i.e. assuming the space H ‘∈ h, which can maximize performance in the whole environment.

The results of experience and theory in multi task learning (Caruana, 1997; Baxter, 2000) show that the biases learned in enough tasks may be extended to tasks not seen in the same environment. Through multi task learning, the model trained on Imagenet can learn a large number of binary classification tasks (one for each class). These tasks are all from the natural, real-world image space, and may be representative of many other CV tasks. Similarly, by learning a large number of classification tasks (one for each word), language models may induce representations that are helpful to many other tasks in the natural language field. However, more research is needed to understand why language modeling seems to be so effective in transfer learning.

Imagenet era of NLP

The time is ripe for NLP to use transfer learning. In view of the impressive empirical results of Elmo, ulmfit and openai, this development seems to be only a matter of time. The word embedding model of pre training will be gradually eliminated and replaced by the pre training language model in each NLP developer’s toolbox. This may solve the problem of insufficient annotation data in NLP domain.

Author: [direction]

Read the original text

This is the original content of yunqi community, which can not be reproduced without permission.