The “four models” innovate the application of NLP technology and uncover the latest open source pre training model of Baidu Wenxin Ernie


At the 2021 deep learning developer summit wave summit, which ended on May 20, baidu Wenxin Ernie opened four pre training models. This paper makes a detailed technical interpretation of the four open source pre training models.
Since 2019, NLP pre training model has made continuous breakthroughs in technological innovation and industrial application, but there are still some pain points bothering developers:

  • Considering only single granularity semantic modeling, lack of multi granularity knowledge introduction, and limited semantic understanding ability;
  • Limited by the modeling length bottleneck of transformer structure, it is unable to process ultra long text;
  • Focusing on language and the first mock exam, there is no real application scenario for modeling the multiple modality, such as language, vision and auditory information.

At the 2021 deep learning developers summit wave summit held on May 20, relying on the core framework of the propeller, baidu Wenxin Ernie’s latest open source four pre training models: multi granularity language knowledge enhancement model Ernie gram, long text understanding model Ernie doc, cross model understanding model Ernie VIL integrating scene map knowledge, and Ernie unimo integrating language and vision.

In view of the existing difficulties and pain points of the current pre training model, the four open-source pre training models of Wenxin Ernie not only make breakthroughs in the three fields of text semantic understanding, long text modeling and cross modal understanding, but also have a wide range of application scenarios and prospects to further promote the upgrading of industrial intelligence.


Wenxin Ernie open source address:

Wenxin Ernie official website address:
1、 Multi granularity language knowledge enhancement model Ernie gram

Since the birth of Ernie model, baidu researchers have introduced knowledge into the pre training model to improve the ability of semantic model through knowledge enhancement. The Ernie gram model released this time explicitly introduces language granularity knowledge to improve the effect of the model. Specifically, Ernie gram proposes an explicit n-gram mask language model to learn n-gram granularity language information. Compared with the continuous n-gram mask language model, it greatly reduces the semantic learning space (V ^ n → V)_( N-gram), where V is the size of the thesaurus and N is the length of the modeled gram), which significantly improves the convergence speed of the pre training model.


▲ Figure 1-1 continuous n-gram mask Language Model vs explicit n-gram mask language model

In addition, based on the explicit n-gram semantic granularity modeling, Ernie gram proposes multi-level n-gram language granularity learning, and uses the two stream dual flow mechanism to learn fine-grained semantic knowledge in n-gram language units and coarse-grained semantic knowledge between n-gram language units at the same time, so as to realize multi-level language granularity knowledge learning.


▲ Figure 1-2 n-gram multi-level language granularity mask learning

Ernie gram significantly surpasses the mainstream open source pre training model in the industry on several typical Chinese tasks such as natural language inference task, short text similarity task and reading comprehension task without increasing any computational complexity. In addition, Ernie gram English pre training model also outperforms the mainstream model in general language understanding tasks and reading comprehension tasks.

Ernie Gram’s method was employed by the president of naacl 2021. Address of the paper:

2、 Long text understanding model Ernie doc

Transformer is the basic network structure that Ernie pre training model relies on. However, due to the square increase of its computation and space consumption with the modeling length, it is difficult for the model to model long text contents such as chapters and books. Inspired by the human reading method of rough reading first and then intensive reading, Ernie doc pioneered the retrospective modeling technology, broke through the transformer’s modeling bottleneck in text length, and realized the two-way modeling of arbitrarily long text.

By repeatedly inputting the long text into the model twice, Ernie doc learns and stores the whole text semantic information in the rough reading stage, and explicitly integrates the whole text semantic information for each text segment in the intensive reading stage, so as to realize two-way modeling and avoid the problem of context fragmentation.

In addition, the circular mode of the recurrence memory structure in the traditional long text model (transformer XL, etc.) limits the effective modeling length of the model. Ernie doc improves it into the same layer loop, so that the model retains the higher-level semantic information and has the modeling ability of ultra long text.


▲ Figure 2-1 retrospective modeling and memory enhancement mechanism in Ernie doc

Ernie doc can better model the overall text information by allowing the model to learn the sequential relationship between text paragraphs at the text level.


▲ figure 2-2 text reordering learning

Ernie doc significantly improves the modeling ability of long text and can solve many application problems that cannot be handled by traditional models. For example, in search engines, Ernie doc can understand the web page as a whole and return more systematic results to users. In intelligent creation, Ernie doc can be used to generate longer and semantically rich articles.

Ernie doc, an ultra long text understanding model, achieves the best results in 13 typical Chinese and English long text tasks including reading comprehension, information extraction, text classification, language model and so on.

Ernie Doc’s method was employed by the president of ACL 2021. The paper link is as follows:
3、 Ernie VIL, a cross modal understanding model based on scene graph knowledge

The ability of cross modal information processing requires the artificial intelligence model to deeply understand and integrate the information of language, vision, hearing and other modes. At present, the cross modal semantic understanding technology based on pre training learns the cross modal joint representation by aligning the corpus, and integrates the semantic alignment signal into the joint representation, so as to improve the cross modal semantic understanding ability. Ernie VIL proposes a knowledge enhanced visual language pre training model, integrates the scene graph knowledge containing fine-grained semantic information into the pre training process, and constructs three pre training tasks: object prediction, attribute prediction and relationship prediction, so that the model pays more attention to fine-grained semantic knowledge in the pre training process, Learn to describe better cross modal semantic alignment information and get better cross modal semantic representation.


▲ Figure 3-1 knowledge enhanced cross modal pre training Ernie VIL framework

Ernie VIL integrates scene graph knowledge into the pre training process of cross modal model for the first time, which provides a new idea for the research in the field of cross modal semantic understanding. The model has achieved leading results in five typical cross modal tasks, such as visual question answering, visual common sense reasoning, citation expression understanding, and cross modal text image retrieval. Ernie VIL model is also gradually implemented in real industrial application scenarios such as video search.

Ernie VIL’s method was employed by the president of aaai-2021. The address of the paper is:

4、 Ernie unimo, a model integrating language and vision

Big data is one of the key foundations for the success of in-depth learning. The current pre training methods are usually carried out on different modal data, which is difficult to support the tasks of various languages and images at the same time. Can the AI system based on deep learning learn all kinds of heterogeneous modal data such as single-mode and multi-mode at the same time? If it can be realized, it will undoubtedly further open the boundary of deep learning for large-scale data utilization, so as to further improve the general ability of perception and cognition of AI system.

In the first mock exam, ERNIE-UNIMO integrates the language and vision model, and uses unified mode learning method. It uses single mode text, single mode image and multi-mode text to train data, and to learn the unified semantic representation of text and image, so that it has the ability to deal with various single modal and cross modal tasks at the same time. The core module of this method is a transformer network. In the specific training process, the three modal data of text, image and picture and text pair are randomly mixed together, in which the image is transformed into the target sequence, the text is transformed into the token sequence, and the picture and text pair is transformed into the splicing of the target sequence and the word sequence. Unified modal learning performs the first mock exam of three types of data, and performs self-supervised learning based on mask prediction in target sequences or word sequences. It also performs cross modal learning based on graph and text, thus achieving unified representation of images and texts. Furthermore, this joint learning method also makes text knowledge and visual knowledge enhance each other, so as to effectively improve the ability of text semantic representation and visual semantic representation.


This method surpasses the mainstream text pre training model and multi-mode pre training model in language understanding and generation, multi-mode understanding and generation, 4 types of scenes and 13 tasks, and tops the authoritative visual Q & a list VQA and text reasoning list Anli. It is verified for the first time that language knowledge and visual knowledge can enhance each other through non parallel text and image single-mode data.

This job is employed by the president of acl2021. The address of the thesis is:

5、 Solve NLP technical problems and help industry intellectualization

Wenxin Ernie released four pre training models with new open source to continuously promote the innovation and application of NLP model technology research.

Language and knowledge technology are regarded as the core of artificial intelligence cognitive ability. Since 2019, Baidu has made a series of world breakthroughs by virtue of its profound accumulation in the field of natural language processing, and released Wenxin Ernie semantic understanding platform, which is widely used in finance, communication, education, Internet and other industries to help the intelligent upgrading of the industry.


As the “pearl on the crown of artificial intelligence”, NLP has always been the forefront of artificial intelligence technology R & D and implementation practice. Based on the leading semantic understanding technology, baidu Wenxin platform helps enterprises cross the threshold of technology, tools, computing power and talents on the NLP track, open to developers and enterprises, comprehensively accelerate the NLP technology, help the intelligent upgrading of the whole industry, and insert intelligent “wings” for the large-scale production of AI industry.
Baidu natural language processing (NLP) takes “understanding language, having intelligence and changing the world” as its mission, develops the core technology of natural language processing, creates a leading technical platform and innovative products, serves global users and makes the complex world simpler.