Heavy weight! Ernie VIL, Baidu’s multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

Time:2021-7-22

Focus on the official account of WeChat Baidu, and get more dry cargo in Natural Language Processing field in time! NLP!


Read the original:https://mp.weixin.qq.com/s/nB_yCkEXkgjv7saKpcNpng

Recently, Baidu has made a breakthrough in the field of multimodal semantic understanding by taking advantage of the distributed training leading edge of the deep learning platform of flying oars. It has proposed the knowledge enhanced visual language pre training model Ernie VIL, which integrates scene graph knowledge into multimodal pre training for the first time and refreshes the world’s best results in five multimodal tasks, And in the multi-modal domain authority list VCR, surpasses Microsoft, Google, Facebook and other organizations, to the top. According to the heart of machine, Ernie VIL model based on flying OARS will also be open source in the near future.

Multimodal semantic understanding is one of the important research directions in the field of artificial intelligence. How to make machines have the ability of understanding and thinking like human beings requires the fusion of multimodal information such as language, voice, vision and so on.

In recent years, vision, language, speech and other single-modal semantic understanding technologies have made great progress. But more AI real scenes involve multiple modes of information at the same time. For example, an ideal AI assistant needs to communicate with human beings according to multimodal information such as language, voice and action, which requires machines to have multimodal semantic understanding ability.

Recently, Baidu has made a breakthrough in this field, proposing Ernie VIL, the first multimodal pre training model integrating scene graph knowledge in the industry. Baidu researchers integrate scene graph knowledge into the pre training process of visual language model to learn the joint representation of scene semantics, which significantly enhances the ability of cross modal semantic understanding. Ernie VIL also refreshes the world’s best results in five typical multimodal tasks, including visual common sense reasoning, visual question answering, citation expression understanding, cross modal image retrieval, and cross modal text retrieval. It also topped the multimodal domain authority list visual commonsense reasoning task (VCR).

Link to the paper:

https://arxiv.org/abs/2006.16934

Ernie open source address:

https://github.com/PaddlePaddle/ERNIE

Ernie VIL top VCR list

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

The latest version of VCR leaderboard. Subtask 1: Q – > A (question answering). Subtask 2: QA → R (answer justification). Comprehensive score: Q → ar: comprehensive performance of the model (score only when both subtasks are correct).

When I was in primary school, “talking by looking at pictures” occupied a place in Chinese test papers all the year round. For example, given the following picture, let’s describe what the characters are doing, what they think and how they feel.

Similarly, in the field of artificial intelligence, machines also need to have the ability to “see pictures and speak”.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

As shown in the picture below, the questioner asks, “how did the person on the right get the money in front of her?” Further, answer “why do you make such an inference?” In other words, the model not only needs to recognize the objects “people”, “musical instruments” and “coins” in the image, but also needs to understand their relationship “people play musical instruments” and so on, and infer through the common sense of “street performance makes money”.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

VCR (visual commonsense reasoning) is a data set composed of more than 100000 such pictures and questions. The data set is jointly created by researchers from the University of Washington and the Allen Institute of artificial intelligence. It tests the multimodal semantic understanding and reasoning ability of the model.

Microsoft, Google, Facebook and other technology companies, as well as UCLA, Georgia Institute of technology and other top universities have launched a challenge to the task.

On June 24, the list was refreshed again. Ernie VIL from Baidu Ernie team achieved the first results in single model effect and multi model effect, and reached the top with an accuracy rate of 3.7 percentage points ahead of the second place in the list in joint tasks, surpassing Microsoft, Google, Facebook and other organizations.

Ernie VIL based on scene graph knowledge

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

When people see the figure above, they will first pay attention to the objects, attributes and relationships in the figure. For example: “car”, “people”, “cat”, “house” and other objects constitute the basic elements of the picture scene; The attributes of objects, such as “the cat is white” and “the car is brown”, describe the objects more precisely; The location and semantic relationship between objects, such as “cat in the car”, “car in front of the house”, establish the association of objects in the scene. Therefore, objects, attributes and relationships constitute the detailed semantics to describe the visual scene.

Based on this observation, baidu researchers integrated the scene graph containing scene prior knowledge into the multi-modal pre training process, modeled the fine-grained semantic association between visual language modes, and learned the joint representation containing fine-grained semantic alignment information.

As shown in the figure below, Ernie VIL proposes three scene graph prediction tasks for multimodal pre training: object prediction, attribute prediction and relationship prediction.

  • Object prediction:Some objects in the graph, such as “house” in the graph, are randomly selected to mask the corresponding words in the sentence. The model predicts the masked part according to the text context and picture;

  • Attribute prediction:For the attribute object pairs in the scene graph, such as “< dress, blue >” in the graph, a part of words are randomly selected to mask the attributes, and they are predicted according to the object, context and picture;

  • Relationship predictionA part of “object relationship object” triad is randomly selected, such as “< cat, on top of, car >” in the graph, and then the relationship among them is masked, and the model predicts it according to the corresponding object, context and picture.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

Through the scene graph prediction task, Ernie VIL learns fine-grained semantic alignment between different modes, such as mapping semantic information such as “cat”, “car is brown” and “cat on the car” to corresponding regions in the image.

In addition to the scene prediction tasks proposed above, Ernie VIL also uses masked language modeling, masked region prediction, image text matching and other tasks.

experimental result

The ability of Ernie VIL model was verified by multi-modal downstream tasks such as visual common sense reasoning and visual question answering.

In addition to obtaining SOTA results in visual common sense reasoning tasks, Ernie VIL also refreshes SOTA results in visual question answering, cross modal image retrieval, cross modal text retrieval, and citation expression understanding tasks.

Reference expression understandingThe (referring expressions compression, refcoco +) task is to locate the relevant region in the image given a natural language description. The task involves fine-grained cross modal semantic alignment (natural language phrases and image regions), so it is more important to examine the fineness of the semantic description of the joint representation. Ernie VIL is tested on two test sets of the task (testa, refcoco +) Compared with the current optimal effect, testb has increased by more than 2.0 percentage points.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

Visual Q & AVisual question answering (VQA) task is to give a picture and text description of the problem, ask the model to give the answer. This task needs a deeper understanding and reasoning of text and image. Meanwhile, the problems in this task involve fine-grained semantics (objects, object attributes, relationships between objects), which can test the depth of understanding of the model for the scene. Ernie VIL achieved the best result of single model with 74.93% score in this task.

Cross modal image & Text Retrieval(Cross-modal Image-Retrieval,IR; Cross modal text retrieval (TR) task is a classic task in the multimodal field. Given image retrieval related text and given text retrieval related image. The task is essentially to calculate the semantic similarity between image mode and text mode, which requires the model to take into account both the overall semantics and fine-grained semantics. Ernie vils performed these two tasks separately [email protected] The results of 0.56 percentage points and 0.2 percentage points improved the SOTA.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

model analysis

Baidu researchers verified Ernie VIL’s stronger ability of cross modal knowledge inference by constructing multimodal cloze test: given a set of image text alignment data, the object, relationship or attribute words in the text were masked, and the model made predictions according to the context and image. The experimental results show that Ernie VIL performs better in predicting the fine-grained semantic words (objects, attributes, relationships) in the text, and the accuracy increases by 2.12%, 1.31% and 6.00% respectively.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

At the same time, this paper gives some examples of cloze test. It can be seen from the figure below that Ernie VIL can more accurately predict the masked objects, attributes and relationships, while the baseline model can only predict the part of speech of the original words, but it is difficult to accurately predict the specific words.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR

epilogue

To understand, read and understand the environment is one of the important goals of artificial intelligence. The primary task to achieve this goal is to make the machine have the ability of multi-modal semantic understanding. This time, Ernie VIL, a knowledge enhanced multimodal model proposed by Baidu, integrates scene graph knowledge into the pre training process of multimodal model for the first time, and refreshes the record in five tasks such as visual question answering and visual common sense reasoning, which provides a new idea for the research of multimodal semantic understanding. In addition to the above breakthroughs in the effect of public data sets, Ernie VIL technology is also gradually implemented in real industrial application scenarios. In the future, Baidu will conduct more in-depth research and application in this field to make it play a greater commercial and social value.

Baidu natural language processing (NLP) takes “understand language, have intelligence, change the world” as its mission, develops core technology of natural language processing, creates leading technology platform and innovative products, serves global users, and makes the complex world simpler.

Heavy weight! Ernie VIL, Baidu's multimodal model, refreshes the record of 5 tasks and reaches the top of the authoritative list VCR