Title: multimodal graph neural network for combinatorial generalization in visual question answering

Source: neurlps 2020https://proceedings.neurips.cc/paper/2020/hash/1fd6c4e41e2c6a6b092eb13ee72bce95-Abstract.html

code:https://github.com/raeidsaqur/mgn

## 1、 Questions raised

a key:**Combinatorial generalization problem**

Example: taking natural language as an example, for example, people can learn the meaning of new words and then apply them to other language environments. If a person learns the meaning of a new verb’dax’, he can immediately infer it to the meaning of’sing and dax’ Similarly, during training, there may be “combinations” of elements in the test set that have not appeared in the training set (these elements exist in the training set). For example, there are “red dogs” and “green cats” in the training set, but the data in the test set is “red cats”.

Problem: recent research shows that the model cannot be extended to new inputs, and these inputs are only unknown combinations in the elements seen in the training set combination distribution [6].

In general, convolutional neural network (CNN) is used to construct the neural architecture of multimodal representation to process the whole image into a single global representation (such as vector), but this fine-grained correlation cannot be captured [29].

VQA methods based on neural symbols (such as NMNS, ns-vqa and ns-cl) have achieved near perfect scores on benchmarks such as clevr [28,29]. However, even if the distribution of visual input remains unchanged (the input image remains unchanged), these models cannot be extended to new language structure combinations (the problem changes) [6]. A key reason is the lack of fine-grained representation of image and text information, which allows joint compositional reasoning in visual and linguistic space.

## 2、 Main ideas

A graph based learning method for multimodal representation is proposed——**Multimodal graph network (MGN)**, focusing on better generalization effect. The graph structure can capture entities, attributes and relationships, so as to establish a closer coupling between the concepts of different modes (such as image and text).

**motivation**：

Consider the image in the figure and the related problem: “there is a yellow rubber cube behind the large green cylinder.” To answer this question, first find the green cylinder, and then scan the space behind it to find the yellow rubber cube. Specifically, 1) although there may be other objects (for example, another ball), the information about them can be abstracted, 2) it is necessary to establish a fine-grained relationship between the visual and language inputs representing “yellow” and “Cube”.

**Core idea**: represents both text and image as**chart**Naturally, it can make the concepts between the two patterns more closely coupled, and provide a synthesis space for reasoning. Specifically, first, the image and text are parsed into**Separate diagram**Object entities and attributes as nodes and relationships as edges. Then, we use a method similar to that used in the graph neural network [16]**Message passing algorithm**(message passing) to export the similarity factor matrix between the node pairs of the two modes. Finally, using graph based**Aggregation mechanism**To generate the input**Multimodal vector representation**。

**Specific model**：

**Part1: diagram structure**

Examples of multimodal input: tuples (s, t), where s is the source text input (e.g., question or title) and t is the corresponding target image.

**Graph parser:**

Input: tuple (s, t)

Exporting: corresponding object centric graphs\(G_s=(V_s,A_s,X_s,E_s)\)and\(G_t=(V_t,A_t,X_t,E_t)\)。

Where, all nodes in the graph form a set V, a is the adjacency matrix of the graph, X is the characteristic matrix of all nodes V in the graph G, and E is the characteristic matrix of all edges in the graph G.

Specific methods:

For the input text s, the entity recognition module is used to capture the object and attribute as the graph node V, and then the relationship of the node is captured as the graph using the relationship matching module\(G_s\)Edges in.

For image T, the pre trained mask RCNN and resnet-50 FPN image semantic segmentation module are used to obtain the object, attribute and position coordinates (x, y, z). These nodes are shown in Figure\(G_t\)Form a separate node in.

stay\(G_s\)and\(G_t\)After constructing nodes and edges in, the feature matrices X and E are obtained by using word embedding (assuming dimension D) in the pre trained language model as the feature vectors of the nodes (objects, attributes) and edges (relationships) of the text graph. For the image scene graph, we use the object and attribute tags obtained from the “parsing scene” (from the mask RCNN channel) as the input of the language model to obtain feature embedding.

**Graph matcher:**

Importing: diagrams\(G_s=(V_s,A_s,X_s,E_s)\)and\(G_t=(V_t,A_t,X_t,E_t)\)

Output: generating multimodal vector representations\({\vec{h}}_{s,t}\)(dimension 2D) – it captures the potential joint representation of the source node (text) and the matching target node (image).

Specific method:

Merge the two graphs, where the initial characteristics of the nodes represent:\(h_i^{\left(0\right)}=x_i\epsilon\ X\)。 Then, the message passing algorithm of graph neural network is used to iteratively update the vector node representation by aggregating its neighbor representation. At the end of information transmission, each node obtains information from its neighbors.

Message passing algorithm: 1 Aggregate, 2 Merge. After K iterations, the characteristic representation vector of nodes\(h_v^{\left(k\right)}\)The node information in the k-hop neighborhood of the graph can be captured.

When the graph classification task is carried out, the point features need to be transformed into global features; The method of summation or graph pooling (readout function) is used to combine the node characteristics in the final iteration to obtain the whole graph\(G_s\)or\(G_t\)The characteristic representation of:

Then, in order to project the features of the text space into the visual space, the vector local nodes of the source graph and the target graph are used to represent\(H_{G_s}\)and\(H_{G_t}\)A soft responsibility matrix is calculated\(\mathrm{\Phi}\)(similarity matrix):

\]

Where the ith row vector\(\mathrm{\Phi}_i\in\ R^{V_t}\)Representation diagram\(G_t\)Nodes and\(V_s\)The probability distribution of the potential similarity relationship of any node in the graph (which can be regarded as the likelihood score used to measure the degree of matching between nodes in two different graphs). In order to obtain the discrete similarity distribution between source node features and target node features, “sinkhorn normalization” (a regularization method) is applied to the similarity matrix to meet the rectangular double random matrix constraint (for\(\sum_{j\in V_t}\mathrm{\Phi}_{i,j}=1,\forall\ i\in\ V_s\ \ \ and\ \ \ \sum_{i\in V_s}\mathrm{\Phi}_{i,j}=1,\forall\ j\in\ V_t\)）。

Finally, given\(\mathrm{\Phi}\), you can obtain a hidden space from the source (text)\(L(G_s)\)To target (image) latent space\(L(G_t)\)Projection function of:

Final joint multimodal representation\(h_{s,t}\)include\([h_s,{\vec{h}}_s^\prime]\)Concat splicing operation of, where\({\vec{h}}_s^\prime\)Is the text feature H_ T projection to visual space.

**Part2: downstream tasks**

**Task 1: subtitle classification task**

In this task, given an image and a title, the model must predict whether the title in the image context is true (T) or false (f). It is used to test the ability of the model to deal with the changes of spatial components of images.

The image is from the clevr dataset [28], and the caption uses the template to generate correct and error samples. In order to measure the generalization performance of the model, the object attribute values are exchanged during the test, and whether the model can detect new attribute value components without reducing the performance is evaluated.

Model:

\(h_{s,t}\)It can be fed to a full connection layer with sigmoid activation function for binary classification 0/1 (match / mismatch).

The binary cross entropy loss function is selected as the loss function:

**VQA task:**

Close dataset: generated based on the clevr dataset. The problem template in close is systematically constructed using the original language to create invisible combinations in the generated problems. Seven different templates are used for one of the five broad types of clevr problems (count, presence, numeric comparison, attribute comparison, and query).

Model:

\(h_{s,t}\)It is provided to a seq2seq model based on attention mechanism, which has encoder decoder structure.

Use bidirectional lstm[24] as the encoder. At time step I, the encoder accepts a problem with padding including variable length word tokens\(q_i\), and\(h_{s,t}\)As input:

\]

Then use bidirectional LSTM for coding:

Decoder is an LSTM network with attention mechanism: using the previous output sequence\(y_{t-1}\), generate vectors through LSTM network\(o_t\)After that, it is sent to the attention layer and the weighted encoder vector is obtained:

Finally, decoder output\(o_t\)and\(c_t\)It is sent to a full connection layer with softmax activation to obtain the predicted symbol sequence\(y_t\)。\(y_t\)Subsequently, it is used to answer the questions of VQA tasks.

## 3、 Experiment

Task1：

Add: dataset B contains new combinations of invisible objects and attributes to test generalization capabilities.

Task2：

Supplementary: supervised pre training and fine-tuning play an important role in effective learning.

Add: MAC models perform strongly in general, but these models perform poorly in logical relationship (embedded_mat_spa, compare \u mat). On the other hand, MGN performed well in all 7 templates.

## 4、 Existing problems

At present, the main training data set of MGN model is only a data set composed of simple images. In the future, we can consider how to expand it to a larger and more natural image data set.