Paper reading: multimodal graph networks for compositional generalization in visual question answering

Time:2022-6-10

Title: multimodal graph neural network for combinatorial generalization in visual question answering
Source: neurlps 2020https://proceedings.neurips.cc/paper/2020/hash/1fd6c4e41e2c6a6b092eb13ee72bce95-Abstract.html
code:https://github.com/raeidsaqur/mgn

1、 Questions raised

a key:Combinatorial generalization problem

image-20220330222600810

Example: taking natural language as an example, for example, people can learn the meaning of new words and then apply them to other language environments. If a person learns the meaning of a new verb’dax’, he can immediately infer it to the meaning of’sing and dax’ Similarly, during training, there may be “combinations” of elements in the test set that have not appeared in the training set (these elements exist in the training set). For example, there are “red dogs” and “green cats” in the training set, but the data in the test set is “red cats”.

Problem: recent research shows that the model cannot be extended to new inputs, and these inputs are only unknown combinations in the elements seen in the training set combination distribution [6].

In general, convolutional neural network (CNN) is used to construct the neural architecture of multimodal representation to process the whole image into a single global representation (such as vector), but this fine-grained correlation cannot be captured [29].

VQA methods based on neural symbols (such as NMNS, ns-vqa and ns-cl) have achieved near perfect scores on benchmarks such as clevr [28,29]. However, even if the distribution of visual input remains unchanged (the input image remains unchanged), these models cannot be extended to new language structure combinations (the problem changes) [6]. A key reason is the lack of fine-grained representation of image and text information, which allows joint compositional reasoning in visual and linguistic space.

2、 Main ideas

A graph based learning method for multimodal representation is proposed——Multimodal graph network (MGN), focusing on better generalization effect. The graph structure can capture entities, attributes and relationships, so as to establish a closer coupling between the concepts of different modes (such as image and text).

motivation

image-20220330222712364

Consider the image in the figure and the related problem: “there is a yellow rubber cube behind the large green cylinder.” To answer this question, first find the green cylinder, and then scan the space behind it to find the yellow rubber cube. Specifically, 1) although there may be other objects (for example, another ball), the information about them can be abstracted, 2) it is necessary to establish a fine-grained relationship between the visual and language inputs representing “yellow” and “Cube”.

Core idea: represents both text and image aschartNaturally, it can make the concepts between the two patterns more closely coupled, and provide a synthesis space for reasoning. Specifically, first, the image and text are parsed intoSeparate diagramObject entities and attributes as nodes and relationships as edges. Then, we use a method similar to that used in the graph neural network [16]Message passing algorithm(message passing) to export the similarity factor matrix between the node pairs of the two modes. Finally, using graph basedAggregation mechanismTo generate the inputMultimodal vector representation

Specific model

image-20220330222824041

Part1: diagram structure

Examples of multimodal input: tuples (s, t), where s is the source text input (e.g., question or title) and t is the corresponding target image.
Graph parser:
Input: tuple (s, t)
Exporting: corresponding object centric graphs\(G_s=(V_s,A_s,X_s,E_s)\)and\(G_t=(V_t,A_t,X_t,E_t)\)
Where, all nodes in the graph form a set V, a is the adjacency matrix of the graph, X is the characteristic matrix of all nodes V in the graph G, and E is the characteristic matrix of all edges in the graph G.
Specific methods:
For the input text s, the entity recognition module is used to capture the object and attribute as the graph node V, and then the relationship of the node is captured as the graph using the relationship matching module\(G_s\)Edges in.

image-20220330222915472

For image T, the pre trained mask RCNN and resnet-50 FPN image semantic segmentation module are used to obtain the object, attribute and position coordinates (x, y, z). These nodes are shown in Figure\(G_t\)Form a separate node in.
stay\(G_s\)and\(G_t\)After constructing nodes and edges in, the feature matrices X and E are obtained by using word embedding (assuming dimension D) in the pre trained language model as the feature vectors of the nodes (objects, attributes) and edges (relationships) of the text graph. For the image scene graph, we use the object and attribute tags obtained from the “parsing scene” (from the mask RCNN channel) as the input of the language model to obtain feature embedding.
Graph matcher:
Importing: diagrams\(G_s=(V_s,A_s,X_s,E_s)\)and\(G_t=(V_t,A_t,X_t,E_t)\)
Output: generating multimodal vector representations\({\vec{h}}_{s,t}\)(dimension 2D) – it captures the potential joint representation of the source node (text) and the matching target node (image).
Specific method:
Merge the two graphs, where the initial characteristics of the nodes represent:\(h_i^{\left(0\right)}=x_i\epsilon\ X\)。 Then, the message passing algorithm of graph neural network is used to iteratively update the vector node representation by aggregating its neighbor representation. At the end of information transmission, each node obtains information from its neighbors.

image-20220330223010345

Message passing algorithm: 1 Aggregate, 2 Merge. After K iterations, the characteristic representation vector of nodes\(h_v^{\left(k\right)}\)The node information in the k-hop neighborhood of the graph can be captured.

image-20220330223036714

When the graph classification task is carried out, the point features need to be transformed into global features; The method of summation or graph pooling (readout function) is used to combine the node characteristics in the final iteration to obtain the whole graph\(G_s\)or\(G_t\)The characteristic representation of:

image-20220330223143516

Then, in order to project the features of the text space into the visual space, the vector local nodes of the source graph and the target graph are used to represent\(H_{G_s}\)and\(H_{G_t}\)A soft responsibility matrix is calculated\(\mathrm{\Phi}\)(similarity matrix):

\[\mathrm{\Phi}=H_{G_s}{H_{G_t}}^T\in\ R^{|V_s|\times|V_t|}
\]

Where the ith row vector\(\mathrm{\Phi}_i\in\ R^{V_t}\)Representation diagram\(G_t\)Nodes and\(V_s\)The probability distribution of the potential similarity relationship of any node in the graph (which can be regarded as the likelihood score used to measure the degree of matching between nodes in two different graphs). In order to obtain the discrete similarity distribution between source node features and target node features, “sinkhorn normalization” (a regularization method) is applied to the similarity matrix to meet the rectangular double random matrix constraint (for\(\sum_{j\in V_t}\mathrm{\Phi}_{i,j}=1,\forall\ i\in\ V_s\ \ \ and\ \ \ \sum_{i\in V_s}\mathrm{\Phi}_{i,j}=1,\forall\ j\in\ V_t\))。

image-20220330223259873

Finally, given\(\mathrm{\Phi}\), you can obtain a hidden space from the source (text)\(L(G_s)\)To target (image) latent space\(L(G_t)\)Projection function of:

image-20220330223326885

Final joint multimodal representation\(h_{s,t}\)include\([h_s,{\vec{h}}_s^\prime]\)Concat splicing operation of, where\({\vec{h}}_s^\prime\)Is the text feature H_ T projection to visual space.

Part2: downstream tasks
Task 1: subtitle classification task
In this task, given an image and a title, the model must predict whether the title in the image context is true (T) or false (f). It is used to test the ability of the model to deal with the changes of spatial components of images.

image-20220330223408433

The image is from the clevr dataset [28], and the caption uses the template to generate correct and error samples. In order to measure the generalization performance of the model, the object attribute values are exchanged during the test, and whether the model can detect new attribute value components without reducing the performance is evaluated.
Model:
\(h_{s,t}\)It can be fed to a full connection layer with sigmoid activation function for binary classification 0/1 (match / mismatch).
The binary cross entropy loss function is selected as the loss function:

image-20220330223445459

VQA task:
Close dataset: generated based on the clevr dataset. The problem template in close is systematically constructed using the original language to create invisible combinations in the generated problems. Seven different templates are used for one of the five broad types of clevr problems (count, presence, numeric comparison, attribute comparison, and query).

image-20220330223508481

Model:
\(h_{s,t}\)It is provided to a seq2seq model based on attention mechanism, which has encoder decoder structure.
Use bidirectional lstm[24] as the encoder. At time step I, the encoder accepts a problem with padding including variable length word tokens\(q_i\), and\(h_{s,t}\)As input:

\[x_i\Leftarrow\left[q_i,h_{s,t}\right]
\]

Then use bidirectional LSTM for coding:

image-20220330223552918

Decoder is an LSTM network with attention mechanism: using the previous output sequence\(y_{t-1}\), generate vectors through LSTM network\(o_t\)After that, it is sent to the attention layer and the weighted encoder vector is obtained:

image-20220330223620087

Finally, decoder output\(o_t\)and\(c_t\)It is sent to a full connection layer with softmax activation to obtain the predicted symbol sequence\(y_t\)\(y_t\)Subsequently, it is used to answer the questions of VQA tasks.

image-20220330223651249

3、 Experiment

Task1:

image-20220330223954088

Add: dataset B contains new combinations of invisible objects and attributes to test generalization capabilities.

Task2:

image-20220330224011939

Supplementary: supervised pre training and fine-tuning play an important role in effective learning.

image-20220330224028707

Add: MAC models perform strongly in general, but these models perform poorly in logical relationship (embedded_mat_spa, compare \u mat). On the other hand, MGN performed well in all 7 templates.

4、 Existing problems

At present, the main training data set of MGN model is only a data set composed of simple images. In the future, we can consider how to expand it to a larger and more natural image data set.

Recommended Today

[redis] redis’ basic principles and solutions of cache breakdown, cache penetration and cache avalanche

Article catalogue Cache penetration principle resolvent Buffer breakdown principle resolvent Cache avalanche principle resolvent Cache penetration principle The data corresponding to the key does not exist in the database. Every request for the key cannot be obtained from the cache. The request will access the database. A large number of visits to the database may […]