Mukea: multimodal knowledge extraction and accumulation for knowledge based visual question answering


Title: multimodal knowledge extraction and accumulation of knowledge-based visual question answering

Source: CVPR 2022

1、 Questions raised

General knowledge-based visual question answering (kb-vqa) requires the ability to associate external knowledge to achieve open cross modal scene understanding.

Existing researches mainly focus on acquiring relevant knowledge from structured knowledge maps, such as conceptnet and DBpedia, or acquiring relevant knowledge from unstructured / semi-structured knowledge, such as Wikipedia and visual genome. Although these knowledge bases provide high-quality knowledge through large-scale manual annotation, one limitation is that they acquire relevant knowledge from pure text knowledge bases, which only contain facts expressed by first-order predicates or language descriptions. Therefore, this knowledge base is difficult to represent high-order predicates and multimodal knowledge, which is necessary to answer complex questions, Therefore, the existing models can not be well understood visually.


There are few studies on how to construct visual related and interpretable multimodal knowledge for VQA scenes.

Objective: to learn the comprehensive knowledge representation containing pictures, questions, answers and other multimodal information through VQA data sets without using the knowledge base based on external text.

2、 Main model


This paper presents a multimodal knowledge extraction and accumulation framework (mukea) for kb-vqa tasks. The core is to accumulate multi-modal knowledge with complex relationships through the observation of VQA samples independent of the existing knowledge base, and conduct interpretable reasoning based on the self accumulated knowledge.


(1) An explicit triplet representation of multimodal knowledge units is proposed.

Head entity: the visual object that the problem refers to embedding

Tail entity: embedding of fact answer

Relationship: implicit between image and problem

(2) Three loss functions are proposed to learn the representation of triples from coarse to fine.

(3) On this basis, a learning strategy based on pre training and fine tuning is proposed to gradually accumulate multimodal knowledge from VQA samples outside the domain (VQA 2.0) and within the domain for interpretable reasoning.

2.1 Multimodal knowledge triplet extraction

(h, R, t):h contains the visual content in the image focused by the question, t is the representation of the answer to a given question image pair, and R describes the implicit relationship between H and t containing multimodal information

Image and problem coding: because the pre trained visual language model has strong modeling ability for implicit modal correlation and cross modal implicit correlation, the pre trained lxmert model is used to encode the problem and image, and on this basis, the multi-modal knowledge triples are further extracted.


Step1: apply fast r-cnn to extract images for images\(i\)A set of objects in\(O=\left\{o_i\right\}_{i=1}^K\left(K=36\right)\), and through the visual feature vector\(f_i\)(dimension is 2048) and spatial feature vector\(b_i\)(dimension is 4 dimensions) to represent each object.

Step2: for the problem, use wordpiece to model the problem Q and obtain D token sequences.
Step3: visual features\(f_i\)and\(b_i\)Input the pre trained lxmert to obtain the visual embedding of object o, which is recorded as\(V\in R^{K\times d_v}\left(d_v=768\right)\); Similarly, get the embedding of the token sequence, which is marked as = 768)\(Q\in R^{D\times d_v}\)

Head entity extraction: the part of the context in the image that is most relevant to the problem.

Step1: calculate the similarity between each object in the image and each token in the question, and obtain the object question similarity matrix A:


Step2: use attention to obtain the most relevant visual content. Calculate the correlation between each object and the problem by using the maximum value of row direction on a:


Then use your attention to\(a_i^{v-q}\)Select the most relevant object as the head entity. Gumbel softmax is used here to obtain an approximate one-hot category distribution. object\(o_i\)The attention weight of is calculated as follows:

Of which,\({g_i}_{i=1}^K\)Is a random variable of standard Gumbel distribution with independent identically distributed, τ Is a temperature parameter.
Finally, the head entity representation h is obtained:


Where, V is the visual embedding of object o, and FFN represents the feedforward network containing two fully connected layers.


Softmax tends to obtain the most likely category and loses probability information;

Gumbel softmax: for n-dimensional probability vector π, for discrete random variables corresponding to π\(\pi_i\)Add Gumbel noise and then sample:


Of which,\({g_i}_{i=1}^K\)Is a random variable of the standard Gumbel distribution with independent identically distributed, and the CDF of the standard Gumbel distribution is:


This is Gumbel Max trick. You can see that there is an argmax operation in the middle, which is non differentiable, so the softmax function is used instead, that is, Gumbel softmax trick.


The Gumbel distribution is used for re parameterization to make the calculation of the whole graph derivable, and the sample points are closest to the samples of the real distribution.]

Relation extraction: the relationship in the multimodal knowledge graph is defined as the complex implicit relationship between the observed instantiated object and the answer. The multimodal representation is extracted from the [cls] token and sent to the FFN layer to obtain the relationship embedding, which is recorded as R.

Tail entity extraction: the tail entity is defined as the answer in (image – question – answer). In the training phase, set the ground truth answer as the tail entity and learn its representation t. In the reasoning stage, the kb-vqa task is defined as a multimodal knowledge graph completion problem, and the prediction of the optimal tail entity is the answer.

2.2 triple representation learning

Triplet trans loss: given an image question pair, let a+ and a − indicate that the set is the correct and incorrect answer. Let h and R represent the extracted corresponding head entity representation and relationship representation. hope\(h+r\)Distance ratio to each t ∈ a+\(h + r\)The distance from each t ∈ a- is small by a certain extent γ:


Triplet Consistency LossTriplet TransE LossProblem: when the distance between positive and negative pairs is less than γ The model will stop learning from triples. In order to further promote t’s embedding learning to meet the strict topological relationship, we use the mean square error (MSE loss) to learn positive samples:


Semantic Consistency Loss: in order to narrow the semantic heterogeneity gap between the tail node embedding and the head node as well as the relationship, softmax is used for classification, and the negative log likelihood loss is optimized:


Final loss function:


2.3 train

Adopt two-stage training strategy to gradually accumulate multimodal knowledge:

(1) Pre training on VQA 2.0 data set to accumulate basic knowledge; Other problems in VQA 2.0 are used as factual knowledge for pre training tasks.

(2) Optimize the training data of downstream kb-vqa tasks to accumulate more complex multimodal knowledge in specific fields.

2.4 prediction:

The answer prediction is regarded as a multimodal knowledge map completion problem. Give an image and a problem, input them into the network, and get the head entity\(h_{inf}\)And relationships\(r_{inf}\)Embedding of. calculation\(h_{inf}+r_{inf}\)And each tail entity in lookup table t\(t_i\)Select the tail entity with the smallest distance as the prediction answer:


3、 Experiment

3.1 data set


Krvqa: unbiased visual question answering data set based on common sense, including knowledge independent reasoning and knowledge related reasoning, as well as multi-step reasoning based on external knowledge.

3.2 Experimental comparison






Mukea has greatly improved the accuracy of “knowledge independent” problems compared with other models, indicating that even traditional visual problems need multimodal common sense to learn low-level visual content and high-level semantics.

In the third category of two-step reasoning, mukea is not as good as some models, because the answers to these questions are mostly relations, and mukea’s prediction tail entities are mostly factual entities. (this is also a future improvement direction)

3.3 Ablation Experiment


3.4 Long tail effect analysis


It is proved that multimodal knowledge has strong generalization ability to long tail knowledge.

3.5 Model interpretability


4、 Existing problems


Figure 1: nylon (√) canvas( ×)

Figure 2: Kuwait Airlines (√) United Parcel Service( ×)

(1) The data set has limited training scenarios, and the model lacks sufficient multimodal knowledge.

(2) Failed to extract some triples. Because head entities and their relationships are extracted in unsupervised mode (lxmert), visually similar content will lead to attention bias.

5、 Tips

This paper proposes a new knowledge-based visual question answering framework, which focuses on the extraction and accumulation of multimodal knowledge rather than using external knowledge base. Adopt pre training and fine-tuning strategies to gradually accumulate multimodal knowledge.

In the future, we can consider how to effectively combine the multimodal knowledge learned by mukea with the knowledge base.