Check It Again: Progressive Visual Question Answeringvia Visual Entailment


Check It Again: Progressive Visual Question Answeringvia Visual Entailment


Although complex visual question answering models have achieved significant success, they often answer questions only based on the surface correlation between questions and answers. Recently, several methods have been developed to solve this language a priori problem. However, most of them predict the correct answer based on an optimal output without checking the authenticity of the answer. In addition, they only explore the interaction between images and questions, ignoring the semantics of candidate answers. In this paper, we propose a select and rerank (SAR) progressive framework based on visual entry. Specifically, we first select the candidate answers related to the question or image, and then we reorder the candidate answers through the visual implication task, which verifies whether the image semantically contains the comprehensive statement of the question and each candidate answer. Experimental results show the effectiveness of our proposed framework. It establishes a new state-of-the-art accuracy on vqa-cp V2, which is improved by 7.55%

1 Introduction

Visual question answering (VQA) task is a multimodal problem, which requires a comprehensive understanding of visual and textual information. In the face of input images and questions, VQA system tries to determine the correct answer in large prediction space. Recently, some studies (jabri et al., 2016; Agrawal et al., 2016; Zhang et al., 2016; Goyal et al., 2017) have shown that there is an accidental correlation between answers and questions in VQA system. As a result, the traditional VQA model always outputs the most common answers to the question categories of input samples (selvaraju et al., 2019), no matter what kind of pictures are given. In order to solve this language a priori problem, various methods have been developed. However, by exploring the characteristics of existing methods, we find that whether it is a general VQA model such as updn (Anderson et al., 2018) and lxmert (tan and Bansal, 2019) or a model carefully designed for language a priori, such as LMH (Clark et al., 2019), 2019) and SSL (Zhu et al., 2020) have created a problem that can not be ignored. Both models predict the correct answer according to an optimal output without checking the authenticity of the answer. In addition, these models do not make good use of the semantic information of answers to help alleviate language a priori.

As shown in figure (a) below, many correct answers often appear in top-N rather than top-1. At the same time, if top-N candidate answers are given, the image can further verify the visual existence / nonexistence of the concept based on the combination of questions and candidate answers. As shown in figure (b) below, the question is about the color of bats, and the two candidate answers are “yellow” and “black”. After checking the correctness of the candidate answer, you can eliminate the wrong answer “yellow” which is inconsistent with the image, and confirm the correct answer “black” which is consistent with the image. Reducing the semantics of language a priori has not been fully studied.

In this paper, we propose a select and rerank (SAR) progressive framework based on visual entry. The intuition behind the proposed framework comes from two observations. Firstly, after excluding the answers irrelevant to the question and image, the prediction space is reduced, and we can obtain a small number of candidate answers. Secondly, when a question and its candidate answers are bridged into a complete statement, the authenticity of the statement can be inferred from the content of the picture. Therefore, after selecting several possible answers as candidates, we can use the visual implication composed of image text pairs to verify whether the image contains synthetic sentences semantically. According to the implication degree, we can further rearrange the candidate answers and give the model another chance to find the correct answer. In conclusion, our contributions are as follows:

  1. We propose a progressive framework of selection and reordering to solve the language a priori problem, and make an empirical study on a series of design choices of each module of the framework. In addition, it is a general framework, which can be easily combined with the existing VQA models to further improve their capabilities.
  2. We emphasize the verification process between text and image, and formulate VQA task as a visual implication problem. This process makes full use of the interactive information of images, questions and candidate answers.
  3. The experimental results show that our framework establishes the latest accuracy of 66.73%, which is much better than the existing methods.

Language-Priors Methods

In order to solve the language a priori problem of VQA model, many methods have been proposed, which can be roughly divided into two categories: (1) designing specific debiasing models to reduce bias. Most of the works in this series are based on the integrated approach (Ramakrishnan et al., 2018; grand and be linkov, 2019; belinkov et al., 2019; cadene et al., 2019; Clark et al., 2019; mahabadi and Henderson, 2019), among which LMH (Clark et al., 2019) All deviations between question and answer pairs can be reduced by punishing samples that can be answered without using image content. (2)Data Augmentationto Reduce Biases。 The main idea of these works (Zhang et al., 2016; Goyal et al., 2017; Agrawal et al., 2018) is to carefully construct more balanced data sets to overcome a priori. For example, the recent method SSL (Zhu et al., 2020) first automatically generates a set of balanced problem image pairs, and then introduces an auxiliary self-monitoring task to use the balanced data. CSS (Chen et al., 2020A) balances the data by adding more complementary samples, which are generated by masking objects in the image or some keywords in the problem. Based on CSS, Cl (Liang et al., 2020) forced the model to use the relationship between complementary samples and original samples. Unlike SSL and CSS without any additional manual annotation, mutant (Gokhale et al., 2020) uses additional object name tags to locate key objects and keywords in problems in the image, which directly helps the model determine the text concept in the image. However, the above method only discusses the interaction between image and question, and ignores the semantics of candidate answers. In this paper, we propose a progressive VQA framework SAR, which realizes better interaction between questions, images and answers.

Answer Re-ranking

Although answerre ranking in VQA tasks is still in its infancy, it has been widely studied for QA tasks such as open domain question answering, in which the model needs to answer questions based on a wide range of open domain knowledge sources. Recent work (wanget al., 2018b, a; kratzwald et al., 2019) solved this task in a two-stage way: extracting candidate answers from all paragraphs, then focusing on these candidate answers and reordering them to obtain the final answer. Rankvqa (Qiao et al., 2020) introduced the answerre ranking method into VQA tasks. However, rankvqa still predicts from a large prediction space rather than selected candidate answers

3 Method

The following figure shows an overview of the proposed select and rerank (SAR) framework, which is composed of candidate answer selection module and answer reordering module. In the candidate answer selection module, given an image and a question, we first use the current VQA model to obtain the candidate answer set composed of top-N answers. In this module, you can filter out answers irrelevant to the question. Next, we formulate VQA as a ve task in the answer re ranking module, in which the image is the premise and the synthesis of dense subtitles (Johnson et al., 2016) (a combination of answers and questions) is the hypothesis. We use the cross domain pre training model lxmert (tanand Bansal, 2019) as the ve scorer to calculate the implication score of each image caption pair, so the answer corresponding to the dense caption with the highest score is our final prediction. We use the cross domain pre training model lxmert (tanand Bansal, 2019) as the ve scorer to calculate the implication score of each image caption pair, so the answer corresponding to the dense caption with the highest score is our final prediction.

3.1 Candidate Answer Selecting

Candidate answer selector (CAS) selects several answers from all possible answers as candidates, which reduces the huge prediction space. give\(M\)VQA data\(D=\{I_i,Q_i\}_{i=1}^M\)among\(I_i\in I,Q_i\in Q\)Yes\(i\)Images and problem data of samples,\(A\)It is a prediction space containing thousands of answers. CAS is a\(|A|\)Classification, CAS according to the input image\(I_i\)And input questions\(Q_i\), the regression score will be given:\(P(A|Q_i,I_i)\), the structure selection of the network is free, and the final candidate answer is selected from it\(A\)Select the top-N answers with the highest scores as candidate answers:


among\(N\)Super parameter, candidate answer\(A_i^*=[A_i^1,A_i^2,…,A_i^N]\)And each data team will be formed to contain\(M*N\)A new dataset of data


among\(A_i^n\in A_i^*\), in this article, we mainly use SSL as our CAS. We also conducted experiments to analyze the effects of different CAS and different n.

3.2 Answer Re-ranking

3.2.1 Visual Entailment

Xieet et al. Proposed the visual entertainment (VE) task. (2019), provided that the real world image\(P_{image}\), assume text\(H_{text}\), given a sample\((P_{image},H_{text})\), the goal of the ve task is to determine according to the information\(P_{image}\)Can we draw a conclusion\(H_{text}\)​。 According to the following agreement, the label of the sample is assigned to

1. Entry (implied), if\(P_{image}\)There is enough evidence to prove\(H_{text}\)Is correct.

2. Contradiction, if\(P_{image}\)There is enough evidence to prove\(H_{text}\)Is wrong.

3. Neutral, if\(P_{image}\)Is there enough evidence\(H_{text}\)Conclusion.

3.2.2 VQA As Visual Entailment

Each candidate answer\(A_i^*\)And his questions can be bridged into a complete statement, and then the image is used to verify the authenticity of each statement. More specifically, the visual presentation of concepts (e.g. “black bat” / “yellow bat”) based on the combination of questions and correct / wrong candidate answers can be included / contradicted by the image content. In this way, we achieve better interaction between questions, images and answers.

Therefore, we express VQA as a ve problem in which images\(I_i\)Is the premise,\(A_i^*\)Answer in\(A_i^n\)And problems\(Q_i\)The consolidated statement of is expressed as\((Q_i,A_i^n)\)It’s a hypothesis. For the same image, the comprehensive statements of different problems describe different regions of the same image. Following Johnson et al. (2016), we also call the synthetic statement “deny caption”. We use\(A_i^+\)It means it’s a problem\(Q_i\)The correct answer\(A_i^n\), otherwise use\(A_i^-\)​。\(I_i\)There is enough evidence to prove\((Q_i,A_i^+)\)Is true, that is, the implied meaning of visual language\((Q_i,A_i^+)\)Yes. also\(I_i\)There is enough evidence to prove\((Q_i,A_i^-)\)Is false, that is, the semantics of visual language is conflict\((Q_i,A_i^-)\)Yes. Please note that there is no neutral tag in our ve task. We only have two Tags: entry and contract.

3.2.3 Re-Ranking based on VE

We rearrange the dense subtitles through comparative learning,\((Q_i,A_i^+)\)It should be semantically for images\(I_i\)than\((Q_i,A_i^-)\)More similar. The right part of the overall structure illustrates this idea. The more you want to die semantically, the deeper the visual implication. We are for everyone\((Q_i,A_i^n)\)And image\(I_i\)Score the visual implication of, and rank the candidate answers according to this score\(A_i^*\)Sort. The first is our final output.

Question-Answer Combination Strategy

Only when the answer information is combined with the question can the answer information be meaningful. We encode the combination of question and answer texts to obtain joint concepts. We designed three question and answer combination strategies:\(R\)\(C\), and\(R\rightarrow C\)To combine questions and answers into dense caption\(C_i\)

R:Replace question category prefix with answer

The prefix of each question is the question category, such as “where”, “what color”, etc. For example, given a question “how many flowers in the vase?”, its answer is “8”, the question category is “how many”, and the dense title is “8 flow ers in the vase”. Similarly, “no a crosswalk” is generated by the question “is this a crosswalk?” and the answer “no”. Firstly, the dictionary of all problem categories in the training set is established, and then the forward maximum matching algorithm is used to determine the problem category of each test sample.

C:Concatenate question and answer directly

In the above two examples, the dense titles are “8 how many flowers in the vase?” and “no is this a crosswalk?”. The dense subtitles produced after concatenation are actually rhetorical questions. We deliberately added the answer text in front of the question text to avoid deleting the answer when trimming the dense subtitles to the same length.


We first use strategy r in training to prevent the model from paying too much attention to the co-occurrence relationship between question categories and answers, and then use strategy C in testing to introduce more information for reasoning.

Using any of the above strategies, we will\(Q_i\)And\(A_i^*\)Each answer in is combined to generate dense subtitles\(C_i^*\)Therefore, the amount of data we obtained is\(M*N\)Data set for\(D”=\{I_i,C_i^n\}_{i=1,n=1}^{M,N}\)For the following ve tasks.

VE Scorer

We use the pre trained model lxmert pair\((I_i,C_i^n)\)The visual implication was scored. Lxmert encodes the image and title text in two streams respectively. Next, individual flows interact through the transformer layer of common attention. In the text stream, dense subtitles are encoded as high-level concepts. The visual representation from the visual stream can then verify the visual presence / absence of high-level concepts.

We will\(i\)Page of picture\(n\)Ve scores of candidate titles are expressed as\(sigmoid(Trm(I_i,C_i^n))\)Among them\(Trm()\)Is the one-dimensional output after the lxmert dense layer.\(\sigma\)Represents the sigmoid function. The higher the score, the higher the tolerance. We optimize the parameters of the main function by minimizing the multi label soft loss:


among\(t_i^n\)Is the soft label score of the nth answer.

Combination with Language-Priors Method

After candidate answer selection, the number of candidate answers is reduced from all possible answers to topn. Although some irrelevant answers are filtered out, the data set of VE system\(D”\)There are still deviations. Therefore, we can selectively apply the existing language a priori methods to our framework to further reduce the language a priori. Taking SSL as an example, we apply the loss function of self-monitoring task to our framework by adjusting the loss function.


among\((I_i’,C_i^n)\)Represents an unrelated image caption pair,\(α\)Is the lower weighting factor. Among them\(P(I_i’,C_i^*)\)Can be considered\((I_i’,C_i^*)\)Correlation confidence of. We can reformulate the overall loss function:


3.3 Inference Process

Question Type Discriminator

Intuitively, most “yes / no” questions can be answered by answering “yes” or “no”. There is no need to provide too many candidate answers to the “yes / no” questions during the testing phase. Therefore, we propose a question type discriminator (QTd) to determine the question type, and then set a different number of candidate answers accordingly\(N’\)。 Specifically, we roughly divide the problem types (including “yes / no”, “number” and “other”) into yes / no and non yes / No. The Gru binary classifier is trained by cross entropy loss, and the training split of each data set is evaluated by 5-fold cross validation. Then, in the test phase, the trained QTd model is implemented as an offline module with an accuracy of about 97%. We will further study the impact of n ‘on each problem type in the next section.

Final Prediction

In the reasoning stage, we are in the second stage\(i\)All candidates for this picture\(C_i^*\)Choose the best dense subtitles\(\widehat{C_i}\)​​​​。

\[\widehat{C_i}=argmax_{n\in N’}\sigma(Trm(I_i,C_i^n))

Corresponding to\(\widehat{C_i}\)Your answer\(\widehat{A_i}\)This is the final forecast.

4 Experiment

4.1 Setting


Our model has been trained and evaluated on the vqa-cp V2 (Agrawal et al., 2018) data set, which is carefully prepared by reorganizing the VQA V2 (Goyal et al., 2017) training and verification set, so that each problem category (65 categories according to the problem prefix) has different distribution in the training set and test set. Therefore, vqa-cp V2 is a natural choice to evaluate the generalization of VQA model. Vqa-cp V2 issues include three types: “yes / no”, “quantity” and “other”. Note that the question type and question category (such as “what color”) are different. In addition, we also evaluated the integrity of our model on the VQA V2 validation set and compared the accuracy difference between the two data sets with the standard VQA evaluation index (antolet al., 2015).


We compared our method with the following baseline methods: updn (Anderson et al., 2018), areg (Ramakrishnan et al., 2018), Rubi (cadene et al., 2019), LMH (Clark et al., 2019), rankvqa (Qiao et al., 2020), SSL (zhuet al., 2020), CSS (Chen et al., 2020A), Cl (Liang et al., 2020) and lxmert (tan and Bansal, 2019). Most of them are designed for language a priori problems, and lxmert represents the latest trend of using Bert like pre training model with best performance (Li et al., 2019; Chen et al., 2020b; Li et al., 2020) on various downstream visual and language tasks (including vqa-v2). Note that mutant (Gokhale et al., 2020) uses additional object name tags to determine the text concept in the image. For fair comparison, we do not compare with mutant.

4.2 Implementation Details

In this paper, we mainly choose SSL as our CAS and set n = 12 and N = 20 for training. In order to extract image features, we follow the previous work and use the pre trained fast r-cnn to encode each image into a set of fixed 36 objects with 2048 dimensional feature vectors. We use the word splitter of lxmert to segment each dense caption into words. All questions are trimmed to the same length, 15 or 18, respectively, for R or C question and answer combination strategies, respectively. In the answer re ranking model, we combine SSL and LMH language a priori methods into our proposed framework SAR, called SAR + SSL and SAR + LMH. Our model is trained on two Titan RTX 24GB GPUs. We train SAR + SSL for 20 epochs with a batch size of 32, and SAR and SAR + LMH train 10 epochs with a batch size of 64. For SAR + SSL, we follow the same settings as the original paper (Zhu et al., 2020), except that we do not need to pre train the model with vqalos before fine tuning it with self-monitoring loss. Using Adam optimizer, the learning rate is 1e-5.

For the problem type discriminator, we use a 300 dimensional glove (Pennington et al., 2014) vector to initialize word embedding and feed them to a one-way Gru with 128 hidden units. When testing on vaq-cp V2,\(N’\)The range is 1-2 yes / no questions and 5-15 yes / no questions. When testing on VAQ V2,\(N’\)The range is 1-2 yes / no questions and 2-5 yes / no questions.

4.3 Results and Analysis

4.3.1Main Results

The performance on the two benchmarks vqa-cp-v2 and vqa-v2 is shown in the figure below. We report the best results of SAR, SAR + SSL and SAR + LMH in the three Q & a combination strategies. “Topn -” means that the candidate answer (selected by CAS) is input into the answer re ranking module for training. Our method is evaluated with two settings of n (12 and 20).

From the results of vqa-cp V2 shown in the table, we can observe that: (1) top20 SAR + LMH established the latest accuracy of 66.73% on vqa-cp V2, beating the previous best method CL of 7.55%. Even if the language a priori method is not combined in the answer re ranking module, our model top20 SAR is 6.26% better than CL. These show the outstanding effectiveness of our proposed SAR framework. (2) SAR + SSL and SAR + LMH achieve better performance than SSL and LMH, which shows that SAR is compatible with current language a priori methods and can give full play to its potential. (3) Compared with another reordering based model rankvqa, our method improves the performance by 23.68%. This shows that our proposed progressive selection and reordering framework is better than rankvqa which only uses answer reordering as an auxiliary task. (4) Previous models can not well summarize all problem types. CL’s lxmert on “yes / no”, “num” and “other” issues is the best before. In contrast, our model is not only comparable to the previous best model on “yes / no” issues, but also improves the best performance on “digital” and “other” issues by 12.45% and 3.65%. Excellent performance on all problem types shows that our model has made significant progress in a truly comprehensive VQA model.

We also evaluated our approach on VQA V2, which is considered to have strong language bias. As shown in the above table, our method achieves the best accuracy of 70.63% in the baseline designed to overcome the language a priori, and is closest to the SOTA established by lxmert, which is clearly trained for biased data sets. For completeness, the performance gap between the two data sets is also compared with the protocol of Chen et al. (2020A). Compared with most previous models with severe performance degradation between VQA V2 and vqa-cp V2 (for example, 27.93% in lxmert), the performance degradation of top20 SAR + LMH is significantly reduced to 2.49%, which proves the effectiveness of our framework to further overcome language bias. Although CSS achieves a better performance gap, it sacrifices the performance of VQA v2. At the same time, with the increase of N from 12 to 20, our model achieves better accuracy on both data sets, and the performance gap is smaller. This shows that, unlike previous methods, our method can reduce the language a priori while maintaining excellent question answering ability. Nevertheless, we believe that how to improve the generality of the model and further turn the trade-off between language priority and answering questions into a win-win result is a promising research direction in the future.

4.3.2 The Effect of N

From the figure below, we can observe that with the increase of N, the overall performance is getting better and better. The scores of “num” and “other” questions improved significantly, and the scores of “yes / no” questions decreased very little. We believe that by properly increasing N, SAR can further obtain better performance. Due to resource constraints, the maximum n we use in this paper is 20.

4.3.3 The Effect of Different CAS

In order to find out the potential performance limitations of CAS models, we demonstrated the accuracy of three CAS models on the vqa-cp V2 test set. As shown in the figure, the top 3 accuracy (ACC) of the three models is about 70% and the top 6 ACC is 80%, which ensures that CAS recalls enough correct answers. Therefore, the performance limitation of CAS is negligible.

We also conducted experiments to study the effects of different CAS on SAR. From the results shown in the following table, we can observe that: (1) selecting a better VQA model as CAS can not guarantee better performance. For example, the performance based on updn is better than that based on LMH, but compared with updn, LMH is a better VQA model in overcoming language priority. This is because a good candidate answer selector has two requirements (a) it should be able to recall more correct answers. (b) In the case of language bias, the wrong answers recalled by CAS during training should have as strong surface correlation as possible with the questions. However, integration methods, such as LMH, are trained to pay more attention to samples that are not correctly answered by the pure question model. This seriously reduces the recall rate of those language a priori wrong answers, resulting in the training data of VE is too simple, which damages the ability of the model to reduce the language a priori. (2) If CAS is a general VQA model updn instead of LMH and SSL, the improvement brought by combining the language a priori method in the answer re ranking module is more obvious. (3) Even if we choose updn, a backbone model of most current works, as our CAS and do not involve any language a priori methods, SAR still obtains a better accuracy rate of 2.53% than the previous SOTA model Cl, which shows that our basic framework has an excellent ability to reduce language a priori.

4.3.4 The Effect of Question-Answer Combination Strategies

From the results shown in Table 3, we can observe that: (1) from the overall results, R → C achieves the best performance on the three models or competitors. On average, R → C is better than C by 2.02%, which shows that avoiding the co-occurrence of question categories and answers during training can effectively alleviate the language a priori. R → C is better than R 2.41%, which shows that the problem category information is very useful in reasoning. (2) On SAR and SAR + SSL, C is always better than R, but on SAR + LMH, we see the opposite results. This may be because our method and balanced data method SSL can learn the positive deviation caused by the surface correlation between question categories and answers, which is helpful for generalization, but the integration based method LMH will weaken the positive deviation in the process of de deviation. (3) Even if there is no language a priori method, R → C SAR competes with or outperforms R or C SAR + SSL and SAR + LMH, which shows that R → C strategy can help the model reduce language a priori. Therefore, compared with R or C, our R → C framework achieves only a slight performance improvement after using the same language a priori method.

4.3.5 Ablation Study

“CAS +” means that we use the select and rerank structure. From table 4, we can find that: (1) LXM + SSL represents the direct application of SSL to lxmert. Its poor performance shows that the main contribution of our framework does not come from the combination of language a priori method SSL and pre training model lxmert. (2) Compared with LXM and LXM + SSL, CAS + LXM and CAS + LXM + SSL achieve significant performance improvements of 9.35% and 6.32% respectively, which proves our proposed select and reorder program. (3) CAS + LXM + QTd (R) and CAS + LXM + SSL + QTd (R) are 3.93% and 2.71% better than CAS + LXM (R) and CAS + LXM + SSL (R), respectively, which shows the contribution of QTd module. This further shows that choosing the appropriate solution for different problem types\(N’\)It is a useful step to improve the performance of the model. (4) CAS + LXM + SSL + QTd improves the performance of CAS + LXM + QTd by 2.61%. It can be seen that the current language a priori method is very suitable for our framework and can further improve the performance.

4.3.6 The Effect of \(N’\)

From the following figure, we can find that: (1) due to the nature of the yes / no problem, the best solution of the yes / no problem is\(N’\)Less than the best value of the non yes / no problem\(N’\)。 (2) With\(N’\)The accuracy of “num” and “other” problems first increased and then decreased. There is a trade-off behind this phenomenon: when\(N’\)If it is too small, the correct answer may not be recalled by CAS; When\(N’\)When it is too large, the interference of wrong answers makes it more difficult for the model to choose the correct answer.

4.3.7 Qualitative Examples

We qualitatively evaluate the effectiveness of our framework. As shown in the figure below, SAR performs better not only in question answering, but also in visual grounding compared with SSL. With the help of answer semantics, SAR can focus on the region related to candidate answers and further use the region to verify its correctness.

5 Conclusion

In this paper, we propose a visual entry based selection and reordering (SAR) progressive framework. Specifically, we first select candidate answers to reduce the prediction space, and then reorder the candidate answers through the visual implication task, which verifies whether the image contains the comprehensive statement of the problem and each candidate answer semantically. Our framework can make full use of the interactive information of images, questions and candidate answers. In addition, it is a general framework that can be easily combined with the existing VQA model to further improve its capability. We have proved the advantages of our framework on vqa-cp V2 dataset through extensive experiments and analysis. Our method establishes a new state-of-the-art accuracy of 66.73%, which is 7.55% higher than the previous best accuracy.