Overcoming Language Priors with Self-supervised Learning for Visual Question Answering


Overcoming Language Priors with Self-supervised Learningfor Visual Question Answering


Most visual question answering (VQA) models start from the language a priori problem, which is caused by the inherent data deviation. Specifically, the VQA model tends to answer questions based on high-frequency responses (e.g., yellow) that ignore the image content (e.g., what color is a banana?). Existing methods solve this problem by creating fine models or introducing additional visual annotations to reduce problem dependence and enhance image dependence. However, since the data bias has not even been alleviated, they are still affected by the problem of language priority. This paper proposes a self supervised learning framework to solve this problem. Specifically, we first automatically generate labeled data to balance biased data, and then propose a self supervised auxiliary task to use balanced data as the basic VQA model to overcome language prior knowledge. Our method can compensate the database by generating balanced data without introducing external annotations. The experimental results show that on the most commonly used vqa-cp V2 benchmark, the overall accuracy of this method is improved from 49.50% to 57.59%, which is significantly better than the existing methods. In other words, we can improve the performance of annotation based methods by 16% without using external annotations.

1 Introduction

Visual question answering (VQA), as a complete task of artificial intelligence, has attracted more and more attention. Its goal is to automatically answer natural language questions according to images. VQA’s paradigm [antolet al., 2015; yanget al., 2016; Anderson al., 2018; kimet al., 2018] is to project images and problems into a common feature space, and then fuse them into a joint vector for prediction. Recently, some researchers [agrawalet al., 2018; goyalet al., 2017] have proved that most existing VQA models are affected by language a priori problems and ignore image content. For example, the question “what color is grass?” can generally be answered with “green” no matter what image is given, because most of the corresponding answers in the data set are “green”. Therefore, the model of memorizing language a priori performs poorly on extraterritorial data sets.

The existing language a priori mitigation methods focus on reducing problem dependence and increasing image dependence. They can be roughly divided into non annotation based methods and annotation based methods. For non annotation based methods, researchers mostly design fine models and strategies with different characteristics. For example, [ramakrishnanet al., 2018] proposed an antagonistic learning strategy to overcome language priors by minimizing the antagonistic performance of problem only branches. Rubi [cadeneet al., 2019] reduces the impact of the maximum deviation instance and increases the impact of the minimum deviation instance by dynamically adjusting the weight. The annotation based method attempts to directly increase image dependence by introducing external visual supervision. [selvarajue et al., 2019] uses human attention map to ensure the alignment between model attention and human attention. [Wu and Mooney, 2019] maintain consistency between correct answers and influential objects of human interpretation and annotation. Generally, annotation based methods can achieve better performance than non annotation based methods because they can better understand images under the guidance of visual supervision. Nevertheless, these methods require large-scale visual annotations, which are not easy to obtain.

However, the inherent data bias has not been eliminated, and the above methods only weaken their adverse effects to a certain extent, so the effect is not ideal. The inherent data bias will inevitably force VQA model to tend to high-frequency answers with higher confidence, and eventually lead to language a priori problems. Therefore, it is very important to solve the inherent data deviation, that is, convert the deviated data into balanced data without introducing external annotations.

Therefore, we propose a self supervised learning framework for VQA to automatically balance biased data to overcome language a priori problems. Our approach is inspired by an interesting and intuitive discovery. As shown in the figure, the question can be answered only when the given image contains the key information to answer the question. We define such a problem – image pair as relevant, otherwise it is irrelevant. Based on the above observations, it is necessary to estimate whether a given question is related to the image before answering the question. Therefore, we introduce an auxiliary task called problem image correlation estimation to estimate the correlation between the problem and the image. Specifically, we first automatically generate a set of balance problem image pairs with binary labels (related and uncorrelated), and then input self supervised auxiliary tasks to help VQA model overcome language priors. We incorporate ancillary tasks into the underlying VQA model by providing related and unrelated pairs. When relevant question image pairs are input, VQA model is encouraged to predict the correct answer with high confidence, where the confidence score is the probability of correlation between question image pairs. On the contrary, when the input pairs are irrelevant, the VQA model is pushed to predict the correct answer with low confidence. In addition, the confidence score of unrelated pairs can be used as an indicator to measure the language a priori and avoid over fitting. By optimizing these two objectives at the same time, we can achieve a balance between answering questions and overcoming the language a priori. Therefore, our method can also be interpreted as a potential multitasking learning framework.

In conclusion, our contributions are as follows:

We introduce the self-monitoring framework by automatically converting the inherent bias data into balanced data, and propose an auxiliary task to use this balanced data to fundamentally overcome the language a priori. As far as we know, this is the first work to use self supervised learning in this field. A large number of experiments were carried out on the popular benchmark vqa-cp v2. Experimental results show that our method can significantly outperform the most advanced methods without external annotation, including the model using manual supervision. We improved the overall accuracy from 49.50% to 57.59%

2.1 Visual Question Answering

Visual question answering (VQA) aims to answer questions based on images, involving technologies of natural language processing and computer vision [liuet al., 2016; parkhiet et al., 2015; conneau et al., 2016; Liu et al., 2018]. Existing VQA methods can be roughly divided into four categories: 1) joint embedding methods [antolet al., 2015] first project images and questions into a common feature space, and then combine them through classifiers to predict answers. 2) The attention-based approach [andersonet al., 2018] mainly focuses on learning the interaction between question words and image regions, making the answer process more interpretable. 3) The combinatorial model [andreaset al., 2016] uses the combinatorial structure of the problem to assemble modules running in the attention space. 4) The knowledge-based approach [wuet al., 2016] is proposed to answer common sense questions by using external knowledge.

However, the existing models tend to remember the language a priori in the training process without considering the image information. This model may achieve unexpected and impressive results on test sets sharing the same distribution as the training set, but it often performs poorly on extraterritorial test sets

2.2 Overcoming Language Priors in VQA

The existing methods to overcome language a priori can be roughly divided into non annotation based methods and annotation based methods. The non annotation based approach focuses on creating fine models to reduce problem dependence, while the annotation based approach focuses on strengthening the visual foundation by introducing additional human visual supervision.

For non annotation based methods, [agrawalet al., 2018] proposed a manually designed VQA framework, which effectively separates visual recognition of different question types from answer space prediction. Similarly, [jinget al., 2020] also decoupled concept discovery and question answering. In addition to reducing the answer space, [ramakr ishnanet al., 2018] proposed an antagonistic learning strategy by antagonistically minimizing the performance of problem only branches. [guoet al., 2019] adopts the pairwise sorting mode to force only the problem branches to make worse predictions than the basic model. Rubi [cadeneet al., 2019] dynamically adjusts the weight of training instances through the a priori mask of problem branch learning only, so as to reduce the impact of the most biased instances and increase the impact of the least biased instances. [?] a neural network symbolic model is proposed, which integrates the symbolic program actuator into DNN for visual reasoning. Different from the above model, it can also solve the deviation problem. [?] combine neural symbolic model with curriculum concept learning to make it more generalized.

In addition, by highlighting important visual areas under the guidance of external visual supervision, the annotation based method has proved to be effective. It is suggested that [selvarajuetal., 2019] increases image dependence by optimizing the alignment between human attention map and gradient based visual importance. SCR [Wu and Mooney, 2019] also emphasizes the correspondence between correct answers and influential objects annotated by human text interpretation. However, these models rely heavily on human supervision, which is not always accessible.

In addition, by highlighting important visual areas under the guidance of external visual supervision, the annotation based method has proved to be effective. [selvarajuetal., 2019] increase image dependence by optimizing the alignment between human attention map and gradient based visual importance. SCR [Wu and Mooney, 2019] also emphasizes the correspondence between correct answers and influential objects annotated by human text interpretation. However, these models rely heavily on human supervision, which is not always possible.

Unlike all these methods, our self-monitoring method does not need to build a complex architecture or introduce external supervision. Firstly, we balance the original deviation data, overcome the language prior knowledge based on the balance data by automatically generating the balance label, and assist the task in a self supervised way.

2.3 Self-supervised Learning

Self supervised learning automatically calculates some supervision signals from the input data, and effectively uses the input itself to learn the high-level representation of some downstream tasks. For example, [gidariset al., 2018] proposed to randomly rotate the image at one of the four possible angles and let the model predict the rotation. In addition to trying to predict rotation, you can also try to recover some data, such as image completion [pathaket al., 2016]. In this paper, we use self supervised learning for problem image correlation estimation as an auxiliary task to help VQA model overcome language a priori. We randomly change the image in the original correlation problem – image pair, and then let the model predict its correlation.

3 Method

The framework of our method is shown in the figure. Next, we will describe in detail how it works.

The framework of our self-monitoring approach. (a) This section describes the basic VQA model, which aims to answer questions based on images. (b) Show us how to automatically generate balance problems – image pairs. More clearly, (c) shows how problem image correlation estimation works on correlated and uncorrelated pairs, respectively. G-T stands for basic facts.

3.1 The Paradigm of VQA

The purpose of VQA is to automatically answer text questions based on images. Specifically, a VQA data set containing N data is given\(D=\{I_i,Q_i,A_i\}_{i=1}^N\), where\(I_i\in I,Q_i\in Q\)They are the image and problem of the ith data.\(A_i\in A\)As an answer. VQA’s model points to learning a mapping function\(F:I\times Q \rightarrow R^A\)To predict the exact distribution of answers. It usually consists of three parts: extracting the features of images and questions, fusing them to obtain joint multimodal representation, and predicting the distribution of answer space. So we can\(i\)An image and a prediction of the answer to the question are written as\(F(A|I_i,Q_i)\)​。 Almost all existing VQA models [yanget al., 2016; kimet al., 2018; Anderson al., 2018] follow this paradigm, and their parameters are usually optimized by minimizing cross entropy loss equation or multi label soft loss equation.

Answer prediction:


Minimizing cross entropy loss equation:


Multi label soft loss equation:


among\(\sigma (\cdot)\)Represents the sigmoid function,\(t_i\)Yes\(i\)The soft attention of each answer of a sample is expressed as\(t_i=\frac{number of votes}{n}\), where\(n\)Expressed as section\(i\)Number of valid answers to questions,\(number of votes\)Is the number of each answer that humans annotate for the question.

3.2 Question-Image Correlation Estimation

VQA model with memory language prior tends to ignore the image directly for prediction. Ideally, a question can be answered only when a given image contains relevant information. Therefore, it is very important for VQA model to judge whether a given image can be used as a reference before answering a specific question, which has been ignored by almost all previous work, because all problem image pairs are correctly matched in the existing tests. We show that this evaluation is necessary to reduce the language priori in VQA, because it will force the model to refer to the image content rather than answer blindly. Therefore, we propose an auxiliary task, called question image correlation estimation (qice), a binary classification task to predict whether the question image pair is relevant before answering the question. In this paper, we define a related question image pair, because images can be used to answer questions with specific answers.

Generate balanced question-image pairs

Firstly, we automatically generate a set of labeled problem image pairs from the original data set without human annotation for auxiliary tasks, as shown in Figure 2 (b). Specifically, each problem image pair in the training set\((Q,I)\)Are considered labels\(c=1\)Because there are answers to this pair in the dataset. Then for each pair of related\((Q,I)\), we randomly select images from the image set to replace the original image. Through this method, we can get another problem image pair\((Q,I’)\), the probability that the image pair is correlated is very small, so we use tags\(c=0\)Indicates irrelevant. Therefore, we can get a balance problem – image pair matching data set, in which the number of correlated pairs is equal to the number of uncorrelated pairs. Note that the construction of balanced data does not require any manual annotation.

Correlation estimation

Using the generated equalization data, the qice model is trained to predict the relevant labels of each problem image pair by optimizing the cross entropy loss.


\(L_{self}\)It can be interpreted as a self supervised training loss, which only uses the label supervision of the data we generate. The objective function ensures the qice model’s understanding of the problem and image content, because each Q corresponds to a balanced relevant and irrelevant position and does not depend on any language a priori. In the next section, we will discuss how to use auxiliary tasks qice and balanced data to help VQA model eliminate language bias in a unified framework.

3.3 Unified Self-supervised Framework

In this section, we propose a unified VQA framework, which can answer questions and estimate question image correlation simultaneously during training. Obviously, the qice tasks defined above can share the same network structure with VQA because they have exactly the same inputs and similar outputs: they are all in question image pairs\((I,Q)\)For input, VQA predicts the spatial distribution of answers, while QIC generates a binary label on a specific answer. This feature urges us to solve these two tasks simultaneously in a unified VQA framework, as shown in the framework diagram of the method.

For the VQA model described in figure (a). Take relevant problem image pairs\((Q,I)\)As input, predict the distribution of their answers\(F(A|Q,I)\), you can minimize VQA losses\(L_{vqa_{ce}}\)Or\(L_{vqa_{ml}}\)To train. This objective function teaches the model the ability to answer questions. For the qice shown in figure (c), an image pair corresponding to a specific answer is given\((I,Q)\), prediction probability of VQA model\(P(A|Q,I)\)It can be regarded as the confidence of the correlation between the two. The greater the probability, the higher the matching degree. therefore\(L_{self}\)Can be rewritten as:


The model requires correct binary prediction for the problem image correlation estimation task, because each problem has an equal number of correlated and uncorrelated image pairs, so that the model can better understand the image. More specifically,\(L_{self}\)This item aims to maximize the reliability of the problem image pair, which is consistent with the goal of the VQA task, which is to predict the real situation on the ground with high reliability.

most important of all,\(L_{self}\)The second term of is designed to minimize the confidence of correlation pairs, which can just meet the reduction of language prior knowledge. Intuitively, the question dependence of VQA model can be measured by the reliability of correctly answering questions when given irrelevant images. The greater the confidence, the stronger the dependence. Minimizing the credibility of related unrelated pairs can clearly prevent VQA model from being over driven by language a priori, which is called problem dependency loss\(L_{qd}\):


We omitted\(c_i\)because\(L_{qd}\)Detachment independent problem image pair\((Q,I’)\)Effective. Mathematically, minimize\(-log(1-P(A|Q,I’))\)Equivalent to minimization\(P(A|Q,I’)\)。 Experimentally, the training process is minimized\(P(A|Q,I’)\)Ratio minimization\(-log(1-P(A|Q,I’))\)More stable. that is because\(P(A|Q,I’)\)Gradient ratio of\(-log(1-P(A|Q,I’))\)More stable. So the loss function\(L_{qd}\)Updated as follows:


Therefore, qice task can be naturally regarded as a potential multi task learning, which includes two tasks: original VQA task and language a priori reduction task. We can\(L_{self}\)Reorganize as follows

\[L_{self}=L_{vqa}+\alpha L_{qd}

among\(L_{vqa}\)It can be any VQA loss function\(L_{vqa_{ce}}\)perhaps\(L_{vqa_{ml}}\)then\(\alpha\)Is a super parameter. Obviously,\(L_{self}\)Can be seen as a common VQA loss when\(\alpha=0\)When, it degenerates into\(L_{vqa}\), this means that the problem depends on loss\(L_{qd}\)In fact, it plays the role of regularization, prevents VQA model from memorizing the prior knowledge of the language, and forces it to better understand the image. So,\(L_{self}\)It provides flexibility in controlling the balance between answering questions and reducing language priority. In addition, we can skillfully estimate the correlation of problem image pairs without explicitly optimizing the model. We only need to use its balanced supervision to compensate for the deviation in VQA and our self-monitoring loss. Our method can reduce the language prior knowledge in a self-monitoring way without external supervision.

4 Experiments

4.1 Datasets and Baselines


Our method is based on the most commonly used benchmark vqa-cp V2 [agrawalet al., 2018] and is evaluated using standard evaluation indicators [antolet al., 2015]. Vqa-cpv2 data set is derived from VQA V2 [goyalet al., 2017]. By reorganizing training and verification splitting, QA pairs in training set and test set have different distributions. Therefore, it is suitable for evaluating the generalizability of the model. We also evaluate our model on vqav2 data sets with strong deviations and report its verification results.


Our method is compared with several methods, including (1) non annotation based methods: updn [Anderson et al., 2018], advreg [ramakrishnanet al., 2018], Rubi [Anderson et al., 2018] and DLR [jinget al., 2020] (2) annotation based methods: hint [selvarajue et al., 2019] and SCR (best execution method) [Wu and Mooney, 2019].

4.2 Implementations Details

Our method is model agnostic and can be well applied to different VQA models. In this paper, we mainly evaluate our updn based method [Anderson et al., 2018] and add a batch normalization layer before the classifier. According to the baseline, we use the pre trained fast r-cnn to extract image features. For each image, it is encoded as a set of 36 objects with 2048 dimensional feature vectors, and all problems are filled to the same length 14. For each question, the word is embedded and initialized by 300 dimensional gloves, and then input Gru to get a 1280 dimensional sentence level representation.

We use the VQA loss of 12 epochs to pre train the model, and use the self-monitoring loss of 20 epochs to fine tune the model. The batch size is 256, and irrelevant images are randomly selected from small batches. Using Adam optimizer, the initial learning rate is 0.001, which is halved every 5 cycles after 10 cycles. In our main experiment, we used different VQA losses to evaluate our method, multi label VQA loss setting α= 3. Cross entropy VQA loss setting α= 1.2。 Other experiments in this paper are based on α= 3. The next section also discusses the super parameters α Settings for.

4.3 Experimental Results and Analysis

Comparison with state-of-the-art

Our method is tested based on two VQA losses (cross entropy loss and multi label loss). In order to eliminate the randomness of random sampling strategy, we reported the average scores of 10 experiments on the test set. It can be seen from the results in the following table: (1) our method can not only improve the overall performance of baseline updn (cross entropy loss is + 14.35%, multi label loss is + 16.06%), but also significantly better than the best method SCR (cross entropy loss is + 3.13%, multi label loss is + 8.09%) (2) The improvements based on these two VQA losses are significant. In general, using multi label losses can achieve better performance because it is consistent with the evaluation method and considers multiple feasible answers, which shows that it is more universal (3) no matter what VQA losses are used, we can achieve very high accuracy in the type of “yes / no” questions (87.75% and 86.53%) , which shows that our strategy can effectively overcome the language prior knowledge, because these simple problems are more likely to have bias (4) for the most difficult “num” problem, we can also get surprising improvements, which strongly shows that our method can jointly understand images and problems and reasoning effectively.

Performance on smaller training sets

In order to further prove the superiority of this method, we have carried out a series of experiments on randomly selecting different amounts of training data from the original training set. All experiments were tested on the standard test set, and the results are shown in the table below. We found that our method improved the average accuracy of baseline updn by + 16.6%. Most importantly, even with 20% of the training data, our method can greatly exceed the best performance method of externally supervised training in the whole training set. We believe that this is because our method can effectively pry the balance data with the help of regularizer, which is more likely to show great universality.

Performance based on different baselines

We also conducted experiments based on two other VQA models: San [yanget al., 2016] and ban [kimet al., 2018]. The results are shown in the table below. We can observe that the improvements of different baselines are significant and consistent, which indicates that our method is model agnostic.

Performance on biased VQA dataset

We also evaluated VQA V2 data sets containing strong language bias. We pre trained the model with VQA loss of 6 epochs, and then fine tuned it with 10 epochs. The results are shown in Table 4. Our method has achieved accuracy improvement on VQA V2 Val, while other methods may lead to degradation. The reason behind this is that our loss of self-monitoring can strike a balance between answering questions and eliminating language a priori.

Impact of different \(\alpha\)

In order to study the trade-off between answering questions and overcoming language a priori\(\alpha\)The impact we have on different\(\alpha\)Extensive experiments were carried out under the setting. Due to space constraints, this paper only analyzes the loss of using multi label VQA, as shown in the figure below. When\(\alpha=3\)The performance of the model is higher. And a big one α It may cause the model to collapse after several periods, and a small one\(\alpha\)Will result in poor performance.

Qualitative analysis

We quantitatively evaluated the effectiveness of our method. As shown in the figure below, our method can correctly answer the questions and focus on the correct areas. For example, when answering the question “is this a professional game?” our method can pay more attention to the characters on men’s clothes, which may be an important visual clue to judge whether the game is professional or not.

5 Conclusion

This paper proposes a new self supervised learning framework to overcome the language a priori problem in VQA. Based on a model independent auxiliary task, the framework can effectively use the automatically generated balance data to reduce the impact of data set deviation. The experimental results show that this method has achieved a balance between answering questions and overcoming language prior knowledge, achieved a good overall learning effect, and achieved a new level on the most commonly used vqa-cpv2 standard. We believe that our work can become a practical VQA and a meaningful step to solve the problem of language deviation, And this self-monitoring can be extended to other tasks affected by inherent data deviations (for example, image caption).

Recommended Today

Apache sqoop

Source: dark horse big data 1.png From the standpoint of Apache, data flow can be divided into data import and export: Import: data import. RDBMS—–>Hadoop Export: data export. Hadoop—->RDBMS 1.2 sqoop installation The prerequisite for installing sqoop is that you already have a Java and Hadoop environment. Latest stable version: 1.4.6 Download the sqoop installation […]