Don’t Take the Easy Way Out:Ensemble Based Methods for Avoiding Known Dataset Biases


Don’t Take the Easy Way Out:Ensemble Based Methods for Avoiding Known Dataset Biases


The most advanced models usually use surface patterns in the data, which can not be well generalized to extraterritorial or adversarial settings. For example, text implication models often learn that specific keywords mean implication regardless of context, while visual question answering models learn to predict prototype answers without considering the information in the image. In this paper, we show that if we have a priori knowledge of these deviations, we can train a model to be more robust to domain transfer. Our approach has two stages: we (1) train a naive model for prediction based solely on dataset bias, and (2) train a robust model as part of integration with the naive model to encourage it to focus on data that is more likely to be summarized by other content. Experiments on five foreign test data sets show that the robustness is significantly improved in all settings, including 12 points on the changing a priori visual question answering data set and 9 points on the confrontational question answering test set.

1 Introduction

Although recent neural models have shown significant results, these achievements are influenced by the observation that they often use data set specific patterns, which can not be well extended to extraterritorial or adversarial settings. For example, the implication model trained on mnli (Bowman et al., 2015) will only be based on the existence of specific keywords (gururangan et al., 2018) or whether sentence pairs contain the same word (MC coy et al., 2015) to guess the answer, 2019), while QA models trained on squad (rajpurkar et al., 2016) tend to choose question words near the text as the answer, regardless of the context (Jia and Liang, 2017).

We call these superficial patterns prejudice. Bias dependent models can perform well on domain data, but they are fragile and easy to be fooled (for example, squad model is easy to be distracted by irrelevant sentences containing many interrogative words). Recent attention to dataset bias prompted researchers to re-examine many popular datasets and found a variety of bias (Agrawal et al., 2018; Anand et al., 2018; Minet al., 2019; Schwartz et al., 2017).

In this paper, based on these works, we show that once the data set deviation is determined, we can improve the extraterritorial performance of the model by preventing the model from using the deviation. To this end, we take advantage of the fact that these deviations can usually be explicitly modeled using a simple, constrained baseline method and removed from the final model through integration based training.

Our method is divided into two stages. First, we construct a bias only model to capture a solution that performs well on training data but is poorly generalized in extraterritorial settings. Next, we integrate the second model with the pre trained bias only model to motivate the second model to learn alternative strategies, and use the second model alone in the test set. We explored several different integration methods based on the expert AP method (Hinton, 2002; Smith et al., 2005). Figure 1 shows an example of applying this process to prevent the visual question answering (VQA) model from guessing answers because they are typical answers to questions, which is a defect observed in the VQA model (Goyal et al., 2018; Agrawal et al., 2018).

We evaluate our approach on a range of different tasks, all of which require models to overcome the challenging domain transfer between training and test data. First, we build a set of synthetic data sets containing artificial construction bias by adding artificial features to mnli. Then, considering the three challenges proposed in the previous work, the data sets are designed to adopt the fracture model of surface strategy (Bowman et al., 2015), reading comprehension (rajpurkar et al., 2016) and VQA (antol et al., 2015).

In addition, we also built a new QA challenge database triviaqa-cp (for triviaqa change priority). This data set is constructed by the questions raised by triviaqa (Joshi et al., 2017). These questions ask specific types of entities from the train set and evaluate these questions in the development set, so as to challenge the model to summarize different types of problems.

We were able to improve extraterritorial performance in all settings, including 6 and 9 gains on two QA data sets. On the VQA challenge set, we achieved a 12 point gain, compared with the 3-point gain previously worked. In general, we define that an integration method that can dynamically select when to trust bias model is the most effective. We give a comprehensive experiment and qualitative analysis to illustrate the advantages of this method. We released our dataset and code to facilitate future work.

Researchers have raised concerns about the bias of many data sets. For example, many joint natural language processing and visual data sets can be partially solved by ignoring the model of task vision (jabri et al., 2016; Zhang et al., 2016; Anand et al., 2018; caglayan et al., 2019). Some problems in the recent multimodal QA dataset (Yang et al., 2018; welbl et al., 2018) can be solved by a single model (Chen and durrett, 2019; min et al., 2019). Other examples include story completion (Schwartz et al., 2017) and multiple-choice questions (Clark et al., 20162018). Recognizing that bias is a concern in different fields, our work is the first to evaluate multiple data sets across languages and vision.

Recent data set construction protocols attempt to avoid certain types of deviations. For example, coqa (Reddy et al., 2019) and Quac (Choi et al., 2018) took measures to prevent annotators from using words appearing in context paragraphs at the same time. Vqa2.0 (Goyal et al., 2018) selected examples to limit the effectiveness of the question only model, while others screened examples with solvable simple baselines (Yang et al., 2018; Zhang et al., 2018b; Clark et al., 2018; Zellers et al., 2018). Although it is important to reduce deviations, developing methods to prevent the use of known deviations in the model will allow us to continue to use existing data sets and update our methods as our understanding of the deviations we want to avoid develops.

Recent work has focused on the bias caused by ignoring some inputs (for example, guessing the answer to a question before seeing evidence). Solutions include forcing the model to understand the generation objectives of all inputs (Lewis and fan, 2019), well-designed model architecture (Agrawal et al., 2018; Zhang et al., 2016), or confrontational deletion of class indicative features from the internal representation of the model (ramakrish Nan et al., 2018; Zhang et al., 2018a; belinkovet et al., 2019; grand and belinkov, 2019), We consider that the bias exceeds part of the input (Feng et al., 2019) and show that our method is superior in vqa-cp. at the same time, he et al. (2019) also suggested using expert perceived products to train unbiased models, but we considered a broader integration method and tested it in other fields.

A related task is to prevent specific problem dataset characteristics in the model, which are usually studied from the perspective of fairness (Zhao et al., 2017; burns et al., 2018). A popular approach is to use opponents to reconfirm target characteristics from internal representatives of the model, usually gender and race (Edwards and storkey, 2016; Wang et al., 2018; Kim et al., 2019). On the contrary, the deviations we consider are related to the characteristics that are crucial to the whole task, so they cannot be simply ignored.

The evaluation model of extraterritorial examples constructed by applying small disturbances to existing examples is also the subject of recent research (Szegedy et al., 2014; belinkov and Bisk, 2018; karini and Wagner, 2018; Glockner et al., 2018). The transformation of distribution involves a great change in the output distribution, which causes a great level of defects to the existing models.

3 Methods

This section describes two phases of our approach: (1) building a bias only model and (2) using it to train a robust model through a set.

3.1 Training a Bias-Only Model

The goal of the first phase is to build a model that performs well on training data, but may perform poorly on extraterritorial test sets. Because we assume that we cannot access from the test set, we must apply a priori knowledge to achieve this goal.

The most direct method is to determine a set of features that are related to class labels during training, but are known to be uncorrelated or anti correlated with the labels in the test set, and then train the classifier according to these features. For example, our vqa-cp (Agrawal et al., 2018) only bias model (see Section 5.2) uses question types as input, because the correlation between question types and answers is very different in the training set and the test set (for example, answer 2 is a common answer to the “how many…” question in the training set, but it is rare for such questions in the test set)

However, one advantage of our approach is that we can model deviations using any type of predictor, which provides us with a way to capture more complex intuition. For example, on squad, our deviation only model runs on the input view and is built by tf-idfscores (see section 5.4), and on our changing priviaqa dataset, our deviation only model uses a pre trained named entity recognition (NER) marker (see section 5.5).

3.2 Training a Robust Model

This stage trains a robust model and avoids the method of learning from pure deviation model.

3.2.1Problem Definition

The bias model is:


among\(\)For training samples,\(b_{ij}\)Is the possibility that the bias model is the j-th class for the i-th sample.

Second model:


among\(p_i\)Similar to the probability distribution of a class.

The goal now is to learn parameters\(\theta\)So that the model can accurately predict the answer without using the strategy of biased model.

3.2.2 General Approach

We train one\(h\)\(f\)Train together, in particular, for each instance, a new class distribution\(\widehat{p}_i\)Yes\(p_i\)and\(b_i\)Calculated. During training, the loss is used\(\widehat{p}_i\)Calculated back propagation influence function\(f\)。 Only a single model was used during the evaluation\(f\)。 We propose several different methods of combining models.

3.2.3 Bias Product

The simplest combination method was proposed by Hinton in 2002:


Equivalent to, where\(\circ\)Indicates element wise multiplication:

\[\widehat{p}_i\propto p_i\circ b_i

3.2.4 Learned-Mixin

The assumption of conditional independence (equation 3) is usually too strong. For example, in some cases, the robust model may be able to predict that the bias model is unreliable for some types of training examples. We find that this leads to the robust model selectively adjusting its behavior to compensate for the inaccuracy of the bias only model, resulting in errors in extraterritorial settings (see Section 5.1).

Instead, we allow the model to explicitly determine the degree of trust bias under a given input:


among\(g\)Is a function that needs to be learned, we calculate\(g\)by\(softplus(w\cdot h_i)\), where\(w\)For a vector to learn,\(h_i\)It’s a training sample\(x_i\)The output of the last hidden layer in the model.\(softplus(x)=log(1+e^x)\)Prevent the weight from becoming negative to reverse the deviation, where bias product is\(f(x_i)=1\)Time.

But one drawback of using this method is when\(g(x_i)=0\)It is difficult for the model to integrate bias into the robust model, and the author finds that this phenomenon does occur.

3.2.5 Learned-Mixin+H

In order to prevent the above problems, the author puts forward the third combination mode. When using learned mixin, entropy penalty is added to the loss:


among\(H(z)=-\sum_jz_jlog(z_j)\)\(w\)It is a super parameter, and the penalty entropy encourages the non-uniformity of deviation components, which has a greater impact on the integration.

4 Evaluation Methodology

We evaluate our approach on several data sets with extraterritorial test sets. Some of these tasks, such as Hans (McCoy et al., 2019) or advantageous squiad (JIA and Liang, 2017), can improve the performance of these tasks by generating additional training samples similar to the test set (e.g., Wang and Bansal (2018)). On the contrary, we prove that the performance of these tasks can be improved by using the knowledge of general and biased strategies that the model may adopt.

Our evaluation setup consists of a training set, an extraterritorial test set, a bias only model and a master model. For evaluation, we train only the bias model on the training set, train the main model on the training set, adopt one of the methods in Section 3, and evaluate the main model on the extraterritorial test set. If available, we also report on the performance of test sets within the domain. We use models that are known to work well in their respective tasks of the main model, and do not further adjust their super parameters or stop the execution early.

We consider two extracted QA data sets, which we regard as a joint classification task, in which the model must select the start and end answer tokens (Wang and Jiang, 2017). For these data sets, we build an independent bias only model to select the start and end tags, and integrate these offsets with the start tag and end tag output distribution of the classifier separately. We apply the relu layer to problem and paragraph embedding, and then maximize the pool to build a hidden state to calculate the learning blend weight.

We compared our method with the reweighted baseline described below and trained the master model without any modifications. On VQA, we also compared with the confrontational method of Ramakrishnan et al. (2018). Grand and belinkov (2019). The other deviations we consider are not based solely on the observation of partial inputs, so these countermeasures cannot be directly applied.

4.1 Reweight Baseline

As a non integrated baseline, we train the master model on a weighted version of the data, where each sample\(x_i\)The weight of is\(1-b_{iy_i}\)(that is, we weighted the example by 1 minus the probability that only the deviation model assigned the correct label). This encourages the master model to focus on examples where only the bias model is wrong.

4.2 Hyperparameters

One of our methods (learned mixin + H) requires hyperparametric adjustment. However, hyperparametric adjustment is challenging in our setup because we assume that we cannot access extraterritorial test examples during training. A reasonable choice is to adjust the super parameters on dev set. However, it is not exactly the same as transferring the region to the test set, but unfortunately, none of our data sets have such a dev set. Instead, we followed previous work (grand and belinkov, 2019; ramakr ishnan et al., 2018) and performed model selection on the test set. Although this presents an important warning for the results of this method, we think it is still interesting to observe that the entropy regularizer may be very influential. Future work may be able to build an appropriate dev set, or propose other super parameter adjustment methods to alleviate this problem. See Appendix A for the selected super parameters.

5 Experiment

Because I only study the VQA direction, I only introduce the experiments related to the VQA direction.

5.2 VQA-CP


We evaluated the vqa-cp V2 (Agrawal et al., 2018) data set, which was constructed by re splitting VQA 2.0 (Goyal et al., 2018) training set and verification set into new training set and test set, so as to determine the correlation between question types, and the answers are different between each split. For example, “tennis” is the most common answer to the questions beginning with “what sport…” in the train set, while “skiing” is the most common answer to these questions in the test set. The reason for choosing this model is that they are typical models in the training data and perform poorly on this test set.

Bias-Only Model

Vqa-cp has a question annotated with one of 65 question types, corresponding to the first few words of the question (for example, “what color”). Only the offset model uses this classification label as input and trains on the same multi label target as the main model.

Main Model

We use the popular implementation of bottomuptodown (Anderson et al., 2018) VQA model. The model uses multi label targets, so we apply our integration method by treating each possible answer as a binary classification problem


The results are shown in the table below. The learned mixin method is very effective. It improves the performance of vqa-cp by about 9 points, and the entropy regularizer can add another 3 points, significantly exceeding the previous model. For learned mixin integration, we found that\(g(x_i)\)It is closely related to the expected accuracy of the deviation. There is a Spearman r correlation coefficient of 0.77 with the test data. The qualitative example (Figure 2) further shows that it increases when the model knows whether it can rely on bias only models\(g(x_i)\)

In the learned mixin model\(g(x_i)\)(labeled “g”) and in the learned mixin + H model\(g(x_i)\)(marked “G +”). The question type and deviation model have the highest ranking answers for this type, as shown above. When the answer may be correct, we find that\(g(x_i)\)Bigger.

6 Conclusion

Our main contribution is a method of using human knowledge to understand which methods can not be well generalized to improve the robustness of the model to domain transfer. Our method uses the pre trained naive model to train the robust model in the integration, and only the robust model is used in the test process. A large number of experiments show that our method works well on two adversarial data sets and two changing a priori data sets, including 12 point gain on vqa-cp. Future work includes learning to automatically detect dataset bias, which will make our method applicable to less specific a priori knowledge.