Why Machine Reading Comprehension Models Learn Shortcuts?

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 989–1002August 1–6, 2021. ©2021 Association for Computational Linguistics

989

Why Machine Reading Comprehension Models Learn Shortcuts?

Yuxuan Lai, Chen Zhang, Yansong Feng∗ , Quzhe Huang, and Dongyan ZhaoWangxuan Institute of Computer Technology, Peking University, China

The MOE Key Laboratory of Computational Linguistics, Peking University, China{erutan, zhangch, fengyansong, huangquzhe, zhaody}

@pku.edu.cn

Abstract

Recent studies report that many machine read-ing comprehension (MRC) models can per-form closely to or even better than humans onbenchmark datasets. However, existing worksindicate that many MRC models may learnshortcuts to outwit these benchmarks, but theperformance is unsatisfactory in real-world ap-plications. In this work, we attempt to ex-plore, instead of the expected comprehensionskills, why these models learn the shortcuts.Based on the observation that a large portionof questions in current datasets have shortcutsolutions, we argue that larger proportion ofshortcut questions in training data make mod-els rely on shortcut tricks excessively. To in-vestigate this hypothesis, we carefully designtwo synthetic datasets with annotations thatindicate whether a question can be answeredusing shortcut solutions. We further proposetwo new methods to quantitatively analyzethe learning difficulty regarding shortcut andchallenging questions, and revealing the inher-ent learning mechanism behind the differentperformance between the two kinds of ques-tions. A thorough empirical analysis showsthat MRC models tend to learn shortcut ques-tions earlier than challenging questions, andthe high proportions of shortcut questions intraining sets hinder models from exploring thesophisticated reasoning skills in the later stageof training.

1 Introduction

The task of machine reading comprehension(MRC) aims at evaluating whether a model can un-derstand natural language texts by answering a se-ries of questions. Recently, MRC research has seenconsiderable progress in terms of model perfor-mance, and many models are reported to approachor even outperform human-level performance on

∗Corresponding author.

P: ... Begun as a one-page journal in September 1876 ,

the Scholastic magazine is issued twice monthly and

claims to be the oldest continuous collegiate publication

in the United States� …

Q.1: When did the Scholastic journal come out ?

Coreference

Paraphrasing

Answer

Question Word Matching

Comprehension ChallengeShortcut

Figure 1: An illustration of shortcuts in Machine Read-ing Comprehension. P is an excerpt of the original pas-sage.

different benchmarks. These benchmarks are de-signed to address challenging features, such as evi-dence checking in multi-document inference (Yanget al., 2018), co-reference resolution (Dasigi et al.,2019), dialog understanding (Reddy et al., 2019),symbolic reasoning (Dua et al., 2019), and so on.

However, recent analysis indicates that manyMRC models unintentionally learn shortcuts totrick on specific benchmarks, while having infe-rior performance in real comprehension challenges(Sugawara et al., 2018). For example, when answer-ing Q.1 in Figure 1, we expect an MRC model tounderstand the semantic relation between come outand begun, and output the answer, September 1876,by bridging the co-reference among Scholastic jour-nal, Scholastic magazine and one-page journal. Infact, a model can easily find the answer withoutfollowing the mentioned reasoning process, sinceit can just recognize September 1876 as the onlytime expression in the passage to answer a whenquestion. We consider such kind of tricks that usepartial evidence to produce, perhaps unreliable,answers as shortcuts to the expected comprehen-sion challenges, e.g., co-reference resolution in this

990

example. The questions with shortcut solutionsare referred to as shortcut questions. For clarity,a model is considered to have learned shortcutswhen it relies on those tricks to obtain correct an-swers for most shortcut questions while perform-ing worse on questions where challenging skillsare necessary.

Previous works have found that, relying on short-cut tricks, models may not need to pay attention tothe critical components of questions and documents(Mudrakarta et al., 2018) in order to get the correctanswers. Thus, many current MRC models canbe either vulnerable to disturbance (Jia and Liang,2017), or lack of flexibility to question/passagechanges (Sugawara et al., 2020). These efforts dis-close the impact of shortcut phenomenon on MRCstudies. However, concerns have been raised onwhy MRC models learn these shortcuts while ig-noring the designed comprehension challenges.

To properly investigate this problem, our firstobstacle is that there are no existing MRC datasetsthat are labeled whether a question has shortcutsolutions. This deficiency makes it hard to for-mally analysis how the performance of a modelis affected by the shortcuts questions, and almostimpossible to examine whether the model correctlyanswers a question via shortcuts. Secondly, pre-vious methods disclose the shortcut phenomenonby analyzing the model outputs through a series ofcarefully designed experiments, but fail to explainhow the MRC models learn the shortcuts tricks.We need new methods to help us quantitatively in-vestigate the learning mechanisms that make thedifference when MRC models learn to answer theshortcuts questions and questions that require chal-lenging reasoning skills.

In this work, we carefully design two syntheticMRC datasets to support our controlled experimen-tal analysis. Specifically, in these datasets, each(passage, question) instance has a shortcut versionpaired with a challenging one where complex com-prehension skills are required to answer the ques-tion. Our construction method ensures that the twoversions of questions are as close as possible, interms of style, size, and topics, which enable us toconduct controlled experiments regarding the nec-essary skills to obtain answers. We design a seriesof experiments to quantitatively explain how short-cut questions affect MRC model performance andhow the models learn these tricks and challengingskills during the training process. We also propose

two evaluation methods to quantify the learningdifficulty of specific question sets. We find thatshortcut questions are usually easier to learn, andthe dominant gradient-based optimizers drive MRCmodels to learn the shortcut questions earlier in thelearning process. The priority of fitting shortcutquestions hinders models from exploring sophis-ticated reasoning skills in later stage of training.Our code and datasets can be found in https://

github.com/luciusssss/why-learn-shortcut

To summarize, our main contributions are thefollowing: 1) We design two synthetic datasetsto study two commonly seen shortcuts in MRCbenchmarks, question word matching and simplematching, against a challenging reasoning patternparaphrasing. 2) We propose two simple methodsas a probe to help investigate the inherent learningmechanism behind the different performance onshortcut questions and challenging ones. 3) Weconduct thorough experiments to quantitatively ex-plain the behaviors of MRC models under differentsettings, and show that the proportions of shortcutquestions greatly affect model performance, whichmay hinder MRC models from learning sophisti-cated reasoning skills.

2 Synthetic Dataset Construction

To study the impact of shortcut questions in modeltraining, we require the datasets to be annotatedwith whether each question has shortcut solutions,or can only be answered via complex reasoning.However, none of existing MRC datasets have suchannotations. We thus design two synthetic datasetswhere it is known whether shortcut solutions existfor a question.

More importantly, we need to conduct controlledexperiments and ensure, for each question, the ex-istence of shortcuts solutions is the only indepen-dent variable. The extraneous variables, such assizes of datasets, topics, answer types, and even thevocabulary, should be controlled relatively steady.Thus, in our designed datasets, each entry has ashortcut version instance and a challenging version.The question of the shortcut version can be cor-rectly solved by a certain shortcut trick, while anexpected comprehension skill is required to dealwith the challenging version. Note that we expectthe two versions of questions are as close as possi-ble so that we can switch between the two versionsfreely while maintaining other factors relativelysteady.

https://github.com/luciusssss/why-learn-shortcut


991

SpM-ParaQWM-ParaQ.3: Who was rated as the most powerful female musician?P: Forbes Magazine rated Beyonce as the most powerful female musician. She released a new album with Lisa…

Q�2: Who was rated as the most powerful female musician?P: Forbes Magazine rated Beyonce as the most powerful female musician. She released a new album with Lisa…

Q.2:Who was named the most influential music girl?

Step1 Paraphrase Question

Step2Drop Redundant

Entities

Q.2: Who was named the most influential music girl?P: Forbes Magazine rated Beyonce as the most powerful female musician…

Q.2: Who was named the most influential music girl?P: Forbes Magazine rated Beyonce as the most powerful female musician. She released a new album with Lisa…

Shortcut Version Challenging Version

Step1 Paraphrase Answer Sentence

P: Forbes Magazine named Beyonce as the most influential music creator.

Q.3: Who was rated as the most powerful female musician?P: She released a new album with Lisa… Forbes Magazine named Beyonce as the most influential music creator. Forbes Magazine rated Beyonce as the most powerful female musician.

Step2 Inject & Sentence Shuffle

Step3 Substitute & Sentence Shuffle

Q.3: Who was rated as the most powerful female musician?P: She released a new album with Lisa… Forbes Magazine named Beyonce as the most influential music creator.


Figure 2: An illustration of how the instances in the synthetic datasets are constructed from original SQuAD data.Each instance has a shortcut version paired with a challenging version where comprehension skills are necessary.

In this work, we focus on paraphrasing (Para)as the complex reasoning challenge, since it widelyexists in many recent MRC datasets (Trischler et al.,2017; Reddy et al., 2019; Clark et al., 2019). Theparaphrasing challenge requires MRC models toidentify the same meaning represented in differentwords. Regarding the shortcut tricks, we study twotypical kinds: question word matching (QWM) andsimple matching (SpM) (Sugawara et al., 2018).For QWM, MRC models can simply obtain an an-swer phrase by recognizing the expected entity typeconfined by the wh-question words of question Q.For SpM, a model can find the answers by identi-fying the word overlap between answer sentencesand the questions.

QWM-Para Dataset: As elaborated in Algo-rithm 1, given an original instance (Q,P ) fromSQuAD (Rajpurkar et al., 2016), we paraphrasethe question Q in Qp to embed the paraphrasingchallenge, and derive the corresponding shortcutversion by dropping the sentences containing otherentities with the matched type according to thequestion words from the given passage.

An example is shown in the left of Figure 2. Inthe challenging version of Q.2, both Beyonce andLisa are person names which match the questionword who. Thus, one should at least recognizethe paraphrasing relationship between named themost influential music girl and rated as the mostpowerful female musician to distinguish betweenthe two names to infer the correct answer. For the

Algorithm 1 Construction of QWM-ParaInput: SQuADOutput: QWM-Para1: QWM-Para← ∅2: for each instance (Q,P ) in SQuAD do3: if Q does not start with who, when, where then4: Discard this instance.5: end if6: if the answer sentence contains other entities matching

the question word then7: Discard this instance.8: end if9: Use back translation to paraphrase Q, obtain Qp

10: if the non-stop-word overlap rate between Qp and theanswer sentence > 25% then

11: Discard the instance.12: end if13: Delete sentences in passage P that does not contain

the golden answer but containing other entities matchingthe question word, note the modified passage as Ps.

14: Is ← the shortcut instance version (Qp, Ps)15: Ic ← the challenging instance version (Qp, P )16: Append the pair of questions, (Is, Ic), to QWM-Para.17: end for

shortcut version, removing the sentence containingLisa from the passage, which is also of the expectedanswer type person indicated by the question wordwho, would help a model easily get the correctanswer, Beyonce.

SpM-Para Dataset: As shown in Algorithm 2,for instances from SQuAD, we paraphrase the an-swer sentences in the passage to embed the para-phrasing challenge and obtain its challenging ver-sion. We insert the paraphrased answer sentencein front of the original one in the passage to con-

992

Algorithm 2 Construction of SpM-ParaInput: SQuADOutput: SpM-Para1: SpM-Para← ∅2: for each instance (Q,P ) in SQuAD do3: if the non-stop-word overlap rate between Q and the

answer sentence S < 75% then4: Discard the instance.5: end if6: Use back translation to paraphrase the answer sen-

tence S in P , obtain Sp.7: if the answer span no longer exists in Sp then8: Discard this instance.9: end if

10: if the non-stop-word overlap rate between Q and Sp

> 25% then11: Discard the instance.12: end if13: Replace S in P with Sp and shuffle sentences, noted

the modified passage as Pc.14: Append Sp to P and shuffle sentences, noted the

modified passage as Ps.15: Is ← the shortcut instance version (Q,Ps)16: Ic ← the challenging instance version (Q,Pc)17: Appen d the pair of questions, (Is, Ic), to SpM-Para.18: end for

struct the corresponding shortcut version, where amodel can obtain the answers by either identifyingthe paraphrase in the passage or using the simplematching trick via the original answer sentences.We randomly shuffle all sentences in the passageto prevent models from learning the pattern of sen-tence orders in the shortcut version, i.e., there aretwo adjacent answer sentences in the passage. Here,we assume the sentence-level shuffling operationwill not affect the answers and solutions for mostquestions, since the supporting evidence is oftenconcentrated in a single sentence. This can alsobe supported by Sugawara et al. (2020)’s findingsthat the performance of BERT-large (Devlin et al.,2019) on SQuAD only drops by around 1.2% aftersentence order shuffling.

For example, in the shortcut version of Q.3shown in the right of Figure 2, MRC models canfind the answer, Beyonce, either from the matchingcontext, rated as the most powerful female musi-cian, or via the paraphrased one, named as the mostinfluential music girl. For the challenging version,only the paraphrased answer sentence is provided,thus, the paraphrasing skill is necessary.

Dataset Details Our synthetic training and testsets are derived from the accessible training anddevelopment sets of SQuAD, respectively. Weadopt back translation to obtain paraphrases of texts(Dong et al., 2017). A sentence is translated fromEnglish to German, then to Chinese, and finally

back to English to obtain its paraphrased version.1

The QWM-Para dataset contains 7072 entries,each containing two versions of (question, passage)tuples, 6306/766 for training and testing, respec-tively. And for SpM-Para, there are 8514 entries,7562/952 for training and testing, respectively.

Quality Analysis We randomly sample 20 en-tries from each training set of the synthetic datasets,manually analyzing their answerability. We findthat 76/80 questions could be correctly answered.The unanswerable questions result from the wrongparaphrasing. Furthermore, among the answerablequestions, the paraphrasing skill is necessary in30 out of 36 questions in the challenging version,and 36 out of 40 questions of the shortcut versioncan be correctly answered via the correspondingshortcut trick.

3 How the Shortcut Questions AffectModel Performance?

Previous efforts show that shortcut questionswidely exist in current datasets (Sugawara et al.,2020). However, there are few quantitative analysisto discuss how these shortcut questions affect themodel performance. A reasonable guess is that,when trained with too many shortcut questions, themodels tend to fit the shortcut tricks, which arepossible solutions to a large amount of questionsin training. We thus argue that the high propor-tions of shortcut questions in training data makemodels rely on the shortcut tricks.

One straightforward way to elaborate on thispoint is to observe the model performance on chal-lenging test questions when the model is trainedwith different proportions of shortcut questions.For example, if a model trained on a dataset, inwhich 90% of questions are shortcut ones, can-not perform as well as its 10% variant on chal-lenging test questions, that will probably indicatethat higher proportions of shortcut questions in thetraining data may hinder the model from learningother challenging skills.

Setup We evaluate two popular MRC models,BiDAF (Minjoon et al., 2017) and BERT-base (De-vlin et al., 2019), which are widely adopted in theresearch for shortcut phenomena (Sugawara et al.,2018; Min et al., 2019; Si et al., 2019; Sugawaraet al., 2020). For each combination of model and

1We use Baidu Translate API (http://api.fanyi.baidu.com).

http://api.fanyi.baidu.com

http://api.fanyi.baidu.com

993

0 10 20 30 40 50 60 70 80 90

60

70

80

(a) BiDAF on QWM-Para

ChallengingShortcut

0 10 20 30 40 50 60 70 80 90

80

85

90

(b) BERT on QWM-Para

ChallengingShortcut

0 10 20 30 40 50 60 70 80 9037

40

43

46

49(c) BiDAF on SpM-Para

ChallengingShortcut

0 10 20 30 40 50 60 70 80 90

40

60

80(d) BERT on SpM-Para

ChallengingShortcut

Figure 3: F1 scores on challenging and shortcut questions with different proportions of shortcut questions intraining. The error bars represent the standard deviations of five runs.

dataset, we train 10 versions of the model, adjustingthe proportion of shortcut questions in the trainingset from 0% to 90%, and report performance onpure challenging and pure shortcut test sets. Wereport the mean and standard deviation in five runsto alleviate the impact of randomness. Detailedsettings are elaborated in Appendix A.

Results and Analysis Figure 3 shows the per-formance of BiDAF and BERT on QWM-Para andSpM-Para when trained with various proportions ofshortcut questions. For both models, the F1 scoreson challenging versions of both test sets drop sub-stantially with the increase in shortcut questionsfor training (Figure 3 (a) ∼ (d)). This result indi-cates that higher proportions of shortcut questionsin training limit the model’s ability to solve chal-lenging questions.

Take BiDAF on QWM-Para as an example (Fig-ure 3 (a)). The F1 score on the test set of challeng-ing questions is 69% after training BiDAF witha dataset entirely composed of challenging ques-tions, showing that even a simple model is ableto learn the paraphrasing skill from shortcut-freetraining data. As the proportion of shortcut train-ing questions increases, the model tends to learnshortcut tricks and performs worse on the chal-lenging testing data. The F1 score on challengingquestions drops to 55% when 90% of the trainingdata are shortcut questions. This drop shows thattraining data with a high proportion of shortcutsactually hinders the model from capturing para-phrasing skills to solve challenging questions. Incontrast, the performance on shortcut questions arerelative steady to the changes of shortcut propor-tions during training. When trained with sufficientchallenging questions, models not only performwell on comprehension challenges, but also cor-rectly answer the shortcut questions where onlypartial evidence is required.

In Figure 3, we can observe similar trends inmodel performance on SpM-Para. The perfor-

mance on challenging questions also drops withhigher proportions of shortcut training questions.Compared with BiDAF, although the overall scoresof BERT are better, BERT also performs poorlyon questions that require to perform paraphras-ing when trained with more shortcut questions, asshown in Figure 3 (b) & (d).

Case Study When answering the example ques-tion Q.4 from QWM-Para, BiDAF trained withpure challenging questions tends to detect the cor-relation between graduate and received his mas-ter’s degree, and locates the correct answer 1506when there are two spans matching the questionword when. However, when there are more than70% shortcut questions in training, BiDAF onlycaptures the type constraint from the question wordwhen, and fails to identify the paraphrasing phe-nomenon to answer the challenging version.

Q.4: When did Luther graduate?P-challenging: In 1501, at the age of 19, he entered theUniversity of Erfurt. ... He received his master’s degree in[1506]Ans

P-shortcut: He received his master’s degree in [1506]Ans

4 Whether Question Word Matching isEasier to Learn than Paraphrasing?

It is still confusing that, with the coexistence ofboth shortcut and challenging questions for train-ing, even in a 50%-50% distribution, both BERTand BiDAF still learn shortcut tricks better, thus,achieve much higher performance on shortcut ques-tions comparing to the challenging ones. We thinkone possible reason is that MRC models may learnthe shortcut tricks, like QWM, with less compu-tational resources than the comprehension chal-lenges, like identifying paraphrasing. In this case,MRC models could better learn the shortcut trickswith equal or even lower proportions of shortcutquestions during training.

994

50 100 150 200 250 300 350 40030405060708090

100 (a) steps for QWM-Para

ChallengingShortcut

50 100 150 200 250 300 350 400 450102030405060708090

(b) steps for SpM-Para

ChallengingShortcut

24 48 96 192 384 768405060708090

100 (c) params for QWM-Para

ChallengingShortcut

24 48 96 192 384 76830405060708090

(d) params for SpM-Para

ChallengingShortcut

Figure 4: F1 scores on training sets when BERT learns challenging and shortcut questions with different optimiz-ing steps ((a) & (b)) and parameter size (represented by the unmasked hidden size in the last hidden layer of BERT,(c) & (d)). The error bars represent the standard deviations of five runs.

To validate this hypothesis, we propose two sim-ple but effective methods to measure the differencein required computational resources. Specifically,we can train models with either pure shortcut ques-tions or challenging ones, and compare the learningspeed and required parameter sizes when achievingcertain performance levels on the training sets.

For learning speed, we train MRC models withdifferent steps and observe how the performancechanges. Intuitively, models should converge fasteron easier training data.

For parameter sizes, our intuition is that the mod-els should learn the easier questions with fewerparameters. However, the high computational con-sumption prevents us from pre-training the modelslike BERT with different parameter sizes. To simu-late BERT with fewer parameters, we mask somehidden units in the last hidden layer of BERT anduse the number of unmasked units to reflect theparameter size. The information in these maskedunits could not be conveyed to the span boundaryprediction module. Thus, only partial parameterscould be used to make the final predictions.

Setup We use BERT as the basic learner andtrain on the training sets of QWM-Para and SpM-Para. We report model performance on the trainingdata with various learning settings. We use all theparameters when adjusting learning steps. Whentuning parameter size, we fix the learning stepsto 400 and 450 for QWM-Para and SpM-Para, re-spectively. All other settings including batch size,optimizer, and learning rate are fixed. We report themean and standard deviation in five runs to allevi-ate the impact of randomness. The implementationdetails are similar to §3, elaborated in Appendix A.

Results and Analysis Figure 4 compares theperformance of BERT trained on the shortcut ques-tions and challenging questions separately underdifferent settings. On both QWM-Para and SpM-

Para, BERT converges faster in learning short-cut questions than learning challenging ones (Fig-ure 4 (a) & (b)). When fixing the training steps,BERT could learn to answer the shortcut questionswith fewer parameters (Figure 4 (c) & (d)). Theseresults show that shortcut questions may be easierfor models to learn than the ones requiring complexreasoning skills.

Take QWM-Para as an example. As can be seenfrom Figure 4 (a), BERT trained on the shortcutquestions achieves a 90% F1-score on the trainingset after 250 steps. When trained on the challengingversion, this score will not reach 90% until 350steps. This result indicates that models could learnto answer the shortcut questions with the QWMtrick faster than the paraphrasing skill.

When we train BERT on QWM-Para with differ-ent numbers of output units masked (Figure 4 (c)),BERT could reach the F1-score of 91% on shortcutdata with no fewer than 96 unmasked hidden units.However, when trained on the challenging ques-tions, BERT has to use 384 hidden units to reachthe 91% F1 score, which indicates that the ques-tions with the paraphrasing challenge may requiremore parameters to learn.

We observe similar trends on SpM-Para (Fig-ure 4 (b) & (d)). BERT requires more parametersand training steps to learn the challenging versionquestions in SpM-Para than the shortcut version.To some extent, these results confirm our hypothe-sis that learning to answer questions with shortcuttricks like SpM or QWM requires smaller amountsof computational resources than the questions re-quiring challenging skills like paraphrasing.

Case Study For the example question Q.5,BERT trained on shortcut questions could correctlyanswer its shortcut version and find the only loca-tion name, Palacio da Alvorada, with only 48 un-masked hidden units. However, when trained withthe challenging data only, the model predicts the

995

other location name, the Monumental Axis as theanswer with such parameter size. BERT could notrecognize the paraphrasing relationship betweenthe place the president live and presidential resi-dence and choose the correct answer, Palacio daAlvorada, from the distracting location name, untilusing all 748 parameters.

Q.5: Where does the president of Brazil live, in Por-tuguese?P-challenging: ... on a triangle of land jutting into thelake, is the Palace of the Dawn ([Palacio da Alvorada]Ans;the presidential residence). Between the federal and civicbuildings on the Monumental Axis is the city ’s cathedral...P-shortcut: ... on a triangle of land jutting into the lake,is the Palace of the Dawn ([Palacio da Alvorada]Ans; thepresidential residence)

5 How do Models Learn Shortcuts?

In previous section, we show that shortcut ques-tions are easier to learn compared to the questionsthat require the complex paraphrasing skill. Then,it is interesting that, trained with a mixture of bothversions of questions, how such discrepancy affectsor even drives the learning procedure, e.g., how theincreasing of challenging training questions allevi-ate the over reliance on shortcut tricks.

We guess one of the possible reasons is howmost existing MRC models are optimized. We hy-pothesize that with a larger proportion of shortcutquestions for training, the models have learnedthe shortcut tricks at the early stage, which mayaffect the models’ further exploration for chal-lenging skills. In other words, in the early stage oftraining, models tend to find the easiest way to fitthe training data with gradient descent. Since theshortcut tricks require less computational resourcesto learn, fitting these tricks may be a priority. Af-terwards, since the shortcut trick can be used toanswer most of the training questions correctly, thelimited unsolved questions remained may not moti-vate the models to explore sophisticated solutionsthat require challenging skills.

To validate this idea, we investigate how the mod-els converge during training with different shortcutproportions in the training data. Notice that if amodel can only answer the shortcut version of aquestion correctly, it is highly likely that the modelonly adopts the shortcut trick for this question in-stead of performing sophisticated reasoning skills.Thus, we think the performance gap on two ver-sions of test data may indicate to what extent themodel relies on the shortcut tricks, e.g., the smaller

the performance gap is, the stronger complex rea-soning skills the model have learned.

Setups We explore how BERT and BiDAF con-verge with 10% and 90% shortcut training ques-tions on QWM-Para and SpM-Para. We report theF1 scores on the challenging and shortcut test ques-tions, respectively, and together with their perfor-mance gaps. We compare the model performanceat different learning steps to investigate when andhow well the models learn the shortcut tricks andthe challenging comprehension skills. The imple-mentation details are the same as §3, elaborated inAppendix A.

Results and Analysis Figure 5 illustrates howthe MRC models converge during training underdifferent settings. The gap line (green with “?”)shows the gap between models’ performance onshortcut questions and that on challenging ones.

For all settings, in the first few epochs, themodel performance on shortcut questions increasesrapidly, much faster than that on challenging ques-tions, causing a steep rise of the performance gap.This result indicates that models may learn theshortcut tricks at the early stage of training, thusquickly and correctly answering more shortcutquestions. And then, for the following epochs afterreaching the peaks, the gap lines slightly go down(Figure 5 (a), (c), (e), and (f)) or maintain almostunchanged (Figure 5 (b), (d), (g), and (h)), whichalso indicates the models may learn the challeng-ing skills later than shortcut tricks. One possiblereason is the gradient based optimizer drives themodel to optimize the global target greedily viathe easiest direction. Thus, trained with a mix-ture of shortcut and challenging questions, modelschoose to learn the shortcut tricks, which requireless computational resources to learn, earlier thanthe sophisticated paraphrasing skills.

Comparing models with different proportions ofshortcut training questions, we find that, with 90%shortcut training questions (Figure 5 (b), (d), (f)and (h)), the performance gap remains at a highlevel in the later training stage, where the perfor-mance on the challenging test questions is relativelylower. These results provide evidence that, for mostcases, after fitting on the shortcut questions, modelsseem to fail to explore the sophisticated reasoningskills.

When there are only 10% of shortcut trainingdata (Figure 5 (a), (c), (e), and (g)), we can ob-

996

0 200 400 600 800 1000

20

30

40

50

60

70

80

90

100(a) BiDAF on QWM-Para, 10%

ChallengingShortcut

0 200 400 600 800 1000

20

30

40

50

60

70

80

90

100(b) BiDAF on QWM-Para, 90%

ChallengingShortcut

0 500 1000 1500 2000 25005

10

15

20

25

30

35

40

45

50(c) BiDAF on SpM-Para, 10%

ChallengingShortcut

0 500 1000 1500 2000 25005

10

15

20

25

30

35

40

45

50(d) BiDAF on SpM-Para, 90%

ChallengingShortcut

0 100 200 300 400 50030

40

50

60

70

80

90

100

110(e) BERT on QWM-Para, 10%

ChallengingShortcut

0 100 200 300 400 50030

40

50

60

70

80

90

100

110(f) BERT on QWM-Para, 90%

ChallengingShortcut

0 200 400 600 800 1000

20

30

40

50

60

70

80

90(g) BERT on SpM-Para, 10%

ChallengingShortcut

0 200 400 600 800 1000

20

30

40

50

60

70

80

90(h) BERT on SpM-Para, 90%

ChallengingShortcut

14

16

18

20

22

24

26

28

30

Gap (Shortcut-Challenging)15

20

25

30

35

40

Gap (Shortcut-Challenging) 0

2

4

6

8

10

12


2

4

6

8

10

12

Gap (Shortcut-Challenging)

0

5

10

15

20

25

30

35

40


10

15

20

25

30

35

40

45

50


10

20

30

40

50


20

30

40

50

60


Figure 5: F1 scores on challenging and shortcut questions with different training steps under different settings.10% and 90% are the proportions of shortcut questions in the training datasets. Gaps (green lines with “?” dots)represent the performance gap between shortcut questions and challenging ones, which is smoothed by averagingover fixed-size windows to mitigate periodic fluctuations.

serve that after a few hundreds of steps, the gaplines stop increasing and even slightly go down.This phenomenon shows that higher proportionsof challenging questions in the training set couldencourage the models to explore the sophisticatedreasoning skills, but in a later stage of training.

Take BiDAF trained on QWM-Para as an exam-ple (Figure 5 (a) & (b)). The F1 scores on shortcuttest questions increase quickly in the first 300 steps,while the performance gap also widens rapidly, in-dicating a possible fast fitting on the shortcut tricks.In Figure 5 (b), with 90% shortcut training ques-tions, the model performance on challenging ques-tions are relatively steady during the next 800 steps,while the F1 score on shortcut questions maintainsa high level of about 85%. This result shows thatafter fitting on the shortcut tricks, the model trainedwith a higher shortcut proportion has almost cor-rectly answered all the shortcut questions but fail toanswer the challenging ones. Actually, with the gra-dient based optimizer, it is difficult for the modelto learn the challenging questions while keepingthe high performance on the shortcut ones, whichaccount for 90% of the training set. We guess itis because the few unsolved challenging questionscould not motivate the model to explore sophisti-cated reasoning skills.

On the contrary, when 90% of the training datarequire challenging skills, the gap line peaks at0.27, as shown in Figure 5 (a). Afterwards, the gapdecreases to 0.24, with the F1 score on challenging

questions increasing to more than 60%. Larger pro-portions of challenging questions for training pre-vent the models from heavily relying on the short-cut tricks. This phenomenon may be because, withfewer shortcut questions in training, the fitting ofshortcut tricks only benefits the training objectivein a small favor. The large number of challengingquestions that can not be correctly answered duringthe early training steps now encourage models toexplore more complicated reasoning skills.

Case Study When answering the example ques-tion Q.6 from SpM-Para, BERT trained with 10%shortcut questions tends to learn the simple match-ing trick quickly and correctly answers the shortcutversion as early as 380 steps. However, the modelcannot correctly answer the challenging variant un-til 630 steps. This difference demonstrates that,training with both type of questions, BERT canlearn the simple matching trick earlier than identi-fying the required paraphrasing between why de-fections occur and errors caused by.

Q.6: Why do these defections occur?P-challenging: ... Most of these errors are caused by[economic or financial factors]Ans ...P-shortcut: ... Most of these defections occur because of[economic or financial factors]Ans. Most of these errorsare caused by [economic or financial factors]Ans. ...

997

6 Related Works

Reading documents to answer natural languagequestions has drawn more and more attention inrecent years (Xu et al., 2016; Minjoon et al., 2017;Lai et al., 2019; Glass et al., 2020). Most previousworks focus on revealing the shortcut phenomenonin MRC from different perspectives. They findthat manually designed features (Chen et al., 2016)or simple model architectures (Weissenborn et al.,2017) could obtain competitive performance, indi-cating that complicated inference procedure maybe dispensable. Even without reading the entirequestions or documents, models can still correctlyanswer a large portion of the questions (Sugawaraet al., 2018; Kaushik and Lipton, 2018; Min et al.,2019). Therefore, current MRC datasets may lackthe benchmarking capacity on requisite skills (Sug-awara et al., 2020), and models may be vulnerableto adversarial attacks (Jia and Liang, 2017; Wal-lace et al., 2019; Si et al., 2019). However, theydo not formally discuss or analyze why modelscould learn shortcuts from the perspectives of thelearning procedure.

On the way of designing better MRC datasets,Jiang and Bansal (2019) construct adversarial ques-tions to guide model learning the multi-hop reason-ing skills. Bartolo et al. (2020) propose a model-in-loop paradigm to annotate challenging questions.More recent works (Jhamtani and Clark, 2020; Hoet al., 2020) propose new datasets with evidencebased metrics to evaluate whether the questions aresolved via shortcuts. Our work aims at providingempirical evidence and analysis to the communityby tracing into the learning procedure and explain-ing how the MRC models learn shortcuts, which isorthogonal to the existing works.

For a more general machine learning perspec-tive, there are also efforts trying to explain howmodels learn easy and hard instances during train-ing. Kalimeris et al. (2019) prove that models tendto learn easier decision boundaries at the begin-ning stage of training. Our results empirically con-firms this theoretical conclusion in the task of MRCand quantitatively explain that larger proportions ofshortcut questions in training make MRC modelsrely on shortcut tricks, rather than comprehensionskills like recognizing the paraphrase relationship.

7 Conclusions

In this work, we try to answer why many MRCmodels learn shortcut tricks while ignoring the

pre-designed comprehension challenges that arepurposely embedded in many benchmark datasets.We argue that large proportions of shortcut ques-tions in training data push MRC models to relyon shortcut tricks excessively. To properly investi-gate, we first design two synthetic datasets whereeach instance has a shortcut version paired witha challenging one which requires paraphrasing, acomplex reasoning skill, to answer, rather than per-forming question word matching or simple match-ing. With these datasets, we are able to adjust theproportion of shortcut questions in both trainingand testing, while maintaining other factors rela-tively steady. We propose two methods to examinethe model training process regarding the shortcutquestions, which enable us to take a closer look atthe learning mechanisms of BiDAF and BERT un-der different training settings. We find that learningshortcut questions generally requires less computa-tional resources, and MRC models usually learn theshortcut questions at their early stage of training.Our findings reveal that, with larger proportionsof shortcut questions for training, MRC modelswill learn the shortcut tricks quickly while ignor-ing the designed comprehension challenges, sincethe remaining truly challenging questions, usuallylimited in size, may not motivate models to exploresophisticated solutions in the later training stage.

Acknowledgments

This work is supported in part by the Na-tional Hi-Tech R&D Program of China(No.2020AAA0106600), the NSFC undergrant agreements (61672057, 61672058). For anycorrespondence, please contact Yansong Feng.

ReferencesMax Bartolo, A. Roberts, Johannes Welbl, Sebastian

Riedel, and Pontus Stenetorp. 2020. Beat the ai: In-vestigating adversarial human annotation for readingcomprehension. Transactions of the Association forComputational Linguistics, 8:662–678.

Danqi Chen, Jason Bolton, and Christopher D Man-ning. 2016. A thorough examination of the cnn/dailymail reading comprehension task. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 2358–2367.

Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, and KristinaToutanova. 2019. BoolQ: Exploring the surprising

https://doi.org/10.18653/v1/N19-1300

998

difficulty of natural yes/no questions. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 2924–2936, Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Pradeep Dasigi, Nelson F. Liu, Ana Marasovic,Noah A. Smith, and Matt Gardner. 2019. Quoref:A reading comprehension dataset with questions re-quiring coreferential reasoning. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5925–5932, Hong Kong,China. Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Li Dong, Jonathan Mallinson, Siva Reddy, and MirellaLapata. 2017. Learning to paraphrase for questionanswering. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing, pages 875–886.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.Drop: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 2368–2378.

Michael Glass, Alfio Gliozzo, Rishav Chakravarti, An-thony Ferritto, Lin Pan, G P Shrivatsa Bhargav, Di-nesh Garg, and Avi Sil. 2020. Span selection pre-training for question answering. In Proceedingsof the 58th Annual Meeting of the Association forComputational Linguistics, pages 2773–2782, On-line. Association for Computational Linguistics.

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,and Akiko Aizawa. 2020. Constructing a multi-hop QA dataset for comprehensive evaluation ofreasoning steps. In Proceedings of the 28th Inter-national Conference on Computational Linguistics,pages 6609–6625, Barcelona, Spain (Online). Inter-national Committee on Computational Linguistics.

Harsh Jhamtani and Peter Clark. 2020. Learning to ex-plain: Datasets and models for identifying valid rea-soning chains in multihop question-answering. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 137–150, Online. Association for Computa-tional Linguistics.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2021–2031.

Yichen Jiang and Mohit Bansal. 2019. Avoiding rea-soning shortcuts: Adversarial evaluation, training,and model development for multi-hop QA. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 2726–2736, Florence, Italy. Association for Computa-tional Linguistics.

Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran,Benjamin L Edelman, Tristan Yang, Boaz Barak,and Haofeng Zhang. 2019. SGD on neural networkslearns functions of increasing complexity. In Ad-vances in Neural Information Processing Systems32: Annual Conference on Neural Information Pro-cessing Systems 2019.

Divyansh Kaushik and Zachary C Lipton. 2018. Howmuch reading does reading comprehension require?a critical investigation of popular benchmarks. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages5010–5015.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Al-berti, Danielle Epstein, Illia Polosukhin, Jacob De-vlin, Kenton Lee, Kristina Toutanova, Llion Jones,Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.Natural questions: A benchmark for question an-swering research. Transactions of the Associationfor Computational Linguistics, 7:453–466.

Yuxuan Lai, Yansong Feng, Xiaohan Yu, Zheng Wang,Kun Xu, and Dongyan Zhao. 2019. Lattice cnns formatching based chinese question answering. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence, volume 33, pages 6634–6641.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Sewon Min, Eric Wallace, Sameer Singh, Matt Gard-ner, Hannaneh Hajishirzi, and Luke Zettlemoyer.2019. Compositional questions do not necessitatemulti-hop reasoning. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 4249–4257, Florence, Italy. Asso-ciation for Computational Linguistics.

Seo Minjoon, Kembhavi Aniruddha, Farhadi Ali, andHajishirzi Hannaneh. 2017. Bidirectional attentionflow for machine comprehension. In InternationalConference on Learning Representations.

https://doi.org/10.18653/v1/N19-1300

https://doi.org/10.18653/v1/D19-1606

https://doi.org/10.18653/v1/D19-1606

https://doi.org/10.18653/v1/D19-1606

https://doi.org/10.18653/v1/2020.acl-main.247

https://doi.org/10.18653/v1/2020.acl-main.247

https://doi.org/10.18653/v1/2020.coling-main.580



https://doi.org/10.18653/v1/2020.emnlp-main.10



https://doi.org/10.18653/v1/P19-1262

https://doi.org/10.18653/v1/P19-1262

https://doi.org/10.18653/v1/P19-1262

https://doi.org/10.1162/tacl_a_00276

https://doi.org/10.1162/tacl_a_00276

http://www.aclweb.org/anthology/P/P14/P14-5010

http://www.aclweb.org/anthology/P/P14/P14-5010

https://doi.org/10.18653/v1/P19-1416

https://doi.org/10.18653/v1/P19-1416

999

Pramod Kaushik Mudrakarta, Ankur Taly, MukundSundararajan, and Kedar Dhamdhere. 2018. Didthe model understand the question? In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 1896–1906.

Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, LixinSu, and Xueqi Cheng. 2019. Has-qa: Hierarchicalanswer spans model for open-domain question an-swering. In Proceedings of the AAAI Conference onArtificial Intelligence, volume 33, pages 6875–6882.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1532–1543.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392.

Siva Reddy, Danqi Chen, and Christopher D Manning.2019. Coqa: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics, 7:249–266.

Chenglei Si, Shuohang Wang, Min-Yen Kan, and JingJiang. 2019. What does bert learn from multiple-choice reading comprehension datasets? ArXiv,abs/1910.12391.

Saku Sugawara, Kentaro Inui, Satoshi Sekine, andAkiko Aizawa. 2018. What makes reading compre-hension questions easier? In Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing, pages 4208–4219.

Saku Sugawara, Pontus Stenetorp, Kentaro Inui, andAkiko Aizawa. 2020. Assessing the benchmark-ing capacity of machine reading comprehensiondatasets. In Thirty-Fourth AAAI Conference on Ar-tificial Intelligence.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2017. Newsqa: A machine compre-hension dataset. In Proceedings of the 2nd Work-shop on Representation Learning for NLP, pages191–200.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,and Sameer Singh. 2019. Universal adversarial trig-gers for attacking and analyzing NLP. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 2153–2162, HongKong, China. Association for Computational Lin-guistics.

Dirk Weissenborn, Georg Wiese, and Laura Seiffe.2017. Making neural QA as simple as possible

but not simpler. In Proceedings of the 21st Confer-ence on Computational Natural Language Learning(CoNLL 2017), pages 271–280, Vancouver, Canada.Association for Computational Linguistics.

Jindou Wu, Yunlun Yang, Chao Deng, HongyiTang, Bingning Wang, Haoze Sun, Ting Yao, andQi Zhang. 2019. Sogou machine reading compre-hension toolkit. arXiv preprint arXiv:1903.11848.

Kun Xu, Yansong Feng, Songfang Huang, andDongyan Zhao. 2016. Hybrid question answeringover knowledge base and free text. In Proceed-ings of COLING 2016, the 26th International Con-ference on Computational Linguistics: Technical Pa-pers, pages 2397–2407, Osaka, Japan. The COLING2016 Organizing Committee.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D Manning. 2018. Hotpotqa: A dataset fordiverse, explainable multi-hop question answering.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages2369–2380.

http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162

https://doi.org/10.18653/v1/D19-1221

https://doi.org/10.18653/v1/D19-1221

https://doi.org/10.18653/v1/K17-1028

https://doi.org/10.18653/v1/K17-1028

https://www.aclweb.org/anthology/C16-1226

https://www.aclweb.org/anthology/C16-1226

1000

A Implement Details

Synthetic Dataset Construction During theconstruction of synthetic datasets, we used Stan-ford CoreNLP (Manning et al., 2014) to identifynamed entities and stop-words.

We set two empirical thresholds for identifyingquestions can be solved by Simple Matching or re-quiring paraphrasing skills. We consider a questionas solvable via the simple matching trick if morethan 75% of non-stop words in the question ex-actly appear in the answer sentence. On the otherhand, if the matching rate is below 25%, we thinkit is unsolvable via simple matching, calling forother skills like paraphrasing. Thus, in dataset con-struction, if the matching rate is above 25% afterparaphrasing, we consider that the back transla-tion fails to incorporate paraphrasing skills into theinstance.

We construct the synthetic datasets from SQuAD(Rajpurkar et al., 2016). Compared with more re-cent MRC datasets (Yang et al., 2018; Kwiatkowskiet al., 2019), most questions in SQuAD can besolved by a single sentence with simple matchingso that we can conveniently use back translation toconstruct questions with paraphrasing challenges.

Paraphrasing in SpM-para When constructingthe SpM-Para dataset, we only select the instanceswhose questions are very similar to the correspond-ing answer sentences (overlap > 75%) to ensurethat a simple matching step can obtain the answers.For the shortcut-version of an instance, we insertthe paraphrased answer sentence into passage andkeep both the original answer sentence and para-phrased answer sentence (see Algorithm 2). Thisoperation aims to control the shortcut instances tohave both shortcut solutions and challenging solu-tions. For the challenging version, we only keepthe paraphrased answer sentences in the passagesand discard the original answer sentence, so thatsuch instances can only be solved by identifyingthe embedded paraphrasing relationship.

Hyper-Parameters for QA models We adoptBERT (BERT-based uncased) (Devlin et al., 2019)and BiDAF (Minjoon et al., 2017) models with theimplementation in SogouMRC tools (Wu et al.,2019). The hyper-parameters are shown in Ta-ble 1. We used 100-d glove vectors (Penningtonet al., 2014) in BiDAF. Notice that these hyper-parameters are adopted in §3, §4, and §5. Our codeand datasets can be found in https://github.

com/luciusssss/why-learn-shortcut

For the simple matching setting where multipleanswer spans may appear in one passage, we follow(Pang et al., 2019) and aggregate the possibilities ofeach span before computing the likelihood losses.

Data Sampling in Difficulty Evaluation In §4,we train BERT on the training sets of QWM-Paraand SpM-Para and observe how the model con-verges with different learning steps and parametersizes. However, we find BERT achieves outstand-ing performance on both datasets with only one ortwo epochs. This is because the strong learningability of BERT model and, with only one kind ofanswering pattern, both the pure shortcut and chal-lenging training sets are relatively easy to learn.

Under this circumstance, BERT performance onmost of the evaluation checkpoints after one epochwill almost reach the final performance, whichmake the comparison vague. If we compare thecheckpoints within one epoch, considering thatmodels have only been trained on partial train-ing data, the evaluation results would reflect themodels’ generalization ability on unseen questions.This differs from our purpose of evaluation, namelycomparing the fitting difficulty of different kindsof questions. Therefore, we randomly sample 1000pair of instances for training and evaluation. Withless training data, BERT will not converge in onlyone or two epochs, thus we could truly evaluate thelearning ability.

Computation Cost We train the models on anNVIDIA 1080 Ti GPU. The number of parametersis 5M for BiDAF and 110M for BERT. The averagetraining time on synthetic datasets for an epoch is1 minute for BiDAF and 10 minutes for BERT.

BERT-base Uncased BiDAF

# Epoch 3 15Batch size 6 30Optimizer AdamWeightDecay AdamLearning rate 3e-5 1e-3

Table 1: Hyper-Parameters for the experiments in §4,§5, and §6.

B A Variant of QWM-Para Dataset

When we train models with different proportions ofshortcut questions on QWM-Para (Figure 3 (a) &(b), which is described in §3), we observe that evenwith pure challenging questions in training, BiDAF



1001

Question Word Matching Substituted

Q�7: Who was rated as the most powerful female musician?P: Forbes Magazine rated Beyonce as the most powerful female musician. She released a new album with Lisa…

Q.7:Who was named the most influential music girl?

Step1 Paraphrase Question Step2 Drop

Redundant Entities

P: Forbes Magazine rated Beyonce as the most powerful female musician.

Step3 Substitute Entities

Q.7: Who was named the most influential music girl?P: Forbes Magazine rated Bella as the most powerful female musician….

Q.7: Who was named the most influential music girl?P: Forbes Magazine rated America as the most powerful female musician….


Figure 6: An illustration of how the questions in thenew synthetic datasets, question word matching substi-tuted, are constructed from original questions.

and BERT still perform much better on shortcutquestions than on the challenging ones. We thinkthis is possibly because in these settings, modelsfail to exploit the paraphrasing skill but learn toguess one from the the entities matching the ques-tion words as the answer. Using such a guessingtrick instead of the paraphrasing skill could im-prove the performance on the challenging ques-tions to some extent, but it results in more gains onthe shortcut questions. Therefore, even with 100%challenging questions in training, the gap betweenthe performance on challenging and shortcut testquestions is still wide.

To avoid these guess solutions, we redesigna variant of the QWM-Para dataset, named asQWM/subs-Para.2 We aim at investigating: 1)Whether this variant could avoid the guessing al-ternative and decrease the performance gaps be-tween challenge questions and shortcut ones whentraining with relatively lower shortcut proportions.2) Whether the experiments on this variant stillconfirms our previous findings about how shortcutquestions in training affect model performance andlearning procedure, as described in §3 and §5.

B.1 Dataset Construction

The construction process of QWM/subs-Para isshown in Figure 6. The first two steps, questionparaphrasing and redundant entities dropping, arethe same as those in the construction of shortcutquestions in QWM-Para (see §2). Then, we per-

2subs refers to substituted, elaborated in §B.1.

0 10 20 30 40 50 60 70 80 90

606570758085

(a) BiDAF on QWM/subs-Para

ChallengingShortcut

0 10 20 30 40 50 60 70 80 90

86

88

90

92

94

(b) BERT on QWM/subs-Para

ChallengingShortcut

0 10 20 30 40 50 60 70 80 9037.5

40.0

42.5

45.0

47.5

(c)

ChallengingShortcut

0 10 20 30 40 50 60 70 80 90

40

60

80(d)

ChallengingShortcut

Figure 7: F1 scores on challenging and shortcut ques-tions with different proportions of shortcut questions intraining. This experiment is conducted on QWM/subs-Para. The error bars represent the standard deviationsof five runs.

0 200 400 600 800 100020

30

40

50

60

70

80

90(a) BiDAF on QWM/subs-Para, 10%

ChallengingShortcut

0 200 400 600 800 100020

30

40

50

60

70

80

90(b) BiDAF on QWM/subs-Para, 90%

ChallengingShortcut

0 500 1000 1500 2000 2500

10

20

30

40

50(c) BiDAF on SpM-Para, 10%

ChallengingShortcut

0 500 1000 1500 2000 2500

10

20

30

40

50(d) BiDAF on SpM-Para, 90%

ChallengingShortcut

0 100 200 300 400 50030

40

50

60

70

80

90

100(c) BERT on QWM/subs-Para, 10%

ChallengingShortcut

0 100 200 300 400 50030

40

50

60

70

80

90

100(d) BERT on QWM/subs-Para, 90%

ChallengingShortcut

0 200 400 600 800 100020

30

40

50

60

70

80

(g) BERT on SpM-Para, 10%ChallengingShortcut

0 200 400 600 800 100020

30

40

50

60

70

80

(h) BERT on SpM-Para, 90%ChallengingShortcut

0

2

4

6

8

10

12


5

10

15

20

25

30

35


2

4

6

8

10

12


2

4

6

8

10

12


0

5

10

15

20


5

10

15

20

25


10

20

30

40

50

60


10

20

30

40

50

60


Figure 8: F1 scores on challenging and shortcut ques-tions with different training steps under different set-tings. This experiment is conducted on QWM/subs-Para. 10% and 90% are the ratios of shortcut ques-tions in the training datasets. Gaps (green lines with “?”dots) represents the performance gap between shortcutquestions and challenging ones, which is smoothed byaveraging over fixed-size windows to mitigate periodicfluctuations.

form entity substitution to avoid the potential guess-ing solutions.

Particularly, for each candidate question, wesubstitute all the entities in the passage with ran-dom ones whose type uniformly distributes in Per-son/Time/Location to construct the challengingquestions. With this random substitution, one canhardly guess the correct answer via the questionwords. As shown in Q.7, after substituting theanswer entity Beyonce to America, one can not an-swer the new question by simply finding a Personentity according to the question word who. Replac-ing a person’s name with a location may break theoriginal semantic, but it will force the model tocomprehend the context to find the answers. Forthe shortcut version, we also conduct the randomentity substitution, but within the same entity types,

1002

e.g., from Beyonce to Bella.This strategy could avoid the models from learn-

ing the trick that identifying replaced words as theanswers.

B.2 Results and AnalysisShown in Figure 7, we can see that, when con-structing the challenging questions with entity sub-stitution, both BiDAF and BERT model performcomparably between challenging and shortcut testquestions with 100% challenging questions in train-ing. These results provide evidence that, after thesubstitution, models could not use guessing as analternative solutions to the paraphrasing skill.

We conduct similar experiments in §3, §5 onQWM/subs-Para, which is shown in Figure 7 andFigure 8, respectively. The tendency could alsosupport our previous findings. For example, thelarger shortcut ratio expands the performance gapsbetween challenging and shortcut questions in Fig-ure 7.

Documents

Why Machine Reading Comprehension Models Learn Shortcuts?