13
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5798–5810, November 16–20, 2020. c 2020 Association for Computational Linguistics 5798 Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space Dayiheng Liu * Yeyun Gong Jie Fu Yu Yan Jiusheng Chen Jiancheng Lv Nan Duan Ming Zhou College of Computer Science, Sichuan University Microsoft Research Asia Mila Microsoft [email protected] Abstract In this paper, we propose a novel data aug- mentation method, referred to as Control- lable Rewriting based Question Data Aug- mentation (CRQDA), for machine reading comprehension (MRC), question generation, and question-answering natural language in- ference tasks. We treat the question data augmentation task as a constrained question rewriting problem to generate context-relevant, high-quality, and diverse question data sam- ples. CRQDA utilizes a Transformer autoen- coder to map the original discrete question into a continuous embedding space. It then uses a pre-trained MRC model to revise the ques- tion representation iteratively with gradient- based optimization. Finally, the revised ques- tion representations are mapped back into the discrete space, which serve as additional ques- tion data. Comprehensive experiments on SQuAD 2.0, SQuAD 1.1 question generation, and QNLI tasks demonstrate the effectiveness of CRQDA 1 . 1 Introduction Data augmentation (DA) is commonly used to improve the generalization ability and robustness of models by generating more training examples. Compared with the DA used in the fields of com- puter vision (Krizhevsky et al., 2012; Szegedy et al., 2015; Cubuk et al., 2019) and speech process- ing (Ko et al., 2015), how to design effective DA tailored to natural language processing (NLP) tasks remains a challenging problem. Unlike the general image DA techniques such as rotation and cropping, it is more difficult to synthesize new high-quality and diverse text. * Work is done during internship at Microsoft Research Asia. 1 The source code and dataset will be available at https: //github.com/dayihengliu/CRQDA. Recently, some textual DA techniques have been proposed for NLP, which mainly focus on text clas- sification and machine translation tasks. One way is directly modifying the text data locally with word deleting, word order changing, and word replace- ment (Fadaee et al., 2017; Kobayashi, 2018; Wei and Zou, 2019; Wu et al., 2019). Another popular way is to utilize the generative model to generate new text data, such as back-translation (Sennrich et al., 2016; Yu et al., 2018), data noising tech- nique (Xie et al., 2017), and utilizing pre-trained language generation model (Kumar et al., 2020; Anaby-Tavor et al., 2020). Machine reading comprehension (MRC) (Ra- jpurkar et al., 2018), question generation (QG) (Du et al., 2017; Zhao et al., 2018) and, question-answering natural language inference (QNLI) (Demszky et al., 2018; Wang et al., 2018) are receiving attention in NLP community. MRC requires the model to find the answer given a paragraph 2 and a question, while QG aims to generate the question for a given paragraph with or without a given answer. Given a question and a sentence in the relevant paragraph, QNLI requires the model to infer whether the sentence contains the answer to the question. Because the above tasks require the model to reason about the question-paragraph pair, existing textual DA methods that directly augment question or paragraph data alone may result in irrelevant question-paragraph pairs, which cannot improve the downstream model performance. Question data augmentation (QDA) aims to auto- matically generate context-relevant questions to fur- ther improve the model performance for the above tasks (Yang et al., 2019; Dong et al., 2019). Exist- ing QDA methods mainly employ the round-trip 2 It can also be a document span or a passage. For nota- tional simplicity, we use the “paragraph” to refer to it in the rest of this paper.

Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5798–5810,November 16–20, 2020. c©2020 Association for Computational Linguistics

5798

Tell Me How to Ask Again: Question Data Augmentationwith Controllable Rewriting in Continuous Space

Dayiheng Liu♠∗ Yeyun Gong† Jie Fu♦ Yu Yan‡Jiusheng Chen‡ Jiancheng Lv♠ Nan Duan† Ming Zhou†

♠College of Computer Science, Sichuan University †Microsoft Research Asia♦ Mila ‡Microsoft

[email protected]

Abstract

In this paper, we propose a novel data aug-mentation method, referred to as Control-lable Rewriting based Question Data Aug-mentation (CRQDA), for machine readingcomprehension (MRC), question generation,and question-answering natural language in-ference tasks. We treat the question dataaugmentation task as a constrained questionrewriting problem to generate context-relevant,high-quality, and diverse question data sam-ples. CRQDA utilizes a Transformer autoen-coder to map the original discrete question intoa continuous embedding space. It then usesa pre-trained MRC model to revise the ques-tion representation iteratively with gradient-based optimization. Finally, the revised ques-tion representations are mapped back into thediscrete space, which serve as additional ques-tion data. Comprehensive experiments onSQuAD 2.0, SQuAD 1.1 question generation,and QNLI tasks demonstrate the effectivenessof CRQDA1.

1 Introduction

Data augmentation (DA) is commonly used toimprove the generalization ability and robustnessof models by generating more training examples.Compared with the DA used in the fields of com-puter vision (Krizhevsky et al., 2012; Szegedyet al., 2015; Cubuk et al., 2019) and speech process-ing (Ko et al., 2015), how to design effective DAtailored to natural language processing (NLP) tasksremains a challenging problem. Unlike the generalimage DA techniques such as rotation and cropping,it is more difficult to synthesize new high-qualityand diverse text.

∗Work is done during internship at Microsoft ResearchAsia.

1The source code and dataset will be available at https://github.com/dayihengliu/CRQDA.

Recently, some textual DA techniques have beenproposed for NLP, which mainly focus on text clas-sification and machine translation tasks. One wayis directly modifying the text data locally with worddeleting, word order changing, and word replace-ment (Fadaee et al., 2017; Kobayashi, 2018; Weiand Zou, 2019; Wu et al., 2019). Another popularway is to utilize the generative model to generatenew text data, such as back-translation (Sennrichet al., 2016; Yu et al., 2018), data noising tech-nique (Xie et al., 2017), and utilizing pre-trainedlanguage generation model (Kumar et al., 2020;Anaby-Tavor et al., 2020).

Machine reading comprehension (MRC) (Ra-jpurkar et al., 2018), question generation(QG) (Du et al., 2017; Zhao et al., 2018) and,question-answering natural language inference(QNLI) (Demszky et al., 2018; Wang et al., 2018)are receiving attention in NLP community. MRCrequires the model to find the answer given aparagraph2 and a question, while QG aims togenerate the question for a given paragraph withor without a given answer. Given a questionand a sentence in the relevant paragraph, QNLIrequires the model to infer whether the sentencecontains the answer to the question. Because theabove tasks require the model to reason aboutthe question-paragraph pair, existing textualDA methods that directly augment question orparagraph data alone may result in irrelevantquestion-paragraph pairs, which cannot improvethe downstream model performance.

Question data augmentation (QDA) aims to auto-matically generate context-relevant questions to fur-ther improve the model performance for the abovetasks (Yang et al., 2019; Dong et al., 2019). Exist-ing QDA methods mainly employ the round-trip

2It can also be a document span or a passage. For nota-tional simplicity, we use the “paragraph” to refer to it in therest of this paper.

Page 2: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5799

consistency (Alberti et al., 2019; Dong et al., 2019)to synthesize answerable questions. However, theround-trip consistency method is not able to gener-ate context-relevant unanswerable questions, whereMRC with unanswerable questions is a challengingtask (Rajpurkar et al., 2018; Kwiatkowski et al.,2019). Zhu et al. (2019) firstly study unanswerablequestion DA, which relies on annotated plausibleanswer to constructs a small pseudo parallel cor-pus of answerable-to-unanswerable questions forunanswerable question generation. Unfortunately,most question answering (QA) and MRC datasetsdo not provide such annotated plausible answers.

Inspired by the recent progress in controllabletext revision and text attribute transfer (Wanget al., 2019; Liu et al., 2020), we propose a newQDA method called Controllable Rewriting basedQuestion Data Augmentation (CRQDA), whichcan generate both new context-relevant answerablequestions and unanswerable questions. The mainidea of CRQDA is to treat the QDA task as a con-strained question rewriting problem. Instead ofrevising discrete question directly, CRQDA aimsto revise the original questions in a continuousembedding space under the guidance of a pre-trained MRC model. There are two componentsof CRQDA: (i) A Transformer-based autoencoderwhose encoder maps the question into a latent rep-resentation. Then its decoder reconstructs the ques-tion from the latent representation. (ii) A MRCmodel, which is pre-trained on the original dataset.This MRC model is used to tell us how to revise thequestion representation so that the reconstructednew question is a context-relevant unanswerable oranswerable question. The original question is firstmapped into a continuous embedding space. Next,the pre-trained MRC model provides the guidanceto revise the question representation iteratively withgradient-based optimization. Finally, the revisedquestion representations are mapped back into thediscrete space, which act as the additional questiondata for training.

In summary, our contributions are as follows: (1)We propose a novel controllable rewriting basedQDA method, which can generate additional high-quality, context-relevant, and diverse answerableand unanswerable questions. (2) We compare theproposed CRQDA with state-of-the-art textual DAmethods on SQuAD 2.0 dataset, and CRQDA out-performs all those strong baselines consistently.(3) In addition to MRC tasks, we further apply

CRQDA to question generation and QNLI tasks,and comprehensive experiments demonstrate itseffectiveness.

2 Related Works

Recently, textual data augmentation has attracteda lot of attention. One popular class of textual DAmethods is confined to locally modifying text inthe discrete space to synthesize new data. Wei andZou (2019) propose a universal DA technique forNLP called easy data augmentation (EDA), whichperforms synonym replacement, random insertion,random swap, or random deletion operation to mod-ify the original text. Jungiewicz and Smywinski-Pohl (2019) propose a word synonym replacementmethod with WordNet. Kobayashi (2018) relieson word paradigmatic relations. More recently,CBERT (Wu et al., 2019) retrofits BERT (Devlinet al., 2018) to conditional BERT to predict themasked tokens for word replacement. These DAmethods are mainly designed for the text classifica-tion tasks.

Unlike modifying a few local words, anothercommonly used textual DA way is to use a gen-erative model to generate the entire new textualsamples, including using variational autoencodes(VAEs) (Kingma and Welling, 2013; Rezendeet al., 2014), generative adversarial networks(GANs) (Tanaka and Aranha, 2019), and pre-trained language generation models (Radford et al.,2019; Kumar et al., 2020; Anaby-Tavor et al.,2020). Back-translation (Sennrich et al., 2016;Yu et al., 2018) is also a major way for textualDA, which uses machine translation model to trans-late English sentences into another language (e.g.,French), and back into English. Besides, data nois-ing techniques (Xie et al., 2017; Marivate and Se-fara, 2019) and paraphrasing (Kumar et al., 2019)are proposed to generate new textual samples. Allthe methods mentioned above usually generate in-dividual sentences separately. For QDA of MRC,QG, and QNLI tasks, these DA approaches cannotguarantee the generating question are relevant tothe given paragraph. In order to generate context-relevant answerable and unanswerable questions,our CRQDA method utilizes a pre-trained MRCas guidance to revise the question in continuousembedding space, which can be seen as a specialconstrained paraphrasing method for QDA.

Question generation (Heilman and Smith, 2010;Du et al., 2017; Zhao et al., 2018; Zhang and

Page 3: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5800

Bansal, 2019) is attracting attention in the field ofnatural language generation (NLG). However, mostprevious works are not designed for QDA. Thatis, they do not aim to generate context-relevantquestions for improving downstream model per-formance. Compared to QG, QDA is relativelyunexplored. Recently, some works (Alberti et al.,2019; Dong et al., 2019) utilize round-trip consis-tency technique to synthesize answerable questions.They first use a generative model to generate thequestion with the paragraph and answer as modelinput, and then use a pre-trained MRC model tofilter the synthetic question data. However, they areunable to generate context-relevant unanswerablequestions. It should be noted that our method andround-trip consistency are orthogonal. CRQDAcan also rewrite the synthetic question data by othermethods to obtain new answerable and unanswer-able question data. Unanswerable QDA is firstlyexplored in Zhu et al. (2019), which constructs asmall pseudo parallel corpus of paired answerableand unanswerable questions and then generatesrelevant unanswerable questions in a supervisedmanner. This method relies on annotated plausi-ble answers for the unanswerable questions, whichdoes not exist in most QA and MRC datasets. In-stead, our method rewrites the original answerablequestion to a relevant unanswerable question inan unsupervised paradigm, which can also rewritethe original answerable question to another newrelevant answerable question.

Our method is inspired by the recent progress oncontrollable text revision and text attribute trans-fer (Wang et al., 2019; Liu et al., 2020). However,our approach differs in several ways. First, thosemethods are used to transfer the attribute of thesingle sentence alone, but our method considersthe given paragraph to rewrite the context-relevantquestion. Second, existing methods jointly trainan attribute classifier to revise the sentence repre-sentation, while our method unitizes a pre-trainedMRC model that shares the embedding space withautoencoder as the guidance to revise the questionrepresentation. Finally, the generated questions byour method serve as augmented data can benefitthe downstream tasks.

3 Methodology

3.1 Problem Formulation

We consider an extractive MRC dataset D, such asSQuAD 2.0 (Rajpurkar et al., 2018), which has |D|

5-tuple data: (q, d, s, e, t), where |D| is the datasize, q = {q1, ..., qn} is a tokenized question withlength n, d = {d1, ..., dm} is a tokenized para-graph with length m, s, e ∈ {0, 1, ...,m − 1} areinclusive indices pointing to the start and end of theanswer span, and t ∈ {0, 1} represents whether thequestion q is answerable or unanswerable with d.Given a data tuple (q, d, s, e, t), we aim to rewriteq to a new answerable or unanswerable question q′

and obtain a new data tuple (q′, d, s, e, t′) that ful-fills certain requirements: (i) The generated answer-able question can be answered with the answer span(s, e) with d, while the generated unanswerablequestion cannot be answered with d. (ii) The gen-erated question should be relevant to the originalquestion q and paragraph d. (iii) The augmenteddataset D′ should be able to further improve theperformance of the MRC models.

3.2 Method OverviewFigure 1 shows the overall architecture of CRQDA.The proposed model consists of two components:a pre-trained language model based MRC model asdescribed in § 3.3, and a Transformer-based autoen-coder as introduced in § 3.4. Given a question qfrom the original dataset D, we first map the ques-tion q into a continuous embedding space. Then werevise the question embeddings by gradient-basedoptimization with the guidance of the MRC model(§ 3.5). Finally, the revised question embeddingsare inputted to the Transformer-based autoencoderto generate a new question data.

3.3 Pre-trained Language Model based MRCModel

In this paper, we adopt the pre-trained lan-guage model (e.g., BERT (Devlin et al., 2018),RoBERTa (Liu et al., 2019b)) based MRC modelsas our MRC baseline model. Without loss of gen-erality, we take the BERT-based MRC model as anexample to introduce our method, which is shownin the left part of Figure 1.

Following Devlin et al. (2018), given a data tu-ple (q, d, s, e, t), we concatenate a “[CLS]” token,the tokenized question q with length n, a “[SEP]”token, the tokenized paragraph d with length m,and a final “[SEP]” token. We feed the resultingsequence into the BERT model. The question q andparagraph d are first mapped into two sequence ofembeddings:

Eq,Ed = BertEmbedding(q, d), (1)

Page 4: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5801

Figure 1: The architecture of CRQDA.

where BertEmbedding(·) denotes the BERT em-bedding layer which sums the corresponding to-ken, segment, and position embeddings, Eq ∈R(n+2)×h and Ed ∈ Rm×h represent the questionembedding and the paragraph embedding.

Eq and Ed are further fed into BERTlayers which consist of multiple Trans-former layers (Vaswani et al., 2017) toobtain the final hidden representations{C,Tq1, ...,Tqn,T[SEP ],Td1, ...,Tdm} as shownin Figure 1. The representation vector C ∈ Rhcorresponding to the first input token ([CLS]) arefed into a binary classification layer to output theprobability of whether the question is answerable:

Pa(is-answerable) = Sigmoid(CWTc + bc), (2)

where Wc ∈ R2×h and bc ∈ R2 are trainable pa-rameters. The final hidden representations of para-graph {Td1, ...,Tdm } ∈ Rm×h are inputted into twoclassifier layer to output the probability of the startposition and the end position of the answer span:

Ps(i =< start >) = Sigmoid(TdiWTs + bs), (3)

Pe(i =< end >) = Sigmoid(TdiWTe + be), (4)

where Ws ∈ R1×h, We ∈ R1×h, bs ∈ R1, andbe ∈ R1 are trainable parameters.

For the data tuple (q, d, s, e, t), the total loss ofMRC model can be written as

Lmrc = λLa(t) + Ls(s) + Le(e), (5)

= −λ logPa(t)− logPs(s)− logPe(e),

where λ is a hyper-parameter.

3.4 Transformer-based Autoencoder

As shown in the right part of Figure 1, the originalquestion q is firstly mapped into question embed-ding Eq with the BERT embedding layer. It shouldbe noted that the Transformer encoder and the pre-trained MRC model share3 the parameters of theembedding layer, which makes the question em-bedding of the two models in the same continuousembedding space.

We obtain the encoder hidden states Henc ∈R(n+2)×h from the Transformer encoder. The ob-jective of the Transformer autoencoder is to recon-struct the input question itself, which is optimizedwith cross-entropy (Dai and Le, 2015). A trivial so-lution of the autoencoder would be to simply copytokens in the decoder side. To avoid this, we do notdirectly feed the whole Henc to the decoder, but usean RNN-GRU (Cho et al., 2014) layer with sumpooling to obtain a latent vector z ∈ Rh. Then wefeed z to the decoder to reconstruct the question,which follows Wang et al. (2019).

Henc = TransformerEncoder(q), (6)

z = Sum(GRU(Henc)), (7)

q = TransformerDecoder(z). (8)

We can train the autoencoder on the question dataof D or pre-train it on other large-scale corpora,such as BookCorpus (Zhu et al., 2015) and EnglishWikipedia.

3The parameters of the Transformer encoder’s embeddinglayer are copied from the pre-trained MRC, and are fixedduring training.

Page 5: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5802

Figure 2: The question rewriting process of CRQDA.

Algorithm 1 Question Rewriting with Gradient-based Optimization.Input: Data tuple (q, d, s, e, t); Original question embedding

Eq; pre-trained MRC model and Transformer autoen-coder; A set of step size Sη = {ηi}; Step size decaycoefficient βs; the target answerable or unanswerable la-bel t′; Threshold βt, βa, βb;

Output: a set of new answerable and unanswerable questiondata tuples D′ = {(q′, d, s, e, t′), .., (q′, d, s, e, t)};

1: D′ = {};2: for each η ∈ Sη do3: for max-steps do4: revise Eq

′by Eq. (10) or Eq. (9)

5: q′ = TransformerAutoencoder(

Eq′)

6: if Pa(t′) > βt and J (q, q′) ∈ [βa, βb] then7: add (q′, d, s, e, t′) to D′;8: end if9: η = βsη;

10: end for11: end for12: return D′;

3.5 Rewriting Question with Gradient-basedOptimization

As mentioned above, the question embedding ofthe Transformer encoder and pre-trained MRC arein the same continuous embedding space, wherewe can revise the question embedding with thegradient guidance by MRC model. The revisedquestion embedding Eq′ is fed into Transformerautoencoder to generate a new question data q′.

Figure 2 illustrates the process of question rewrit-ing. Specifically, we take the process of rewritingan answerable question to a relevant unanswer-able question as an example to present the pro-cess. Given an answerable question q, the goalsof the rewriting are: (I) the revised question em-bedding should make the pre-trained MRC modelpredict the question from answerable to unanswer-able with the paragraph d; (II) The modificationsize of Eq should be adaptive to prevent the revi-sion of Eq from falling into local optimum; (III)The revised question q′ should be similar to theoriginal q, which helps to improve the robustness

of the model.For goal-(I), we take the label t′ = 0, which

denotes the label of question is unanswerable, tocalculate the loss La(t′) and the gradient of Eq bythe pre-trained MRC model (see the red line in Fig-ure 2). We iteratively revise Eq with gradients fromthe pre-trained MRC model until the MRC modelpredicts the question is unanswerable with the re-vised Eq′ as its input, which means the Pa(t′|Eq

′)

is large than a threshold βt. Note that here we usethe gradient to only modify Eq, and all the modelparameters during rewriting process are fixed. Theprocess of each iteration can be written as:

Eq′= Eq − η(∇EqLa(t′)), (9)

where η is the step size. Similarly, we can revisethe Eq of a data tuple (q, d, s, e, t) to generate anew answerable question whose answer is still theoriginal answer span (s, e) as follows:

Eq′= Eq − η (∇Eq(λLa(t) + Ls(s) + Le(e))) .

(10)

Rewriting the answerable question into another an-swerable question can be seen as a special con-strained paraphrasing, which requires that the ques-tion after the paraphrasing is context-relevant an-swerable and its answer remains unchanged.

For goal-(II), we follow (Wang et al., 2019) toemploy the dynamic-weight-initialization methodto allocate a set of step-sizes Sη = {ηi} as initialstep-sizes. For each initial step-size, we performa pre-defined max-step revision with the step sizevalue decay (corresponds to Algorithm 1 line 2-11) to find the target question embedding. Forgoal-(III), we select the q′ whose unigram wordoverlap rate with the original question q is within athreshold range [βa, βb]. The unigram word overlapis computed by:

J (q, q′) =count(wq ∩ wq)count(wq ∪ wq)

, (11)

here wq is the word in q and wq is the word in q′.The whole question rewriting procedure is summa-rized in Algorithm 1.

4 Experiments

In this section, we describe the experimental detailsand results. We first conduct the experiment on theSQuAD 2.0 dataset (Rajpurkar et al., 2018) to com-pare CRQDA with other strong baselines, which is

Page 6: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5803

reported in § 4.1. The ablation study and furtheranalysis are provided in § 4.2. Then we evaluate ourmethod on additional two tasks including questiongeneration on SQuAD 1.1 dataset (Rajpurkar et al.,2016) in § 4.3, and question-answering languageinference on QNLI dataset (Wang et al., 2018) in§ 4.4.

Methods EM F1BERTlarge (Devlin et al., 2018) (original) 78.7 81.9

+ EDA (Wei and Zou, 2019) 78.3 81.6+ Back-Translation (Yu et al., 2018) 77.9 81.2+ Text-VAE (Liu et al., 2019a) 75.3 78.6+ AE with Noise 76.7 79.8+ 3M synth (Alberti et al., 2019) 80.1 82.8+ UNANSQ (Zhu et al., 2019) 80.0 83.0+ CRQDA (ours) 80.6 83.3

Table 1: Comparison results on SQuAD 2.0.

4.1 SQuAD

The extractive MRC benchmark SQuAD 2.0dataset contains about 100,000 answerable ques-tions and over 50,000 unanswerable questions.Each question is paired with a Wikipedia para-graph.

Implementation Based on RobertaForQuestio-nAnswering4 model of Huggingface (Wolf et al.,2019), we train a RoBERTalarge model on SQuAD2.0 as the pre-trained MRC model for CRQDA.The hyper-parameters are the same as the originalpaper (Liu et al., 2019b). For training the autoen-coder, we copy the word embedding parametersof the pre-trained MRC model to autoencoder andfix them during training. Both of its encoder anddecoder consist of 6-layer Transformers, where theinner dimension of feed-forward networks (FFN),hidden state size, and the number of attention headare set to 4096, 1024, and 16.

The autoencoder trains on BookCorpus (Zhuet al., 2015) and English Wikipedia (Devlin et al.,2018). The sequence length, batch size, learningrate, and training steps are set to 64, 256, 5e-5 and100,000. For each original answerable data, weuse CRQDA to generate new unanswerable ques-tion data, resulting in about 220K data samples(including the original data samples). The hyper-parameter of βs, βt, βa, βb, and max-step are set to0.9, 0.5, 0.5, 0.99, and 5, respectively.

4https://github.com/huggingface/transformers.

Baselines We compare our CRQDA against thefollowing baselines: (1) EDA (Wei and Zou, 2019):it augments question data by performing synonymreplacement, random insertion, random swap, orrandom deletion operation. We implement EDAwith their source code5 to synthesize a new ques-tion data for each question of SQuAD 2.0; (2)Back-Translation (Yu et al., 2018; Prabhumoyeet al., 2018): it uses machine translation model totranslate questions into French and back into En-glish. We implement Back-Translation based onthe source code6 to generate a new question datafor each original question; (3) Text-VAE (Bow-man et al., 2016; Liu et al., 2019a): it uses RNN-based VAE to generate a new question data foreach question of SQuAD 2.0. The implementa-tion is based on the source code7; (4) AE withNoise: it uses the same autoencoder of CRQDAfor question data rewriting. The only differenceis that the autoencoder cannot utilize the MRCgradient but only uses a noise (sampled from Gaus-sian distribution) to revise the question embedding.This experiment is designed to show necessity ofthe pre-trained MRC. (5) 3M synth (Alberti et al.,2019): it employs round-trip consistency techniqueto synthesize 3M questions on SQuAD 2.0; (6)UNANSQ (Zhu et al., 2019): it employs a pair-to-sequence model to generate 69,090 unanswerablequestions. Following previous methods (Zhu et al.,2019; Alberti et al., 2019), we use each augmenteddataset to fine-tune BERTlarge model, where theimplementation is also based on Huggingface.

Results For SQuAD 2.0, Exact Match (EM) andF1 score are used as evaluation metrics. The re-sults on SQuAD 2.0 development set are shownin Table 1. The popular textual DA methods (in-cluding EDA, Back-Translation, Text-VAE, andAE with Noised), do not improve the performanceof the MRC model. One possible reason might bethat they introduce detrimental noise to the trainingprocess as they augment question data without con-sidering the paragraphs and the associated answers.In sharp contrast, the QDA methods (including 3Msynth, UNANSQ, and CRQDA) improve the modelperformance. Besides, our CRQDA outperformsall the strong baselines, which brings about 1.9absolute EM score and 1.5 F1 score improvement

5https://github.com/jasonwei20/eda_nlp.6https://github.com/shrimai/

Style-Transfer-Through-Back-Translation.7https://github.com/dayihengliu/

Mu-Forcing-VRAE.

Page 7: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5804

based on BERTlarge. We provide some augmenteddata samples of each baseline in Appendix A.

4.2 Ablation and Analysis

Our ablation study and further analysis are de-signed for answering the following questions: Q1:How useful is the augmented data synthesized byour method if trained by other MRC models? Q2:How does the choice of the corpora for autoencodertraining influence the performance? Q3: How dodifferent CRQDA augmentation strategies influ-ence the model performance?

Methods EM F1BERTbase 73.7 76.3

+ CRQDA 75.8 (+2.1) 78.7 (+2.4)BERTlarge 78.7 81.9

+ CRQDA 80.6 (+1.9) 83.3 (+1.4)RoBERTabase 78.6 81.6

+ CRQDA 80.2 (+1.6) 83.1 (+1.5)RoBERTalarge 86.0 88.9

+ CRQDA 86.4 (+0.4) 89.5 (+0.6)

Table 2: Results of different MRC models withCRQDA on SQuAD 2.0.

To answer the first question (Q1), we usethe augmented SQuAD 2.0 dataset in § 4.1 totrain different MRC models (BERTbase, BERTlarge,RoBERTabase, and RoBERTalarge). The hyper-parameters and implementation are based on Hug-gingface (Wolf et al., 2019). The results arepresented in Table 2. We can see that CRQDAcan improve the performance of each MRCmodel, yielding 2.4 absolute F1 improvement withBERTbase model and 1.5 absolute F1 improvementwith RoBERTabase. Besides, although we use aRoBERTalarge model to guide the rewriting of ques-tion data, the augmented dataset can further im-prove its performance.

Methods EM F1 R-L B4BERTbase 73.7 76.3 - -

+ CRQDA (SQuAD 2) 74.8 77.7 82.9 59.6+ CRQDA (2M ques) 75.3 78.2 97.8 94.7+ CRQDA (Wiki) 75.8 78.7 99.3 98.4+ CRQDA (Wiki+Mask) 75.4 78.4 99.7 99.4

Table 3: Results of training autoencoder on differentcorpora. R-L is short for ROUGE-L, and B4 is shortfor BLEU-4.

For the second question (Q2), we conduct ex-periments to use the following different corporato train the autoencoder of CRQDA: (1) SQuAD2.0: we use all the questions from the training

set of SQuAD 2.0; (2) 2M questions: we collect2,072,133 questions from the training sets of sev-eral MRC and QA datasets, including SQuAD2.0,Natural Questions, NewsQA (Trischler et al.,2016), QuAC (Choi et al., 2018), TriviaQA (Joshiet al., 2017), CoQA (Reddy et al., 2019), Hot-potQA (Yang et al., 2018), DuoRC (Saha et al.,2018), and MS MARCO (Bajaj et al., 2016); (3)Wiki: We use the large-scale corpora EnglishWikipedia and BookCorpus (Zhu et al., 2015) totrain autoencoder; (4) Wiki+Mask: We also trainautoencoder on English Wikipedia and BookCor-pus as Wiki. In addition, we randomly mask 15%tokens of the encoder inputs with a special token,which is similar to the mask strategy used in (De-vlin et al., 2018; Song et al., 2019).

We firstly measure the reconstruction perfor-mance of the autoencoders on the question dataof SQuAD 2.0 development set. We use BLEU-4 (Papineni et al., 2002) and ROUGE-L (Lin, 2004)metrics for evaluation. Then we use these autoen-coders for the CRQDA question rewriting with thesame settings in § 4.1. These augmented SQuAD2.0 datasets are used to fine-tune BERTbase model.We report the performance of fine-tuned BERTbasemodel in Table 3. It can be observed that with moretraining data, the reconstruction performance of au-toencoder is better. Also, the performance of fine-tuned BERTbase model is better. When trained withWiki and Wiki+Mask, the autoencoders can recon-struct almost all questions well. The reconstructionperformance of model trained with Wiki+Mask per-forms the best. However, the fine-tuned BERTbasemodel with autoencoder trained on Wiki performsbetter than that trained on Wiki+Mask. The reasonmight be that the autoencoder trained with denois-ing task will be insensitive to the word embeddingrevision of CRQDA. In other words, some revisionsguided by the MRC gradients might be filtered outas noises by the autoencoder, which is trained witha denoising task.

Methods EM F1RoBERTalarge (Liu et al., 2019b) 86.00 88.94

+ CRQDA (unans, βa = 0.7) 86.39 89.31+ CRQDA (unans, βa = 0.5) 86.43 89.50+ CRQDA (unans, βa = 0.3) 86.26 89.35+ CRQDA (ans) 86.22 89.30+ CRQDA (ans+unans) 86.36 89.38

Table 4: Results of using differnt CRQDA augmenteddatasets for MRC training.

For the last question Q3, we use CRQDA for

Page 8: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5805

question data augmentation with different settings.For each answerable original question data sam-ple from the training set of SQuAD 2.0, weuse CRQDA to generate both answerable andunanswerable question examples. Then the aug-mented unanswerable question data (unans), theaugmented answerable question data (ans), andall of them (ans + unans) are used to fine-tuneRoBERTalarge model. To further analyze the effectof βa (a larger βa value means that the generatedquestions are closer to the original question in thediscrete space), we use different βa = 0.3, 0.5, 0.7for question rewriting. The results are reported inTable 4. It can be observed that the MRC achievesthe best performance when βa = 0.5. Moreover, allof unans, ans, and ans + unans augmented datasetscan further improve the performance. However,we find that the RoBERTalarge model fine-tuned onans + unans performs worse than fine-tuned onunans only. The result is mixed in that using moreaugmented data is not always beneficial.

Method B4 MTR R-LUniLM (Dong et al., 2019) 22.12 25.06 51.07ProphetNet (Yan et al., 2020) 25.01 26.83 52.57ProphetNet + CRQDA 25.95 27.40 53.15UniLM (Dong et al., 2019) 23.75 25.61 52.04ProphetNet (Yan et al., 2020) 26.72 27.64 53.79ProphetNet + CRQDA 27.21 27.81 54.21

Table 5: Results on SQuAD 1.1 question generation.B4 is short for BLEU-4, MTR is short for METEOR,and R-L is short for ROUGE-L. The first block followsthe data split in Du et al. (2017), while the second blockis the same as in Zhao et al. (2018).

4.3 Question Generation

Answer-aware question generation task (Zhou et al.,2017) aims to generate a question for the given an-swer span with a paragraph. We apply our CRQDAmethod to SQuAD 1.1 (Rajpurkar et al., 2016) ques-tion generation task to further evaluate CRQDA.The settings of CRQDA are the same as in § 4.1and § 4.2. The augmented answerable questiondataset is used to fine-tune the ProphetNet (Yanet al., 2020) model which achieves the best per-formance on SQuAD 1.1 question generation task.The implementation is based on their source code8.We also compare with the previous state-of-the-artmodel UniLM (Dong et al., 2019). Following Yanet al. (2020), we use BLEU-4, METEOR (Banerjee

8https://github.com/microsoft/ProphetNet.

Methods AccuracyBERTlarge (Devlin et al., 2018) 92.3BERTlarge + CRQDA 93.0

Table 6: Results on QNLI.

and Lavie, 2005), and ROUGE-L metrics for eval-uation, and we split the SQuAD 1.1 dataset intotraining, development and test set. We also reportthe results on the another data split setting as in Yanet al. (2020), which reverses the development setand test set. The results are shown in Table 5. Wecan see that CRQDA improves ProphetNet on allthree metrics and achieves a new state-of-the-arton this task.

4.4 QNLIGiven a question and a context sentence, question-answering NLI asks the model to infer whetherthe context sentence contains the answer to thequestion. QNLI dataset (Wang et al., 2018) con-tains 105K data samples. We apply CRQDA toQNLI dataset to generate new entailment and non-entailment data samples. Note that this task doesnot include the MRC model, but uses a text entail-ment classification model. Similarly, we train aBERTlarge model based on the code of BertForSe-quenceClassification in Huggingface to replace the“pre-trained MRC model” of CRQDA to guide thequestion data rewriting. Following the settings in§ 4.1, we use CRQDA to synthesize about 42Knew data samples as augmented data. Note that weonly rewrite the question but keep the paired con-text sentence unchanged. Then the augmented dataand original dataset are used to fine-tine BERTlargemodel. Table 6 shows the results. CRQDA in-creases the accuracy of the BERTlarge model by0.7%, which also demonstrates the effectiveness ofCRQDA.

5 Conclusion

In this work, we present a novel question data aug-mentation method, called CRQDA, for context-relevant answerable and unanswerable questiongeneration. CRQDA treats the question data aug-mentation task as a constrained question rewritingproblem. Under the guidance of a pre-trained MRCmodel, the original question is revised in a continu-ous embedding space with gradient-based optimiza-tion and then decoded back to the discrete spaceas a new question data sample. The experimen-tal results demonstrate that CRQDA outperforms

Page 9: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5806

other strong baselines on SQuAD 2.0. The CRQDAaugmented datasets can improve multiple readingcomprehension models. Furthermore, CRQDA canbe used to improve the model performance on ques-tion generation and question-answering languageinference tasks, which achieves a new state-of-the-art on the SQuAD 1.1 question generation task.

Acknowledgment

This work is supported in part by the National Natu-ral Science Fund for Distinguished Young Scholarunder Grant 61625204, and in part by the StateKey Program of the National Science Foundationof China under Grant 61836006.

ReferencesChris Alberti, Daniel Andor, Emily Pitler, Jacob De-

vlin, and Michael Collins. 2019. Synthetic qa cor-pora generation with roundtrip consistency. arXivpreprint arXiv:1906.05416.

Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich,Amir Kantor, George Kour, Segev Shlomov, NaamaTepper, and Naama Zwerdling. 2020. Do not haveenough data? deep learning to the rescue! In AAAI.

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng,Jianfeng Gao, Xiaodong Liu, Rangan Majumder,Andrew McNamara, Bhaskar Mitra, Tri Nguyen,et al. 2016. Ms marco: A human generated machinereading comprehension dataset. arXiv preprintarXiv:1611.09268.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization, pages 65–72.

Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-drew M Dai, Rafal Jozefowicz, and Samy Ben-gio. 2016. Generating sentences from a continuousspace. In CoNLL.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-moyer. 2018. Quac: Question answering in context.arXiv preprint arXiv:1808.07036.

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vi-jay Vasudevan, and Quoc V Le. 2019. Autoaug-ment: Learning augmentation strategies from data.In CVPR.

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In NIPS.

Dorottya Demszky, Kelvin Guu, and Percy Liang. 2018.Transforming question answering datasets into nat-ural language inference datasets. arXiv preprintarXiv:1809.02922.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In NAACL.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Unified languagemodel pre-training for natural language understand-ing and generation. In NeurIPS.

Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-ing to ask: Neural question generation for readingcomprehension. arXiv preprint arXiv:1705.00106.

Marzieh Fadaee, Arianna Bisazza, and Christof Monz.2017. Data augmentation for low-resource neuralmachine translation. In ACL.

Michael Heilman and Noah A Smith. 2010. Good ques-tion! statistical ranking for question generation. InHLT-NAACL.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. arXiv preprint arXiv:1705.03551.

Michal Jungiewicz and Aleksander Smywinski-Pohl.2019. Towards textual data augmentation for neuralnetworks: synonyms and maximum loss. ComputerScience, 20.

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. In ICLR.

Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-jeev Khudanpur. 2015. Audio augmentation forspeech recognition. In Sixteenth Annual Conferenceof the International Speech Communication Associ-ation.

Sosuke Kobayashi. 2018. Contextual augmentation:Data augmentation by words with paradigmatic re-lations. In NAACL.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep convo-lutional neural networks. In NIPS.

Ashutosh Kumar, Satwik Bhattamishra, Manik Bhan-dari, and Partha Talukdar. 2019. Submodularoptimization-based diverse paraphrasing and its ef-fectiveness in data augmentation. In NAACL.

Varun Kumar, Ashutosh Choudhary, and Eunah Cho.2020. Data augmentation using pre-trained trans-former models. arXiv preprint arXiv:2003.02245.

Page 10: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5807

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Jacob Devlin,Kenton Lee, et al. 2019. Natural questions: a bench-mark for question answering research. TACL, 7:453–466.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text summarizationbranches out.

Dayiheng Liu, Jie Fu, Yidan Zhang, Chris Pal, andJiancheng Lv. 2020. Revision in continuous space:Unsupervised text style transfer without adversariallearning. In ACL.

Dayiheng Liu, Yang Xue, Feng He, Yuanyuan Chen,and Jiancheng Lv. 2019a. mu-forcing: Training vari-ational recurrent autoencoders for text generation.ACM Transactions on Asian and Low-Resource Lan-guage Information Processing.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Vukosi Marivate and Tshephisho Sefara. 2019. Improv-ing short text classification through global augmen-tation methods. arXiv preprint arXiv:1907.03752.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In ACL.

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W Black. 2018. Style transferthrough back-translation. In ACL.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1(8).

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In ACL.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In EMNLP.

Siva Reddy, Danqi Chen, and Christopher D Manning.2019. Coqa: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics, 7:249–266.

Danilo Jimenez Rezende, Shakir Mohamed, and DaanWierstra. 2014. Stochastic backpropagation and ap-proximate inference in deep generative models. InICML.

Amrita Saha, Rahul Aralikatte, Mitesh M Khapra,and Karthik Sankaranarayanan. 2018. Duorc: To-wards complex language understanding with para-phrased reading comprehension. arXiv preprintarXiv:1804.07927.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation modelswith monolingual data. In ACL.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequencepre-training for language generation. arXiv preprintarXiv:1905.02450.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-manet, Scott Reed, Dragomir Anguelov, Dumitru Er-han, Vincent Vanhoucke, and Andrew Rabinovich.2015. Going deeper with convolutions. In CVPR.

Fabio Henrique Kiyoiti dos Santos Tanaka and ClausAranha. 2019. Data augmentation using gans. arXivpreprint arXiv:1904.09135.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2016. Newsqa: A machine compre-hension dataset. arXiv preprint arXiv:1611.09830.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2018.Glue: A multi-task benchmark and analysis platformfor natural language understanding. arXiv preprintarXiv:1804.07461.

Ke Wang, Hang Hua, and Xiaojun Wan. 2019. Control-lable unsupervised text attribute transfer via editingentangled latent representation. In NeurIPS.

Jason W Wei and Kai Zou. 2019. Eda: Easy data aug-mentation techniques for boosting performance ontext classification tasks. In ACL.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing. ArXiv.

Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han,and Songlin Hu. 2019. Conditional bert contextualaugmentation. In International Conference on Com-putational Science.

Ziang Xie, Sida I Wang, Jiwei Li, Daniel Levy, AimingNie, Dan Jurafsky, and Andrew Y Ng. 2017. Datanoising as smoothing in neural network languagemodels. In ICLR.

Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu,Nan Duan, Jiusheng Chen, Ruofei Zhang, andMing Zhou. 2020. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. arXivpreprint arXiv:2001.04063.

Page 11: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5808

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W Cohen, Ruslan Salakhutdinov, andChristopher D Manning. 2018. Hotpotqa: A datasetfor diverse, explainable multi-hop question answer-ing. arXiv preprint arXiv:1809.09600.

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc VLe. 2018. Qanet: Combining local convolutionwith global self-attention for reading comprehen-sion. arXiv preprint arXiv:1804.09541.

Shiyue Zhang and Mohit Bansal. 2019. Address-ing semantic drift in question generation for semi-supervised question answering. arXiv preprintarXiv:1909.06356.

Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and QifaKe. 2018. Paragraph-level neural question genera-tion with maxout pointer and gated self-attention net-works. In EMNLP.

Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,Hangbo Bao, and Ming Zhou. 2017. Neural ques-tion generation from text: A preliminary study. InNational CCF Conference on Natural LanguageProcessing and Chinese Computing, pages 662–671.

Haichao Zhu, Li Dong, Furu Wei, Wenhui Wang, BingQin, and Ting Liu. 2019. Learning to ask unanswer-able questions for machine reading comprehension.In ACL.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of the IEEE inter-national conference on computer vision, pages 19–27.

A Augmented Dataset

Figure 3 and Figure 4 provide some augmenteddata samples of each baseline on SQuAD 2.0. Wecan see that the baseline EDA tends to introducenoise which destroys the original sentence struc-ture. The baselines of Text VAE, BackTransla-tion and AE+Noised often change some importantwords of the original question. This can cause theaugmented question to miss the original key infor-mation and not to able to infer the original answer.In contrast, it can be observed that the generatedanswerable questions of CRQDA still maintain thekey information for the original answer inference.

Its generated unanswerable questions tend to in-troduce some context-relevant words to convert anoriginal answerable question into an unanswerableone.

Page 12: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5809

Title: Spectre_(2015_film)Paragraph: Spectre(2015) isthetwenty-fourthJamesBondfilmproduced byEonProductions. ItfeaturesDanielCraiginhisfourthperformanceasJamesBond,andChristophWaltzasErnstStavroBlofeld,withthefilmmarkingthecharacter'sre-introduction intotheseries.ItwasdirectedbySamMendesashissecondJamesBondfilmfollowingSkyfall,andwaswrittenbyJohnLogan,NealPurvis,RobertWadeandJezButterworth.ItisdistributedbyMetro-Goldwyn-MayerandColumbiaPictures.Withabudgetaround$245million, itisthemostexpensiveBondfilmandoneofthemostexpensivefilmsevermade.

Original Question:WhichcompanymadeSpectre?Answer: EonProductionsEDA (delet):Whichcompanymade?EDA (add):Whichcompany accompany madeSpectre?EDA (replacement):WhichpartymadeSpectre?EDA (swap):Whichmade companySpectre?Text VAE:Whichcompanywasexcluded?BackTranslation:Whichcompanymadespectrum?AE+Noised:Whomadecompany?CRQDA (answerable):What companymadeSpectre?CRQDA (unanswerable):WhichcompanymadeEonProductions?

Original Question: HowmanyJamesBondfilmshasEonProductions produced? Answer: twenty-fourEDA (delet):HowmanyBond filmshasProductions produced?EDA (add):HowmanyJamesadherenceBondfilms moivehasEonProductionsproduced?EDA (replacement):HowmanyBondfilmshasProductions produced?EDA (shuffle):Howmanyjam BondcinemahasEonProductions produced?Text VAE:HowmanyBestPictureinmateshasbeenexecuted?BackTranslation:Howmanyfilmsbondhasproducedproducts?AE+Noised:HowmanyJamesEonBondFilmshasproduced?CRQDA (answerable):Howmuch JamesBondfilmshas been produced by EonProductions?CRQDA (unanswerable):HowmanyBond filmshasEtonvproduced?

Figure 3: Augmented data samples on SQuAD 2.0.

Page 13: Tell Me How to Ask Again: Question Data Augmentation with … · 2020. 11. 11. · Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

5810

Title: Space_RaceParagraph: TheSpaceRacewasa20th-centurycompetitionbetweentwoColdWarrivals,theSovietUnion(USSR)andtheUnitedStates(US),forsupremacyinspaceflightcapability.Ithaditsorigins inthemissile-basednucleararmsracebetweenthetwonationsthatoccurredfollowingWorldWarII,enabledbycapturedGermanrockettechnologyandpersonnel.Thetechnological superiority required forsuchsupremacywasseenasnecessaryfornationalsecurity,andsymbolicofideological superiority.TheSpaceRacespawnedpioneering effortstolaunchartificialsatellites,unmannedspaceprobesof theMoon, Venus,andMars,andhumanspaceflight inlowEarthorbitandtotheMoon.ThecompetitionbeganonAugust2,1955,whentheSovietUnion responded totheUSannouncement fourdaysearlierofintenttolaunchartificialsatellitesfortheInternationalGeophysicalYear,bydeclaringtheywouldalsolaunchasatellite"inthenearfuture".TheSovietUnionbeattheUStothis,withtheOctober4,1957orbitingofSputnik1,andlaterbeattheUStothefirsthuman inspace,YuriGagarin,onApril12,1961.TheSpaceRacepeakedwiththeJuly20,1969USlandingofthefirsthumansontheMoonwithApollo 11.TheUSSRtriedbutfailedmanned lunarmissions,andeventuallycancelledthemandconcentratedonEarthorbitalspacestations.AperiodofdétentefollowedwiththeApril1972agreementonaco-operativeApollo–Soyuz TestProject, resultingintheJuly1975rendezvous inEarthorbitofaUSastronautcrewwithaSovietcosmonautcrew.

Original Question: OnwhatdatedidtheSpaceRacebegin? Answer: August2,1955EDA (delet):OnwhatdatedidtheSpaceRace?EDA (add):Onwhatdate time didtheSpaceRacebegin? EDA (replacement):Onwhatdate timedidtheroom Racebegin?EDA (swap):Onwhatdate timedidtheRaceSpacebegin?Text VAE:Onwhatdatedid theRedDeathbegin?BackTranslation:OnwhatdatetheSpaceRacebegin ?AE+Noised:OnwhatdatetheSpacedidbegin?CRQDA (answerable):WhendidtheSpaceRacestart?CRQDA (unanswerable):Onwhatdatedid theSpaceRussiansbegin?

Original Question:Whowasthefirstperson inspace?Answer: YuriGagarinEDA (delet):Whowasthe person inspace?EDA (add):Whowasthesecond firstperson inspace?EDA (replacement):Whowasthestartpersoninspace?EDA (shuffle):Whofirstwastheinpersonspace?Text VAE:Whowasthefirstpersonin room?BackTranslation:Whowasthefirstperson inspace?AE+Noised:Whowastheman inspace?CRQDA (answerable):Whowasthefirstinspace?CRQDA (unanswerable):Whowasthefirst Russiansin space?

Figure 4: Augmented data samples on SQuAD 2.0.