11
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6729–6739 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 6729 Low-Resource Generation of Multi-hop Reasoning Questions Jianxing Yu, Wei Liu, Shuang Qiu, Qinliang Su * , Kai Wang, Xiaojun Quan, Jian Yin School of Data and Computer Science, Sun Yat-sen University Guangdong Key Laboratory of Big Data Analysis and Processing, China {yujx26, liuw259, qiush9, suqliang}@mail.sysu.edu.cn {wangk73@mail2, quanxj3@mail, issjyin@mail}.sysu.edu.cn Abstract This paper focuses on generating multi-hop reasoning questions from the raw text in a low resource circumstance. Such questions have to be syntactically valid and need to logically cor- relate with the answers by deducing over mul- tiple relations on several sentences in the text. Specifically, we first build a multi-hop genera- tion model and guide it to satisfy the logical ra- tionality by the reasoning chain extracted from a given text. Since the labeled data is limit- ed and insufficient for training, we propose to learn the model with the help of a large scale of unlabeled data that is much easier to obtain. Such data contains rich expressive forms of the questions with structural patterns on syntax and semantics. These patterns can be estimat- ed by the neural hidden semi-Markov model using latent variables. With latent patterns as a prior, we can regularize the generation model and produce the optimal results. Experimental results on the HotpotQA data set demonstrate the effectiveness of our model. Moreover, we apply the generated results to the task of ma- chine reading comprehension and achieve sig- nificant performance improvements. 1 Introduction Question generation (QG) is a hot research topic that aims to create valid and fluent questions cor- responding to the answers by fully understanding the semantics on a given text. QG is widely used in many practical scenarios: including providing prac- tice exercises from course materials for education- al purposes (Lindberg et al., 2013), initiating the dialog system by asking questions (Mostafazadeh et al., 2017), and reducing the labor cost of creating large-scale labeled samples for the QA task (Duan et al., 2017). The mainstream QG methods can be summarized into the rule-based and neural-based models. The first method often transforms the input * Corresponding author. text into an intermediate symbolic representation, such as a parsing tree, and then convert the resulting form into a question by well-designed templates or general rules (Hussein et al., 2014). Since rules and templates are hand-crafted, the scalability and generalization of this method are limited. Respec- tively, the neural model usually directly maps the text to question based on neural network (Du and Cardie, 2017), which is entirely data-driven with far less labor. Such a model can be typically re- garded as learning a one-to-one mapping between the text and question. The mapping is mainly used to generate simple questions with a single sentence. However, due to the lack of fine-grained model- ing on the evidential relations on the text, such a method has minimal capability to form the multi- hop questions that require sophisticated reasoning skills. These questions have to be grammatically valid. Besides, they need to logically correlate with the answers by deducing over multiple entities and relations in several sentences and paragraphs of the given text. As shown in Fig.(1), the question asks the director of a film, where the film was shot at the Quality Cafe in Los Angeles and Todd Phillips di- rected it. These two relations can form a reasoning chain from question to answer by logically integrat- ing the pieces of evidence “Los Angeles,” “Quality Cafe,” and “Old School” as well as the pronoun “itdistributed across S 1 in paragraph 1 and S 1 , S 2 in paragraph 2. Without capturing such a chain, it is difficult to precisely produce the multi-hop ques- tion by using “Old School” as a bridging evidence and marginal entity “Todd Phillips” as the answer. For the task of multi-hop QG, a straightforward solution is to extract a reasoning chain from the in- put text. Under the guidance of the reasoning chain, we learn a neural QG model to make the result sat- isfy the logical correspondence with the answer. However, the neural model is data-hungry, and the scale of training data mostly limits its performance.

Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6729–6739July 5 - 10, 2020. c©2020 Association for Computational Linguistics

6729

Low-Resource Generation of Multi-hop Reasoning Questions

Jianxing Yu, Wei Liu, Shuang Qiu, Qinliang Su∗, Kai Wang, Xiaojun Quan, Jian YinSchool of Data and Computer Science, Sun Yat-sen University

Guangdong Key Laboratory of Big Data Analysis and Processing, China{yujx26, liuw259, qiush9, suqliang}@mail.sysu.edu.cn{wangk73@mail2, quanxj3@mail, issjyin@mail}.sysu.edu.cn

Abstract

This paper focuses on generating multi-hopreasoning questions from the raw text in a lowresource circumstance. Such questions have tobe syntactically valid and need to logically cor-relate with the answers by deducing over mul-tiple relations on several sentences in the text.Specifically, we first build a multi-hop genera-tion model and guide it to satisfy the logical ra-tionality by the reasoning chain extracted froma given text. Since the labeled data is limit-ed and insufficient for training, we propose tolearn the model with the help of a large scaleof unlabeled data that is much easier to obtain.Such data contains rich expressive forms ofthe questions with structural patterns on syntaxand semantics. These patterns can be estimat-ed by the neural hidden semi-Markov modelusing latent variables. With latent patterns as aprior, we can regularize the generation modeland produce the optimal results. Experimentalresults on the HotpotQA data set demonstratethe effectiveness of our model. Moreover, weapply the generated results to the task of ma-chine reading comprehension and achieve sig-nificant performance improvements.

1 Introduction

Question generation (QG) is a hot research topicthat aims to create valid and fluent questions cor-responding to the answers by fully understandingthe semantics on a given text. QG is widely used inmany practical scenarios: including providing prac-tice exercises from course materials for education-al purposes (Lindberg et al., 2013), initiating thedialog system by asking questions (Mostafazadehet al., 2017), and reducing the labor cost of creatinglarge-scale labeled samples for the QA task (Duanet al., 2017). The mainstream QG methods can besummarized into the rule-based and neural-basedmodels. The first method often transforms the input

∗Corresponding author.

text into an intermediate symbolic representation,such as a parsing tree, and then convert the resultingform into a question by well-designed templatesor general rules (Hussein et al., 2014). Since rulesand templates are hand-crafted, the scalability andgeneralization of this method are limited. Respec-tively, the neural model usually directly maps thetext to question based on neural network (Du andCardie, 2017), which is entirely data-driven withfar less labor. Such a model can be typically re-garded as learning a one-to-one mapping betweenthe text and question. The mapping is mainly usedto generate simple questions with a single sentence.However, due to the lack of fine-grained model-ing on the evidential relations on the text, such amethod has minimal capability to form the multi-hop questions that require sophisticated reasoningskills. These questions have to be grammaticallyvalid. Besides, they need to logically correlate withthe answers by deducing over multiple entities andrelations in several sentences and paragraphs of thegiven text. As shown in Fig.(1), the question asksthe director of a film, where the film was shot at theQuality Cafe in Los Angeles and Todd Phillips di-rected it. These two relations can form a reasoningchain from question to answer by logically integrat-ing the pieces of evidence “Los Angeles,” “QualityCafe,” and “Old School” as well as the pronoun “it”distributed across S1 in paragraph 1 and S1, S2 inparagraph 2. Without capturing such a chain, it isdifficult to precisely produce the multi-hop ques-tion by using “Old School” as a bridging evidenceand marginal entity “Todd Phillips” as the answer.

For the task of multi-hop QG, a straightforwardsolution is to extract a reasoning chain from the in-put text. Under the guidance of the reasoning chain,we learn a neural QG model to make the result sat-isfy the logical correspondence with the answer.However, the neural model is data-hungry, and thescale of training data mostly limits its performance.

Page 2: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6730

Figure 1: Sample that requires reasoning skills.

Each training example is a triple combined withthe text, answer, and question. Since labeling sucha combination is labor-intensive, it is difficult toensure that we can always obtain sufficient train-ing data in real-world applications. We thus for-malize the problem as the low-resource generationof multi-hop questions, which is less explored byexisting work. This task has substantial researchvalue since reasoning is crucial in quantifying thehigh-level cognitive ability of machines, and lowresource is the key to promote the extensive appli-cation. In order to address the problem, we proposeto utilize unlabeled data, which is usually abundan-t and much easier to obtain. Although such datadoes not combine the questions with the texts andanswers, the unlabeled questions contain plentifulexpressive forms with structural patterns on thesyntax and semantics. These patterns can be seenas the “template” to produce the questions. Thus,we can use the patterns as the prior to regularize theQG model and obtain better results accordingly.

Motivated by the above observations, we pro-pose a practical two-stage approach to learn a multi-hop QG model from both a small-scale labeleddata and a large-size unlabeled corpus. In particu-lar, we first exploit the neural hidden semi-Markovmodel (Dai et al., 2016) to parameterize the so-phisticated structural patterns on the questions bylatent variables. Without domain knowledge, thevariables can be estimated by maximizing the like-lihood of the unlabeled data. We then heuristicallyextract a reasoning chain from the given text andbuild a holistic QG model to generate a multi-hopquestion. The evidential relations in the reasoningchain are leveraged to guide the QG model, so as tolet the generated result meet multi-hop logical cor-respondence with the answer. Simultaneously, wenaturally incorporate the prior patterns into the QG

model. In this way, we can regularize the modeland inform it to express a question reasonably. Thatcan improve the syntactic and semantic correctnessof the result. With the parameterized patterns, thewhole model can be learned from the labeled andunlabeled data in an end-to-end and explainablemanner. In order to better balance the supervisionof the labeled data and the usage of prior patterns,we propose to optimize the model by reinforcementlearning with an augmented evaluated loss. Experi-ments are conducted on the HotpotQA (Yang et al.,2018) data set, which contains a large number ofreasoning samples with manual annotation. Eval-uated results in terms of automatic metrics andhuman judgment show the effectiveness of our ap-proach. Moreover, we apply our generated resultsto the task of machine reading comprehension. Weview the results as pseudo-labeled samples to en-rich the training data for the task. That can alleviatethe labeled data shortage problem and boost the per-formance accordingly. Extensive experiments areperformed to show the efficacy of our approach inthis application with the help of low-resource QG.

The main contributions of this paper include,

• We dedicate to the topic of low-resource gen-eration of multi-hop questions from the text.

• We propose a practical approach to generatemulti-hop questions with a minimal amountof labeled data. The logical rationality of theresults is guided by the reasoning chain ex-tracted from the text. Besides, the results areregularized to ensure the correctness of syn-tax and semantics by using the prior patternsestimated from a large-size of unlabeled data.

• We show the potential of our approach in areal-world application on machine readingcomprehension by using the generated results.

The rest of this paper is organized as follows.Section 2 elaborates on the proposed low-resourceQG framework. Section 3 presents experimentalresults, while Section 4 shows the QG applicationand demonstrates its usefulness. Section 5 reviewsrelated works and Section 6 concludes this paper.

2 Approach

In this section, we first define some notations andthen present the details of the proposed QG frame-work, including the learning of prior patterns fromthe unlabeled data, and the multi-hop QG networkguided by the reasoning chain and prior patterns.

Page 3: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6731

2.1 Notations and Problem Formulation

Let DL = {(Bi, Ai, Yi)}ni=1 denote a small set oflabeled data that consists of n examples on thetext B, answer A, and question Y . Besides, weassume that there are a large number of unlabeleddata DU = {Qj}Nj=1 available, where Qj ∈ DU

shares the same characteristics with Yi ∈ DL andN > n. Each text contains multiple paragraphs andsentences, involving several logically correlatedentities. We aim to generate the new question Y ′

and answer A′ given the evaluated text B′ by a QGmodel, where the answer A′ often is a salient entityin the text B′. The question Y ′ is produced by find-ing the best Y to maximize the conditional proba-bility in Y = argmaxY ′

∏Tt=1 p(yt|B′, A′, Y ′<t),

where Y ′<t represents the outputted 1th to (t−1)thterms, yt is the tth term. The question has to be syn-tactically and semantically correct. Also, it needsto correspond to the answer by logically deducingover multiple evidential entities and relations scat-tered across the text. Since the resource in DL maynot be enough to support accurately learning of thep(·), we transfer the linguistic knowledge in DU

and combine it with DL to enhance the training.

2.2 Learning Patterns from Unlabeled Data

The expressive pattern on the question can beviewed as a sequence of groups. Each group con-tains a set of term segments that are semanticallyand functionally similar. Such segmentation is notexplicitly given but can be inferred from the text’ssemantics. It is difficult to characterize this struc-tural pattern by simple hand-crafted rules, whilewe do not have extra labeled data to learn the pat-tern by the methods like Variational Auto-Encoder(VAE) (Kingma and Welling, 2014). In order totackle this problem, we propose to employ the neu-ral hidden semi-Markov model. The model param-eterizes the similar segments on the input questionsby probabilistic latent variables. Through unsuper-vised learning, these variables can be trained onthe unlabeled data. That can well represent theintricate structural patterns without domain knowl-edge. Besides, the variables can be incorporatedinto the generation model naturally, which makesthe results more interpretable and controllable.

Given a question Q with a sequence of terms{qt}Tt=1, we model its segmentation by two vari-ables, including a deterministic state variable zt ∈{1, · · · ,K} that indicates the segment to whichthe tth term belongs, and a length variable lt ∈

{1, · · · , L}, which specifies the length of the cur-rent segment. We assume the question is generatedbased on a joint distribution as Eq.(1) by multi-stepemissions, where i(·) is the index function; the in-dex on tth term is i(t) =

∑tj=1 lj , with i(0) = 0

and i(T ′) = T ; qi(t−1)+1:i(t) is the sequence ofterms (qi(t−1)+1, · · · , qi(t)). That is, we first pro-duce a segment based on the latent state zt, andthen emits term with a length of lt on that segment.

T ′−1∏t=0

p(zt+1, lt+1|zt, lt)T ′∏t=1

p(qi(t−1)+1:i(t)|zt, lt)

(1)p(zt+1, lt+1|zt, lt) is the transition distribution,

where the (t+ 1)th latent state and length are con-ditioned on their previous ones. Since the lengthmainly depends on the segment, we can further fac-torize the distribution as p(lt+1|zt+1)× p(zt+1|zt).p(lt+1|zt+1) is the length distribution, and we fixit to be uniform up to a maximum length L. Inthis way, the model can be encouraged to bringtogether the functionally similar emissions of dif-ferent lengths. p(zt+1|zt) is the state distribution,which can be viewed as a K × K matrix, whereeach row sums to 1. We define this matrix to beEq.(2), where eo, ej , ek ∈ Rd are the embeddingsof the state o, j, k respectively, and bo,j , bo,k are thescalar bias terms. Since the adjacent states play dif-ferent syntactic or semantic roles in the expressivepatterns, we set bo,o as negative infinity to disableself-transition. We apply a row-wise softmax to theresulting matrix to obtain the desired probabilities.

p(zt+1 = j|zt = o) =exp(eTj eo+bo,j)∑Kk=1 exp(e

Tkeo+bo,k)

(2)

p(qi(t−1)+1:i(t)|zt, lt) is the term emission distri-bution conditioned on a latent state and a length.Based on the Markov process, the distributioncan be written as a product over the probabili-ties of all the question terms, as Eq.(3). In or-der to compute the term probability, we lever-age a neural decoder like the Gated RecurrentUnit (GRU) (Cho et al., 2014). We first formu-late the hidden vector hjt for yielding jth term ashjt = GRU(hj−1t , [ezt ; eqi(t−1)+j−1

]), where [·; ·]is a concatenation operator, eqi(t−1)+j−1

and eztare the embedding of the term and correspond-ing segment, respectively. By attending over thegiven question using hjt , we can produce a con-text vector vjt , as gzt � hjt , where � refers to

Page 4: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6732

the element-wise multiplication, gzt is a gate forthe latent state zt, and there are K gate vectorsas trainable parameters. We then pass the vectorvjt through a softmax layer to obtain the desireddistribution as p(qi(t−1)+j |qi(t−1)+j−1, zt, lt) =

softmax(Wqvjt + bq) , where Wq and bq are the

trainable parameters.

p(qi(t−1)+1:i(t)|zt, lt) = p(qi(t−1)+1|zt, lt)×∏ltj=2 p(qi(t−1)+j |qi(t−1)+j−1, zt, lt)

(3)

Considering that the latent variables are unob-served, we then learn the model by marginalizingover these variables to maximize the log marginal-likelihood of the observed question sequence Q,i.e., max(logp(Q)). p(Q) can be formulated asEq.(4) by the backward algorithm (Murphy, 2002),with the base cases βT (o) = 1,∀o ∈ {1, · · · ,K}.The quantities in Eq.(4) are obtained from a dynam-ic program, which is differentiable. Thus, we canestimate the model’s parameters from the unlabeleddata DU by back-propagation.

βt(o) = p(qt+1:T |zt = o)

=∑K

k=1 β∗t (k)p(zt+1 = k|zt = o)

β∗t (k) = p(qt+1:T |zt+1 = k)

=∑L

j=1 [βt+j(k)p(lt+1 = j|zt+1 = k)

p(qt+1:t+j |zt+1 = k, lt+1 = j)]

p(Q) =∑K

k=1 β∗0(k)p(z1 = k)

(4)

2.3 Multi-hop QG Net with RegularizationAfterward, we incorporate the learned patterns intothe generation model as the prior. Such prior canbe acted as a soft template to regularize the model.That can ensure the correctness of the results insyntax and semantics, especially when the labeleddata is insufficient to learn the correspondence be-tween the text and question. Fig.(2) illustrates thearchitecture of our model. We first estimate the pri-or pattern by sampling a sequence of latent statesz with the length l. We then extract the reasoningchain and other textual contents involved in askingand solving a specific question from the given text.Under the guidance of both the reasoning chainand the prior patterns, we build a multi-hop QGmodel on the extracted contents by the sequence-to-sequence framework (Bahdanau et al., 2015).The evidential relations in the chain are used toenhance the logical rationality of the results. Theprior pattern helps to facilitate the performance inlow-resource conditions by specifying the segmen-tation of the generated results.

Figure 2: Flow chart of the low-resource multi-hop QGnetwork.

2.3.1 Prior Patterns EstimationUsing the Viterbi algorithm (Zucchini et al., 2016),we can obtain the typed segmentation of a giv-en question. Such segmentation can be character-ized by a sequence of latent states z. Each seg-ment, like the phrase, is associated with a state,reflecting that the state frequently produces thatsegment. Based on the labeled data DL, we cancollect all sequences of latent states, which can beseen as a pool of prior patterns. We sample onefrom the pool uniformly. And then, we view it asa question template with S distinct segments, as{< zkt , l

kt >}Sk=1, where zt is a state variable for

the tth term, lt is the length variable derived bythe probability p(lt|zt), zkt and lkt are obtained bycollapsing adjacent zt and lk with the same val-ue. In order to easily incorporate into the gener-ation model, we encode the template as a vectorvmk = gzt � hmk , where hmk is the hidden vector forgenerating mth term, as GRU(hmk−1, [ezm ; eyt−1 ]),m satisfies i(m−1) < t ≤ i(m), k = t− i(m−1).

2.3.2 Question-Related Content ExtractionGiven a text, we use the method proposed by Yuet al. (2020) to extract the question-related con-tent. In order to make the paper self-contained, webriefly describe the approach in this section. It firstextracted the entities from the text, and view themas the potential answers and evidences. It thenlinks the entities to create a graph by three kind-

Page 5: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6733

s of relations, including dependency, coreference,and synonym. Based on the graph, it heuristicallyextracts a sub-graph as the reasoning chain. Thetextual contents on the sub-graph are then gathered,including the answer, reasoning type, evidential en-tities, and sentences on the entities. The extractionis based on three question types, consisting of theSequence, Intersection, and Comparison. Thesetypes account for a large proportion of the multi-hop questions on most typical data sets, for exam-ple, 92% in HotpotQA data set (Min et al., 2019).

2.3.3 Question Generation with GuidanceWe then develop a multi-hop QG model based onthe extracted contents. This model is guided by thereasoning chain and prior pattern, so that the gener-ated results are not only logical but also fluent. Inthe pre-processing phase, we first mask the answerfrom the input contents by a special token<UNK>,to avoid the answer inclusion problem (Sun et al.,2018). That is, the answer words may appear in thequestion that would reduce the rationality.

Encoder: The reasoning chain is encoded via anN head graph transformer (Vaswani et al., 2017),so as to integrate all evidential relations fully. Eachnode is represented by contextualizing on its neigh-bors, as hgv = ev+ ‖Nn=1

∑j∈Nv a

n(ev, ej)Wnej ,

where ‖ denotes the concatenation, ev is the embed-ding of node’s entity, an(·, ·) is nth head attentionfunction, Nv is the set of neighbors. By aggrega-tion with N -head attention, we can get a relation-aware vector cg as Eq.(5), where Wn

g ,Wh,Wd aretrainable matrices, C is the set of nodes in the chain.

an(st, hgv) =

exp((Whhgv)

TWdst)∑k∈Nv exp((Whh

gk)

TWdst)

cg = st+ ‖Nn=1

∑v∈C a

n(st, hgv)Wn

ghgv

(5)

Other textual inputs are encoded in two steps:(1) each text term is embedded by looking up thepre-trained vectors, such as BERT (Devlin et al.,2019). (2) The resulting embeddings are fed in-to a bi-directional GRU to incorporate a sequen-tial context. In detail, the sentences are repre-sented by concatenating the final hidden states of

GRU, as [←−hb1;−→hbJ ], where jth term is hbj = [

←−hbj ;−→hbj ],←−

hbj = GRU(ebj ,←−−hbj+1),

−→hbj = GRU(ebj ,

−−→hbj−1); [·; ·]

denotes the concatenation of two vectors; ebj is theaugmented embedding of jth term; J is the sizeof all terms. Similarly, the answer and evidenceentities are integrally encoded as ha = [

←−ha1;−→haO].

Attention: For the textual inputs, we fully inte-grate the encodings and their correlations by atten-tion. First, we use self-attention (Wang et al., 2017)to grasp the long-term dependency in the sentences,as [hbj ]

Jj=1 = SelfAttn([hbj ]

Jj=1). Subsequently,

we exploit multi-perspective fusion (Song et al.,2018) to grasp the answer-related context in the sen-tences and strengthen their cross interactions. Thatis, [hb

′j ]Jj=1 = MulPerFuse([hbj ]

Jj=1, [h

ao]Oo=1).

By aggregating the significant information overall the terms, we can obtain a context vector ctas Eq.(6), where αtj is the normalized attentionweight, atj denotes the alignment score, st refersto the tth hidden state of the decoder, v, b,Ws,Wb

are trainable parameters.

atk = vTtanh(Wsst +Wbhb′k + b)

αtj = exp(atj)/∑J

k=1 exp(atk)

ct =∑J

j=1 αtjhb′j

(6)

Decoder: Based on the context vector ct, weexploit another GRU as the decoder. Each questionterm is yielded by the distribution in Eq.(7), whereρ is a 1-dim embedding of the reasoning type, Wo

and bo are trainable parameters. We use a copymechanism (Gu et al., 2016) to tackle unknownwords problem, where pcopy(·) denotes the copydistribution. In order to let the questions logicallycorrelate with answers, we guide the decoder bythe vector cg, which encodes the reasoning chain.Accordingly, we regularize the model to adaptivelyfit the prior pattern represented by the vector vmk .That can improve the generated quality when thelabeled data is insufficient.

pvoc(yt) = Softmax(Wo[st; ct; cg; ρ] + bo)

pcopy =∑J

j=1 αtj × 1{y == wj}pg = Sigmoid(ct, st, yt−1)st = GRU(st−1, v

mk )

p(yt) = pg · pvoc(yt) + (1− pg) · pcopy(yt)(7)

2.3.4 Learning with Limited Labeled DataA straightforward solution to train the above QGmodel is the supervised learning. It minimizes thecross-entropy loss at each generated term by refer-ring to the ground-truth in the labeled data DL, asLsl = − 1

n

∑i∈DL

∑Tit=1 log p(yit|Yi;<t, Ai, Bi).

However, since DL only contains a few samples,we would not have enough supervision from DL

to get the best results. While we leverage the un-labeled data DU to facilitate the training, it is dif-ficult to subtly balance the supervised signal from

Page 6: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6734

DL and the prior pattern learned from DU . Inorder to address the problem, we resort to rein-forcement learning. It can globally measure theoverall quality of the results by minimizing the lossLrl = −EY s∼πθ [r(Y s)], where Y s is a sampledresult, Y ∗ is the ground-truth, θ is the parameter-s of the QG model, and π is the generation pol-icy of the model. r(·) is a function to evaluatethe generated quality. It is the weighted sum ofthree rewards, including (a) Fluency: we calculatethe negative perplexity (Zhang and Lapata, 2017)of Y s by a BERT-based language model pLM ,that is, −2−

1T

∑Tt=1 log2pLM (yt|Y s<t); (b) Answer-

ability: we use a metric QBLEU4(Ys, Y ∗) (Ne-

ma and Khapra, 2018) to measure the matchingdegree of Y s and Y ∗ by weighting on severalanswer-related factors, including question type,content words, function words, and named en-tities; (c) Semantics: we employ word mover-s distance (WMD) (Gong et al., 2019) to mea-sure the predicted result Y s, which has differentexpressive forms but same semantics with goldY ∗, as −WMD(Y s, Y ∗)/Length(Y ∗), whereLength(·) is the length function used as the nor-malization factor. By considering the metrics arenon-differentiable, we exploit the policy gradientmethod (Li et al., 2017) for optimization. In or-der to enhance readability, we train the model by amixed loss, as L = γLrl + (1− γ)Lsl, where γ isa trade-off factor.

3 Evaluations

We extensively evaluate the effectiveness of ourapproach, including the comparisons with state-of-the-art and the application on a task of MRC-QA.

3.1 Data and Experimental Settings

The evaluations were performed on three typicaldata sets, including HotpotQA (Yang et al., 2018),ComplexWebQuestions (Talmor and Berant, 2018),and DROP (Dua et al., 2019). These data setswere collected by crowd-sourcing, consisting of97k, 35k, and 97k examples, respectively. TheHotpotQA data set contained a large proportion oflabeled examples. Each comprised of the question,answer, and text with several sentences. Therefore,the HotpotQA data set was suitable to evaluatethe multi-hop QG task. The other two data setscontained abundant reasoning questions, but theyare not associated with the text and answer. Wethus viewed them as the unlabeled data. In order

to simulate the low-resource setting, we randomlysampled 10% of the HotpotQA train set to learnthe models, and evaluated them on the test set witha size of 7k. We verified the generated quality foreach evaluated method by comparing the match-ing degree between the result and gold-standard.We adopted three standard evaluation metrics inthe QG task, including BLEU-4 (Papineni et al.,2002), METEOR (Banerjee and Lavie, 2005), andROUGE-L (Lin, 2004). Furthermore, we carriedout human evaluations to analyze the generated re-sults. To avoid biases, we randomly sampled 100cases from the test set and generated questions foreach test case by all the evaluated methods. Wethen invited eight students to give the binary ratingon each question independently. The rating wasin terms of three metrics, including valid syntax,relevance to input textual sentences, and logicalrationality to the answer. We averaged the cumu-lative scores of the 100 binary judgments as theperformances corresponding to the evaluated meth-ods. The resultant scores were between 0∼100,where 0 is the worst, and 100 is the best. We usedRandolph’s free-marginal kappa (Randolph, 2005)to measure the agreements among the raters.

Model configurations were set as follows. Weleveraged 768-dimension pre-trained vectors fromthe uncased BERT to embed words. The number ofstates K and emissions L in the semi-Markov mod-el was set to 50, 4, respectively. The size of hiddenunits in both encoder and decoder was 300. Therecurrent weights were initialized by a uniform dis-tribution between−0.01 and 0.01 and updated withstochastic gradient descent. We used Adam (King-ma and Ba, 2015) as the optimizer with a learningrate of 10−3. The trade-off parameter γ was set to0.4. For pattern learning, we parsed every questionby the Stanford CoreNLP toolkit (Manning et al.,2014). We then learn better segmentation by forc-ing the model not to break syntactic elements likethe VP and NP. To reduce the bias, we carried outfive runs and reported the average performance.

3.2 Comparisons on QG State-of-the-Arts

We compared our approach against five typical andopen-source methods. These methods were basedon the sequence-to-sequence framework with atten-tion. According to the different techniques used,we summarized them as follows. (a) the basic mod-el with the copy mechanism, i.e., NQG++ (Zhouet al., 2017); (b) ASs2s (Kim et al., 2019), which

Page 7: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6735

encoded the answer separately to form answer-focused questions; (c) CorefNQG (Du and Cardie,2018) that incorporated linguistic features to repre-sent the inputs better; (d) MaxPointer (Zhao et al.,2018) using gated self-attention to form questionsfor long text inputs; (e) MPQG+R (Song et al.,2018) that captured a broader context in the text toproduce the context-dependent results. In order tounderstand the effect of unlabeled data, we exam-ined two variants of the proposed model. That is,Ours-Pattn which was trained without unlabeleddata, and Ours-50% that used 50% unlabeled datafor training. Moreover, we performed empirical ab-lation studies to gain better insight into the relativecontributions of various components in our model,including Ours-Chain that discarded the guidanceof the reasoning chain vector and Ours-Reinf thatreplaced the reinforcement learning with a simplesupervised learning.

Table 1: Comparisons of our approach against base-lines. Statistically significant with t-test, p-value<0.01.

Methods BLEU-4 METEOR ROUGE-L

NQG++ 14.55 15.01 31.85ASs2s 16.89 17.04 34.92CorefNQG 16.16 16.53 34.30MaxPointer 17.08 17.34 35.38MPQG+R 14.90 15.46 32.39

Ours 19.07 19.16 39.41Ours-50% 18.33 18.36 37.85Ours-Pattn 17.10 17.35 35.40Ours-Chain 18.11 18.18 37.37Ours-Reinf 18.22 18.39 37.48

As reported in Tab.(1), our approach achievedthe best performance. We significantly outper-formed the best baseline (i.e., MaxPointer) byover 11.6%, 10.5%, 11.4% in terms of BLEU-4,METEOR, and ROUGE-L, respectively. From thecomparisons among Ours-Pattn, Ours-50%, andOurs, we found that the performance improveswith more unlabeled data. Although we lack anappropriate comparative model based on the unla-beled data, these results can still indicate the effec-tiveness of our model. With only limited labeleddata, our model can effectively leverage unlabeleddata to guide the generation. Besides, the ablationon all evaluated components led to a significantperformance drop. We may infer that the reasoningchain is crucial for multi-hop QG on the guidanceof logical correlations. Also, the reinforcementlearning can globally optimize the model by bal-ancing the prior patterns and labeled supervision.

3.3 Human Evaluations and Analysis

Tab.(2) illustrated the results of human evaluation.The average kappa were all above 0.6, which indi-cated substantial agreement among the raters. Con-sistent with quantitatively analyzed results in Sec-tion 3.2, our model significantly outperformed allbaselines in terms of three metrics, where the im-provement on the rationality metric was the largest.That showed the satisfied quality of our generatedresults, especially in terms of multi-hop ability.

Table 2: Human evaluations and kappa agreement. Ra-tion. is short for the rationality metric. Statisticallysignificant with t-test, p-value<0.01.

Methods Syntax Relevance Ration. Kappa

NQG++ 54.3 44.8 50.3 0.61ASs2s 61.3 50.8 55.8 0.62CorefNQG 59.0 49.4 54.8 0.64MaxPointer 61.8 51.8 56.5 0.63MPQG+R 55.5 47.5 51.3 0.64Ours 68.3 57.3 62.3 0.65

3.4 Evaluations on Value of Unlabeled Data

We investigated the value of unlabeled data for theoverall performance, especially when the labeleddata was inadequate. In particular, we randomlysampled {10%, 40%, 70%, 100%} of the labeleddata, and split the unlabeled data into ten subsets.For each scale on the labeled data, we incremental-ly added by one subset of unlabeled data to learnthe QG model. We used the same training protocoland reported the overall performance on the test set.As shown in Fig.(3), even a small amount of unla-beled data can play a decisive role in improving theperformance in terms of three metrics. The ratioof improvement was higher when the scale of thelabeled data was small. The results further verifiedthe usefulness of unlabeled data on learning theQG model with a low labeled resource.

3.5 Evaluations on the Mixed Loss Objective

In order to examine the gains of our training ap-proach with the mixed loss objective, we tuned thetrade-off parameter (i.e., γ) from [0, 1] with 0.1 asan interval. The performance change curve wasdisplayed in Fig.(4). The best performance wasobtained at γ = 0.4. The performance droppeddramatically when γ was close to 0 or 1. We wouldinfer that both objectives could help to measure thequality of the outputted results better, and thus trainthe model efficiently.

Page 8: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6736

Figure 3: Evaluations on effectiveness of unlabeled data under different scales of labeled data.

Figure 4: Evaluations on mixed objective trade-off.

4 Application on the Task of MRC-QA

The task of machine reading comprehension (MRC-QA) aims to answer given questions by understand-ing the semantics of the text. The mainstream meth-ods are based on the neural network. These meth-ods often need a lot of labeled data for training,but the data is expensive to obtain. Thus, we areinspired to apply our generated results to enrichthe training set for the task of MRC-QA. Fig.(5)demonstrates the architecture of this application.Given a case from a small-size labeled set, we firstextracted the contents correlated to a specific ques-tion from the case’s text, including the reasoningchain, reasoning type, answer, evidential entities,and sentences on the entities. We then learned ourQG model based on the contents and generatedquestions as pseudo data to augment the labeledset. For each evaluated case, we could yield approx-imately 5∼8 pseudo samples consisted of the text,question, and answer. Later, we trained an MRC-QA model on the augmented labeled set and report-ed the performance on the test set. By referringto the leaderboard on the HotpotQA website, we

Figure 5: Apply multi-hop QG to support MRC-QA.

chose an open-source model for MRC-QA, namedDynamically Fused Graph Network (DFGN) (Qiuet al., 2019), which achieved the state-of-the-art atthe paper submission time. Considering that thesize of the training set impacted the model’s per-formance, we ran the entire pipeline with differentproportions of the labeled data, so as to verify theproposed model thoroughly. Two evaluation met-rics were employed, including exact match (EM)and F1. We examined the tasks of answer spanextraction, supporting sentence prediction, and thejoint task in the distractor setting.

Table 3: Comparison of our QG+QA model against theQA model under different proportions of labeled data.

Labeled Answer span Support pred. Joint

Data# EM F1 EM F1 EM F1

QG + QA(i.e., the DFGN model)

10% .501 .633 .469 .764 .277 .50920% .551 .672 .500 .801 .317 .57430% .567 .697 .520 .815 .339 .60040% .569 .704 .521 .829 .340 .61550% .571 .713 .531 .833 .344 .61960% .586 .717 .533 .834 .346 .62470% .593 .729 .535 .839 .353 .62680% .606 .731 .540 .845 .356 .63090% .610 .741 .550 .853 .359 .632

100% .614 .746 .558 .858 .360 .635

QA(i.e., the DFGN model)

100% .563 .697 .515 .816 .336 .598

Page 9: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6737

Tab.(3) showed that our QG+QA model trainedon 30% labeled data obtained competitive perfor-mance against the QA model learned on the 100%labeled data. When using more labeled data, theperformance advantages of our QG+QA model con-tinued to grow. Such results showed that our QGmodel could enlarge the coverage and diversity ofthe MRC-QA training set given limited labeled da-ta. That could help to learn the state-of-the-art.Moreover, we conducted case studies to under-stand the generating behavior vividly. As exhibitedin Tab.(4), our QG model could generate massivequestions on multi-hop reasoning. Contrastively,the gold standard often contained one sample sinceit was labor-intensive to enumerate all the cases.

Table 4: Case studies on our multi-hop QG model.

Passage: ... (S1) ’The Hard Easy’ is the episode writtenby Thomas Herpich. (S2) He was born in October, 1979in Torrington, Connecticut, American, along with his twinbrother Peter who was a painter and artist. (S3) Thomasis best known for being a storyboard artist on the animatedtelevision series ’Adventure Time’. ...

Results of Ours MethodQuestion: When was the birth time for the writer of theepisode ’The Hard Easy’?Answer: October, 1979Question: Where is the birthplace for the writer of theepisode ’The Hard Easy’?Answer: Torrington, ConnecticutQuestion: What nationality was the writer of the episode

’The Hard Easy’?Answer: AmericanQuestion: Who is the twin brother for the writer of theepisode ’The Hard Easy’?Answer: PeterQuestion: What is the occupation for the twin brother ofthe episode writer of ’The Hard Easy’?Answer: painter and artist

Gold StandardQuestion: Who is the brother for the writer of the episode

’The Hard Easy’?Answer: Peter

5 Related Works

Existing models for the QG task include rule-basedand neural-based methods. Since the rules are hand-crafted, the first method is of low scalability (Chal-i and Hasan, 2015). The researcher turns to theneural model. It can directly map the inputs intoquestions by using an attention-based sequence-to-sequence framework, which is entirely data-drivenwith far less labor. Various techniques have beenapplied to this framework, including answer sepa-rately encoding, using linguistic features, capturingborder context, reinforcement learning, and em-

phasizing on question-worthy contents (Pan et al.,2019). These methods are mainly used to gener-ate simple questions with a single sentence (Yuet al., 2019). They are challenging to generate thereasoning questions accurately due to the lack offine-grained modeling on the evidential relationsin the text. In order to address the problem, Yuet al. (2020) proposed to incorporate a reasoningchain into the sequential framework, so as to guidethe generation finely. All the methods are built ofthe assumption that sufficient labeled data is avail-able. However, labeled data is quite scarce in manyreal-world applications (Yang et al., 2019). Thelow-resource problem has been studied in the taskssuch as machine translation (Gu et al., 2018), postagging (Kann et al., 2018), word embedding (Jianget al., 2018), text generation (Wiseman et al., 2018),and dialogue systems (Mi et al., 2019). To the bestof our knowledge, the low-resource multi-hop QGis untouched by existing work. We thus focus onthis topic and propose a method to fulfill the gap.

6 Conclusions and Future Works

We have proposed an approach to generate the ques-tions required multi-hop reasoning in low-resourceconditions. We first built a multi-hop QG modeland guided it to satisfy the logical rationality by thereasoning chain extracted from a given text. In or-der to tackle the labeled data shortage problem, welearned the structural patterns from the unlabeleddata by the hidden semi-Markov model. With thepatterns as a prior, we transferred this fundamentalknowledge into the generation model to producethe optimal results. Experimental results on theHotpotQA data set demonstrated the effectivenessof our approach. Moreover, we explored the gener-ated results to facilitate the real-world applicationof machine reading comprehension. We will inves-tigate the robustness and scalability of the model.

Acknowledgments

This work is supported by the Key-Area Re-search and Development Program of GuangdongProvince (2018B010107005, 2019B010120001),National Natural Science Foundation of China(61906217,61806223,61902439,U1711262,U1611264,U1711261,U1811261,U1811264,U1911203),National Key R&D Program of China (2018YFB1004404), Guangdong Basic & Applied ResearchFoundation (2019B1515130001), Fundamental Research Funds for Central Universities (19lgpy219).

Page 10: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6738

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofICLR, pages 124–131, San Diego, CA, USA.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization, pages 65–72, Ann Ar-bor, Michigan.

Yllias Chali and Sadid A. Hasan. 2015. Towards topic-to-question generation. Computational Linguistics,41(1):1–20.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. In Proceedings ofthe 2014 EMNLP, pages 1724–1734, Doha, Qatar.

Hanjun Dai, Bo Dai, Yan-Ming Zhang, Shuang Li, andLe Song. 2016. Recurrent hidden semi-markov mod-el. In Proceedings of the ICLR, pages 14–23, SanJuan, Puerto Rico.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 NAACL, pages4171–4186, Minneapolis, Minnesota.

Xinya Du and Claire Cardie. 2017. Identifying whereto focus in reading comprehension for neural ques-tion generation. In Proceedings of the 2017 EMNLP,pages 2067–2073, Copenhagen, Denmark.

Xinya Du and Claire Cardie. 2018. Harvest-ing paragraph-level question-answer pairs fromWikipedia. In Proceedings of the 56th ACL, pages1907–1917, Melbourne, Australia.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In Proceed-ings of the 2019 NAACL, pages 2368–2378, Min-neapolis, Minnesota.

Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou.2017. Question generation for question answering.In Proceedings of the 2017 EMNLP, pages 866–874,Copenhagen, Denmark.

Hongyu Gong, Suma Bhat, Lingfei Wu, JinJun Xiong,and Wen-mei Hwu. 2019. Reinforcement learningbased text style transfer without parallel training cor-pus. In Proceedings of the 2019 NAACL, pages3168–3180, Minneapolis, Minnesota.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th ACL, pages 1631–1640, Berlin, Germany.

Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li,and Kyunghyun Cho. 2018. Meta-learning for low-resource neural machine translation. In Proceedingsof the 2018 EMNLP, pages 3622–3631, Brussels,Belgium.

Hafedh Hussein, Mohammed Elmogy, and ShawkatGuirguis. 2014. Automatic english question gener-ation system based on template driven scheme. InInternational Journal of Computer Science Issues,pages 45–53.

Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, and Kai-Wei Chang. 2018. Learning word embeddings forlow-resource languages by PU learning. In Proceed-ings of the 2018 NAACL, pages 1024–1034, New Or-leans, Louisiana.

Katharina Kann, Johannes Bjerva, Isabelle Augen-stein, Barbara Plank, and Anders Søgaard. 2018.Character-level supervision for low-resource POStagging. In Proceedings of the Workshop on DeepLearning Approaches for Low-Resource NLP, pages1–11, Melbourne.

Yanghoon Kim, Hwanhee Lee, Joongbo Shin, and Ky-omin Jung. 2019. Improving neural question gen-eration using answer separation. In Proceedings ofthe Thirty-Third AAAI, pages 6602–6609, New YorkCity, NY, USA.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof ICLR, pages 324–331, San Diego, CA, USA.

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of theICLR, pages 224–231, Banff, AB, Canada.

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean,Alan Ritter, and Dan Jurafsky. 2017. Adversariallearning for neural dialogue generation. In Proceed-ings of the 2017 EMNLP, pages 2157–2169, Copen-hagen, Denmark.

Chin-Yew Lin. 2004. ROUGE: A package for automat-ic evaluation of summaries. In Text SummarizationBranches Out, pages 74–81, Barcelona, Spain.

David Lindberg, Fred Popowich, John Nesbit, and PhilWinne. 2013. Generating natural language question-s to support learning on-line. In Proceedings of the14th European Workshop on Natural Language Gen-eration, pages 105–114, Sofia, Bulgaria.

Christoper Manning, Mihai Surdeanu, John Bauer, Jen-ny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of the 52nd ACL,pages 55–60, Baltimore, Maryland.

Page 11: Low-Resource Generation of Multi-hop Reasoning Questions · 6730 Figure 1: Sample that requires reasoning skills. Each training example is a triple combined with the text, answer,

6739

Fei Mi, Minlie Huang, Jiyong Zhang, and Boi Faltings.2019. Meta-learning for low-resource natural lan-guage generation in task-oriented dialogue system-s. In Proceedings of the IJCAI, pages 3151–3157,Macao, China.

Sewon Min, Victor Zhong, Luke Zettlemoyer, and Han-naneh Hajishirzi. 2019. Multi-hop reading compre-hension through question decomposition and rescor-ing. In Proceedings of the 57th ACL, pages 6097–6109, Florence, Italy.

Nasrin Mostafazadeh, Chris Brockett, Bill Dolan,Michel Galley, Jianfeng Gao, Georgios Spithourakis,and Lucy Vanderwende. 2017. Image-grounded con-versations: Multimodal context for natural questionand response generation. In Proceedings of theEighth IJCNLP, pages 462–472, Taipei, Taiwan.

Kevin Murphy. 2002. Hidden semi-markov models(hsmms). In unpublished notes.

Preksha Nema and Mitesh M. Khapra. 2018. Toward-s a better metric for evaluating question generationsystems. In Proceedings of the 2018 EMNLP, pages3950–3959, Brussels, Belgium.

Liangming Pan, Wenqiang Lei, Tat-Seng Chua, andMin-Yen Kan. 2019. Recent advances in neu-ral question generation. In arXiv preprint arX-iv:1905.08949.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the40th ACL, pages 311–318, Philadelphia, Pennsylva-nia, USA.

Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li,Weinan Zhang, and Yong Yu. 2019. Dynamicallyfused graph network for multi-hop reasoning. InProceedings of the 57th ACL, pages 6140–6150, Flo-rence, Italy.

Justus J Randolph. 2005. Free-marginal multirater kap-pa (multirater kfree): An alternative to fleiss’ fixed-marginal multirater kappa.

Linfeng Song, Zhiguo Wang, Wael Hamza, Yue Zhang,and Daniel Gildea. 2018. Leveraging context infor-mation for natural question generation. In Proceed-ings of the 2018 NAACL, pages 569–574, New Or-leans, Louisiana.

Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, YanjunMa, and Shi Wang. 2018. Answer-focused andposition-aware neural question generation. In Pro-ceedings of the 2018 EMNLP, pages 3930–3939,Brussels, Belgium.

Alon Talmor and Jonathan Berant. 2018. The web asa knowledge-base for answering complex questions.In Proceedings of the 2018 NAACL, pages 641–651,New Orleans, Louisiana.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is al-l you need. In Proceedings of the 31st NIPS, pages6000–6010, Vancouver, BC, Canada.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of the 55th ACL, pages189–198, Vancouver, Canada.

Sam Wiseman, Stuart Shieber, and Alexander Rush.2018. Learning neural templates for text generation.In Proceedings of the 2018 EMNLP, pages 3174–3187, Brussels, Belgium.

Ze Yang, Wei Wu, Jian Yang, Can Xu, and ZhoujunLi. 2019. Low-resource response generation withtemplate prior. In Proceedings of the 2019 EMNLP),pages 1886–1897, Hong Kong, China.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018. HotpotQA: A dataset for di-verse, explainable multi-hop question answering. InProceedings of the 2018 EMNLP, pages 2369–2380,Brussels, Belgium.

Jianxing Yu, Quan Xiaojun, Su Qinliang, and Yin Jian.2020. Generating multi-hop reasoning questions toimprove machine reading comprehension. In Pro-ceedings of the WWW, pages 550–561, Taipei, Tai-wan.

Jianxing Yu, Zheng-Jun Zha, and Tat-Seng Chua. 2012.Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviews.In Proceedings of the EMNLP, pages 391–401, JejuIsland, Korea.

Jianxing Yu, Zhengjun Zha, and Jian Yin. 2019. Infer-ential machine comprehension: Answering question-s by recursively deducing the evidence chain fromtext. In Proceedings of the 57th ACL, pages 2241–2251, Florence, Italy.

Xingxing Zhang and Mirella Lapata. 2017. Sentencesimplification with deep reinforcement learning. InProceedings of the 2017 EMNLP, pages 584–594,Copenhagen, Denmark.

Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and QifaKe. 2018. Paragraph-level neural question genera-tion with maxout pointer and gated self-attention net-works. In Proceedings of the 2018 EMNLP, pages3901–3910, Brussels, Belgium.

Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,Hangbo Bao, and Ming Zhou. 2017. Neural ques-tion generation from text: A preliminary study.In Proceedings of NLPCC, pages 662–671, Dalian,China.

Walter Zucchini, Iain L MacDonald, and Roland Lan-grock. 2016. Hidden markov models for time series:an introduction using r. In Chapman and Hall/CRC.