10
Unsupervised Injection of Knowledge into Dialogue Generation via Language Models Yi-Lin Tuan 1 , Wei Wei 2 , William Yang Wang 1 1 University of California, Santa Barbara, USA 2 Google Research, Mountain View, USA {ytuan, william}@cs.ucsb.edu, [email protected] Abstract Neural conversation models have shown the power to produce more meaningful and en- gaging responses given external knowledge. Specifically, the knowledge we experiment on is in textual form, for example, a personal- ity description. Despite the success of train- ing and testing with external knowledge, in re- ality, we do not always have sufficient back- ground knowledge about the discussed topic. Therefore, it is also crucial to have the mod- els generate captivating responses without ex- ternal knowledge. To achieve this, we propose a unified training method, Decoupling, which induces a knowledge-related sentence and cou- ples it with the dialogue history to generate a response in an unsupervised fashion. Its ef- fect is further analyzed by testing the mod- els with no knowledge, partial and full text of the knowledge. Empirically, we observed that the variance of the performance given differ- ent amounts of knowledge is significant. Also, our method performs more closely to the super- vised method (the upper bound) than the base- lines. 1 Introduction Chit-chat dialogue generation is often seen as an end-application of natural language processing. It aggregates a proper understanding of dialogue history and the ability to generate engaging re- sponses (Vinyals and Le, 2015). So far, exist- ing chit-chat generation models are suffering from some well-known problems: one of them is that models tend to produce general responses without useful information, e.g., “I don’t know.” (Li et al., 2016a; Ghazvininejad et al., 2018). Recent work has collected data with external knowledge to make responses more vivid and in- formative, such as the Persona-Chat (Zhang et al., 2018; Dinan et al., 2020), LIGHT (Urbanek et al., Hey, are you a student? I traveled a lot, I even studied abroad Model I'm studying abroad but I did not go to university. Hi, I am a nurse, I love it. Human No specified persona specified persona: I work as a nurses aide in a nursing home. Knowledge Gap Figure 1: A real example of the introduced knowledge gap. The responses are generated by a model tested with or without external knowledge (a specified per- sona here). The model is trained with paired person- ality for each conversation. 2019), and Wizard of Wikipedia (Dinarelli and Ros- set, 2011). Take Persona-Chat as the example, in addition to the dialogue history, the responses con- dition on the profile of the speaker, e.g., “my fa- vorite sport is ultimate frisbee”. They demonstrated that with the specified information, the models re- spond more perceptive. However, in reality, we do not always have suffi- cient prior knowledge of the discussed topic. This leads to some defects in both the training and the inference stages. To cope with the insufficient ex- ternal knowledge problem in the training stage, Zhao et al. (2020) applied a copy mechanism mod- ule to disentangle the knowledge reasoning part, and then expanded the training to unlabeled data. Nonetheless, to our observation, the impact of in- sufficient knowledge in the inference stage has not been surveyed. In this paper, we define and analyze the knowl- edge gap on various conversational data with exter- nal knowledge. The knowledge gap, as illustrated in Figure 1, is the variance of the performance of the models with varying amounts of external knowledge. This measurement can quantify the arXiv:2004.14614v1 [cs.CL] 30 Apr 2020

arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

Unsupervised Injection of Knowledge into Dialogue Generationvia Language Models

Yi-Lin Tuan1, Wei Wei2, William Yang Wang1

1University of California, Santa Barbara, USA2Google Research, Mountain View, USA

{ytuan, william}@cs.ucsb.edu, [email protected]

Abstract

Neural conversation models have shown thepower to produce more meaningful and en-gaging responses given external knowledge.Specifically, the knowledge we experiment onis in textual form, for example, a personal-ity description. Despite the success of train-ing and testing with external knowledge, in re-ality, we do not always have sufficient back-ground knowledge about the discussed topic.Therefore, it is also crucial to have the mod-els generate captivating responses without ex-ternal knowledge. To achieve this, we proposea unified training method, Decoupling, whichinduces a knowledge-related sentence and cou-ples it with the dialogue history to generate aresponse in an unsupervised fashion. Its ef-fect is further analyzed by testing the mod-els with no knowledge, partial and full text ofthe knowledge. Empirically, we observed thatthe variance of the performance given differ-ent amounts of knowledge is significant. Also,our method performs more closely to the super-vised method (the upper bound) than the base-lines.

1 Introduction

Chit-chat dialogue generation is often seen as anend-application of natural language processing.It aggregates a proper understanding of dialoguehistory and the ability to generate engaging re-sponses (Vinyals and Le, 2015). So far, exist-ing chit-chat generation models are suffering fromsome well-known problems: one of them is thatmodels tend to produce general responses withoutuseful information, e.g., “I don’t know.” (Li et al.,2016a; Ghazvininejad et al., 2018).

Recent work has collected data with externalknowledge to make responses more vivid and in-formative, such as the Persona-Chat (Zhang et al.,2018; Dinan et al., 2020), LIGHT (Urbanek et al.,

Hey, are you a student?I traveled a lot, I even studied abroad

Model

I'm studying abroadbut I did not go to university.

Hi, I am a nurse, I love it.

HumanNo specified

persona

specified persona:I work as a nurses aide in a nursing home.

KnowledgeGap

Figure 1: A real example of the introduced knowledgegap. The responses are generated by a model testedwith or without external knowledge (a specified per-sona here). The model is trained with paired person-ality for each conversation.

2019), and Wizard of Wikipedia (Dinarelli and Ros-set, 2011). Take Persona-Chat as the example, inaddition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”. They demonstratedthat with the specified information, the models re-spond more perceptive.

However, in reality, we do not always have suffi-cient prior knowledge of the discussed topic. Thisleads to some defects in both the training and theinference stages. To cope with the insufficient ex-ternal knowledge problem in the training stage,Zhao et al. (2020) applied a copy mechanism mod-ule to disentangle the knowledge reasoning part,and then expanded the training to unlabeled data.Nonetheless, to our observation, the impact of in-sufficient knowledge in the inference stage has notbeen surveyed.

In this paper, we define and analyze the knowl-edge gap on various conversational data with exter-nal knowledge. The knowledge gap, as illustratedin Figure 1, is the variance of the performanceof the models with varying amounts of externalknowledge. This measurement can quantify the

arX

iv:2

004.

1461

4v1

[cs

.CL

] 3

0 A

pr 2

020

Page 2: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

significance of using external knowledge for eachmodel in both the training and the testing phases,thus suggesting a way to explore the causes to eachmodel by giving insufficient knowledge. Mean-while, the knowledge gap reveals that when tack-ling the low-resource problem, testing the modelswith full external knowledge may be misleading.

Other than evaluation by the knowledge gap, wepropose Decoupling, a unified method that does notrequire paired external knowledge for training. De-coupling first de-couples some knowledge-relatedinformation from the input. Then, the extractedinformation is formulated as knowledge. Specifi-cally, we take a pretrained LM as the knowledgebase (Petroni et al., 2019) and maximize the proba-bility of the extracted information in the knowledgebase. Finally, the method re-couples the informa-tion back to the input. Empirically, Decouplingshows convincing results across all data. More-over, we provide some analyses about the signifi-cantly different trends of the datasets when sweep-ing through the amount of external knowledge.

The main contribution is three-fold.

• We leverage the LM as a knowledge base forneural response generation, and propose a uni-fied framework, Decoupling, to inject knowl-edge into a single conversation model.

• We conduct a comprehensive evaluation onthe knowledge gap problem, and provide asystematic explanation.

• We demonstrate that Decoupling can outper-form other unsupervised methods and performclosely to the upper bound.

2 Related Work

Recent work about open-domain dialogue genera-tion has grown on personalized (Li et al., 2016b),knowledge-grounded (Ghazvininejad et al., 2018),and diversity-promoting (Li et al., 2016a) tasks.These directions are beneficial for informing a di-alogue agent. With a similar target, their settingsare different. The first two need extra annotations,while the last one does not. However, the last onehas limited overall performance due to the absenceof extra information. Following is the discussionabout the research on the above three tasks andother related work of our approach.

As the inconsistent personality causing a bigproblem in multi-turn dialogue generation, Li et al.

(2016b) proposed to assist generation with a per-sona label during decoding. The persona labelsused were from the key-value attributes of TV showcharacters, which were useful but limited. Afterthat, Zhang et al. (2018); Dinan et al. (2020) col-lected a dataset with paired personality descrip-tions. This enhanced the abundance of personaand demonstrated the advantage of personality toincrease information in responses. In additionto personalized dialogues, Ghazvininejad et al.(2018) presented that knowledge-grounded onescan also boost the contained information. Dinanet al. (2018) collected a wizard-apprentice conver-sation dataset grounded on Wikipedia articles. Dueto the efforts of collecting large enough knowledge-annotated dialogues dataset, Zhao et al. (2020) pro-posed to use a copy mechanism module to disen-tangle the knowledge reasoning part. Instead ofunstructured knowledge (text) used in the abovework, Zhou et al. (2018); Moon et al. (2019); Tuanet al. (2019) utilized structured knowledge graphs,which can also be formulated as text using tem-plates to be fine-tuned on the LM in this work.

As to diversity-promoting dialogue generation,some methods are adopted in the inference stages,such as maximizing the mutual information to min-imize the probability of general responses (Li et al.,2016a); some methods are applied on the trainingstage, such as using the latent variable.The usage oflatent variables in dialogues is mainly divided intotwo topics: (a) assuming the prior in Generative Ad-versarial Networks (GAN) (Li et al., 2017a; Tuanand Lee, 2019) and (b) approximating the posteriorin Variational Autoencoders (VAE). These trainingapproaches claimed to reduce the problem of gen-eral responses. GAN is claimed beneficial for itreplacing the forward KullbackLeibler divergence(KLD) with JensenShannon divergence (Goodfel-low et al., 2014), which considers both forward andreverse KLD. In the meantime, VAE is claimed tosupplement diversity by adding the evidence lowerbound (ELBO) to the original loss. Zhao et al.(2017) formulated conditional variational autoen-coder and augmented with dialogue-act labels fordialogue generation. Gao et al. (2019) modified itby replacing the continuous latent variable with adiscrete distribution over words and leverage thesimilarity of word embeddings. Regularizing theintermediate generation can also be seen in othertasks, such as Hu et al. (2017) tried to disentangle alatent code to make it controllable. In this work, we

Page 3: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

I love music. Hi, how is your day? Great, I just played the guitar.

Hi, how is your day?

I love music.

Pretrained LM

Great, I just played the guitar.

GPT-1

special token<persona>

Loss 2: KL Divergence

Loss 1: negative log likelihood + classification loss

GPT-1

Figure 2: The framework of Decoupling trainingfor transformer. Green parts are related to personal-ity/knowledge. Red parts work as encoder in Seq2Seqmodels to process input message. Yellow parts work asdecoder in Seq2Seq to infer response. The Orange cellis to classify if the response is authentically paired withthe input message as (Wolf et al., 2019).

use the latent variable differently. We suppose theprior of one latent code is the estimated distributionof knowledge. Moreover, compared to the abovetasks, the training in our task is given no pairedexternal information but a collection of it.

3 Method

This work aims to tackle two-agent dialogues with-out paired textual external knowledge but having acollection of the supposed useful knowledge.

A two-agent dialogue is a sequence of utterancesswapping between two speakers: {x1, x2, ...xT },where each x2t−1 is an utterance from agent A, andx2t is an utterance from agent B, where t startsfrom 1 to T . The collection of textual knowledgeis denoted as the set Z. For datasets using person-ality as the external knowledge, each z ∼ Z isa sentence that describes an arbitrary agent, suchas “I have two dogs.” (more variations are in sec-tion 4.1). Given any dialogue history x<t beforethe t-th utterance, we hope to find a possible per-sonality description z ∼ Z that is able to generatea response y as close as possible to the xt. In thefollowing, we replace the notation of x<t with x,xt with y for simplicity. Meanwhile, the real pairedpersonality description is denoted as z. Note that zis used for description, we do not actually have itin the data.

As depicted in Figure 2, the idea of our proposedapproach, Decoupling, is first finding a possiblesubset Zx ⊂ Z conditioned on the dialogue historyx, then generate y by conditioning on the joint

Algorithm 1 Decoupling1: pre-train an LM (PZ) on Z and fix it2: initialize σ and φ3: given hyperparamters α, β, γ4: for number of training iterations do5: sample (x, y) from real data6: sample z ∼ Pσ(z|x)7: compute∇φ = ∇φ logPφ(y|x, z)8: compute

∇σ =βPφ(y|x, z)∇σ logPσ(z|x)+ γDKL(Pσ(z|x)||PZ(z))

9: update φ,σ by α∇φ +∇σ10: end for

distribution of z ∼ Zx and x. The formulation canbe written as,

z ∼ Pσ(z|x)y ∼ Pφ(y|x, z)

(1)

where σ and φ are respectively the parameters ofthe knowledge generation model and response gen-eration model.

The core of Decoupling is to minimize the nega-tive likelihood of y and meanwhile regularize thegenerated z by a pretrained language model (LM).The loss function is as follows,

Lφ,σ = − logPφ(y|x, z) +DKL

(Pσ(z|x)||PZ(z)

)(2)

where the PZ denotes the LM pretrained upon thecollection of personalitiesZ, and z is sampled fromPσ(z|x) (as equation 1). The derivation of equa-tion 2 is left in subsection 3.3. The algorithm isdepicted in Algorithm 1. Note that the ∇σ needsfurther simplification as in subsection 3.2. Further,in our experiment, the weights α, β are normallyset to 1, and the weight γ ∈ [0.1 − 1] has beentested to work fine.

3.1 MotivationDecoupling is motivated from the concept of causal-ity. We take the personality z as an unobservedconfounding variable that causes the responsey (Hlavackova-Schindler et al., 2007) other thanx. This is possible because the knowledge meetsthe causal modeling specified conditions (Selltizet al., 1976): (1) there is a time ordering betweenz and y, (2) there exists covariation between z and

Page 4: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

y, (3) the covariation between the two variables zand y still exists when other confounding variables(x here) are removed. As the concept of causality,the premise that p(y|x, z) > p(y|x) is also nec-essary for claiming that variable z is a cause ofy. This is hold because in the used data the exis-tence of z ∈ Z has been proved to be better, andz ∈ Zx ⊂ Z. Then, because we can only access xfor inference, we model z conditioned on x, aimingto extract the knowledge-related information fromx, which might be ignored in conventional maxi-mum likelihood estimation (Vinyals and Le, 2015).Note that since the z is derived from x, we assumethat the real one, z, must have partial informationavailable in x, or the derived z will be meaningless.Finally, we minimize the mutual information be-tween x and z, that is, maximize the likelihood ofz in PZ . The way can ensure z contains as muchas possible extra information other than x. Theargument induces the Kullback-Leibler divergence(KLD) (Kullback and Leibler, 1951) regularizationterm.

An intuitive way to interpret the formulation isthat the probability distribution Pσ(z|x) is the leastdescriptive part of x but relates to the personalitydomain Z to some extent. Because Pσ(z|x) re-couples z to the input, the Pφ(y|x, z) is enforced toutilize information in the personality domain. Thus,the pretrained LM not only injects personality intothe Pσ(z|x) but also enables Pφ(y|x, z) to do rea-soning even trained without paired personality.

3.2 The GradientsThe equation 2 is an ideal loss function, but ismissing one step to be tractable. Because φ is onlyrelated to the first term in equation 2, the gradient isthe same as maximum likelihood estimation. In theother side, the parameters σ relates to both the like-lihood term and the KLD term. For the first term,we use policy gradient (Sutton and Barto, 2018;Ranzato et al., 2016) by taking the Pφ(y|x, z) asreward and the Pσ(z|x) as the policy. The detailderivation is in subsection 3.3. For the second term,because the textual knowledge z is a sequence ofwords, its domain grows exponentially with the in-creasing length. We tackle it using an upper boundof the KLD.

DKL(Pσ||PZ) ≤∑τ

DKL(Pσ,τ ||PZ,τ ) (3)

where the KLD subscripted by τ refers to∑z Pσ(zτ |z<τ ) log

(Pσ(zt|z<t)/PZ(zt|z<t)

),

and zτ is the τ -th token in the z sequence.Therefore the gradients can be rewritten as,

∇φ =∇ logPφ(y|x, z)∇σ =Pφ(y|x, z)∇ logPσ(z|x)

+

len(z)∑τ=1

DKL(Pσ,τ (z)||PZ,τ (z))(4)

As depicted in Figure 2, we implement the De-coupling method upon Transformer (Vaswani et al.,2017). Particularly we adopt the TransferTransfomodel (Wolf et al., 2019; Radford et al., a), wherethe optimization cares both the likelihood loss andthe classification loss. Hence, our training also addthe classification loss to the originally derived like-lihood terms. Although we shape Decoupling to bedirectly used on TransferTransfo, it is generally ap-plicable to other auto regressive models (Mikolovet al., 2010; Sutskever et al., 2014; Radford et al.,b)

3.3 Derivation

Suppose z is a confounding variable besides x ofy, but z is an unobserved one. Also, as defined,variable z is a cause of y other than x if and onlyif P (y|x, z) > P (y|x). When applying the in-equality to real scenarios, we can rewrite it as thefollowing.

Lemma 1. given (x, y, z) ∼ PD, where PD is thereal data distribution, and z contains informationuseful to y and different from x.

maxφ

Pφ(y|x) < maxθPθ(y|x, z) (5)

where θ is the parameter for a generation modelthat takes the real personality z as an input.

Definition 1. Define ε(z), for (x, y) ∼ PD andz ∈ Z, is the residual of logPθ∗(y|x, z) −logPφ∗(y|x), where θ∗ is the optimal parametersof P (y|x, z) and φ∗ is the optimal parameters ofP (y|x). The lower bound of ε(z) can be derivedas follows.

ε(z) ≥ logPφ∗(x, y, z)

PD(x, y)PD(z)(6)

The derivation is as follows. Note that theprior distributions P (x), P (y), P (z) are avail-able from the dataset. Therefore we label themwith subscript D and we do not need to modelthese priors. In addition, we assume that P (x)

Page 5: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

and P (z) are identically independent distributed,i.e., P (x, z) = P (x)P (z), which holds in somereal-world cases and also satisfies the definition inGranger causality (Granger, 2004), which indicatesthat z should contain some unique information thatdoes not exist in other variables.

ε(z) := logPθ∗(y|x, z)− logPφ∗(y|x)≥ logPφ∗(y|x, z)− logPφ∗(y|x)

= logPφ∗(x, y, z)

Pφ∗(x, z)

Pφ∗(x)

Pφ∗(x, y)

≥ logPφ∗(x, y, z)PD(x)

PD(x)PD(z)PD(x, y)

= logPφ(x, y, z)

PD(x, y)PD(z)

(7)

To find out the latent z that contributes mostto the generation of y, our aim is to sample az ∼ pσ(z|x) which makes pφ∗(y|x, z) as closeas possible to pθ∗(y|x, z), that is to maximize theresidual ε(z).

Theorem 1. The target to find z that makelogPφ(y|x, z) as close as possible to the upperbound logPθ∗(y|x, z) is formulated as:

minσ

logPθ∗(y|x, z)− logPφ(y|x, z) (8)

The tigher lower bound is:

argmaxσ

logPφ(x, y, z), subject to minσI(x; z)

(9)Note that we suppose a φ that is able to utilize z.

Proof. because z is the optimal z and z is an arbi-trary z, the inequalities hold

pφ(y|x) < pφ(y|x, z) ≤ pθ∗(y|x, z) (10)

Therefore, to optimize the model with respectto z is the same as to minimize the difference be-tween residuals ε(z) and ε(z), since z is based onφ and does not change the lower bound of ε inDefinition 1.

argminσ

logPθ∗(y|x, z)− logPφ(y|x, z)

= argminσε(z)− ε(z)

= argmaxσ

ε(z)

(11)

We are not able to compute the gradient of σ ofmaximizing ε(z) since the process z ∼ Pσ(z|x) is

not differentiable. Therefore, we optimize its lowerbound, at least to give the ε(z) a tighter constraint.

argmaxσ

logPφ(x, y, z)

subject to Pφ(x, z) = PD(x)PD(z)(12)

Because logarithm an increasing function, wechoose to optimize its variable:

∇σPφ(x, y, z) ≈1

m

m∑i=1

pφ(yi|xi, zi)∇σ log pσ(zi|xi)

(13)where m is the batch size.

The constraint Pφ(x, z) = PD(x)PD(z) can berewritten as minimizing the mutual information ofx and z variables I(X;Z) = Ep(x,z) log

p(x,z)p(x)p(z) =

Ep(x)DKL(p(z|x)||p(z)). The optimization prob-lem is then implemented as equation 4.

4 Experiment Setup

We test the proposed method on Persona-Chat (Zhang et al., 2018), LIGHT (Urbanek et al.,2019), Wizard of Wikipedia (Dinan et al., 2018),and Dailydialog (Li et al., 2017b) for a thoroughunderstanding. For fairness, the proposed ap-proach and the baselines are based on the Trans-fertransfo (Wolf et al., 2019), which starts from apretrained GPT model (Radford et al., a) and incor-porates personality description with the dialoguehistory. Moreover, to save the memory, we use thesame parameters for the models σ and φ, but usingdifferent dialogue state embeddings to distinguishthem.

4.1 Datasets

To more comprehensively understand the us-age of external knowledge in dialogue genera-tion, we performed experiments on three (in gen-eral) knowledge-grounded conversation datasets,Persona-Chat (Zhang et al., 2018), LIGHT (Ur-banek et al., 2019), and Wizard of Wikipedia(WOW) (Dinan et al., 2018). They provided ex-tra information (personalities in Persona-Chat andLIGHT; Wikipedia articles in WOW) for each con-versation, which enabled us to do the automaticevaluation with respect to it. In addition, to fur-ther demonstrate that the Decoupling method hadsome advantage of adding information, we alsoevaluated the methods on a conversation datasetdesigned without external knowledge. We choseDailydialog (Li et al., 2017b).

Page 6: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

data recall precision f1

Persona-Chat 21.71 33.70 26.41LIGHT 19.56 20.79 20.16WOW-seen 9.82 24.26 13.99WOW-unseen 10.22 24.47 14.42

Table 1: The unigram f1 of the knowledge-paireddatasets. The significant difference of WOWs from oth-ers can help explain the results.

Persona-Chat and LIGHT. They contain open-domain conversations with profiles of the speak-ers. Each profile is about 3-5 sentences describingthe speaker’s personalities, such as “my favoritesport is ultimate frisbee.” The LM is a GPT modelfine-tuned on the collection of personalities in thetraining set.

Wizard of Wikipedia (WOW). Each conversa-tion is between an apprentice and a wizard, whocan access the related Wikipedia articles about atopic on Wikipedia. The test set of WOW is splitinto WOW-seen and WOW-unseen set, where theWOW-unseen contains only topics not included inthe training set. In our experiment, we directly usethe most related sentence in Wikipedia articles asthe external knowledge.

Dailydialog. It is an open-domain conversationdataset with emotion and dialog-act labels. We dis-card the metadata but only use the conversations tovalidate whether the approximated distribution ofpersonality can in general assist response genera-tion by decoupling with the LM of Persona-Chat.

For the knowledge-paired datasets, we computedtheir unigram overlaps between the x and z in Ta-ble 1. The recalls are much lower in WOWs, in-dicating that the WOWs might have less mutualinformation between the x and z. The statistics canhelp explain the completely different trends of theirperformances.

4.2 Baselines

We take four variations of the Transfertransfo (Wolfet al., 2019) as baselines. They are respectivelytrained with full external knowledge (denoted asFull), random subsequences of the external knowl-edge (we chose the length of 10, 10len), and noexternal knowledge (Vanilla). To validate the ben-efits of Decoupling, we adopt another baselinecalled RealLM that trains the knowledge gener-ation model directly with paired external knowl-edge, but uses the z sampled from the model tolearn responses. Note that Full and 10len are the

upper bounds since they are trained with real knowl-edge. RealLM is the lower bound of Decoupling,demonstrating the results of training the conversa-tion model with an optimal knowledge generationmodel.

Because the formulation of Decoupling is some-what similar to conditional variational autoencoder(CVAE) (Zhao et al., 2017), we take a recent varia-tion, discrete CVAE (DCVAE) (Gao et al., 2019),as a baseline. Note that DCVAE is for any underly-ing variable that in y but not in x, while Decouplingis only considering z ∈ Z that is a causation to y.That is, the two methods are independent, and areable to be used together.

4.3 Evaluation Metrics

As the basic, we chose the metrics adopted inthe dataset papers, i.e., unigram F1 of the predic-tions and the ground-truths, perplexity (PPL) of theground-truth responses, and hits@1. The hits@1is also called Recall@1 (Dinan et al., 2018). Theselection is from 100 candidates in WOWs, and 20candidates in other datasets.

We conduct the evaluation considering theknowledge gap. Specifically, the metrics are drawnas curves with regard to varying amount of exter-nal knowledge. The curves show a more compre-hensive evaluation of the models, clarifying howeffectively the models utilize external knowledge.

5 Results

Following the concept of knowledge gap, the per-formance curves are shown in Figure 3 and 4,where the x-axis is the length of given externalknowledge in the inference stage.

Perplexity. As shown in Figure 3, most of thecurves are monotonically decreasing when givenmore information for inference, except for theVanilla method trained on LIGHT. This demon-strates that Vanilla is generally less possible tocatch a proper usage of external information. ForWOWs, the curves of RealLM are the worst, whichindicates the external knowledge is not quite re-lated to the contexts. That is, they have less mutualinformation, and we are not able to infer someextent of knowledge z using contexts x. This phe-nomenon is consistent to the unigram f1 shown inTable 1. Overall, Decoupling always performs bet-ter than RealLM. This shows that Decoupling caninfer more than the paired knowledge, but finding away to utilize knowledge more efficiently. Also, for

Page 7: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

0 5 10 15 20 25 3016

18

20

22

24

Full10lenVanillaDecouplingRealLM

0 5 10 15 20 25 30 35

22.5

23.0

23.5

24.0

24.5

25.0

25.5

26.0Full10lenVanillaDecouplingRealLM

0 5 10 15 20 25

15

20

25

30

35 Full10lenVanillaDecouplingRealLM

0 5 10 15 20 25

15

20

25

30

35

40 Full10lenVanillaDecouplingRealLM

Figure 3: The perplexity of difference methods given variations of sub-information. From left to right are respec-tively Persona-Chat, LIGHT, WOW-seen, and WOW-unseen.

0 5 10 15 20 25 3050

55

60

65

70

75

Full10lenVanillaDecouplingRealLM

0 5 10 15 20 25 30 35

55.0

57.5

60.0

62.5

65.0

67.5

Full10lenVanillaDecouplingRealLM

0 5 10 15 20 25

30

40

50

60

70

80

90

Full10lenVanillaDecouplingRealLM

0 5 10 15 20 25

20

30

40

50

60

70

80

Full10lenVanillaDecouplingRealLM

Figure 4: The hits@1 of difference methods given variations of sub-information. From left to right are respectivelyPersona-Chat, LIGHT, WOW-seen, and WOW-unseen.

Person

a-Cha

t

LIGHT

WOW

-seen

WOW

-unse

en0

2

4

6

8

ppl

traintest

Person

a-Cha

t

LIGHT

WOW

-seen

WOW

-unse

en0

2

4

6

8

10

12

hits

@1

traintest

Figure 5: The variances of (left) perplexity and (right)hits@1 on each dataset in the training and testingstages.

testing with labels under the length of 10, Decou-pling shows closer performance to 10len than othermethods. This probably demonstrates that Decou-pling can successfully learn how to utilize externalknowledge without training with the paired ones.But it is limited with the sampling length of theknowledge generation model.

Hits@1. As shown in Figure 4, most of thecurves are monotonically increasing along withthe length of labels. However, for LIGHT, exceptfor Full, other methods drop after a certain lengthof labels. Among them Vanilla is the worst. Weconjecture that the z in LIGHT is noisier when itexceeds certain length, so the usage of z becomesmore difficult.In addition, the RealLM performsthe worst for WOWs. Decoupling also outperforms10len under length of the 10 except for Persona-Chat. These characteristics are generally consistentwith the ones of perplexity.

Method F1 Hits@1 PPL

(Persona-Chat, w/ knowledge)Full∗ 20.83 76.67 16.18Vanilla 19.89 62.70 19.48DCVAE 20.30 62.94 18.89Decoupling 19.90 66.67 18.68

(LIGHT, w/ knowledge)Full∗ 18.96 66.34 17.06Vanilla 18.61 53.37 23.98DCVAE 18.64 55.08 23.5Decoupling 18.67 59.63 23.15

(WOW-seen, w/ knowledge)(Dinan et al., 2018) 35.5 92.3 23.1(Li et al., 2019) 16.2 - 17.8(Zhao et al., 2020) 18.5 - 23.0Full∗ 32.84 91.80 12.00Vanilla 20.46 91.23 18.24DCVAE 20.44 90.87 18.41Decoupling 20.13 90.84 18.25

(WOW-unseen, w/ knowledge)(Dinan et al., 2018) 32.2 83.1 32.8(Li et al., 2019) 11.4 - 44.8(Zhao et al., 2020) 16.5 - 25.6Full∗ 32.59 85.14 13.29Vanilla 19.79 78.13 22.19DCVAE 19.60 78.01 22.35Decoupling 19.54 76.99 22.12

(Dailydialog, w/o knowledge)Vanilla 21.96 78.01 11.97DCVAE 22.02 77.92 11.96Decoupling 21.66 76.57 12.60

Table 2: The results of response generation. The super-script ∗ indicates the method is trained with externalknowledge, i.e., it is an upper bound.

Other comparisons. The automatic evaluationresults are shown in Table 2. According to the re-

Page 8: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

Input if you could do anything in the world what would you do?

Vanilla i’d go crazy. i’d go so angry.DCVAE i am a veteran of a major army.Decoupling i would probably write a book about my favorite food.Input yes, fine. i had something wrong with my stomach. i won’t go into detail, but is wasn’t serious.

Vanilla oh, i’m sorry to hear that. is there anything i can do for you?DCVAE really? did it hurt?Decoupling have you had any tests done on your stomach?

Table 3: Some examples. The first one is from Persona-Chat; the second one is from Dailydialog.

Person

a-Cha

t

LIGHT

WOW

-seen

WOW

-unse

en0

2

4

6

8

10

ppl

Full10lenVanillaDecouplingRealLM

Person

a-Cha

t

LIGHT

WOW

-seen

WOW

-unse

en0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

hits

@1

Full10lenVanillaDecouplingRealLM

Figure 6: The knowledge gaps of the methods trainedon each dataset. The left plot is the gaps of perplexity;the right one is of hits@1.

sults tested with external knowledge, Decouplingsignificantly facilitates the model to use knowledgedescription even without training on it on Persona-Chat and LIGHT. With the assistance of Figure 3and 4, Decoupling also show its power when testedwith insufficient knowledge on those datasets. Thedefect of Decoupling is displayed on WOWs andDailydialog. In WOWs, the external knowledge isnot able to be observed from the dialogue history(see Table 1); in Dailydialog, the data is not de-signed with external knowledge, so it intrinsicallydoes not exist. On such data, Decoupling onlypreserves a similar performance with Vanilla. Toconclude, on the automatic evaluation metrics, De-coupling can maintain the generation ability withonly dialogue history but strengthen the ability toutilize external knowledge.

6 Discussion

There are a few more aspects worth discussing:The difference of knowledge gaps between

the training and testing phases. Figure 5 plotsthe variances of Full, 10len, and Vanilla in thetraining and the testing stages. For both metrics,the variances of the testing phase are always sim-ilar to or much higher than the ones in the train-ing phase. Therefore, in a dataset collected withpaired knowledge, the significance of testing with

data LM RealLM Decoupling

Persona-Chat 24.85 23.51 21.88LIGHT 42.56 4.22 23.61WOW-seen 44.08 20.59 25.98WOW-unseen 46.51 46.47 43.98

Table 4: The perplexity of real paired knowledge onfour datasets.

the knowledge is equally or even more importantthan training with it.

The knowledge gaps in the four knowledge-grounded datasets. The knowledge gaps of thedatasets are drawn in Figure 6. We can concludethat (1) Full and 10len, models trained with ex-ternal knowledge, highly depend on it rather thaneffectively utilize the dialogue history; (2) Vanilla,Decoupling, and RealLM, with the narrower knowl-edge gaps, can optimize the usage of the dialoguehistory; and (3) With Table 2, among them Decou-pling also enhance the usage of external knowledgewhile maintaining the usage of dialogue history.

The performance of knowledge generationmodels. Table 4 lists the perplexity of the realpaired knowledge evaluated on the pretrained LM(PZ), the knowledge generation model trained byRealLM, and the knowledge generation modeltrained by Decoupling. To evaluate RealLM andDecoupling, we first feed the dialogue history (x),and then compute the Pσ(z|x). The results indi-cates that the knowledge generation models trainedby RealLM and Decoupling can catch the condi-tional information of knowledge given dialoguehistory, although they share parameters with theresponse generation parts.

7 Conclusion

This work analyzes the impact of using externalknowledge in training and testing respectively. Theresults show that the external knowledge introducesa consistent effect in testing. But training with

Page 9: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

labels, which traditionally thought to be impor-tant, induces the performance to be highly dependon whether tested with external information. Theunified Decoupling method shows its potential tobridge the testing gap between having labels or not.It also shows comparable results to upperboundwithout training with paired labels.

ReferencesEmily Dinan, Varvara Logacheva, Valentin Malykh,

Alexander Miller, Kurt Shuster, Jack Urbanek,Douwe Kiela, Arthur Szlam, Iulian Serban, RyanLowe, et al. 2020. The second conversational in-telligence challenge (convai2). In The NeurIPS’18Competition, pages 187–208. Springer.

Emily Dinan, Stephen Roller, Kurt Shuster, AngelaFan, Michael Auli, and Jason Weston. 2018. Wizardof wikipedia: Knowledge-powered conversationalagents. In International Conference on LearningRepresentations.

Marco Dinarelli and Sophie Rosset. 2011. Models cas-cade for tree-structured named entity detection. InProceedings of 5th International Joint Conferenceon Natural Language Processing, pages 1269–1278,Chiang Mai, Thailand. Asian Federation of NaturalLanguage Processing.

Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, GuodongZhou, and Shuming Shi. 2019. A discrete cvae forresponse generation on short-text conversation. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 1898–1908.

Marjan Ghazvininejad, Chris Brockett, Ming-WeiChang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, andMichel Galley. 2018. A knowledge-grounded neuralconversation model. In Thirty-Second AAAI Confer-ence on Artificial Intelligence.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in neural informationprocessing systems, pages 2672–2680.

Clive WJ Granger. 2004. Time series analysis, cointe-gration, and applications. American Economic Re-view, 94(3):421–425.

Katerina Hlavackova-Schindler, Milan Palus, MartinVejmelka, and Joydeep Bhattacharya. 2007. Causal-ity detection based on information-theoretic ap-proaches in time series analysis. Physics Reports,441(1):1–46.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward

controlled generation of text. In Proceedingsof the 34th International Conference on MachineLearning-Volume 70, pages 1587–1596. JMLR. org.

Solomon Kullback and Richard A Leibler. 1951. Oninformation and sufficiency. The annals of mathe-matical statistics, 22(1):79–86.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016a. A diversity-promoting objec-tive function for neural conversation models. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages110–119.

Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016b. Apersona-based neural conversation model. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 994–1003.

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean,Alan Ritter, and Dan Jurafsky. 2017a. Adversariallearning for neural dialogue generation. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 2157–2169.

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017b. Dailydialog: A manu-ally labelled multi-turn dialogue dataset. In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing (Volume 1: Long Pa-pers), pages 986–995.

Zekang Li, Cheng Niu, Fandong Meng, Yang Feng,Qian Li, and Jie Zhou. 2019. Incremental trans-former with deliberation decoder for documentgrounded conversations. In Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics, pages 12–21.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In Eleventhannual conference of the international speech com-munication association.

Seungwhan Moon, Pararth Shah, Anuj Kumar, and Ra-jen Subba. 2019. Opendialkg: Explainable conver-sational reasoning with attention-based walks overknowledge graphs. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 845–854.

Fabio Petroni, Tim Rocktaschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.

Page 10: arXiv:2004.14614v1 [cs.CL] 30 Apr 2020addition to the dialogue history, the responses con-dition on the profile of the speaker, e.g., “my fa-vorite sport is ultimate frisbee”

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. a. Improving language understand-ing by generative pre-training.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. b. Languagemodels are unsupervised multitask learners.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. InternationalConference on Learning Representations.

Claire Selltiz, Lawrence S Wrightsman, and Stu-art Wellford Cook. 1976. Research methods in so-cial relations. Holt, Rinehart and Winston.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Advances in neural information processing sys-tems, pages 3104–3112.

Richard S Sutton and Andrew G Barto. 2018. Rein-forcement learning: An introduction. MIT press.

Yi-Lin Tuan, Yun-Nung Chen, and Hung-yi Lee.2019. Dykgchat: Benchmarking dialogue genera-tion grounding on dynamic knowledge graphs. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 1855–1865.

Yi-Lin Tuan and Hung-Yi Lee. 2019. Improv-ing conditional sequence generative adversarial net-works by stepwise evaluation. IEEE/ACM Transac-tions on Audio, Speech, and Language Processing,27(4):788–798.

Jack Urbanek, Angela Fan, Siddharth Karamcheti,Saachi Jain, Samuel Humeau, Emily Dinan, TimRocktaschel, Douwe Kiela, Arthur Szlam, and Ja-son Weston. 2019. Learning to speak and act ina fantasy text adventure game. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 673–683.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Oriol Vinyals and Quoc Le. 2015. A neural conver-sational model. ICML Deep Learning Workshop,2015.

Thomas Wolf, Victor Sanh, Julien Chaumond, andClement Delangue. 2019. Transfertransfo: A trans-fer learning approach for neural network based con-versational agents. NeurIPS 2018 CAI Workshop.

Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018. Per-sonalizing dialogue agents: I have a dog, do youhave pets too? In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 2204–2213.

Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning discourse-level diversity for neuraldialog models using conditional variational autoen-coders. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 654–664.

Xueliang Zhao, Wei Wu, Chongyang Tao, Can Xu,Dongyan Zhao, and Rui Yan. 2020. Low-resourceknowledge-grounded dialogue generation. In Inter-national Conference on Learning Representations.

Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao,Jingfang Xu, and Xiaoyan Zhu. 2018. Com-monsense knowledge aware conversation generationwith graph attention. In Proceedings of the 27thInternational Joint Conference on Artificial Intelli-gence, pages 4623–4629.