20
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems Andrea Madotto * , Chien-Sheng Wu * , Pascale Fung Human Language Technology Center Center for Artificial Intelligence Research (CAiRE) Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong [eeandreamad,cwuak,pascale]@ust.hk Abstract End-to-end task-oriented dialog systems usually suffer from the challenge of in- corporating knowledge bases. In this pa- per, we propose a novel yet simple end-to- end differentiable model called memory- to-sequence (Mem2Seq) to address this is- sue. Mem2Seq is the first neural gen- erative model that combines the multi- hop attention over memories with the idea of pointer network. We empirically show how Mem2Seq controls each gener- ation step, and how its multi-hop attention mechanism helps in learning correlations between memories. In addition, our model is quite general without complicated task- specific designs. As a result, we show that Mem2Seq can be trained faster and attain the state-of-the-art performance on three different task-oriented dialog datasets. 1 Introduction Task-oriented dialog systems help users to achieve specific goals with natural language such as restaurant reservation and schedule arrangement. Traditionally, they have been built with several pipelined modules: language understanding, dia- log management, knowledge query, and language generation (Williams and Young, 2007; Hori et al., 2009; Lee et al., 2009; Levin et al., 2000; Young et al., 2013). Moreover, the ability to query exter- nal Knowledge Bases (KBs) is essential in task- oriented dialog systems, since the responses are guided not only by the dialog history but also by the query results (e.g. Table 1). However, despite the stability of such pipelined systems via com- bining domain-specific knowledge and slot-filling * These two authors contributed equally. Point of interest (poi) Distance Traffic info Poi type Address The Westin 5 miles moderate traffic rest stop 329 El Camino Real Round Table 4 miles no traffic pizza restaurant 113 Anton Ct Mandarin Roots 5 miles no traffic chinese restaurant 271 Springer Street Palo Alto Cafe 4 miles moderate traffic coffee or tea place 436 Alger Dr Dominos 6 miles heavy traffic pizza restaurant 776 Arastradero Rd Stanford Express Care 6 miles no traffic hospital 214 El Camino Real Hotel Keen 2 miles heavy traffic rest stop 578 Arbol Dr 1th Turn DRIVER Where can I get tea? Seq2Seq I have a away from away would you like the address +Attn I have a listing for a place that serves tea that is 5 miles away Ptr-Unk There is a away you like would you like more info Mem2Seq The nearest tea shop is Palo Alto Cafe located 4 miles away, would you like directions there? GOLD Palo Alto Cafe is 4 miles away and serves coffee and tea. Do you want the address? 2th Turn DRIVER Yes. Seq2Seq The is at at +Attn The address is 329 El Camino Real and it’s 3 miles away and there is no traffic Ptr-Unk Palo Alto Cafe is 4 miles away and PAD is no traffic Mem2Seq Palo Alto Cafe is 4 miles away at 436 Alger Drive GOLD Palo Alto is located at 436 Alger Dr. Table 1: Example of generated responses for the In-Car Assistant on the navigation domain. techniques, modeling the dependencies between modules is complex and the KB interpretation re- quires human effort. Recently, end-to-end approaches for dialog modeling, which use recurrent neural networks (RNN) encoder-decoder models, have shown promising results (Serban et al., 2016; Wen et al., 2017; Zhao et al., 2017). Since they can directly map plain text dialog history to the output re- sponses, and the dialog states are latent, there is no need for hand-crafted state labels. Moreover, attention-based copy mechanism (Gulcehre et al., 2016; Eric and Manning, 2017) have been recently introduced to copy words directly from the input sources to the output responses. Using such mech- anism, even when unknown tokens appear in the dialog history, the models are still able to produce correct and relevant entities. However, although the above mentioned ap- proaches were successful, they still suffer from two main problems: 1) They struggle to effec- tively incorporate external KB information into the RNN hidden states (Sukhbaatar et al., 2015), arXiv:1804.08217v3 [cs.CL] 21 May 2018

Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-EndTask-Oriented Dialog Systems

Andrea Madotto∗, Chien-Sheng Wu∗, Pascale FungHuman Language Technology Center

Center for Artificial Intelligence Research (CAiRE)Department of Electronic and Computer Engineering

The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong[eeandreamad,cwuak,pascale]@ust.hk

Abstract

End-to-end task-oriented dialog systemsusually suffer from the challenge of in-corporating knowledge bases. In this pa-per, we propose a novel yet simple end-to-end differentiable model called memory-to-sequence (Mem2Seq) to address this is-sue. Mem2Seq is the first neural gen-erative model that combines the multi-hop attention over memories with theidea of pointer network. We empiricallyshow how Mem2Seq controls each gener-ation step, and how its multi-hop attentionmechanism helps in learning correlationsbetween memories. In addition, our modelis quite general without complicated task-specific designs. As a result, we show thatMem2Seq can be trained faster and attainthe state-of-the-art performance on threedifferent task-oriented dialog datasets.

1 Introduction

Task-oriented dialog systems help users to achievespecific goals with natural language such asrestaurant reservation and schedule arrangement.Traditionally, they have been built with severalpipelined modules: language understanding, dia-log management, knowledge query, and languagegeneration (Williams and Young, 2007; Hori et al.,2009; Lee et al., 2009; Levin et al., 2000; Younget al., 2013). Moreover, the ability to query exter-nal Knowledge Bases (KBs) is essential in task-oriented dialog systems, since the responses areguided not only by the dialog history but also bythe query results (e.g. Table 1). However, despitethe stability of such pipelined systems via com-bining domain-specific knowledge and slot-filling

∗∗ These two authors contributed equally.

Point of interest (poi) Distance Traffic info Poi type AddressThe Westin 5 miles moderate traffic rest stop 329 El Camino Real

Round Table 4 miles no traffic pizza restaurant 113 Anton CtMandarin Roots 5 miles no traffic chinese restaurant 271 Springer StreetPalo Alto Cafe 4 miles moderate traffic coffee or tea place 436 Alger Dr

Dominos 6 miles heavy traffic pizza restaurant 776 Arastradero RdStanford Express Care 6 miles no traffic hospital 214 El Camino Real

Hotel Keen 2 miles heavy traffic rest stop 578 Arbol Dr

1th

Turn

DRIVER Where can I get tea?Seq2Seq I have a away from away would you like the address

+Attn I have a listing for a place that serves tea that is 5 miles awayPtr-Unk There is a away you like would you like more info

Mem2Seq The nearest tea shop is Palo Alto Cafe located 4 miles away,would you like directions there?

GOLD Palo Alto Cafe is 4 miles away and serves coffee and tea.Do you want the address?

2th

Turn

DRIVER Yes.Seq2Seq The is at at

+AttnThe address is 329 El Camino Real and it’s 3 miles awayand there is no traffic

Ptr-Unk Palo Alto Cafe is 4 miles away and PAD is no trafficMem2Seq Palo Alto Cafe is 4 miles away at 436 Alger Drive

GOLD Palo Alto is located at 436 Alger Dr.

Table 1: Example of generated responses for theIn-Car Assistant on the navigation domain.

techniques, modeling the dependencies betweenmodules is complex and the KB interpretation re-quires human effort.

Recently, end-to-end approaches for dialogmodeling, which use recurrent neural networks(RNN) encoder-decoder models, have shownpromising results (Serban et al., 2016; Wen et al.,2017; Zhao et al., 2017). Since they can directlymap plain text dialog history to the output re-sponses, and the dialog states are latent, there isno need for hand-crafted state labels. Moreover,attention-based copy mechanism (Gulcehre et al.,2016; Eric and Manning, 2017) have been recentlyintroduced to copy words directly from the inputsources to the output responses. Using such mech-anism, even when unknown tokens appear in thedialog history, the models are still able to producecorrect and relevant entities.

However, although the above mentioned ap-proaches were successful, they still suffer fromtwo main problems: 1) They struggle to effec-tively incorporate external KB information intothe RNN hidden states (Sukhbaatar et al., 2015),

arX

iv:1

804.

0821

7v3

[cs

.CL

] 2

1 M

ay 2

018

Page 2: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Figure 1: The proposed Mem2Seq architecture for task-oriented dialog systems. (a) Memory encoderwith 3 hops; (b) Memory decoder over 2 step generation.

since RNNs are known to be unstable over longsequences. 2) Processing long sequences is verytime-consuming, especially when using attentionmechanisms.

On the other hand, end-to-end memorynetworks (MemNNs) are recurrent attentionmodels over a possibly large external mem-ory (Sukhbaatar et al., 2015). They write exter-nal memories into several embedding matrices,and use query vectors to read memories repeat-edly. This approach can memorize external KB in-formation and rapidly encode long dialog history.Moreover, the multi-hop mechanism of MemNNhas empirically shown to be essential in achiev-ing high performance on reasoning tasks (Bordesand Weston, 2017). Nevertheless, MemNN sim-ply chooses its responses from a predefined candi-date pool rather than generating word-by-word. Inaddition, the memory queries need explicit designrather than being learned, and the copy mechanismis absent.

To address these problems, we present a novelarchitecture that we call Memory-to-Sequence(Mem2Seq) to learn task-oriented dialogs in anend-to-end manner. In short, our model augmentsthe existing MemNN framework with a sequen-tial generative architecture, using global multi-hop attention mechanisms to copy words directlyfrom dialog history or KBs. We summarize ourmain contributions as such: 1) Mem2Seq is thefirst model to combine multi-hop attention mech-anisms with the idea of pointer networks, whichallows us to effectively incorporate KB informa-tion. 2) Mem2Seq learns how to generate dynamicqueries to control the memory access. In addi-tion, we visualize and interpret the model dynam-ics among hops for both the memory controller

and the attention. 3) Mem2Seq can be trainedfaster and achieve state-of-the-art results in severaltask-oriented dialog datasets.

2 Model Description

Mem2Seq 1 is composed of two components: theMemNN encoder, and the memory decoder asshown in Figure 1. The MemNN encoder cre-ates a vector representation of the dialog history.Then the memory decoder reads and copies thememory to generate a response. We define all thewords in the dialog history as a sequence of to-kens X = {x1, . . . , xn, $}, where $ is a specialcharter used as a sentinel, and the KB tuples asB = {b1, . . . , bl}. We further define U = [B;X]as the concatenation of the two sets X and B, Y ={y1, . . . , ym} as the set of words in the expectedsystem response, and PTR = {ptr1, . . . , ptrm}as the pointer index set:

ptri =

{max(z) if ∃z s.t. yi = uz

n + l + 1 otherwise(1)

where uz ∈ U is the input sequence and n + l + 1is the sentinel position index.

2.1 Memory EncoderMem2Seq uses a standard MemNN with adjacentweighted tying (Sukhbaatar et al., 2015) as an en-coder. The input of the encoder is word-level in-formation in U . The memories of MemNN arerepresented by a set of trainable embedding matri-ces C = {C1, . . . , CK+1}, where each Ck mapstokens to vectors, and a query vector qk is used asa reading head. The model loops over K hops and

1The code is available at https://github.com/HLTCHKUST/Mem2Seq

Page 3: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

it computes the attention weights at hop k for eachmemory i using:

pki = Softmax((qk)TCki ), (2)

where Cki = Ck(xi) is the memory content in po-

sition i, and Softmax(zi) = ezi/Σjezj . Here, pk

is a soft memory selector that decides the mem-ory relevance with respect to the query vector qk.Then, the model reads out the memory ok by theweighted sum over Ck+1 2,

ok =∑i

pkiCk+1i . (3)

Then, the query vector is updated for the next hopby using qk+1 = qk + ok. The result from theencoding step is the memory vector oK , which willbecome the input for the decoding step.

2.2 Memory DecoderThe decoder uses RNN and MemNN. TheMemNN is loaded with both X and B, since weuse both dialog history and KB information togenerate a proper system response. A Gated Re-current Unit (GRU) (Chung et al., 2014), is usedas a dynamic query generator for the MemNN. Ateach decoding step t, the GRU gets the previouslygenerated word and the previous query as input,and it generates the new query vector. Formally:

ht = GRU(C1(yt−1), ht−1); (4)

Then the query ht is passed to the MemNN whichwill produce the token, where h0 is the encodervector oK . At each time step, two distribution aregenerated: one over all the words in the vocabu-lary (Pvocab), and one over the memory contents(Pptr), which are the dialog history and KB inofr-mation. The first, Pvocab, is generated by concate-nating the first hop attention read out and the cur-rent query vector.

Pvocab(yt) = Softmax(W1[ht; o1]) (5)

where W1 is a trainable parameter. On the otherhand, Pptr is generated using the attention weightsat the last MemNN hop of the decoder: Pptr =pKt . Our decoder generates tokens by pointing tothe input words in the memory, which is a simi-lar mechanism to the attention used in pointer net-works (Vinyals et al., 2015).

2Here is Ck+1 since we use adjacent weighted tying.

We designed our architecture in this way be-cause we expect the attention weights in thefirst and the last hop to show a “looser” and“sharper” distribution, respectively. To elaborate,the first hop focuses more on retrieving mem-ory information and the last one tends to choosethe exact token leveraging the pointer supervi-sion. Hence, during training all the parameters arejointly learned by minimizing the sum of two stan-dard cross-entropy losses: one between Pvocab(yt)and yt ∈ Y for the vocabulary distribution, andone between Pptr(yt) and ptrt ∈ PTR for thememory distribution.

2.2.1 Sentinel

If the expected word is not appearing in the mem-ories, then the Pptr is trained to produce the sen-tinel token $, as shown in Equation 1. Once thesentinel is chosen, our model generates the tokenfrom Pvocab, otherwise, it takes the memory con-tent using the Pptr distribution. Basically, the sen-tinel token is used as a hard gate to control whichdistribution to use at each time step. A similar ap-proach has been used in (Merity et al., 2017) tocontrol a soft gate in a language modeling task.With this method, the model does not need to learna gating function separately as in Gulcehre et al.(2016), and is not constrained by a soft gate func-tion as in See et al. (2017).

2.3 Memory Content

We store word-level content X in the memorymodule. Similar to Bordes and Weston (2017), weadd temporal information and speaker informationin each token of X to capture the sequential depen-dencies. For example, “hello t1 $u” means “hello”at time step 1 spoken by a user.

On the other hand, to store B, the KB informa-tion, we follow the works of Miller et al. (2016);Eric et al. (2017) that use a (subject, relation, ob-ject) representation. For example, we representthe information of The Westin in Table 1: (TheWestin, Distance, 5 miles). Thus, we sum wordembeddings of the subject, relation, and object toobtain each KB memory representation. Duringdecoding stage, the object part is used as the gen-erated word for Pptr. For instance, when the KBtuple (The Westin, Distance, 5 miles) is pointed,our model copies “5 miles” as an output word. No-tice that only a specific section of the KB, relevantto a specific dialog, is loaded into the memory.

Page 4: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Task 1 2 3 4 5 DSTC2 In-CarAvg. User turns 4 6.5 6.4 3.5 12.9 6.7 2.6

Avg. Sys turns 6 9.5 9.9 3.5 18.4 9.3 2.6Avg. KB results 0 0 24 7 23.7 39.5 66.1Avg. Sys words 6.3 6.2 7.2 5.7 6.5 10.2 8.6

Max. Sys words 9 9 9 8 9 29 87Pointer Ratio .23 .53 .46 .19 .60 .46 .42

Vocabulary 3747 1229 1601Train dialogs 1000 1618 2425

Val dialogs 1000 500 302Test dialogs 1000 + 1000 OOV 1117 304

Table 2: Dataset statistics for 3 different datasets.

3 Experimental Setup

3.1 Dataset

We use three public multi-turn task-oriented dia-log datasets to evaluate our model: the bAbI dia-log (Bordes and Weston, 2017), DSTC2 (Hender-son et al., 2014) and In-Car Assistant (Eric et al.,2017). The train/validation/test sets of these threedatasets are split in advance by the providers. Thedataset statistics are reported in Table 2.

The bAbI dialog includes five end-to-end dia-log learning tasks in the restaurant domain, whichare simulated dialog data. Task 1 to 4 are aboutAPI calls, refining API calls, recommending op-tions, and providing additional information, re-spectively. Task 5 is the union of tasks 1-4. Thereare two test sets for each task: one follows thesame distribution as the training set and the otherhas out-of-vocabulary (OOV) entity values thatdoes not exist in the training set.

We also used dialogs extracted from the Di-alog State Tracking Challenge 2 (DSTC2) withthe refined version from Bordes and Weston(2017), which ignores the dialog state annotations.The main difference with bAbI dialog is that thisdataset is extracted from real human-bot dialogs,which is noisier and harder since the bots mademistakes due to speech recognition errors or mis-interpretations.

Recently, In-Car Assistant dataset has been re-leased. which is a human-human, multi-domaindialog dataset collected from Amazon Mechan-ical Turk. It has three distinct domains: cal-endar scheduling, weather information retrieval,and point-of-interest navigation. This dataset hasshorter conversation turns, but the user and systembehaviors are more diverse. In addition, the sys-tem responses are variant and the KB informationis much more complicated. Hence, this dataset re-quires stronger ability to interact with KBs, ratherthan dialog state tracking.

3.2 Training

We trained our model end-to-end using Adam op-timizer (Kingma and Ba, 2015), and chose learn-ing rate between [1e−3, 1e−4]. The MemNNs,both encoder and decoder, have hops K = 1, 3, 6to show the performance difference. We use sim-ple greedy search and without any re-scoring tech-niques. The embedding size, which is also equiv-alent to the memory size and the RNN hidden size(i.e., including the baselines), has been selectedbetween [64, 512]. The dropout rate is set between[0.1, 0.4], and we also randomly mask some in-put words into unknown tokens to simulate OOVsituation with the same dropout ratio. In all thedatasets, we tuned the hyper-parameters with grid-search over the validation set, using as measureto the Per-response Accuracy for bAbI dialog andDSTC2, and BLEU score for the In-Car Assistant.

3.3 Evaluation Metrics

Per-response/dialog Accuracy: A generative re-sponse is correct only if it is exactly the same asthe gold response. A dialog is correct only if ev-ery generated responses of the dialog are correct,which can be considered as the task-completionrate. Note that Bordes and Weston (2017) teststheir model by selecting the system response frompredefined response candidates, that is, their sys-tem solves a multi-class classification task. SinceMem2Seq generates each token individually, eval-uating with this metric is much more challengingfor our model.BLEU: It is a measure commonly used for ma-chine translation systems (Papineni et al., 2002),but it has also been used in evaluating dialog sys-tems (Eric and Manning, 2017; Zhao et al., 2017)and chat-bots (Ritter et al., 2011; Li et al., 2016).Moreover, BLEU score is a relevant measure intask-oriented dialog as there is not a large vari-ance between the generated answers, unlike opendomain generation (Liu et al., 2016). Hence, weinclude BLEU score in our evaluation (i.e. usingMoses multi-bleu.perl script).Entity F1: We micro-average over the entire set ofsystem responses and compare the entities in plaintext. The entities in each gold system response areselected by a predefined entity list. This metricevaluates the ability to generate relevant entitiesfrom the provided KBs and to capture the seman-tics of the dialog (Eric and Manning, 2017; Ericet al., 2017). Note that the original In-Car Assis-

Page 5: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Task QRN MemNN GMemNN Seq2Seq Seq2Seq+Attn Ptr-Unk Mem2Seq H1 Mem2Seq H3 Mem2Seq H6T1 99.4 (-) 99.9 (99.6) 100 (100) 100 (100) 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)T2 99.5 (-) 100 (100) 100 (100) 100 (100) 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)T3 74.8 (-) 74.9 (2.0) 74.9 (0) 74.8 (0) 74.8 (0) 85.1 (19.0) 87.0 (25.2) 94.5 (59.6) 94.7 (62.1)T4 57.2 (-) 59.5 (3.0) 57.2 (0) 57.2 (0) 57.2 (0) 100 (100) 97.6 (91.7) 100 (100) 100 (100)T5 99.6 (-) 96.1 (49.4) 96.3 (52.5) 98.8 (81.5) 98.4 (87.3) 99.4 (91.5) 96.1 (45.3) 98.2 (72.9) 97.9 (69.6)

T1-OOV 83.1 (-) 72.3 (0) 82.4 (0) 79.9 (0) 81.7 (0) 92.5 (54.7) 93.4 (60.4) 91.3 (52.0) 94.0 (62.2)T2-OOV 78.9 (-) 78.9 (0) 78.9 (0) 78.9 (0) 78.9 (0) 83.2 (0) 81.7 (1.2) 84.7 (7.3) 86.5 (12.4)T3-OOV 75.2 (-) 74.4 (0) 75.3 (0) 74.3 (0) 75.3 (0) 82.9 (13.4) 86.6 (26.2) 93.2 (53.3) 90.3 (38.7)T4-OOV 56.9 (-) 57.6 (0) 57.0 (0) 57.0 (0) 57.0 (0) 100 (100) 97.3 (90.6) 100 (100) 100 (100)T5-OOV 67.8 (-) 65.5 (0) 66.7 (0) 67.4 (0) 65.7 (0) 73.6 (0) 67.6 (0) 78.1 (0.4) 84.5 (2.3)

Table 3: Per-response and per-dialog (in the parentheses) accuracy on bAbI dialogs. Mem2Seq achievesthe highest average per-response accuracy and has the least out-of-vocabulary performance drop.

Ent. F1 BLEU Per-Resp.

Per-Dial.

Rule-Based - - 33.3 -QRN - - 43.8 -

MemNN - - 41.1 0.0GMemNN - - 47.4 1.4

Seq2Seq 69.7 55.0 46.4 1.5+Attn 67.1 56.6 46.0 1.4

+Copy 71.6 55.4 47.3 1.3Mem2Seq H1 72.9 53.7 41.7 0.0Mem2Seq H3 75.3 55.3 45.0 0.5Mem2Seq H6 72.8 53.6 42.8 0.7

Table 4: Evaluation on DSTC2.Seq2Seq (+attn and +copy) is reportedfrom Eric and Manning (2017).

BLEU Ent. F1 Sch. F1 Wea. F1 Nav. F1Human* 13.5 60.7 64.3 61.6 55.2

Rule-Based* 6.6 43.8 61.3 39.5 40.4KV Retrieval Net* 13.2 48.0 62.9 47.0 41.3

Seq2Seq 8.4 10.3 09.7 14.1 07.0+Attn 9.3 19.9 23.4 25.6 10.8

Ptr-Unk 8.3 22.7 26.9 26.7 14.9Mem2Seq H1 11.6 32.4 39.8 33.6 24.6Mem2Seq H3 12.6 33.4 49.3 32.8 20.0Mem2Seq H6 9.9 23.6 34.3 33.0 4.4

Table 5: Evaluation on In-Car Assistant. Human, rule-based and KV Retrieval Net evaluation (with *) are reportedfrom (Eric et al., 2017), which are not directly comparable.Mem2Seq achieves highest BLEU and entity F1 score overbaselines.

tant F1 scores reported in Eric et al. (2017) usesthe entities in their canonicalized forms, which arenot calculated based on real entity value. Sincethe datasets are not designed for slot-tracking, wereport entity F1 rather than the slot-tracking accu-racy as in (Wen et al., 2017; Zhao et al., 2017).

4 Experimental Results

We mainly compare Mem2Seq with hop 1,3,6with several existing models: query-reductionnetworks (QRN, Seo et al. (2017)), end-to-end memory networks (MemNN, Sukhbaataret al. (2015)), and gated end-to-end memory net-works (GMemNN, Liu and Perez (2017)). Wealso implemented the following baseline models:standard sequence-to-sequence (Seq2Seq) modelswith and without attention (Luong et al., 2015),and pointer to unknown (Ptr-Unk, Gulcehre et al.(2016)). Note that the results we listed in Table 3and Table 4 for QRN are different from the origi-nal paper, because based on their released code, 3

we discovered that the per-response accuracy wasnot correctly computed.bAbI Dialog: In Table 3, we follow Bordes

3We simply modified the evaluation part and reported theresults. (https://github.com/uwnlp/qrn)

and Weston (2017) to compare the performancebased on per-response and per-dialog accuracy.Mem2Seq with 6 hops can achieve per-response97.9% and per-dialog 69.6% accuracy in T5, and84.5% and 2.3% for T5-OOV, which surpass ex-isting methods by far. One can find that in T3 es-pecially, which is the task to recommend restau-rant based on their ranks, our model can achievepromising results due to the memory pointer. Interms of per-response accuracy, this indicates thatour model can generalize well with few perfor-mance loss for test OOV data, while others havearound 15-20% drop. The performance gain inOOV data is also mainly attributed to the use ofcopy mechanism. In addition, the effectiveness ofhops is demonstrated in tasks 3-5, since they re-quire reasoning ability over the KB information.Note that QRN, MemNN and GMemNN viewedbAbI dialog tasks as classification problems. Al-though their tasks are easier compared to our gen-erative methods, Mem2Seq models can still over-pass the performance. Finally, one can find thatSeq2Seq and Ptr-Unk models are also strong base-lines, which further confirms that generative meth-ods can also achieve good performance in task-oriented dialog systems (Eric and Manning, 2017).

Page 6: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

DSTC2: In Table 4, the Seq2Seq modelsfrom Eric and Manning (2017) and the rule-basedfrom Bordes and Weston (2017) are reported.Mem2Seq has the highest 75.3% entity F1 scoreand an high of 55.3 BLEU score. This further con-firms that Mem2Seq can perform well in retrievingthe correct entity, using the multiple hop mecha-nism without losing language modeling. Here, wedo not report the results using match type (Bordesand Weston, 2017) or entity type (Eric and Man-ning, 2017) feature, since this meta-informationare not commonly available and we want to havean evaluation on plain input output couples. Onecan also find out that, Mem2Seq comparable per-response accuracy (i.e. 2% margin) among otherexisting solution. Note that the per-response ac-curacy for every model is less than 50% since thedataset is quite noisy and it is hard to generate aresponse that is exactly the same as the gold one.

In-Car Assistant: In Table 5, our model canachieve highest 12.6 BLEU score. In addition,Mem2Seq has shown promising results in termsof Entity F1 scores (33.4%), which are, in general,much higher than those of other baselines. Notethat the numbers reported from Eric et al. (2017)are not directly comparable to ours as we mentionbelow. The other baselines such as Seq2Seq or Ptr-Unk especially have worse performances in thisdataset since it is very inefficient for RNN meth-ods to encode longer KB information, which is theadvantage of Mem2Seq.

Furthermore, we observe an interesting phe-nomenon that humans can easily achieve a highentity F1 score with a low BLEU score. This im-plies that stronger reasoning ability over entities(hops) is crucial, but the results may not be similarto the golden answer. We believe humans can pro-duce good answers even with a low BLEU score,since there could be different ways to express thesame concepts. Therefore, Mem2Seq shows thepotential to successfully choose the correct enti-ties.

Note that the results of KV Retrieval Net base-line reported in Table 5 come from the original pa-per (Eric et al., 2017) of In-Car Assistant, wherethey simplified the task by mapping the expressionof entities to a canonical form using named entityrecognition (NER) and linking. Hence the eval-uation is not directly comparable to our system.For example, their model learned to generate re-sponses such as “You have a football game at foot-

70T1T4

124T2

280T3

403T5

648In-Car

1557DSTC2

Maximum input lenght (# tokens)

0

5

10

15

20

Tim

ep

erep

och

(min

utes

)

Mem2Seq H6

Seq2Seq

Seq2Seq+Attn

Ptr-Unk

Figure 2: Training time per-epoch for differenttasks (lower is better). The speed difference be-comes larger as the maximal input length in-creases.

ball time with football party,” instead of generat-ing a sentence such as “You have a football gameat 7 pm with John.” Since there could be more thanone football party or football time, their modeldoes not learn how to access the KBs, but it ratherlearns the canonicalized language model.Time Per-Epoch: We also compare the train-ing time 4 in Figure 2. The experiments are setwith batch size 16, and we report each modelwith the hyper-parameter that can achieved thehighest performance. One can observe that thetraining time is not that different for short in-put length (bAbI dialog tasks 1-4) and the gapbecomes larger as the maximal input length in-creases. Mem2Seq is around 5 times faster in In-Car Assistant and DSTC2 compared to Seq2Seqwith attention. This difference in training effi-ciency is mainly attributed to the fact that Seq2Seqmodels have input sequential dependencies whichlimit any parallelization. Moreover, it is unavoid-able for Seq2Seq models to encode KBs, insteadMem2Seq only encodes with dialog history.

5 Analysis and Discussion

Memory Attention: Analyzing the attentionweights has been frequently used to show thememory read-out, since it is an intuitive way to un-derstand the model dynamics. Figure 8 shows theattention vector at the last hop for each generatedtoken. Each column represents the Pptr vector atthe corresponding generation step. Our model hasa sharp distribution over the memory, which im-

4Intel(R) Core(TM) i7-3930K [email protected], using aGeForce GTX 1080 Ti

Page 7: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

0 1 2 3 4 5 6 7 8 9 10 11 12 13Generation Step

Sentinel

parking garage

closest

the

to

directions

the

are

what

783 arcadia pl address chevron

gas station poi type chevron

moderate traffic traffic info chevron

3 miles distance chevron

chevron poi gas station moderate traffic 3 miles

271 springer street address mandarin roots

chinese restaurant poi type mandarin roots

moderate traffic traffic info mandarin roots

4 miles distance mandarin roots

mandarin roots poi chinese restaurant moderate traffic 4 miles

408 university ave address trader joes

grocery store poi type trader joes

no traffic traffic info trader joes

5 miles distance trader joes

trader joes poi grocery store no traffic 5 miles

638 amherst st address sigona farmers market

grocery store poi type sigona farmers market

no traffic traffic info sigona farmers market

4 miles distance sigona farmers market

sigona farmers market poi grocery store no traffic 4 miles

347 alta mesa ave address jills house

friends house poi type jills house

heavy traffic traffic info jills house

4 miles distance jills house

jills house poi friends house heavy traffic 4 miles

270 altaire walk address civic center garage

parking garage poi type civic center garage

no traffic traffic info civic center garage

4 miles distance civic center garage

civic center garage poi parking garage no traffic 4 miles

434 arastradero rd address ravenswood shopping center

shopping center poi type ravenswood shopping center

heavy traffic traffic info ravenswood shopping center

4 miles distance ravenswood shopping center

ravenswood shopping center poi shopping center heavy traffic 4 milesM

emor

yC

on

ten

t

COR: the closest parking garage is civic center garage located 4 miles away at 270 altaire walkGEN: the closest parking garage is civic center garage at 270 altaire walk 4 miles away through the directions

0.0

0.2

0.4

0.6

0.8

Figure 3: Last hop memory attention visualizationfrom the In-Car dataset. COR and GEN on the topare the correct response and our generated one.

plies that it is able to select the right token fromthe memory. For example, the KB information“270 altarie walk” was retrieved at the sixth step,which is an address for “civic center garage”. Onthe other hand, if the sentinel is triggered, then thegenerated word comes from vocabulary distribu-tion Pvocab. For instance, the third generation steptriggered the sentinel, and “is” is generated fromthe vocabulary as the word is not present in thedialog history.

Multiple Hops: Mem2Seq shows how multiplehops improve the model performance in severaldatasets. Task 3 in the bAbI dialog dataset servesas an example, in which the systems need to rec-ommend restaurants to users based on restaurantranking from highest to lowest. Users can rejectthe recommendation and the system has to rea-son over the next highest restaurant. We foundout there are two common patterns between hopsamong different samples: 1) the first hop is usu-ally used to score all the relevant memories and

Figure 4: Principal component analysis of queryvectors in hop (a) 1 and (b) 6 for bAbI dialog.

room time date party eventconference room 100 3pm Thursday sales team conference

- 11am Tuesday - tennis activity- 7pm Wednesday - dentist appointment

conference room 100 1pm Tuesday HR meeting- 7pm Friday - doctor appointment- 11am Tuesday - lab appointment

DRIVER When’s my tennis match?Seq2seq Your lab appointment is on Monday at

+Att Your lab appointment is on Tuesday at 11amPtr-Unk Your tennis match is on PAD at 1pm

Mem2Seq H1 Your is very welcomeMem2Seq H3 Your tennis is on Tuesday at 11amMem2Seq H6 Your tennis is Monday at 11am

GOLD Your tennis match is Tuesday at 11am

Table 6: Example of generated responses for theIn-Car Assistant on the scheduling domain.

retrieve information; 2) the last hop tends to focuson a specific token and makes mistakes when theattention is not sharp. Such mistakes can be at-tributed to lack of hops, for some samples. Formore information, we report two figures in thesupplementary material.Query Vectors: In Figure 4, the principal com-ponent analysis of Mem2Seq queries vectors isshown for different hops. Each dot is a query vec-tor ht during each decoding time step, and it has itscorresponding generated word yt. The blue dotsare the words generated from Pvocab, which trig-gered the sentinel, and orange ones are from Pptr.One can find that in (a) hop 1, there is no clear sep-aration of two different colors but each of whichtends to group together. On the other hand, theseparation becomes clearer in (b) hop 6 as eachcolor clusters into several groups such as location,cuisine, and number. Our model tends to retrievemore information in the first hop, and points intothe memories in the last hop.

Page 8: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Examples: Table 1 and 6 show the generated re-sponses of different models in the two test set sam-ples from the In-Car Assistant dataset. We reportexamples from this dataset since their answers aremore human-like and not as structured and repet-itive as others. Seq2Seq generally cannot pro-duce related information, and sometimes fail inlanguage modeling. Instead, using attention helpswith this issue, but it still rarely produces the cor-rect entities. For example, Seq2Seq with atten-tion generated 5 miles in Table 1 but the correctone is 4 miles. In addition, Ptr-Unk often cannotcopy the correct token from the input, as shown by“PAD” in Table 1. On the other hand, Mem2Seqis able to produce the correct responses in this twoexamples. In particular in the navigation domain,shown in Table 1, Mem2Seq produces a differentbut still correct utterance. We report further ex-amples from all the domains in the supplementarymaterial.

Discussions: Conventional task-oriented dialogsystems (Williams and Young, 2007), which arestill widely used in commercial systems, requirea multitude of human efforts in system designingand data collection. On the other hand, althoughend-to-end dialog systems are not perfect yet, theyrequire much less human interference, especiallyin the dataset construction, as raw conversationaltext and KB information can be used directly with-out the need of heavy preprocessing (e.g. NER,dependency parsing). To this extent, Mem2Seqis a simple generative model that is able to in-corporate KB information with promising gener-alization ability. We also discovered that the en-tity F1 score may be a more comprehensive evalu-ation metric than per-response accuracy or BLEUscore, as humans can normally choose the rightentities but have very diversified responses. In-deed, we want to highlight that humans may have alow BLEU score despite their correctness becausethere may not be a large n-gram overlap betweenthe given response and the expected one. How-ever, this does not imply that there is no correla-tion between BLEU score and human evaluation.In fact, unlike chat-bots and open domain dialogswhere BLEU score does not correlate with hu-man evaluation (Liu et al., 2016), in task-orienteddialogs the answers are constrained to particularentities and recurrent patterns. Thus, we believeBLEU score still can be considered as a relevantmeasure. In future works, several methods could

be applied (e.g. Reinforcement Learning (Ranzatoet al., 2016), Beam Search (Wiseman and Rush,2016)) to improve both responses relevance andentity F1 score. However, we preferred to keepour model as simple as possible in order to showthat it works well even without advanced trainingmethods.

6 Related Works

End-to-end task-oriented dialog systems train asingle model directly on text transcripts of di-alogs (Wen et al., 2017; Serban et al., 2016;Williams et al., 2017; Zhao et al., 2017; Seo et al.,2017; Serban et al., 2017). Here, RNNs play animportant role due to their ability to create a la-tent representation, avoiding the need for artificialstate labels. End-to-End Memory Networks (Bor-des and Weston, 2017; Sukhbaatar et al., 2015),and its variants (Liu and Perez, 2017; Wu et al.,2017, 2018) have also shown good results in suchtasks. In each of these architectures, the output isproduced by generating a sequence of tokens, orby selecting a set of predefined utterances.

Sequence-to-sequence (Seq2Seq) models havealso been used in task-oriented dialog sys-tems (Zhao et al., 2017). These architectures havebetter language modeling ability, but they do notwork well in KB retrieval. Even with sophisticatedattention models (Luong et al., 2015; Bahdanauet al., 2015), Seq2Seq fails to map the correct en-tities to the generated input. To alleviate this prob-lem, copy augmented Seq2Seq models Eric andManning (2017), were used. These models out-perform utterance selection methods by copyingrelevant information directly from the KBs. Copymechanisms has also been used in question an-swering tasks (Dehghani et al., 2017; He et al.,2017), neural machine translation (Gulcehre et al.,2016; Gu et al., 2016), language modeling (Merityet al., 2017), and summarization (See et al., 2017).

Less related to dialog systems, but related to ourwork, are the memory based decoders and the non-recurrent generative models: 1) Mem2Seq querygeneration phase used to access our memories canbe seen as the memory controller used in MemoryAugmented Neural Networks (MANN) (Graveset al., 2014, 2016). Similarly, memory en-coders have been used in neural machine transla-tion (Wang et al., 2016), and meta-learning appli-cation (Kaiser et al., 2017). However, Mem2Seqdiffers from these models as such: it uses multi-

Page 9: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

hop attention in combination with copy mecha-nism, whereas other models use a single matrixrepresentation. 2) non-recurrent generative mod-els (Vaswani et al., 2017), which only rely on self-attention mechanism, are related to the multi-hopattention mechanism used in MemNN.

7 Conclusion

In this work, we present an end-to-end trainableMemory-to-Sequence model for task-oriented di-alog systems. Mem2Seq combines the multi-hopattention mechanism in end-to-end memory net-works with the idea of pointer networks to incor-porate external information. We empirically showour model’s ability to produce relevant answers us-ing both the external KB information and the pre-defined vocabulary, and visualize how the multi-hop attention mechanisms help in learning corre-lations between memories. Mem2Seq is fast, gen-eral, and able to achieve state-of-the-art results inthree different datasets.

Acknowledgments

This work is partially funded by ITS/319/16FPof Innovation Technology Commission, HKUST16214415 & 16248016 of Hong Kong Re-search Grants Council, and RDC 1718050-0 ofEMOS.AI.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. International Con-ference on Learning Representations.

Antoine Bordes and Jason Weston. 2017. Learn-ing end-to-end goal-oriented dialog. Interna-tional Conference on Learning Representations,abs/1605.07683.

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence mod-eling. NIPS Deep Learning and RepresentationLearning Workshop.

Mostafa Dehghani, Sascha Rothe, Enrique Alfonseca,and Pascal Fleury. 2017. Learning to attend, copy,and generate for session-based query suggestion. InProceedings of the 2017 ACM on Conference on In-formation and Knowledge Management, CIKM ’17,pages 1747–1756, New York, NY, USA. ACM.

Mihail Eric, Lakshmi Krishnan, Francois Charette, andChristopher D. Manning. 2017. Key-value retrieval

networks for task-oriented dialogue. In Proceedingsof the 18th Annual SIGdial Meeting on Discourseand Dialogue, pages 37–49. Association for Com-putational Linguistics.

Mihail Eric and Christopher Manning. 2017. A copy-augmented sequence-to-sequence architecture givesgood performance on task-oriented dialogue. InProceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 2, Short Papers, pages 468–473,Valencia, Spain. Association for Computational Lin-guistics.

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014.Neural turing machines. CoRR.

Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, EdwardGrefenstette, Tiago Ramalho, John Agapiou, et al.2016. Hybrid computing using a neural net-work with dynamic external memory. Nature,538(7626):471–476.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1631–1640, Berlin, Germany. Association forComputational Linguistics.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages140–149, Berlin, Germany. Association for Compu-tational Linguistics.

Shizhu He, Cao Liu, Kang Liu, and Jun Zhao.2017. Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 199–208, Vancouver, Canada. Association for Computa-tional Linguistics.

Matthew Henderson, Blaise Thomson, and Jason DWilliams. 2014. The second dialog state trackingchallenge. In Proceedings of the 15th Annual Meet-ing of the Special Interest Group on Discourse andDialogue (SIGDIAL), pages 263–272.

Chiori Hori, Kiyonori Ohtake, Teruhisa Misu, HidekiKashioka, and Satoshi Nakamura. 2009. Statisti-cal dialog management applied to wfst-based di-alog systems. In IEEE International Conferenceon Acoustics, Speech and Signal Processing, 2009.ICASSP 2009., pages 4793–4796. IEEE.

Lukasz Kaiser, Ofir Nachum, Aurko Roy, and SamyBengio. 2017. Learning to remember rare events.

Page 10: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

International Conference on Learning Representa-tions.

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. InternationalConference on Learning Representations.

Cheongjae Lee, Sangkeun Jung, Seokhwan Kim, andGary Geunbae Lee. 2009. Example-based dialogmodeling for practical multi-domain dialog system.Speech Communication, 51(5):466–484.

Esther Levin, Roberto Pieraccini, and Wieland Eckert.2000. A stochastic model of human-machine inter-action for learning dialog strategies. IEEE Transac-tions on speech and audio processing, 8(1):11–23.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A diversity-promoting ob-jective function for neural conversation models. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 110–119, San Diego, California. Associationfor Computational Linguistics.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How not to evaluate your dialogue system: An em-pirical study of unsupervised evaluation metrics fordialogue response generation. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2122–2132, Austin,Texas. Association for Computational Linguistics.

Fei Liu and Julien Perez. 2017. Gated end-to-endmemory networks. In Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics: Volume 1, Long Pa-pers, pages 1–10, Valencia, Spain. Association forComputational Linguistics.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1412–1421, Lis-bon, Portugal. Association for Computational Lin-guistics.

Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017. Pointer sentinel mixturemodels. International Conference on Learning Rep-resentations.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for directlyreading documents. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, pages 1400–1409, Austin, Texas.Association for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of

40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. InternationalConference on Learning Representations.

Alan Ritter, Colin Cherry, and William B. Dolan. 2011.Data-driven response generation in social media. InProceedings of the 2011 Conference on EmpiricalMethods in Natural Language Processing, pages583–593, Edinburgh, Scotland, UK. Association forComputational Linguistics.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Compu-tational Linguistics.

Minjoon Seo, Sewon Min, Ali Farhadi, and HannanehHajishirzi. 2017. Query-reduction networks forquestion answering. International Conference onLearning Representations.

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron C Courville, and Joelle Pineau. 2016.Building end-to-end dialogue systems using gener-ative hierarchical neural network models. In AAAI,pages 3776–3784.

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron C Courville,and Yoshua Bengio. 2017. A hierarchical latentvariable encoder-decoder model for generating di-alogues. In AAAI, pages 3295–3301.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In Advancesin neural information processing systems, pages2440–2448.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 6000–6010.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors, Advances in Neural Information ProcessingSystems 28, pages 2692–2700. Curran Associates,Inc.

Mingxuan Wang, Zhengdong Lu, Hang Li, and QunLiu. 2016. Memory-enhanced decoder for neuralmachine translation. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, pages 278–286, Austin, Texas.Association for Computational Linguistics.

Page 11: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic,Lina Maria Rojas-Barahona, Pei hao Su, StefanUltes, David Vandyke, and Steve J. Young. 2017.A network-based end-to-end trainable task-orienteddialogue system. In EACL.

Jason D Williams, Kavosh Asadi, and Geoffrey Zweig.2017. Hybrid code networks: practical and efficientend-to-end dialog control with supervised and rein-forcement learning. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 665–677, Vancouver, Canada. Association for Computa-tional Linguistics.

Jason D Williams and Steve Young. 2007. Partiallyobservable markov decision processes for spokendialog systems. Computer Speech & Language,21(2):393–422.

Sam Wiseman and Alexander M. Rush. 2016.Sequence-to-sequence learning as beam-search opti-mization. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Process-ing, pages 1296–1306, Austin, Texas. Associationfor Computational Linguistics.

Chien-Sheng Wu, Andrea Madotto, Genta Winata, andPascale Fung. 2017. End-to-end recurrent entity net-work for entity-value independent goal-oriented di-alog learning. In Dialog System Technology Chal-lenges Workshop, DSTC6.

Chien-Sheng Wu, Andrea Madotto, Genta Winata, andPascale Fung. 2018. End-to-end dynamic querymemory network for entity-value independent task-oriented dialog. In IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP).

Steve Young, Milica Gasic, Blaise Thomson, and Ja-son D Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review. Proceedings of theIEEE, 101(5):1160–1179.

Tiancheng Zhao, Allen Lu, Kyusong Lee, and Max-ine Eskenazi. 2017. Generative encoder-decodermodels for task-oriented spoken dialog systems withchatting capability. In Proceedings of the 18th An-nual SIGdial Meeting on Discourse and Dialogue,pages 27–36. Association for Computational Lin-guistics.

Page 12: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

8 Tables

8.1 Time per Epoch

T1 T2 T3 T4 T5 DSTC2 In-CarSeq2Seq 0.7 1.22 0.85 0.15 1.57 9.36 2.00

+Attn 1.18 2.38 2.2 0.32 6.04 22.07 6.18Ptr-Unk 1.28 2.5 2.21 0.38 6.10 20.39 6.48

Mem2Seq H1 0.73 1.10 0.55 0.15 1.05 1.49 0.68Mem2Seq H3 0.97 1.48 0.77 0.21 1.33 2.36 1.02Mem2Seq H6 1.23 2.14 1.11 0.33 2.20 4.22 1.43Table 7: Minutes per-epoch. Mem2Seq is faster compared to others especially for longer inputs.

8.2 Hyper-parameters

T1 T2 T3 T4 T5 DSTC2 In-Car lrSeq2Seq 256 (0.1) 256 (0.1) 128(0.1) 256 (0.1) 256 (0.1) - 512 (0.3)

0.0001+Attn 256 (0.1) 256 (0.1) 256 (0.1) 256 (0.1) 128 (0.1) - 512 (0.0)Ptr-Unk 256 (0.1) 256 (0.1) 256 (0.1) 256 (0.1) 128 (0.1) - 512 (0.0)

Mem2Seq GRU L1 256 (0.3) 256 (0.2) 128 (0.2) 256 (0.2) 256 (0.1) 256 (0.1) 256 (0.2)0.001Mem2Seq GRU L3 256 (0.1) 256 (0.3) 128 (0.2) 256 (0.3) 128 (0.1) 128 (0.2) 256 (0.1)

Mem2Seq GRU L6 256 (0.2) 128 (0.3) 256 (0.1) 128 (0.3) 256 (0.2) 256 (0.1) 256 (0.2)

Table 8: Selected hyper-parameters in each datasets for different hops. The values is the embeddingdimension and the GRU hidden size, and the values between parenthesis is the dropout rate. For all themodels we used learning rate equal to 0.001, with a decay rate of 0.5.

Page 13: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

9 Visualization

9.1 Multiple Hops Attention

1 2 3

1

SentinelSILENCE

youfor

optionssome

intolook

meletok

pleaserangeprice

cheapa

infor

lookingare

rangeprice

whichsixbe

willwe

partyyour

inbe

wouldpeoplemany

howfood

italianwith

cuisineof

typea

onpreference

anySILENCE

iton

i’mbombay

intable

ahave

imay

todaywithyou

helpi

canwhathello

hiresto bombay cheap italian 8stars R rating 8

cheap R price resto bombay cheap italian 8starssix R number resto bombay cheap italian 8stars

bombay R location resto bombay cheap italian 8starsresto bombay cheap italian 8stars address R address resto bombay cheap italian 8stars

italian R cuisine resto bombay cheap italian 8starsresto bombay cheap italian 8stars phone R phone resto bombay cheap italian 8stars

resto bombay cheap italian 1stars R rating 1cheap R price resto bombay cheap italian 1starssix R number resto bombay cheap italian 1stars

bombay R location resto bombay cheap italian 1starsresto bombay cheap italian 1stars address R address resto bombay cheap italian 1stars

italian R cuisine resto bombay cheap italian 1starsresto bombay cheap italian 1stars phone R phone resto bombay cheap italian 1stars

resto bombay cheap italian 2stars R rating 2cheap R price resto bombay cheap italian 2starssix R number resto bombay cheap italian 2stars

bombay R location resto bombay cheap italian 2starsresto bombay cheap italian 2stars address R address resto bombay cheap italian 2stars

italian R cuisine resto bombay cheap italian 2starsresto bombay cheap italian 2stars phone R phone resto bombay cheap italian 2stars

resto bombay cheap italian 3stars R rating 3cheap R price resto bombay cheap italian 3starssix R number resto bombay cheap italian 3stars

bombay R location resto bombay cheap italian 3starsresto bombay cheap italian 3stars address R address resto bombay cheap italian 3stars

italian R cuisine resto bombay cheap italian 3starsresto bombay cheap italian 3stars phone R phone resto bombay cheap italian 3stars

resto bombay cheap italian 4stars R rating 4cheap R price resto bombay cheap italian 4starssix R number resto bombay cheap italian 4stars

bombay R location resto bombay cheap italian 4starsresto bombay cheap italian 4stars address R address resto bombay cheap italian 4stars

italian R cuisine resto bombay cheap italian 4starsresto bombay cheap italian 4stars phone R phone resto bombay cheap italian 4stars

CORRECT: what do you think of this option: resto bombay cheap italian 8stars

what

1 2 3

2

do

1 2 3

3

you

1 2 3

4

think

1 2 3

5

of

1 2 3

6

this

1 2 3

7

option:

1 2 3

8

resto bombay cheap italian 8stars

Figure 5: 3-hop memory attention visualization from bAbI dialog task 3. In the generation step 8, theattention at hop 1 is shallow and divided equally over the 5 possible choice, and becomes very sharp athop 3.

Page 14: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

1 2 3

1

SentinelSILENCE

youfor

optionother

anfindmelet

surethatlike

don’ti

noresto paris expensive british 5stars

option:this

ofthink

youdo

whatSILENCE

youfor

optionssome

intolook

meletok

pleaserangeprice

expensivea

infor

lookingare

rangeprice

whichpleasepeople

sixfor

partyyour

inbe

wouldpeoplemany

howfood

britishwith

cuisineof

typea

onpreference

anySILENCE

iton

i’mparis

intable

abook

tolikei’d

todaywithyou

helpi

canwhathello

morninggood

resto paris expensive british 4stars R rating 4expensive R price resto paris expensive british 4stars

six R number resto paris expensive british 4starsparis R location resto paris expensive british 4stars

resto paris expensive british 4stars address R address resto paris expensive british 4starsbritish R cuisine resto paris expensive british 4stars

resto paris expensive british 4stars phone R phone resto paris expensive british 4starsresto paris expensive british 2stars R rating 2

expensive R price resto paris expensive british 2starssix R number resto paris expensive british 2stars

paris R location resto paris expensive british 2starsresto paris expensive british 2stars address R address resto paris expensive british 2stars

british R cuisine resto paris expensive british 2starsresto paris expensive british 2stars phone R phone resto paris expensive british 2stars

resto paris expensive british 3stars R rating 3expensive R price resto paris expensive british 3stars

six R number resto paris expensive british 3starsparis R location resto paris expensive british 3stars

resto paris expensive british 3stars address R address resto paris expensive british 3starsbritish R cuisine resto paris expensive british 3stars

resto paris expensive british 3stars phone R phone resto paris expensive british 3starsresto paris expensive british 5stars R rating 5

expensive R price resto paris expensive british 5starssix R number resto paris expensive british 5stars

paris R location resto paris expensive british 5starsresto paris expensive british 5stars address R address resto paris expensive british 5stars

british R cuisine resto paris expensive british 5starsresto paris expensive british 5stars phone R phone resto paris expensive british 5stars

CORRECT: what do you think of this option: resto paris expensive british 4starswhat

1 2 3

2

do

1 2 3

3

you

1 2 3

4

think

1 2 3

5

of

1 2 3

6

this

1 2 3

7

option:

1 2 3

8

resto paris expensive british 5stars

Figure 6: 3-hop memory attention visualization from bAbI dialog task 3. The model tends to makemistake if the attention at last hop is not sharp.

Page 15: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

9.2 Last Hop Attention

0 1 2 3 4 5 6Generation Step

Sentinel

SILENCE

you

for

options

some

into

look

me

let

ok

please

people

eight

for

party

your

in

be

would

people

many

how

please

london

be

it

should

where

SILENCE

it

on

i’m

food

italian

with

range

price

expensive

a

in

table

a

have

i

may

today

with

you

help

i

can

what

hello

hi

Mem

ory

Co

nte

nt

COR: api call italian london eight expensiveGEN: api call italian london eight expensive

0.0

0.2

0.4

0.6

0.8

1.0

Figure 7: Last hop memory attention visualization from the bAbI dataset. COR and GEN on the top arethe correct response and our generated one.

Page 16: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

0 1 2 3 4 5 6 7 8 9 10 11 12 13Generation Step

Sentinel

parking garage

closest

the

to

directions

the

are

what

783 arcadia pl address chevron

gas station poi type chevron

moderate traffic traffic info chevron

3 miles distance chevron

chevron poi gas station moderate traffic 3 miles

271 springer street address mandarin roots

chinese restaurant poi type mandarin roots

moderate traffic traffic info mandarin roots

4 miles distance mandarin roots

mandarin roots poi chinese restaurant moderate traffic 4 miles

408 university ave address trader joes

grocery store poi type trader joes

no traffic traffic info trader joes

5 miles distance trader joes

trader joes poi grocery store no traffic 5 miles

638 amherst st address sigona farmers market

grocery store poi type sigona farmers market

no traffic traffic info sigona farmers market

4 miles distance sigona farmers market

sigona farmers market poi grocery store no traffic 4 miles

347 alta mesa ave address jills house

friends house poi type jills house

heavy traffic traffic info jills house

4 miles distance jills house

jills house poi friends house heavy traffic 4 miles

270 altaire walk address civic center garage

parking garage poi type civic center garage

no traffic traffic info civic center garage

4 miles distance civic center garage

civic center garage poi parking garage no traffic 4 miles

434 arastradero rd address ravenswood shopping center

shopping center poi type ravenswood shopping center

heavy traffic traffic info ravenswood shopping center

4 miles distance ravenswood shopping center

ravenswood shopping center poi shopping center heavy traffic 4 miles

Mem

ory

Co

nte

nt

COR: the closest parking garage is civic center garage located 4 miles away at 270 altaire walkGEN: the closest parking garage is civic center garage at 270 altaire walk 4 miles away through the directions

0.0

0.2

0.4

0.6

0.8

Figure 8: Last hop memory attention visualization from the In-Car dataset. COR and GEN on the topare the correct response and our generated one.

Page 17: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

10 InCar Assistant Dataset Examples

Table 9: Example of generated responses for the In-Car Assistant on the weather domain.location monday tuesday wednesday

grand rapids hot, low of 50F, high of 70F raining, low of 60F, high of 80F rain, low of 20F, high of 30Fnew york misty, low of 30F, high of 50F snow, low of 70F, high of 80F cloudy, low of 20F, high of 30F

boston hail, low of 90F, high of 100F overcast, low of 60F, high of 70F rain, low of 50F, high of 60Fdurham hot, low of 90F, high of 100F dry, low of 60F, high of 80F misty, low of 60F, high of 80F

san francisco rain, low of 60F, high of 70F cloudy, low of 30F, high of 40F overcast, low of 90F, high of 100Fcarson raining, low of 70F, high of 80F humid, low of 90F, high of 100F frost, low of 40F, high of 60F

san jose blizzard, low of 40F, high of 50F snow, low of 90F, high of 100F overcast, low of 60F, high of 80Fthursday friday saturday sunday

clear skies, low of 60F, high of 70F warm, low of 70F, high of 90F foggy, low of 50F, high of 60F overcast, low of 50F, high of 60Frain, low of 80F, high of 100F rain, low of 40F, high of 60F cloudy, low of 30F, high of 50F snow, low of 20F, high of 40Fdew, low of 20F, high of 30F cloudy, low of 90F, high of 100F overcast, low of 50F, high of 70F overcast, low of 80F, high of 90F

misty, low of 90F, high of 100F hot, low of 70F, high of 90F hail, low of 30F, high of 40F rain, low of 60F, high of 80Fstormy, low of 90F, high of 100F cloudy, low of 90F, high of 100F clear skies, low of 90F, high of 100F foggy, low of 30F, high of 50F

dry, low of 50F, high of 60F foggy, low of 80F, high of 100F overcast, low of 50F, high of 70F overcast, low of 60F, high of 70Frain, low of 30F, high of 50F rain, low of 80F, high of 100F dry, low of 50F, high of 60F dry, low of 90F, high of 100F

DRIVER Is it gonna snow in Durham during the week?Seq2seq there is no snow in the forecast for this week

+Att there is no snow in the forecast for durham for durhamPtr-Unk yes there is no snow in durham on week

Mem2Seq no snow, the forecast does not predict any snow on DurhamGOLD Durham does not have any snow predicted this week

Table 10: Example of generated answers for the In-Car dataset on the Weather Domain.monday tuesday friday wednesday

windy, low of 40F, high of 50F snow, low of 40F, high of 50F frost, low of 30F, high of 40F hail, low of 70F, high of 80Frain, low of 30F, high of 40F foggy, low of 60F, high of 70F stormy, low of 50F, high of 60F snow, low of 80F, high of 90F

cloudy, low of 20F, high of 40F clear skies, low of 70F, high of 90F clear skies, low of 60F, high of 70F dry, low of 40F, high of 60Fwarm, low of 90F, high of 100F windy, low of 50F, high of 60F blizzard, low of 20F, high of 30F rain, low of 60F, high of 70Fsnow, low of 60F, high of 80F clear skies, low of 30F, high of 50F clear skies, low of 30F, high of 40F dry, low of 30F, high of 50Fwindy, low of 50F, high of 60F drizzle, low of 80F, high of 100F windy, low of 80F, high of 100F clear skies, low of 50F, high of 70F

clear skies, low of 60F, high of 70F rain, low of 90F, high of 100F foggy, low of 20F, high of 40F hail, low of 20F, high of 30Fthursday sunday location saturday

raining, low of 80F, high of 90F foggy, low of 40F, high of 50F san francisco snow, low of 40F, high of 50Fdrizzle, low of 50F, high of 60F clear skies, low of 90F, high of 100F manhattan dry, low of 70F, high of 90Fstormy, low of 40F, high of 50F hail, low of 40F, high of 60F seattle windy, low of 60F, high of 70F

overcast, low of 80F, high of 90F blizzard, low of 20F, high of 40F oakland clear skies, low of 50F, high of 70Fraining, low of 50F, high of 60F windy, low of 30F, high of 50F exeter overcast, low of 60F, high of 70Ffrost, low of 40F, high of 50F drizzle, low of 40F, high of 60F menlo park foggy, low of 50F, high of 70Fdry, low of 20F, high of 30F cloudy, low of 20F, high of 40F carson windy, low of 80F, high of 100F

DRIVER Is it going to snow today?CAR What city do you wanna know if it will snow today?

DRIVER Please check for OaklandGOLD Today is gonna be warm in Oakland

Seq2seq It is not snow in today in+att No there will be snow in Oakland today

Ptr-Unk Today is gonna be PAD inMem2Seq it will not snow in oakland today

Page 18: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Table 11: Example of generated responses for the In-Car Assistant on the navigation domain.poi distance traffic info poi type address

Stanford Express Care 3 miles no traffic hospital 214 El Camino Realhome 4 miles heavy traffic home 5671 barringer street

Whole Foods 2 miles heavy traffic grocery store 819 Alma StP.F. Changs 6 miles moderate traffic chinese restaurant 669 El Camino Real

Jing Jing 5 miles heavy traffic chinese restaurant 113 Arbol DrPizza My Heart 6 miles moderate traffic pizza restaurant 528 Anton Ct

Cafe Venetia 4 miles no traffic coffee or tea place 269 Alger DrDRIVER Find me a tea house

Seq2seq We’ re 2 miles away from from and there is a road block nearby+att We’ re 4 miles away from Cafe Venetia and it s at 269 Alger drive

PTRUNK We’ re 6 miles away from whole foods and you like the address on screenMem2Seq We’re 4 miles away from Cafe Venetia and there is no traffic

GOLD There is Cafe Venetia that’s 4 miles away at 269 Alger dr

Table 12: Example of generated responses for the In-Car Assistant on the navigation domain.distance traffic info poi type address poi4 miles heavy traffic rest stop 578 Arbol Dr Hotel Keen1 miles moderate traffic hospital 899 Ames Ct Stanford Childrens Health2 miles no traffic grocery store 452 Arcadia Pl Safeway1 miles moderate traffic coffee or tea place 436 Alger Dr Palo Alto Cafe1 miles heavy traffic grocery store 819 Alma St Whole Foods6 miles moderate traffic coffee or tea place 394 Van Ness Ave Coupa6 miles heavy traffic coffee or tea place 145 Amherst St Teavana

DRIVER Take me to a place where i can get some hot teaSeq2seq There is a away

att There is a Pizza Chicago and TeavanaPTRUNK I have a PAD away awayMem2Seq We’ re 6 miles away from Palo Alto Cafe and 6 miles away from Coupa, which one do you prefer?

GOLD Coupa is nearby and serves tea

Page 19: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Table 13: Example of generated responses for the In-Car Assistant on the navigation domain.distance traffic info poi type address poi3 miles no traffic chinese restaurant 593 Arrowhead Way Chef Chu’s1 miles no traffic chinese restaurant 669 El Camino Real P.F. Changs5 miles road block nearby gas station 200 Alester Ave Valero3 miles no traffic pizza restaurant 528 Anton Ct Pizza My Heart4 miles heavy traffic rest stop 753 University Ave Comfort Inn5 miles heavy traffic chinese restaurant 842 Arrowhead Way Panda Express2 miles heavy traffic pizza restaurant 704 El Camino Real Pizza Hut6 miles no traffic friends house 864 Almanor Ln jacks house

DRIVER Find me the nearest gas station.Seq2seq the nearest parking garage is from at away

+att The is one is away awayPTRUNK PAD nearest gas station is is isMem2Seq The nearest gas station is gas station which is you you

GOLD The nearest gas station is located 5 miles away. Need more info?

Table 14: Example of generated responses for the In-Car Assistant on the navigation domain.Poi Distance Traffic info Poi type Address

The Westin 4 miles no traffic rest stop 329 El Camino RealPalo Alto Garage R 2 miles moderate traffic parking garage 481 Amaranta Ave

Mandarin Roots 4 miles moderate traffic chinese restaurant 271 Springer StreetSigona Farmers Market 4 miles no traffic grocery store 638 Amherst St

Whole Foods 2 miles heavy traffic grocery store 819 Alma StStanford Shopping Center 2 miles moderate traffic shopping center 773 Alger Dr

toms house 1 miles no traffic friends house 580 Van Ness Ave

DRIVER Give me directions to the closest grocery store

CAR There are Whole Foods 2 miles away andSigona Farmers Market 4 miles away. Where do we go?

DRIVER I need the one that avoids all heavy trafficSeq2seq he is is is is away at

+Att Chef Chu s is located at 593 Arrowhead wayPtr-Unk PAD is the is PAD the to

Mem2SeqSigona Farmers Market is the only option i is 4 milesaway with no traffic, setting navigation now

GOLD Sigona Farmers Market is located 4 miles away with no traffic.It’s located at 638 Amherst St.

Page 20: Abstractcharter used as a sentinel, and the KB tuples as B = fb 1;:::;b lg. We further define U = [B;X] as the concatenation of the two sets X and B, Y = fy 1;:::;y mg as the set

Table 15: Example of generated responses for the In-Car Assistant on the scheduling domain.DRIVER Set me a reminder for football activity at 1pm this month on the 11th with marie

Seq2seq Okay I am scheduling a reminder with your for at on with atatt Okay I am scheduling a football activity with on the 15th at 1pm this month on the 15th

PTRUNK Okay setting PAD reminder reminder football activity football activity for this this on the 11th month 1pmMem2Seq Setting a reminder for football activity with marie on the 11th at 1pm

GOLD Reminder for football on the 11th of this month at 1pm with marie is set

Table 16: Example of generated responses for the In-Car Assistant on the scheduling domain.room agenda time date party event

conference room 50 discuss the merger 1pm Wednesday infrastructure team conference- - 2pm Wednesday - yoga activity- - 2pm Wednesday - doctor appointment

conference room 50 go over budget 7pm Monday infrastructure team meeting- - 5pm Monday - dinner- - 10am Thursday - medicine

DRIVER Tell me when is my next meeting and who are invited in the meeting and what are the agenda of the next meetingSeq2seq The meeting is the meeting in conference room 100 at 1pm

+att Your next meeting is on wednesday at 1pmPTRUNK your PAD meeting PAD the in in the to to the the the theMem2Seq Your next meeting is on Monday at 7pm hr is to hr in attendance to go over the budget

GOLD Your next meeting is on Monday at 7 pm with the infrastructure team to go over budget