End-to-End Reinforcement Learning of Dialogue Agents for ... · Actor=Bill Murray Release Year=1993 Find me the Bill Murray[ movie. I think it came out in 1993. When was it released?

End-to-End Reinforcement Learning of Dialogue Agentsfor Information Access

Bhuwan Dhingra?∗ Lihong Li† Xiujun Li† Jianfeng Gao†Yun-Nung Chen‡∗ Faisal Ahmed† Li Deng†

?School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA†Microsoft Research, Redmond, WA, USA‡National Taiwan University, Taipei, Taiwan

[email protected] †{lihongli,xiul,jfgao,vivic,fiahmed,deng}@microsoft.com

Abstract

This paper proposes KB-InfoBot—a dialogueagent that provides users with an entity from aknowledge base (KB) by interactively askingfor its attributes. All components of the KB-InfoBot are trained in an end-to-end fashionusing reinforcement learning. Goal-orienteddialogue systems typically need to interactwith an external database to access real-worldknowledge (e.g., movies playing in a city).Previous systems achieved this by issuing asymbolic query to the database and adding re-trieved results to the dialogue state. However,such symbolic operations break the differen-tiability of the system and prevent end-to-endtraining of neural dialogue agents. In this pa-per, we address this limitation by replacingsymbolic queries with an induced “soft” pos-terior distribution over the KB that indicateswhich entities the user is interested in. We alsoprovide a modified version of the episodic RE-INFORCE algorithm, which allows the KB-InfoBot to explore and learn both the policyfor selecting dialogue acts and the posteriorover the KB for retrieving the correct entities.Experimental results show that the end-to-endtrained KB-InfoBot outperforms competitiverule-based baselines, as well as agents whichare not end-to-end trainable.

1 Introduction

Goal-oriented dialogue systems help users completespecific tasks, such as booking a flight or searchinga database, by interacting with them via natural lan-

∗Work completed while BD and YNC were with Microsoft.

Movie=?Actor=Bill Murray

Release Year=1993

Find me the Bill Murray’s movie.

I think it came out in 1993.

When was it released?

Groundhog Day is a Bill Murray movie which came out in 1993.

KB-InfoBotUser

(Groundhog Day, actor, Bill Murray)(Groundhog Day, release year, 1993)(Australia, actor, Nicole Kidman)(Mad Max: Fury Road, release year, 2015)

Knowledge Base (head, relation, tail)

Figure 1: A dialogue example between a user lookingfor a movie and the KB-InfoBot. The knowledge base isshown above the KB-InfoBot.

guage. In this work, we present KB-InfoBot, a dia-logue agent that identifies entities of interest to theuser from a knowledge base (KB), by interactivelyasking for attributes of that entity which helps con-strain the search. Such an agent finds applicationin interactive search settings. Figure 1 shows a dia-logue example between a user searching for a movieand the proposed KB-InfoBot.

A typical goal-oriented dialogue system consistsof four basic components: a language understand-ing (LU) module for identifying user intents and ex-tracting associated slots (Yao et al., 2014; Hakkani-Tur et al., 2016; Chen et al., 2016), a dialogue statetracker which tracks the user goal and dialogue his-tory (Henderson et al., 2014; Henderson, 2015), adialogue policy which selects the next system ac-tion based on the current state (Young et al., 2013),and a natural language generator (NLG) for con-verting dialogue acts into natural language (Wen etal., 2015; Wen et al., 2016a). For successful com-pletion of user goals, it is also necessary to equip

arX

iv:1

609.

0077

7v2

[cs

.CL

] 3

1 O

ct 2

016

the dialogue policy with real-world knowledge froma database. Previous end-to-end systems achievedthis by constructing a symbolic query from the cur-rent belief states of the agent and retrieving resultsfrom the database which match the query (Wen etal., 2016b; Williams and Zweig, 2016; Zhao and Es-kenazi, 2016). Unfortunately, such operations makethe model non-differentiable, and various compo-nents in a dialogue system are usually trained sep-arately.

In our work, we replace SQL-like queries witha probabilistic framework for inducing a posteriordistribution of the user target over KB entities. Webuild this distribution from the belief tracker multi-nomials over attribute-values and binomial probabil-ities of the user not knowing the value of an attribute.The policy network receives as input this full distri-bution to select its next action. In addition to mak-ing the model end-to-end trainable, this operationalso provides a principled framework to propagatethe uncertainty inherent in language understandingto the dialogue policy making the agent robust to LUerrors.

Our entire model is differentiable, which meansthat in theory our system can be trained completelyend-to-end using only a reinforcement signal fromthe user that indicates whether a dialogue is success-ful or not. However, in practice, we find that withrandom initialization the agent is unable to see anyrewards if the database is large; even when it does,credit assignment is tough. Hence, at the beginningof training, we first have an imitation-learning phase(Argall et al., 2009) where both the belief trackerand policy network are trained to mimic a rule-basedagent. Then, on switching to reinforcement learning,the agent is able to improve further and increase itsaverage reward. Such a bootstrapping approach hasbeen shown effective when applying reinforcementlearning to solve hard problems, especially thosewith long decision horizons (Silver et al., 2016).

Our key contributions are three-fold. First, wepresent a probabilistic framework for inducing aposterior distribution over the entities in a knowl-edge base. Second, we use the above framework todevelop, to our knowledge, the first fully end-to-enddifferentiable model of a multi-turn information pro-viding dialogue agent (KB-InfoBot), whose param-eters can be tuned using standard gradient descent

methods. Third, we present a modified version ofthe episodic REINFORCE (Williams, 1992) updaterule for training the above model based on user feed-back, which allows the agent to explore both the setof possible dialogue acts at each turn and the set ofpossible entity results from the KB at the final turn.

2 Related Work

Statistical goal-oriented dialogue systems have longbeen modeled as partially observable Markov deci-sion processes (POMDPs) (Young et al., 2013), andare trained using reinforcement learning based onuser feedback. Recently, there has been growinginterest in designing “end-to-end” systems, whichcombine feature extraction and policy optimizationusing deep neural networks, with the aim of elimi-nating the need of hand-crafted representations. Wediscuss these works below and highlight their meth-ods of interfacing with the external database.

Cuayahuitl (2016) proposed SimpleDS, whichuses a multi-layer feed-forward network to directlymap environment states to agent actions. The net-work is trained using Q-learning and a simulateduser; however it does not interact with a structureddatabase, leaving that task to a server, which may besuboptimal as we show in our experiments below.

Wen et al. (2016b) introduced a modular dialogueagent, which consists of several neural-network-based components trained using supervised learn-ing. One key component is the database opera-tor, which forms a query qt = ∪s′∈SI

arg maxv pts′ ,

where pts′ are distributions over the possible valuesof each slot and are output from the belief tracker.The query is issued to the database which returnsa list of the matched entries. We refer to this op-eration henceforth as a Hard-KB lookup. Hard-KBlookup breaks the differentiability of the whole sys-tem, and as a result training of various componentsof the dialogue system needs to be performed sep-arately. The intent network and belief trackers aretrained using supervised labels specifically collectedfor them; while the policy network and generationnetwork are trained separately on the system utter-ances. In this paper, we retain modularity of the net-work by keeping the belief trackers separate, but re-place the query with a differentiable lookup over thedatabase which computes a posterior distribution de-noting the probability that the user is looking for a

particular entry.An alternative way for the dialogue agent to in-

terface with the database is by augmenting its ac-tion space with predefined API calls (Williams andZweig, 2016; Zhao and Eskenazi, 2016; Bordes andWeston, 2016). The API calls modify a query hy-pothesis maintained outside the end-to-end systemwhich is used to retrieve results from this KB. Theseresults are then appended to the next system inputbased on which the agent selects its next action. Theresulting model is end-to-end differentiable, albeitwith the database falling out of it. This frameworkdoes not deal with uncertainty in language under-standing since the query hypothesis can only holdone slot-value at a time. Our approach, on theother hand, directly models the uncertainty to comeup with a posterior distribution over entities in theknowledge base.

Wu et al. (2015) recently presented an entropyminimization dialogue management (EMDM) strat-egy for KB-InfoBots. The agent always asks for thevalue of the slot with maximum entropy over the re-maining entries in the database. This approach is op-timal in the absence of LU errors, but suffers fromerror propagation in their presence. This rule-basedpolicy serves as a baseline to compare to our pro-posed approach.

Our work is motivated by the neural GenQA (Yinet al., 2016a) and neural enquirer (Yin et al., 2016b)models for querying KB and tables via natural lan-guage in a fully “neuralized” way. These workshandle single-turn dialogues and are trained usingsupervised learning, while our model is designedfor multi-turn dialogues and trained using reinforce-ment learning. Moreover, instead of defining anattention distribution directly over the KB entities,which could be very large, we instead induce it fromthe smaller distributions over each relation (or slot indialogue terminology) in the KB. A separate line ofwork—TensorLog (Cohen, 2016)—investigates rea-soning over chains of KB facts in a differentiablemanner to retrieve new facts. Instead, our focus ison retrieving entities present in the KB given someof the facts they participate in.

Reinforcement Learning Neural Turing Ma-chines (RL-NTM), introduced by Zaremba andSutskever (2015), also allow neural controllers tointeract with discrete external interfaces. The par-

Movie ActorRelease

Year

Groundhog Day Bill Murray 1993

Australia Nicole Kidman X

Mad Max: Fury Road X 2015

(Groundhog Day, actor, Bill Murray)(Groundhog Day, release year, 1993)(Australia, actor, Nicole Kidman)(Mad Max: Fury Road, release year, 2015)

Knowledge Base (head, relation, tail)

3 rows

2 columns

3 unique head entities2 unique relation types

Figure 2: An entity-centric knowledge base, where headentities are movies. Top: Conventional (h, r, t) format.Bottom: Table format. Missing values are denoted by X.

ticular form of interface considered in that work isa one-dimensional memory tape along which a readhead can move. Our work is in a similar vein, butassumes a different interface—an entity-centric KB.We exploit the structure of such KBs to provide dif-ferentiable access to an KB-InfoBot agent for mak-ing decisions.

Li et al. (2016) recently applied deep reinforce-ment leaning successfully to train non-goal orientedchatbot type dialogue agents. They show that rein-forcement learning allows the agent to model long-term rewards and generate more diverse and coher-ent responses as compared to supervised learning.Chatbot systems, however, typically do not need tointerface with an external database, which is the pri-mary focus of this paper.

3 Probabilistic Framework for KB Lookup

In this section we describe a probabilistic frameworkfor querying a KB given the agent’s beliefs over theslots or attributes in the KB.

3.1 Entity-Centric Knowledge Base (EC-KB)

A Knowledge Base consists of triples of the form(h, r, t), which denotes that relation r holds betweenthe head h and tail t. In this work we assume that theKB-InfoBot has access to a domain-specific entity-centric knowledge base (EC-KB) (Zwicklbauer etal., 2013) where all head entities are of a particu-lar type, and the relations correspond to attributesof these head entities. Examples of the type includemovies, persons, or academic papers. Such a KBcan be converted to a table format whose rows cor-

respond to the unique head entities, columns corre-spond to the unique relation types (slots henceforth),and some of the entries may be missing. A small ex-ample is shown in Figure 2.

3.2 Notations and Assumptions

Let T denote the KB table described above and Ti,jdenote the jth slot-value of ith entity. 1 ≤ i ≤ Nand 1 ≤ j ≤ M . We let V j denote the vocabularyof each slot, i.e. the set of all distinct values in j-thcolumn. We denote missing values from the tablewith a special token and write Ti,j = Ψ. Mj = {i :Ti,j = Ψ} denotes the set of entities for which thevalue of slot j is missing. Note that the user maystill know the actual value of Ti,j , and we assumethis lies in V j . Hence, we do not deal with OOVentities or relations at test time.

The user goal is sampled uniformly G ∼U [{1, ...N}] and points to a particular row in the ta-ble T . To make the problem realistic we also sam-ple binary random variables Φj ∈ {0, 1} to indicatewhether the user knows the value of slot j or not.The agent maintains M multinomial distributionsfor its belief over user-goals given user utterancesU t1 till turn t. A slot distribution ptj(v) for v ∈ V j

is the probability at turn t that the user constraint forslot j is v. The agent also maintains M binomialsqtj = Pr(Φj = 1) which denote the probability thatthe user knows the value of slot j.

We also assume that column values are indepen-dently distributed to each other. This is a strongassumption but it allows us to model the user goalfor each slot independently, as opposed to model-ing the user goal over KB entities directly. Typicallymaxj |V j | < N and hence this assumption reducesthe number of parameters in the belief tracker.

3.3 Soft-KB Lookup

Let ptT (i) = Pr(G = i|U t1) be the posterior prob-ability that the user is interested in row i of the ta-ble, given the utterances up to turn t. We assume allprobabilities are conditioned on user inputs U t1 anddrop it from the notation below. From our assump-tion of independence of slot values:

ptT (i) ∝M∏j=1

Pr(Gj = i), (1)

Feature Extractor

Belief Trackers

Policy Network

Beliefs Summary

Soft-KB Lookup

KB-InfoBot

User

User Utterance

System Action

Figure 3: High-level overview of the end-to-end KB-InfoBot. Components with trainable parameters are high-lighted in gray.

where Pr(Gj = i) denotes the posterior probabil-ity of user goal for slot j pointing to Ti,j . We canmarginalize this over Φj to get:

Pr(Gj = i) =1∑

φ=0

Pr(Gj = i,Φj = φ) (2)

= qtj Pr(Gj = i|Φj = 1)+

(1− qtj) Pr(Gj = i|Φj = 0).

For Φj = 0, the user does not know the value of theslot, hence we assume a uniform prior over the rowsof T :

Pr(Gj = i|Φj = 0) =1

N, 1 ≤ i ≤ N (3)

For Φj = 1, the user knows the value of slot j, butthis may be missing from T , and we again have twocases:

Pr(Gj = i|Φj = 1) =

{1N , i ∈Mjptj(v)

Nj(v)

(1− |Mj |

N

), i 6∈Mj

(4)Here, ptj(v) is the slot distribution from the belieftracker, and Nj(v) is the count of value v in slotj. Detailed derivation for (4) is provided in the ap-pendix. Combining (1), (2), (3), and (4) gives usthe procedure for computing the posterior over KBentities.

4 End-to-End KB-InfoBot

Figure 3 shows an overview of the end-to-end KB-InfoBot. At each turn, the agent receives a natu-ral language utterance ut as input, and selects anaction at as output. The action space, denoted byA, consists of M + 1 actions — request(slot=i) for1 ≤ i ≤ M will ask the user for the value of sloti, and inform(I) will inform the user with an ordered

list of results I from the KB. The dialogue ends oncethe agent chooses inform. We describe each of thecomponents in detail below.

Feature Extractor: The feature extractor convertsuser input ut into a vector representation xt. Inour implementation we use a simple bag of n-grams(with n = 2) representation, where each element ofxt is an integer indicating the count of a particular n-gram in ut. We let V n denote the number of uniquen-grams, hence xt ∈ RV n

. This module could po-tentially be replaced with a more sophisticated NLUunit, but for the user simulator we consider belowthe vocabulary size is relatively small (V n = 3078),and doing so did not yield any improvements.

Belief Trackers: The KB-InfoBot consists of Mbelief trackers, one for each slot. Each tracker hasinput xt and produces two outputs, ptj and qtj , whichwe shall collectively call the belief state: ptj is amultinomial distribution over the slot values v, andqtj is a scalar probability of the user knowing thevalue of that slot. It is common to use recurrent neu-ral networks for belief tracking (Henderson et al.,2014; Wen et al., 2016b) since the output distribu-tion at turn t depends on all user inputs till that turn.We use a Gated Recurrent Unit (GRU) (Cho et al.,2014) for each tracker, which, starting from h0

j = 0maintains a summary state htj as follows:

rtj = σ(W rj x

t + U rj ht−1j + br)

ztj = σ(W zj x

t + U zj ht−1j + bz)

htj = tanh(W hj x

t + Uhj (rtj · ht−1j ) + bh)

htj = (1− ztj) · ht−1j + ztj · htj . (5)

Here the subscript j and superscript t stand for thetracker index and dialogue turn respectively, and σdenotes the sigmoid nonlinearity. The output htj ∈Rd can be interpreted as a summary of what the userhas said about slot j till turn t. The belief states arecomputed from this vector as follows:

ptj = softmax(W pj h

tj + bpj ) (6)

qtj = σ(WΦj h

tj + bΦj ) (7)

The key differences between the belief tracker de-scribed here and the one presented in (Wen et al.,

2016b) are as follows: (1) we model the probabil-ity that user does not know the value of a slot sep-arately, as opposed to treating it as a special valuefor the slot, since this is a very different type of ob-ject; (2) we use GRU units instead of a Jordan-typeRNN, and use summary states htj instead of tying to-gether RNN weights; (3) we use n-gram features forsimplicity instead of CNN features.

Soft-KB Lookup: This module uses the proce-dure described in section 3.3 to compute the poste-rior over the EC-KB ptT ∈ RN from the belief statesdescribed above. Note that this is a fixed, differen-tiable operation without any trainable parameters.

Collectively, outputs of the belief trackers and thesoft-KB lookup can be viewed as the current dia-logue state internal to the KB-InfoBot. Let st =[pt1, p

t2, ..., p

tM , q

t1, q

t2, ..., q

tM , p

tT ] be the vector of

size∑

j Vj +M +N denoting this state.

Beliefs Summary: At this stage it is possible forthe agent to directly use the state vector st to selectits next action at. However, the large size of the statevector would lead to a large number of parameters inthe policy network. To improve efficiency we extractsummary statistics from the belief states, similar to(Williams and Young, 2005; Gasic et al., 2009).

We summarize each slot into an entropy statisticover a distribution wtj computed from elements ofthe KB posterior ptT as follows:

wtj(v) ∝∑

i:Ti,j=v

ptT (i) + p0j

∑i:Ti,j=Ψ

ptT (i) . (8)

Here, p0j is a prior distribution over the values of slot

j, estimated using counts of each value in the KB.Intuitively, the probability mass of v in this distri-bution is the agent’s confidence that the user goalhas value v in slot j. This confidence is a sum oftwo terms: (i) sum of KB posterior probabilities ofrows which have value v, and (ii) sum of KB poste-rior probabilities of rows whose value is unknown,multiplied by the prior probability that an unknownmight in fact be v. These two terms correspondto the two terms in (8) respectively. The summarystatistic for slot j is then the entropy H(wtj) of thisweighted probability distribution. The KB poste-rior ptT is also summarized into an entropy statisticH(ptT ).

The scalar probabilities of the user knowingthe value of a slot are passed as is to the pol-icy network. Hence, the final summary vectorwhich is input to the policy network is st =[H(pt1), ...,H(ptM ), qt1, ..., q

tM , H(ptT )]. Note that

this vector has size 2M + 1.

Policy Network: The policy network’s job is toselect the next action based on the current sum-mary state st and the dialogue history. Similar to(Williams and Zweig, 2016; Zhao and Eskenazi,2016), we use an RNN to allow the network to main-tain an internal state of dialogue history. Specif-ically, we use a GRU unit (see eq 5) followed bya fully-connected layer and softmax nonlinearity tomodel the policy (W π ∈ R|A|×d, bπ ∈ R|A|):

htπ = GRU(s1, ..., st) (9)

π = softmax(W πhtπ + bπ) . (10)

Action Selection: During the course of the dia-logue, the agent samples its actions from the policyπ. If this action is inform(), it must also provide anordered set I = (i1, i2, . . . , iR) of indices from theKB to the user. We assume a search-engine type set-ting where the agent returns a list of entities and thedialogue is considered a success if the correct entityis in top R items in the list. Since we want to learnthe KB posterior ptT using reinforcement learning,we can view it as another policy, and sample resultsfrom the following distribution:

µ(I) = ptT (i1)×ptT (i2)

1− ptT (i1)× · · · . (11)

5 Training

5.1 Reinforcement LearningThe KB-InfoBot agent described above samples sys-tem actions from the policy π and KB results fromthe distribution µ. This allows the agent to exploreboth the space of actions A as well as the space ofall possible KB results. This formulation leads to amodified version of the episodic REINFORCE algo-rithm (Williams, 1992) which we describe below.

We can write the expected discounted return ofthe agent under policy π as follows:

J(θ) = E

[H∑h=0

γhrh

](12)

Here, the expectation is over all possible trajectoriesτ of the dialogue, θ denotes the parameters of theend-to-end system, H is the maximum length of anepisode, γ is the discounting factor, and rh the re-ward observed at turn h. We can use the likelihoodratio trick (Glynn, 1990) to write the gradient of theobjective as follows:

∇θJ(θ) = E

[∇θ log pθ(τ)

H∑h=0

γhrh

], (13)

where pθ(τ) is the probability of observing a par-ticular trajectory under the current policy. With aMarkovian assumption, we can write

pθ(τ) =

[p(s0)

H∏k=0

p(sk+1|sk, ak)πθ(ak|sk)

]µθ(I),

(14)where θ denotes dependence on the neural networkparameters. Notice the last term µθ above which isthe posterior of a set of results from the KB. From13,14 we obtain

∇θJ(θ) =Ea∼π,I∼µ

[(∇θ logµθ(I)+

H∑h=0

∇θ log πθ(ah)) H∑k=0

γkrk

],

(15)

where the expectation is now over all possible ac-tion sequences and the KB results, since gradient ofthe other terms in pθ(τ) is 0. This expectation is es-timated using a mini-batch of dialogues of size B,and we use RMSProp (Hinton et al., 2012) updatesto train the parameters θ.

5.2 Imitation LearningIn theory, both the belief trackers and policy networkcan be trained from scratch using only the reinforce-ment learning objective described above. In prac-tice, however, for a moderately sized KB, the agentalmost always fails if starting from random initial-ization. In this case, credit assignment is difficultfor the agent, since it does not know whether thefailure is due to an incorrect sequence of actions orincorrect set of results from the KB. Hence, at thebeginning of training we have an imitation learningphase where the belief trackers and policy networkare trained to mimic a simple hand-designed rule-based agent. The rule-based agent is described in

detail in the next section. Here, we give the imitationlearning objective used to bootstrap the KB-InfoBot.

Assume that ptj and qtj are the belief states from arule-based agent, and at its action at turn t. Then theloss function in imitation learning is:

L(θ) = E[D(ptj ||ptj(θ))+H(qtj , q

tj(θ))−

log πθ(at)],

(16)

where D(p||q) denotes the Kullback-Leibler diver-gence between p and q, and H(p, q) denotes thecross-entropy between p and q. The last term isa standard supervised cross-entropy loss betweenthe rule-based agent’s action and its probability inthe KB-InfoBot’s policy. The expectation is esti-mated as an average over the minibatch, and we usestochastic gradient descent to optimize the loss.

5.3 User Simulator

To evaluate the performance of the KB-InfoBot, wemake use of a rule-based stochastic simulated user.At the beginning of each dialogue, the simulateduser randomly samples a target entity from the EC-KB and a random combination of informable slotsfor which it knows the value of the target. The re-maining slot-values are unknown to the user. Theuser initiates the dialogue by providing a subset ofits informable slots to the agent and requesting foran entity which matches them. In subsequent turns,if the agent requests for the value of a slot, the usercomplies by providing it or informs the agent that itdoes not know that value. If the agent informs re-sults from the KB, the simulator checks whether thetarget is among them and provides the reward.

We convert dialogue acts from the user into nat-ural language utterances using a separately trainednatural language generator (NLG). The NLG istrained in a sequence-to-sequence fashion, usingconversations between humans collected by crowd-sourcing. It takes the dialogue actions (DAs) asinput, and generates template-like sentences withslot placeholders via an LSTM decoder. Then, apost-processing scan is performed to replace the slotplaceholders with their actual values, which is sim-ilar to the decoder module in (Wen et al., 2015;Wen et al., 2016a). In the LSTM decoder, we applybeam search, which iteratively considers the top kbest sentences up to time step t when generating the

token of the time step t+1. For the sake of the trade-off between the speed and performance, we use thebeam size of 3 in the following experiments.

There are several sources of error in user utter-ances. Any value provided by the user may be cor-rupted by noise, or substituted completely with anincorrect value of the same type (e.g., “Bill Murray”might become just “Bill” or “Tom Cruise”). TheNLG described above is inherently stochastic, andmay sometimes generate utterances irrelevant to theagent request. By increasing the temperature of theoutput softmax in the NLG we can increase the noisein user utterances.

6 Experiments and Results

6.1 Baselines

We compare our end-to-end model with two setsof baselines. Rule-Based agents consist of hand-designed belief trackers and a hand-designed policy.The belief trackers search for tokens in the user in-put which match a slot-value in the KB, and do aBayesian update on the probability mass ptj(v) asso-ciated with the values found. If the agent asks fora slot, but does not find any value for that slot inthe user response, then the corresponding don’t-careprobability qtj for that slot is set to 1. We comparethree variants of the hand-designed policy, whichdiffer in terms of the KB-lookup method. The No-KB version ignores the KB and selects its actions byalways asking for the slot with maximum entropyH(ptj). The Hard-KB version performs a hard-KBlookup and selects the next action based on the en-tropy of the slots and the number of retrieved results.Finally, the Soft-KB version computes the full pos-terior over the KB and selects actions based on thesummary statistics described in section 4. All theseagents are variants of the EMDM strategy proposedin (Wu et al., 2015), with the difference being in theway the entropy is computed. At the end of the di-alogue, all three agents inform the user with the topresults from the KB posterior ptT , hence the differ-ence only lies in the policy for action selection.

The second set of baselines, Simple-RL agents,retain the hand-designed belief trackers as describedabove, but use the GRU policy network describedin section 4 instead of a hand-designed policy. Weagain compare three variants of these agents, which

Table 1: Movies-KB statistics. Total number of moviesand unique values for each slot are given.

Relation Type # Unique ValuesActor 51Director 51MPAA Rating 67Critic Rating 68Genre 21Release Year 10Movie 428

differ only in the inputs to the policy network. TheNo-KB version only takes entropy H(ptj) of each ofthe slot distributions. The Hard-KB version takesentropy of the slots along with a 6-bin one-hot en-coding of the number of retrieved results (no match,1 match, ... or more than 5 matches). This is thesame approach as in (Wen et al., 2016b), exceptthat we take entropy instead of summing probabili-ties. Lastly, The Soft-KB version takes the summarystatistics of the slots and posterior over the KB de-scribed in section 4. In addition we also append theagent beliefs qtj whether the user knows the valueof a slot, and a one-hot encoding of the previousagent action to the input for each of these versions.The policy network produces a distribution over theM + 1 valid actions available to the agent. The in-form action is accompanied with results from theposterior ptT . During training actions are sampledfrom the output for exploration, but in evaluation ac-tions are determined via argmax.

6.2 Movies-KBIn our experiments, we use a movie-centric knowl-edge base constructed using the IMDBPy1 package.We selected a subset of movies released after 2007,and retained 6 slots. Statistics for this KB are givenin Table 1. The original KB was modified to reducethe number of actors and directors in order to makethe task more challenging2. We also randomly re-move 20% of the values from the agent’s copy ofthe KB to simulate a real-world scenario where theKB may be incomplete. The user, however, may stillknow these values.

1http://imdbpy.sourceforge.net/2We restricted the vocabulary to the first few unique values

of these slots and replaced all other values with a random valuefrom this set

Figure 4: Average rewards and their std error duringtraining for each of the models. Evaluation done at in-tervals of 100 updates, by choosing the optimal policyactions for 2000 simulations.

6.3 HyperparametersWe use GRU hidden state size of d = 50 for theSimple-RL baselines and d = 100 for the end-to-end system, a learning rate of 0.05 for the im-itation learning phase and 0.005 for the reinforce-ment learning phase, and minibatch size 128. Imi-tation learning was performed for 500 updates, afterwhich the agent switched to reinforcement learning.The maximum length of a dialogue is limited to 10turns,3 beyond which the dialogue is deemed a fail-ure. The input vocabulary is constructed from theNLG vocabulary and bigrams in the KB, and its sizeis 3078. The agent receives a positive reward if theuser target is in top R = 5 results returned by it;this reward is computed as 2(1− (r− 1)/R), wherer is the actual rank of the target. For a failed dia-logue the agent receives a reward of −1, and at eachturn it receives a reward of −0.1 since we want it tocomplete the task in the shortest time possible. Thediscounting factor γ is set to 0.99.

6.4 Performance ComparisonWe compare each of the discussed models alongthree metrics: the average rewards obtained, successrate (where success is defined as providing the usertarget among top R results), and the average num-ber of turns per dialogue. Parameters of the rule-based baselines control the trade-off between aver-age number of turns and success rate. We tunedthese using grid-search and selected the combinationwith highest average reward.

3A turn consists of one user action and one agent action.

http://imdbpy.sourceforge.net/

Table 2: Performance Comparison. Average and std error for 5000 runs after choosing the best model during training.

Agent KB Lookup Success Rate Avg Turns Avg RewardRule-based No-KB 0.77± 0.01 5.05± 0.01 0.74± 0.02Rule-based Hard-KB 0.73± 0.01 3.65± 0.01 0.75± 0.02Rule-based Soft-KB 0.76± 0.01 3.94± 0.03 0.83± 0.02

Simple-RL No-KB 0.76± 0.01 3.32± 0.02 0.87± 0.02Simple-RL Hard-KB 0.75± 0.01 3.07± 0.01 0.86± 0.02Simple-RL Soft-KB 0.80± 0.01 3.37± 0.03 0.98± 0.02

End2End-RL Soft-KB 0.83± 0.01 3.27± 0.03 1.10± 0.02

Figure 5: Variation in average rewards as temperature ofsoftmax in NLG output is increased. Higher temperatureleads to more noise in output. Average over 5000 simula-tions after selecting the best model during training.

Figure 4 shows how the reinforcement-learningagents perform as training progresses. This figurewas generated by fixing the model every 100 up-dates, and performing 2000 simulations while se-lecting greedy policy actions. Table 2 shows the per-formance of each model over a further 5000 simula-tions, after selecting the best model during training,and selecting greedy policy actions.

The Soft-KB versions outperform both Hard-KBand No-KB counterparts, which perform similarly,in terms of average reward. The benefit comes fromachieving a similar or higher success rate in re-duced number of turns. Note that all baseline agentsshare the same belief trackers, but by re-asking val-ues of some slots they can have different posteriorsptT to inform the results. Having full informationabout the current state of beliefs over the KB helpsthe Soft-KB agent discover better policies. Further,reinforcement learning helps discover better poli-cies than the hand-crafted rule-based agents, and

hence Simple-RL agents outperform the Rule-Basedagents. All baseline agents, however, are limitedby the rule-based belief trackers which remain fixedduring training. The end-to-end agent is not limitedas such, and is able to achieve a higher success rateand a higher average reward. This validates our mo-tivation for introducing the Soft-KB lookup — theagent is able to improve both the belief trackers andpolicy network from user feedback directly.

Figure 5 shows the average reward of three of theagents as the temperature of the output softmax inthe user simulator NLG is increased. A higher tem-perature means a more uniform output distribution,which leads to generic user responses irrelevant tothe agent questions. This is a simple way of intro-ducing noise in user responses. The performanceof all three agents drops as the temperature is in-creased, but less so for the end-to-end agent, whichcan adapt its belief tracker to the inputs it receives.

7 Conclusion

We have presented an end-to-end differentiable di-alogue agent for multi-turn information access. Allcomponents of the agent are trained using reinforce-ment learning from user feedback, by optimizing amodified version of the episodic REINFORCE ob-jective. We have shown that starting from an imita-tion learning phase where the agent learns to mimicrule-based belief trackers and policy, the agent cansuccessfully improve on its own through reinforce-ment learning. The gain in performance is especiallyhigh when the noise in user inputs is high.

A KB-InfoBot is a specific type of goal-orienteddialogue agent. Future work should focus on extend-ing the techniques described here to other more gen-eral dialogue agents, such as a restaurant reservation

agent or flight booking agent. We have also ignoredscalability issues in this work, which is important forreal-world sized knowledge bases and is a directionfor future research.

References

Brenna D Argall, Sonia Chernova, Manuela Veloso, andBrett Browning. 2009. A survey of robot learningfrom demonstration. Robotics and autonomous sys-tems, 57(5):469–483.

Antoine Bordes and Jason Weston. 2016. Learn-ing end-to-end goal-oriented dialog. arXiv preprintarXiv:1605.07683.

Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur, Jian-feng Gao, and Li Deng. 2016. End-to-end mem-ory networks with knowledge carryover for multi-turnspoken language understanding. In Proceedings ofThe 17th Annual Meeting of the International SpeechCommunication Association.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using rnn encoder-decoder for statisti-cal machine translation. EMNLP.

William W Cohen. 2016. Tensorlog: A differentiabledeductive database. arXiv preprint arXiv:1605.06523.

Heriberto Cuayahuitl. 2016. Simpleds: A simple deepreinforcement learning dialogue system. InternationalWorkshop on Spoken Dialogue Systems (IWSDS).

Milica Gasic, Fabrice Lefevre, Filip Jurcicek, SimonKeizer, Francois Mairesse, Blaise Thomson, Kai Yu,Steve Young, et al. 2009. Back-off action selectionin summary space-based POMDP dialogue systems.In Automatic Speech Recognition & Understanding,2009. ASRU 2009. IEEE Workshop on, pages 456–461.IEEE.

Peter W Glynn. 1990. Likelihood ratio gradient esti-mation for stochastic systems. Communications of theACM, 33(10):75–84.

Dilek Hakkani-Tur, Gokhan Tur, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang.2016. Multi-domain joint semantic frame parsing us-ing bi-directional RNN-LSTM. In Proceedings of The17th Annual Meeting of the International Speech Com-munication Association.

Matthew Henderson, Blaise Thomson, and Steve Young.2014. Word-based dialog state tracking with recurrentneural networks. In Proceedings of the 15th AnnualMeeting of the Special Interest Group on Discourseand Dialogue (SIGDIAL), pages 292–299.

Matthew Henderson. 2015. Machine learning for dialogstate tracking: A review. Machine Learning in SpokenLanguage Processing Workshop.

Geoffrey Hinton, N Srivastava, and Kevin Swersky.2012. Lecture 6a overview of mini–batch gradient de-scent. Coursera Lecture slides https://class. coursera.org/neuralnets-2012-001/lecture,[Online.

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jian-feng Gao, and Dan Jurafsky. 2016. Deep reinforce-ment learning for dialogue generation. EMNLP.

David Silver, Aja Huang, Chris J. Maddison, ArthurGuez, Laurent Sifre, George van den Driessche, Ju-lian Schrittwieser, Ioannis Antonoglou, Veda Pan-neershelvam, Marc Lanctot, Sander Dieleman, Do-minik Grewe, John Nham, Nal Kalchbrenner, IlyaSutskever, Timothy Lillicrap, Madeleine Leach, Ko-ray Kavukcuoglu, Thore Graepel, and Demis Hass-abis. 2016. Mastering the game of Go with deep neu-ral networks and tree search. Nature, 529:484–489.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-HaoSu, David Vandyke, and Steve Young. 2015. Seman-tically conditioned lstm-based natural language gener-ation for spoken dialogue systems. EMNLP.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M.Rojas-Barahona, Pei-Hao Su, Stefan Ultes, DavidVandyke, and Steve Young. 2016a. Conditional gen-eration and snapshot learning in neural dialogue sys-tems. EMNLP.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M.Rojas-Barahona, Pei-Hao Su, Stefan Ultes, DavidVandyke, and Steve Young. 2016b. A network-based end-to-end trainable task-oriented dialogue sys-tem. arXiv preprint arXiv:1604.04562.

Jason D Williams and Steve Young. 2005. Scalingup POMDPs for dialog management: The “SummaryPOMDP” method. In IEEE Workshop on AutomaticSpeech Recognition and Understanding, 2005., pages177–182. IEEE.

Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialog control optimized with su-pervised and reinforcement learning. arXiv preprintarXiv:1606.01269.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine learning, 8(3-4):229–256.

Ji Wu, Miao Li, and Chin-Hui Lee. 2015. A prob-abilistic framework for representing dialog systemsand entropy-based dialog management through dy-namic stochastic state evolution. IEEE/ACM Trans-actions on Audio, Speech, and Language Processing,23(11):2026–2035.

Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Ge-offrey Zweig, and Yangyang Shi. 2014. Spoken

language understanding using long short-term mem-ory neural networks. In Spoken Language TechnologyWorkshop (SLT), 2014 IEEE, pages 189–194. IEEE.

Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, HangLi, and Xiaoming Li. 2016a. Neural generative ques-tion answering. International Joint Conference on Ar-tificial Intelligence.

Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao.2016b. Neural enquirer: Learning to query tables. In-ternational Joint Conference on Artificial Intelligence.

Steve Young, Milica Gasic, Blaise Thomson, and Jason DWilliams. 2013. POMDP-based statistical spoken di-alog systems: A review. Proceedings of the IEEE,101(5):1160–1179.

Wojciech Zaremba and Ilya Sutskever. 2015. Reinforce-ment learning neural Turing machines-revised. arXivpreprint arXiv:1505.00521.

Tiancheng Zhao and Maxine Eskenazi. 2016. Towardsend-to-end learning for dialog state tracking and man-agement using deep reinforcement learning. arXivpreprint arXiv:1606.02560.

Stefan Zwicklbauer, Christin Seifert, and Michael Gran-itzer. 2013. Do we need entity-centric knowledgebases for entity disambiguation? In Proceedings ofthe 13th International Conference on Knowledge Man-agement and Knowledge Technologies, page 4. ACM.

Table 3: Sample dialogues between user simulator and SimpleRL-SoftKB and End2End-RL agents. At the end ofeach dialogue, the agent informs top 5 results from the KB posterior. User target, if informed, is in bold.

SimpleRL-SoftKB End-to-EndUser: I would like to know a movie whose critic rating is 5.5 and actor is fifer.Agent: request(actor) Agent: request(actor)User: chad fifer User: chad fiferAgent: request(mpaa rating) Agent: request(critic rating)User: i cannot remember User: 5.2Agent: request(critic rating) Agent: request(mpaa rating)User: i think it is 5.2 User: i cannot rememberAgent: request(genre) Agent: request(critic rating)User: i cannot remember User: i think it is 5.5Agent: inform(extranjera, promise, bling, the neighbor, lesson no.one)

Agent: inform(le prince de ce monde, tel pre telle fille,the neighbor, promise, krakatoa)

User: Which movie has sabine as director and comedy as genre?Agent: request(actor) Agent: request(critic rating)User: i cannot remember User: i think it is 6.9.

Agent: request(mpaa rating)Agent: inform(ruddy hell! it’s harry and paul,made in america, rome-old and juli-eh, bathtime, numb)

User: i cannot rememberAgent: request(critic rating)User: i think it is 6.9Agent: request(critic rating)User: i think it is 7.2Agent: inform(ruddy hell! it’s harry and paul, homerazzi gotthardlange, bathtime, rome-old and juli-eh, numb)

A Sample Dialogues

Table 3 shows some sample dialogues betweenthe user simulator and SimpleRL-SoftKB andEnd2End-RL agents. User utterances are gener-ated using the NLG described in text. Value of thecritic rating slot is a common source of error in theuser simulator, and hence all learned policies tend toask for this value multiple times.

B Posterior Derivation

Here, we present a derivation for equation 4, i.e., theposterior over the KB slot when the user knows thevalue of that slot. For brevity, we drop Φj = 0 fromthe condition in all probabilities below. For the casewhen i ∈Mj , we can write:

Pr(Gj = i)

= Pr(Gj ∈Mj) Pr(Gj = i|Gj ∈Mj)

=|Mj |N

1

|Mj |=

1

N, (17)

where we assume all missing values to be equallylikely, and estimate the prior probability of the goal

being missing from the count of missing values inthat slot. For the case when i = v 6∈Mj :

Pr(Gj = i)

= Pr(Gj 6∈Mj) Pr(Gj = i|Gj 6∈Mj)

=

(1− |Mj |

N

)×ptj(v)

Nj(v), (18)

where the second term comes from taking the prob-ability mass associated with v in the belief trackerand dividing it equally among all rows with value v.

Documents

End-to-End Reinforcement Learning of Dialogue Agents for ... · Actor=Bill Murray Release Year=1993 Find me the Bill Murray[ movie. I think it came out in 1993. When was it released?