Using Reinforcement Learning to Model …kgeorgila/publications/...Proceedings of the SIGDIAL 2017 Conference, pages 331–341, Saarbrucken, Germany, 15-17 August 2017.¨ c 2017 Association

Proceedings of the SIGDIAL 2017 Conference, pages 331–341,Saarbrucken, Germany, 15-17 August 2017. c©2017 Association for Computational Linguistics

Using Reinforcement Learning to Model Incrementalityin a Fast-Paced Dialogue Game

Ramesh Manuvinakurike1,2, David DeVault2, Kallirroi Georgila1,2

1Institute for Creative Technologies, University of Southern California2Computer Science Department, University of Southern California

manuvina|[email protected], [email protected]

Abstract

We apply Reinforcement Learning (RL) tothe problem of incremental dialogue pol-icy learning in the context of a fast-paceddialogue game. We compare the policylearned by RL with a high performancebaseline policy which has been shown toperform very efficiently (nearly as well ashumans) in this dialogue game. The RLpolicy outperforms the baseline policy inoffline simulations (based on real user data).We provide a detailed comparison of theRL policy and the baseline policy, includ-ing information about how much effort andtime it took to develop each one of them.We also highlight the cases where the RLpolicy performs better, and show that un-derstanding the RL policy can provide valu-able insights which can inform the creationof an even better rule-based policy.

1 Introduction

Building incremental spoken dialogue systems(SDSs) has recently attracted much attention. Onereason for this is that incremental dialogue pro-cessing allows for increased responsiveness, whichin turn improves task efficiency and user satisfac-tion. Incrementality in dialogue has been studied inthe context of turn-taking, predicting the next userutterances/actions, and generating fast system re-sponses (Skantze and Schlangen, 2009; Schlangenet al., 2009; Selfridge and Heeman, 2010; DeVaultet al., 2011; Dethlefs et al., 2012a,b; Selfridgeet al., 2012, 2013; Hastie et al., 2013; Baumann andSchlangen, 2013; Paetzel et al., 2015). Over theyears researchers have tried a variety of approachesto incremental dialogue processing. One such ap-proach is using rules whose parameters may beoptimized using real user data (Buß et al., 2010;

Ghigi et al., 2014; Paetzel et al., 2015). Reinforce-ment Learning (RL) is another method that hasbeen used to learn policies regarding when the sys-tem should interrupt the user (barge-in), stay silent,or generate backchannels in order to improve theresponsiveness of the SDS or increase task success(Kim et al., 2014; Khouzaimi et al., 2015; Dethlefset al., 2016).

We apply RL to the problem of incremental dia-logue policy learning in the context of a fast-paceddialogue game. We use a corpus of real user datafor both training and testing. We compare the poli-cies learned by RL with a high performance base-line policy which uses parameterized rules (whoseparameters have been optimized using real userdata) and has a carefully designed rule (CDR) struc-ture. From now on, we will refer to this baseline asthe CDR baseline.

Our contributions are as follows: We providean RL method for incremental dialogue processingbased on simplistic features which performs betterin offline simulations (based on real user data) thanthe high performance CDR baseline. Note that thisis a very strong baseline which has been shownto perform very efficiently (nearly as well as hu-mans) in this dialogue game (Paetzel et al., 2015).In many studies that use RL for dialogue policylearning, the focus is on the RL algorithms, thestate-action space representation, and the rewardfunction. As a result, the rule-based baselines usedfor comparing the RL policies against are not ascarefully engineered as they could be, i.e., they arenot the result of iterative improvement and opti-mization using insights learned from data or usertesting. This is understandable since building a verystrong baseline would be a big project by itself andwould detract attention from the RL problem. Inour case, there was a pre-existing strong CDR base-line policy which inspired us to investigate whetherit could be outperformed by an RL policy. One of

331

our main contributions is that we provide a detailedcomparison of the RL policy and the CDR baselinepolicy, including information about how much ef-fort and time it took to develop each one of them.We also highlight the cases where the RL policyperforms better, and show that understanding theRL policy can provide valuable insights which caninform the creation of an even better rule-basedpolicy.

2 RDG-Image Game

For this study we used the RDG-Image (Rapid Di-alogue Game) (Paetzel et al., 2014) dataset andthe high performance baseline Eve system (Sec-tion 2.2). RDG-Image is a collaborative, twoplayer, time-constrained, incentivized rapid con-versational game, and has two player roles, theDirector and the Matcher. The players are given 8images as shown in Figure 1 in a randomized order.One of the images is highlighted with a red borderon the Director’s screen (called target image - TI).The Matcher sees the same 8 images in a differentorder but does not know the TI. The Director has todescribe the TI in such a way that the Matcher willbe able to identify it from the distractors as quicklyas possible. The Director and Matcher can talkback-and-forth freely to accomplish the task. Oncethe Matcher believes that he has made the right se-lection, he clicks on the image and communicatesthis to the Director. If the guess is correct then theteam earns 1 point, otherwise 0 points. Now theDirector can press a button so that the game cancontinue with a new TI. The game consists of 4rounds called Sets (from 1 - 4) with varying levelsof complexity. Each round has a predefined timelimit. The goal is to complete as many images aspossible, and thus as a team to earn as many pointsas possible.

2.1 Human-Human Data

The RDG-Image data comes in two flavors, human-human (HH) and human-agent (HA) spoken con-versations. The HH data was collected by pairing2 human players in real time and having their con-versation recorded. The HA conversations wererecorded by pairing a human Director with theagent Matcher (Section 2.2). In this section, we de-scribe the HH part of the corpus. The HH data wascollected in two separate experiments, in-lab (Paet-zel et al., 2014) and over the web (Manuvinakurikeand DeVault, 2015). Figure 1 shows an excerpt

from the HH corpus.The HH corpus contains the user speech tran-

scribed, and labeled dialogue acts (DAs) alongwith carefully annotated time stamps as shown inFigure 1. This timing information is importantfor modeling incrementality. We can observe thatthe game conversation involves rapid exchangeswith frequent overlaps. Each episode (dialogueexchange for each TI) typically begins with the Di-rector describing the TI and ends with the Matcheracknowledging the TI selection with the Assert-Identified (As-I) DA (e.g., “got it”) or As-S (skip-ping action) DA (e.g., “let’s move on to the nextimage”). The Director then requests the next TIand the game continues until time runs out. Some-times the Matcher may interrupt the Director withquestions or other illocutionary acts. A completelist of DAs can be found in (Manuvinakurike et al.,2016).

In this paper, we are interested in modeling in-crementality for DAs related to TI selection by theMatcher. As-I is the most common DA used bythe human Matchers. As-S was not frequently usedby the human Matchers but is used by the base-line matcher agent to give up on the current TIand proceed to the next TI to try to increase thetotal points scored. Further distinctions betweenAs-I and As-S are made in Section 2.2. The mostcommon DA generated by the Director was D-T(Describe-Target).

2.2 Eve

The baseline agent called Eve (Paetzel et al., 2015)was developed to play the role of the Matcher usingthe HH data. The agent Eve relies on several kindsof incremental processing. It obtains the 1-bestautomatic speech recognition (ASR) hypothesisevery 100ms and forwards it to the natural languageunderstanding (NLU) module. The NLU moduleis a Naive Bayes classifier trained on bag-of-wordsfeatures which are generated using a frequencythreshold (frequency >5) on unigrams and bigrams(dt). The NLU assigns confidence values to the8 images (called the image set). Let the imageset at time t be It = {i1, ..., i8}, with the correcttarget image T ∈ It unknown to the agent. Themaximum probability assigned to any image at timet is P ∗

t = maxj P (T = ij |dt). We call theseprobability values (P (T = ij |dt)) as confidence.The image with the highest confidence is chosenas the best selection TI by the agent. Let tc be the

332

Figure 1: Example interaction for a set of images in the human-human corpus.

time consumed on the current TI.Eve’s policy decides between waiting and inter-

rupting the user with As-I or As-S to maximize thescore in the game. She can do it by taking threeactions: i) WAIT: Listen more in the hope that theuser provides more information; ii) As-I: Make theselection and request the next TI; iii) As-S: Makethe selection and request the next TI as it mightnot be fruitful to wait more.1 Eve’s policy depictedin Algorithm 1, uses two threshold values namelyidentification threshold (IT) and give-up threshold(GT) to select these actions. The IT learned isthe least confidence value (P ∗

t ) above which theagent uses the As-I action. GT is the maximumtime the agent should WAIT before giving up onthe current image set and requesting the humanDirector to move on to the next TI. The IT and GTvalues are learned using an offline policy optimiza-tion method called the Eavesdropper simulation,which performs an exhaustive grid search to findthe optimal values of IT and GT for each imageset (Paetzel et al., 2015). In this simulation, theagent is trained offline on the HH conversationsand learns the best values of IT and GT, i.e., thevalues that result in scoring the maximum points inthe game. For example, the optimal values learnedfor the image set shown in Figure 1 were IT=0.8and GT=18sec.

The Eve agent is very efficient and carefully en-gineered to perform well in this task, and servesas a very strong baseline. In the real user studyreported in Paetzel et al. (2015), Eve in the HAgameplay scored nearly as well as human users inHH gameplay. Thus this study provides an opportu-nity to compare an RL policy with a strong baseline

1For As-S Eve’s utterance is ‘I don’t think I can get thatone. Let’s move on. I clicked randomly’ and for As-I it is‘Got it’.

Algorithm 1 Eve’s dialogue policyif P ∗

t > IT & |filtered(dt)| ≥ 1 thenAssert-Identified (As-I)

else if elapsed(t) < GT thenWAIT (continue listening)

elseRequest-Skip (As-S)

end if

policy that uses a hand-crafted carefully designedrule structure (CDR baseline). In the Appendix,Figure 6 shows an example from the HA corpus.The data used in the current work comes from boththe HH and HA datasets (see Table 1).

Branch # users # sub-dialoguesHuman-Human lab 64 1485Human-Human web 196 5642Human-Agent web 175 7393

Table 1: Number of users and number of TI sub-dialogues used for our study.

2.3 Improving NLU with Agent ConversationData

Obviously, the success of the agent heavily dependson the accuracy of the NLU module. In the earlierwork by Paetzel et al. (2015), the NLU modulewas trained on HH conversations. We investigatedwhether using HA data would improve the NLUaccuracy or not. Using data from all of the usersdirector’s speech for all the TIs in the HH branchonly the NLU accuracy was found to be 59.72%.Using data from the HA branch only resulted ina lower NLU accuracy of 48.70%. Combiningthe HH and HA training data resulted in a higheraccuracy of 61.89%. The improvement associatedwith training on HH and HA data is significant

333

across all sets of images2. Thus in this work weuse the best performing NLU with the data trainedfrom both the HH and HA subsets of the corpus.The overall reported NLU accuracy was averagedacross all the image sets. The NLU module wastrained with the same method as in Paetzel et al.(2015). Note that for all our experiments, 10% ofthe HH data and 10% of the HA data was used fortesting, and the rest was used for training.

2.4 Room for Improvement

Though the baseline agent is impressive in its per-formance there are a few shortcomings. We in-vestigated the errors being made by the baselinepolicy and identified four primary limitations inits decision-making. Examples of these limitationsare shown in Figure 2, depicting the NLU assignedconfidence (y-axis) for the human TI descriptionsplotted against the time steps (x-axis).

First, the baseline commits to As-I as soon as theconfidence reaches a high enough value (IT thresh-old), or As-S when the time consumed exceeds theGT threshold. In Case 1 the agent decides to skip(As-S) because the time consumed has exceededthe GT threshold, instead of waiting more whichwould allow for a more distinguishing descriptionto come from the human Director.

Second, its performance can be negatively af-fected by instability in the partial ASR results. Ex-amples of partial ASR results are shown in Figure 8in the Appendix. In Case 2, the agent could learnto wait for higher time intervals as the ASR partialoutputs become more stable.

Third, the baseline only commits at high confi-dence values. Case 3 shows an instance where theagent can save time by committing to a selection ata much lesser confidence value.

Fourth, as we can see from Algorithm 1, the base-line policy does not use “combinations” (or jointvalues) of time and confidence to make detaileddecisions.

Perhaps using RL can not only help the agentlearn a more complex strategy but could also pro-vide insights into developing a better engineeredpolicy which would not have been intuitive for a di-alogue designer to come up with. That is, RL couldpotentially help in building better rules that wouldbe much easier to incorporate into the agent andthus improve its performance. For example, is there

2All the significance tests are performed using student’s ttest.

a combination of time and confidence which is notcurrently used by the baseline i.e., not committingat some initial time slices for high confidence val-ues and committing at lower confidence values asthe user consumes more time?

3 Design of the RL Policy

The incremental policy decision making is mod-eled as an MDP (Markov decision process), i.e.,a tuple (S,A, TP,R, γ). S is a set of states thatthe agent can be in. In this task S is representedby (P ∗

t , tc) features where P ∗t is the highest con-

fidence score assigned by the NLU for any imagein the image set (P ∗

t 7−→ IR; 0.0 ≤ P ∗t ≤ 1.0)

and tc is the time consumed for the current TI(tc 7−→ IR; 0.0 ≤ tc ≤ 45.0)3. The RL learnsa policy π mapping the state (S) to the action (A),π : S → A, whereA = {As-I, As-S, WAIT} are theactions to be performed by the agent to maximizethe overall reward in the game. The As-I and As-Sactions map to their corresponding utterances. R isthe reward function and γ a discount factor weight-ing long-term rewards. TP is the set of transitionprobabilities after taking an action.

When the agent is in the state St = (P ∗t , tc),

executing the WAIT action results in moving tothe state St+1 which corresponds to a new dt+1

which corresponds to the new utterance (See Sec-tion 2.2) and thus yielding new P ∗

t and tc for thegiven episode. The As-I and As-S actions resultin goal states for the agent. Separate policies aretrained per image set similar to the baseline. Thedifference between the As-I and As-S action is inthe rewards assigned. The reward function R is asfollows. After the agent performs the As-I action,it receives a high positive reward for the correctimage selection and a high negative penalty for thewrong selection. This is to encourage the agent tolearn to guess at the right point of time. There isa small positive reward of δ for “WAIT” actions,to encourage the agent to wait before committingto As-I selections. No reward is provided for theAs-S actions. This is to discourage the agent fromchoosing to skip and scoring the points by chance,and at the same time not penalize the agent forwanting to skip when it is really necessary. Thereward function for As-S prevents the agent fromgetting heavy negative penalties in case the wrongimages are selected by the NLU. In those casesthe confidence would probably be low and thus the

3Each round lasts a maximum of 45 seconds.

334

TI: H TI: A TI: B

Yeah so it’s a gray cat with blue eyes It’s a cat laying with eyes closed black Cat blue eyes on blanket black stripes like tiger

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

As-I

As-I As-I

As-I

As-I

Baseline action

Better action

time

confidence

CASE 1 CASE 2 CASE 3

Baseline points

Better points

0 0 1

1 1 1Selection

Selection

As-S

Figure 2: Examples where the agent can do better. The red boxes show the wrong selection by the agent.

agent would not commit to As-I but choose theaction As-S instead.

R =

+δ if action is WAIT+100 if As-I is right−100 if As-I is wrong0 if action is As-S

In this work we use the least squares policy it-eration (LSPI) (Lagoudakis and Parr, 2003) RLalgorithm implemented in the BURLAP4 java codelibrary to learn the optimal policy. LSPI is a sampleefficient model-free off-policy method that com-bines policy iteration with linear value function ap-proximation. LSPI in our work uses State-Action-Reward-State (SARS) transitions sampled from thehuman interactions data (HH and HA). We useGaussian radial basis value function (RBF) rep-resentation for the confidence (P ∗

t ) and time con-sumed (tc) features. We treat the state featuresas continuous values. The confidence values andtime consumed values are continuous in naturewithin the bounds defined i.e., 0.0 ≤ P ∗

t ≤ 1.0and 0.0 ≤ tc ≤ 45.0. We define 10 basis functionsdistributed uniformly for the confidence features(P ∗

t ) and 45 basis functions for the time consumed(tc) features. The basis function returns a valuebetween 0 and 1 with a value of 1 when the querystate has a distance of zero from the function’s“center” state. As the state gets further away, the

4http://burlap.cs.brown.edu/

basis function’s returned value degrades to a valueof zero.

Initial experimentation with the Vanilla Q-learning algorithm (Sutton and Barto, 1998) didnot yield good results, due to the very large statespace and consequently data sparsity. Binning thefeatures, in order to transform their continuous val-ues into discrete values and thus reduce the size ofthe state space, did not help either. That is, hav-ing a large number of bins did not deal with thedata sparsity problem, and having a small numberof bins made it much harder to learn fine-graineddistinctions between the states. Note that LSPI isgenerally considered as a more sample efficientalgorithm than Q-learning.

We run LSPI with a discount factor of 0.99 untilconvergence occurs or a maximum of 50 iterationsis reached, whichever happens first. We use 250kavailable SARS transitions from the HH and HAinteractions to train the policy. The LSPI returns aGreedy-Q policy which we use on the test data.

Figure 3 shows the modus operandi of the pol-icy in this domain. For every time step the ASRprovides a 1-best partial hypothesis for the speechuttered by the test user. This partial speech recogni-tion hypothesis is input to the NLU module whichreturns the confidence value (P ∗

t ). The time con-sumed (tc) for the current TI is tracked by the gamelogic.

335

ASR partials

Figure 3: Actions taken by the baseline and the RL agent.

pets zoo kitten cocktail bikes yoga necklacePPS P PPS P PPS P PPS P PPS P PPS P PPS P

Baseline 0.22 37 0.28 27 0.14 14 0.18 23 0.09 13 0.20 3 0.20 4RL agent 0.23 39 0.31 32 0.13 16 0.19 25 0.14 22 0.11 18 0.12 20

Table 2: Comparison of points per second (PPS) and points (P) earned by the baseline and the RL agenton the test set.

4 Experimental Setup

For testing, we use the real user held out conver-sation data from the HH and HA datasets. TheIT and GT thresholds for the baseline Eve werealso retrained (Paetzel et al., 2015) using the samedata and NLU as used to train the RL policy. Fig-ure 3 shows the setup for testing and comparing theactions of the RL policy and the baseline. EveryASR partial corresponds to a state. For every ASRpartial we obtain the highest assigned confidencescore from the NLU, use the time consumed fea-ture from the game, and obtain the action from thepolicy. If the action chosen by the policy is “WAIT”then we sample the next state. For each pair ofconfidence and time consumed values we obtainthe actions from the baseline and the RL policyseparately and compare them with the ground truthto evaluate which policy performs better. Once thepolicy decides to take either the As-I or As-S actionthen we advance the simulated game time by an ad-ditional interval of 750ms or 1500ms respectively.This is to simulate the conditions in the real usergame where we found that the users on average

take 500ms to click the button to load the next setof TIs, and the agent takes 250ms to say the As-Iutterance and 1000ms to say the As-S utterance.The next TI is loaded at this point and then theprocess is repeated until the game time runs out foreach user round.

5 Results

The policy learned using RL (LSPI with RBF func-tions) performs significantly better (p<0.01) in scor-ing points compared to the baseline agent in offlinesimulations. Also, the RL policy takes relativelymore time to commit (As-I or As-S) compared tothe baseline.5 The idea of setting the IT and GTthreshold values in the baseline (Section 2.2) origi-nally aimed at scoring points rapidly in the game,i.e., the baseline agent was optimized at scoringthe highest number of points per second (PPS).The PPS parameter is a measure of how effectivethe agent is at scoring points overall, and is calcu-lated as the ratio of the total points scored by the

5p=0.06; we cannot claim that the time taken is signifi-cantly higher but there is a trend.

336

0

20

40

60

80

100

120

140

160

0 2 4 6 8 10 12 14 16 18 20 22

Baseline RL(LSPI)

Average Points scored

Tim

e co

nsu

med

Figure 4: The RL policy scores significantly morepoints than the baseline by investing slightly moretime (graph generated for one of the image sets).

agent divided by the total time consumed. Table 2shows the points per second and the total pointsscored in some of the image sets by the baseline andthe RL. We can observe that the RL consistentlyscores more points than the baseline, however thiscomes at the cost of additional time. By scoringmore points overall than the baseline, the RL alsoscores higher in the PPS metric (p<0.05). Table 3shows the total points scored and the total timespent across all the users by the baseline and theagent. Each set here refers to one round in a game.

Baseline RLSet P t (s) P t (s)1 96 510.8 107 528.12 75 525.0 85 537.93 42 298.9 74 595.24 49 531.9 76 592.3

Table 3: The points scored (P) and the time con-sumed (t) in seconds for different image sets (Set).

Figure 5: Decisions of the RL policy (in blue) vs.the baseline policy (in red).

Figure 4 depicts this result for an image set ofbikes (images shown in Figure 1). We plot thetotal time spent by the agent and the total points

scored. Clearly, the RL policy manages to scoremore points than the baseline in a given amountof time. In order to understand the differences inthe actions taken by the RL policy and the baselinepolicy, we plot on a 3 dimensional scatter plot, theaction taken by the policy for confidence valuesbetween 0 and 1 (spaced at 0.1 intervals) and thetime consumed between 0s to 15s (spaced at 100msintervals) for one of the image sets (bikes). Fig-ure 5 shows the decisions made by the RL (in blue)compared to the decisions made by the baseline (inred). As we can see there is not much variety in thedecisions of the baseline policy; it basically usesthresholds (see Algorithm 1) optimized using realuser data. Below we summarize our observationsregarding the actions taken by the RL policy.

i) Regardless of whether the confidence value ishigh or low, the RL policy learns to wait for low val-ues of the time consumed. This may be helping theRL policy to avoid the problem illustrated in Case 2in Figure 2, where instability in the early ASR re-sults for a description can lead an incorrect guessto be momentarily associated with high confidence.The RL policy is more keen on waiting and decidesto commit early only when the confidence valueis really high (almost 1.0). ii) Requiring a lowerdegree of confidence when the time consumed ishigh was also found to be an effective strategy toscore more points in the game. Thus the RL policylearns to guess (As-I) even at lower confidence val-ues when the time consumed reaches high values.This combination of time and confidence valueshelps the RL agent perform better w.r.t. points andconsequently PPS in the task.

It is also important to note that the agent does notwait eternally to make its selection. The human TIdescriptions are collected from real user gameplaythat lasts for a limited number of time steps. That is,the maximum number of points that the RL policycan score in simulation is limited by the numberof images described in the real user gameplay. Inthe case of the “WAIT” action beyond this pointthe agent fails to gather high rewards as the As-Iaction was never selected. By the virtue of thisdesign feature, the RL agent has implicitly learnedthe notion of playing the game at a high pace.

Note also that the RL agent has not learned toalways commit at a later time than the baseline.Table 4 shows the percentage of times (in the testgames) where the RL policy chooses a differentstrategy than the baseline. We can see that the RL

337

% times Same commit times 48.06% times Baseline has faster commit 44.77% times RL has faster commit 7.17

Table 4: Comparison of commit strategies betweenbaseline and RL (%).

policy commits at the same time instances as thebaseline about 48% of the time. 44.77% of the timethe baseline commits to the TI faster and about 7%of the time the RL decides to commit earlier to theTI compared to the baseline.

6 Discussion & Future Work

The cases shown in Figure 2 provide examples ofhow the RL policy can outperform the baseline. i)As the RL agent has learned to not commit to adecision early it can wait enough time to observemore user words and thus reach higher confidence(Case 1). ii) The RL agent is not keen on commit-ting when it sees an early high confidence value(like IT for the baseline) but rather waits whichmay enable the ASR partials to become more sta-ble (Case 2). iii) The RL agent also learns to com-mit at low confidence values as the time consumedincreases and sometimes even committing earlierthan the baseline (Case 3).

6.1 Contrasting Baseline and RL PolicyBuilding Efforts

Building an SDS with carefullly crafted rules hasoften been criticized as a laborious and time con-suming exercise. This is in contrast to the alter-native data oriented approaches, which are oftenargued to require less time to engineer a solutionand be more scalable. Development of the baselinesystem’s policy component took an NLP researcherapproximately two months, including experimen-tation with alternative rule structures and devel-opment of the parameter optimization framework.Note that this effort does not include data collec-tion. The same amount of effort was put into devel-oping the RL policy by a researcher with similarskills. Building the RL policy involved experiment-ing with various reward functions to suit the task.Though the reward function is simplistic in ourcase, a high negative reward for wrong As-I actionswas required for RL to learn useful policies. It alsotakes effort and experimentation to select the rightalgorithm (LSPI with value function approxima-tion vs. Vanilla Q-learning). It is thus hard to claimwhich approach is more time-efficient (in terms

of development effort). Figure 7 in the Appendixshows a comparison of the baseline policy and theRL policy learned with the Vanilla Q-learning al-gorithm which did not perform well. It performedworse than the baseline. We also need to keep inmind that: i) We cannot claim that the rules learnedby the RL policy could not be implemented in thehand-crafted system. Bounds on the time and con-fidence (for example: do not commit as soon as theconfidence exceeds a threshold but rather wait fora few additional time steps, it is okay to commitat lower confidence values for higher time valuesto perform better, etc.) can be included in the Al-gorithm 1 and the system can be deployed withease. ii) It usually takes time and effort to build acommon infrastructure to experiment between thetwo strategies. In this case, experimenting withthe incremental RL policy was simpler as the in-frastructure and the methodology existed from theprevious work by Manuvinakurike et al. (2015) andPaetzel et al. (2015). Despite the fact that both ap-proaches required similar development effort, inthe end, RL did learn a better strategy automati-cally, at least in our offline simulations (based onreal user data). RL provides advantages comparedto the baseline method. Adding new constraintsinto the baseline can be hard. This is because thebaseline method uses exhaustive grid search to setits parameter values, and it might be exponentiallycostly to do this with more constraints. On theother hand, RL is more scalable as adding featuresis relatively easy with RL.

6.2 Future Work

In this work we have showed that RL has potentialfor learning policies to make incremental decisionsthat yield better results than a high performanceCDR baseline. Our experiments were performed insimulation (albeit using real user data) and the nextstep is to investigate whether these improvementstransfer to real time experiments (real time inter-action of the agent with human users). Anotherinteresting avenue for future work is to implementa hybrid approach of engineering a hand-craftedpolicy using the intuitions learned from using RL.There are still regions of the state space that werenot fully explored by RL. On the other hand, as wesaw, RL can potentially learn interesting policieswhich would not have been intuitive for a dialoguedesigner to come up with. Therefore, we plan toexplore incorporating intuitions from the RL into

338

the high performance CDR baseline and see whichavenue would be more fruitful and if we can get thebest of both worlds. Finally, another idea for futurework is to experiment with Inverse ReinforcementLearning (Abbeel and Ng, 2004; Nouri et al., 2012;Kim et al., 2014) in order to potentially learn abetter reward function directly from the data.

Acknowledgments

This work was supported by the National ScienceFoundation under Grant No. IIS-1219253 and bythe U.S. Army Research Laboratory (ARL) un-der contract number W911NF-14-D-0005. Anyopinions, findings, and conclusions or recommen-dations expressed in this material are those of theauthor(s) and do not necessarily reflect the views,position, or policy of the National Science Foun-dation or the United States Government, and noofficial endorsement should be inferred. We wouldlike to thank Maike Paetzel. Thanks to HuggerIndustries, Eric Sorenson and fixedgear for imagespublished under CC BY-NC-SA 2.0. Thanks toEric Parker and cosmo flash for images publishedunder CC BY-NC 2.0, and to Richard Masonder/ Cyclelicious and Florian for images publishedunder CC BY-SA 2.0. Thanks to wikimedia forimages published under CC BY-SA 2.0.

ReferencesPieter Abbeel and Andrew Y. Ng. 2004. Apprentice-

ship learning via inverse reinforcement learning. InProceedings of the International Conference on Ma-chine Learning (ICML). Banff, Alberta, Canada.

Timo Baumann and David Schlangen. 2013. Open-ended, extensible system utterances are preferred,even if they require filled pauses. In Proceedings ofthe Annual Meeting of the Special Interest Group onDiscourse and Dialogue (SIGDIAL). Metz, France,pages 280–283.

Okko Buß, Timo Baumann, and David Schlangen.2010. Collaborating on utterances with a spoken di-alogue system using an ISU-based approach to incre-mental dialogue management. In Proceedings of theAnnual Meeting of the Special Interest Group on Dis-course and Dialogue (SIGDIAL). pages 233–236.

Nina Dethlefs, Helen Hastie, Heriberto Cuayáhuitl,Yanchao Yu, Verena Rieser, and Oliver Lemon. 2016.Information density and overlap in spoken dialogue.Computer Speech & Language 37:82–97.

Nina Dethlefs, Helen Hastie, Verena Rieser, and OliverLemon. 2012a. Optimising incremental dialogue de-cisions using information density for interactive sys-tems. In Proceedings of the Joint Conference on

Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL). Jeju Island, South Korea, pages82–93.

Nina Dethlefs, Helen Hastie, Verena Rieser, and OliverLemon. 2012b. Optimising incremental generationfor spoken dialogue systems: Reducing the need forfillers. In Proceedings of the International NaturalLanguage Generation Conference (INLG). Utica, IL,USA, pages 49–58.

David DeVault, Kenji Sagae, and David Traum. 2011.Incremental interpretation and prediction of utter-ance meaning for interactive dialogue. Dialogue &Discourse 2(1).

Fabrizio Ghigi, Maxine Eskenazi, M. Ines Torres, andSungjin Lee. 2014. Incremental dialog processingin a task-oriented dialog. In Proceedings of INTER-SPEECH. Singapore, pages 308–312.

Helen Hastie, Marie-Aude Aufaure, Panos Alexopou-los, Heriberto Cuayáhuitl, Nina Dethlefs, MilicaGašic, James Henderson, Oliver Lemon, XingkunLiu, Peter Mika, Nesrine Ben Mustapha, VerenaRieser, Blaise Thomson, Pirros Tsiakoulis, YvesVanrompay, Boris Villazon-Terrazas, and SteveYoung. 2013. Demonstration of the Parlance system:a data-driven, incremental, spoken dialogue systemfor interactive search. In Proceedings of the AnnualMeeting of the Special Interest Group on Discourseand Dialogue (SIGDIAL). Metz, France, pages 154–156.

Hatim Khouzaimi, Romain Laroche, and FabriceLefèvre. 2015. Optimising turn-taking strategieswith reinforcement learning. In Proceedings of theAnnual Meeting of the Special Interest Group on Dis-course and Dialogue (SIGDIAL). Prague, Czech Re-public, pages 315–324.

Dongho Kim, Catherine Breslin, Pirros Tsiakoulis, Mil-ica Gašic, Matthew Henderson, and Steve Young.2014. Inverse reinforcement learning for micro-turnmanagement. In Proceedings of INTERSPEECH.Singapore, pages 328–332.

Michail G. Lagoudakis and Ronald Parr. 2003. Least-squares policy iteration. Journal of Machine Learn-ing Research 4:1107–1149.

Ramesh Manuvinakurike and David DeVault. 2015.Natural Language Dialog Systems and IntelligentAssistants, chapter Pair Me Up: A web frameworkfor crowd-sourced spoken dialogue collection, pages189–201.

Ramesh Manuvinakurike, Maike Paetzel, and DavidDeVault. 2015. Reducing the cost of dialoguesystem training and evaluation with online, crowd-sourced dialogue data collection. In Proceedings ofthe Workshop on the Semantics and Pragmatics ofDialogue (SemDial). Gothenburg, Sweden.

339

Ramesh Manuvinakurike, Maike Paetzel, Cheng Qu,David Schlangen, and David DeVault. 2016. Towardincremental dialogue act segmentation in fast-pacedinteractive dialogue systems. In Proceedings of theAnnual Meeting of the Special Interest Group on Dis-course and Dialogue (SIGDIAL). Los Angeles, CA,USA, pages 252–262.

Elnaz Nouri, Kallirroi Georgila, and David Traum.2012. A cultural decision-making model for negotia-tion based on inverse reinforcement learning. In Pro-ceedings of the Annual Meeting of the Cognitive Sci-ence Society (CogSci). Sapporo, Japan, pages 2097–2102.

Maike Paetzel, Ramesh Manuvinakurike, and DavidDeVault. 2015. “So, which one is it?” The effectof alternative incremental architectures in a high-performance game-playing agent. In Proceedings ofthe Annual Meeting of the Special Interest Group onDiscourse and Dialogue (SIGDIAL). Prague, CzechRepublic, pages 77–86.

Maike Paetzel, David Nicolas Racca, and David De-Vault. 2014. A multimodal corpus of rapid dialoguegames. In Proceedings of the Language Resourcesand Evaluation Conference (LREC). Reykjavik, Ice-land, pages 4189–4195.

David Schlangen, Timo Baumann, and Michaela At-terer. 2009. Incremental reference resolution: Thetask, metrics for evaluation, and a Bayesian filteringmodel that is sensitive to disfluencies. In Proceed-ings of the Annual Meeting of the Special InterestGroup on Discourse and Dialogue (SIGDIAL). Lon-don, UK, pages 30–37.

Ethan Selfridge, Iker Arizmendi, Peter Heeman, andJason Williams. 2013. Continuously predicting andprocessing barge-in during a live spoken dialoguetask. In Proceedings of the Annual Meeting of theSpecial Interest Group on Discourse and Dialogue(SIGDIAL). Metz, France, pages 384–393.

Ethan Selfridge and Peter Heeman. 2010. Importance-driven turn-bidding for spoken dialogue systems. InProceedings of the 48th Annual Meeting of the As-sociation for Computational Linguistics (ACL). Up-psala, Sweden, pages 177–185.

Ethan O. Selfridge, Iker Arizmendi, Peter A. Heeman,and Jason D. Williams. 2012. Integrating incremen-tal speech recognition and POMDP-based dialoguesystems. In Proceedings of the Annual Meeting ofthe Special Interest Group on Discourse and Dia-logue (SIGDIAL). Seoul, South Korea, pages 275–279.

Gabriel Skantze and David Schlangen. 2009. Incre-mental dialogue processing in a micro-domain. InProceedings of the 12th Conference of the EuropeanAssociation for Computational Linguistics (EACL).Athens, Greece, pages 745–753.

Richard S. Sutton and Andrew G. Barto. 1998. Re-inforcement learning: An introduction. MIT PressCambridge.

Appendix

Example dialogue

Figure 6: Example dialogue for an episode in thehuman-agent corpus for the same TI as in Figure 1.

Policy differences

Policy action Confidence

tim

e

Wait As-S As-S As-I

Figure 7: Policy learned by the Vanilla Q-learningalgorithm (blue) compared to the baseline (red).

340

ASR Partials

Figure 8: Actions taken by the baseline and the RL agent for the 1-best ASR increments. The image set isalso shown.

341

Documents

Using Reinforcement Learning to Model …kgeorgila/publications/...Proceedings of the SIGDIAL 2017 Conference, pages 331–341, Saarbrucken, Germany, 15-17 August 2017.¨ c 2017 Association