Affordable On-line Dialogue Policy Learning · 2017-12-03 · On-line Policy Learning In this paper, we proposed a complete frame-work of companion teaching , depicted as Figure 1

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2200–2209Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Affordable On-line Dialogue Policy Learning

Cheng Chang∗, Runzhe Yang∗, Lu Chen, Xiang Zhou and Kai YuKey Lab. of Shanghai Education Commission for Intelligent Interaction and Cognitive Eng.

SpeechLab, Department of Computer Science and EngineeringBrain Science and Technology Research CenterShanghai Jiao Tong University, Shanghai, China

{cheng.chang,yang runzhe,chenlusz,owenzx,kai.yu}@sjtu.edu.cn

Abstract

The key to building an evolvable dialoguesystem in real-world scenarios is to en-sure an affordable on-line dialogue policylearning, which requires the on-line learn-ing process to be safe, efficient and eco-nomical. But in reality, due to the scarcityof real interaction data, the dialogue sys-tem usually grows slowly. Besides, thepoor initial dialogue policy easily leads tobad user experience and incurs a failureof attracting users to contribute trainingdata, so that the learning process is un-sustainable. To accurately depict this, t-wo quantitative metrics are proposed to as-sess safety and efficiency issues. For solv-ing the unsustainable learning problem,we proposed a complete companion teach-ing framework incorporating the guidancefrom the human teacher. Since the humanteaching is expensive, we compared vari-ous teaching schemes answering the ques-tion how and when to teach, to economi-cally utilize teaching budget, so that makethe online learning process affordable.

1 Introduction

A task-oriented dialogue system is designed forinteracting with humans users to accomplish sev-eral predefined domains or tasks (Young et al.,2013; Daubigney et al., 2012). Dialogue Man-ager is the core component in a typical dialoguesystem, which controls the flow of dialogue bya state tracker and a policy module (Levin et al.,1997). The state tracker tracks the internal s-tate of the system while the policy module de-cides the response to the user according to the s-tatus of states (Sun et al., 2014a; Thomson and

∗Both authors contributed equally to this work.

Young, 2010). The approaches of constructinga policy module can be divided into two cate-gories: rule-based and statistical. Rule-based poli-cies are usually hand-crafted by domain expertswhich means they are inconvenient and difficult tobe optimized (Williams and Young, 2007; Wangand Lemon, 2013). In recent mainstream statisti-cal studies, Partially Observable Markov DecisionProcess (POMDP) framework has been applied tomodel dialogue management with unobservable s-tates, where policy training can be formulated asa Reinforcement Learning (RL) problem, whichenables the policy to be optimized automatically(Kaelbling et al., 1998; Arnold, 1998; Young et al.,2013).

Though RL-based approaches have the poten-tial to improve themselves as they interact morewith human users and achieve better performancethan rule-based approaches, they are rarely used inreal-world applications, especially in on-line sce-narios, since the training process is unsustainable.

Vicious Cycle: Unsustainable On-line Learning

Poor (initial) Policy

Bad User Experience

Insufficient Real User (Data)

Unsafe Policy Behavior (Solvable) ✔

Individual Rationality (Unsolvable) ✘

Possible Solutions to break the vicious cycle

Inefficient Learning Process (Solvable) ✔

The main causes of unsustainable on-line dia-logue policy learning are two-fold:

• Safety issue: the initial policy trained fromscratch may lead to terrible user experienceat the early training period, thus fail to attractsufficient users for more dialogues to do fur-ther policy training.

• Efficiency issue: if the progress of policylearning is not so efficient, it will exhaustusers’ patience before the policy reaches adesirable performance level.

2200

Prior works have mainly focused on improvingefficiency, such as Gaussian Processes RL (Gasicet al., 2010), deep RL (Fatemi et al., 2016), etc.For deep RL approaches, recent researches on thestudent-teacher RL framework have shown promi-nent acceleration to policy learning process (Tor-rey and Taylor, 2013; Williams and Zweig, 2016;Amir et al., 2016). In such framework, the teach-er agent instructs the student agent by providingsuggestions on what actions should be taken next(Clouse, 1996).

For the safety issue, Chen et al. (2017) devel-oped several teaching strategies answering “how”the human teacher guide the learning process.

However, those previous teaching methods ex-clude “when” to teach from concern. They sim-ply exhaust all the budget continuously from thebeginning, which is wasteful and causes a heavyworkload of the human teacher. An affordable dia-logue policy learning with human teaching shouldrequire a lighter workload and economically uti-lize teaching budget.

Furthermore, as for safety and efficiency eval-uation, previous works have been observing thetraining curves and testing curves to tell which oneis better, or evaluate policy performance after cer-tain dialogues of training, which are subjective anderror prone (Chen et al., 2015a; Su et al., 2016;Chen et al., 2017).

Our contribution is to address the above prob-lems. We propose a complete framework of com-panion teaching, and develop various teachingschemes which combine different teaching strate-gies and teaching heuristics together, to answerthe questions of “how” and “when” to teach toachieve affordable dialogue policy learning (sec-tion 2). Specifically, a novel failure prognosisbased teaching heuristic is proposed, where Mul-tiTask Learning (MTL) is utilized to predict thedialogue success reward (section 3). To avoidthe drawbacks of traditional subjective measure-ments, we propose two evaluation metrics, calledRisk Index (RI) and Hitting Time (HT), to quantifythe safety and efficiency of on-line policy learningrespectively (section 4). Simulation experimentsshowed, with the proposed companion teachingschemes, sustainable and affordable on-line dia-logue policy learning has been achieved (section5).

2 Companion Teaching Framework

The companion teaching framework is an on-linepolicy training framework with three intelligen-t participants: machine dialogue manager, humanuser, and human teacher (Chen et al., 2017). Un-der this framework, the human teacher is able toaccompany the dialogue manager to guide poli-cy learning with a limited teaching budget. Byinvestigating the real work mode in a call cen-ter, this framework makes a reasonable assump-tion that human teacher has access to the extracteddialogue states from the dialogue state tracker aswell as the system’s dialogue act, and can also re-ply in the same format.

However, there are two major problems in theprevious framework. First, the system will judgewhether a dialogue session succeeds by severalsimple rules and then determine whether to feed asuccess reward signal to dialogue manager for re-inforcement learning. Actually, the success feed-back made by the system lacks flexibility and cred-ibility, and it could mislead the policy learning. Amore suitable judge should be the user or the hu-man teacher. Second, the previous framework on-ly answers in which way the human teacher canguide the online dialogue policy learning, but an-other essential question, when should the humanteacher give guidance, remains undiscussed.

INPUT (ASR / SLU)

OUTPUT (NLG / TTS)

Dialogue State Tracking (DST)

Policy Model (parameters θ)

Reward Function

1.2. 3.

rusrt rtea

t

⏱

Dialogue Manager

Human User

Human Teacher

Figure 1: Companion Teaching Framework forOn-line Policy Learning

In this paper, we proposed a complete frame-work of companion teaching, depicted as Figure1. At each turn, the input module receives a speechinput from the human user, then produces possibleutterances ut of the speech in text. After that, thedialogue state tracker extracts the dialogue state stfrom possible utterances. This dialogue state willbe shared with policy model and human teacherif needed. When the final response at, has been

2201

determined, the output module will translate thisdialogue act to the natural language and reply tothe human user. The success signal will be fedback by the user or the human teacher as an im-portant part of system reward, at the end of eachsession. The human teacher can take the initia-tive or be activated by student initiated heuristicto give the dialogue guidance with strategies cor-responding to different configurations of switchesin the illustration. We call the combination of s-trategy and heuristic as teaching scheme.

2.1 Teaching StrategiesThe teacher can choose among three teaching s-trategies corresponding to different configurationsof switches in a wiring diagram as Figure 1 shows:The left switch is a Single-Pole, Double-Throw(SPDT) switch, which controls whether the an-swer is made by the system (connected to 1) orgiven by the teacher as an example (connected to2). The right switch is a simple on-off switch,which represents whether there is an extra rewardsignal from the teacher (ON) or not (OFF). The s-trategy related to the right switch is called teachingvia Critic Advice (CA), also known as turn-levelreward shaping (Thomaz and Breazeal, 2008; Ju-dah et al., 2010). When the switch at position 3 isturned on, the teacher will give the policy modelextra turn-level reward to distinguish the student’sactions between good and bad actions. Besides,the left switch corresponds to teaching via Exam-ple Action (EA), which means the teacher givesexample action for the student to take accordingto the student’s state.

The other strategy is proposed by Chen et al.(2017), which take the advantages of both EA andCA, named teaching via Example Action with Pre-dicted Critique (EAPC). With this strategy, the hu-man teacher gives example actions, meanwhile, aweak action predictor is trained using this teach-ing information to provide the extra reward evenin teacher’s absence.

2.2 Teaching HeuristicsThe strategies only answer how the human teachercan offer companion teaching to the system. How-ever, the timing of teaching should not be ignoredfor the sake of utilizing the limited teaching bud-get better. Exhausting all the budget at early train-ing stage, named Early teaching heuristic (Early),is simple and straightforward but wastes teachingopportunities on unnecessary cases. Thus, it is im-

perative to design some effective heuristics to in-struct when the teacher should give a hand to thestudent.

In addition to early teaching, the teachingheuristics can be broadly divided into two cat-egories: teacher-initiated heuristics and student-initiated heuristics (Amir et al., 2016). However,the teacher-initiated approaches require the con-stant long-term attention of the teacher to moni-tor the dialogue process (Torrey and Taylor, 2013;Amir et al., 2016), which is costly and impracticalfor real applications. Therefore, in this paper, weonly discuss student-initiated heuristics, shown asthe line with a stopwatch in Figure 1, which meansthat the student agent decides when to ask for theteacher’s help.

Previous works have presented several effectiveheuristics based on state importance, I(s), whichis determined by the Q-values of the RL agent:

I(s) = maxaQ(s,a) −minaQ(s,a)

Torrey and Taylor (2013) proposed State Impor-tance based Teaching heuristic (SIT) which makethe student ask for advice only when the currentstate is important:

I(s) > tsi, (1)

where tsi is a fixed threshold for importance.And Clouse (1996) proposed an State Uncertain-ty based Teaching heuristic (SUT) which ask foradvice when the student is uncertain about whichaction to take:

I(s) < tsu, (2)

where tsu is a given threshold for uncertainty.Though teaching effort can be conserved by on-

ly applying to those important or uncertain states,it may end up wasting advice if the dialogue islikely to be successful without teaching. In thispaper, we propose a novel Failure Prognosis basedTeaching heuristic (FPT) for on-line policy learn-ing to reduce that unnecessary advice. The detailsare given in section 3. For comparison, we will al-so investigate Random teaching heuristic (Rand)which means the student seek for advice with afixed probability pr.

3 Failure Prognosis Based TeachingHeuristic

To make better use of teaching advice, we proposeto use an on-line turn-level task success predictor

2202

to predict whether the ongoing dialogue will endin success and ask for advice only when the cur-rent prediction is a failure. The proposed approachutilizes MultiTask Learning (MTL) for the policymodel to estimate future dialogue success rewardand is compatible with various RL algorithms. Inthis paper, we implement the policy model with aDeep Q-Network (DQN), in which a neural net-work function approximator, named Q-network, isused to estimate the action-value function (Mnihet al., 2013).

3.1 Multitask Deep Q-Network

The goal of the policy model is to interact with hu-man user by choosing actions in each turn to max-imize future rewards. We define the dialogue stateshared by dialogue state tracker in the t-th turn asst, the action taken by policy model under currentpolicy πθ with parameters θ in the t-th turn as at,and at ∼ πθ(·|st). In an ideal dialogue environ-ment, once the policy model emit an action at, thehuman user will give an explicit feedback, like anormal response or a feedback of whether the di-alogue is successful, which will be converted to areward signal rt delivering to the policy model im-mediately, and then the policy model will transit tonext state st+1. The reward rt is composed of twoparts:

rt = rturnt + rsucct ,

where rturnt is the turn penalty reward and rsucct

is the dialogue success reward. Typically, rturnt isfixed for each turn as a negative constant Rturn,while rsucct equals to a predefined positive con-stantRsucc only when the dialogue terminates andreceives a successful user feedback otherwise ze-ro.

In DQN algorithm, all these transitions(st, at, rt, st+1) will be stored in a replay mem-ory D. And the objective is to optimize MSE be-tween Q-network Q(s, a; θ) and Q-learning targetQe. The loss function L(θ) is defined as:

L(θ) = Es,a∼πθ[(Qe −Q(s, a; θ))2

]. (3)

During the training period, Qe is estimated withold fixed parameter θ− and sampled transitionse ∼ D:

Qe = r + γ maxat+1

Q(st+1, at+1; θ−), (4)

where γ is the discount factor.

The reward Q(s, a) estimated by original Q-learning algorithm is essentially a combinationof future turn penalty reward Qturn(s, a) and fu-ture dialogue success reward Qsucc(s, a). For atask-oriented dialogue system, the prediction ofQsucc(s, a) is much more important because it re-flects the possibility of the dialogue to be success-ful. If these two rewards are estimated separate-ly, the objective of Qsucc(s, a) can be optimizedexplicitly, and we can get more insights into theestimated future. We found that in practice, opti-mizing these two objectives with MultiTask Learn-ing (MTL) converges faster and more stable com-pared with two separate models, the reason ofwhich may lie in that MTL can learn different re-lated tasks in parallel using shared representation-s, which will be helpful for each task to be learnedbetter (Caruana, 1997). The structure of proposedMTL-DQN is depicted in Figure 2.

Figure 2: MTL-DQN structure

3.2 Failure Prognosis

In the proposed multitask DQN, we define on-linetask success predictor T (st) as:

T (st) = Qsucc(st, at),

where at is the action taken under state st. It isreasonable to assume that the dialogue is goingto fail if T (st) is relatively small. Based on thetask success predictor, we propose a novel student-initiated heuristic, named Failure Prognosis basedTeaching heuristic (FPT).

The key to the proposed heuristic is to definefailure prognosis quantitatively. A straight way isto set a ratio threshold α, and consider it to be fail-ure prognosis when T (st) < αRsucc. However,this assumes that the numerical scale of Qsucc isconsistent through the training period, which is notalways the case. And the student’s noisy estima-tion ofQsucc at early training period will make thelearning process unstable. To smooth the teaching,we consider using a turn-level sliding window n-ear the current state to calculate an average value

2203

as the replacement of the fixed Rsucc. So in the t-th turn, the failure prognosis for the student to betrue is equivalent to:

T (st) < α1w

t−1∑j=t−w

T (sj), (5)

where w is the size of the sliding window.

4 Quantitative Measurements for Safetyand Efficiency

The performance of different teaching strategiesand heuristics should be measured in both the safe-ty and efficiency dimension. However, the mea-surements in these two dimensions are subjectiveand error prone in the previous work (Chen et al.,2017). Especially for assessing the degree of safe-ty of various teaching strategies and heuristics, wesimply obverse the training curves so that we can-not tell of two interleaving curves which trainingprocess is safer. Thus, it is imperative to set upsome quantitive measurements for both safety andefficiency evaluations. In this paper, we proposetwo scalar criteria: Risk Index (RI) and HittingTime (HT).

4.1 Risk Index

The Risk Index is a nonnegative index designed toindicate how risky the training process could befor evaluating the safety issue during the wholeonline dialogue policy learning process. Becausewe expect that the system satisfies the quality-of-service requirement in the early training period,specifically, we hope it can keep a relatively ac-ceptable success rate. It is straightforward to set asuccess rate threshold for the training process. Ina real application scenario, this threshold can beobtained by an appropriate user study.

If the success rate over a training process keepsabove this threshold all the time, we will think thistraining process is absolutely safe. Therefore, itsRI should equal to zero.

On the other hand, if the success rate over atraining process rises and falls and sometimes isbelow the threshold, it is risky. The riskiness con-sists of two parts:

• Disruptiveness: Sometimes the success rateduring a certain period will fall much lowerthan the threshold, which could be very dis-ruptive. To quantify the disruptiveness, we

consider the function

dis(t) = threshold−%succ(t)

over the training process. The higher the val-ue of dis(t′) is, the riskier the training pro-cess could be during the period of a certainlength centered with time t′.

• Persistence: Another thing we should takeinto account is the duration of the time at highrisk. Let δrisk(t) be the indicator of whetherthreshold ≥ %succ(t). Then the persis-tence can be quantified as total risky time

per(T ) =∫ T

t=0δrisk(t)dt

The longer the danger persists over the train-ing process, the value of persistence of thetraining process will be, and the riskier it is.

Our Risk Index integrate these two contents ofriskiness. That is, a nonnegative scalar

RI =∫ T

t=0dis(t)δrisk(t)dt,

which measures the integrated riskiness for the on-line training process of total length T . The RIalso has an intuitive interpretation as the area ofthe region which is below the threshold line andabove training curve. Straightforwardly, high RIindicates poor safety.

4.2 Hitting TimeTo measure the efficiency, we proposed the HittingTime in order to show how fast the system learnsand reaches the satisfactory performance.

The difficulty of designing such a criterion liesin the dramatic and undamped fluctuation of thetest curves, which is inherent in the instabilitydialogue task. Therefore, many popular criteriafor the evaluating dynamic performance in con-trol theory, such as “settling time” and “rise time”,cannot be applied to evaluate efficiency here.

We use Hitting Time to evaluate efficiency overthe fluctuant testing curve first by fitting it to theempirical learning curve

f(t) = a− b · e−(t/c)2 ,

where the parameter a is the stationary perfor-mance which is predicted as the asymptotic goalof the system, b relates to the initial performance,

2204

and c relates to the climbing speed. This empiricalmodel forces our fitted learning curve be an “S”shape curve satisfying constraints f ′(0) = 0 andf ′′(c/2) = 0. Then we observe when this fittedlearning curve hits the target performance τ andthis time (measured in sessions) is Hitting Time. Itcan be calculated analytically as follow

HT = c

√ln(

b

a− τ).

Ideally, the ultimate success rate a should bevery close under different settings because of thesufficient training. However, if the success ratekeeps very poor during the given sessions, the fit-ted awill be very low, and even less than the targetsatisfactory performance τ . In this situation, a ismeaningless, and HT becomes a complex number.And this indicates the real hitting time is far larg-er than given number of sessions T . We will notethe HT in this case as ULT (Unacceptably LargeTime).

In this way, we overcome the fluctuation andmake the HT tell us how much time the systemtakes to hit and surpass the target success rate.

5 Experiments and Results

Three objectives are set for our experiments: (1)Observing the effect of multitask DQN; (2) Con-trasting the performances of different teachingschemes (strategies and heuristics) under the com-panion teaching framework; (3) Observing thesafety and efficiency issues under sparse user feed-back scenarios.

5.1 Experimental SetupOur experiments are conducted with the DialogueState Tracking Challenge 2 (DSTC2) dataset,which is on restaurant information domain (Hen-derson and Thomson, 2014). The human user isemulated by an agenda-based user simulator witherror model (Schatzmann et al., 2007), while thehuman teacher is emulated by a pre-trained poli-cy model with success rate of about 0.78 throughmultitask DQN approach without teaching. Arule-based tracker is used for dialogue state track-ing (Sun et al., 2014b). The semantic parser isimplemented according to an SVM-based methodproposed by Henderson et al. (2012). The natu-ral language generator is implemented and modi-fied based on an RNNLG toolkit (Wen et al., 2016,2015a,c,b).

Early Rand SIT SUT FPT

None pr = 0.6 tsi = 5 tsu = 10α = 1.2w = 25

Table 1: Experimental configurations of teachingheuristics introduced in section 2.2 and 3.2.

In our experiments, all dialogues are limited totwenty turns. The “dialogue success” is judged bythe user simulator according to whether all usergoals are satisfied. And for policy learning, weset a small per-turn penalty of one to encourageshort interactions, i.e. Rturn = −1, and a largedialogue success reward of thirty to appeal to suc-cessful interactions, i.e. Rsucc = 30 , and the dis-count factor γ is set to one. Table 1 summarizesthe heuristics studied in our experiments, togetherwith corresponding configurations which are cho-sen empirically.

5.2 Observing the Effect of MTL-DQN

The MTL-DQN described in section 3.1 can esti-mate the prediction ofQturn andQsucc respective-ly. In our experiments, it was implemented withone shared hidden layer and two dependent hiddenlayers for two different tasks using MXNet (Chenet al., 2015b).

Figure 3 shows a typical failure in dialogue pol-icy training. The policy showed in the examplehasn’t been trained well, and it tends to ask theuser to repeat over and over again when the confi-dence score of the user utterance is not high, whichcauses the user to terminate the dialogue impa-tiently.

Figure 3: An example of failed dialogue whiletraining without teaching. The labels “Score” and“FP” represent for the confidence score of user ut-terance and the value of failure prognosis of thecurrent turn respectively.

2205

This kind of failure can be predicted and cor-rected in advance. By equation 5, the third turnwill be estimated to be failure prognosis, whichcan be a sign for the teacher to intervene and cor-rect the following actions to avoid dialogue fail-ure. Besides, the explicit separate estimation ofQturn and Qsucc provides a better understandingof the state of the current turn. For example, al-though the first turn and second turn have similarQ-values (Qturn+Qsucc), the latter turn is predict-ed with less future turns and less possibility to leadto dialogue success. See appendix A for additionalsuccessful example.

5.3 Comparing Different Teaching Schemes

Our proposed complete companion teachingframework allows us to teach dialogue system-s with different teaching schemes, which consist-s various strategies and heuristics. In our exper-iments, we compared 18 schemes consisting ofthree teaching strategies (CA, EA and EAPC), andsix teaching heuristics (Early, Rand, SIT, SUT, FP-T and SUT&FPT). The SUT&FPT heuristic mean-s the student only ask for advice when equation 2and 5 are both satisfied. For comparison, we useNo Teaching (NoTeaching) as the baseline.

To verify the effects of different companionteaching schemes, we conduct a set of experimentsto see their performances on safety and efficiencydimensions. During training, the teacher can on-ly teach for a limited budget of 1000 turns. All thetraining curves shown in this paper are moving av-erage curves with a window of size 250 dialoguesand over eight runs with an endurable standard er-ror.

5.3.1 Safety EvaluationTo compare effects of different teaching schemeson safety dimension, we use the Risk Index (RI) insection 4.1 to quantitatively measure each trainingprocess. We set the empirical safety threshold as65% here. The results are shown in Table 2.

As RIs implies, schemes composed with EAPCstrategy is much safer than those composed withother strategies. As for teaching heuristics, FP-T, SUT and SUT&FPT are three relatively saferheuristic accompanying different strategies. Oneexception is that Early teaching looks more suit-able for CA. A possible explanation is that whenthe teacher gives critique earlier, the student willmind its behavior earlier so that increase safe-ty. Figure 4 shows the training curves of on-line

CA EA EAPCEarly 98.5 110.6 56.1Rand 193.4 102.4 65.5FPT 154.4 86.2 53.6SIT 230.8 121.7 66.0SUT 183.5 95.8 44.5∗SUT&FPT 131.6 101.8 54.6NoTeaching 202.9

Table 2: RIs of learning processes under differ-ent teaching schemes. The least risky teachingscheme is annotated with ∗. For comparing differ-ent teaching heuristics with fixed teaching strate-gy, the smallest RIs in each column are bold andunderlined, the 2nd smallest ones are bold only,and the 3rd smallest ones are underlined only. Seeabbreviations of schemes in section 2.1 and 2.2.

learning process under EAPC with various heuris-tics. Among all 18 teaching schemes, EAPC+SUTis the safest teaching scheme which reduces about78% risk of no-teaching learning.

5.3.2 Efficiency EvaluationWe use Hitting Time (HT) in section 4.2 to mea-sure the efficiency of learning process under dif-ferent teaching schemes. The empirical satisfacto-ry target success rate for the student is 70% in ourexperimental settings.

CA EA EAPCEarly 3390.9 3479.4 4354.7Rand 3669.0 3518.5 2979.2FPT 3089.4 2921.1 2798.4SIT 3576.4 4339.7 3768.7SUT 3230.4 2954.5 3300.2SUT&FPT 2890.7 3393.0 2702.2∗

NoTeaching 3204.1

Table 3: HTs of test curves of different teachingschemes. The most efficient teaching scheme isannotated with ∗. For comparing different teach-ing heuristics with fixed strategy on efficiency is-sue, the smallest HTs in each column are bold andunderlined, and the 2nd smallest bold only. Seeabbreviations of schemes in section 2.1 and 2.2.

Table 3 contains all HTs of learning process un-der 18 teaching schemes. Intuitively, The num-ber in the table reflect the number of sessionsat which the model achieves target success rate.As it shows, not any teaching scheme will im-prove the learning efficiency. If the teacher in-tervenes at an improper time, it will distract sys-tem or confuse system even with a right guidance.But teaching when a potential failure exists (F-

2206

0 1000 2000 3000 4000 5000

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

#session

%succ

EAPC+FPT

EAPC+SIT

EAPC+SUT

EAPC+SUT&FPT

0 500 1000 1500 2000 2500 30000.45

0.50

0.55

0.60

0.65

0.70

#session

%succ

NoTeaching

EAPC+Early

EAPC+Rand

Figure 4: On-line learning process under differentteaching schemes (EAPC + different heuristics).The yellow dashed line indicates safe success ratethreshold. The area in gray indicates how risky atraining process is. See abbreviations of schemesin section 2.1 and 2.2.

PT) is always good for improving learning effi-ciency. EAPC+SUT&FPT is the teaching schemethat leads to the most efficient learning process inour experiments. Figure 6 gives some exampletest curves and fitted empirical learning curves oflearning process under EAPC with various heuris-tics.

5.3.3 Teacher’s WorkloadWe also observe teacher’s workload of all theteaching schemes since economically utilizingteaching budget is one of our goals.

CA+Early CA+FPT

CA+SIT CA+SUT CA+SUT&FPT

0 100 200 300 400 500 600 7000

200

400

600

800

1000

#session

#teachedturn

EA+Early EA+FPT

EA+SIT EA+SUT EA+SUT&FPT

0 100 200 300 400 500 600 7000

200

400

600

800

1000

#session

#teachedturn

EAPC+Early EAPC+FPT

EAPC+SIT EAPC+SUT EAPC+SUT&FPT

0 100 200 300 400 500 600 7000

200

400

600

800

1000

#session

#teachedturn

CA+Rand

EA+Rand

EAPC+Rand

Figure 5: Cumulative usage of teaching budget.The total teaching budget is 1000 for every teach-ing scheme. See abbreviations of schemes in sec-tion 2.1 and 2.2.

Figure 5 illustrates the cumulative usage ofteaching budget of 18 teaching schemes. It showsthat early teaching is the most costly teachingheuristic so that the teaching budget is soon usedup. SIT looks a bit lazy at the beginning and con-sumes teaching budget slowly. When the teaching

NoTeaching

EAPC+Early

EAPC+Rand(0.6)

EAPC+FPT

EAPC+SIT

EAPC+SUT

EAPC+SUT&FPT

1000 2000 3000 4000 5000 60000.50

0.55

0.60

0.65

0.70

0.75

#session

%succ

Figure 6: Test curves and fitted empirical learningcurves of learning process with different teachingschemes (EAPC+different heuristic). See abbrevi-ations of schemes in section 2.1 and 2.2.

strategy is EA or EAPC, FPT-based schemes donot use up full teaching budget in our experiments.Combine SUT and FPT, the workload is relative-ly lighter than that of teaching in other heuristic-s. And through proper teaching schemes, we canmake better use of the teaching budget and reduceteacher’s workload.

5.4 Safety and Efficiency Issues under SparseUser Feedback Scenarios

In real application scenarios, the user rarely pro-vides feedback at the end of the dialogue, so thatsafety and efficiency issues are even more serious.To observe the effectiveness of different teachingschemes under sparse user feedback, we conduct-ed experiments with sparse user feedback.

The user feedback rate is set to 30% empirical-ly and experiments are carried out under teachingschemes consisting of EAPC strategies and differ-ent heuristics, since EAPC is much safer and moreefficient than other teaching strategies.

RIs HTsNoTeaching 608.2 ULTEarly 223.0 6881.8Rand 226.6 ULTFPT 171.5 6753.0SIT 308.8 7868.4SUT 183.3 5876.9SUT&FPT 155.4 8420.9

Table 4: RIs & HTs of learning processes un-der EAPC strategy and different heuristics whenuser feedback rate is 30%. See abbreviations ofschemes in section 2.1 and 2.2.

Table 4 records the RIs and HTs of those differ-ent learning process when user feedback is sparse.

2207

We can see that when the user feedback rate drop-s from 100% to 30%, the RIs and HTs increasedramatically. The NoTeaching baseline is veryrisky and inefficient (its hitting time is even unpre-dictable within 10000 sessions learning). Howev-er, with teaching scheme such as EAPC+FPT, bothsafety and efficiency can be improved a lot.

6 Conclusions and Future Work

This paper addressed the safety and efficiency is-sues of sustainable on-line dialogue policy learn-ing with different teaching schemes, which answerboth “how” and “when” to teach, within the com-plete companion teaching framework. To evaluatethe policy learning process precisely, we proposedtwo measurements, Risk Index (RI) and HittingTime (HT), to quantify the degree of safety and ef-ficiency. Particularly, through multitask learning,we managed to optimize the predicted remainingturns and dialogue success reward explicitly, basedon which we developed a novel Failure Progno-sis based Teaching (FPT) heuristic to better utilizethe fixed teaching budget and make the teachingaffordable.

Experiments showed that different teachingschemes have different effects on safety and ef-ficiency dimension. And they also require differ-ent workload of the teacher. Among 18 comparedteaching schemes, FPT-based heuristics combinedwith EAPC strategy achieved promising perfor-mance on RI and HT, and required relatively slightworkload. This result indicates a proper teachingscheme under the companion teaching frameworkis able to guarantee a sustainable and affordableon-line dialogue policy learning process.

There are several directions for our future work.We expect to deploy our proposed framework inreal-world scenarios collaborating with real hu-man teachers to verify the results presented in thispaper and discover more potential challenges ofon-line dialogue system development. Further-more, the current study is focused on dialogue suc-cess rate, which is a simplification of the humansatisfaction evaluation. So future work is neededto take more qualities into consideration to achievebetter user experience.

Acknowledgments

This work was supported by the Shanghai Sail-ing Program No. 16YF1405300, the China NS-FC projects (No. 61573241 and No. 61603252)

and the Interdisciplinary Program (14JCZ03) ofShanghai Jiao Tong University in China. Exper-iments have been carried out on the PI supercom-puter at Shanghai Jiao Tong University.

ReferencesOfra Amir, Ece Kamar, Andrey Kolobov, and Barbara

Grosz. 2016. Interactive teaching strategies for a-gent training. In IJCAI. International Joint Confer-ences on Artificial Intelligence.

Benedict Arnold. 1998. Reinforcement learning: Anintroduction (adaptive computation and machinelearning). IEEE Transactions on Neural Network-s, 9(5):1054.

Rich Caruana. 1997. Multitask learning. MachineLearning, 28(1):41–75.

Lu Chen, Pei Hao Su, and Milica Gasic. 2015a. Hyper-parameter optimisation of gaussian process rein-forcement learning for statistical dialogue manage-ment. In Meeting of the Special Interest Group onDiscourse and Dialogue, pages 407–411.

Lu Chen, Runzhe Yang, Cheng Chang, Zihao Ye, Xi-ang Zhou, and Kai Yu. 2017. On-line dialogue poli-cy learning with companion teaching. In EACL.

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang,Minjie Wang, Tianjun Xiao, Bing Xu, ChiyuanZhang, and Zheng Zhang. 2015b. Mxnet: A flexibleand efficient machine learning library for heteroge-neous distributed systems. Statistics.

Jeffery Allen Clouse. 1996. On integrating apprenticelearning and reinforcement learning. University ofMassachusetts.

L Daubigney, M Geist, S Chandramohan, andO Pietquin. 2012. A comprehensive reinforcementlearning framework for dialogue management opti-mization. IEEE Journal of Selected Topics in SignalProcessing, 6(8):891–902.

Mehdi Fatemi, Layla El Asri, Hannes Schulz, Jing He,and Kaheer Suleman. 2016. Policy networks withtwo-stage training for dialogue systems. arXiv.org.

Milica Gasic, Filip Jurcıcek, Simon Keizer, FrancoisMairesse, Blaise Thomson, Kai Yu, and Steve Y-oung. 2010. Gaussian processes for fast policy opti-misation of pomdp-based dialogue managers. pages201–204.

Matthew Henderson, Milica Gai, Blaise Thomson, andPirros Tsiakoulis. 2012. Discriminative spoken lan-guage understanding using word confusion network-s. In Spoken Language Technology Workshop, pages176–181.

2208

Matthew Henderson and Blaise Thomson. 2014. Thesecond dialog state tracking challenge. In SIGDIAL,volume 263, Stroudsburg, PA, USA. Association forComputational Linguistics.

Kshitij Judah, Saikat Roy, Alan Fern, and Thomas GDietterich. 2010. Reinforcement learning via prac-tice and critique advice. In Twenty-Fourth AAAIConference on Artificial Intelligence, pages 481–486.

Leslie Pack Kaelbling, Michael L Littman, and Antho-ny R Cassandra. 1998. Planning and acting in par-tially observable stochastic domains. Artificial intel-ligence, 101(1):99–134.

E Levin, R Pieraccini, and W Eckert. 1997. Learningdialogue strategies within the markov decision pro-cess framework. In Automatic Speech Recognitionand Understanding, 1997. Proceedings., 1997 IEEEWorkshop on, pages 72–79.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin Riedmiller. 2013. Playing atari withdeep reinforcement learning. Computer Science.

Jost Schatzmann, Blaise Thomson, Karl Weilhammer,Hui Ye, and Steve Young. 2007. Agenda-based usersimulation for bootstrapping a pomdp dialogue sys-tem. In NAACL, pages 149–152, Morristown, NJ,USA. Association for Computational Linguistics.

Pei Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-barahona, Stefan Ultes, David Vandyke, Tsung H-sien Wen, and Steve Young. 2016. On-line activereward learning for policy optimisation in spoken di-alogue systems. In Meeting of the Association forComputational Linguistics, pages 2431–2441.

Kai Sun, Lu Chen, Su Zhu, and Kai Yu. 2014a. Thesjtu system for dialog state tracking challenge 2. InMeeting of the Special Interest Group on Discourseand Dialogue, pages 318–326.

Kai Sun, Lu Chen, Su Zhu, and Kai Yu. 2014b. Thesjtu system for dialog state tracking challenge 2. InSIGDIAL, pages 318–326, Stroudsburg, PA, USA.Association for Computational Linguistics.

A. L. Thomaz and C. Breazeal. 2008. Teachable robot-s: Understanding human teaching behavior to buildmore effective robot learners. Artificial Intelligence,172(6):716–737.

Blaise Thomson and Steve Young. 2010. Bayesian up-date of dialogue state: A pomdp framework for spo-ken dialogue systems. Computer Speech and Lan-guage, 24(4):562–588.

Lisa Torrey and Matthew Taylor. 2013. Teaching ona budget: Agents advising agents in reinforcementlearning. In Proceedings of the 2013 internationalconference on Autonomous agents and multi-agentsystems, pages 1053–1060. International Foundationfor Autonomous Agents and Multiagent Systems.

Zhuoran Wang and Oliver Lemon. 2013. A simple andgeneric belief tracking mechanism for the dialog s-tate tracking challenge: On the believability of ob-served information. In The Sigdial Meeting on Dis-course and Dialogue.

Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Niko-la Mrksic, Pei-Hao Su, David Vandyke, and SteveYoung. 2015a. Stochastic Language Generationin Dialogue using Recurrent Neural Networks withConvolutional Sentence Reranking. In Proceedingsof the 16th Annual Meeting of the Special InterestGroup on Discourse and Dialogue (SIGDIAL). As-sociation for Computational Linguistics.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, LinaM. Rojas-Barahona, Pei-Hao Su, David Vandyke,and Steve Young. 2015b. Toward multi-domain lan-guage generation using recurrent neural networks.NIPS Workshop on Machine Learning for SpokenLanguage Understanding and Interaction.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, LinaM. Rojas-Barahona, Pei-Hao Su, David Vandyke,and Steve Young. 2016. Multi-domain neural net-work language generation for spoken dialogue sys-tems. In Proceedings of the 2016 Conferenceon North American Chapter of the Association forComputational Linguistics (NAACL). Associationfor Computational Linguistics.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015c.Semantically conditioned lstm-based natural lan-guage generation for spoken dialogue systems.In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing(EMNLP). Association for Computational Linguis-tics.

Jason D Williams and Steve Young. 2007. Partial-ly observable markov decision processes for spokendialog systems. Computer Speech and Language,21(2):393–422.

Jason D Williams and Geoffrey Zweig. 2016. End-to-end LSTM-based dialog control optimized with su-pervised and reinforcement learning. CoRR.

Steve Young, Milica Gasic, Blaise Thomson, and Ja-son D Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review. Proceedings of theIEEE, 101(5):1160–1179.

2209

Documents

Affordable On-line Dialogue Policy Learning · 2017-12-03 · On-line Policy Learning In this paper, we proposed a complete frame-work of companion teaching , depicted as Figure 1