Statisical Spoken Dialogue System Talk 2 – Belief tracking

Slide 1

Statisical Spoken Dialogue SystemTalk 2 Belief trackingCLARA Workshop

Presented by Blaise Thomson

Cambridge University Engineering [email protected]://mi.eng.cam.ac.uk/~brmt2

1Human-machine spoken dialogueRecognizerSemanticDecoderDialogManagerSynthesizerMessageGenerator

UserWaveformsWordsDialogActsI want a restaurantinform(type=restaurant)What kind of food doyou want.?request(food)Typical structure of a spoken dialogue system2Spoken Dialogue Systems State of the art

OutlineIntroductionAn example user model (spoken dialogue model)The Partially Observable Markov Decision Process (POMDP) POMDP models for dialogue systemsPOMDP models for off-line experimentsPOMDP models for simulating

Inference Belief propagation (Fixed parameters)Expectation Propagation (Learning parameters)

Optimisations

Results

Intro An example user model?Partially Observable Markov Decision Process (POMDP)Probabilistic model of what the user will say

Variables:Dialogue state, st. (e.g. User wants a restaurant)System action, at. (e.g. What type of food)Observation of what was said, ot. (e.g. N-best semantic list)

Assumes Input-Output Hidden Markov structure:

s1s2sT...o1o2oT...a1a2aT...Intro Simplifying the POMDP user modelTypically split dialogue state, st:

s1s2sT...o1o2oT...a1a2aT...Intro Simplifying the POMDP user modelTypically split dialogue state, st:True user goal, gtTrue user act, ut

g1g2gT...o1o2oT...a1a2aT...u1u2uT...Further split the goal, gt, into sub-goals gt,ce.g. User wants a Chinese restaurant food=Chinese, type=restaurant

Intro Simplifying the POMDP user modelgtgt,foodgt,typegt,starsgt,areagIntro Simplifying the POMDP user modelgtypeutypegfoodufoodoaUGgtypeutypegfoodufoodoaUGHow can I help you?Im looking for a beer [0.5]Im looking for a bar [0.4]

Sorry, what did you say?bar [0.3]bye [0.3]

When decisions are based on probabilistic user goals:Partially Observable Markov Decision Process (POMDPs)Intro POMDP models for dialogue systems Beer Bar Bye Beer Bar ByeIntro POMDP models for dialogue systems

Intro belief model for dialogue systems Beer Bar Byeconfirm(beer)Choose actions according to beliefs in the goal instead of most likely hypothesis

More robust some key reasonsFull hypothesis listUser modelIntro POMDP models for off-line experimentsHow can I help you?Im looking for a beerIm looking for a bar

Sorry, what did you say?barbye

Beer Bar Bye Beer Bar Bye[0.5][0.4][0.2][0.7][0.3][0.3][0.5][0.1]Intro POMDP models for simulationOften useful to be able to simulate how people behave:For reinforcement learningFor testing a given systemIn theory, simply generate from the POMDP user modelgtypeutypegfoodufoodaUGrestaurantChineseinform(type=restaurant)silence()An example voicemailWe have a voicemail system with 2 possible user goals:g = SAVE: The user wants to save g = DEL: The user wants to delete

In each turn until we save/delete we observe one of two thingso = USAVE: The user said saveo = UDEL: The user said delete

We assume that the goal changes between each turn, and for the moment we only look at two turns

We start by being completely unsure what the user wantsAn example exercise Observation probability: P(o | g)

If we observe the user saying they want to save and then what is the probability they want to save.

P(g1 | o1 = OSAVE)

Use Bayes Theorem P(A|B) = P(B|A) P(A) / P(B)g \ oOSAVEODELSAVE0.80.2DEL0.20.8An example exercise Observation probability: P(o | g)

Transition probability: P(g | g)

If we observe the user saying they want to save and then saying they want to delete, what is the probability they want to save in the second turn. i.e. what is:P(g2 | o1 = OSAVE, o2 = ODEL)

g \ oOSAVEODELSAVE0.80.2DEL0.20.8g \ gSAVEDELSAVE0.90.1DEL0.01.0An example answer

g2g1=SAVEg1=DELTOTALPROBSAVE0.5*0.8*0.9*0.20.5*0.2*0*0.20.0720.39DEL0.5*0.8*0.1*0.80.5*0.2*1*0.80.1120.61An example expanding furtherIn general we will want to compute probabilities conditional on the observations (we will call this the data D).

This always becomes a marginal on the joint distribution with the observation probabilities fixed. e.g.

These sums can be computed much more cleverly using dynamic programming

Belief PropagationInterested in the marginals p(x|D)Assume network is a tree with observations above and below D = {Da, Db}x

Da DbBelief PropagationWhen we split Db = {Dc, Dd}

These are called the messages into x.We have one message for every probability factor connectedx

Da Dc Dd

Belief Propagation - message passing

ab Da DbBelief Propagation - message passing

ab Da Dbc DcBelief PropagationWe can do the same thing repeatedly.

Start on one side, and keep getting p(x|Da)

Then start on the other ends and keep getting p(Db|x)

To get a marginal simply multiply these

Belief Propagation our exampleg1o1g2o2Write probabilities as vectors with SAVE on top

Parameter Learning The problemgtypeutypegfoodufoodoaUGgtypeutypegfoodufoodoaUGParameter Learning The problemFor every (action, goal, goal) triple there is a parameter

The parameters are a probability table of P(g|g,a)

The goals are all hidden and factorized and there are many of themgt-1gtatNeed to tieparametersMust allow for factorizedhidden variablesParameter Learning Some optionsHand-craftRoy et al, Zhang et al, Young et al, Thomson et al, Bui et alAnnotate user goal and use Maximum LikelihoodWilliams et al, Kim et al, Henderson & LemonIsnt always possibleExpectation MaximisationDoshi & Roy (7 states), Syed et al (no goal changes) Uses an unfactorised stateIntractableExpectation Propagation (EP)Allows parameter tying (details in paper)Handles factorized hidden variablesHandles large state spacesDoesnt require any annotations (incl of user act) though it does use the semantic decoder output

28There is actually a 5th option which is Variational Inference. Dont use that because we want an algorithm that matching the expectation. VB tends to converge to modes of the distribution which can be problematic for us.Belief Propagation as message passing

ab Da DbMessage from outside the factor q\(a) input message from above a Message from outside the factor q\(b) product of input messages below b Message from this factor to b q*(b)Message from this factor to a q*(a)Belief Propagation as message passing

ab Da DbMessage from outside network q\(a) = p(a|Da)Message from outside network q\(b) = p(Db|a)Message from this factor q*(b) = p(b|Db)Message from this factor q*(a) = p(Db|a)

q*(b)q*(a)q\(a)q\(b) Think in terms of approximations from each probability factorBelief Propagation Unknown parameters?Imagine we have a discrete choice for the parameters

Integrate over our estimate from the rest of the network:

To estimate q, we want to sum over a and b:

Belief Propagation Unknown parameters?But we actually have continuous parameters

Integrate over our estimate from the rest of the network:

To estimate q, we want to sum over a and b:

Expectation PropagationThis doesnt make sense q is a probability!Multiplying by q\(q) gives:

Choose q*(q) to minimize KL divergence with thisIf we restrict ourselves to Dirichlet distributions, we need to find the Dirichlet that best matches a mixture of Dirichlets

Expectation Propagation ExamplegtypeuoaggtypeuoagqExpectation Propagation ExamplegtypeuoaggtypeuoagqExpectation Propagation Examplegtypeuoaggtypeuoagqp(o|inform(type=bar)) [0.5]p(o|inform(type=hotel)) [0.2]Expectation Propagation Examplegtypeuoaggtypeuoagqp(o|inform(type=bar)) [0.5]p(o|inform(type=hotel)) [0.2]inform(type=bar) [0.5]inform(type=hotel) [0.2]p(u=bar|g)0.4 * p(u=hotel|g) 0.1 Expectation Propagation Examplegtypeuoaggtypeuoagqp(o|inform(type=bar)) [0.5]p(o|inform(type=hotel)) [0.2]inform(type=bar) [0.5]inform(type=hotel) [0.2]type=bar [0.45]type=hotel [0.18]p(u=bar|g)0.4 * p(u=hotel|g) 0.1 Expectation Propagation Examplegtypeuoaggtypeuoagqp(o|inform(type=bar)) [0.5]p(o|inform(type=hotel)) [0.2]inform(type=bar) [0.5]inform(type=hotel) [0.2]type=bar [0.45]type=hotel [0.18]type=bar [0.44]type=hotel [0.17]p(u=bar|g)0.4 * p(u=hotel|g) 0.1 Expectation Propagation Examplegtypeuoaggtypeuoagqp(o|inform(type=bar)) [0.5]p(o|inform(type=hotel)) [0.2]inform(type=bar) [0.5]inform(type=hotel) [0.2]type=bar [0.45]type=hotel [0.18]type=bar [0.44]type=hotel [0.17]p(o|inform(type=bar)) [0.6]p(o|inform(type=rest)) [0.3]p(u=bar|g)0.4 * p(u=hotel|g) 0.1 Expectation Propagation Examplegtypeuoaggtypeuoagqp(o|inform(type=bar)) [0.5]p(o|inform(type=hotel)) [0.2]inform(type=bar) [0.5]inform(type=hotel) [0.2]type=bar [0.45]type=hotel [0.18]type=bar [0.44]type=hotel [0.17]p(o|inform(type=bar)) [0.6]p(o|inform(type=rest)) [0.3]Expectation Propagation Optimisation 1 In dialogue systems, most of the values are equally likely

We can use this to reduce computations:Compute the q distributions only onceMultiply instead of summing the same value repeatedly

1 2 3 4 5Number of stars

Twee stars pleaseExpectation Propagation Optimisation 2 For each value, assume transition to most other values is the same (mostly constant factor)e.g. constant probability of change

The reduced number of parameters means we can speed up learning too!Results Computation times

No optGroupingConstChangeBothResults Simulated re-rankingTrain on 1000 simulated dialoguesRe-rank simulated semantics on 1000 dialoguesOracle accuracy is 93.5%

TAcc Semantic accuracy of the top hypothesis NCE Normalized Cross Entropy Score (Confidence scores)ICE Item Cross Entropy Score (Accuracy + Confidence)TAccNCEICENo rescoring75.7 0.5410.921Trained with noisy semantics81.7 0.6500.870Trained with semantic annotations81.50.6320.903Results Data re-rankingTrain on Mar09 TownInfo trial data (720 dialogues)Test on Feb08 TownInfo trial data (648 dialogues)Oracle accuracy is 79.2%TAccNCEICENo rescoring73.3-0.0331.687Trained with noisy semantics73.40.3271.586Trained with semantic annotations73.90.3381.655

Results Simulated dialogue managementUse reinforcement learning (Natural Actor Critic algorithm) to train two systems:One uses hand-crafted parametersOne uses parameters learned from 1000 simulated dialoguesResults Live evaluations (control)Tested in the Spoken Dialogue Challenge

Provide bus timetables in Pittsburgh

800 road names (pairs represent a stop). Required to get place from, to and time

All parameters of the Cambridge system were hand-crafted# Dial# Succ% SuccWERBASELINE915964.8 +/- 5.042.35System 2612337.7 +/- 6.260.66Cambridge756789.3 +/- 3.632.65System 4836274.7 +/- 4.834.34Results Live evaluations (control)

WEREstimated success rateCAMBASELINECAM SuccessCAM FailureBASELINE SuccessBASELINE FailureSummaryPOMDP models are an effective model of dialogue:For use in dialogue systemsFor re-ranking semantic hypotheses off-line

Expectation Propagation allows parameter learning for complex models, without annotations of dialogue state

Experiments show:EP gives improvements in re-ranked hypothesesEP gives improvements in simulated dialogue management performanceProbabilistic belief gives improvements in live dialogue management performance

Current/Future workUsing the POMDP as a simulator too

Need to change the model to better handle user acts (the sub-acts are not independent!)gtypeutypegfoodufoodgtypeugfood

The End Thanks!Dialogue Group homepage:http://mi.eng.cam.ac.uk/research/dialogue/

My homepage:http://mi.eng.cam.ac.uk/~brmt2/

Expectation Propagation OptimisationsAB

Assume this is constant for A-A*Compute thisoffline

Documents

Statisical Spoken Dialogue System Talk 2 – Belief tracking