83
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 10: Information State and Utility-based dialogue systems

CS 224S / LINGUIST 285 Spoken Language Processing

  • Upload
    gusty

  • View
    67

  • Download
    0

Embed Size (px)

DESCRIPTION

CS 224S / LINGUIST 285 Spoken Language Processing. Dan Jurafsky Stanford University Spring 2014 Lecture 10: Information State and Utility-based dialogue systems. Outline. Dialogue Manager Design Finite State Frame-based Information-State Dialogue-Act Detection - PowerPoint PPT Presentation

Citation preview

LSA.303 Introduction to Computational Linguistics

CS 224S / LINGUIST 285Spoken Language ProcessingDan JurafskyStanford UniversitySpring 2014

Lecture 10: Information State and Utility-based dialogue systems

1OutlineDialogue Manager DesignFinite StateFrame-basedInformation-StateDialogue-Act DetectionDialogue-Act GenerationUtility-based conversational agents MDP, POMDPEvaluation2Information-State and Dialogue ActsFor more than just form-fillingNeed to:Decide when the user has asked a question, made a proposal, rejected a suggestionGround a users utterance, ask clarification questions, suggest plansNeed models of interpretation and generationSpeech acts and groundingMore sophisticated representation of dialogue context than just a list of slots3Information-state architectureInformation stateDialogue act interpreterDialogue act generatorSet of update rulesUpdate dialogue state as acts are interpretedGenerate dialogue actsControl structure to select which update rules to apply4Information-state

5Dialog actsAlso called conversational movesAn act with (internal) structure related specifically to its dialogue functionIncorporates ideas of groundingIncorporates other dialogue and conversational functions that Austin and Searle didnt seem interested in

6Verbmobil taskTwo-party scheduling dialoguesSpeakers were asked to plan a meeting at some future dateData used to design conversational agents which would help with this task(cross-language, translating, scheduling assistant)7Verbmobil Dialogue ActsTHANKthanksGREETHello DanINTRODUCEIts me againBYEAllright, byeREQUEST-COMMENTHow does that look?SUGGESTJune 13th through 17thREJECTNo, Friday Im booked all dayACCEPTSaturday sounds fineREQUEST-SUGGESTWhat is a good day of the week for you?INITI wanted to make an appointment with youGIVE_REASONBecause I have meetings all afternoonFEEDBACKOkayDELIBERATELet me check my calendar hereCONFIRMOkay, that would be wonderfulCLARIFYOkay, do you mean Tuesday the 23rd?

8DAMSL: forward looking functionSTATEMENT a claim made by the speakerINFO-REQUEST a question by the speaker CHECK a question for confirming informationINFLUENCE-ON-ADDRESSEE (=Searle's directives) OPEN-OPTION a weak suggestion or listing of options ACTION-DIRECTIVE an actual commandINFLUENCE-ON-SPEAKER (=Austin's commissives) OFFER speaker offers to do something COMMIT speaker is committed to doing somethingCONVENTIONAL other OPENING greetings CLOSING farewells THANKING thanking and responding to thanks9DAMSL: backward looking functionAGREEMENT speaker's response to previous proposal ACCEPT accepting the proposal ACCEPT-PART accepting some part of the proposal MAYBE neither accepting nor rejecting the proposal REJECT-PART rejecting some part of the proposal REJECT rejecting the proposal HOLD putting off response, usually via subdialogueANSWER answering a questionUNDERSTANDING whether speaker understood previous SIGNAL-NON-UNDER. speaker didn't understand SIGNAL-UNDER. speaker did understand ACK demonstrated via continuer or assessment REPEAT-REPHRASE demonstrated via repetition or reformulation COMPLETION demonstrated via collaborative completion10A DAMSL Labeling

11Conversation ActsTraum and Hinkelman (1992)

Generating Dialogue ActsTwo examplesConfirmationRejection13ConfirmationErrors: Speech is a pretty errorful channelHumans use grounding to confirm that they heard correctlyASR is way worse than humans!Dialog systems need to do even more grounding and confirmation than humansUsers are confused when system doesnt give explicit acknowledgement signal.Stifelman et al. (1993), Yankelovich et al. (1995)

14Explicit confirmationS: Which city do you want to leave from?U: BaltimoreS: Do you want to leave from Baltimore?U: Yes15Explicit confirmationU: Id like to fly from Denver Colorado to New York City on September 21st in the morning on United AirlinesS: Lets see then. I have you going from Denver Colorado to New York on September 21st. Is that correct?U: Yes16Implicit confirmation: displayU: Id like to travel to BerlinS: When do you want to travel to Berlin?

U: Hi Id like to fly to Seattle Tuesday morningS: Traveling to Seattle on Tuesday, August eleventh in the morning. Your name?17Implicit vs. ExplicitComplementary strengthsExplicit: easier for users to correct systemss mistakes (can just say no)But explicit is cumbersome and longImplicit: much more natural, quicker, simpler (if system guesses right).18Implicit and ExplicitEarly systems: all-implicit or all-explicitModern systems: adaptiveHow to decide?ASR system can give confidence metric.This expresses how convinced system is of its transcription of the speechIf high confidence, use implicit confirmationIf low confidence, use explicit confirmation19Computing confidenceSimplest: use acoustic log-likelihood of users utteranceMore featuresProsodic: utterances with longer pauses, F0 excursions, longer durationsBackoff: did we have to backoff in the LM?Cost of an error: Explicit confirmation before moving money or booking flights20RejectionIm sorry, I didnt understand that.Reject when:ASR confidence is lowBest interpretation is semantically ill-formedMight have four-tiered level of confidence:Below confidence threshhold, rejectAbove threshold, explicit confirmationIf even higher, implicit confirmationEven higher, no confirmation

21Automatic Interpretation of Dialogue ActsHow do we automatically identify dialogue acts?Given an utterance:Decide whether it is a QUESTION, STATEMENT, SUGGEST, or ACKPerhaps we can just look at the form of the utterance to decide?22Can we just use the surface syntactic form?YES-NO-Qs have auxiliary-before-subject syntax:Will breakfast be served on USAir 1557?STATEMENTs have declarative syntax:I dont care about lunchCOMMANDs have imperative syntax:Show me flights from Milwaukee to Orlando on Thursday night23Surface form != speech act typeSurface formSpeech actCan I have the rest of your sandwich?QuestionRequestI want the rest of your sandwichDeclarativeRequestGive me your sandwich!ImperativeRequest24Dialogue Act ambiguityCan you give me a list of the flights from Atlanta to Boston?This looks like an INFO-REQUEST.If so, the answer is:YES.But really its a DIRECTIVE or REQUEST, a polite form of:Please give me a list of the flightsWhat looks like a QUESTION can be a REQUEST25Dialogue Act ambiguitySimilarly, what looks like a STATEMENT can be a QUESTION:

UsOPEN-OPTIONI was wanting to make some arrangements for a trip that Im going to be taking uh to LA uh beginnning of the week after nextAgHOLDOK uh let me pull up your profile and Ill be right with you here. [pause]AgCHECKAnd you said you wanted to travel next week?UsACCEPTUh yes.26Indirect speech actsUtterances which use a surface statement to ask a questionUtterances which use a surface question to issue a request

27DA interpretation as statistical classification: FeaturesWords and Collocations:Please or would you: good cue for REQUESTAre you: good cue for INFO-REQUESTProsody:Rising pitch is a good cue for INFO-REQUESTLoudness/stress can help distinguish yeah/AGREEMENT from yeah/BACKCHANNELConversational StructureYeah following a proposal is probably AGREEMENT; yeah following an INFORM probably a BACKCHANNEL28An example of dialogue act detection: Correction DetectionIf system misrecognizes an utterance, and eitherRejectsVia confirmation, displays its misunderstandingThen user has a chance to make a correctionRepeat themselvesRephrasingSaying no to the confirmation question.29CorrectionsUnfortunately, corrections are harder to recognize than normal sentences!Swerts et al (2000): corrections misrecognized twice as often (in terms of WER) as non-corrections!!!Why?Prosody seems to be largest factor: hyperarticulationLiz Shriberg example:NO, I am DE-PAR-TING from JacksonvilleBettina Braun example from a talking elevatorIn den VIERTEN Stock

30A Labeled dialogue (Swerts et al)

31Machine learning to detect user corrections: featuresLexical information (no, correction, I dont, swear words)Prosodic indicators of hyperarticulationincreases in F0 range, pause duration, word durationLengthASR confidenceLM probabilityVarious dialogue features (repetition)32Prosodic FeaturesShriberg et al. (1998)Decision tree trained on simple acoustically-based prosodic featuresSlope of F0 at the end of the utteranceAverage energy at different places in utteranceVarious duration measuresAll normalized in various waysThese helped distinguishStatement (S)Yes-no-question (QY)Declarative question (QD) (Youre going to the store?)Wh-question (QW)

33Prosodic Decision Tree for making S/QY/QW/QD decision

34Dialogue System EvaluationAlways two kinds of evaluationExtrinsic: embedded in some external taskIntrinsic: evaluating the component as such

What constitutes success or failure for a dialogue system?

35Reasons for Dialogue System EvaluationA metric to compare systemscant improve it if we dont know where it failscant decide between two systems without a goodness metricA metric as an input to reinforcement learning: automatically improve conversational agent performance via learning36PARADISE evaluationMaximize Task SuccessMinimize CostsEfficiency MeasuresQuality MeasuresPARADISE (PARAdigm for Dialogue System Evaluation) (Walker et al. 2000)

37Task Success% of subtasks completedCorrectness of each questions/answer/error msgCorrectness of total solutionError rate in final slotsGeneralization of Slot Error RateUsers perception of whether task was completed

38Efficiency CostPolifroni et al. (1992), Danieli and Gerbino (1995) Hirschman and Pao (1993)

Total elapsed time in seconds or turnsNumber of queriesTurn correction ration: number of system or user turns used solely to correct errors, divided by total number of turns

39Quality Cost# of times ASR system failed to return any sentence# of ASR rejection prompts# of times user had to barge-in# of time-out promptsInappropriateness (verbose, ambiguous) of systems questions, answers, error messages40Concept accuracy:

Concept accuracy or Concept error rate% of semantic concepts that the NLU component returns correctlyI want to arrive in Austin at 5:00DESTCITY: BostonTime: 5:00Concept accuracy = 50%Average this across entire dialogueHow many of the sentences did the system understand correctlyCan be used as either quality cost or task success41PARADISE: Regress against user satisfaction

42Regressing against user satisfactionQuestionnaire to assign each dialogue a user satisfaction rating: this is dependent measureSet of cost and success factors are independent measuresUse regression to train weights for each factor43Experimental ProceduresSubjects given specified tasksSpoken dialogues recordedCost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeledUsers specify task solution via web pageUsers complete User Satisfaction surveysUse multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factorsSlide from Julia Hirschberg44User Satisfaction: Sum of Many MeasuresWas the system easy to understand? (TTS Performance)Did the system understand what you said? (ASR Performance) Was it easy to find the message/plane/train you wanted? (Task Ease)Was the pace of interaction with the system appropriate? (Interaction Pace) Did you know what you could say at each point of the dialog? (User Expertise)How often was the system sluggish and slow to reply to you? (System Response) Did the system work the way you expected it to in this conversation? (Expected Behavior) Do you think you'd use the system regularly in the future? (Future Use)45Performance Functions from Three SystemsELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ETTOOT User Sat.= .35* COMP + .45* MRS - .14*ETANNIE User Sat.= .33*COMP + .25* MRS -.33* Help

COMP: User perception of task completion (task success)MRS: Mean (concept) recognition accuracy (cost)ET: Elapsed time (cost)Help: Help requests (cost)Slide from Julia Hirschberg46Evaluation SummaryBest predictors of User Satisfaction:Perceived task completion mean recognition score (concept accuracy)Performance model useful for system developmentMaking predictions about system modificationsDistinguishing good dialogues from bad dialoguesAs part of a learning model47Now that we have a success metricCould we use it to help drive learning?Learn an optimal policy or strategy for how the conversational agent should behave48New Idea: Modeling a dialogue system as a probabilistic agentA conversational agent can be characterized by:The current knowledge of the systemSet of states S the agent can be inSet of actions A the agent can takeA goal G, which impliesA success metric that tells us how well the agent achieved its goalA way of using this metric to create a strategy or policy for what action to take in any particular state.49What do we mean by actions A and policies ?Kinds of decisions a conversational agent needs to make:When should I ground/confirm/reject/ask for clarification on what the user just said?When should I ask a directive prompt, when an open prompt?When should I use user, system, or mixed initiative?50A threshold is already a policy a human-designed one!Could we learn what the right action isRejectionExplicit confirmationImplicit confirmationNo confirmationBy learning a policy which, given various information about the current state,dynamically chooses the action which maximizes dialogue success51Another strategy decisionOpen versus directive promptsWhen to do mixed initiative

How we do this optimization?Markov Decision Processes52Review: Open vs. Directive PromptsOpen promptSystem gives user very few constraintsUser can respond how they please:How may I help you? How may I direct your call?Directive promptExplicit instructs user how to respondSay yes if you accept the call; otherwise, say no53Review: Restrictive vs. Non-restrictive gramamrsRestrictive grammarLanguage model which strongly constrains the ASR system, based on dialogue stateNon-restrictive grammarOpen language model which is not restricted to a particular dialogue state54Kinds of InitiativeHow do I decide which of these initiatives to use at each point in the dialogue?GrammarOpen PromptDirective PromptRestrictiveDoesnt make senseSystem InitiativeNon-restrictiveUser InitiativeMixed Initiative55Modeling a dialogue system as a probabilistic agentA conversational agent can be characterized by:The current knowledge of the systemA set of states S the agent can be ina set of actions A the agent can takeA goal G, which impliesA success metric that tells us how well the agent achieved its goalA way of using this metric to create a strategy or policy for what action to take in any particular state.56Goals are not enoughGoal: user satisfactionOK, thats all very well, butMany things influence user satisfactionWe dont know user satisfaction til after the dialogue is doneHow do we know, state by state and action by action, what the agent should do?We need a more helpful metric that can apply to each state57UtilityA utility function maps a state or state sequence onto a real number describing the goodness of that state I.e. the resulting happiness of the agentPrinciple of Maximum Expected Utility:A rational agent should choose an action that maximizes the agents expected utility58Maximum Expected UtilityPrinciple of Maximum Expected Utility:A rational agent should choose an action that maximizes the agents expected utilityAction A has possible outcome states Resulti(A)E: agents evidence about current state of worldBefore doing A, agent estimates prob of each outcomeP ( Resulti(A) | Do(A), E)Thus can compute expected utility:

59Utility (Russell and Norvig)

60Markov Decision ProcessesOr MDPCharacterized by:a set of states S an agent can be ina set of actions A the agent can takeA reward r(a,s) that the agent receives for taking an action in a state

(+ Some other things Ill come back to (gamma, state transition probabilities))61A brief tutorial exampleLevin et al. (2000)A Day-and-Month dialogue systemGoal: fill in a two-slot frame:Month: NovemberDay: 12thVia the shortest possible interaction with user62What is a state?In principle, MDP state could include any possible information about dialogueComplete dialogue history so farUsually use a much more limited setValues of slots in current frameMost recent question asked to userUsers most recent answerASR confidenceetc.63State in the Day-and-Month exampleValues of the two slots day and month.Total:2 special initial state si and sf.365 states with a day and month1 state for leap year 12 states with a month but no day31 states with a day but no month411 total states

64Actions in MDP models of dialogueSpeech acts!Ask a questionExplicit confirmationRejectionGive the user some database informationTell the user their choicesDo a database query65Actions in the Day-and-Month examplead: a question asking for the dayam: a question asking for the monthadm: a question asking for the day+monthaf: a final action submitting the form and terminating the dialogue

66A simple reward functionFor this example, lets use a cost functionA cost function for entire dialogueLetNi = number of interactions (duration of dialogue)Ne = number of errors in the obtained values (0-2)Nf = expected distance from goal(0 for complete date, 1 if either data or month are missing, 2 if both missing)Then (weighted) cost is:C = wiNi + weNe + wfNf672 possible policiesPo=probability of error in open promptPd=probability of error in directive prompt

682 possible policiesStrategy 1 is better than strategy 2 when improved error rate justifies longer interaction:

69That was an easy optimizationOnly two actions, only tiny # of policiesIn general, number of actions, states, policies is quite largeSo finding optimal policy * is harderWe need reinforcement leraningBack to MDPs:70MDPWe can think of a dialogue as a trajectory in state space

The best policy * is the one with the greatest expected reward over all trajectoriesHow to compute a reward for a state sequence?

71Reward for a state sequenceOne common approach: discounted rewardsCumulative reward Q of a sequence is discounted sum of utilities of individual states

Discount factor between 0 and 1Makes agent care more about current than future rewards; the more future a reward, the more discounted its value

72The Markov assumptionMDP assumes that state transitions are Markovian

73Expected reward for an actionExpected cumulative reward Q(s,a) for taking a particular action from a particular state can be computed by Bellman equation:

Expected cumulative reward for a given state/action pair is:immediate reward for current state+ expected discounted utility of all possible next states sWeighted by probability of moving to that state sAnd assuming once there we take optimal action a

74What we need for Bellman equationA model of p(s|s,a)Estimate of R(s,a)How to get these?If we had labeled training dataP(s|s,a) = C(s,s,a)/C(s,a)If we knew the final reward for whole dialogue R(s1,a1,s2,a2,,sn)Given these parameters, can use value iteration algorithm to learn Q values (pushing back reward values over state sequences) and hence best policy75Final rewardWhat is the final reward for whole dialogue R(s1,a1,s2,a2,,sn)?This is what our automatic evaluation metric PARADISE computes:the general goodness of a whole dialogue!!!!!

76How to estimate p(s|s,a) without labeled dataHave random conversations with real people:Carefully hand-tune small number of states and policiesThen can build a dialogue system which explores state space by generating a few hundred random conversations with real humansSet probabilities from this corpusHave random conversations with simulated people:Now you can have millions of conversations with simulated peopleSo you can have a slightly larger state space

77An exampleSingh, S., D. Litman, M. Kearns, and M. Walker. 2002. Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. Journal of AI Research.

NJFun system, people asked questions about recreational activities in New JerseyIdea of paper: use reinforcement learning to make a small set of optimal policy decisions78Very small # of states and actsStates: specified by values of 8 featuresWhich slot in frame is being worked on (1-4)ASR confidence value (0-5)How many times a current slot question had been askedRestrictive vs. non-restrictive grammarResult: 62 statesActions: each state only 2 possible actionsAsking questions: System versus user initiativeReceiving answers: explicit versus no confirmation.79Ran system with real users311 conversationsSimple binary reward function1 if competed task (finding museums, theater, winetasting in NJ area)0 if notSystem learned good dialogue strategy: RoughlyStart with user initiativeBackoff to mixed or system initiative when re-asking for an attributeConfirm only a lower confidence values80State of the artOnly a few MDP systems were builtCurrent direction:Partially observable MDPs (POMDPs)We dont REALLY know the users state (we only know what we THOUGHT the user said)So need to take actions based on our BELIEF , i.e., a probability distribution over states rather than the true state81SummaryUtility-based conversational agentsPolicy/strategy for:ConfirmationRejectionOpen/directive promptsInitiative+?????MDP

82SummaryDialogue Manager DesignFinite StateFrame-basedInformation-StateDialogue-Act DetectionDialogue-Act GenerationUtility-based conversational agents MDP, POMDPEvaluation83