misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science...

misunderstandings, corrections and beliefs in spoken language interfaces

Dan Bohus Computer Science Departmentwww.cs.cmu.edu/~dbohus Carnegie Mellon Universitydbohus@cs.cmu.edu Pittsburgh, PA 15213

problem

spoken language interfaces lack robustness when faced with understanding errors

stems mostly from speech recognition spans most domains and interaction types exacerbated by operating conditions

more concretely …

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you

leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to

answer the following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]S: traveling in the afternoon. Okay what day would you be departing

chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at

1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR

WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago

at 1:40pm arrives Seoul at ………

some statistics …

corrections [Krahmer, Swerts, Litman, Levow]

30% of utterances correct system mistakes 2-3 times more likely to be misrecognized

semantic error rates: ~25-35%

SpeechActs [SRI] 25%

CU Communicator [CU] 27%

Jupiter [MIT] 28%

CMU Communicator [CMU] 32%

How May I Help You? [AT&T] 36%

two types of understanding errors

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]

NON-understanding

System cannot extract any meaningful information from the user’s turn

S: What city are you leaving from?U: Birmingham [BERLIN PM]

System extracts incorrect information from the user’s turn

MIS-understanding

misunderstandings

S: What city are you leaving from?U: Birmingham [BERLIN PM]

System extracts incorrect information from the user’s turn

MIS-understanding

detect potential misunderstandings; do something about them

fix recognition

outline

detecting misunderstandings

detecting user corrections[late-detection of misunderstandings]

belief updating[construct accurate beliefs by integrating information from multiple turns]

recognition confidence scoresS: What city are you leaving from?U: Birmingham [BERLIN PM]

conf=0.63

traditionally [Bansal, Chase, Cox, Kemp, many others]

speech recognition confidence scores use acoustic, language model and search info frame, phoneme, word-level

“semantic” confidence scores

we’re interested in semantics, not words YES = YEAH, NO = NO WAY

use machine learning to build confidence annotators in-domain, manually labeled data

utterance: [BERLIN PM] Birmingham

labels: correct / misunderstood

features from different knowledge sources binary classification problem probability of misunderstanding: regression

problem

a typical result

Identifying User Corrections Automatically in a Spoken Dialog System [Walker, Wright, Langkilde]

HowMayIHelpYou corpus: call routing for phone services 11787 turns

features ASR: recog, numwords, duration, dtmf, rg-grammar, tempo … understanding: confidence, context-shift, top-task, diff-conf, … dialog & history: sys-label, confirmation, num-reprompts,

num-confirms, num-subdials, …

binary classification task majority baseline (error): 36.5% RIPPER (error): 14%

outline

belief updating [construct accurate beliefs by integrating information from multiple turns]

detect user corrections is the user trying to correct the system?

S: Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]

user correction

misunderstanding

same story: use machine learning in-domain, manually labeled data features from different knowledge sources binary classification problem probability of correction: regression problem

typical result

Identifying User Corrections Automatically in a Spoken Dialog System [Hirschberg, Litman, Swerts]

TOOT corpus: access to train information 2328 turns, 152 dialogs

features prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo

… ASR: gram, str, conf, ynstr, … dialog position: diadist dialog history: preturn, prepreturn, pmeanf

binary classification task majority baseline: 29% RIPPER: 15.7%

outline

belief updating[construct accurate beliefs by integrating information from multiple turns]

belief updating problem: an easy case

S: on which day would you like to travel?U: on September 3rd

[AN DECEMBER THIRD] {CONF=0.25}

S: did you say you wanted to leave on December 3rd?

departure_date = {Dec-03/0.25}

departure_date = {Ø}

[NO] {CONF=0.88}

belief updating problem: a trickier case

S: Where would you like to go?U: Huntsville

[SEOUL] {CONF=0.65}

S: traveling to Seoul. What day did you need to travel?

destination = {seoul/0.65}

destination = {?}

U: no no I’m traveling to Birmingham

[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

given: an initial belief Pinitial(C) over

concept C a system action SA a user response R

construct an updated belief: Pupdated(C) ← f (Pinitial(C), SA, R)

belief updating problem formalized

S: traveling to Seoul. What day did you need to travel?

destination = {seoul/0.65}

destination = {?}

[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

outline

belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work

belief updating: current solutions

most systems only track values, not beliefs

new values overwrite old values explicit confirm + yes → trust hypothesis explicit confirm + no → kill hypothesis explicit confirm + “other” → non-understanding implicit confirm: not much

“users who discover errors through incorrect implicitconfirmations have a harder time getting back on track”[Shin et al, 2002]

outline

belief updating: general form

given: an initial belief Pinitial(C) over concept C a system action SA a user response R

construct an updated belief: Pupdated(C) ← f (Pinitial(C), SA, R)

restricted version: 2 simplifications

1. compact belief system unlikely to “hear” more than 3 or 4

values single vs. multiple recognition results

in our data: max = 3 values, only 6.9% have >1 value

confidence score of top hypothesis

2. updates after confirmation actions

reduced problem ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R)

outline

I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?

collected with RoomLine a phone-based mixed-initiative spoken dialog

system conference room reservation

search and negotiation

explicit and implicit confirmations confidence threshold model (+ some

exploration)

implicit confirmation task

user study 46 participants, 1st time users 10 scenarios, fixed order presented graphically (explained during briefing)

compensated per task success

corpus statistics

449 sessions, 8848 user turns orthographically transcribed manually annotated

misunderstandings (concept-level) non-understandings user corrections correct concept values

outline

user response types

following Krahmer and Swerts study on Dutch train-table information system

3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER

cross-tabulated against correctness of confirmations

user responses to explicit confirmations

YES NO Other

CORRECT94%

[93%]0% [0%] 5% [7%]

INCORRECT 1% [6%]72%

[57%]27%

[37%]~10%

from transcripts

[numbers in brackets from Krahmer&Swerts]

from decoded YES NO Other

CORRECT 87% 1% 12%

INCORRECT 1% 61% 38%

other responses to explicit confirmations

~70% users repeat the correct value ~15% users don’t address the question

attempt to shift conversation focus

User does not correct

User corrects

CORRECT 1159 0

INCORRECT 29 [10% of incor]

250[90% of incor]

user responses to implicit confirmations

YES NO Other

CORRECT30% [0%]

7% [0%]63%

[100%]

INCORRECT 6% [0%]33%

[15%]61%

transcripts

[numbers in brackets from Krahmer&Swerts]

decodedYES NO Other

CORRECT 28% 5% 67%

INCORRECT 7% 27% 66%

ignoring errors in implicit confirmations

User does not correct

User corrects

CORRECT 552 2

INCORRECT 118 [51% of incor]

111[49% of incor]

users correct later (40% of 118) users interact strategically

correct only if essential

~correct later

correct later

~critical 55 2

critical 14 47

outline

machine learning approach

need good probability outputs low cross-entropy between model

predictions and reality cross-entropy = negative average log

posterior

logistic regression sample efficient stepwise approach → feature selection

logistic model tree for each action root splits on response-type

features. target.

initial situation initial confidence score concept identity, dialog state, turn number

system action other actions performed in parallel

features of the user response acoustic / prosodic features lexical features grammatical features dialog-level features

target: was the value correct?

baselines

initial baseline accuracy of system beliefs before the update

heuristic baseline accuracy of heuristic rule currently used in

the system

oracle baseline accuracy if we knew exactly when the user is

correcting the system

results: explicit confirmation

InitialHeuristicLMTOracle

InitialHeuristicLMT

3.57 2.71

Explicit ConfirmationHard error (%) Soft error

InitialHeuristicLMT

InitialHeuristicLMTOracle

16.1515.33

0.610.67

Implicit Confirmation

results: implicit confirmation

Hard error (%) Soft error

InitialHeuristicLMT

InitialHeuristicLMTOracle15.40

14.3612.64

Unplanned Implicit Confirmation

0.430.46

results: unplanned implicit confirmation

Hard error (%) Soft error

informative features

initial confidence score prosody features barge-in expectation match repeated grammar slots concept id priors on concept values [not included in these

results]

outline

discussion

evaluation does it make sense? what would be a better evaluation?

current limitation: belief compression extending models to N hypothesis + other

current limitation: system actions extending models to cover all system actions

thank you!

a more subtle caveat distribution of training data

confidence annotator + heuristic update rules

distribution of run-time data confidence annotator + learned model

always a problem when interacting with the world!

hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach?

KL-divergence & cross-entropy KL divergence: D(p||q)

Cross-entropy: CH(p, q) = H(p) + D(p||q)

Negative log likelihood

)(log)()||(

xpxpqpD

)(log)(),( xqxpqpCH

)(log)( xqqLL

logistic regression regression model for binomial (binary) dependent

variables

fwefxP

1)|1( fw

)1(log

fit a model using max likelihood (avg log-likelihood) any stats package will do it for you

no R2 measure test fit using “likelihood ratio” test stepwise logistic regression

keep adding variables while data likelihood increases signif. use Bayesian information criterion to avoid overfitting

logistic regression

0 10% 20% 30% 40% 50%0

% Nonunderstandings (FNON)

logistic model tree

0 10% 20% 30% 40% 50%0

regression tree, but with logistic models on leaves

f=0 f=1

g>10g<=10

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science...

Documents

MISUNDERSTANDINGS AND UNOFFICIAL KNOWLEDGE IN

Spoken Dialog Management for an Astronaut’s Procedure Assistant Presented by: Dan Bohus (dbohus@cs.cmu.edu) Collaborators: Gregory Aist, RIALIST Group

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department dbohus

COVID-19 misunderstandings cleared up

On the Most Common Misunderstandings of High-Lift …elib.dlr.de/109654/1/On the Most Common Misunderstandings of High... · On the Most Common Misunderstandings of High-Lift Flows

Traditional Scholarship and Modern Misunderstandings

UNDERSTANDINGS AND MISUNDERSTANDINGS OF TRIGONOMETRY …

Bohus Engagement Model

7 Common Misunderstandings about China

10 common misunderstandings about the WTO - APEDA misunderst… · Criticisms of the WTO are often based on fundamental misunderstandings of the way the WTO works. 10 common misunderstandings

Intercultural Misunderstandings

Marriage Misunderstandings

1569 W6th Ave, Technical/Design Comment, S. Bohus

Online supervised learning of non-understanding recovery policies Dan Bohus dbohus dbohus@cs.cmu.edu Computer Science Department Carnegie

Bohus Collection 2014

Chemistry Myths and Misunderstandings

Misunderstandings 5DEC12 HIS 225-01

Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus (dbohus@cs.cmu.edu)

Common Misunderstandings

Some photos from Bohus Malmön