Speech-to-Speech Translation with Clarifications Julia Hirschberg, Svetlana Stoyanchev Columbia University September 18, 2013

Speech-to-Speech Translation with

ClarificationsJulia Hirschberg, Svetlana Stoyanchev

Columbia UniversitySeptember 18, 2013

Outline Main Problem

Key Ideas

Solution Details

Impact

Issues, Gaps, and Future work

Speech Translation Speech-to-Speech translation system

3

L1 Speaker

lation

SpeechQuestion(L1)

Translated Question (L2)

Answer (L2)

Translated Answer (L1)

L2 Speaker

TranslationSystem

Speech Translation Translation may be impaired by:

Speech recognition errors

Word Error rate in English side of Transtac is 9%

Word error rate in Let’s Go bus information is 50%

A speaker may use ambiguous language

A speech recognition error may be caused by use of out-of-vocabulary words

4

TranslationSystem

Speech Translation Speech-to-Speech translation system

Introduce a clarification component

5

L1 Speaker

SpeechQuestion(L1)

Translated Question (L2)

Answer (L2)

Translated Answer (L1))

Clarification sub-dialogue

Clarification sub-dialogue

L2 SpeakerDialogue M

anager

Dialogue M

anager

Key Ideas Use targeted clarifications

Address challenges with targeted clarifications

Data collection for system evaluation

Most Common Clarification Strategies in Dialogue Systems

“Please repeat”

“Please rephrase”

System repeats the previous question

7

What Clarification Questions Do Human Speakers Ask?

Targeted reprise questions (M. Purver) o Ask a targeted question about the part of an utterance that was

misheard or misunderstood, including understood portions of the utterance

o Speaker: Do you have anything other than these XXX plans?

o Non-Reprise: What did you say?/Please repeat.

o Reprise: What kind of plans?

88% of human clarification questions are reprise

12% non-reprise

• Goal: Introduce targeted (reprise) questions into a spoken system

8

Advantages of Targeted Clarifications

More natural

User does not have to repeat the whole utterance/command

Provides grounding and implicit confirmation Speech-to-speech translation

Useful in systems that handle natural language user responses/commands/queries and a wide range of topics and vocabulary

Tutoring system

Virtual assistants (in car, in home): a user command may contain ASR error due to noise, background speech, etc.

9

Types of Clarification Questions in the TBOLT

System Rephrase part

• Used when an error is OOV and NOT a name (works on difficult non-OOV words as well)

• Asks to rephrase the error segment

• “I did not understand when you said: fiscal. Please give me another word or phrase for it.”

Spelling• Used for names

• “Please spell ‘Rockefeller’.”

Disambiguation

• Used to disambiguate between homophones

• “Did you mean plain as in extensive tract of level open land, or, plane as in an aircraft?”

Types of Questions (cont.)• Reprise (as found in human-human communication)

o Repeats part of the utterance before the error segment

o User: We will search some of the XXX to make sure everyone is safe.

o System: We will search some of the what?

• Reprise/ Rephrase-part

o Combines a targeted question with a rephrase question

o System: We will search some of the what? Please say another word or phrase for this: ‘vehicles’.

• Confirmation

o A yes/no question to confirm an utterance

o “Did you say ‘the breach is located here’?”

Requirement for a Targeted Question

Error Detection Error segment boundaries Error type

Does the error contain a proper name? Does the error contain an out-of-

vocabulary (OOV) word?

Challenges with Using Targeted Clarifications

1)A system's clarification question may not be appropriate because

An error segment may not be detected correctly An error type is not identified correctly

2)How should a system construct a targeted/reprise clarification question

3)How long should the system persist in its clarification?

Our goal is to address these challenges empirically

13

Examples of Inappropriate Clarification Questions

Rephrase-part (incorrect error type detected)• User: Captain Bell wants to ask you some questions.

• Error Detection: Captain <OOV> to ask you some questions.

• System: I did not understand when you said "Bell wants". Please give me another word or phrase for it.

Reprise/ Targeted Rephrase (error segment is identified incorrectly)• User: The insurgents are denouncing us.

• ASR/Error detection: The insurgents are doing now and <ERROR>

• System: The insurgents are doing now and what? Please say another word or phrase for "cing us".

Examples (cont.)• Spelling (error type is incorrectly detected as a

name)• User: It sounds like there is a kernel of truth to your

story. • Error Detection: It sounds like there is a <NAME>

of truth to your story. • System: I think I heard a name when you said this:

"kernel". Please spell that name.

Examples (cont.)• Disambiguation (disambiguated word is

misunderstood)• User: How often do you have problems with generators?

• ASR/Error Detection: How often do you have problems with general?

• System: Did you mean general as in broad or general as in a military officer?

Causes of Inappropriate Questions

Rephrase part Partial word is detected as an error Detected segment contains a name Detected segment is a function word (to, from, the …)

Disambiguate Neither choice for disambiguation is correct Disambiguated word is misrecognized

Spell Non-name Long segment

RepriseContains undetected recognition error

Goal

• Develop a method to automatically identify when an inappropriate question is asked

Use user’s answers to detect if a question was inappropriate

Data Collection• Simulation clarification system

• Users were asked to read a sentence and then were played a pre-recorded question

• Led to believe they were interacting with the actual system

Data Collection(cont.)• Prepared 228 questions

84 appropriate 144 inappropriate For each type of clarification questions, create

appropriate and inappropriate questions, Total 19 categories of clarification questions

• Each subject was asked 144 questions

• Recorded their initial utterances and their answers to the questions

User Responses• Subjects tended to be cooperative

• Answers varied from subject to subject

• Example: “I did not understand when you said: ‘Betirma’. Please give me another word or phrase for it.”

o “No"

o "Betirma"

o “Betirma bravo echo tango india romeo mike alpha"

User Responses (cont.)• Example 2:

User: “How often do you have problems with generators?”

System: “Did you mean general as in broad or general as in a military officer?”

o "generator as in a machine for making electricity"

o "no"

o "generators"

Method Extract lexical and prosodic features from

responses

Number of pauses, speech energy, speech tempo

Lexical and prosodic difference between initial response and an answer to clarification

Measure number of times subjects replay each question

Measure latency: length of pause before answer

• Determine whether questions are appropriate or inappropriate based on user responses

Challenge 2: Constructing Targeted Clarification Questions

Previous work: collected clarification questions using mturk (Stoyanchev et al. 2012, 2013)

Using human-generated questions manually created a set of generation rules

Evaluated generated questions with human subjects

Types of Questions R_GEN Generic: <context before error> what?

Applies if no other rules apply Sentence: The doctor will most likely prescribe XXX Question: the doctor will most likely prescribe WHAT?

R_SYN Syntactic: about <context before error> what about <context after error> ? Applies when: there is VB after error; VB and error share a parent Sentence: When was the XXX contacted? Question: When was WHAT contacted?

R_NMOD: which <parent word>? Applies when: DEP TAG error = NMOD and parent POS = NN | NNS Sentence: Do you have anything other than these XXX plans Question: Which plans?

R_START: what about <context after error>

Evaluation Questionnaire• Generated questions automatically using the rules

for a set of 84 sentences• Asked humans (mturk) to create a clarification

questions for the same sentences• Questionnaire applied to both human and computer-

generated questions

SubjectsMturk

Recruited 6 subjects from the lab

Inter-annotator Agreement

Results

Results

DiscussionR_GEN and R_SYN performance is comparable to

human-generated questions

R_NMOD (which …?) outperforms all other question types including human-generated questions

R_START rule did not work

Key Ideas Use Targeted Clarifications

Address challenges with targeted clarifications

Experiment on automatic detection of inappropriate questions

Experiment on automatic detection of when to terminate clarification

Data collection for system evaluation

Image Description and Questioning

Speaker1: A car is burning behind the girl The girl looks startled There was a massive explosion

Speaker2: A woman is standing in front of a

burning car Everything around her seems to

have been destroyed What caused this destruction?

Show user an image and ask to describe it and construct questions

Data Collection for System Evaluation

• Advantages: Do not prime users with words in a verbally

described scenario Elicits natural speech compared to reading Can be extended to a 2-way dialogue where the

interviewee is given a narrative or video information for answering interviewer's questions.

• Disadvantages: Uncontrolled vocabulary (can not force to

mispronounce words) No control across subject pairs

ImpactImpact on Speech-to-Speech Translation

Detecting when a targeted clarification question was inappropriate is an important feature for determining next dialogue move in clarification

Impact beyond Speech-to-Speech TranslationTargeted clarifications can be used in spoken

dialogue systemsEspecially useful for non-slot-filling (tutoring,

virtual assistants)

Future Work Appropriate and inappropriate questions

Analyze the data collected in responses to appropriate and inappropriate clarification questions

Use machine learning to predict if an utterance is an answer to appropriate or inappropriate clarification question

Targeted (reprise) clarification questions Which information from an initial sentence should a reprise clarification

question contain?

Using human-constructed questions, determine which information is essential to be repeated in a targeted question

Clarification length

How long should the system focus on a targeted clarification before back off?

Collect data and use machine learning to predict on each system’s turn whether a clarification should continue or stops

Conclusions• Used an error-simulation system to collect data

Data collection experiment for automatic detection of answers to 'inappropriate' system clarifications

Evaluation of automatically generated reprise clarification questions shows that they could be used in a system

Proposed an experiment for determining an optimal length of targeted clarification

• Collected audio data for system evaluation using an image description method

36

Thank youQuestions?

37

Challenge 3: Clarification Length

How long should the system focus on a targeted clarification before back off?

In a Speech-to-Speech translation: back-off= translate

In spoken dialogue systems : back-off = ask a generic question to 'please rephrase'.

• The answer depends on how patient and cooperative are users.

Evaluation of Clarification Length

• BOLT 2012 system behaviour: System asks targeted clarification at most 3 times before translating.

• Goal: Determine dynamically at each clarification turn whether the system should terminate clarification process.

• Use data to learn the dialogue strategy

Experiment Design• Simulate sequence of unsuccessful clarification questions.

• Give user an option to hit “translate” button

• Distractor cases: Simulate successful clarification

User: This computer is not operational

System: Please rephrase “not operational”

User: not working

System: thank you ( translate and show next question)

• Experimental case:

Loop asking 3 – 5 different targeted questions

Clarification dialogue continues until the user hits “translate”

• Use a combination of distractor and experimental cases

Method• Use data to determine when system should give

up on a targeted clarification

Apply machine learning Features:

Dialogue length (more likely to give up as dialogue continues to fail)

Question type Appropriateness of a clarification question Confidences of error detection and

classification components

Documents

Speech-to-Speech Translation with Clarifications Julia Hirschberg, Svetlana Stoyanchev Columbia University September 18, 2013