Multimodal Dialog 2 Multimodal Dialog System Intelligent Robot
Lecture Note
Slide 3
Multimodal Dialog Multimodal Dialog System A system which
supports human-computer interaction over multiple different input
and/or output modes. Input: voice, pen, gesture, face expression,
etc. Output: voice, graphical output, etc. Applications GPS
Information guide system Smart home control Etc. . voice pen 3
Intelligent Robot Lecture Note
Slide 4
Multimodal Dialog Motivations Speech: the Ultimate Interface? +
Interaction style: natural (use free speech) Natural repair process
for error recovery + Richer channel speakers disposition and
emotional state (if systems knew how to deal with that..) - Input
inconsistent (high error rates), hard to correct error e.g., may
get different result, each time we speak the same words. - Slow
(sequential) output style: using TTS (text-to-speech) How to
overcome these weak points? Multimodal interface!! 4 Intelligent
Robot Lecture Note
Slide 5
Multimodal Dialog Advantages of Multimodal Interface Task
performance and user preference Migration of Human-Computer
Interaction away from the desktop Adaptation to the environment
Error recovery and handling Special situations where mode choice
helps 5 Intelligent Robot Lecture Note
Slide 6
Multimodal Dialog Task Performance and User Preference Task
performance and user preference for multimodal over speech only
interfaces [Oviatt et al., 1997] 10% faster task completion, 23%
fewer words, (Shorter and simpler linguistic constructions) 36%
fewer task errors, 35% fewer spoken disfluencies, 90-100% user
preference to interact this way. Speech-only dialog system Speech:
Bring the drink on the table to the side of bed Multimodal dialog
System Speech: Bring this to here Pen gesture: Easy, Simplified
user utterance ! 6 Intelligent Robot Lecture Note
Slide 7
Multimodal Dialog Migration of Human-Computer Interaction away
from the desktop Small portable computing devices Such as PDAs,
organizers, and smart-phones Limited screen real estate for
graphical output Limited input no keyboard/mouse (arrow keys,
thumbwheel) Complex GUIs not feasible Augment limited GUI with
natural modalities such as speech and pen Use less space Rapid
navigation over menu hierarchy Other devices Kiosks, car navigation
system No mouse or keyboard Speech + pen gesture 7 Intelligent
Robot Lecture Note
Slide 8
Multimodal Dialog Adaptation to the environment Multimodal
interfaces enable rapid adaptation to changes in the environment
Allow user to switch modes Mobile devices that are used in multiple
environments Environmental conditions can be either physical or
social Physical Noise: Increases in ambient noise can degrade
speech performance switch to GUI, stylus pen input Brightness:
Bright light in outdoor environment can limit usefulness of
graphical display Social Speech many be easiest for password,
account number etc, but in public places users may be uncomfortable
being overheard Switch to GUI or keypad input 8 Intelligent Robot
Lecture Note
Slide 9
Multimodal Dialog Error Recovery and Handling Advantages for
recovery and reduction of error: Users intuitively pick the mode
that is less error-prone. Language is often simplified. Users
intuitively switch modes after an error The same problem is not
repeated. Multimodal error correction Cross-mode compensation -
complementarity Combining inputs from multiple modalities can
reduce the overall error rate Multimodal interface has potentially
9 Intelligent Robot Lecture Note
Slide 10
Multimodal Dialog Special Situations Where Mode Choice Helps
Users with disability People with a strong accent or a cold People
with RSI Young children or non-literate users Other users who have
problems when handle the standard devices: mouse and keyboard
Multimodal interface let people choose their preferred interaction
style depending on the actual task, the context, and their own
preferences and abilities. 10 Intelligent Robot Lecture Note
Slide 11
Multimodal Dialog Multimodal Dialog System Architecture
Architecture of QuickSet [] Multi-agent architecture VR/AR
Interfaces MAVEN BARS Facilitator routing, triggering, dispatching,
Facilitator routing, triggering, dispatching, Inter-agent
Communication Language Sketch/ Gesture ICL Horn Clauses Speech/ TTS
Natural Language Map Interface Multimodal Integration Simulators
WebSvcs (XML, SOAP, ) Other Facilitators Databases CORBA bridge
Other user interfaces Java-enabled Web pages COM objects 11
Intelligent Robot Lecture Note
Multimodal Dialog Multimodal Reference Resolution Need to
resolve references (what the user is referring to) across
modalities. A user may refer to an item in a display by using
speech, by pointing, or both Closely related with Multimodal
Integration . voice pen 13 Intelligent Robot Lecture Note
Slide 14
Multimodal Dialog Multimodal Reference Resolution Finds the
most proper referents to referring expressions. [Chai et al., 2004]
Referring expression Refer to a specific entity or entities Given
by a users inputs (most likely in speech inputs) Referent An entity
which the user refers Referent can be an object that is not
specified by current utterance. Speech Gesture g1 g2 Object 14
Intelligent Robot Lecture Note
Slide 15
Multimodal Dialog Multimodal Reference Resolution Hard case
Multiple and complex gesture inputs. E.g.) in information guide
system Speech Gesture g1 g2 Time g3 Speech Gesture g1 g2 Time g3 ?
User: ? ( ) System: . User: ( ) 15 Intelligent Robot Lecture
Note
Slide 16
Multimodal Dialog Multimodal Reference Resolution Using
linguistic theories to guide the reference resolution process.
[Chai et al., 2005] Conversation Implicature Givenness Hierarchy
Greedy algorithm for finding the best assignment for a referring
expression given a cognitive status. Calculate the match score
between referring expressions and referent candidates. Matching
score Finds the best assignments by using greedy algorithm object
selectivity Likelihood of status compatibility measurement 16
Intelligent Robot Lecture Note
Slide 17
Multimodal Dialog Multimodal Integration Combining information
from multiple input modalities to understand users intention and
attention Multimodal reference resolution is a special case of
multimodal integration Speech + pen gesture. The case where pen
gestures can express meaning of deictic or grouping only. Meaning
Multimodal Integration / Fusion Combined Meaning 17 Intelligent
Robot Lecture Note
Slide 18
Multimodal Dialog Multimodal Integration Issues: Nature of
multimodal integration mechanism Algorithmic procedural Parser /
Grammars Declarative Does approach treat one mode as primary? Is
gesture a secondary dependent mode? Multimodal reference resolution
How temporal and spatial constraints are expressed Common meaning
representation for speech and gesture Two main approaches
Unification-based multimodal parsing and understanding [Johnston,
1998] Finite-state transducer for multimodal parsing and
understanding [Johnston et al., 2000 18 Intelligent Robot Lecture
Note
Slide 19
Multimodal Dialog Unification-based multimodal parsing and
understanding Parallel recognizers and understanders Time-stamped
meaning fragments for each stream Common framework for meaning
representation typed feature structures Meaning fusion operations
unification Unification is an operation that determines the
consistency of two pieces of partial information, And if they are
consistent combines them into a single result Whether a given
gestural input is compatible with a given piece of spoken input.
And if they are, combine them into a single result Semantic, and
spatiotemporal constraints Statistical ranking Flexible
asynchronous architecture Must handle unimodal and multimodal input
19 Intelligent Robot Lecture Note
Slide 20
Multimodal Dialog Unification-based multimodal parsing and
understanding Temporal Constraints [Oviatt et al., 1997] Speech and
gesture overlap, or Gesture precedes speech by
Multimodal Dialog Handcrafted Finite-state Edit Machines
Edit-based Multimodal Understanding Smart edit Smart edit is a
4-edit machine + heuristics + refinements Deletion of SLM only
words (not found in the grammar) thai restaurant listings in
midtown -> thai restaurant in midtown Deletion of doubled words
Subway to to the cloisters -> subway to the cloisters Subdivided
cost classes ( icost, dcost 3 classes ) High cost: slot fillers
(e.g. Chinese, cheap, downtown) Low cost: dispensable words (e.g.
please, would ) Medium cost: all other words Auto-completion of
place names Algorithm enumerates all possible shortening of places
names Metropolitan Museum of Art, Metropolitan Museum 37
Intelligent Robot Lecture Note
Slide 38
Multimodal Dialog Learning Edit Patterns Users input is
considered a noisy version of the parsable input (clean). Noisy
(S): show cheap restaurants thai places in in chelsea Clean (T):
show cheap thai places in chelsea Translating the users input to a
string that can be assigned a meaning representation by the grammar
38 Intelligent Robot Lecture Note
Slide 39
Multimodal Dialog Learning Edit Patterns Noisy Channel Model
for Error Correction Translation probability S g : string that can
be assigned a meaning representation by the grammar S u : users
input utterance From Markov assumption, (trigram) Where S u = S u 1
S u 2 S u n and S g = S g 1 S g 2 S g m Word Alignment (S u i,S g i
) GIZA++ 39 Intelligent Robot Lecture Note
Slide 40
Multimodal Dialog Learning Edit Patterns Deriving Translation
Corpus Finite-state transducer can generate the input strings for
given meaning. Training the translation model corpus meaning string
Multimodal Grammar Generated String Target String Generate the
strings given meaning Select the closest strings 40 Intelligent
Robot Lecture Note
Slide 41
Multimodal Dialog Experiments and Results 16 first time users
(8 male, 8 female). 833 user interactions (218 multimodal / 491
speech-only / 124 pen- only) Finding restaurants of various types
and getting their names, phone numbers, addresses. Getting subway
directions between locations. Avg. ASR sentence accuracy: 49% Avg.
ASR word accuracy: 73.4% 41 Intelligent Robot Lecture Note
Slide 42
Multimodal Dialog Experiments and Results Improvements on
concept accuracy ConAccRel Impr No edits38.9%0% Basic edit51.5%32%
4-edit53.0%36% Smart edit60.2%55% Smart edit (lattice)63.2%62% MT
edit50.3%29% ConAcc Smart edit67.4% MT edit61.1% Result of 6-fold
cross validation Result of 10-fold cross validation 42 Intelligent
Robot Lecture Note
Slide 43
Multimodal Dialog A Salience Driven Approach Modify the
language model score, and rescore recognized hypotheses By using
the information of gesture input Primed Language model W* = argmax
P(O|W)P(W) 43 Intelligent Robot Lecture Note
Slide 44
Multimodal Dialog A Salience Driven Approach People do not make
any unnecessary deictic gesture Cognitive theory of Conversation
Implicature Speakers tend to make their contribution as informative
as is required And not make their contribution more informative
than is required Speech and gesture tend to complement each other
When a speech utterance is accompanied by a deictic gesture, Speech
input issue commands or inquiries about properties of object
Deictic gesture indicate the objects of interest Gesture as an
earlier indicator to anticipate the content of communication in the
subsequent spoken utterances 85% of time gestures occurred before
corresponding speech unit 44 Intelligent Robot Lecture Note
Slide 45
Multimodal Dialog A Salience Driven Approach A deictic gesture
can activate several objects on the graphical display It will
signal a distribution of objects that are salient Move this to here
Graphical displaySalience weight time gesture speech Salient Object
A cup 45 Intelligent Robot Lecture Note
Slide 46
Multimodal Dialog A Salience Driven Approach Salient object a
cup is mapped to the physical world representation To indicate a
salient part of representation Such as relevant properties or task
related to the salient objects. This salient part of the physical
world is likely to be the potential content of speech Move this to
here time gesture speech A cup 46 Intelligent Robot Lecture
Note
Slide 47
Multimodal Dialog A Salience Driven Approach 47 Intelligent
Robot Lecture Note Physical world representation Domain Model
Relevant knowledge about the domain Domain objects Properties of
objects Relations between objects Task models related to objects
Frame-based representation Frame: domain object Frame elements:
attributes and tasks related to the objects Domain Grammar
Specifies grammar and vocabularies used to process language inputs
Semantics-based context free grammar Non-terminal: semantic tag
Terminal: word (value of semantic tag) Annotated user spoken
utterance Relevant semantic information N-grams
Slide 48
Multimodal Dialog Salience Modeling Calculating a salience
distribution of entities in the physical world Salience value of
entity at time t n is influenced by a joint effect from Sequence of
gestures that happen before t n 48 Intelligent Robot Lecture
Note
Slide 49
Multimodal Dialog Salience Modeling Summation of P(e k |g) for
all gestures before time t n Weighted by Normalizing factor:
Summation of salience value of all entities at time t n The closer
gesture has higher impact to salience distribution 49 Intelligent
Robot Lecture Note
Slide 50
Multimodal Dialog Salience Driven Spoken Language Understanding
Maps the salience distribution to the physical world representation
Uses salient world to influence spoken language understanding
primes language models to facilitate language understanding
Rescoring the hypotheses of speech recognizer by using primed
language model score 50 Intelligent Robot Lecture Note
Slide 51
Multimodal Dialog Primed Language Model Primed language model
is based on the class-based bigram model Class : semantic and
functional class for domain E.g.) this Demonstrative, price
AttrPrice Modify the word class probability Originally it measures
the probability of seeing a word w i given a class c i It modified
as the choice of word w i is dependent on the salient physical
world Which is represented as the salience distribution P(e) P(w
i,c i |e k ) and P(c i |e k ) are not dependent on time t i can be
estimated based on the training data Speech hypotheses are
reordered according to primed language model. Class transition
probability Word class probability 51 Intelligent Robot Lecture
Note
Slide 52
Multimodal Dialog Evaluation - WER Domain : real estate
properties Interface : speech + pen gesture 11 users tested, five
non-native speakers and six native speakers 226 user inputs with an
average of 8 words per utterance Average WER reduction is about 12%
(t=4.75, p