Multimodal Dialog 1 Intelligent Robot Lecture Note

Multimodal Dialog 2 Multimodal Dialog System Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Dialog System A system which supports human-computer interaction over multiple different input and/or output modes. Input: voice, pen, gesture, face expression, etc. Output: voice, graphical output, etc. Applications GPS Information guide system Smart home control Etc. . voice pen 3 Intelligent Robot Lecture Note

Multimodal Dialog Motivations Speech: the Ultimate Interface? + Interaction style: natural (use free speech) Natural repair process for error recovery + Richer channel speakers disposition and emotional state (if systems knew how to deal with that..) - Input inconsistent (high error rates), hard to correct error e.g., may get different result, each time we speak the same words. - Slow (sequential) output style: using TTS (text-to-speech) How to overcome these weak points? Multimodal interface!! 4 Intelligent Robot Lecture Note

Multimodal Dialog Advantages of Multimodal Interface Task performance and user preference Migration of Human-Computer Interaction away from the desktop Adaptation to the environment Error recovery and handling Special situations where mode choice helps 5 Intelligent Robot Lecture Note

Multimodal Dialog Task Performance and User Preference Task performance and user preference for multimodal over speech only interfaces [Oviatt et al., 1997] 10% faster task completion, 23% fewer words, (Shorter and simpler linguistic constructions) 36% fewer task errors, 35% fewer spoken disfluencies, 90-100% user preference to interact this way. Speech-only dialog system Speech: Bring the drink on the table to the side of bed Multimodal dialog System Speech: Bring this to here Pen gesture: Easy, Simplified user utterance ! 6 Intelligent Robot Lecture Note

Multimodal Dialog Migration of Human-Computer Interaction away from the desktop Small portable computing devices Such as PDAs, organizers, and smart-phones Limited screen real estate for graphical output Limited input no keyboard/mouse (arrow keys, thumbwheel) Complex GUIs not feasible Augment limited GUI with natural modalities such as speech and pen Use less space Rapid navigation over menu hierarchy Other devices Kiosks, car navigation system No mouse or keyboard Speech + pen gesture 7 Intelligent Robot Lecture Note

Multimodal Dialog Adaptation to the environment Multimodal interfaces enable rapid adaptation to changes in the environment Allow user to switch modes Mobile devices that are used in multiple environments Environmental conditions can be either physical or social Physical Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input Brightness: Bright light in outdoor environment can limit usefulness of graphical display Social Speech many be easiest for password, account number etc, but in public places users may be uncomfortable being overheard Switch to GUI or keypad input 8 Intelligent Robot Lecture Note

Multimodal Dialog Error Recovery and Handling Advantages for recovery and reduction of error: Users intuitively pick the mode that is less error-prone. Language is often simplified. Users intuitively switch modes after an error The same problem is not repeated. Multimodal error correction Cross-mode compensation - complementarity Combining inputs from multiple modalities can reduce the overall error rate Multimodal interface has potentially 9 Intelligent Robot Lecture Note

Multimodal Dialog Special Situations Where Mode Choice Helps Users with disability People with a strong accent or a cold People with RSI Young children or non-literate users Other users who have problems when handle the standard devices: mouse and keyboard Multimodal interface let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities. 10 Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Dialog System Architecture Architecture of QuickSet [] Multi-agent architecture VR/AR Interfaces MAVEN BARS Facilitator routing, triggering, dispatching, Facilitator routing, triggering, dispatching, Inter-agent Communication Language Sketch/ Gesture ICL Horn Clauses Speech/ TTS Natural Language Map Interface Multimodal Integration Simulators WebSvcs (XML, SOAP, ) Other Facilitators Databases CORBA bridge Other user interfaces Java-enabled Web pages COM objects 11 Intelligent Robot Lecture Note

Multimodal Dialog 12 Multimodal Language Processing 12 Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Reference Resolution Need to resolve references (what the user is referring to) across modalities. A user may refer to an item in a display by using speech, by pointing, or both Closely related with Multimodal Integration . voice pen 13 Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Reference Resolution Finds the most proper referents to referring expressions. [Chai et al., 2004] Referring expression Refer to a specific entity or entities Given by a users inputs (most likely in speech inputs) Referent An entity which the user refers Referent can be an object that is not specified by current utterance. Speech Gesture g1 g2 Object 14 Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Reference Resolution Hard case Multiple and complex gesture inputs. E.g.) in information guide system Speech Gesture g1 g2 Time g3 Speech Gesture g1 g2 Time g3 ? User: ? ( ) System: . User: ( ) 15 Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Reference Resolution Using linguistic theories to guide the reference resolution process. [Chai et al., 2005] Conversation Implicature Givenness Hierarchy Greedy algorithm for finding the best assignment for a referring expression given a cognitive status. Calculate the match score between referring expressions and referent candidates. Matching score Finds the best assignments by using greedy algorithm object selectivity Likelihood of status compatibility measurement 16 Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Integration Combining information from multiple input modalities to understand users intention and attention Multimodal reference resolution is a special case of multimodal integration Speech + pen gesture. The case where pen gestures can express meaning of deictic or grouping only. Meaning Multimodal Integration / Fusion Combined Meaning 17 Intelligent Robot Lecture Note

Multimodal Dialog Multimodal Integration Issues: Nature of multimodal integration mechanism Algorithmic procedural Parser / Grammars Declarative Does approach treat one mode as primary? Is gesture a secondary dependent mode? Multimodal reference resolution How temporal and spatial constraints are expressed Common meaning representation for speech and gesture Two main approaches Unification-based multimodal parsing and understanding [Johnston, 1998] Finite-state transducer for multimodal parsing and understanding [Johnston et al., 2000 18 Intelligent Robot Lecture Note

Multimodal Dialog Unification-based multimodal parsing and understanding Parallel recognizers and understanders Time-stamped meaning fragments for each stream Common framework for meaning representation typed feature structures Meaning fusion operations unification Unification is an operation that determines the consistency of two pieces of partial information, And if they are consistent combines them into a single result Whether a given gestural input is compatible with a given piece of spoken input. And if they are, combine them into a single result Semantic, and spatiotemporal constraints Statistical ranking Flexible asynchronous architecture Must handle unimodal and multimodal input 19 Intelligent Robot Lecture Note

Multimodal Dialog Unification-based multimodal parsing and understanding Temporal Constraints [Oviatt et al., 1997] Speech and gesture overlap, or Gesture precedes speech by

Multimodal Dialog Handcrafted Finite-state Edit Machines Edit-based Multimodal Understanding Smart edit Smart edit is a 4-edit machine + heuristics + refinements Deletion of SLM only words (not found in the grammar) thai restaurant listings in midtown -> thai restaurant in midtown Deletion of doubled words Subway to to the cloisters -> subway to the cloisters Subdivided cost classes ( icost, dcost 3 classes ) High cost: slot fillers (e.g. Chinese, cheap, downtown) Low cost: dispensable words (e.g. please, would ) Medium cost: all other words Auto-completion of place names Algorithm enumerates all possible shortening of places names Metropolitan Museum of Art, Metropolitan Museum 37 Intelligent Robot Lecture Note

Multimodal Dialog Learning Edit Patterns Users input is considered a noisy version of the parsable input (clean). Noisy (S): show cheap restaurants thai places in in chelsea Clean (T): show cheap thai places in chelsea Translating the users input to a string that can be assigned a meaning representation by the grammar 38 Intelligent Robot Lecture Note

Multimodal Dialog Learning Edit Patterns Noisy Channel Model for Error Correction Translation probability S g : string that can be assigned a meaning representation by the grammar S u : users input utterance From Markov assumption, (trigram) Where S u = S u 1 S u 2 S u n and S g = S g 1 S g 2 S g m Word Alignment (S u i,S g i ) GIZA++ 39 Intelligent Robot Lecture Note

Multimodal Dialog Learning Edit Patterns Deriving Translation Corpus Finite-state transducer can generate the input strings for given meaning. Training the translation model corpus meaning string Multimodal Grammar Generated String Target String Generate the strings given meaning Select the closest strings 40 Intelligent Robot Lecture Note

Multimodal Dialog Experiments and Results 16 first time users (8 male, 8 female). 833 user interactions (218 multimodal / 491 speech-only / 124 pen- only) Finding restaurants of various types and getting their names, phone numbers, addresses. Getting subway directions between locations. Avg. ASR sentence accuracy: 49% Avg. ASR word accuracy: 73.4% 41 Intelligent Robot Lecture Note

Multimodal Dialog Experiments and Results Improvements on concept accuracy ConAccRel Impr No edits38.9%0% Basic edit51.5%32% 4-edit53.0%36% Smart edit60.2%55% Smart edit (lattice)63.2%62% MT edit50.3%29% ConAcc Smart edit67.4% MT edit61.1% Result of 6-fold cross validation Result of 10-fold cross validation 42 Intelligent Robot Lecture Note

Multimodal Dialog A Salience Driven Approach Modify the language model score, and rescore recognized hypotheses By using the information of gesture input Primed Language model W* = argmax P(O|W)P(W) 43 Intelligent Robot Lecture Note

Multimodal Dialog A Salience Driven Approach People do not make any unnecessary deictic gesture Cognitive theory of Conversation Implicature Speakers tend to make their contribution as informative as is required And not make their contribution more informative than is required Speech and gesture tend to complement each other When a speech utterance is accompanied by a deictic gesture, Speech input issue commands or inquiries about properties of object Deictic gesture indicate the objects of interest Gesture as an earlier indicator to anticipate the content of communication in the subsequent spoken utterances 85% of time gestures occurred before corresponding speech unit 44 Intelligent Robot Lecture Note

Multimodal Dialog A Salience Driven Approach A deictic gesture can activate several objects on the graphical display It will signal a distribution of objects that are salient Move this to here Graphical displaySalience weight time gesture speech Salient Object A cup 45 Intelligent Robot Lecture Note

Multimodal Dialog A Salience Driven Approach Salient object a cup is mapped to the physical world representation To indicate a salient part of representation Such as relevant properties or task related to the salient objects. This salient part of the physical world is likely to be the potential content of speech Move this to here time gesture speech A cup 46 Intelligent Robot Lecture Note

Multimodal Dialog A Salience Driven Approach 47 Intelligent Robot Lecture Note Physical world representation Domain Model Relevant knowledge about the domain Domain objects Properties of objects Relations between objects Task models related to objects Frame-based representation Frame: domain object Frame elements: attributes and tasks related to the objects Domain Grammar Specifies grammar and vocabularies used to process language inputs Semantics-based context free grammar Non-terminal: semantic tag Terminal: word (value of semantic tag) Annotated user spoken utterance Relevant semantic information N-grams

Multimodal Dialog Salience Modeling Calculating a salience distribution of entities in the physical world Salience value of entity at time t n is influenced by a joint effect from Sequence of gestures that happen before t n 48 Intelligent Robot Lecture Note

Multimodal Dialog Salience Modeling Summation of P(e k |g) for all gestures before time t n Weighted by Normalizing factor: Summation of salience value of all entities at time t n The closer gesture has higher impact to salience distribution 49 Intelligent Robot Lecture Note

Multimodal Dialog Salience Driven Spoken Language Understanding Maps the salience distribution to the physical world representation Uses salient world to influence spoken language understanding primes language models to facilitate language understanding Rescoring the hypotheses of speech recognizer by using primed language model score 50 Intelligent Robot Lecture Note

Multimodal Dialog Primed Language Model Primed language model is based on the class-based bigram model Class : semantic and functional class for domain E.g.) this Demonstrative, price AttrPrice Modify the word class probability Originally it measures the probability of seeing a word w i given a class c i It modified as the choice of word w i is dependent on the salient physical world Which is represented as the salience distribution P(e) P(w i,c i |e k ) and P(c i |e k ) are not dependent on time t i can be estimated based on the training data Speech hypotheses are reordered according to primed language model. Class transition probability Word class probability 51 Intelligent Robot Lecture Note

Multimodal Dialog Evaluation - WER Domain : real estate properties Interface : speech + pen gesture 11 users tested, five non-native speakers and six native speakers 226 user inputs with an average of 8 words per utterance Average WER reduction is about 12% (t=4.75, p

Documents

Multimodal Dialog 1 Intelligent Robot Lecture Note