USI module U1-5 Multimodal interaction

USI module U1-5Multimodal interaction Jacques TerkenUSI module U1, lecture 5

SAI User-System Interaction U1, Speech in the interface: 5. Multimodal Interfaces

2

Contents• Demos and video clips• Multimodal behaviour• Multimodal interaction, architecture and

multimodal fusion• Design heuristics, guidelines and tools


3

• http://www.nuance.com/xmode/demo/#• http://www.csee.ogi.edu/CHCC/ (Video

Quickset)• RASA (combination of tangible and multimodal

interaction)

• May be also of interest– http://www.gvu.gatech.edu/gvu/events/

demo-days/2001/demos010930.html– http://ligwww.epfl.ch/~thalmann/

research.html

http://www.nuance.com/xmode/demo/

http://www.csee.ogi.edu/CHCC/



http://www.csee.ogi.edu/CHCC/Rasa/rasa40mb.mov

http://www.gvu.gatech.edu/gvu/events/demo-days/2001/demos010930.html









http://ligwww.epfl.ch/~thalmann/research.html







4

Quickset ipaq (ogi – chcc)


5

Multimodal behaviour • Development of multimodal systems

dependent on knowledge about natural integration patterns that are characteristic for the combined use of different modalities

• Dealing with myths about multimodal interaction:– Oviatt, S.L., “Ten myths of Multimodal interaction”,

Communications of the ACM 42(11), 1999, pp.74-81


6

Myth 1: If you build a multimodal system, users will interact multimodally.

Dependent on domain:• Spatial domain: 95-100% of the users have a

preference for multimodal interaction; • Other domains: 20% of the commands are

multimodalDependent on type of action:• High MM: adding, moving, modifying objects,

calculating distance between objects• Low MM: printing, scrolling etc.

Multi-Modal Interaction (0H640)


7Multi-Modal Interaction (0H640)


8

• Distinction between general, selective and spatial actions

• General: non-object-directed actions (printing etc.)

• Selective: choosing objects • Spatial: manipulation of objects ( adding etc.)


9


10

myth 2: Speech & pointing is the dominant multimodal integration pattern.

• Central in Bolt’s speak-and-point interface (“put that there”

• Speak-and-point includes only 14% of spontaneous multimodal actions

• In human communication pointing accounts for appr. 20% of all gestures

• Other actions: handwriting, hand gestures, facial expressions (“Rich” interaction)



11

myth 3: Multimodal input involves simultaneous signals.


• Information from different modalities is often sequential

• Often gestures precede speech


12

myth 4: Speech is the primary input mode in any multimodal system that includes it, and gestures, head and body movement, gaze direction and other input are secondary

• Often speech cannot contain all information (cf. combination of pen + speech)

• Gestures are better for some kinds of information

• Often gestures indicate the context for speech



13

myth 5: Multimodal language does not differ linguistically from unimodal language.

• Users often avoid complicated commands in multimodal interaction

• Multimodal language is often shorter, syntactically more simple, and more fluent– Unimodal: “place a boat dock on the east, no,

west end of reward lake”– Multimodal: [draws rectangle] “add rectangle”

• Multimodal language more easy to process– Less anaphora and indirectness



14

myth 6: Multimodal integration involves redundancy of content between modes.

• Different modalities contribute complementary information:– Speech: subject, object, verb (objects,

actions/operations): – Gesture: Location (spatial info)

• Even in the case of correction only 1% redundancy



15

myth 7: Individual error-prone recognition technologies combine multimodally to produce even greater unreliability.

• Combination of inputs enables mutual disambiguation

• Users choose the least error-prone modality (“leveraging from users’ natural intelligence about when and how to deploy input modes effectively”)

• Combination of error-prone modalities gives in fact a more stable system



16

myth 8: All users’ multimodal commands are integrated in a uniform way.

• Differences between people• Consistent use within people

• Advance detection of integration pattern can result in better recognition



17


18

myth 9: Different input modes are capable of transmitting comparable content (alt-mode hypothesis).

• Differences between modalities:– Type of information– Functionality during communication– Accuracy of expression – Manner of integration with other modalities



19

myth 10: Enhanced speed and efficiency are the main advantages of multimodal systems.

Applies indeed (to a limited extent) for the spatial domain:

• In multimodal pen/speech interaction speed increase with app. 10%

More important advantages in other domains:• Decrease in errors an non-fluent speech with 35-50%• Possibility of choice of input:

– Less chance of fatigue per modality– Better opportunities for repair– Larger range of users



20

Advantages: Robustness • Individual signal processing technologies error-

prone• Integration of complementary modalities to yield

synergy, capitalizing on the strength of each modality and overcoming weaknesses in the other– Users will select the input mode that they

consider less error prone for particular lexical content

– User’s language is simplified when interacting multimodally

– Users tend to switch modes after system errors, facilitating error recovery

– Users report less frustration when interacting multimodally (greater sense of control)

– Mutual compensation/disambiguation


21

W3C (see http://www.w3.org/TR/mmi-reqs/ ): Seen from the perspective of the system (how the input

is handled)• Sequential multimodal input

Modality A for action a, next Modality B for action b, each event handled as a separate event

• Simultaneous (Uncoordinated) multimodal inputEach event handled as a separate event. Choice between different modalities at each moment in time

• Composite (coordinated simultaneous) multimodal inputEvents integrated into a single event before interpretation. (“true” multimodality)

Technologies: Types of multimodality

http://www.w3.org/TR/mmi-reqs/


22

Sequential Simultaneous

Non-coordinated(W3C: supplementary)

Exclusive(W3C: Sequential)

Concurrent(W3C: simultaneous)

Coordinated(W3C: complementary)

alternate Synergistic(W3C: composite)

Coutaz & Nigay


23


24

Mutual disambiguation (MD)• Speech input: n-best list

1. Ditch2. Ditches

• Gestural input

• Joint interpretation:1. Ditches

• Benefit may be dependent on situation (e.g. larger for non-native speakers)


25

Early fusion• Closely coupled and synchronized modalities

such as speech and lip movements• “Feature level” fusion• Based on multiple Hidden Markov Models or

temporal neural networks. Correlation structure between modes can be taken into account automatically via learning

• Problems: modelling complexity, computational intensity, training difficulty


26

Late fusion• “Semantic level” fusion• Individual recognizers • Sequential integration• Advantages: scalable – individual recognizers

don’t need to be retrained

• Early approaches: multimodal command’s posterior probability is the cross-product of the posterior probabilities of the associated constituents No advantage taken from mutual compensation phenomenon


27

Architectural requirements for late semantic fusion• Fine-grained timestamping• Sequentially-integrated or simultaneously

delivered• Common representational format for different

modalities• Frame based (multimodal fusion through

unification of feature-structures) Mutual disambiguation


28

Unificationutterance gesture


29

Design of multimodal interfaces1. Task analysis

What are the actions that need to be performed?

2. Task allocationWhat party is the most suitable candidate for performing particular actions?

3. Modality allocationWhat modality or combination of modalities is most suited to perform particular actions?

Current presentation focuses on 3


30

Definition of ‘modality’• Modality as sensory channel

However, stating that particular numeric information should be presented in the visual modality provides little grip

• Hence, the notion of ‘representational modality’ has been proposed (Bernsen), which distinguishes e.g. table and graph as two different modalities

• For the time being, we use ‘modality’ in the more restricted sense of sensory channel, and look for mappings between actions and modalities


31

Relevant dimensions• Nature of the information• Interaction paradigm• Physical and dialogue context• Platform• Accessibility• Multitasking


32

Rules of thumb, heuristics• Michaelis and Wiggins (1982)• Cohen and Oviatt (1994)• Suhm (2000)• Larsson (2003)• Reeves, Lai et al. (2004)

• For references see Terken J. “Guidelines and Tools for the Design of Multimodal Interfaces”, Workshop ASIDE2005, Aalborg (DK)

http://www.idemployee.id.tue.nl/j.m.terken/Usi/Terken%20Workshop%202005%20paper.pdf

http://www.idemployee.id.tue.nl/j.m.terken/Usi/Terken%20Workshop%202005%20paper.pdf


33

Michaelis and Wiggins (1982)• Speech generation is preferable when the

– message is short.– message will not be referred to later.– messages deal with events in time.– message requires an immediate response.– visual channels of communication are overloaded.– environment is too brightly lit, too poorly lit, subject

to severe vibration, or otherwise unsuitable for transmission of visual information.

– user must be free to move around.– user is subjected to high G forces or anoxia.

• Tentative guidelines for when NOT to use speech may be derived from these suggestions through negation.


34

Cohen and Oviatt (1994)• spoken communication with machines (both

input and output) may be advantageous:– when the user’s hands or eyes are busy– when only limited keyboard and/or screen is

available– when the user is disabled– when pronunciation is the subject matter of

computer use– when natural language interaction is

preferred


35

Suhm (2000)Principles for choosing the set of modalities2. Consider speech input for entry of textual data, dialogue-

oriented tasks, and command control. Speech input is generally less efficient for navigation, manipulation of image data. and resolution of object references.

3. Consider written input for corrections, entry of digits, and entry of graphical data (formulas, sketches, etc.)

4. Consider gesture input for indicating scope or type of commands, for resolving deictic object references

5. Consider the traditional modalities (keyboard and mouse input) as alternative, unless superiority of novel modalities (speech, pen input) is proven.

• Principles to circumvent limitations of recognition technology

• Principles for the implementation of Pen-Speech Interfaces


36

Larsson (2003)• Satisfy Real-world Constraints

– Task-oriented Guidelines – Physical Guidelines – Environmental Guidelines

• Communicate Clearly, Concisely, and Consistently with Users– Consistency Guidelines – Organizational Guidelines

• Help Users Recover Quickly and Efficiently from Errors– Conversational Guidelines– Reliability Guidelines

• Make Users Comfortable– System Status – Human-memory Constraints – Social Guidelines – …


37

Reeves, Lai et al. (2004)Propose a set of multimodal design principles

that are founded in perception and cognition science (but motivation remains implicit)

Four general areas• Designing multimodal input and output• Adaptivity• Consistency• Feedback• Error prevention/handling


38

Designing Multimodal Input and Output• Maximize human cognitive and physical abilities.

Designers need to determine how to support intuitive, streamlined interactions based on users' human information processing abilities (including attention, working memory, and decision making) for example:– Avoid unnecessarily presenting information in two

different modalities in cases where the user must simultaneously attend to both sources to comprehend the material being presented; such redundancy can increase cognitive load at the cost of learning the material.

– Maximize the advantages of each modality to reduce user's memory load in certain tasks and situations;

– System visual presentation coupled with user manual input for spatial information and parallel processing;

– System auditory presentation coupled with user speech input for state information, serial processing, attention alerting, or issuing commands.


39

• Integrate modalities in a manner compatible with user preferences, context, and system functionality. Additional modalities should be added to the system only if they improve satisfaction, efficiency, or other aspects of performance for a given user and context. When using multiple modalities:– Match output to acceptable user input style (for example,

if the user is constrained by a set grammar, do not design a virtual agent to use unconstrained natural language);

– Use multimodal cues to improve collaborative speech (for example, a virtual agent's gaze direction or gesture can guide user turn-taking);

– Ensure system output modalities are well synchronized temporally (for example, map-based display and spoken directions, or virtual display and non-speech audio);

– Ensure that the current system interaction state is shared across modalities and that appropriate information is displayed in order to support: • Users in choosing alternative interaction modalities; • Multidevice and distributed interaction;


40

3. Theoretical approaches• Modality theory (Bernsen c.s.)

‘Modality’ defined as ‘representational modality’


41

Modality theory (Bernsen)Aim• Given any particular class of task domain information

which needs to be exchanged between user and system during task performance, identify the set of input/output modalities which constitute an optimal solution to the representation and exchange of that information (Bernsen, 2001).

• Taxonomic analyses: – (representational) Input and output modalities are

characterized in terms of a limited number basic features such as

– linguistic/nonlinguistic, – analogue/non-analogue, – arbitrary/nonarbitrary, – static/dynamic.


42

• Modality properties can then be applied according to the following procedure:1. Requirements Specification >2. Modality Properties + Natural

Intelligence >3. Advice/Insight with respect to modality

choice.


43

• [MP1] Linguistic input/output modalities have interpretational scope, which makes them eminently suited for conveying abstract information. They are therefore unsuited for conveying high-specificity information including detailed information on spatial manipulation and location.

• [MP2] Linguistic input/output modalities, being unsuited for specifying detailed information on spatial manipulation, lack an adequate vocabulary for describing the manipulations.

• [MP3] Arbitrary input/output modalities impose a learning overhead which increases with the number of arbitrary items to be learned.

• [MP4] Acoustic input/output modalities are omnidirectional.

• [MP5] Acoustic input/output modalities do not require limb (including haptic) or visual activity.


45

4. Tools• SMALTO (Bernsen)• Multimodal property flowchart (Williams et al.,

2002)


46

SMALTO• Addresses the “Speech functionality problem”:• SMALTO has been created by taking a large

number of claims or findings from the literature on designing speech or speech-centric interfaces and casting these claims into the structured representation expressing the Speech Functionality Problem


47

• [Combined speech input/output, speech output, or speech input modalities M1, M2 and/or M3 etc.] or [speech modality M1, M2 and/or M3 etc. in combination with non-speech modalities NSM1, NSM2 and/or NSM3 etc.]

• are [useful or not useful] • for [generic task: GT] • and/or ]speech act type: SA] • and/or [user group: UG] • and/or [interaction mode: IM] • and/or [work environment: WE] • and/or [generic system: GS] • and/or [performance parameter: PP] • and/or [learning parameter: LP] • and/or [cognitive property: CP] • and/or [preferable or non-preferable] to [alternative modalities

AM1, AM2 and/or AM3 etc.]• and/or [useful on conditions] C1, C2 and/or C3 etc.


48

• SMALTO has been evaluated within the framework of projects involving the creators and in the DISC project

• Informal evidence indicates that it is difficult to apply for “linguistically naïve” designers because of the way the modality properties are formulated

• This was also the motivation for the Modality Property Flowchart (Williams et al. 2002)


49

Multimodal property flowchart

pdf

http://www.idemployee.id.tue.nl/j.m.terken/Usi/multimodal%20property%20flowchart.pdf


50

• Multimodal interfaces as a particular type of interfaces Multimodal property flowchart needs to be combined with general usability heuristics for interface design (e.g. Nielsen)


51

Main points• Multimodal interfaces match the natural

expressivity of human beings • Taxonomy of multimodal interaction• Limitations of signal processing in one

modality can be overcome by taking into consideration input from another modality (multimodal disambiguation)

• Mapping of functionalities onto modalities not always straightforward support from guidelines and tools

Documents

USI module U1-5 Multimodal interaction