13
J Multimodal User Interfaces (2012) 6:73–85 DOI 10.1007/s12193-011-0088-y ORIGINAL PAPER Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction Ina Wechsung · Klaus-Peter Engelbrecht · Christine Kühnel · Sebastian Möller · Benjamin Weiss Received: 29 April 2011 / Accepted: 27 December 2011 / Published online: 15 February 2012 © OpenInterface Association 2012 Abstract Quality of Service (QoS) and Quality of Experi- ence (QoE) have to be considered when designing, build- ing and maintaining services involving multimodal human– machine interaction. In order to guide the assessment and evaluation of such services, we first develop a taxonomy of the most relevant QoS and QoE aspects which result from multimodal human–machine interactions. It consists of three layers: (1) The quality factors influencing QoS and QoE re- lated to the user, the system, and the context of use; (2) the QoS interaction performance aspects describing user and system behavior and performance; and (3) the QoE aspects related to the quality perception and judgment processes tak- ing place within the user. For each of these layers, we then provide metrics which are able to capture the QoS and QoE aspects in a quantitative way, either via questionnaires or performance measures. The metrics are meant to guide sys- tem evaluation and make it more systematic and compara- ble. Keywords Quality assessment · Multimodal interfaces · Usability I. Wechsung · K.-P. Engelbrecht · C. Kühnel · S. Möller · B. Weiss ( ) Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Berlin, Germany e-mail: [email protected] I. Wechsung e-mail: [email protected] K.-P. Engelbrecht e-mail: [email protected] C. Kühnel e-mail: [email protected] S. Möller e-mail: [email protected] 1 Introduction Whereas the quality of unimodal interfaces and multimedial interactive systems has been addressed for several decades, the quality of multimodal human–machine interaction is a relatively new topic. The reason is that multimodal dialogue systems have only recently reached a level of maturity which allows for a wide-spread application. Examples include in- formation kiosks at airports or train stations, navigation sys- tems, media guides, entertainment and education systems, or smart home environments [15]. In the frame of this paper, we define multimodal dialog systems as computer systems with which human users inter- act on a turn-by-turn basis, using several different modalities for information input and/or receiving information from the system in different modalities. By interaction modality, we mean the sensory channel used by a communicating agent to convey information to a communication partner, e.g. spoken language, intonation, gaze, hand gestures, body gestures, or facial expressions [1]. These channels may be used sequen- tially or in parallel, and they may provide complementary or redundant information to the user [7]. It is commonly ex- pected that the use of multiple modalities makes better use of the human cognitive resources, and will thus result in a lower cognitive load on the user during the interaction [8]. In addition, the use of different modalities may provide bet- ter recognition and interpretation performance on the system side, in particular in adverse environments where ambient noise and illumination degrade the information input per- formance. Information provided by the system can better be tailored to the user and the situational context if several out- put channels are available. Thus, multimodal systems have some principle advantages over comparable unimodal sys- tems. In order to reach high quality and usability, each sys- tem ideally passes several assessment and evaluation cycles

Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

Embed Size (px)

Citation preview

Page 1: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

J Multimodal User Interfaces (2012) 6:73–85DOI 10.1007/s12193-011-0088-y

O R I G I NA L PA P E R

Measuring the Quality of Service and Quality of Experienceof multimodal human–machine interaction

Ina Wechsung · Klaus-Peter Engelbrecht ·Christine Kühnel · Sebastian Möller · Benjamin Weiss

Received: 29 April 2011 / Accepted: 27 December 2011 / Published online: 15 February 2012© OpenInterface Association 2012

Abstract Quality of Service (QoS) and Quality of Experi-ence (QoE) have to be considered when designing, build-ing and maintaining services involving multimodal human–machine interaction. In order to guide the assessment andevaluation of such services, we first develop a taxonomy ofthe most relevant QoS and QoE aspects which result frommultimodal human–machine interactions. It consists of threelayers: (1) The quality factors influencing QoS and QoE re-lated to the user, the system, and the context of use; (2) theQoS interaction performance aspects describing user andsystem behavior and performance; and (3) the QoE aspectsrelated to the quality perception and judgment processes tak-ing place within the user. For each of these layers, we thenprovide metrics which are able to capture the QoS and QoEaspects in a quantitative way, either via questionnaires orperformance measures. The metrics are meant to guide sys-tem evaluation and make it more systematic and compara-ble.

Keywords Quality assessment · Multimodal interfaces ·Usability

I. Wechsung · K.-P. Engelbrecht · C. Kühnel · S. Möller ·B. Weiss (�)Quality and Usability Lab, Deutsche Telekom Laboratories, TUBerlin, Berlin, Germanye-mail: [email protected]

I. Wechsunge-mail: [email protected]

K.-P. Engelbrechte-mail: [email protected]

C. Kühnele-mail: [email protected]

S. Möllere-mail: [email protected]

1 Introduction

Whereas the quality of unimodal interfaces and multimedialinteractive systems has been addressed for several decades,the quality of multimodal human–machine interaction is arelatively new topic. The reason is that multimodal dialoguesystems have only recently reached a level of maturity whichallows for a wide-spread application. Examples include in-formation kiosks at airports or train stations, navigation sys-tems, media guides, entertainment and education systems,or smart home environments [1–5].

In the frame of this paper, we define multimodal dialogsystems as computer systems with which human users inter-act on a turn-by-turn basis, using several different modalitiesfor information input and/or receiving information from thesystem in different modalities. By interaction modality, wemean the sensory channel used by a communicating agent toconvey information to a communication partner, e.g. spokenlanguage, intonation, gaze, hand gestures, body gestures, orfacial expressions [1]. These channels may be used sequen-tially or in parallel, and they may provide complementaryor redundant information to the user [7]. It is commonly ex-pected that the use of multiple modalities makes better useof the human cognitive resources, and will thus result in alower cognitive load on the user during the interaction [8].In addition, the use of different modalities may provide bet-ter recognition and interpretation performance on the systemside, in particular in adverse environments where ambientnoise and illumination degrade the information input per-formance. Information provided by the system can better betailored to the user and the situational context if several out-put channels are available. Thus, multimodal systems havesome principle advantages over comparable unimodal sys-tems.

In order to reach high quality and usability, each sys-tem ideally passes several assessment and evaluation cycles

Page 2: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

74 J Multimodal User Interfaces (2012) 6:73–85

during its development: Individual components (such as aspeech or gesture recognizer) are assessed as to whether theyprovide sufficient performance; design concepts are evalu-ated with respect to their functional requirements as well aswith respect to the expected user experience; initial proto-types are tested in terms of mock-ups or through Wizard-of-Oz simulations; preliminary system versions are evalu-ated in tests with “friendly” users; and roll-off systems areevaluated with their first customers. For IP-based services,network performance is determined, verified and controlledin regard to providing and processing information for/ofthe multimodal interface. Such huge efforts should lead tosystems and services which provide a high quality to theirusers; however, the high percentage of unsuccessful inno-vations in this area shows that the quality of multimodaldialogue systems is still limited, and users might recur towell-established unimodal systems instead. The fact that asubstantial number of commercial systems reach their cus-tomers without a thorough evaluation is in our experienceonly partially due to the lack of time and financial effortspent in such evaluations. Rather, we think that a significantpart of the problem is due to insufficient evaluation tech-niques.

Despite several efforts made in the past either for multi-modal systems [6, 9] or for the general system developmentprocess [10], most evaluations are still individual undertak-ings: Test protocols and metrics are developed on the spot,with a particular system and user group in focus, and withlimited budget and time. As a result, we see a multitude ofhighly interesting—but virtually incomparable—evaluationexercises, which address different aspects of quality, andwhich rely on different evaluation criteria. Already in 1998,Gray and Salzman [11] reviewed papers investigating us-ability evaluation methods and concluded that most of thesestudies are misleading. As a result they recommended to fol-low a strict experimental approach. Although the paper waswidely discussed in the community [12], efforts to standard-ize evaluation methods were apparently only seldom madeas presented in a meta-analysis reviewing 180 studies [13].In his paper the author criticizes the diversity in measur-ing user satisfaction. In particular he states that not employ-ing standardized questionnaires leads to severe difficulties incomparing different studies [13]. Also [14] discuss the largevariety of different usability evaluation methods and the re-sulting lack of understanding of each approach.

An evaluation criterion commonly used by system de-signers is performance: To what degree does the system pro-vide the function it has been built for. A collection of suchperformance criteria can result in Quality of Service (QoS),i.e. “the collective effect of service performance which de-termines the degree of satisfaction of the user” [15]. QoScan be viewed both in terms of the prerequisites, i.e. the in-fluencing factors of the system, and in terms of the resultingperformance metrics.

Obviously, a certain level of system performance is nec-essary to fulfill the user’s needs, as it is stated in the abovedefinition. However, QoS does not determine user satisfac-tion in terms of a strict cause-and-effect relationship. On thecontrary, user perception and satisfaction comes in a multi-tude of different aspects, each of which may (or may not) beinfluenced by the performance of individual service compo-nents. An established way to partly deal with this is to findrelationships between single system factors and user per-ception by applying standardized test—this is named “User-perceived QoS” or “Quality of Perception” by ETSI [16] andis limited to user perception [17].

In telecommunications, the term Quality of Experience(QoE) has recently been used for describing all such aspectswhich finally result in the acceptability of the service [18].In other areas (such as speech or sound quality engineer-ing), the term “quality” is used instead, being defined as the“result of appraisal of the perceived composition of a unitwith respect to its desired composition” [19]. Quality is thusfar more than performance—it is the degree of fulfillmentof the user’s expectations and needs. Note that—followingthis definition—the measurement of quality requires a per-ception and a judgment process to take place inside the user.Thus, measurement of QoE usually relies on interaction ex-periments with real or test users and subsequent analysis ofboth, questionnaire data and performance measures. In con-trast to this QoS can be quantified by a person external to theinteraction process, e.g. by the system developer, by relyingsolely on performance measures.

For interactive systems based on the speech modalityalone (so-called spoken dialogue systems), efforts have beenmade to come up with a standard set of QoS and QoEmetrics. These include interaction parameters describing theperformance of system components and the behavior of userand system during the interaction [20, 21] as well as ques-tionnaires for collecting quality judgments [20, 22, 23]. Thisis a distinguishing feature compared to other taxonomiesdefined for the general case of human computer interac-tion, which typically define components of usability on arather abstract level, i.e. not defining instrumentation (e.g.,[24, 25]). The taxonomy presented here follows the formerapproach, that is, to systematise concepts and parameterswhich can be measured in an interaction, focusing on thespecial case of multimodal systems. For multimodal dia-logue systems, there is only one proposal for standard met-rics [24, 26]. One reason for this is a lack of understandingof the relevant QoS and QoE aspects of such systems.

Our final aim is to bundle evaluation efforts so that theinsights gained in individual campaigns can be applied toother systems in the future. A prerequisite to this is to agreeon which aspects to evaluate, and how to evaluate. As a firststep towards this aim, we propose a taxonomy of QoS as-pects (system factors and performance aspects of the inter-action) and QoE aspects (related to human perception and

Page 3: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

J Multimodal User Interfaces (2012) 6:73–85 75

Fig. 1 Taxonomy of QoS and QoE aspects of multimodal human–machine interaction, adapted from [30]

judgment of the interaction). This taxonomy is described inSect. 2. In the subsequent Sects. 3–5, we provide definitionsand examples for the individual items of the taxonomy. Asfar as they are available, we provide metrics for quantifyingthem in the evaluation. Finally, Sect. 6 defines the next stepswhich are necessary to take full benefit of the approach.

2 Taxonomy of QoS and QoE aspects

Our taxonomy is based on a similar one developed for spo-ken dialogue systems in [27], but considers a broader rangeof input and output modalities (such as touch input, ges-ture recognition, audio-visual speech output, etc.) and morequality aspects. It is intended to serve as a framework forevaluation, and not as a precise protocol: Whereas standard-ized protocols such as the ones followed by DARPA or Bliz-zard [28, 29] are very helpful for the development of coretechnologies, they provide little insight into the appropriate-ness of this technology for a system to-be-developed. Via aframework, in turn, developers are able to tailor evaluationto their individual needs.

The use of the taxonomy is threefold: System develop-ers may search for the QoS and QoE aspect they are inter-ested in and find the appropriate evaluation metrics. It couldalso serve as a basis for a systematic collection of evalua-tion data. And finally, if no evaluation metric is available for

a given quality aspect, necessary methodological researchcan be identified.

The taxonomy consists of three layers, two of them ad-dressing QoS and one addressing QoE, see Fig. 1:

• Quality factors influencing QoS and QoE;• QoS interaction performance aspects describing user and

system performance and behavior;• QoE aspects related to the quality perception and judg-

ment processes. As a result of the interaction describedon the second layer, quality “happens” as a multidimen-sional perceptual event in a particular context of use.

The layers are necessary to make a clear distinction be-tween QoS and performance aspects (which can be deter-mined by an external observer) on the one hand, and QoEand quality aspects (which have to be acquired from theuser) on the other. By using three instead of two (QoSand QoE) layers we could better differentiate between theinfluencing factors (which are defined by the service andits users) and the resulting interaction performance aspects(which can be measured during the interaction).

It is obvious from Fig. 1 that there are relationships be-tween quality factors, QoS interaction performance aspects,and QoE aspects. These relationships are mostly not one-to-one and can vary in their strength depending on the system,user and context. Inside each layer, however, relations be-tween aspects are better defined and therefore indicated asfar as possible.

Page 4: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

76 J Multimodal User Interfaces (2012) 6:73–85

As a result of the relationships, there are also depen-dencies between the associated metrics. Sometimes, metricsmay point into contradicting directions, and the system de-veloper has to carefully review the metrics in order to nottake wrong design decisions. Metrics may also be combinedin order to obtain estimations of global quality aspects, butcare should be taken in the integration; simple linear combi-nations of different metrics might not be adequate.

Please note, that we propose questionnaires and perfor-mance metrics here as only the combination of both cangive a comprehensive and valid picture for several reasons:(a) Users might be influenced not to honestly report their ex-perience; (b) a combination will give the system developermore insight what system factors influenced noteworthy userratings; (c) for most of the QoE aspects described there ex-ist no valid and reliable metrics for the case of multimodalsystems, so a mixture can help to interpret the results better.

Also note that the metrics defined will not automaticallyresult in better systems. For efficiently developing systemswhich provide a high quality to their users, usability engi-neering methods have to be used during the entire systemlifecycle, involving proper analysis, design, prototyping, ex-pert evaluation, user evaluation, and feedback cycles [24].The metrics defined here will certainly help to quantify theprogress made during the usability evaluation cycle, and touncover the characteristics of the system, the user and thecontext of use which are relevant for achieving high qual-ity. However, we think it is not possible to relate individualconcepts and related metrics to each step of the system life-cycle, as the exact concepts to be measured and the resultsexpected will largely depend on the type of system to be de-veloped.

In the following sections, we provide definitions and ex-amples for the individual items of the taxonomy. As far asthey are available, we also present and explain metrics forquantifying them in the evaluation.

3 Quality factors

Quality factors exercise an influence on QoE through the in-teraction performance aspects. They include the character-istics of the user, the system and the context of use whichhave an impact on perceived quality.

3.1 User factors

User factors include all characteristics of the user whichcarry an influence on his/her interaction behavior and qualityjudgment. Some of these characteristics are static (e.g. age,gender, native language), others change dynamically frominteraction to interaction, or even within an interaction (e.g.motivation, emotional status). Because systems cannot be

designed for an individual user, users are commonly clas-sified into groups which are relevant for the purpose of theevaluation. Such a classification can e.g. be performed onthe basis of

• perceptional characteristics (e.g. visual or auditory im-pairments, typical for elderly users),

• behavioral characteristics (e.g. left-/ right-handed users,accented vs. non-accented speech),

• experience (e.g. with the system under investigation, withsimilar systems, with the task domain, with technology ingeneral),

• motivation (for using the system), and• individual preferences, capabilities or knowledge.

These characteristics carry an influence not only on theinteraction behavior (and thus the interaction performance),but also on the quality which is influenced by the referenceinternal to the user (i.e. the “desired composition” in the def-inition of Sect. 1).

With particular relevance for multimodal information andcommunication technologies (ICT), Hermann et al. [31] andNaumann et al. [32] developed a scheme which classifiesusers according to their affinity towards technology, theirgeneral interaction methods and capabilities (cognitive ca-pabilities, problem-solving strategies, purposefulness), aswell as (less importantly) their domain knowledge, languagecompetence, age, and orientation towards social norms. Itresults in seven user groups which show a distinct behaviortowards and experience with such systems. However, thereexists currently no screening questionnaire to classify usersaccording to this scheme. Instead, there are several differ-ent questionnaires to assess user aspects like computer anxi-ety [33], computer literacy [34], attitudes towards computers[35] computer self-efficacy [36], computer experience [37],mental abilities [38] and so on. This situation has to be con-sidered as quite problematic, as the different questionnairesdo not only cover varying domains, ranging from basic elec-tric or electronic devices to current mobile multimedia in-terfaces, but the most validated ones are also the oldest, andthus maybe not applicable 20 years after creation and vali-dation, as experience with technology and expectations con-stantly change.

3.2 System factors

System factors are the characteristics of the system as an in-teraction partner (agent factors) and those related to its func-tional capabilities (functional factors). Agent factors includethe technical characteristics of the individual system mod-ules (speech, gesture and/or face recognition; multimodalfusion, dialogue management, multimodal fission, etc.) aswell as their aesthetic appearance [27]. Functional factorsinclude the functionalities covered, the type of task (well-, or

Page 5: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

J Multimodal User Interfaces (2012) 6:73–85 77

ill-structured, homo- or, heterogeneous, see [39], the numberof available tasks, the task complexity, the task frequency,and task consequences (particularly important for security-critical systems) [27]. For less task-directed systems the do-main characteristics gain importance (e.g. education or en-tertainment systems).

Both agent and functional factors are commonly speci-fied in advance, resulting in specification documents whichlist the characteristics, but mostly in a qualitative, not in aquantitative way. Most agent factors have to be specified bythe system developer, however, aesthetics can usually bet-ter be specified by design experts or experienced salesmen.Functional factors can best be specified by domain experts;they may be the outcome of concept testing phases or focusgroup discussions where potential users try to find core andsecondary functions which should be implemented in the fi-nal system, and weight them according to their importancefor the later usage scenario.

3.3 Context factors

Context factors consist of two types. The so-called envi-ronmental factors capture the physical usage environment(home, office, mobile, or public usage), as well as any trans-mission channels involved in the interaction. These charac-teristics include any space, acoustic, and lighting conditionswhich might exercise an influence on the performance ofthe system or on the behavior of the user. The usage en-vironment may also include potential parallel activities ofthe user; such activities have to be taken into account whenevaluating the system, as they may reduce the cognitive re-sources which can be dedicated to the interaction with thesystem under test. A second class of context factors are theso-called service factors, i.e. non-physical characteristics ofthe system and its usage which may carry an influence onhow the user judges upon its quality, like access restrictions(from where the system can be reached), the availability ofthe system (restricted opening hours), any security or pri-vacy issues resulting from the use of the system, and theresulting costs. The latter are very important for the final ac-ceptance, as the user will try to find a balance between thevalue provided by the system and the price s/he is willing topay for it.

Like the system factors, context factors are usually spec-ified prior to system design, and mostly in a qualitative way.Usage contexts cannot always be anticipated by the develop-ers, and sometimes it makes sense to use task analysis meth-ods as in [24] in order to find out about the user’s functionalneeds in specific usage contexts.

4 QoS interaction performance aspects

It is during the interaction when the perception and judg-ment processes forming quality take place. Interaction per-

formance aspects are organized into two cycles (one for thesystem and one for the user), their order reflecting the pro-cessing step they are located in. These cycles and the respec-tive interaction performance aspects are described in the fol-lowing sub-sections.

4.1 System interaction performance aspects

System interaction performance aspects can be quantifiedwith the help of interaction parameters, which are eitherlogged during the interaction or annotated by an expert af-terwards. While these interaction parameters are not directlylinked to the perceived quality their interpretation can offeruseful information to system developers. They are particu-larly useful to assess and compare the performance of theinvolved technologies, such as recognizers or fusion and fis-sion components. Thus, a meaningful parameter set can onlybe selected (or defined) under consideration of the specificsof the technologies actually used. However, some generalprinciples apply at each processing stage, which will benamed in this section. For further reference, a list of multi-modal interaction parameters has been published in [24] anda first application for an evaluation of a multimodal dialoguesystem in [40].

4.1.1 Input performance

Input performance can be quantified e.g. in terms of its ac-curacy or error rate, as it is common practice for speech, ges-ture recognizers and facial expression recognizers. To com-pute those parameters an expert has to transcribe the userutterance or the handwriting input, or annotates the facialexpressions and gestures concerning beginning and end aswell as the interpretation of these. Typically, the number ofcorrectly determined words, gestures or expressions, of sub-stitutions, of insertions, and of deletions is counted. Thesecounts can be divided by the total number of words, ges-tures or facial expressions in the reference to produce errorrates. These measures will mostly be computed separatelyfor each modality, an exception being the case of signal levelfusion. In contrast to fusion on the semantic level, which hasto be evaluated as part of the interpretation performance, fu-sion on the signal level can be considered as forming part ofthe input performance. The performance of the signal levelfusion can then be measured by comparing the fusion re-sults with the recognition results obtained with each modal-ity separately (see [39] for examples).

In addition to that, the degree of coverage of the user’sbehavior (vocabulary, gestures, facial expressions) as well asthe system’s real-time performance are indicators of systeminput performance. The real-time performance can be cap-tured in terms of system feedback delay and system responsedelay—measured from the end of user input to the beginning

Page 6: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

78 J Multimodal User Interfaces (2012) 6:73–85

of system feedback, such as the display of loading status ofa web page, or the beginning of the system response.

Concerning special multimodal input like face detection,person or hand tracking, see [41] for metrics and even a cor-pus to evaluate system components in a comparable way.

4.1.2 Input modality appropriateness

Multimodal dialogue systems may offer a set of modalitiesfor user input. These modalities may be used sequentially,simultaneously or compositely. Depending on the content,the environment and the user it can be determined (e.g.guided by modality properties as described in [42]) if theoffered input modalities are appropriate for every given turnin a given context. For example, spoken input is inappropri-ate for secret information like a PIN when it occurs in publicspaces. Appropriateness can be annotated per modality or—in the case of composite input—for the multimodal input asa whole. In the first case, each modality can be appropriateor inappropriate. In the second case, the multimodal inputcan be appropriate, partially appropriate or inappropriate.From the annotations, rates of appropriate input modalitiescan be calculated by dividing through the number of timesthe system asked the user for input.

4.1.3 Interpretation performance

The performance of the system to extract meaning from theuser input can be quantified in terms of accuracy when a lim-ited set of underlying semantic concepts is used for mean-ing description. Example: Counting of the errors in fillingthe correct attribute-value pairs on the basis of an expert-derived measure of the “correct” interpretation. For indepen-dent input modalities, the concept error rate is typically cal-culated [21]. However, in the case of multimodal input, alsothe performance of the modality fusion component shouldbe considered, as fusion on the semantic level can help to re-duce the impact of recognition errors. This gain in accuracy(or fusion gain) can be evaluated by comparing the fusedresult with the results of the recognition modules for the dif-ferent modalities.

4.1.4 Dialogue management performance

Depending on the function of interest, different metrics canbe defined for the dialogue manager. Its main function is todrive the dialogue to the intended goal; this function can beassessed only indirectly, namely in terms of dialogue suc-cess (see below). In addition, goals should be achievable inan efficient and elegant way. Efficiency can be indicated bythe dialog duration or the number of dialog turns. In addi-tion, the query density can be computed as the quotient ofunique concepts introduced by the user and the total num-ber of concepts input to the system. Regarding the elegance

of the dialog, several metrics are listed in [21]. These in-clude the average user turn duration and system turn dura-tion as well as the number of user input modality changesand system output modality changes. In addition, counts ofspecial actions can be collected, such as the number of sys-tem questions, the number of diagnostic system error mes-sages, the number of timeouts, the number of help requestsor help messages or the number of cancel attempts. The di-alogue manager’s ability to correct misunderstandings cane.g. be quantified by computing the user and system correc-tion rate as the number of turns concerned with correctionsin relation to the total number of turns.

4.1.5 Contextual appropriateness

The system response should be appropriate in a given con-text, where appropriateness can be defined based on Grice’sCooperativity Principle [43] and quantified in terms of viola-tions of this principle [42]. Based on this, the contextual ap-propriateness parameter defined in [21] involves first an ex-pert rating the appropriateness of each system response, andsecond calculating the rate of appropriate responses amongall system responses.

4.1.6 Output modality appropriateness

As for the input side, output modality appropriateness canbe checked on the basis of modality properties as defined in[42], taking into account the interrelations between simul-taneous modalities. In the case of multimodal output, theassignment of an output modality to a system response isthe task of the output modality fission. Its performance canbe judged by an expert: has the appropriate modality beenchosen for every bit of information? As for contextual ap-propriateness, the rate of appropriate modalities can then bedetermined.

4.1.7 Form appropriateness

Refers to the surface form of the output provided to theuser. For example: Form appropriateness of spoken outputcan be measured via its intelligibility, comprehensibility, orthe required listening effort. The appropriateness of an Em-bodied Conversational Agent (ECA) can be assessed by itsability to convey specific information, including emotions,turn-taking backchannels, etc. The synchrony of the outputmodalities—especially in the case of ECAs—can be mea-sured in terms of the lag of time of each modality comparedto the other modalities [24].

4.2 User interaction performance aspects

On the user’s side, interaction performance can be quanti-fied by the effort required from the user to interact with the

Page 7: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

J Multimodal User Interfaces (2012) 6:73–85 79

system, as well as by the freedom of interaction. Aspectsinclude:

Perceptual effort: Effort required decoding the systemmessages, understanding and interpreting their meaning[44] e.g. listening-effort or reading effort. Metrics: Borg’scategory-ratio scale [45].

Cognitive workload: Specification of the costs of taskperformance (e.g. necessary information processing capac-ity and resources) [46]. There are several ways to measureworkload. A simple and cheap option is to use question-naires like the Nasa-TLX [47] or the RSME [48]. A moreelaborate measure are psycho-physiological parameters likepupil diameter [49] or test stetting employing the dualtask paradigm. An overview of methods assessing cognitiveworkload is given in [46].

Physical response effort: Physical effort required to com-municate with the system. Example: Effort required for en-tering information into a mobile phone. Metrics: Question-naires, e.g. [22].

5 QoE aspects

So far, we have limited ourselves to QoS, both in terms of in-fluencing factors and of performance metrics. However, theultimate aim of a system developer should be to satisfy theuser’s needs. According to Hassenzahl et al. [50], the userevaluation of a system is influenced by pragmatic and hedo-nic quality aspects. These quality aspects have to be evalu-ated with the help of real or test users providing judgmentson what they perceive. Such judgments can be seen as “di-rect” QoE measurements. In addition to that, “indirect” QoEmeasurements can be obtained by logging user behavior andrelating it to perceived QoE [51, 52].

5.1 Interaction quality

This term relates to the quality of the pure interaction be-tween user and system, disregarding any other aspects ofsystem usage. It includes the perceived input quality, theperceived output quality, as well as the system’s coopera-tivity.

Input quality relates to the perceived system understand-ing and input comfort; in contrast to input performance, itreflects how the user thinks that s/he is understood by thesystem. Studies have shown that the relationship betweene.g. perceived system performance and actual performanceis only weak at best [53]. One underlying reason is that theuser frequently does not know which concepts have beenunderstood by the system, unless there is a direct feedback.In other cases, this might only be detected in further stagesof the dialogue, or a misunderstanding might even never bedetected during the interaction, potentially resulting in taskfailure unnoted by the user.

Output quality refers to the perceived system under-standability, and to its form appropriateness. This includeswhether the meaning of a system message can be discerned,and whether the form supports meaning extraction by theuser. Meaning extraction may be limited by the output per-formance of the system (e.g. legibility of characters on thescreen, intelligibility of synthesized speech), but goes signif-icantly beyond this by taking into account the content of thesystem message in the immediate dialogue context, whichwill bear on the perceived transparency of the system.

Output quality is strongly related to system cooperativ-ity, as the user can judge the system’s support in reachinga joint goal mainly via the system output messages. Be-yond the perceived quality of the system output, coopera-tivity includes the distribution of initiative between the part-ners (which may be asymmetric because of different rolesand expectations), the consideration of background knowl-edge of the user and the system, and the ability for repairand clarification.

Questionnaires have been developed to quantify a vari-ety of interaction quality aspects. For speech-based inter-faces, the framework provided in ITU-T Rec. P.851 [22]captures the most relevant aspects such as the input andoutput quality, the perceived speed/pace of the interaction,and the smoothness and naturalness of the interaction. It ismainly based on the SASSI [23] questionnaire which hasbeen developed for systems with speech input capabilities,and which has been extended towards the output quality. Theframework is currently being extended towards multimodalsystems.

5.2 Usability

According to the ISO definition, usability is the “extent towhich a product can be used by specified users to achievespecified goals with effectiveness, efficiency and satisfactionin a specified context of use” [54]. Although this is probablythe most common definition, relevant literature offers a largevariety of different additional definitions. Apart from this ithas to be mentioned that the term User eXperience (UX) gotincreasingly popular during the last decade. The ISO stan-dard 9241-210:2010 [55] defines UX as “a person’s percep-tions and responses that result from the use or anticipateduse of a product, system or service”. According to [56] thisdefinition permits three different interpretations of the termUX: First of all UX can be understood as “an umbrella termfor all the user’s perceptions and responses [. . .]”. Secondly,UX can be understood as a different may be even as counterconcept to usability as historically the focus of usability ismainly on performance. The third interpretation sees UX as“an elaboration of the satisfaction component of usability”.In order to be consistent with the other concepts defined onthis layer of the taxonomy and to have a more fine-grained

Page 8: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

80 J Multimodal User Interfaces (2012) 6:73–85

distinction, we adopted the last view and we will thus con-sider two aspects of usability: The Ease of Use, which is in-fluenced by the mentioned consequences of interaction qual-ity, and the Joy of Use, which is often associated with UX.Joy of Use depends not only on the quality of the interac-tion; hedonic aspects like the appeal or the “personality” ofthe system will largely influence this sub-aspect. Both, Easeof Use and Joy of Use may determine the satisfaction of theuser, which can be considered as a consequence of good us-ability.

Our concept of usability follows [56] and incorporateshedonic as well as pragmatic qualities, thus we go far be-yond a strict performance-related approach. We do not usethe term User Experience within the taxonomy, as the def-inition provided by ISO is very broad and allows for dif-ferent interpretations (see previous paragraph). Instead, thelast layer tries to provide a structured picture of the UX sub-concepts.

Following the definition above questionnaires need tomeasure both, Joy and Ease of Use. Although several ques-tionnaires measure the “Ease of Use-part” of usability onlyfew include the Joy of Use. An affect scale is included in theSUMI [57] the PSUQ [58] and the CSUQ [58] additionallymeasure frustration. The AttrakDiff’s “attractiveness” scale,measuring pragmatic (Ease of Use) as well as hedonic qual-ities (Joy of Use), is probably closest to our conception ofusability [59].

Apart from the questionnaires presented above other suit-able methods to assess Joy and Ease of Use include qual-itative approaches like the Repertory Grid Technique [60]the Valence Method proposed by Burmester [61] and theUX Curve [62]. The Repertory Grid Technique has its ori-gin in the psychology of personal constructs by Kelly [60].Constructs in Kelly’s sense are bipolar (dis)similarity di-mensions [63]. According to Kelly [56, 60] every humanowns an individual and characteristic system of constructs,through which he/she perceives and evaluates his/her expe-riences. The Repertory Grid Technique aims to assess theseindividual constructs using two phases. In the elicitationphase the persons is presented with triads of the relevant ob-ject (e.g. three websites as in [64]) and is asked to verbalizewhat two objects have in common and how they differ fromthe third. This way bipolar constructs in the form of a se-mantic differential are generated. These bipolar constructsare later used as the rating scale for all constructs. The re-sult is an individual construct-based description of the ob-jects [63].

The Valence Method [87] is a two-phase measure basedon the theoretical work by [65]. In the first part, the users areasked to set positive and negative valence markers while ex-ploring the interface for up to eight minutes. The whole ses-sion is videotaped. In the next phase, the marked situationsare presented to the participants again while the interviewer

is asking which design aspect was the reason for setting themarker. The laddering interviewing technique is employedto uncover the underlying need by repeating question whya certain attribute was mentioned until the affected need isidentified.

The main limitation according to the authors [61] is thatit is currently recommendable for first usage situations onlyas the number of markers increases substantially if the prod-uct is already known to the user: Valuable insights in formof quantitative data (number of positive and negative mark-ers) and qualitative data (interviews) can be gained with thismethod, one disadvantage probably being the relatively highresources required.

The UX Curve [62] is used to retrospectively assess thesystem’s quality over time. Participants are asked to drawcurves describing their perceptions of the system’s attrac-tiveness, the system’s ease of use, the system’s utility, aswell as the usage frequency and their general experienceover time. Also users should explain major changes in thecurves. According to the authors, the UX curve allows tomeasure the long-term experience and the influences thatimprove or decrease the perceived quality of the experience.A similar methods is offered with iScale [66]. Again usersare asked to draw a curve reflecting their experience. How-ever, iScale is an online survey instrument while the UXcurve was developed for face-to-faces setting. Also the as-sessed dimensions differ.

5.3 Ease of use

The perceived Ease of Use describes the extent to whichusers assume that the usage of a system will be effort-less [67]. Relevant determinants for this construct are theaspects described in the ISO 9241 standard [54], namely ef-ficiency and effectiveness and, moreover, learnability and in-tuitivity.

The effectiveness refers to the accuracy and complete-ness with which specified users can reach specified goalsin particular environments [54]. Efficiency is the effort andresources required in relation to the accuracy and complete-ness achieved [54]. The vast majority of standardized us-ability questionnaires cover these two constructs. Examplesare the QUIS [68], the SUS [69], the IsoMetrics UsabilityInventory [70], the AttrakDiff [59], the SASSI [23], andthe USE [71]. It has to be noted that the questionnairessubscales are not necessarily named efficiency or effective-ness. The SASSI subscale “speed” is strongly related to effi-ciency, the scale pragmatic-qualities on the AttrakDiff refersto both, efficiency and effectiveness. Also performance datacan be used to operationalize these aspects: task durationmight serve as an efficiency measures, tasks success as an ef-fectiveness measure. Also learnability, the ease with whichnovice users can start effective interactions and maximize

Page 9: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

J Multimodal User Interfaces (2012) 6:73–85 81

performance is covered by most usability questionnaires.This can not be said for intuitivity, the degree the user isable to interact with a system effectively by applying knowl-edge unconsciously. Intuivity might be associated to con-structs covered by established questionnaires like familiar-ity or self-descriptiveness. However despite intuitivity beingoften considered as an important determinant of a productsquality, it is not as commonly included in usability evalu-ations as the above mentioned aspects are. Only recentlyquestionnaires, specially focusing on intuitivity, have beendeveloped [72, 73].

Other methods regularly used for the evaluation ofEase of Use are expert-oriented procedures like the Cog-nitive Walkthrough [74] and modelling approaches e.g.GOMS [75]. The Cognitive Walkthrough is rooted in the-ories of explorative learning. Experts, usually designers orpsychologist, analyse the system’s functionalities based on adescription of the interface, the tasks, the action necessary toperform the task and information about the user and the us-age context. Critical information is recorded by the expertsusing a standardized protocol.

With the method GOMS the interaction with a system isreduced to basic elements, which are goals, methods, opera-tors and selection rules. Goals are the basic goals of the user,what he/she wants to achieve while using the system [bon-nie john]. Operators are the actions offered by the system toaccomplish the goals. Methods are well-learned sequencesof sub-goals and operators suitable to achieve a goal [75].Selections rules apply if several methods are possible andreflect the user’s personal preferences. These four basic ele-ments describe the procedural knowledge necessary to per-form the tasks. This knowledge is applied to the design tocheck if the system provides methods for all user goals, fur-thermore execution times of well-trained, error-free expertusers can be predicted.

In case of multimodal systems, GOMS analyses can be-come quite extensive due to the complexity of such systems.As multimodal systems allow for parallel, serial or com-bined usage of different modalities multiple methods for onegoal are possible which than require the definition of multi-ple selection rules. The EPIC framework by [76] is a moresophisticated architecture better suitable for predicting exe-cution times for interactions with multimodal systems, how-ever, EPIC is first and foremost a research system and thusnot focused on being a tool for evaluation purposes [76].

5.4 Joy of use

According to Schleicher and Trösterer [77] Joy of Use isthe conscious positive experience of a systems quality. Im-portant determinant is the product’s aesthetic. Although theterm aesthetic often refers to visual aesthetics only, we goalong with the following, broader definition proposed by

Hekkert [78]. Thus “aesthetic is the pleasure attained fromsensory perception, as opposed to anesthetic. An experienceof any kind [. . .] comprises an aesthetic part, but the experi-ence as a whole is not aesthetic” [78]. This definition impliesthat aesthetic can be experienced through all our senses, re-spectively modalities, and is not limited to visual aesthetics.The system’s personality refers to the users’ perception ofthe system characteristics originating from the current com-bination of agent factors and surface form. This includessystem factors like chosen gender, appearance or voice forembodied conversational agents, wording of voice prompts,structure, color and icon scheme for graphical displays, thatshould exhibit a consistent and application-adequate person-ality (also called character) [79–81].

The appeal is a result of the aesthetics of the product, itsphysical factors and the extent to which the product inheritsinteresting, novel, and surprising features [50, 82].

It is noteworthy that there is an ongoing debate concern-ing the relationship between hedonic qualities, the aspectsrelated to Joy of Use, and pragmatic qualities, the aspectsrelated to Ease of Use. While some findings provide evi-dence for the claim that “what is beautiful is usable” [83]and for the underlying assumption that Joy of Use and Easeof Use are interdependent; other studies could not confirmthese results [84]. Hassenzahl [85] suggests that these am-biguous results are caused by different approaches in under-standing and measuring aesthetics. Accordingly, a variety ofmethods is available to measure Joy of Use related aspectsbut before deciding on a measurements method it has to bedefined which aspect should be assessed. The questionnaireproposed by Lavie and Tractinsky [86] is suitable for mea-suring the visual aesthetics, but not for aesthetics perceivedvia other modalities. The AttrakDiff [59] measures hedonicqualities on a higher level and is not limited to unimodalinterfaces. For measuring hedonic qualities during the inter-action the Joy of Use-Button [77] and psycho-physiologicalparameters are available options with the latter being themost resource-intensive method.

Another well validated and widely used instrument is theSelf-Assessment Manikin [87], which measures the arousal,pleasure and dominance linked to affective reactions onthree non-verbal scales.

If the aim is to measure specific emotions LemTool [88]and PrEmo [89] can be used. However, both tools are sofar only validated for specific application areas only: Lem-tool for websites [88] and PrEmo for non interactive prod-ucts [89].

However, although a wide range of methods assessing he-donic, affective qualities are nowadays available a recent re-view by [90] indicates that questionnaires, more specificallythe hedonic qualities subscales of Hassenzahl’s Attrakdiff[50] and the Self Assessment Manikin by [87], are by farthe most popular instrument.

Page 10: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

82 J Multimodal User Interfaces (2012) 6:73–85

5.5 Utility and usefulness

In order to judge whether a system is useful, we have to com-pare the functional requirements of the user with the func-tions offered by the system. Utility answers the question:Can a specific user resolve his personal task with the helpof the system [91]? Usefulness relates this to the usability:How well can a user resolve the task, considering the effortspent in the interaction, but also the joy experienced herein?Questionnaires focusing on utility and usefulness are ratherrare. However usefulness as understood in our frameworkis a subscale in the PSSUQ as well as in the CSUQ [58].Those questionnaires are identical except that PSSUQ wasdeveloped to use after a usability test and is thus address-ing specific tasks whereas the CSUQ asks about the generalsystem and is suitable for surveys [58].

5.6 Acceptability

The term acceptability describes how readily a user willactually use the system. There are some disputes aboutwhether acceptability is still a part of QoE, as it may be rep-resented as a purely economic measure, relating the numberof potential users to the quantity of the target group [92].Thus, we consider it to be a consequence of the quality as-pects and the influencing factors described in the taxonomy.

Unfortunately, it is still ill understood how acceptabilityis really influenced by QoE aspects; user needs as well aseconomic considerations may be dominant in certain fieldsand rule out the advantages of good QoE, at least temporar-ily. For example, Chateau et al. [52] have postulated thatthe relative importance of factors on acceptability variesroughly from decade to decade, and that the price may sig-nificantly outperform perceived quality, as long as the lat-ter is still above a certain threshold. In-depth and long-termanalyses of the relationship between QoE and acceptabilityare thus needed to further clarify the value of high QoE forthe success on the market.

On of the most influential approach in determining ac-ceptance is the Technology Acceptance Model (TAM) [93].Theoretical base of the model is the theory of reasoned ac-tions (TRA) by Ajzen and Fishbein [94]. According to theTRA actual behavior is determined by behavioral intentions.Behavioral intentions are dependent on attitudes towards thebehaviors and subjective norms. In terms of the TAM theattitudes towards the behavior are the perceived usefulnessand the perceived Ease of Use. Subjective norms are not in-cluded in TAM. It is important to note that the perceivedusefulness, described in the TAM, does not exactly matchour understanding of these terms. More precisely, perceivedusefulness is defined as “the degree to which a person be-lieves that using a particular system would enhance his orher job performance”. Thus the subscale usefulness in the

TAM questionnaire is not to be used to measure the con-ceptualization of usefulness given in Sect. 5.3. However tomeasure acceptance the questionnaire is helpful although itwill not provide as detailed information about the system’squality as other questionnaires mentioned.

As acceptability is also affected by social constraints be-yond the experienced system’s quality, it is a complex andmultidimensional QoE aspect only assessable in the field oreven in the market with valid results. In our opinion Accept-ability is primarily a topic for marketing and not for interac-tion research.

6 Conclusions and future work

The presented taxonomy provides definitions of factors andaspects as well as information about their relationships.On this common ground, comparable evaluation can beperformed, its results can be identified and categorized,and metrics for specific purposes (or the lack of those)can be identified. A first example of an application of thetaxonomy—however without relating it to QoS and QoE—is the evaluation of multimodal interfaces for intelligent en-vironments presented in [95].

Still, the presented metrics are not exhaustive, and manyof the established metrics are not sufficiently evaluated fortheir application to multimodal systems. For example, thereis no standardized questionnaire available for assessing theinteraction quality of multimodal applications. As currentsystems cover a wide range of modalities, applications anddomains, we anticipate that an open framework will beneeded to enable meaningful evaluation for specific con-texts. Also, concerning performance measures describingHCI, interaction parameters are already defined for multi-modal interfaces [24]. However, there exist only preliminaryresults concerning the relationships between these measuresand aspects of QoS and QoE [40].

The taxonomy has been developed on the basis of ownevaluation exercises, as well as on the basis of a broad lit-erature survey. The attribution of concepts to the differentlayers was guided by the definitions cited above, as well asby the intuition of usability engineers in our department. Itwould be good to validate this classification with further us-ability experts from other domains, so that the taxonomy be-comes more generic and stable. This could e.g. be done in acard-sorting experiment where usability experts have to or-ganize QoS and QoE concepts according to certain rules.

Up to now, practical application examples of the taxon-omy are still very limited, and they are limited to a fewsystems in the telecommunication sector. In order to fur-ther substantiate the taxonomy, our aim is to classify furtherevaluations described in the literature according to the tax-onomy. This will help to identify further evaluation metrics

Page 11: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

J Multimodal User Interfaces (2012) 6:73–85 83

for individual QoS and QoE aspects, and to validate themon a larger basis of systems and user groups. Furthermore,we will use such evaluations to systematically identify rela-tionships between quality factors on the one hand, and QoSand QoE aspects on the other. As soon as such relationshipsare identified, it will be able to judge the impact of a certainquality factor in advance, and to design the system accord-ingly.

Overall, we aim at an integration of the taxonomy de-scribed here in the usability engineering lifecycle. We ex-pect that for the early steps of the lifecycle (analysis, design,prototyping) a list of quality factors can be defined whichfacilitate the proper specification of all characteristics of thesystems which are relevant for its quality. For the later stepsof the lifecycle (expert evaluation and user texting), the met-rics defined for the QoS and QoE layers will help to capturerelevant aspects so that adequate conclusions can be drawnfor the (re-) design of the system.

As stated above, the taxonomy is designed as a frame-work facilitation evaluation, but not as a strict model whichcould be implemented in order to predict acceptability. Still,we see the potential that algorithmic relationships can bedefined between quality factors as input elements, QoS in-teraction performance metrics as mediating elements, andQoE aspects as output results. The relationships could be de-scribed by deterministic linear or non-linear formulae, or bya probabilistic framework such as a Hidden Markov Model.Although some of the concepts addressed in the taxonomyare still immature (such as aesthetics, appeal), others suchas interaction quality are well operationalized so that mean-ingful predictions of quality aspects get within reach. Thedefinition of algorithmic relationships will be very helpfulboth for a target-oriented design and optimization, as wellas for the online adaptation of future multimodal dialoguesystems.

References

1. Lamel L, Bennacef S, Gauvain JL, Dartigues H, Temem JN (2002)User evaluation of the MASK kiosk. Speech Commun 38:131–139

2. Bernsen NO, Dybkjær L, Kiilerich S (2004) Evaluating conversa-tion with Hans Christian Andersen. In: Proc LREC, vol 3, Lisbon,pp 1011–1014

3. Wahlster W (2006) SmartKom: foundations of multimodal dia-logue systems. Springer, Berlin

4. Turunen M, Hakulinen J, Ståhl O, Gambäck B, Hansen P, Ro-driguez Gancedo M, Santos de la Camara R, Smith C, Charlton D,Cavazza M (2011) Multimodal and mobile conversational healthand fitness companions. Comput Speech Lang 25:192–209

5. Cassell J, Kopp S, Tepper P, Ferriman K, Striegnitz K (2007) Trad-ing spaces: how humans and humanoids use speech and gesture togive directions. In: Nishida T (ed) Conversational informatics. Wi-ley, New York, pp 133–160

6. Gibbon D, Mertins I, Moore RK (eds) (2000) Handbook of multi-modal and spoken dialogue systems. Kluwer Academic, Boston

7. Coutaz J, Nigay L, Salber D, Blandford AE, May J, Young RM(1995) Four easy pieces for assessing the usability of multimodalinteraction: the CARE properties. In: Nordby K, Helmersen PH,Gilmore DJ, Arnesen SA (eds) Human–computer interaction, procinteract 1995. Chapman & Hall, London, pp 115–120

8. Wickens CD (2008) Multiple resources and mental workload.Hum Factors 50:449–455

9. Bernsen NO, Dybkjær L (2010) Multimodal usability. Human–computer interaction series. Springer, Berlin

10. ISO 25000 (2005) Software engineering—software product qual-ity requirements and evaluation (SQuaRE)—guide to SQuaRE. In-ternational Organization for Standardization, Geneva

11. Gray WD, Salzman MC (1998) Damaged merchandise? A reviewof experiments that compare usability evaluation methods. Hum-Comput Interact 13(3):203–261

12. Olson GM, Moran TP (1998) Commentary on Damaged Merchan-dise? Hum-Comput Interact 13(3):263–323

13. Hornbæk K (2006) Current practice in measuring usability: chal-lenges to usability studies and research. Int J Man-Mach Stud64(2):79–102

14. Hartson HR, Andre TS, Williges RC (2003) Criteria for evalu-ating usability evaluation methods. Int J Hum-Comput Interact15(1):145–181

15. ITU-T Rec. E.800 (1994) Terms and definitions related to Qual-ity of Service and network performance including dependability.International Telecommunication Union, Geneva

16. ETSI TR 102 643 (2010) Human factors (HF); Quality of Ex-perience (QoE) requirements for real-time communication ser-vices. European Telecommunications Standards Institute, SophiaAntipolis

17. Brooks P, Hestnes B (2010) User measures of quality of expe-rience: why being objective and quantitative is important. IEEENetw 24:8–13

18. ITU-T Rec. P.10 (2007) Vocabulary for performance and Qualityof Service. International Telecommunication Union, Geneva

19. Jekosch U (2005) Voice and speech quality perception. Assess-ment and evaluation. Springer, Berlin

20. Fraser N (1997) Assessment of interactive systems. In: Gibbon D,Moore R, Winski R (eds) Handbook on standards and resourcesfor spoken language systems. Mouton de Gruyter, Berlin, pp 564–615

21. ITU-T Suppl. 24 to P-Series Rec. (2005) Parameters describ-ing the interaction with spoken dialogue systems. InternationalTelecommunication Union, Geneva

22. ITU-T Rec. P.851 (2003) Subjective quality evaluation of tele-phone services based on spoken dialogue systems. InternationalTelecommunication Union, Geneva

23. Hone KS, Graham R (2000) Towards a tool for the SubjectiveAssessment of Speech System Interfaces (SASSI). Nat Lang Eng6(3/4):287–303

24. Nielsen J (1993) Usability engineering. Academic Press, Boston25. Alonso-Ríos D, Vázquez-García A, Mosqueira-Rey E, Moret-

Bonillo V (2010) Usability: a critical analysis and a taxonomy.Int J Hum-Comput Interact 26(1):53–74

26. Möller S, Kühnel C, Weiss B (2010) Extending Suppl 24 toP-series towards multimodal systems and services. InternationalTelecommunication Union, Geneva. Source: Deutsche TelekomLaboratories, ITU-T SIG12 Meeting 18–27 May 2010

27. Möller S (2005) Quality of telephone-based spoken dialogue sys-tems. Springer, New York

28. Pallett D, Fiscus J, Fisher W, Garofolo J (1993) Benchmark testsfor the DARPA spoken language program. In: Proc DARPA hu-man language technology workshop, Princeton, pp 7–18

29. Bennett CL, Black AW (2006) The blizzard challenge 2006. In:Proc blizzard satellite workshop to interspeech, Pittsburgh

30. Möller S (2005) Quality of telephone-based spoken dialogue sys-tems. Springer, New York

Page 12: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

84 J Multimodal User Interfaces (2012) 6:73–85

31. Hermann F, Niedermann I, Peissner M, Henke K, Naumann A(2007) Users interact differently: towards a usability-oriented tax-onomy. In: Jacko J (ed) Interaction design and usability, proc HCII2007, vol 1. Springer, Heidelberg, pp 812–817

32. Naumann AB, Hermann F, Peissner M, Henke K (2008) In-teraktion mit Informations- und Kommunikationstechnologie:Eine Klassifikation von Benutzertypen [interaction with in-formation and communication technology: a classification ofuser types]. In: Herczeg M, Kindsmüller MC (eds) Mensch& computer 2008: Viel Mehr interaktion. Oldenbourg Wis-senschaftsverlag, München, pp 37–45. http://mc.informatik.uni-hamburg.de/konferenzbaende/mc2008/konferenzband/mc2008_05_naumann.pdf

33. Heinssen RK, Glass CR, Knight LA (1987) Assessing computeranxiety: development and validation of the computer anxiety rat-ing scale. Comput Hum Behav 3(1):49–59

34. Van Vliet PJA, Kletke MG, Chakraborty G (1994) The measure-ment of computer literacy—a comparison of self-appraisal and ob-jective tests. Int J Hum-Comput Stud 40(5):835–857

35. Richter T, Naumann J, Groeben N (2000) Attitudes toward thecomputer: construct validation of an instrument with scales differ-entiated by content. Comput Hum Behav 16:473–491

36. Compeau DR, Higgins CA (1995) Computer self efficacy: devel-opment of a measure and initial test. Manag Inf Syst Q June:189–211

37. Smith B, Caputi P, Rawstorne P (2007) The development of ameasure of subjective computer experience. Comput Hum Behav23(1):127–145

38. Kalbe E, Kessler J, Calabrese P, Smith R, Passmore AP, BrandM, Bullock R (2004) DemTect: a new, sensitive cognitive screen-ing test to support the diagnosis of mild cognitive impairment andearly dementia. Int J Geriatr Psychiatry 19:136–143

39. Lopez-Cozar Delgado R, Araki M (2005) Spoken, multilingualand multimodal dialogue systems—development and assessment.Wiley, New York

40. Kühnel C, Weiss B, Möller S (2010) Parameters describing multi-modal interaction—definitions and three usage scenarios. In: ProcINTERSPEECH, pp 2014–2017

41. Mostefa D, Garcia M-N, Choukri K (2006) Evaluation of mul-timodal components within CHIL: the evaluation packages andresults. In: Proc LREC, Genoa, pp 915–918

42. Bernsen NO (2002) Multimodality in language and speechsystems—from theory to design support tool. In: Granström B,House D, Karlsson I (eds) Multimodality in language and speechsystems. Kluwer Academic, Dordrecht, pp 93–148

43. Grice H (1975) Logic and conversation. In: Cole P, Morgan JL(eds) Syntax and semantics. Speech acts, vol 3. Academic Press,New York, pp 41–58

44. Zimbardo PG (1995) Psychologie. Springer, Berlin45. Borg G (1982) Psychophysical bases of perceived exertion. Med

Sci Sports Exerc 14:377–38146. De Waard D (1996) The measurement of drivers’ mental work-

load. PhD thesis, University of Groningen, Haren47. Hart SG, Staveland LE (1988) Development of NASA-TLX (task

load index): results of empirical and theoretical research. In: Han-cock P, Meshkati N (eds) Human mental workload. North Holland,Amsterdam, pp 139–183

48. Zijlstra FRH (1993) Efficiency in work behavior. A design ap-proach for modern tools. PhD thesis, Delft University of Tech-nology, Delft University Press, Delft

49. Cacioppo JT, Tassinary LG, Berntson GG (eds) (2002) Handbookof psychophysiology, 3rd edn. Cambridge University Press, NewYork

50. Hassenzahl M, Platz A, Burmester M, Lehner K (2000) Hedonicand ergonomic quality aspects determine a software’s appeal. In:Proc CHI 2000, Den Haag, pp 201–208

51. Mandryk RL, Inkpen K, Calvert TW (2006) Using psycho-physiological techniques to measure user experience with enter-tainment technologies. Behav Inf Technol 25(2):141–158

52. Chateau N, Gros L, Durin V, Macé A (2006) Redrawing the linkbetween customer satisfaction and speech quality. In: Proc 2ndISCA/DEGA tutorial and research workshop on perceptual qualityof systems, Berlin, pp 88–94

53. Hornbæk K, Law EL (2007) Meta-analysis of correlations amongusability measures. In: Proc CHI 2007. ACM Press, New York,pp 617–626

54. ISO 9241-11 (1999) Ergonomic requirements for office work withvisual display terminals (VDTs). Part 11: guidance on usability.International Organization for Standardization, Geneva

55. ISO DIS 9241-210 (2010) Ergonomics of human systeminteraction—part 210: human-centred design for interactive sys-tems (formerly known as 13407). International Organization forStandardization (ISO), Geneva

56. Bevan N (2009) What is the difference between the purposeof usability and user experience evaluation methods. In: ProcUXEM’09 (INTERACT’09), Uppsala

57. Kirakowski J, Corbett M (1993) SUMI: the software usabilitymeasurement inventory. Br J Educ Technol 24(3):210–212

58. Lewis JR (1995) IBM computer usability satisfaction question-naires: psychometric evaluation and instructions for use. Int JHum-Comput Interact 7:57–78

59. Hassenzahl M (2004) The interplay of beauty, goodness, and us-ability in interactive products. Hum-Comput Interact 19:319–349

60. Kelly GA (1955) The psychology of personal constructs. Norton,New York

61. Burmester M, Mast M, Jäger K, Homans H (2010) Valencemethod for formative evaluation of user experience. In: Proc DIS’10. ACM Press, New York, pp 364–367

62. Kujala S, Roto V, Vaananen-Vainio-Mattila K, Karapanos E, Sin-nela A (2011) UX curve: a method for evaluating long-term userexperience. Interact Comput. doi:10.1016/j.intcom.2011.06.005

63. Hassenzahl M, Wessler R (2000) Capturing design space from auser perspective: the repertory grid technique revisited. Int J Hum-Comput Interact 12(3/4):441–459

64. Hassenzahl M, Trautmann T (2001) Analysis of web sites withthe repertory grid technique. In: Proc CHI 2001. ACM Press, NewYork, pp 167–168

65. Hassenzahl M, Diefenbach S, Göritz AS (2010) Needs, affect, andinteractive products—facets of user experience. Interact Comput22(5):353–362

66. Karapanos E, Martens J-B, Hassenzahl M (2010) On the retro-spective assessment of users’ experiences over time: memory oractuality? In: CHI 2010 extended abstracts on human factors incomputing systems, Atlanta. ACM Press, New York, pp 4075–4080

67. Davis F (1989) Perceived usefulness, perceived ease of use, anduser acceptance of information technology. Manag Inf Syst Q13(3):319–340

68. Chin J, Diehl V, Norman K (1988) Development of an instrumentmeasuring user satisfaction of the human–computer interface. In:Proceedings of SIGCHI 1988, pp 213–218

69. Brooke J (1996) SUS: A “quick and dirty” usability scale. In: Jor-dan P, Thomas B, Weerdmeester B, McClelland I (eds) Usabilityevaluation in industry. Taylor & Francis, London, pp 189–194

70. Gediga G, Hamborg K-C, Düntsch I (1999) The IsoMetrics us-ability inventory: an operationalisation of ISO 9241-10. Behav InfTechnol 18:151–164

71. Lund AM (2001) Measuring usability with the USE questionnaire.In: Usability and user experience 8, STC usability SIG newsletter.

72. Ullrich D, Diefenbach S (2010) INTUI. Exploring the facets ofintuitive interaction. In: Mensch & computer 2010 interaktive kul-turen.

Page 13: Measuring the Quality of Service and Quality of Experience of multimodal human–machine interaction

J Multimodal User Interfaces (2012) 6:73–85 85

73. Naumann A, Hurtienne J, Israel JH, Mohs C, Kindsmüller MC,Meyer HA, Husslein S (2007) Intuitive use of user interfaces:defining a vague concept. In: Harris D (ed) Engineering psychol-ogy and cognitive ergonomics, vol 13, Proc HCII 2007. LNAI, vol4562. Springer, Heidelberg, pp 128–136

74. Polson PG, Lewis C, Rieman J, Wharton C (1992) Cognitive walk-throughs: a method for theory-based evaluation of user interfaces.Int J Man-Mach Stud 36:741–773

75. John BE, Kieras DE (1996) Using GOMS for user interface de-sign and evaluation: which technique? ACM Trans Comput-HumInteract 3(4):287–319

76. Kieras DE, Meyer DE (1997) An overview of the EPIC architec-ture for cognition and performance with application to human–computer interaction. Hum-Comput Interact 12(4):391–438

77. Schleicher R, Trösterer S (2009) The ‘joy-of-use’-button: record-ing pleasant moments while using a PC. In: Human–computerinteraction—INTERACT 2009, vol 5727/2009. Springer, Heidel-berg

78. Hekkert P (2006) Design aesthetics: principles of pleasure in prod-uct design. Psychol Sci 48:157–172

79. Isbister K, Nass C (2000) Consistency of personality in interactivecharacters: verbal cues, non-verbal cues, and user characteristics.Int J Hum-Comput Stud 53(2):251–267

80. Marcus A (1995) Principles of effective visual communication forgraphical user interface design. In: Baecker R, Grudin J, BuxtonW, Greenberg S (eds) Human-computer interaction: toward theyear 2000. Morgan Kaufmann, San Francisco, pp 425–441

81. Mairesse F, Walker M, Mehl M, Moore R (2007) Using linguisticcues for the automatic recognition of personality in conversationand text. J Artif Intell Res 30. doi:10.1613/jair.2349

82. Stelmaszweska H, Fields B, Blandford A (2004) Conceptualisinguser hedonic experience. In: Reed DJ, Baxter G, Blythe M (eds)Proc ECCE-12, living and working with technology. EACE, York,pp 83–89

83. Tractinsky N, Katz AS, Ikar D (2000) What is beautiful is usable.Interact Comput 13(2):127–145

84. Mahlke S, Lindgaard G (2007) Emotional experiences and qualityperceptions of interactive products. In: Jacko JA (ed) Proceedingsof the 12th international conference on human–computer interac-tion: interaction design and usabilit. Lecture notes in computerscience. Springer, Berlin/Heidelberg, pp 164–173

85. Hassenzahl M (2008) Aesthetics in interactive products: corre-lates and consequences of beauty. In: Schifferstein HNJ, HekkertP (eds) Product experience. Elsevier, San Diego, pp 287–302

86. Lavie T, Tractinsky N (2004) Assessing dimensions of perceivedvisual aesthetics of web sites. Int J Hum-Comput Stud 60:269–298

87. Bradley MM, Lang PJ (1994) Measuring emotion: the self-assessment Manikin and the semantic differentia. J Behav TherExp Psychiatry 25:49–59

88. Huisman G, Van Hout M (2008) The development of a graphicalemotion measurement instrument using caricatured expressions:the LEMtool. In: Peter C, Crane E, Fabri M, Agius H, Axelrod L(eds) Emotion in HCI—designing for people: proceedings of the2008 international workshop. Fraunhofer, Stuttgart, pp 5–7

89. Desmet PMA (2004) Measuring emotions: development and ap-plication of an instrument to measure emotional responses to prod-ucts. In: Blythe MA, Overbeeke K, Monk AF, Wright PC (eds)Funology: from usability to enjoyment. Kluwer Academic, Nor-well, pp 111–123

90. Bargas-Avila JA, Hornbæk K (2011) Old wine in new bottles ornovel challenges: a critical analysis of empirical studies of userexperience. In: Proc CHI 2011, pp 2689–2698

91. Bevan N (1995) Usability is quality of use. In: Proc HCII 1995.Elsevier, Amsterdam, pp 349–354

92. Eurescom Project P.807 Deliverable 1 (1998) Jupiter II—usability,performability and interoperability trials in Europe. European In-stitute for Research and Strategic Studies in Telecommunications,Heidelberg

93. Davis FD (1993) User acceptance of information technology sys-tem characteristics, user perceptions and behavioral impacts. Int JMan-Mach Stud 38(3):475–487

94. Ajzen I, Fishbein M (1980) Understanding attitudes and predictingsocial behavior. Prentice Hall, Englewood Cliffs/New York

95. Möller S, Engelbrecht K-P, Kühnel C, Wechsung I, Weiss B(2009) Evaluation of multimodal interfaces for ambient intelli-gence. In: Aghajan H, López-Cózar Delgado R, Augusto JC (eds)Human-centric interfaces for ambient intelligence. Elsevier, Ams-terdam, pp 347–370