6
Multimodal emotion expressions of virtual agents Mimic and vocal emotion expressions and their effects on emotion recognition Benny Liebold & Peter Ohler Institute for Media Research Chemnitz University of Technology Chemnitz, Germany [email protected] Abstract— Emotional expressions of virtual agents are widely believed to enhance the interaction with the user by utilizing more natural means of communication. However, as a result of the current technology virtual agents are often only able to produce facial expressions to convey emotional meaning. The presented research investigates the effects of unimodal vs. mul- timodal expressions of emotions on the users’ recognition of the respective emotional state. We found that multimodal ex- pressions of emotions yield the highest recognition rates. Addi- tionally, emotionally neutral cues in one modality, when pre- sented together with emotionally relevant cues in the other modality, impair the recognition of the correct emotion catego- ry as well as intense emotional states. Keywords- intelligent agent; emotion expression; emotion recognition; multimodality I. INTRODUCTION In our effort to create more naturalistic interfaces, virtual agents (VAs) are widely considered a promising technology for Human-Computer Interaction, because they do not re- quire the user to learn the way an interface operates but uti- lize naturalistic means of communication [1]. Particularly, the emotional intelligence that the user attributes to a VA is believed to further enrich the interaction process between user and VA by increasing the VA’s behavioral realism and thus its believability [2]. Whether this impression of emo- tional intelligence stems from models of emotion or motiva- tion [3] is not important from the user’s perspective at first glance, because she only perceives the VA’s socially situated expressive behavior, i.e. expressions of emotions. According to this assumption, developers of VA platforms often seek to improve the perceived behavioral realism of VAs by inte- grating the possibility to communicate emotional states. In face-to-face communication emotions are typically expressed as multimodal arrangements of nonverbal behavior across several communication channels such as mimic, gestures, and prosody [4]. While current technology provides the nec- essary means to produce realistic synthetic facial expressions and gestures, we are not yet able to create equally realistic synthetic emotional speech. Although several approaches have been brought forward, synthetic vocal emotion expres- sions have not yet reached a perceptual quality that is compa- rable to the highly realistic facial expressions of state of the art VA models [5]. While simple unit selection approaches can easily achieve acceptable speech quality, only signifi- cantly less realistic rule based speech synthesizers are able to account for the full range of complex voice properties that are necessary to produce authentic emotional speech [6]. Accordingly, developers of virtual agent platforms who aim for a high degree of believability often implement facial ex- pressions but at the same time have to choose, whether to either integrate emotional speech lacking important speech properties or not to integrate emotional speech at all. Consequently, research on emotion expressions of virtual agents mainly focused on separate information channels with visual cues (mimic, gestures) representing the major part. Research investigated both the recognition and the effects of virtual emotion displays. Regarding emotion recognition, even simple displays of emotion were found to yield good recognition rates above a certain threshold of expression intensity [7]. Krumhuber et al. [8] demonstrated that more authentic facial expressions of VAs based on the Facial Ac- tion Coding System (FACS) [9] achieve recognition rates comparable to those of human actors. Regarding effects of emotion expressions on social interaction, facial expressions of emotion by a VA were found to affect participants’ deci- sion making during negotiation [10]. However, facial expres- sions should be logically matched to the communication sce- nario in order to avoid negative effects [11]. Additionally, the timing of emotional expressions is believed to strongly moderate the effects of emotional expressions on the interac- tion partner’s perception of the VA [cf. 2]. Further, de Melo, Kenny, and Gratch [12] demonstrated that the modeling of autonomous emotion indicators, such as respiratory frequen- cy, blushing and sweating, positively influenced the partici- pants’ ability to identify several emotion categories. Consequently, visual cues alone have a reasonable im- pact on the agent-user interaction process. Although evi- dence from nonverbal behavior research in humans suggests that emotion recognition rates benefit from emotional ex- pressions across several modalities [13-15], it could be easily argued that facial expressions of emotion are sufficient to communicate a VAs emotional state. However, research so far did not systematically investigate the paradox situation when emotional information is conveyed in only one modali- ty with the other modality remaining neutral resulting in a disparity of emotionality between communication channels. Natural expressions of emotion are dynamic and situational specific multimodal arrangements of expressive behaviors that utilize at least visual and auditory cues. Emotion expres- sions conveying emotional information in only one modality with the other one remaining neutral, are relatively unusual 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction 978-0-7695-5048-0/13 $26.00 © 2013 IEEE DOI 10.1109/ACII.2013.73 405

[IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

  • Upload
    peter

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

Multimodal emotion expressions of virtual agents Mimic and vocal emotion expressions and their effects on emotion recognition

Benny Liebold & Peter Ohler Institute for Media Research

Chemnitz University of Technology Chemnitz, Germany

[email protected]

Abstract— Emotional expressions of virtual agents are widely believed to enhance the interaction with the user by utilizing more natural means of communication. However, as a result of the current technology virtual agents are often only able to produce facial expressions to convey emotional meaning. The presented research investigates the effects of unimodal vs. mul-timodal expressions of emotions on the users’ recognition of the respective emotional state. We found that multimodal ex-pressions of emotions yield the highest recognition rates. Addi-tionally, emotionally neutral cues in one modality, when pre-sented together with emotionally relevant cues in the other modality, impair the recognition of the correct emotion catego-ry as well as intense emotional states.

Keywords- intelligent agent; emotion expression; emotion recognition; multimodality

I. INTRODUCTION In our effort to create more naturalistic interfaces, virtual

agents (VAs) are widely considered a promising technology for Human-Computer Interaction, because they do not re-quire the user to learn the way an interface operates but uti-lize naturalistic means of communication [1]. Particularly, the emotional intelligence that the user attributes to a VA is believed to further enrich the interaction process between user and VA by increasing the VA’s behavioral realism and thus its believability [2]. Whether this impression of emo-tional intelligence stems from models of emotion or motiva-tion [3] is not important from the user’s perspective at first glance, because she only perceives the VA’s socially situated expressive behavior, i.e. expressions of emotions. According to this assumption, developers of VA platforms often seek to improve the perceived behavioral realism of VAs by inte-grating the possibility to communicate emotional states. In face-to-face communication emotions are typically expressed as multimodal arrangements of nonverbal behavior across several communication channels such as mimic, gestures, and prosody [4]. While current technology provides the nec-essary means to produce realistic synthetic facial expressions and gestures, we are not yet able to create equally realistic synthetic emotional speech. Although several approaches have been brought forward, synthetic vocal emotion expres-sions have not yet reached a perceptual quality that is compa-rable to the highly realistic facial expressions of state of the art VA models [5]. While simple unit selection approaches can easily achieve acceptable speech quality, only signifi-cantly less realistic rule based speech synthesizers are able to

account for the full range of complex voice properties that are necessary to produce authentic emotional speech [6]. Accordingly, developers of virtual agent platforms who aim for a high degree of believability often implement facial ex-pressions but at the same time have to choose, whether to either integrate emotional speech lacking important speech properties or not to integrate emotional speech at all.

Consequently, research on emotion expressions of virtual agents mainly focused on separate information channels with visual cues (mimic, gestures) representing the major part. Research investigated both the recognition and the effects of virtual emotion displays. Regarding emotion recognition, even simple displays of emotion were found to yield good recognition rates above a certain threshold of expression intensity [7]. Krumhuber et al. [8] demonstrated that more authentic facial expressions of VAs based on the Facial Ac-tion Coding System (FACS) [9] achieve recognition rates comparable to those of human actors. Regarding effects of emotion expressions on social interaction, facial expressions of emotion by a VA were found to affect participants’ deci-sion making during negotiation [10]. However, facial expres-sions should be logically matched to the communication sce-nario in order to avoid negative effects [11]. Additionally, the timing of emotional expressions is believed to strongly moderate the effects of emotional expressions on the interac-tion partner’s perception of the VA [cf. 2]. Further, de Melo, Kenny, and Gratch [12] demonstrated that the modeling of autonomous emotion indicators, such as respiratory frequen-cy, blushing and sweating, positively influenced the partici-pants’ ability to identify several emotion categories.

Consequently, visual cues alone have a reasonable im-pact on the agent-user interaction process. Although evi-dence from nonverbal behavior research in humans suggests that emotion recognition rates benefit from emotional ex-pressions across several modalities [13-15], it could be easily argued that facial expressions of emotion are sufficient to communicate a VAs emotional state. However, research so far did not systematically investigate the paradox situation when emotional information is conveyed in only one modali-ty with the other modality remaining neutral resulting in a disparity of emotionality between communication channels. Natural expressions of emotion are dynamic and situational specific multimodal arrangements of expressive behaviors that utilize at least visual and auditory cues. Emotion expres-sions conveying emotional information in only one modality with the other one remaining neutral, are relatively unusual

2013 Humaine Association Conference on Affective Computing and Intelligent Interaction

978-0-7695-5048-0/13 $26.00 © 2013 IEEE

DOI 10.1109/ACII.2013.73

405

Page 2: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

TABLE I. EMPLOYED CONDITIONS OF MODALITY

Vocal Emotional Cues Yes No

Mimic Emo-tional Cues

Yes AV VN V0

No AN A0 C

Note: AV = emotional expression in mimic and voice; VN = emotional mimic and no sound; VN = emotional mimic and neutral speech; AN = emotional speech and no picture; AN = emotional speech and neutral mimic; C = neutral mimic and voice, but meaningful speech content

in face-to-face communication—at least for sincere expres-sions of emotions. Related work by Creed and Beale [16]already demonstrated the impact of combined multimodal emotion displays from different emotion categories suggest-ing that users tend to integrate conflicting information into a coherent framework in accordance with cognitive dissonance theory [17]. With this paper we seek to close the gap for combined emotional and neutral emotion displays by com-paring different strategies of integrating unimodal and mul-timodal emotion expressions into VAs regarding facial and auditory cues with the goal to better understand, how users react to emotion expressions of VAs. We argue that emo-tionally incongruent modalities impair the user’s ability to recognize the VA’s respective emotional state correctly. This misidentification of emotional states could further result in negative effects on the agent-user interaction process.

In order to understand the relevance of different infor-mation channels for emotion recognition, we need to employ a framework, which allows us to conceptualize emotion recognition as a process. One convenient approach was pre-sented by Scherer [e.g. 6, 18], who considered the flow of information as a modified Brunswikian lens model. In thistechno-communicative metaphor, emotion recognition is a result of four subsequent processes: the emotional state of the communicator, the encoding of emotional information into several communication channels (distal cues), the per-ception of this information (proximal cues), and the integra-tion of the perceived information into a coherent judgment of the communicator’s emotional state. In our case of VAs, both the emotional state and the encoded emotional infor-mation is predetermined. Thus, we can focus on the per-ceived emotional cues and their integration into a judgment about the VA’s emotional state.

In situations when VAs convey emotional information in only one modality with other modalities remaining neutral, an appropriate reaction of the user would be to implicitly evaluate the neutral cue as not relevant to the emotional state. However, we are accustomed to the fact, that all ex-pressive behavior (i.e. also choosing not to express some-thing) in humans is to some extent motivated by mental states that are related to emotions, because human emotions are typically expressed as multimodal arrangements. We should, therefore, tend to interpret the neutral expression in a single information channel of a virtual agent as a relevant component of the agent’s emotional state.

We argue that in the above-mentioned situations emo-tional cues that are presented together with neutral cues should achieve lower recognition rates than the emotional cues alone (hypothesis 1), because the expressive patterns of emotions differ systematically from human expressions. As a possible indicator of conscious problem solving the reaction times of user decisions should increase when neutral cues are presented together with emotional cues (hypothesis 2), be-cause users would have to integrate conflicting information into their judgment. Finally, in line with previous research [19] multimodal emotion expressions should achieve higher recognition rates than unimodal expressions (hypothesis 3).

II. METHOD We conducted a laboratory experiment, in which German

participants were asked to indicate the emotional states of a VA, that were presented via several video clips. We recruited 88 participants of which 5 were excluded because of missing values or misunderstood instructions resulting in N = 83 par-ticipants (f = 58, age: M = 24,31; SD = 7,76). They received course credits for their participation.

We varied the VA’s modality of expressive behavior in a modified 2×2 within-subjects design with the presence or absence of emotional cues in mimic and prosody as within-subjects factors. This approach has already been used by Bänziger, Grandjean, and Scherer [19] in the Multimodal Emotion Recognition Test (MERT). They presented short video clips of real actors with mimic (no sound), vocal (no video) or multimodal (both) emotion expressions as well as pictures of the respective expressions’ apex, i.e. the point of the strongest expression. The vocal expressions did not con-tain any meaningful content in order to isolate the effect of vocal emotion expressions from context influences. MERT further differentiates five emotions at two intensity levels resulting in a 4 (modality) × 5 (type of emotion) × 2 (intensi-ty) within-subjects design. The resulting stimuli have to be rated according to their emotion quality and intensity by choosing one out of ten emotion descriptions.

We used a similar design, but slightly modified the mo-dality conditions: We changed the picture condition to a con-text condition (denoted as C), that contained neutral mimic and neutral vocal expressions, but used meaningful speech content compared to the other conditions. Further, we added a new condition to each condition, where only one modality contained emotional cues: Each of the two unimodal emotion expressions were presented either together with neutral ex-pressions in the other modality (AN, VN) or without the oth-er modality (i.e. no video or no sound; A0, V0). The multi-modal condition with both mimic and vocal expressions (AV) remained the same (Table I).

A. Stimulus Materials Video clips were based on an animated VA of the computer game Half Life 2 [20]. The ability of the game engine to dis-play authentic mimic expressions is based on FACS [9] al-lowing to manipulate facial parameters (action units) accord-ing to reported research results. We used one of the game’s female main characters (Alyx) as the enacting VA, because it presented itself as an attractive and the most sophisticated animated model in terms of facial expressions. Empirically derived action unit parameters for the emotions were drawn

406

Page 3: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

happiness cold anger sadness disgust fear

elation hot anger despair contempt panic

Figure 1. Facial expressions used in the study. Emotion categories were adopted from MERT [19].

TABLE II. TRANSLATED GERMAN CONTEXT SENTENCES AND MEANINGLESS SENTENCES USED IN THE STUDY

Emotion Emotion Intensity Low intensity High intensity

anger He has been unfriendly to me. Tim intentionally broke my cellphone. disgust There is rotten meat in the fridge. He has been sentenced to lifelong prison for lifetime. sadness The storm demolished my garden. My father suddenly died in a traffic accident.fear There are said to be wolves in the forest again. The plane’s engines failed. happiness I received a present. I won a lot of money in the lottery. meaningless 1 Hät sandig pron you venzy. meaningless 2 Fee gött laich jonkill gosterr.

from the same FACS-coded videos that have been used in MERT [4]. We then created identical facial expressions at different intensity levels and chose appropriate video clips for low and high emotion intensities via a pretest (Fig. 1).

As stated above, current technology does not provide the necessary means to fully synthesize emotional voice samples in an appropriate and authentic way. We therefore asked moderators of the local university radio station to perform the necessary voice samples. They were instructed to per-form two meaningless sentences that were used in MERT. Both were reported to be recognized as a foreign language. Four female moderators were recruited to perform both sen-tences several times for all emotion × intensity combinations as well as ten context sentences that implied attributions that are consistent with specific emotions according to cognitive emotion theories [e.g. 21]. The translations of the German sentences are presented in Table II. We then selected the most appropriate speaker in a pretest. The facial animations and the voice samples were integrated and manual lip syn-chronization was applied. Only the apex of each facial ex-pression was presented in the video clips. Thus, no facial dynamics were modeled. From these video clips, which rep-resent the AV condition, the video clips for the conditions A0 and V0 were derived. The video streams from the condi-tions AN, VN, and C were substituted with a neutral expres-sion. The resulting video clips lasted between 3s and 4s andwere presented via E-Prime 2.0 [22], which allowed to log the participants’ responses and reaction times.

B. Procedure Participants took part in group tests of up to eight persons

using headphones and computer stations that were visually separated from one another. After a brief introduction partic-ipants received two questionnaires for administrative purpos-

es and two questionnaires that were employed to assess inter-individual differences. After completing the questionnaires, participants were introduced to the computer-based test and were given a speed-accuracy instruction. The program first explained the test and then presented four practice items so that the participants could get familiar with the testing pro-cedure. The participants were then asked to indicate the VA’s emotional state for 6 (modality) × 5 (type of emotion) × 2 (intensity) = 60 video clips via mouse click. The cursor was centered between the answer boxes after each click to provide roughly equal distance for reaction time measures. To avoid order effects, the video clips were presented in one of two randomized orders with the exception that no emotion or modality could appear in two subsequent video clips. Both orders did not affect classification performance, t < 0.01, n.s.

III. RESULTS Recognition accuracy is reported for individual emotion

recognition (emotion and intensity correct) and family recognition (emotion correct) as suggested by [19] to account for wrong judgments due to intensity misclassification. Al-pha-levels for statistical analyses were set at p < .05. For effects of repeated measurement ANOVAs violating the sphericity assumption, Greenhouse-Geisser-corrected de-grees of freedom are reported

A. Recognition Accuracy – Individual Emotion Recognition To analyze participants’ individual emotion recognition

performance data we employed a three-way repeated meas-urement ANOVA with modality, type of emotion and inten-sity as within-subjects factors. We found significant main effects for all factors. First, modality significantly affected participants’ performance, F(4.18,342.98) = 89.07, p < .001, η2

p = .52. Planned comparisons revealed that authentic dis-

407

Page 4: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

TABLE III. CONFUSION MATRIX AVERAGED FOR ALL MODALITIES

Participants’ Respose Emotion Display in Video Clip anger disgust sadness fear happiness

low high low high low high low high low high anger low 49 32 23 22 8 1 8 6 3 1

high 9 50 2 14 0 0 0 0 0 0 dusgust low 3 2 28 20 0 0 1 1 0 0

high 20 12 26 26 0 1 4 2 2 1 sadness low 9 3 8 5 73 47 11 6 2 2

high 1 1 3 4 16 42 9 13 0 2 fear low 2 1 7 3 1 6 51 32 0 3

high 0 0 1 0 0 2 3 39 0 7 happiness low 6 1 2 4 1 1 11 2 79 36

high 0 0 0 0 0 0 0 0 13 47 Note: Correct classifications are marked as bold, misclassifications based on wrong emotion intensity are marked as green. Values are rounded.

Low Intensity High Intensity

Fig. 2. Estimated marginal means for the three-way interaction modality × emotion × intensity for individual emotion recognition.

plays of emotions yielded higher classification performances than inferences about emotions from context information (C), F(1,82) = 71.99, p < .001, η2

p = .59. Also, multimodal expressions of emotions (AV) were recognized significantly better than unimodal expressions, F(1,82) = 225.98, p < .001, η2

p = .73. We further found that unimodal displays of emo-tion are recognized better, when no parallel neutral infor-mation channel is presented (i.e. A0 and V0), F(1,82) = 78,99, p < .001, η2

p = .49. The latter effect was consistent for individual contrasts for both AN vs. A0 and VN vs. V0. Ad-ditionally, auditory cues were better classified than visual cues, F(1,82) = 46.37, p < .001, η2

p = .36. The recognition rates for A0 and V0 were well above chance level, indicating a reasonable fit of our stimulus selection. The main effect for emotion indicated that the different emotion categories af-fected participants recognition performance, F(4,328) = 90.04, p < .001, η2

p = .52. Contrasts revealed that especially disgust was recognized less accurately compared to the other emotions, F(1,82) = 204.21, p < .001, η2

p = .71. The confu-sion matrix (Table III) revealed that disgust was often misin-terpreted as anger compared to other emotions, where mainly problems with identifying the correct emotion intensity ap-peared. However, disgust portrayals significantly exceeded chance level (.20) when only the emotion category is ana-lyzed, M = .49, t = 14.75, p < .001. The significant main ef-fect of intensity indicating that emotions at their correct in-tensities were better recognized in the low-intensity condi-tion, F(1,82) = 83.38, p < .001, η2

p = .50, We further found all interaction terms to yield significant

results. Due to our research question we only report the mo-dality × intensity interaction, F(5,328) = 49.32, p < .001, η2

p

= .38, indicating that the above reported pattern of the mo-dality main effect differed between the presentation of low vs. high intensity emotions. As a follow up analysis we con-ducted separate repeated measurement ANOVAs for low and high emotion intensity. We found that the described effects of modality only remained stable for the high intensity con-dition. For low emotion intensities authentic emotion expres-sions did not differ from context information (C), F(1,82) = 0.03, n.s., but multimodal expressions still yielded slightly better recognition rates, F(1,82) = 25.16, p < .001, η2

p = .24. Neutral cues did no longer affect recognition rates of uni-modal expressions of low intensity, F(1,82) = 1.64, n.s. Fur-ther results from the analysis can be obtained from Fig. 2.

B. Recognition Accuracy – Family Recognition To analyze participants’ family recognition performance

data we employed a three-way repeated measurement ANOVA with modality, type of emotion and intensity as within-subjects factors. Modality significantly affected the ability to recognize the correct emotion family, F(3.80,311.52) = 33.42, p < .001, η2

p = .29. We again ap-plied planned contrast analyses and found similar results as for individual emotion recognition: Authentic emotion ex-pressions led to higher recognition rates than mere context information, F(1,82) = 31.30, p < .001, η2

p = .28, multimodal emotion expressions achieved higher recognition rates com-pared to unimodal expressions, F(1,82) = 162.33, p < .001, η2

p = .66, and again, neutral cues impaired the recognition of unimodal emotional cues, F(1,82) = 20.56, p < .001, η2

p = .20, also remaining consistent within both modalities. Over-all, visual cues were recognized better than auditory cues,

408

Page 5: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

Low Intensity High Intensity

Fig. 3. Estimated marginal means for the three-way interaction modality × emotion × intensity for family recognition. Note: The low recognition rate for the condition C for low intensity sadness reflects the fact that participants tended to infer low intensity anger.

Fig. 4. Judgements for correct type of emotion and correct intensity (corr) compared to correct type of emotion but wrong intensity (false) for both groups of unimodal presentation (N = neutral cues; 0 = no neutral cue)

F(1,82) = 11.94, p < .001, η2p = .13. The main effect of emo-

tion also yielded significant results, F(2.86,234.36) = 237.12, p < .001, η2

p = .74, again with disgust deviating the most from other emotions in terms of recognition rates, F(1,82) = 320.20, p < .001, η2

p = .80. The intensity of the displayed emotions did not significantly affect the recognition of emo-tion families, F(1,82) = 0.08, n.s.

All interaction terms including the three-way interaction were found to have a significant impact on family recogni-tion rates. However, due to our research questions, we again focus only on the modality × intensity interaction, F(5,410) = 13.23, p < .001, η2

p = .14, which indicates that recognition patters between modalities differ for low vs. high intensities. As a follow-up analysis we conducted two separate repeated measurement ANOVAs for both low and high intensity ex-pressions of emotions. The above-mentioned main effects of modality remained stable at both levels of intensity. Further results from the analysis can be obtained from Fig. 3.

C. Reaction Times To analyze participants’ reaction times we employed a

three-way repeated measurement ANOVA with modality, type of emotion and intensity as within-subjects factors. Due to space constraints and our research focus we only report the main effect for modality: Modality significantly affected participants’ reaction times, F(3.39,277.80) = 67.81, p < .001, η2

p = .45. Planned contrasts revealed that context in-formation required significantly longer to be recognized compared to authentic emotion expressions, F(1,82) = 117.54, p < .001, η2

p = .59. Multimodal expressions were recognized significantly faster than unimodal expressions, F(1,82) = 44.02, p < .001, η2

p = .35. Participants required significantly more time identifying unimodal emotion cues when presented together with neutral cues, F(1,82) = 86.98, p < .001, η2

p = .52. The latter effect did not differ at both levels of intensity, F(1,82) = 2.36, n.s. Interestingly, isolated facial expressions of emotions were recognized significantly faster than vocal expressions, F(1,82) = 17.73, p < .001, η2

p= .18, but both modalities did not differ in their respective reaction times when presented together with neutral infor-mation channels, F(1,82) = 2.20, n.s.

IV. DISCUSSION The aim of the presented research was to systematically

compare emotion recognition from multimodal vs. unimodal

emotion expressions in order to further understand user reac-tions towards a VA’s expressions of emotions. We found that neutral cues impaired the recognition of unimodal emo-tion expressions (hypothesis 1). Although the results from the family recognition analysis supports hypothesis 1, the analysis of individual emotion recognition suggest that not only identifying the correct type of emotion but also identify-ing an emotion’s intensity is impaired by neutral cues: Fur-ther inspections of confusion matrices suggested that partici-pants judged the VA’s emotional state of high intensity to be systematically less intense, when neutral cues were presented together with high intensity expressions (see Fig. 4). One possible explanation would be that human expressions of intense emotions are usually accompanied by marked behav-ior in all modalities and cannot be expressed sufficiently in only one modality in the case that both modalities are pre-sented. We also found that participants required significantly more time to classify unimodal emotional cues that are pre-sented together with neutral cues (hypothesis 2). While these findings could be interpreted as first evidence in terms of increased cognitive effort due to the necessary integration of both modalities as suggested for mismatched emotion ex-pressions by Creed and Beale [16], the reaction times are confounded with the participants’ perceived certainty of their judgment. Thus, further research is required in order to un-derstand deviations of emotional expressions of VAs from face-to-face communication. Finally, we found that multi-modal expressions of emotions yielded the highest recogni-tion rates (hypothesis 3), which replicates the findings from other research on multimodal emotion recognition [19]. We did not yet systemically investigate interactions of our find-ings with specific emotions. However, this task should yield fruitful findings because emotions differ in their dependence

409

Page 6: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

on specific modalities of expression. Therefore neutral cues should only affect certain unimodal emotion expressions.

The impairment effect of neutral cues should decrease the longer the user interacts with a VA that is capable of unimodal emotion expressions, because the user should get accustomed to the way a VA expresses emotional states. In our experiment the effect was artificially amplified because with each video clip the way our VA expressed emotions changed. Additionally, our findings only account for proto-typic categories of emotion. Thus, the effect on more ecolog-ically valid forms of interaction remains to be investigated.

The impairment effect of neutral cues suggests that de-velopers aiming for a high believability of their VAs de-signed for short term interaction should implement emotion expressions across several modalities in order to avoid mis-classifications of emotion categories by users. Especially in entertainment contexts, where intense emotional reactions are produced more frequently, VAs should be capable of producing multimodal emotion expressions. However, de-velopers would have to forego a certain degree of their VA’s believability by using emotional speech modulations instead of high quality speech samples in order to achieve appropri-ate user perceptions of emotional reactions. Though, we as-sume that even less accurate vocal expressions of emotion should facilitate recognition rates. However, this assumption would have to be tested in further research.

ACKNOWLEDGMENT We thank Tanja Bänziger, Didier Grandjean, and Klaus

Scherer for the opportunity to use the Multimodal Emotion Recognition Test as well as Oliver Rossett for his support with the test handling. We also thank Georg Valtin and Dan-iel Pietschmann for their continued feedback. The presented research was founded by the German Research Foundation as part of the postgraduate program “Connecting Real and Virtual Social Worlds” under grant #1780.

REFERENCES [1] J. Cassell, E. Churchill, S. Prevost, and J. Sullivan, Embodied

conversational agents. Cambridge, MA: MIT Press, 2000. [2] F. D. Schönbrodt and J. B. Asendorpf, "The Challenge of

Constructing Psychologically Believable Agents," J. Media Psychology, vol. 23, pp. 100-107, Apr. 2011.

[3] N. C. Krämer, I. A. Iurgel, and G. Bente, "Emotion and motivation in embodied conversational agents," presented at the AISB'05 Conv., Symp. on Agents that Want and Like: Motivational and Emotional Roots of Cognition and Action, Hatfield, UK, 2005.

[4] K. R. Scherer and H. Ellgring, "Multimodal expression of emotion: affect programs or componential appraisal patterns?," Emotion, vol. 7, pp. 158-71, Feb. 2007.

[5] M. Schröder, F. Burkhardt, and S. Krstulovic, "Synthesis of emotional speech," in Blueprint for affective computing. A sourcebook, K. R. Scherer, T. Bänziger, and E. B. Roesch, Eds., New York, NY: Oxford Univ. Press, 2010, pp. 222–231.

[6] P. N. Juslin and K. R. Scherer, "Vocal expression of affect," in The new handbook of methods on nonverbal behavior research, J. A. Harrigan, R. Rosenthal, and K. R. Scherer, Eds., Oxford, UK: Oxford Univ. Press, 2005, pp. 65–135.

[7] J. M. Beer, A. D. Fisk, and W. A. Rogers, "Emotion Recognition of Virtual Agents Facial Expressions: The Effects of Age and Emotion Intensity," in Proc. of the Human Factors and Ergonomics Society Annu. Meeting, San Antonio, TX, 2009, pp. 131-135.

[8] E. G. Krumhuber, L. Tamarit, E. B. Roesch, and K. R. Scherer, "FACSGen 2.0 animation software: generating three-dimensional FACS-valid facial expressions for emotion research," Emotion, vol. 12, pp. 351-363, Apr. 2012.

[9] P. Ekman, W. Friesen, and J. Hager, Facial Action Coding System, 2nd ed. Salt Lake City, UT: Research Nexus, 2002.

[10] C. M. de Melo, P. Carnevale, and J. Gratch, "The effect of virtual agents’ emotion displays and appraisals on people’s decision making in negotiation," in Intelligent Virtual Agents, 12th Int. Conf., Santa Cruz, CA, Y. Nakano, M. Neff, A. Paiva, and M. Walker, Eds., New York, NY: Springer, 2012, pp. 53-66.

[11] D. C. Berry, L. T. Butler, and F. de Rosis, "Evaluating a realistic agent in an advice-giving task," Int. J. Human-Computer Studies, vol. 63, pp. 304-327, May 2005.

[12] C. M. de Melo, P. Kenny, and J. Gratch, "Influence of autonomic signals on perception of emotions in embodied agents," Appl. Artificial Intell., vol. 24, pp. 494-509, Jul. 2010.

[13] C. Regenbogen, D. A. Schneider, A. Finkelmeyer, N. Kohn, B. Derntl, T. Kellermann, et al., "The differential contribution of facial expressions, prosody, and speech content to empathy," Cognition and Emotion, vol. 26, pp. 995-1014, Jan. 2012.

[14] B. Kreifelts, T. Ethofer, W. Grodd, M. Erb, and D. Wildgruber, "Audiovisual integration of emotional signals in voice and face: an event-related fMRI study," Neuroimage, vol. 37, pp. 1445-1456, Oct. 2007.

[15] B. de Gelder and J. Vroomen, "The perception of emotions by ear and by eye," Cognition & Emotion, vol. 14, pp. 289-311, Aug. 2000.

[16] C. Creed and R. Beale, "Psychological responses to simulated displays of mismatched emotional expressions," Interacting with Comput., vol. 20, pp. 225-239, Mar. 2008.

[17] L. Festinger, A theory of cognitive dissonance. Stanford, CA: Stanford University Press, 1957.

[18] K. R. Scherer, "Personality inference from voice quality: the loud voice of extroversion," European J. Social Psychology, vol. 8, pp. 467-487, 1978.

[19] T. Bänziger, D. Grandjean, and K. R. Scherer, "Emotion recognition from expressions in face, voice, and body: The Multimodal Emotion Recognition Test (MERT)," Emotion, vol. 9, pp. 691-704, Oct. 2009.

[20] Valve, "Half Life 2," Washington, USA: Vivendi Universal, 2004, video game.

[21] A. Ortony, G. L. Clore, and A. Collins, The cognitive structure of emotions. Cambridge, MA: Cambridge University Press, 1990.

[22] Psychology Software Tools, "E-Prime 2.0," Sharpsburg, GA, 2008, software.

410