CONTEMPORARY MULTIMODAL DATA COLLECTION ...tator. The ﬁrst is a collaborative computer game titled Fireboy & Watergirl in the Forest Temple [13] where the subject and facilitator

CONTEMPORARY MULTIMODAL DATA COLLECTION METHODOLOGY FORRELIABLE INFERENCE OF AUTHENTIC SURPRISE

Jordan Edward Shea, Cecilia Ovesdotter Alm, Reynold Bailey

Rochester Institute of Technology{jes7923∗|coagla∗|rbj†

}@{rit.edu∗|cs.rit.edu†

}ABSTRACT

The need for intelligent systems that can understand and con-vey human emotional expression is becoming increasinglyprevalent. Unfortunately, most datasets for developing suchsystems rely on acted or exaggerated emotions, or utilize sub-jective labels obtained from possibly unreliable sources. Thispaper reports on an innovative data collection methodologyfor capturing multimodal human signals of authentic surprise.We introduce two tasks with a facilitator to elicit genuine re-actions of surprise while co-collecting data from three humanmodalities: speech, facial expressions, and galvanic skin re-sponse. Our work highlights the methodological potential ofbiophysical measurement-based validation for enabling reli-able inference. A case study is presented which providesbaseline results for Random Forest classification. Using fea-tures gathered from the three modalities, our baseline systemis able to identify surprise instances with approximately 20%absolute increase in accuracy compared to random assign-ment on a balanced dataset.

1. INTRODUCTION

As the use of artificial intelligence in collaborative contextswith humans becomes more prominent, it is critical that mul-timodal data collection strategies effectively support thesesystems. Data utilized in affective corpora often involveacted or extreme emotions, focusing on contrastive emotionalbehaviors and not on increasing the breadth or depth of rep-resentative data for specific affective states. Additionally,most automated classification work involving behavioral datautilizes subjective labels obtained from either interpretationby third-party annotators, or from self-report.

This paper makes a focused contribution to methodologyin two respects. First, we report on a data collection method-ology for capturing multimodal human signals of authenticsurprise which features the use of games to elicit genuineemotions in an innovative procedure (see Figure 1). Second,our work highlights the potential of galvanic skin response asa means of objectively validating subjective and interpretativeverbal, prosodic, and facial cues.

(a) Facilitator and subject (b) View of computer game

Fig. 1: A trained facilitator engaged in games with a subjectvolunteer who was unaware of the experiment’s intention toelicit surprised reactions until post-experiment debriefing.

2. RELATED WORK

While often considered a fundamental emotion [1, 2], sur-prise is characterized by an intricate complexity. Surprisedreactions can be positive (amazement, astonishment), nega-tive (disappointment, shock), or of ambiguous sentiment (be-wilderment, perplexion) [3]. It is challenging to capture gen-uine reactions, as surprise is a fleeting state and is difficult toreproduce within a short period of time [4].

As for other affective classification problems, making in-ferences about and detecting surprise involves automating un-derstanding of a multimodal activity [5]. While there existspopular affective datasets that are generated by human actors[6], studies that elicit acted data often have to take specialcare in instructing actors how to behave [7]. There are con-cerns that “acted datasets are recorded under very constrainedconditions and resulting expressions are exaggerated” [8].

While it may at times be possible to identify surprise in aunimodal format, such as by relying on prosodic and lexicalcues via speech [9], a multidimensional sensing approach isbeneficial given that unimodal cues are also used to conveyother forms of expression. Unlike other studies interested ineliciting emotions firsthand [10], our approach includes thecapture of galvanic skin response (GSR) in addition to facialexpressions and speech. Our goal is to not only corroboratethe findings that others have had on the distinguishability ofsurprise [11, 12], but to explore additional indicators of sur-prise possibly hidden among these three modalities.

978-1-7281-0255-9/18/$31.00 c©2018 IEEE

(a) The Forest Temple - Cooperative computer game task (b) War - Competitive card game task

Fig. 2: Experimental setup of the computer and card game tasks, featuring a trained facilitator. Subject data is collected via ahead-worn Shure BETA 54 microphone, a Shimmer3 GSR sensor worn on the central fingers of the non-dominant hand, andtwo standard webcams capturing the subject’s face and the environment scene. An audio signal is used to time-synchronizespeech data with the GSR and facial data.

3. DATA COLLECTION

Data is collected in an IRB-approved experiment by havingsubjects participate in two 10-minute activities with a facili-tator. The first is a collaborative computer game titled Fireboy& Watergirl in the Forest Temple [13] where the subject andfacilitator must work together to navigate levels as quickly aspossible (setup in Figure 2a). The second is the well-knowncard game War in which the subject and facilitator competefor each others’ cards (Figure 2b). To increase emotionalcommitment, subjects are told they will be given a chocolatebar for each game that they win.

Half of the subjects begin with the computer game taskfollowed by the card game task, and vice versa. Both gamesfeature pre-planned events intended to elicit surprise from thesubject. In the computer game, the facilitator is uncooperative(giving false directions, purposefully failing the level, etc.),whereas in the card game, the facilitator engages in cheat-ing and the deck is set up to trigger unlikely card sequences(triple-war scenarios where many cards are at stake, etc.).

For this study, we rely on two methods of annotation. Forthe computer game activity, annotation was done in real-timeby the facilitator discretely pressing a key at a point where anaction was taken to elicit surprise. By using this approach, weavoid the possibility of third-party annotator bias. However,the more conventional approach of post-annotation will stillbe applied to the card game, but will not be examined withinthis paper.Speech: Subjects wear a headset microphone attached toa high-quality TASCAM DR-100 MK III audio recorder.Speech data is synchronized with other modalities usinga predefined salient audio signal. After experimentation,speech data is transcribed with IBM Watson’s ASR system[14], and then time-aligned with the Montreal Forced Aligner[15]. Prosodic feature extraction is done using Praat [16, 17].

Facial Expressions: Subjects are monitored by two web-cams, one focused on the subject’s face and one on the scene.We use the iMotions software platform to capture and processthe facial data [18].Galvanic Skin Response: A Shimmer3 wearable sensor isplaced on the middle and pointer fingers of the subject’s non-dominant hand [19], and linked to iMotions for calibrationof the GSR sensor and synchronization with the subject’s fa-cial expressions. The sensor monitors the subject’s skin con-ductance (sweat levels) by continuous small electrical signals,measured in kOhms.Subjects: Forty-one subjects (university age) completed thestudy, yielding approximately 820 minutes of data for thecomputer and card activities combined. Data from five sub-jects were excluded for reasons including poor face tracking,extended periods of track loss, and insufficient live markers,leaving data for 8 women and 28 men.

4. RESULTS AND DISCUSSION

After initial exploration, the dataset of 36 pruned subjectswere split into approximately 80% for the development set(29 subjects) and 20% for the final test set (7 subjects).Dataset analysis is done on the development set to ensure thefinal test set is entirely held out.

Given that our dataset is more modest in size than crowd-sourced or unimodal datasets, we report on binary classifica-tion results on a dataset with balanced classes using a RandomForest classifier. This approach reveals each feature’s rela-tive importance (standard to the scikit-learn framework [20]),helping to indicate the usefulness of each modality.Data Processing: Captured data was aggregated into 6-second windows as units of analysis, which was done byconsidering 3 seconds before and after the points where the

Fig. 3: Spectrogram of 6-second window depicts raised intensity (dB) and pitch (Hz) (indicated by the yellow and blue linesrespectively), verbal markers (I actually got it, too easy... huh, what?), and exaggeration of phones (for example at the endof what). The red, vertical dashed line indicates the moment when the facilitator expected an instance of surprise to occur. Anotable observation is that a rise in speech intensity occurs approximately at the same time as a drop in GSR signal.

(a) Facial reaction (b) GSR reaction

Fig. 4: Surprised facial expressions highlight lip and eyemovements during computer game. Galvanic skin responsedrops for the 6-second window of surprise (kOhms; drop sig-nals a reaction). The central line in the GSR plot indicates themoment when the facilitator expected an instance of surpriseto occur, which reflects the speech trend in Figure 3.

facilitator logged an action intended to cause surprise. Withineach window of time, 30 different facial attributes were be-ing recorded in millisecond intervals by iMotions, alongwith GSR measurements, speech pitch, and speech intensity.These attributes could then be aggregated over the definedwindow in different ways such as using the minimum, max-imum, mean, median, standard deviation, or slope. Threeadditional features (GSR peaks, speech shimmer, and speechjitter [21]) could not be broken down using the aggregationtechniques, resulting in approximately 200 features.

As articulated by Grossman and Kegl [22], more can be

learned by observing how features change over time ratherthan in independent contexts. Figures 3 and 4 depict aninstance of the same surprise captured across multiple modal-ities, where a notable correlation between a rise in speech andpitch intensity and a fall in GSR signal can be seen. Thesefindings are supported by the results shown in Table 1, whichhighlights the relative importance that was assigned to GSRfor the trained classifier. Particularly, it weighs maxima inspeech intensity and the slope of GSR more heavily thanmany other features.

Modality Feature Type Feature ImportanceFacial Cheek Raise Slope 4.2%Speech Intensity Max 1.2%GSR CAL kOhms Slope 0.5%

Table 1: Per modality, cheek movement, speech intensity, andGSR CAL kOhms were the most indicative features of sur-prise, where feature importance is assigned by the classifier,and higher importance has greater decision-making prece-dence. These features were ranked 1st, 12th, and 41st respec-tively. We note that many features were deemed less impor-tant than the leading GSR feature.

The leave-one-subject-out classification experiment in-volved the development datasets of n = 29 subjects. For eachfold, we train on n−1 subjects and test on one subject with anequal number of windows of surprise and non-surprise. This

(a) Leave-one-subject-out on development set. (b) Leave-one-subject-out on test set.

Fig. 5: Left: The accuracy ranges for the best (Subject 49, 86.5%) and worst (Subject 33, 50%) predicted subject.Right: On the final held-out test set, the average of 69% was impacted by low performance of one subject (Subject 42, 54%).

helps us understand how variation among subjects (Figure 5)impacts classification performance (Table 2). Classificationon the final held-out test set of 7 subjects yield similar results(Table 2). Inspection of poorly performing subjects in Fig-ure 5 suggests that seating orientation and engagement mayhave played a factor in how well facial data was captured.

Facial features (especially ones associated with happi-ness) were often the best indicators for surprise (see Table 3),which may convey that our experiment was better at capturingpositive instances of surprise rather than negative. Further-more, calculating the rate of change as a slope was consis-tently the best way to aggregate features during windows ofsurprise, as we often found that salient changes would occurover a short period of time.

Unweighted Weighted

Metric Development Test Development TestAccuracy 71% 69% 71% 70%Precision 74% 72% 73% 73%

Recall 66% 65% 66% 65%F-Score 68% 65% 69% 69%

Table 2: Leave-one-subject-out classification for both the de-velopment and test set yields approximately 70% accuracy,roughly 20% over random choice. Weighted metrics give sub-jects who had more instances of surprise a higher precedence,while unweighted metrics treat every subject equally.

Modality Feature Type Feature ImportanceFacial Cheek Raise Slope 4.2%Facial Smile Slope 4.0%Facial Joy Slope 3.2%Facial Eye Closure Slope 2.7%Facial Valence Slope 2.4%

Table 3: The most useful features for classification were ei-ther facial movements (Cheek Raise, Smile, Eye Closure)or facial expressions inferred by iMotions’ Affectiva engine(Joy, Valence).

5. CONCLUSION

Our data highlights the importance of considering multiplemodalities for the identification of surprise. Not only werethe instances of surprise distinct from those of non-surprise,but we were able to see classification positively weighed thebiophysical modality of galvanic skin response. Furthermore,by marking instances of surprise in real-time, we avoid third-party annotator bias where subtler instances of surprise couldbe missed. We also avoid self-report bias related to the facili-tator’s or the subject’s recollection of what transpired.

Identifying novel methods for further pruning of theelicited data may increase the accuracy of classifiers, as notall attempts at surprising subjects were successful. Addition-ally, expanding the number of subjects in under-representeddemographic groups would further enhance the inclusivenessof the dataset. Lastly, future work should involve analyzingthe elicited data from the perspective of lexical cues of sur-prise, and also study variation in expression of surprise acrosssubjects in more detail.

6. REFERENCES

[1] Virginia Francisco, Raquel Hervas, Federico Peinado,and Pablo Gervas, “EmoTales: Creating a corpus of folktales with emotional annotations,” Language Resourcesand Evaluation, vol. 46, no. 3, pp. 341–381, 09 2012.

[2] Mohammad Rabiei and Alessandro Gasparetto, “Sys-tem and method for recognizing human emotion statebased on analysis of speech and facial feature extrac-tion; Applications to human-robot interaction,” in 20164th International Conference on Robotics and Mecha-tronics (ICROM), Oct 2016, pp. 266–271.

[3] Barbara Kryk-Kastovsky, “Surprise, surprise: Theiconicity-conventionality scale of emotions,” The Lan-guage of Emotions: Conceptualization, Expression andTheoretical Foundation,, pp. 155–169, 1997.

[4] Zoltan Kovecses, “Surprise as a conceptual category,”Review of Cognitive Linguistics, vol. 13, no. 2, pp. 270,2015.

[5] Soujanya Poria, Erik Cambria, Rajiv Bajpai, and AmirHussain, “A review of affective computing: From uni-modal analysis to multimodal fusion,” Information Fu-sion, vol. 37, pp. 98–125, 2017.

[6] Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Wal-ter Sendlmeier, and Benjamin Weiss, “A database ofgerman emotional speech,” in in Proceedings of Inter-speech, Lissabon, 2005, pp. 1517–1520.

[7] Adela Barbulescu, Remi Ronfard, and Gerard Bailly, “Agenerative audio-visual prosodic model for virtual ac-tors,” IEEE Computer Graphics and Applications, vol.37, no. 6, pp. 40–51, November 2017.

[8] Sara Zhalehpour, Onur Onder, Zahid Akhtar, and Cig-dem Eroglu Erdem, “Baum-1: A spontaneous audio-visual face database of affective and mental states,”IEEE Transactions on Affective Computing, vol. 8, no.3, pp. 300–313, July 2017.

[9] Bjorn W. Schuller, “Speech emotion recognition: Twodecades in a nutshell, benchmarks, and ongoing trends,”Communications of the ACM, vol. 61, no. 5, pp. 90–99,2018.

[10] Vanus Vachiratamporn, Roberto Legaspi, KoichiMoriyama, and Masayuki Numao, “Towards the designof affective survival horror games: An investigation onplayer affect,” in 2013 Humaine Association Conferenceon Affective Computing and Intelligent Interaction, Sept2013, pp. 576–581.

[11] Jingying Chen, Dan Chen, Xiaoli Li, and Kun Zhang,“Towards improving social communication skills with

multimodal sensory information,” IEEE Transactionson Industrial Informatics, vol. 10, no. 1, pp. 323–33,Feb 2014.

[12] Carlos T. Ishi, Takashi Minato, and Hiroshi Ishiguro,“Motion analysis in vocalized surprise expressions andmotion generation in android robots,” IEEE Roboticsand Automation Letters, vol. 2, no. 3, pp. 1748–1754,July 2017.

[13] Oslo Albert, “Fireboy and Watergirl in the Forest Tem-ple,” Armor Games, 2009.

[14] IBM, IBM Watson Speech to Text, Armonk, North Cas-tle, NY, 2018.

[15] Michael McAuliffe, Michaela Socolof, Sarah Mihuc,Michael Wagner, and Morgan Sonderegger, “MontrealForced Aligner: Trainable text-speech alignment usingKaldi,” in Proceedings of interspeech, 2017.

[16] Paul Boersma and David Weenink, “Praat, a system fordoing phonetics by computer,” Glot international, vol.5, pp. 341–345, 01 2001.

[17] Will Styler, “Using Praat for linguistic research,” Uni-versity of Colorado at Boulder Phonetics Lab, 2013.

[18] iMotions A/S, iMotions Biometric Research Platform6.0, Copenhagen, Denmark, 2016.

[19] Adrian Burns, Barry R. Greene, Michael J. McGrath,Terrance J. O’Shea, Benjamin Kuris, Steven M. Ayer,Florin Stroiescu, and Victor Cionca, “Shimmer - awireless sensor platform for noninvasive biomedical re-search,” IEEE Sensors Journal, vol. 10, no. 9, pp. 1527–1534, Sept 2010.

[20] Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, Olivier Grisel,Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin-cent Dubourg, Jake Vanderplas, Alexandre Passos,David Cournapeau, Matthieu Brucher, Matthieu Perrot,and Edouard Duchesnay, “Scikit-learn: Machine learn-ing in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.

[21] Mireia Farrus, Javier Hernando, and Pascual Ejarque,“Jitter and shimmer measurements for speaker recogni-tion,” Proceedings of the Interspeech 2007, pp. 778–781, 01 2007.

[22] Ruth B. Grossman and Judy Kegl, “To capture a face: Anovel technique for the analysis and quantification of fa-cial expressions in American Sign Language,” Sign Lan-guage Studies, vol. 6, no. 3, pp. 273–305,355, Spring2006, Copyright - Copyright American Annals of theDeaf Spring 2006; Document feature - Diagrams; Ta-bles; Graphs; ; Last updated - 2015-05-25.

Documents

CONTEMPORARY MULTIMODAL DATA COLLECTION ...tator. The ﬁrst is a collaborative computer game titled Fireboy & Watergirl in the Forest Temple [13] where the subject and facilitator