7
A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd Annotations Mojtaba Khomami Abadi 1,2 , Azad Abad 1 , Ramanathan Subramanian 3 , Negar Rostamzadeh 1 , Elisa Ricci 4,5 , Jagannadan Varadarajan 3 , Nicu Sebe 1 1 Department of Information Engineering and Computer Science, University of Trento, Italy. 2 Semantic, Knowledge and Innovation Lab (SKIL), Telecom Italia, Trento, Italy. 3 Advanced Digital Sciences Center (ADSC), University of Illinois at Urbana-Champaign, Singapore. 4 University of Perugia, Italy 5 Fondazione Bruno Kessler, Trento, Italy. khomamiabadi,abad,rostamzadeh,[email protected], [email protected], Subramanian.R,[email protected] ABSTRACT We propose Multi-task learning (MTL) for time-continuous or dy- namic emotion (valence and arousal) estimation in movie scenes. Since compiling annotated training data for dynamic emotion pre- diction is tedious, we employ crowdsourcing for the same. Even though the crowdworkers come from various demographics, we demonstrate that MTL can effectively discover (1) consistent pat- terns in their dynamic emotion perception, and (2) the low-level audio and video features that contribute to their valence, arousal (VA) elicitation. Finally, we show that MTL-based regression mod- els, which simultaneously learn the relationship between low-level audio-visual features and high-level VA ratings from a collection of movie scenes, can predict VA ratings for time-contiguous snippets from each scene more effectively than scene-specific models. Categories and Subject Descriptors H.1.2 [User/Machine Systems]: Human information processing; I.5.2 [Pattern Recognition Design Methodology]: Pattern analy- sis General Terms Measurement, Algorithms, Verification, Human Factors Keywords Multi-task Learning, Time-continuous emotion estimation, Crowd annotation, Movie clips 1. INTRODUCTION Affective video tagging has been acknowledged as an important multimedia problem for long, given its utility for applications such as personalized media recommendation. However, most content and user-based media tagging approaches seek to recognize the general emotion of a stimulus (typically a movie or audio/video Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CrowdMM’14, November 7, 2014, Orlando, FL, USA. Copyright 2014 ACM 978-1-4503-3128-9/14/11 ...$15.00. http://dx.doi.org/10.1145/2660114.2660126. music clip), and only a few methods such as [7] have attempted to determine the dynamic of time-continuous emotion profile in the stimulus. This limitation is partly attributed to the fact that inter- preting and measuring emotion is an inherently difficult problem– emotion is a highly subjective feeling, and the discrepancy between the emotion envisioned by the content creator versus the actual emotion evoked in consumers has been highlighted by many works. Also, learning the relationship between low-level content (typically in the form of audio-visual effects) and the high-level emotional feeling over time requires extensive training data, with annotations typically performed by multiple annotators for reliability, which is both difficult and expensive to acquire. Recently, crowdsourcing (CS) has become popular for perform- ing tedious tasks through extensive human collaboration via the In- ternet. When it is difficult to employ experts for analyzing large- scale data, CS is an attractive alternative, as many individuals work on smaller data chunks to provide useful information in the form of annotations or tags. CS has been successfully employed to develop data-driven solutions for computationally difficult problems in mul- tiple domains like natural language processing [30], and computer vision [33]. Two reasons mainly contribute to the success of CS– (1) crowd workers are paid a fraction of the wages that experts are entitled to, thereby achieving cost efficiency, and (2) the experi- menter’s task becomes scalable when the original task is split into smaller and manageable micro-tasks and distributed among crowd- workers. Nevertheless, cost-effectiveness in CS is achieved at the expense of expertise– crowdworkers may lack the technical and cognitive skills or the motivation to effectively perform a given task [19]. Therefore, efficient methodologies that are robust to noisy data are crucial to the success of CS approaches. In this paper, we propose Multi-task learning (MTL) for time- continuous valence, arousal (VA) estimation from movie scenes for which dynamic emotion annotations are acquired from crowd- workers. Given a set of related tasks, MTL seeks to simultane- ously learn all tasks by modeling the similarities as well as differ- ences among them to build task-specific classification or regression models. This joint learning procedure accounting for task relation- ships leads to more efficient models as compared to learning each task independently. For the purpose of learning the relationship between low-level audio-visual features and corresponding crowd- worker VA annotations over time, we ask the following questions: (1) Given that emotion perception is highly subjective, and biases relating to crowdworker demographics may additionally exist, can we discover any patterns relating to their dynamic emotional per- ception? The exercise of seeking to acquire a gold standard anno-

A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

A Multi-task Learning Framework for Time-continuousEmotion Estimation from Crowd Annotations

Mojtaba Khomami Abadi1,2, Azad Abad1, Ramanathan Subramanian3,Negar Rostamzadeh1, Elisa Ricci4,5, Jagannadan Varadarajan3, Nicu Sebe1

1Department of Information Engineering and Computer Science, University of Trento, Italy.2Semantic, Knowledge and Innovation Lab (SKIL), Telecom Italia, Trento, Italy.

3Advanced Digital Sciences Center (ADSC), University of Illinois at Urbana-Champaign, Singapore.4University of Perugia, Italy 5Fondazione Bruno Kessler, Trento, Italy.

khomamiabadi,abad,rostamzadeh,[email protected], [email protected],Subramanian.R,[email protected]

ABSTRACTWe propose Multi-task learning (MTL) for time-continuous or dy-namic emotion (valence and arousal) estimation in movie scenes.Since compiling annotated training data for dynamic emotion pre-diction is tedious, we employ crowdsourcing for the same. Eventhough the crowdworkers come from various demographics, wedemonstrate that MTL can effectively discover (1) consistent pat-terns in their dynamic emotion perception, and (2) the low-levelaudio and video features that contribute to their valence, arousal(VA) elicitation. Finally, we show that MTL-based regression mod-els, which simultaneously learn the relationship between low-levelaudio-visual features and high-level VA ratings from a collection ofmovie scenes, can predict VA ratings for time-contiguous snippetsfrom each scene more effectively than scene-specific models.

Categories and Subject DescriptorsH.1.2 [User/Machine Systems]: Human information processing;I.5.2 [Pattern Recognition Design Methodology]: Pattern analy-sis

General TermsMeasurement, Algorithms, Verification, Human Factors

KeywordsMulti-task Learning, Time-continuous emotion estimation, Crowdannotation, Movie clips

1. INTRODUCTIONAffective video tagging has been acknowledged as an important

multimedia problem for long, given its utility for applications suchas personalized media recommendation. However, most contentand user-based media tagging approaches seek to recognize thegeneral emotion of a stimulus (typically a movie or audio/video

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, November 7, 2014, Orlando, FL, USA.Copyright 2014 ACM 978-1-4503-3128-9/14/11 ...$15.00.http://dx.doi.org/10.1145/2660114.2660126.

music clip), and only a few methods such as [7] have attempted todetermine the dynamic of time-continuous emotion profile in thestimulus. This limitation is partly attributed to the fact that inter-preting and measuring emotion is an inherently difficult problem–emotion is a highly subjective feeling, and the discrepancy betweenthe emotion envisioned by the content creator versus the actualemotion evoked in consumers has been highlighted by many works.Also, learning the relationship between low-level content (typicallyin the form of audio-visual effects) and the high-level emotionalfeeling over time requires extensive training data, with annotationstypically performed by multiple annotators for reliability, which isboth difficult and expensive to acquire.

Recently, crowdsourcing (CS) has become popular for perform-ing tedious tasks through extensive human collaboration via the In-ternet. When it is difficult to employ experts for analyzing large-scale data, CS is an attractive alternative, as many individuals workon smaller data chunks to provide useful information in the form ofannotations or tags. CS has been successfully employed to developdata-driven solutions for computationally difficult problems in mul-tiple domains like natural language processing [30], and computervision [33]. Two reasons mainly contribute to the success of CS–(1) crowd workers are paid a fraction of the wages that experts areentitled to, thereby achieving cost efficiency, and (2) the experi-menter’s task becomes scalable when the original task is split intosmaller and manageable micro-tasks and distributed among crowd-workers. Nevertheless, cost-effectiveness in CS is achieved at theexpense of expertise– crowdworkers may lack the technical andcognitive skills or the motivation to effectively perform a giventask [19]. Therefore, efficient methodologies that are robust tonoisy data are crucial to the success of CS approaches.

In this paper, we propose Multi-task learning (MTL) for time-continuous valence, arousal (VA) estimation from movie scenesfor which dynamic emotion annotations are acquired from crowd-workers. Given a set of related tasks, MTL seeks to simultane-ously learn all tasks by modeling the similarities as well as differ-ences among them to build task-specific classification or regressionmodels. This joint learning procedure accounting for task relation-ships leads to more efficient models as compared to learning eachtask independently. For the purpose of learning the relationshipbetween low-level audio-visual features and corresponding crowd-worker VA annotations over time, we ask the following questions:(1) Given that emotion perception is highly subjective, and biasesrelating to crowdworker demographics may additionally exist, canwe discover any patterns relating to their dynamic emotional per-ception? The exercise of seeking to acquire a gold standard anno-

Page 2: A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

tation for each movie scene (or clip) via crowdsourcing is mean-ingful only if such patterns can be discovered. (2) If the emotionalground truth corresponding to each movie clip can be representedby a single, gold standard emotional profile, can we discover cor-responding audio-visual correlates for a movie clip collection (asagainst a single movie clip), which in turn, can be more effectivefor predicting the VA profile of a novel clip? Through extensive ex-periments, we demonstrate that MTL effectively answers the abovequestions, and is superior to single-task learning for VA predictionin novel scene segments. To summarize, this paper makes the fol-lowing contributions:

1. This is the first work to employ MTL for time-continuousemotion prediction.

2. This is also one the first work to attempt dynamic affect pre-diction for movie clips.

The paper is organized as follows: Section 2 overviews the lit-erature. Experimental protocol employed for recording crowd-workers’ affective responses is described in Section 3 and a briefoverview of MTL is provided in Section 4. Annotation data analy-sis and emotion prediction experiments are presented in Section 5,and conclusions are stated in Section 6.

2. RELATED WORKWe now examine related work on (1) Crowdsourcing, (2) Af-

fective movie analysis, (3) CS for affective media tagging and (4)Multi-task learning.

2.1 CrowdsourcingSteiner et al. [25] defined three types of video events and showed

that these events can be detected from video sequences via crowd-sourcing upon combining textual, visual and behavioral cues. Von-drick et al. [28] argued that frame-by-frame video annotation is es-sential for a variety of tasks, as in the case of time-continuous emo-tion measurement, even if it is difficult for human annotators. Anonline framework to collect valid facial responses to media contentwas proposed in the work of McDuff et al. [16], who found sig-nificant differences between subgroups who liked/disliked or werefamiliar/unfamilar with a particular commercial.

2.2 Affective movie analysisA primary issue in affective multimedia analysis is the paucity of

reliable annotators to generate sufficient training data and in moststudies, only few annotators are used [21, 29]. Also, emotion per-ception varies with individual traits such as personality [10], andsignificant differences may be observed in affective ratings com-piled from different persons over a small population. To addressthis problem, a number of studies have turned to crowdsourcing.In a seminal study affective movie study, Gross et al. [6] compileda collection of movie clips to evoke eight emotional states such asanger, disgust, fear and neutral based on emotion ratings compiledfor 250 movie clips from 954 subjects. [2, 11, 24] are three recentworks that have attempted affect recognition from physiological re-sponses of a large population of users to music and movie stimuli.

2.3 Crowdsourcing for affective media taggingSoleymani et al. [22] performed crowdsourcing on a limited scale

to collect 1300 affective annotations from 40 volunteers for 155Hollywood movie clips. In another CS-based affective video an-notation study, Soleymani et al. [23] compiled annotations for theMediaEval 2010 Affect Task Corpus on AMT, and asked workersto self-report their boredom levels. In a recent CS-based media tag-ging work, Soleymani et al. [20] presented a dataset of 1000 songs

Table 1: Video clip details. HALV, LALV, HAHV and LALV re-spectively correspond to high-arousal low-valence, low-arousallow-valence, high-arousal high-valence and low arousal-low va-lence labels.

HALV LALV HAHV LAHV

No. of video clips 3 3 3 3Min. length (sec) 79 80 86 59Max. length (sec) 91 121 109 92Avg. length (sec) 86.66 97.33 101 76.33

for music emotion analysis, each annotated continuously over timeby at least 10 users. Nevertheless, movies denote multimedia stim-uli that best approximate the real world and movie clips have beenfound to be more effective for eliciting emotions in viewers as com-pared to music video clips in [1], and that is why we believe con-tinuous emotion prediction with movie stimuli is important in thecontext of affective media representation and modeling.

2.4 Multi-task learningRecently, multi-task learning (MTL) has been employed in sev-

eral computer vision applications such as image classification [32],head pose estimation [31] and visual tracking [34]. Given a setof related tasks, MTL [4] seeks to simultaneously learn a set oftask-specific classification or regression models. The intuition be-hind MTL is simple: a joint learning procedure which accounts fortask relationships is expected to lead to more accurate models ascompared to learning each task separately. While MTL has beenused previously for learning from noisy crowd annotations [9], wepresent the first work that employs MTL for time-continuous emo-tion prediction from movie clips.

3. EXPERIMENTAL PROTOCOLIn this study, we asked crowd workers to continuously annotate

12 emotional movie scenes adopted from [1] via a web-based userinterface– they were not allowed to access the scene content priorto the rating task. Our objective was to understand and model theiremotional state over time, as they viewed the movie clips.

3.1 DatasetWe selected 12 video clips from [1] equally distributed among

the four quadrants in the valence-arousal space. Table 1 presentscharacteristics of video clips from the different quadrants. All videoclips were hosted on YouTube for access during the CS task.

3.2 Experimental ProtocolWe posted the annotation task on Amazon Mechanical Turk

(AMT) and other CS channels via the CrowdFlower (CF) platform.CF is an intermediate platform for posting the AMT task on ourbehalf. Moreover, CF provides a simple gold standard qualificationmechanism to discard outliers. If workers passed the qualificationtest, they were considered qualified to perform a given task. How-ever, pre-designed tests are very generic and limited to simple tasks,which do not allow for trivially discarding low quality annotations.So, we performed PHP server-side scripting and redirection, col-lection and evaluation of all annotations real-time on our server viaHTTP requests, before letting workers submit the task. The archi-tecture of the designed CS platform is shown in Fig. 1(a).

To ensure annotation quality, each crowd worker could only an-notate 5 video clips, and at least 15 judgments were requested andcollected for each video clip. We also recorded facial expressions

Page 3: A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

(a) (b)

(c) (d) (e)Figure 1: (a) Architecture of the designed Crowdflower platform (Turkers are the AMT crowdworkers). (b) User-interface forrecording workers’ emotional ratings and facial expressions. (c) Age, (d) gender and (e) locality distributions of crowdworkers.

of workers (not used in this work) as they performed the annota-tion. Informed consent was obtained from workers and, prior tothe task, workers had to provide their demographics (age, genderand location). Time-continuous valence/arousal annotations fromworkers were compiled over separate sessions (a worker need notannotate for both valence and arousal for the same clip under thissetting), and workers were also required to rate each clip for overallemotional valence or arousal. Each worker was paid 10 cents pervideo as remuneration upon successful task completion.

Workers did not get paid if their annotations and webcam facialvideos were not recorded on our server. To evaluate the annota-tion quality, each video annotation was logged in XML format andanalyzed. A continuous slider was used to record emotional rat-ing, and if the slider had not moved for more than 80% of the clipduration, or if more than 20% of the data was lost, the annotationwas automatically discarded. Also, files smaller than a pre-definedthreshold were discarded. If the annotation task was left incom-plete, a warning message notified the worker about the missing an-notations. Workers could then re-annotate the missing videos andget paid. For motivating workers to provide good quality anno-tations, we rewarded them with online gift vouchers if they pro-vided high-quality annotations. Furthermore, we introduced someconstraints such as: (1) Workers could not play (or rate) multiplevideo clips simultaneously. (2) Workers could annotate a video asmany times as they wanted to. (3) Workers were allowed to useonly the Chrome browser for annotation due to unavailability ofHTML5 technology support in other browsers. (4) Media playercontrollers were removed from the interface so that workers couldnot fast forward/rewind the movie clips, and finally, (5) If the an-notation stopped midway, it had to be redone from scratch.

3.3 Annotation MechanismA screen shot of the user interface for recording annotations is

presented in Fig. 1(b). The following components were part of thecontinuous annotation and facial expression recording process.Video Player: To provide an uninterrupted video stream for work-ers with low bandwidth, we uploaded all movie clips onto YouTube.

On the client-side, YouTube JavaScript player API was integratedand used in our web-based user interface.Slider: The slider was used to collect the time-continuous VA rat-ings of workers while watching the video clips. The slider valuesranged from -2 to 2 (very unpleasant to very pleasant for valence,and calm to highly excited for arousal) for both factors. In orderto facilitate workers’ decision making, a standard visual scale SelfAssessment Manikin (SAM) image was displayed to the workers.Webcam Panel: To upload facial expression of workers in real-time, we used HTML5 technology to buffer the worker’s webcamrecording on the client-side when the play button was pressed. Thebuffered video was automatically uploaded in compressed, VP8open codec format on our server when the video clip finished play-ing. Videos were recorded at 320x240 resolution, 30 fps.Questionnaires: Workers needed to report (1) their overall emo-tional (valence or arousal) rating for the movie clip on a scale of-2 to 2, and (2) their familiarity with the clip to avoid the effect ofsuch bias on their ratings.

3.4 Annotation statistics and pre-processingOverall, 1012 and 527 workers provided continuous valence and

arousal ratings respectively. Their age, gender and locality distri-butions are as shown in Fig. 1(c),(d) and (e). As a preliminary steptowards ensuring good quality VA labels, we discarded those time-continuous annotations with (1) missing values more than thresh-old, (2) standard deviation less than threshold, and (3) missingoverall or general VA ratings.

4. MULTI-TASK LEARNINGAs mentioned previously, Multi-task learning (MTL) models both

similarities as well as differences among a set of related tasks,which is more beneficial as compared to learning task-specific mod-els. Given a set of tasks t = 1..T , withX(t) denoting training datafor the task t and Y (t) their corresponding labels (ratings), MTLseeks to jointly learn a set of weights W = [W1..WT ], where Wt

models task t. For the problem of time-continuous VA prediction,the 12 movie clips used for crowdsourcing denote the related tasks.

Page 4: A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

In this work, we used the publicly available MALSAR library [35],which contains a host of MTL algorithms for analysis. We wereparticularly interested in the following MTL variants:

Multi-task Lasso: which extends the Lasso algorithm [26] to MTL,and assumes that sparsity is shared among all tasks.

`21 norm-regularized MTL [3]: which attempts to minimize theobjective function

∑Tt=1 ‖W

Tt Xt−Yt‖2F +α‖W‖2,1+β‖W‖2F ,

where ‖.‖F and ‖.‖2,1 denote matrix Frobenious norm and `21norm respectively. The basic assumption in this model is that alltasks are related, which is not always true, and α, β denote the reg-ularization parameters controlling group sparsity and norm sparsityrespectively.

Dirty MTL [8]: where the weight matrix W = P +Q, where Pand Q denote the group and task-wise sparse components.

Sparse graph Regularization (SR-MTL): where a priori knowl-edge concerning task-relatedness is modeled in terms of a graph Rin the objective function. This way, similarity is only enforced be-tween Wt’s corresponding to related tasks. The minimized objec-tive function in this case is

∑Tt=1 ‖W

Tt Xt−Yt‖2F +α‖WR‖2F +

β‖W‖1 + γ‖W‖2F , where R is the graph encoding task relation-ships, and α, β, γ denote regularization parameters as above.

5. DATA ANALYSIS AND EXPERIMENTSUpon compiling VA ratings from crowdworkers, we firstly ex-

amined if any patterns existed in the dynamic annotations. This ex-amination was important for two reasons– (1) predicting dynamicVA levels for a stimulus instead of the overall rating is useful as itallows for determining the ‘emotional highlight’ in the scene, andcomparing dynamic vs static ratings could help us understand howdynamic emotion perception influenced crowdworkers’ overall im-pression of a scene, and (2) given the subjectivity associated withemotions and the uncontrolled worker population, patterns in dy-namic VA annotations would indicate that a reliable, gold standardannotation for a clip is achievable in spite of these biases.

Given the 12 clips (tasks) related in terms of valence and arousal(from now on, clips/tasks 1-3, 4-6, 7-9 and 10-12 respectively cor-respond to HAHV, LAHV, LALV and HALV labels), we employedMTL to determine if some time-points influenced the overall emo-tional perception for a movie clip more than others? To this end,we used the time-continuous VA ratings to predict the overall VArating for each clip. For dimensional consistency, we only used theVA ratings for the final 50 sec of each clip for this experiment, andso, the x-axis in Fig. 2 denotes time to clip completion (between50-1 sec). Weights learnt using the different MTL variants consis-tently suggest that the continuous VA ratings provided in the latterhalf for all of the movie scenes predict the overall rating better.The third and fourth columns respectively depict learnt weights forthe six HV, LV/HA, LA clips, and represents the situation wherea-priori knowledge regarding task-relatedness is fed to SR-MTL.Examining SR-MTL valence weights (cols 3,4 in row 1), one caninfer that general affective impressions are created earlier in timefor high-valence stimuli as compared to low-valence stimuli. Ex-amination of SR-MTL arousal weights suggests that the most influ-ential impressions regarding high-arousal stimuli are also created afew seconds before clip completion.

Therefore, MTL enables effective characterization of patternsconcerning dynamic VA levels of crowdworkers, and this in turnimplies that deriving a representative, gold standard annotation fromworker annotations for each movie clip is meaningful. While MTL

has been used to learn from noisy crowd data [9], we simply usedthe median value of the annotations at each time-point to derive theground truth emotional profile for each movie clip. Next, we willbriefly describe the audio-visual features extracted from each clip,and show how the joint learning of the relationship between audio-visual features and VA ratings allows for more effective dynamicemotion prediction.

5.1 Multimedia Feature ExtractionInspired by previous affective studies [18,29], we extracted low-

level audio-visual features that have been found to correlate wellwith the VA dimensions. In particular, we extracted the featuresused in [7,12] on a per-second basis for our regression experiments.

5.1.1 Video FeaturesLighting key and color variance [29] are well-known video fea-

turesknown to evoke emotions. Therefore, we extracted lightingkey from each frame in the HSV space by multiplying the mean bythe standard deviation of V values. Color variance [12] is definedas the determinant of the covariance matrix of L, U, and V in theCIE LUV color space. Also, the amount of motion in a movie sceneis indicative of its excitement level [12]. Therefore, we computedthe optical flow [15] in consecutive frames of a video segment tomotion magnitude for each frame. The proportions of colors areimportant elements for evoking emotions [27]. A 20-bin color his-togram of hue and lightness values in the HSV space was computedfor each frame of a segment and averaged over all frames. Themean of the bins reflect the variation in the video content. For eachframe in a segment, the median of the L and S values in HSL spacewere computed; their average for all the frames of a segment is anindication of the segment lightness and saturation [12]. We alsoused the definitions in [29] to calculate shadow proportion, visualexcitement, grayness and visual detail. Extracted video features arelisted in Table 2.

5.1.2 Audio FeaturesSound information in the form of loudness of speech (energy

of sound) is related to arousal, while rhythm and average pitch inspeech relates to valence [18], while Mel-frequency cepstrum com-ponents (MFCCs) [14] are representative of the short-term soundpower spectrum. Commonly used features in audio and speechprocessing [14] were extracted from the audio channels. To extractMFCCs, we divided the audio segment into 20 divisions and thenextracted the first 13 MFCC components from each division. Us-ing the sequence of MFCC components over a segment, we com-puted 13 derivatives of MFCC, DMFCC, and mean auto correla-tion, AMFCC proposed in [14]. Upon calculating MFCC, DMFCCand AFCC (13 values each), we used their means as features. Theimplementation in [17] was used to extract formants up to 4400Hzover the audio segment, and formant means were used as features.Moreover, we used the ACA toolbox [13] to calculate mean andstandard deviation(std) of (i) spectral flux, (ii) spectral centroid and(iii) time-domain zero crossing rate [14] over 20 audio segmentdivisions. We also calculated the power spectral density and thebandwidth, band energy ratio (BER) , and density spectrum magni-tude (DSM) according to [14]. Finally, we also computed the meanproportion of silence as defined in [5]. All in all, 56 audio featureslisted in Table 2 were extracted.

5.2 Experiments and ResultsIn this section, we attempt to predict the gold standard (or ground-

truth) dynamic V/A ratings for each clip from audio-visual fea-tures using MTL, and show why learning the audio visual feature-

Page 5: A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

Valence

Arousal

Figure 2: Predicting overall clip emotion from dynamic annotations: (Top) W matrix learnt for valence using (from left to right)`2,1, dirty and SR MTL (HV, LV). (Bottom) W learnt for arousal using (from left to right) `2,1, dirty and SR MTL (HA, LA). Largerweights are denoted using darker shades.

Table 2: Extracted audio-visual features from each movie clip(feature dimension listed in parenthesis).

Audio features DescriptionMFCC features (39) MFCC coefficients [14], Derivative of

MFCC, MFCC Autocorrelation (AM-FCC)

Energy (1) and Pitch (1) Average energy of audio signal [14] andfirst pitch frequency

Formants (4) Formants up to 4400HzTime frequency (8) mean and std of: MSpectrum flux, Spec-

tral centroid, Delta spectrum magnitude,Band energy ratio [14]

Zero crossing rate (1) Average zero crossing rate of audio sig-nal [14]

Silence ratio (2) Mean and std of proportion of silence ina time window [5, 14]

Video features DescriptionBrightness (6) Mean of: Lighting key, shadow propor-

tion, visual details, grayness, median ofLightness for frames, mean of mediansaturation for frames

Color Features (41) Color variance, 20-bin histograms for hueand lightness in HSV space

VisualExcitement (1) Features as defined in [29]Motion (1) Mean inter-frame motion [15]

emotion relationship simultaneously for the 12 movie scenes ismore effective than learning scene-specific models. Fig. 3 showsthe model weights learnt by the various MTL approaches whenthey are trained with features and VA ratings over the entire clipduration for all clips. Here again, some interesting correlates be-tween audio-visual features and VA ratings are observed over allscenes. Considering video features, color descriptors are found tobe salient for valence, while motion and visual excitement corre-late with arousal better, especially for HA stimuli, as noted fromSR MTL (HA) weights (column 7). Among audio features, thefirst few MFCC components correlate well with both V,A.

Then, we examined if learning prediction models for all movieclips was more beneficial than training a Lasso regressor per movieclip. To this end, we held out time-contiguous data of length 5, 10or 15 seconds from the first half (front) or second half (back) ofeach of the clips for testing, while the remainder of the clips wereused for training. Optimal group sparsity regularization parameterfor the different MTL methods, as well as optimal Lasso parameterwere chosen from [0.01 0.1 1 5] employing 5-fold cross valida-tion, and all other parameters (where necessary) were set to 1. Theroot mean square error (RMSE) observed for V/A estimates over allclips (tasks) is shown in Table 3. MTL methods clearly outperformsingle-task Lasso, and consequent to our earlier finding that the lat-ter half of all clips is emotionally salient, larger prediction errorsare observed for the back portion. Also, prediction errors increasewith the test clip size, and predictions are more accurate for arousal,

and with audio features. Finally, sophisticated MTL methods suchas dirty and SR-MTL outperform MT-Lasso and `2,1 MTL. Over-all, these results are demonstrative of efficient MTL-based learningutilizing relatively few training examples.

6. CONCLUSION AND FUTURE WORKThis paper explores Multi-task learning to estimate dynamic VA

levels for movie scenes. Since time-continuous VA annotations arehighly difficult to acquire, we employ crowdsourcing for the same.Though emotion is a subjective feeling and the crowdworkers arosefrom varied demographics, MTL could effectively capture patternsconcerning their dynamic emotion perception .The latter half of allclips was found to be more emotionally salient, and influenced theaffective impression of the clip. We again utilized MTL to modelthe relationship between the representative dynamic VA profile foreach clip and underlying audio-visual effects, and observed thatMTL approaches considerably outperformed clip-specific Lassomodels, implying that jointly learning characteristics of a collec-tion of scenes is beneficial. Future work involves usage of (1) MTLfor cleaning crowd annotations, and (2) face videos compiled inthis work as an additional affective cue.

7. ACKNOWLEDGEMENTThe authors acknowledge that this research was funded by the re-

search grant for ADSC’s Human Sixth Sense Programme from Sin-gapore’s Agency for Science, Technology and Research (A*STAR),the FIRB 2008 project S-PATTERNS, and cluster project “Ageingat Home”.

8. REFERENCES[1] M. Abadi, S. Kia, R. Subramanian, P. Avesani, and N. Sebe.

User-centric affective video tagging from MEG andperipheral physiological responses. In Affective Computingand Intelligent Interaction, pages 582–587, 2013.

[2] M. K. Abadi, M. Kia, S. Ramanathan, P. Avesani, andN. Sebe. Decoding affect in videos employing the MEGbrain signal. pages 1–6, 2013.

[3] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task featurelearning. In Neural Information Processing Systems, 2007.

[4] R. Caruana. Multitask learning. Springer, 1998.[5] L. Chen, S. Gunduz, and M. T. Ozsu. Mixed type audio

classification with support vector machine. In IEEE Int’lConference on Multimedia and Expo, pages 781–784, 2006.

[6] J. J. Gross and R. W. Levenson. Emotion elicitation usingfilms. Cognition & Emotion, 9(1):87–108, 1995.

[7] A. Hanjalic and L.-Q. Xu. Affective video contentrepresentation and modeling. IEEE Transactions onMultimedia, 7(1):143–154, 2005.

[8] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirtymodel for multi-task learning. In Neural InformationProcessing Systems, 2010.

Page 6: A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

`2,1-V Dirty-V SR-HV SR-LV `2,1-A Dirty-A SR-HA SR-LA

`2,1-V Dirty-V SR-HV SR-LV `2,1-A Dirty-A SR-HA SR-LA

Figure 3: Predicting dynamic V/A ratings using (top) video and (bottom) audio features. Larger weights shown using darker shades(best viewed under zoom).

Table 3: RMSE-based V/A prediction performance of task-specific vs multi-task methods. RMSE mean, standard deviation over fiveruns are reported. Best model RMSE is shown in bold.

Front Back5 s 10 s 15 s 5 s 10 s 15 s

Valence

Video

Lasso 0.429±0.041 0.816±0.583 1.189±0.625 0.584±0.024 0.881±0.057 1.125±0.064MT-Lasso 0.191±0.028 0.319±0.064 0.549±0.042 0.206±0.014 0.443±0.067 0.593±0.108`21 MTL 0.193±0.030 0.326±0.063 0.565±0.047 0.207±0.015 0.450±0.066 0.606±0.113

Dirty MTL 0.452±0.141 0.840±0.293 1.179±0.400 0.308±0.105 0.607±0.140 0.801±0.129SR MTL 0.193±0.030 0.325±0.064 0.563±0.046 0.207±0.015 0.450±0.066 0.607±0.113

Audio

Lasso 0.475±0.030 0.712±0.069 0.851±0.081 0.634±0.016 0.860±0.027 1.174±0.034MT-Lasso 0.241±0.023 0.348±0.014 0.487±0.038 0.237±0.024 0.400±0.039 0.527±0.033`21 MTL 0.243±0.020 0.359±0.012 0.520±0.029 0.247±0.026 0.392±0.029 0.553±0.028

Dirty MTL 0.299±0.023 0.473±0.019 0.751±0.060 0.312±0.043 0.524±0.042 0.692±0.072SR MTL 0.248±0.017 0.365±0.015 0.526±0.027 0.252±0.027 0.404±0.033 0.567±0.026

Arousal

Video

Lasso 0.429±0.041 0.816±0.583 1.189±0.625 0.584±0.024 0.881±0.057 1.125±0.064MT-Lasso 0.191±0.028 0.319±0.064 0.549±0.042 0.206±0.014 0.443±0.067 0.593±0.108`21 MTL 0.193±0.030 0.326±0.063 0.565±0.047 0.207±0.015 0.450±0.066 0.606±0.113

Dirty MTL 0.452±0.141 0.840±0.293 1.179±0.400 0.308±0.105 0.607±0.140 0.801±0.129SR MTL 0.193±0.030 0.325±0.064 0.563±0.046 0.207±0.015 0.450±0.066 0.607±0.113

Audio

Lasso 0.435±0.038 0.599±0.050 0.727±0.058 0.556±0.033 0.807±0.037 1.004±0.033MT-Lasso 0.212±0.026 0.339±0.020 0.406±0.019 0.243±0.039 0.353±0.028 0.464±0.029`21 MTL 0.212±0.028 0.345±0.022 0.437±0.027 0.238±0.032 0.358±0.022 0.475±0.027

Dirty MTL 0.249±0.015 0.420±0.036 0.618±0.051 0.304±0.043 0.430±0.029 0.568±0.019SR MTL 0.214±0.027 0.350±0.031 0.431±0.026 0.242±0.031 0.363±0.026 0.487±0.030

[9] H. Kajino, Y. Tsuboi, and H. Kashima. A convex formulationfor learning from crowds. In AAAI Conference on ArtificialIntelligence, 2012.

[10] E. G. Kehoe, J. M. Toomey, J. H. Balsters, and A. L. W.Bokde. Personality modulates the effects of emotionalarousal and valence on brain activation. Social Cognitive &Affective Neuroscience, 7:858–70, 2012.

[11] S. Koelstra, C. Mühl, M. Soleymani, J.-S. Lee, A. Yazdani,T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. DEAP: Adatabase for emotion analysis using physiological signals.IEEE Transactions on Affective Computing, 3(1):18–31,2012.

[12] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani,T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. Deap: Adatabase for emotion analysis; using physiological signals.IEEE Transactions on Affective Computing, 3(1):18–31,2012.

[13] A. Lerch. An introduction to audio content analysis:Applications in signal processing and music informatics.John Wiley & Sons, 2012.

[14] D. Li, I. K. Sethi, N. Dimitrova, and T. McGee.Classification of general audio data for content-basedretrieval. Pattern Recognition Letters, 22(5):533–544, 2001.

[15] B. D. Lucas, T. Kanade, et al. An iterative image registrationtechnique with an application to stereo vision. In Int’l JointConference on Artificial Intelligence, volume 81, pages674–679, 1981.

[16] D. McDuff, R. Kaliouby, and R. W. Picard. Crowdsourcingfacial responses to online videos. IEEE Transactions onAffective Computing, 3(4):456–468, 2012.

[17] K. Mustafa and I. C. Bruce. Robust formant tracking forcontinuous speech with speaker variability. Audio, Speech,and Language Processing, IEEE Transactions on,14(2):435–444, 2006.

[18] R. W. Picard. Affective computing. MIT press, 2000.[19] J. Ross, L. Irani, M. Silberman, A. Zaldivar, and

B. Tomlinson. Who are the crowdworkers?: shiftingdemographics in mechanical turk. In Human Factors inComputing Systems, pages 2863–2872, 2010.

[20] M. Soleymani, M. N. Caro, E. M. Schmidt, C.-Y. Sha, andY.-H. Yang. 1000 songs for emotional analysis of music. InACM international workshop on Crowdsourcing formultimedia, pages 1–6, 2013.

[21] M. Soleymani, G. Chanel, J. J. Kierkels, and T. Pun.Affective characterization of movie scenes based onmultimedia content analysis and user’s physiological

Page 7: A Multi-task Learning Framework for Time-continuous Emotion Estimation from Crowd ...vintage.winklerbros.net/Publications/crowdmm2014ee.pdf · 2014-09-24 · A Multi-task Learning

emotional responses. In IEEE Int’l Symposium onMultimedia, pages 228–235, 2008.

[22] M. Soleymani, J. Davis, and T. Pun. A collaborativepersonalized affective video retrieval system. In AffectiveComputing and Intelligent Interaction, pages 1–2, 2009.

[23] M. Soleymani and M. Larson. Crowdsourcing for affectiveannotation of video: development of a viewer-reportedboredom corpus. In Workshop on Crowdsourcing for SearchEvaluation, 2010.

[24] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. Amultimodal database for affect recognition and implicittagging. T. Affective Computing, 3(1):42–55, 2012.

[25] T. Steiner, R. Verborgh, R. Van de Walle, M. Hausenblas,and J. G. Vallés. Crowdsourcing event detection in youtubevideo. In M. Van Erp, W. R. Van Hage, L. Hollink,A. Jameson, and R. Troncy, editors, Workshop on detection,representation, and exploitation of events in the semanticweb, pages 58–67, 2011.

[26] R. Tibshirani. Regression shrinkage and selection via thelasso. Journal of the Royal Statistical Society. Series B(Methodological), pages 267–288, 1996.

[27] P. Valdez and A. Mehrabian. Effects of color on emotions.Journal of Experimental Psychology: General, 123(4):394,1994.

[28] C. Vondrick, D. Patterson, and D. Ramanan. Efficientlyscaling up crowdsourced video annotation. Int’l Journal ofComputer Vision, 101(1):184–204, 2013.

[29] H. L. Wang and L.-F. Cheong. Affective understanding infilm. IEEE Transactions on Circuits and Systems for VideoTechnology, 16(6):689–704, 2006.

[30] J. D. Williams, I. D. Melamed, T. Alonso, B. Hollister, andJ. Wilpon. Crowd-sourcing for difficult transcription ofspeech. In IEEE Workshop on Automatic Speech Recognitionand Understanding (ASRU), pages 535–540, 2011.

[31] Y. Yan, E. Ricci, R. Subramanian, O. Lanz, and N. Sebe. Nomatter where you are: Flexible graph-guided multi-tasklearning for multi-view head pose classification under targetmotion. In IEEE Int. Conf. on Computer Vision, pages1177–1184, 2013.

[32] X. Yuan and S. Yan. Visual classification with multi-taskjoint sparse representation. In Computer Vision and PatternRecognition, 2010.

[33] J. Yuen, B. C. Russell, C. Liu, and A. Torralba. Labelmevideo: Building a video database with human annotations. InInt’l Conference on Computer Vision, pages 1451–1458,2009.

[34] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Robust visualtracking via structured multi-task sparse learning. Int’lJournal of Computer Vision, 101(2):367–383, 2013.

[35] J. Zhou, J. Chen, and J. Ye. MALSAR: Multi-tAsk Learningvia StructurAl Regularization. Arizona State University,

2011.