Download pdf - FROM CROWDSOURCED RANKINGS TO AFFECTIVE RATINGS · on CrowdFlower to rank the database along the induced arousal axis, the distribution of the country of annotators was very different

FROM CROWDSOURCED RANKINGS TO AFFECTIVE RATINGS

Yoann Baveye?†, Emmanuel Dellandrea†, Christel Chamaret?, Liming Chen†

?Technicolor975, avenue des Champs Blancs, 35576 Cesson Sevigne, France

{yoann.baveye, christel.chamaret}@technicolor.com†Universite de Lyon, CNRS

Ecole Centrale de Lyon, LIRIS, UMR5205, F-69134, Lyon, France{emmanuel.dellandrea, liming.chen}@ec-lyon.fr

ABSTRACTAutomatic prediction of emotions requires reliably annotateddata which can be achieved using scoring or pairwise ranking.But can we predict an emotional score using a ranking-basedannotation approach? In this paper, we propose to answer thisquestion by describing a regression analysis to map crowd-sourced rankings into affective scores in the induced valence-arousal emotional space. This process takes advantages of theGaussian Processes for regression that can take into accountthe variance of the ratings and thus the subjectivity of emo-tions. Regression models successfully learn to fit input dataand provide valid predictions. Two distinct experiments wererealized using a small subset of the publicly available LIRIS-ACCEDE affective video database for which crowdsourcedranks, as well as affective ratings, are available for arousaland valence. It allows to enrich LIRIS-ACCEDE by provid-ing absolute video ratings for the whole database in additionto video rankings that are already available.

Index Terms— Gaussian Processes for Regression, Out-lier detection, Affective video database, Affective computing

1. INTRODUCTION

Large multimedia databases are essential in various fields.They can be used to measure the performance of differentworks, enable benchmarks, and they can also be used to learnand test computational models using machine learning. TheAffective Computing field is no exception to the rule andsome well-known picture databases are already intensivelyused by researchers, such as the IAPS database [1]. But un-til very recently, there did not exist affective video databasespublicly available large enough to be used in machine learn-ing. Existing databases were too small, not sufficiently realis-tic or suffered from copyright issues. That is why Soleymaniet al. wrote in [2] that

“High-quality corpora will help to push forwardthe state of the art in affective video indexing.”

Recently, a large affective video database called LIRIS-ACCEDE [3] has been made publicly available 1 in an attemptto solve most of these issues. In this database, 9,800 videoexcerpts have been annotated with pairwise comparisons us-ing crowdsourcing along the induced dimensions of the 2Dvalence-arousal emotional space. Valence can range fromnegative (e.g., sad, disappointed) to positive (e.g., happy,elated), whereas arousal varies from inactive (e.g., calm,bored) to active (e.g., excited, alarmed). All the video clipsbeing ranked along both dimensions, the rankings provide noinformation about the distances between them. Furthermore,these ranks are relative to this particular database which pre-vents the comparison with other video clips annotated withabsolute valence and arousal values.

This is why the goal of this paper is to enrich the LIRIS-ACCEDE database by providing absolute video ratings in ad-dition to video rankings that are already available. The newabsolute video ratings are generated thanks to a regressionanalysis, allowing to map the ranked database into the 2Dvalence-arousal affective space. Gaussian Processes for Re-gression were preferred over other existing regression tech-niques since they can model the noisiness from measurementsand thus take into account the subjectivity of emotions. Theproposed regression analysis is performed using rating valuescollected on a subset of the database from a previous exper-iment. Results show that the predictions of our models arein line with these affective rating values and thus are able toestimate affective ratings from crowdsourced ranks.

The paper is organized as follows. Section 2 providesbackground material on affective multimedia self-reportingmethods. Section 3 presents the database, the crowdsourcedexperiment that led to the ranking of the whole database andthe controlled experiment in which arousal and valence rat-ings are collected for a subset of the database. For the purposeof estimating scores in the affective space from crowdsourcedranks, a regression analysis is performed in Section 4. Next,in Section 5, the outliers are detected and removed from the

1http://liris-accede.ec-lyon.fr/

process. Results are discussed in Section 6 and finally, con-clusion and future work end the paper in Section 7.

2. RELATED WORK

In previous works, three different self-reporting methods withdifferent advantages and drawbacks have been used to anno-tate affective multimedia databases: ratings, continuous an-notations and ranking approaches.

Rating-based are the most popular in experiments to anno-tate a collection of movies. Annotators are asked to select on arating scale the score that best represents the emotion inducedby the video segment. For example, this process has beenused to annotate, among others, the HUMAINE [4], FilmStim[5] and DEAP [6] affective video databases. The advantageof the scoring method is that annotators rate a video clip in-dependently of any other video clip and thus, the scores canbe easily compared even to those of video clips annotated inother experiments. But indeed, requesting an affective scorerequires annotators to understand the range of the emotionalscale which is a sizable cognitive load. Furthermore, it maybe quite difficult to ensure that the scale is used consistently[7], especially in experiments using crowdsourcing [3].

Continuous ratings are even more difficult to implement.Recently, Metallinou and Narayan investigated the challengesof continuous assessments of emotional states [8]. This tech-nique is promising since the precision of the annotationsis better than any other technique where one annotation ismade for a large video segment. However, there are sev-eral issues that have to be addressed such as user-specificdelays or the aggregation of multiple continuous subjectiveself-assessments.

Ranking approaches and more specifically pairwise com-parisons are easier as they require less cognitive load. An-notators agree more when describing emotions in relativeterms. Thus, pairwise comparisons enhance the reliability ofthe ground truth. This method has been used by Yang andChen [9] to annotate the emotions induced by a collection ofsongs, and more recently in [3] to rank the LIRIS-ACCEDEaffective video database described in the next section. Someapproaches, such as [10], are able to derive ratio scale mea-sures of the stimuli from only binary judgments from pairwisecomparisons but they are not applicable in experiments deal-ing with a large number of stimuli. To deal with a huge num-ber of stimuli, the quicksort algorithm can be used in order toreduce the amount of pairwise comparisons needed to sort thestimuli. But the main disadvantage of rating-by-comparisonexperiments, for which comparisons are generated using thequick sort algorithm as in [3], is that relative ranks do notprovide indication about the distance between the excerpts,i.e. we do not know how much lower or higher a valenceor arousal rank is than another one. It is also not sure thatthe median value represents a neutral data. In order to solvethese drawbacks, Yang and Chen explored the possibility of

Fig. 1. Some examples of key frames extracted from severalvideo clips included in the LIRIS-ACCEDE database. Creditscan be found at http://liris-accede.ec-lyon.fr/database.php.

representing songs in the emotion space according to relativeemotion rankings but the ranks were converted to emotionscores in a linear way.

In this work, we show that it is possible to map the relativeranks of the 9,800 excerpts of the LIRIS-ACCEDE databaseinto the valence-arousal emotional space in a finer way, thuscombining the advantages of both the rating-based and rank-ing self-reporting methods, while considering the variabilityof annotations in emotion.

3. LIRIS-ACCEDE DATABASE

First, we describe the LIRIS-ACCEDE database used in thiswork and the previous experiments that led to the annotationof the whole database.

3.1. Description

LIRIS-ACCEDE is composed of 9,800 video clips extractedfrom 160 movies shared under Creative Commons Licenses[3]. Thus, the database can be shared publicly without copy-right issues. The excerpts included in the database have beenextracted from movies labeled under several movie genressuch as comedy, drama, horror, documentary or even anima-tion movies. Thus, the database is representative of current

most popular movie genres. The 9,800 video segments lastbetween 8 and 12 seconds. It is large enough to get consis-tent video clips allowing the viewer to feel emotions and itis also small enough to reduce the probability for a viewer tofeel more than one emotion per excerpt. In order to extractconsistent segments from movies, a fade and scene cut detec-tion based on [11] has been implemented to make sure thateach extracted segment starts and ends with a fade or a scenecut. There are numerous different video scenes reflecting thevariety of selected movies. This variety can be seen on Figure1 showing the key frame of several excerpts included in thedatabase.

Annotating such a large amount of data in laboratorywould be very time-consuming. That is why crowdsourcinghas been used to rank the whole database along the inducedarousal and valence axes.

3.2. Crowdsourced ranking experiment

The annotations were gathered using the CrowdFlower plat-form, since it reaches several crowdsourcing services. Thus,collected annotations were made from workers with moredifferent cultural backgrounds than experiments made onlyon one crowdsourcing service particularly popular for a fewcountries. Figure 2 shows that for the experiment madeon CrowdFlower to rank the database along the inducedarousal axis, the distribution of the country of annotators wasvery different depending on the crowdsourcing service. Forexample, most workers on Amazon Mechanical Turk wereAmerican and Indian whereas for instaGC workers are mostlyAmerican, Canadian and British.

To make reliable annotations as simple as possible, pair-wise comparisons were proposed on CrowdFlower in orderto rank the database along the induced arousal and valenceaxes. The ranking of the whole database was a two-stageprocess. First, annotations to rank the database along theinduced valence axis were gathered. Workers were simplyasked to select the video clip inducing the most positive emo-tion. Then, in the second step, workers were asked to selectthe excerpt inducing the calmest emotion, allowing to rankthe database along the arousal axis. For both ranking exper-iments, the pairwise comparisons were generated and sortedusing the quick sort algorithm. Workers were paid 0.01$ percomparison. More than 4,000 annotators participated in bothstages, allowing to rank the LIRIS-ACCEDE database alongthe induced valence and arousal axes. Note that a more de-tailed description of the ranking process is given in [3].

3.3. Controlled rating experiment

To cross-validate the annotations gathered from various un-controlled environments using crowdsourcing, another exper-iment has been created to collect ratings for a subset of thedatabase in a controlled environment.

0

50

100

150

200

250

300

350

400

Nu

mb

er o

f a

nn

ota

to

rs

ARG AUS BGD BGR BRA CAN DEU EGY ESP FIN FRA GBR

GRC HRV HUN IDN IND ITA JPN LTU MAR MEX MYS NLD

PAK PHL POL PRT ROU RUS SWE TUR USA VNM

Fig. 2. Distribution of most represented countries for majorcrowdsourcing services used in October 2013.

In this controlled experiment, 28 volunteers were askedto rate a subset of the database carefully selected using the5-point discrete Self-Assessment-Manikin scales for valenceand arousal [12]. 20 excerpts per axis that are regularly dis-tributed have been selected in order to get enough excerptsto represent the whole database while being relatively fewto create an experiment of acceptable duration. Thus, theiroptimum rank positions are those with value 9800

19 × n withn ∈ {0, . . . , 19}. For each axis and each optimum rank,we select the excerpt i ∈ {0, . . . , 9799} with αi ≥ 0.6 asclose as possible to the optimum rank, where αi is the Krip-pendorff’s alpha [13] measuring the inter-rater reliability ofexcerpt i during the crowdsourced ranking experiment. Thisprocess ensures that the selected film clips have different lev-els of valence and arousal and thus are representative of thefull database while being highly reliable in eliciting such in-duced emotions. To rate selected excerpts, participants wereinstructed to focus on the emotion they felt while watchingthe 40 video clips. All the videos were presented in a ran-dom order and the participants had to perform immediatelya self-assessment of their level of arousal or valence directlyafter viewing each film clip. Finally, all the annotations of avideo clip were averaged to compute the affective rating ofthe excerpt. Due to the 5-point discrete scales used in thisexperiment, affective ratings range from 1 to 5.

The Spearman’s rank correlation coefficient (SRCC) be-tween the rankings of the film clips in the LIRIS-ACCEDEdatabase and the ratings collected in this experiment exhibiteda statistically highly significant correlation for both arousal

(SRCC = 0.751) and valence (SRCC = 0.795), thus validat-ing the annotations made from various uncontrolled environ-ments.

Hence, the goal of this paper is to use these rankings andratings available for the 40 video clips selected for this sec-ond experiment to perform a regression analysis between therankings and the ratings to convert the relative rankings intoabsolute scores.

4. REGRESSION ANALYSIS

Among all existing regression models, we used the GaussianProcesses for Regression as they can model the noisiness frommeasurements and thus take into account the subjectivity ofemotions.

Two different Gaussian Process Regression Models arelearned in this part, one for the valence axis and a second onefor arousal. From the rank given as input (ranging from 0 to9799), the goal of the models is to predict its affective ratingfor the dedicated axis (ranging from 1 to 5). To learn the mod-els, we will use the crowdsourced ranks and the correspond-ing affective ratings gathered in section 3.3. The variance ofthe annotations gathered in this controlled rating experimentwill be used to provide guidance to learn the models. Thus,these variances will be needed only during the learning stepand will no longer be necessary to predict new affective rat-ings.

Knowing the rank x for valence or arousal of a video clipin the database, the goal of the Gaussian Processes regressionmodels is to estimate the score g(x) of the video clip in theaffective space. Rasmussen and Williams [14] define a Gaus-sian Process (GP) as a collection of random variables, any GPfinite number of which have a joint Gaussian distribution. Thepredictions from a GP model take the form of a full predictivedistribution:

g(x) = f(x) + h(x)Tβ, with f(x) ∼ GP (0, k(x, x′)) (1)

where f(x) is a zero mean GP, h(x) are a set of fixed basisfunctions, and β are additional parameters. For valence weused linear basis functions whereas quadratic basis functionswere selected for arousal.

We used the squared exponential kernel for which, dur-ing interpolation at new values, distant observations will havenegligible effect:

k(x, x′) = σ2f × exp

(−(x− x′)2

2l2

)+ σ2

nδ(x, x′) (2)

where the length-scale l and the signal variance σf are hy-perparameters, σn is the noise variance and δ(x, x′) is theKronecker delta. All the parameters are estimated using themaximum likelihood principle. In this work, σn values arenot hyperparameters since they represent the known varianceof annotations gathered in the controlled rating experiment

(a) Valence

(b) Arousal

Fig. 3. Mahalanobis distances between the 40 video clips andthe estimated center of mass, with respect to the estimatedcovariance in the ranking/rating space. Red points are thevideo clips considered as outliers.

described in section 3.3. They are added to the diagonal ofthe assumed training covariance. As a consequence, the GPis also able to model the subjectivity of emotions from thisexperiment.

5. OUTLIER DETECTION

To perform a regression analysis on clean data, the first stepis to detect outliers.

The Minimum Covariance Determinant (MCD) estimatorintroduced by Rousseeuw in [15] is a highly robust estimatorfor estimating the center and scatter of a high dimensionaldata set without being influenced by outliers. Assuming thatthe inlier data are Gaussian distributed, it consists in findinga subset of observations whose empirical covariance has thesmallest determinant. The MCD estimate of location µ is thenthe average of the “pure” observations in the selected subsetand the MCD estimate of scatter is their covariance matrix Σ.In this work, we used the fast MCD algorithm implemented in[16] to estimate the covariance of the 40 video clips defined insection 3.3, described in the 2D ranking/rating space by theirrank and rating score.

Once the center and covariance matrix have been esti-mated, the Mahalanobis distance of centered observations canbe computed. It provides a relative measure of an observationfrom the center of mass taking into account the correlationbetween those points. The Mahalanobis distance of a videoclip xi is defined as:

d(µ,Σ)(xi)2 = (xi − µ)Σ−1(xi − µ) (3)

where µ and Σ are the estimated center of mass and covari-

3√ d(µ,Σ

)(xi)

(a) Valence3√ d

(µ,Σ

)(xi)

(b) Arousal

Fig. 4. Box plot of the Mahalanobis distances for valence andarousal. The whiskers show the lowest and highest values stillwithin the 1,5 IQR. Red points are the video clips consideredas outliers.

ance of the underlying Gaussian distribution. Figure 3 showsthe shape of the Mahalanobis distances for the valence andarousal data sets.

By considering the covariance of the data and the scales ofthe different variables, the Mahalanobis distance is useful fordetecting outliers in such cases. As a rule of thumb, a videoclip xi is considered as an outlier if d(µ,Σ)(xi) < Q1− 1.5×IQR or if d(µ,Σ)(xi) > Q3 + 1.5 × IQR with Q1 and Q3the first and third quartiles and IQR the Inter-quartile Range.The boxplots showing the outliers detected for valence andarousal during this process are illustrated in Figure 4.

In our experiments, two video clips are categorized as out-liers for valence and three video clips for arousal. Thus, thesevideo clips are removed from the data sets in order to performa regression analysis only on “clean” data sets. As a con-sequence, 38 video clips are used to perform the regressionanalysis for valence while for arousal the data set is composedof 37 video clips.

6. RESULTS

Figure 5 shows the regression models trained on all “clean”video clips for both valence and arousal axes. The 95% con-fidence interval shows that the models successfully used thevariance of the annotations to learn the models.

To measure the prediction power of the learned regressionmodels, we calculated in addition to the well-known conven-tional squared correlation coefficientR2, the predictive leave-one-out squared correlation coefficient Q2

loo defined as:

Q2loo = 1−

N∑i=1

(ypred(N−1)i − yi

)2

N∑i=1

(yi − yN−1,i

mean

)2(4)

with yi the true rating value of the video clip i and ypred(N−1)i

the prediction of the model learned with the initial training

0 2000 4000 6000 8000 10000

x

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

g(x)

Prediction95% confidence intervalVideo clips

(a) Valence

0 2000 4000 6000 8000 10000

x

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

g(x)

Prediction95% confidence intervalVideo clips

(b) Arousal

Fig. 5. Gaussian Process Models learned for valence andarousal converting ranks (horizontal axis) into ratings (ver-tical axis). Black bars show the variance of the annotationsgathered in section 3.3.

Table 1. Performance of the Gaussian Process Modelslearned predicting valence and arousal.

Measure Valence ArousalR2 0.657 0.632Q2loo 0.621 0.586

set from which the video clip i was removed. Note that thearithmetic mean used in equation (4), yN−1,i

mean , is different foreach test set and calculated for the observed values comprisedin the training set.

R2 measures the goodness of fit of a model while Q2loo

computed using the leave-one-out cross-validation techniquemeasures the model prediction power. Both values for valenceand arousal are shown in Table 1.

These results are remarkably high considering that we aremodeling crowdsourced ranks and affective ratings that areboth subject to the subjectivity of human emotions. Thus, ourproposed regression models successfully learned to fit inputobservations. Furthermore, Q2

loo values show that the modelsare also able to provide valid predictions for new observa-tions.

7. CONCLUSION

In this work, we show that it is possible to estimate absolutevalues in the emotional space using affective ranks while tak-ing into account the subjectivity of emotions.

First, we used the relative ranks available for the 9,800video clips of the LIRIS-ACCEDE database as well as ab-solute scores available for a subset of the database. Outlierswere detected using the minimum covariance determinant es-timator and removed from the dataset in order to create a sub-set of “clean” observations. Finally, a regression analysis wasperformed for valence and arousal. The Gaussian process re-gression models, taking into account the variance of the an-notation of the absolute scores, achieved a good performance,confirming our intuition that absolute scores in the affectivespace can be estimated using relative ranks.

In a near future, we plan to consider the continuous an-notation process as a way to get finer annotations for longervideo clips. The comparisons between these continuous an-notations and absolute scores will allow to study the effect ofthe memorability on the induced emotions.

8. REFERENCES

[1] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, Interna-tional affective picture system (IAPS): Technical manualand affective ratings. The Center for Research in Psy-chophysiology, University of Florida, 1999.

[2] M. Soleymani, M. Larson, T. Pun, and A. Hanjalic,“Corpus development for affective video indexing,”IEEE Transactions on Multimedia, pp. 1–1, 2014.

[3] Y. Baveye, J.-N. Bettinelli, E. Dellandrea, L. Chen, andC. Chamaret, “A large video data base for computa-tional models of induced emotion,” in Affective Com-puting and Intelligent Interaction (ACII), 2013 HumaineAssociation Conference on, 2013, pp. 13–18.

[4] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox,O. Lowry, M. McRorie, J.-C. Martin, L. Devillers,S. Abrilian, A. Batliner, N. Amir, and K. Karpouzis,“The HUMAINE database: Addressing the collectionand annotation of naturalistic and induced emotionaldata,” in Affective Computing and Intelligent Interac-tion, ser. Lecture Notes in Computer Science. SpringerBerlin Heidelberg, 2007, vol. 4738, pp. 488–500.

[5] A. Schaefer, F. Nils, X. Sanchez, and P. Philippot, “As-sessing the effectiveness of a large database of emotion-eliciting films: A new tool for emotion researchers,”Cognition & Emotion, vol. 24, no. 7, pp. 1153–1172,Nov. 2010.

[6] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yaz-dani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras,

“DEAP: a database for emotion analysis using physi-ological signals,” IEEE Transactions on Affective Com-puting, vol. 3, no. 1, pp. 18–31, Jan. 2012.

[7] S. Ovadia, “Ratings and rankings: reconsidering thestructure of values and their measurement,” Interna-tional Journal of Social Research Methodology, vol. 7,no. 5, pp. 403–414, 2004.

[8] A. Metallinou and S. Narayanan, “Annotation and pro-cessing of continuous emotional attributes: Challengesand opportunities,” in Automatic Face and GestureRecognition (FG), 2013 10th IEEE International Con-ference and Workshops on, Apr. 2013, pp. 1–8.

[9] Y.-H. Yang and H. Chen, “Ranking-based emotionrecognition for music organization and retrieval,” Audio,Speech, and Language Processing, IEEE Transactionson, vol. 19, no. 4, pp. 762–774, 2011.

[10] F. Wickelmaier and C. Schmid, “A matlab functionto estimate choice model parameters from paired-comparison data,” Behavior Research Methods, Instru-ments, & Computers, vol. 36, no. 1, pp. 29–40, Feb.2004.

[11] R. W. Lienhart, “Comparison of automatic shot bound-ary detection algorithms,” in Electronic Imaging’99,1998, pp. 290–301.

[12] M. M. Bradley and P. J. Lang, “Measuring emotion: theself-assessment manikin and the semantic differential,”Journal of behavior therapy and experimental psychia-try, vol. 25, no. 1, pp. 49–59, Mar. 1994.

[13] K. Krippendorff, “Estimating the reliability, systematicerror and random error of interval data,” Educationaland Psychological Measurement, vol. 30, no. 1, pp. 61–70, Apr. 1970.

[14] C. E. Rasmussen and C. K. I. Williams, Gaussian pro-cesses for machine learning, ser. Adaptive computationand machine learning. Cambridge, Mass: MIT Press,2006.

[15] P. J. Rousseeuw, “Least median of squares regression,”Journal of the American Statistical Association, vol. 79,no. 388, pp. 871–880, Dec. 1984.

[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duches-nay, “Scikit-learn: Machine learning in Python,” Jour-nal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.