28
SSPNet Social Signal Processing Network http://www.sspnet.eu/ Funded under the 7th FP (Seventh Framework Programme) Theme ICT-2007.2.2 [Network of Excellence] D8.2: Progress, compendium of work done during months 1-12 Due date: 31/01/2010 Submission date: 29/01/2010 Project start date: 01/02/2009 Duration: 60 months WP Manager: E.Hendriks Revision: 1 Author(s): E.A. Hendriks, A. Vinciarelli, I. Poggi. M. Mehu, M. Pantic Project funded by the European Commission in the 7th Framework Programme (2009-2013) Dissemination Level PU Public No RE Restricted to a group specified by the consortium (includes Commission Services) No CO Confidential, only for members of the consortium (includes Commission Services) Yes

SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNetSocial Signal Processing Network

http://www.sspnet.eu/

Funded under the 7th FP (Seventh Framework Programme)Theme ICT-2007.2.2

[Network of Excellence]

D8.2: Progress, compendium of work done duringmonths 1-12

Due date: 31/01/2010 Submission date: 29/01/2010Project start date: 01/02/2009 Duration: 60 monthsWP Manager: E.Hendriks Revision: 1

Author(s): E.A. Hendriks, A. Vinciarelli, I. Poggi. M. Mehu, M. Pantic

Project funded by the European Commissionin the 7th Framework Programme (2009-2013)

Dissemination LevelPU Public NoRE Restricted to a group specified by the consortium (includes Commission Services) NoCO Confidential, only for members of the consortium (includes Commission Services) Yes

Page 2: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,
Page 3: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

D8.2: Progress, compendium of work done during months 1-12

Abstract:A collaborative effort within the WP8 of the SSPNet Network of Excellence is to automat-

ically model, detect, and interpret social signals displayed in political debates. This reportdescribes the results achieved during year 1 of the SSPNet.

Page 4: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,
Page 5: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

Contents

1 Introduction 1

2 Integration Efforts 1

3 Achievements 2

4 Future Achievements 19

Bibliography 21

Page 6: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,
Page 7: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

1 Introduction

Within SSPNet, the objective of WP8 is to automatically model, detect and interpret socialsignals displayed in political debates. The three main goals identified are to: 1) identify theparticipants that have the same opinion on a subject (i.e., to measure agreement), 2) identifythe speaker(s) that better convey their message (i.e., measure communication effectiveness),and 3) rank the participants according to their status.

As the goal of WP8 is the modelling, detection and automatic interpretation of socialsignals in political debates, the data to be used for research in WP8 should mainly consist ofrecordings of political debates. After analysis of a number of databases (see also deliverableWP8.1), we have decided to use the Canal 9 Political debates as the integration data whichall partners will use as their main source of data for political debates.At the beginnning of the project it turned out that the goals as identified above are ratherambitious. Therefore we decided, at least in the first year, to focus on (automatic) mod-elling, detection and interpretation of agreement/disagreement in political debates of Canal9. To be able to work on (automatic) modelling, detection and interpretation of agree-ment/disagreement it is important to have enough recorded material of agreement/disagreementevents annotated and available in an easy readable format, to be used e.g. as ground truth. Soa major effort for the first year is to select enough clips containing agreement/disagreement,annotate these in terms of pose/gesture/facial action and related speech. Annotation of Fa-cial Action Units needs trained experts. Since at the beginning of the project only a fewresearchers could do this, we decided to organise a course to enlarge the group of potentialFACS annotators within (and outside) our WP.For computer supported detection of agreement/disagreement a lot of basic tools need tobe developed to process audio and video streams (e.g. speaker separation and detectionof overlapping speech in audio streams, skin segmentation in video streams for detection ofhand/face blobs, single person video scenes cuts, face/person recognition, face pose detection,facial feature points/facial action units detection, etc). So another major effort for the firstyear was to start collaboration between partners to develop these tools and get them availablefor all partners or even for the whole community.

In this deliverable we will report on the achievements that have been realized this year.

2 Integration Efforts

Since the WP8 partners come from rather different scientific disciplines (e.g. psychology onthe one hand and computer science on the other hand) another important goal of the firstyear is to understand each other scientific problems better, to learn each other ’language’ andto start interdisciplineary collaboration on the issue above. To facilitate this process it wasdecided to have regular meetings, preferably face-to-face.We had face-to-face meetings in February (London), April (London), September (Delft) andDecember (Rome). In between we also organised a video meeting in June. In all of these meet-ings all WP8 partners were represented. To stimulate the integration of research efforts weencouraged researchers to visit each others laboratories to discuss collaboration and researchresults and to make collaboration plans for the near future. Exchange/visit of researcherstook place between:

• University of Geneva and University Roma Tre

SSPNet D8.2: page 1 of 22

Page 8: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

• University of Delft and Idiap

• University of Delft and Imperial College

• Idiap and Imperial College

Furthermore joined publications are in progress or already submitted (Imperial College/Idiap,University of Delft/Imperial College, University of Geneva/Imperial College). Integrationwith the other partners comes more or less natural since most of the WP8 partners are alsoinvolved in the other Thematic Workpackages. In the overview of the achievements it isindicated which other thematic workpackage can and will benefit from efforts and results inWP8

3 Achievements

In this section we report the achievements of WP8 obtained by the partners in the first yearof the project. Each achievement starts with a short description and then lists the tangibleoutcome, dissemination activities, partners involved, related publications and the relationshipwith the other thematic WPs.

1. Collection of a bibliography on linguistic aspects of social interactions (withparticular attention to agreement and disagreement).

Description: We have identified the linguistic aspects most important for the analy-sis of political debates and, in more general terms, of conversations. We paid particularattention to (dis-)agreement, dominance, status, role, and emotions. Collected refer-ences come in particular from the following areas: conversation analysis, pragmatics,phonetics and linguistics. This work has been useful to identify expertise gaps in theconsortium as well.

Tangible outcome: bibliography containing 150 references

Partners involved: Idiap Research Institute Universita’ Roma Tre

Dissemination: The bibliography is available on the portal.

Related Publication(s): N/A

Links to other thematic WPs: Useful for group interactions (WP10).

2. Collection and annotation of the Canal9 database of political debates.

SSPNet D8.2: page 2 of 22

Page 9: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Description: Seventy political debates were collected and annotated in a way thatwill facilitate the analysis of socially relevant phenomena (for a total of 43 hours and10 minutes). The annotations include:

• Manual speaker segmentation

• Role of debate participants (moderator vs guest)

• Agreement and Disagreement (composition of groups opposing one another).

• Automatic Speaker Segmentation (output of an automatic speaker clustering sys-tem)

• Manual video shot segmentation

• Automatic video shot segmentation

• Manual video shot classification. Each shot is classed as ’personal shot’ (a shotshowing only one person), or ’other’.

• Manual identification of people in personal shots.

Tangible outcome:

• The Canal 9 Database

• A paper reporting the full description of the database and its annotations.

Partners involved: Idiap Research Institute

Dissemination: The corpus is available on the SSPNet Portal at the following URL:http://sspnet.eu/data/canal-9-political-debates/

Related Publication(s):

• A. Vinciarelli, A. Dielmann, S. Favre, and H. Salamin, ”Canal9: a Database ofPolitical Debates for Analysis of Social Interactions,” in Proceedings of the Inter-national Conference on Affective Computing and Intelligent Interaction, Vol. 2,pp. 96-99, 2009.

Links to other thematic WPs: The political debates are useful for WP10 (theyportray conflictual group interactions) and for WP9 (the debates are a source of non-verbal cues related to polite behavior).

3. Development of an approach for automatic role recognition in conversations.

SSPNet D8.2: page 3 of 22

Page 10: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Description: We have developed an automatic approach for the recognition of rolesin broadcast data (news and talk-shows) and in meetings. In broadcast data, the rolescorrespond to the functions that people fulfill in the structure of a given program (e.g.,’Anchorman’, ’Guest’, ”Headline Person’, etc.). In meetings, roles correspond to theposition of meeting participants in an ideal company in the framework of which themeetings are held (e.g., ’Project Manager’, ’Marketing Expert, etc.). While the broad-cast data is real, the meetings are acted.

The approach includes three major steps:

• Segmentation of the data into single speaker intervals, i.e. time segments whereonly one person talks or holds the floor. The result is the turn-taking, i.e. thesequence of turns composing the conversation portrayed in the data.

• Extraction of a social network from the turn-taking. People that talk in the sametime intervals are supposed to interact and this information is sufficient to extracta social network. The network is used to represent each interaction participantwith a feature vector.

• Role assignment. This step assigns a role to each interaction participant usingmachine learning and pattern recognition (in particular, Bayesian classifiers andHidden Markov Models).

The results show that roles are recognized with an accuracy (percentage of correctlylabeled time in terms of role) of around 80% in the case of broadcast data, and of 45%in the case of meeting recordings. The difference is due in part to the fact that rolesin broadcast data have a higher influence on the behavior of people, and in part to thefact that meetings are simulated and people do not play roles they play in their real lifeas well.

Tangible outcome:

• A package for analysis and representation of turn-taking

• Publications (see below)

Partners involved: Idiap Research Institute

Dissemination: The publications are available on the SSPNet Portal

Related Publication(s):

• H.Salamin, S.Favre, and A.Vinciarelli Automatic Role Recognition in MultipartyRecordings: Using Social Affiliation Networks for Feature Extraction IEEE Trans-actions on Multimedia, Vol. 27, no. 12, pp. 1373-1380, November 2009.

• A.Vinciarelli Capturing Order in Social Interactions IEEE Signal Processing Mag-azine, Vol. 26, no. 5, pp. 133-137, September 2009

• S.Favre, A.Dielmann, and A.Vinciarelli Automatic Role Recognition in MultipartyRecordings Using Social Networks and Probabilistic Sequential Models Proceedingsof the ACM International Conference on Multimedia, pp. 585-588, 2009.

SSPNet D8.2: page 4 of 22

Page 11: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Links to other thematic WPs: This work is relevant to WP10 because roles are akey aspect of group interactions.

4. Development of an approach for the detection of agreement and disagree-ment in competitive discussions.

Description: We developed an approach for the detection of disagreement betweenparticipants of competitive discussions, i.e. discussions where people defend differentopinions in more or less conflictual terms. The approach is based on a preferencestructure (observed in pragmatics) typical of conflicts: people tend to react to someonethey disagree with rather than to someone they agree with. This means that the speakertalking at turn n is statistically dependent on the speaker talking at turn n-1 and thata Markov Chain can capture the structure of agreement and disagreement (who agreesand disagrees with whom) in a discussion.

The approach includes two major steps:

• Segmentation of the data into single speaker intervals, i.e. time segments whereonly one person talks or holds the floor. The result is the turn-taking, i.e. thesequence of turns composing the conversation portrayed in the data.

• Alignment of a Markov Chain where the states correspond to two fronts opposingone another in the discussion with the sequence of the turns.

The approach has been tested over a subset of the Canal9 database (see other achieve-ment of this WP) including the 45 debates where there are 4 participants (in the restof the corpus the participants are 3 or 2). The results show that 66% of the debatesare correctly reconstructed in terms of who agrees and disagrees with whom. The per-formance of a system reconstructing the same structure randomly is only 7% and thedifference with respect to the approach is statistically significant.

Tangible outcome: An automatic system for the reconstruction of agreement anddisagreement structures in conversations.

Partners involved: Idiap Research Institute

Dissemination: Related publications are available on the SSPNet portal. Further-more, the Canal9 database is available on the SSPNet portal and the results of ourpapers can be reproduced.

Related Publication(s):

• A.Vinciarelli Capturing Order in Social Interactions IEEE Signal Processing Mag-azine, Vol. 26, no. 5, pp. 133-137, September 2009

• A.Vinciarelli, H.Salamin, M.Pantic Social Signal Processing: Understanding so-cial interactions through nonverbal behavior analysis Proceedings of Workshop onComputer Vision and Pattern Recognition for Human Behavior, pp. 42-49, 2009.

SSPNet D8.2: page 5 of 22

Page 12: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Links to other thematic WPs: This work is relevant to WP10 as well as disagree-ment can have a major impact on the life a group.

5. Implementation of an audio speaker segmenter (suitable for the automaticsegmentation of the Debate videos)

Description : The audio segmenter based on a Hidden Markov Model with a mini-mum duration constraint as given in [Ajmera and Wooters, 2003],[Ajmera, 2005] is im-plemented in Matlab. The unsupervised segmentation automatically determines themost likely number of speakers (clusters) in the sequence. The Hidden Markov Modelmodels each cluster/speaker by a Mixture of Gaussian model with a varying number ofGaussians. By the subsequent fusion of similar clusters, the optimal number of speak-ers is found when the likelihood is maximized. This approach is in principle not onlyapplicable to audio, but also to other types of sequence data.

Tangible outcome: A Matlab software package for constructing, training and eval-uating HMMs with a minimum duration constraint.

Partners involved: TU Delft, IDIAP Research Institute

Dissemination: The software in currently available for partners.

Related Publications:

• A robust speaker clustering algorithm, J.Ajmera, C.Wooters, IEEE Workshop onAutomatic Speech Recognition and Understanding, 2003, ASRU’03.

• Robust audio segmentation, J.Ajmera, thesis, 2005

Links to other thematic WPs: useful for group interactions

6. Setting up the Canal 9 Database for public use.

Description : The Canal 9 Political Debates is the chosen dataset for WP8 as outlinedin Section 3 of D8.1. The Canal 9 Database has now become available online throughthe SSPNet portal, thus allowing a common corpus for all partners, but also for othersoutside the network to work on, when researching issues related to WP8 (or any otherissues pertaining to analysis of real-life face-to-face interaction).

Tangible outcome: Canal9 Database is available online and easy to access and down-load.

Partners involved: Imperial College, IDIAP

SSPNet D8.2: page 6 of 22

Page 13: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Dissemination: The corpus is available on the SSPNet Portal at the following URL:

http://sspnet.eu/data/canal-9-political-debates/

Related Publications:

• A. Vinciarelli, A. Dielmann, S. Favre, and H. Salamin, ”Canal9: a Database ofPolitical Debates for Analysis of Social Interactions,” in Proc. IEEE Int’l Conf.Affective Computing and Intelligent Interfaces, Vol. 2, pp. 96-99, 2009.

Links to other thematic WPs: Canal9 is a database containing over 70 real polit-ical debates, so it could naturally be used for other thematic WPs, although it is notexplicitly specified in other thematic WPs.

7. A survey of the nonverbal auditory and visual cues that could be presentduring (dis)agreement and of possible tools to detect them.

Description : This was the first step towards achieving our eventual objective: theautomatic detection of (dis)agreement based on the presence and temporal dynamicsof the relevant behavioral cues. A review of the Social Psychology literature was com-pleted in order to survey all visual and auditory cues which could be present during(dis)agreement. Furthermore, a list of tools that could be used for the detection of thesecues was compiled.

Tangible outcome:

• Publications (see below).

• Bibliography of 83 references

• A list of tools ready to use towards the WP8 objective, after minor adaptations.

Partners involved: Imperial College, University of Geneva

Dissemination: The corpus is available via the following URL (to be copied to SSP-Net web portal as well): http://www.doc.ic.ac.uk/~kb709/a_da_bib.htm

Related Publications:

• K. Bousmalis, M. Mehu, and M. Pantic, ”Spotting Agreement and Disagreement:A Survey of Nonverbal Audiovisual Cues and Tools”, in Proc. IEEE Int’l Conf.Affective Computing and Intelligent Interfaces, Vol. 2, 2009.

Links to other thematic WPs: Agreement and disagreement are one of the mostcommon interactive signals and, hence, revealing agreement and disagreement is usefulfor analysis of any interpersonal interaction including group interactions (WP10).

8. Segmentation and annotation of 67 (dis)agreement episodes from Canal 9debates

SSPNet D8.2: page 7 of 22

Page 14: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Description : We went through 5 debates from Canal 9 debates and found 11 agree-ment episodes and 56 disagreement episodes. These episodes were annotated for eventsof smile, overlapping speech, up-down movement of hand, front-back head movement,sudden ’cut-off’ of gaze, large head shift, and suddenly-parted lips.

Tangible outcome: A set of segmented and annotated 67 episodes of spontaneousagreement and disagreement from the Canal 9 dataset, to be used for training andtesting tools for automatic detection of (dis)agreement.

Partners involved: Imperial College

Dissemination: The set will be made available to the interested WP8 partners oncethe set is complete (additional episodes of agreement need to be segmented and anno-tated).

Related Publications: N/A

Links to other thematic WPs: Events like smiles, overlapping speech, front-backhead movement, etc., are present not only in the episodes of agreement and disagreementbut in other interpersonal interactions as well. Hence, annotation of agreement anddisagreement episodes in terms of these events could be useful for all other thematicWPs.

9. Segmentation and annotation of various hand-gesture events in (dis)agreementepisodes from Canal 9 debates

Description : We went through 20 debates from Canal 9 debates and segmentedvarious events of (dis)agreement-related hand gestures.

162 instances of Forefinger Raise (movement of vertical movement of the arm with theforefinger erect) 130 instances of Half-Forefinger Raise (similar to Forefinger Raise, butonly half the forefinger is a erect with the upper part curled down towards the fist)45 instances of Folded Arms (static) 17 instances of Hand Wag (left-right movement ofthe hand with an open palm facing away from the actor) 13 instances of Arm Folding(movement) 8 instances of Forefinger Wag (movement of an erect forefinger left andright, with the palm facing away from the actor) 3 instances of Hand Scissor (movementof the hands imitating a pair of scissors; the starting position finds the hands crossedon the wrists with open palms; after a sudden outward movement the palms move awayfrom each other) 2 instances of Hand Cross (double-handed version of the Hand Wag)

Tangible outcome: A set of segmented and annotated episodes of various hand ges-tures related to spontaneous agreement and disagreement episodes from the Canal 9dataset, to be used for training and testing tools for automatic detection of hand ac-tion/gesture.

SSPNet D8.2: page 8 of 22

Page 15: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Partners involved: Imperial College

Dissemination: The set will be made available to the interested WP8 partners oncethe set is complete (additional episodes of certain hand gestures need to be segmentedand annotated).

Related Publications: N/A

Links to other thematic WPs: The annotated hand gestures may be present notonly in the episodes of agreement and disagreement but in other interpersonal interac-tions as well. Hence, their annotation could be useful for all other thematic WPs.

10. FACS analysis of facial behaviour in instances of agreement/disagreement.

Description : Selection of 88 subclips in 7 debates of the Canal 9 database (36agreements, 52 disagreements), FACS coding of 61 subclips in 7 debates of the Canal 9database (29 agreements, 32 disagreements) and data analysis of the facial expression inagreement and disagreement; preliminary results on the prevalence of individual facialaction units in agreement and disagreement.We went through 7 political debates of the Canal 9 database and selected 88 subclipsrepresenting agreements and disagreements. The selection was based on verbal content,i.e. when individuals express their agreement/disagreement directly (I agree, I don’tagree) or indirectly (they contradict or approve an argument that has been made earlierin the debate). Once these instances were spotted, we extracted the video portionin which the individuals can be seen in frontal view (represented in the time codeinformation). The facial expression information is operationalized in 5 different ways:the frequency of occurence of each movement, their relative duration in the subclip,the average duration per movement, the intensity of the movement, and the proportionof people showing the movement. For each individual we tried to obtain instances ofagreement and disagreement, as well as a neutral statement to use in a within-subjectanalysis of variance.

Tangible outcome: list of 88 subclips (with time codes) representing agreements anddisagreements and facs analysis of 61 of them. Preliminary results on the occurrence ofindividual facial movements in agreement and disagreement.

Partners involved: Idiap Research Institute, University of Geneva, University RomaTre

Dissemination: The selection of 88 subclips and the facs coding of 61 of them isavailable on the portal.

Related Publications: Article in preparation. An exploratory analysis of facialmovements involved in agreement and disagreement.

SSPNet D8.2: page 9 of 22

Page 16: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Links to other thematic WPs: Related to facial expression of politeness (WP9)Related to facial expression present in group interactions (WP10).

11. Implementation of a tracker, which tracks 24 facial feature points and thatconstructs a facial appearance model for all frames in the debate movies inwhich a frontal face shot is shown.

Description : The tracker is a Lucas-Kanade style tracker that fits an active appear-ance model to each video frame in order to (1) identify the location of 24 facial featurepoints and (2) to build an appearance model for the depicted face in each frame of thedebate videos. The tracker runs at about 8 frames per second in its current Matlabimplementation. The active appearance model we used so far consists of 9 parametersthat model the location of the facial feature points, and of 25 parameters that modelthe appearance of the depicted face. These 9 + 25 parameters are computed for eachvideo frame, and can be used to, e.g., identify the depicted person, detect nodding, rec-ognize whether the depicted person is speaking, perform basic AU classification, and toperform gaze tracking. Some example fits of the active appearance model are shown inFigure 1 and 2. Figure 1 shows the original video frame with the detected facial featurepoints overlaid as red crosses, whereas Figure 2 shows the frame with the computer-generated appearance model overlaid. Note how the 25 parameters of the appearancemodel capture variations such as mouth open/closed, gaze differences, etc. We alsodeveloped an extension of the standard active appearance model that uses a more so-phisticated description for both facial feature point and facial appearance variation,without significantly increasing the computational requirements for fitting the modelto video frames. We showed that the extended model outperforms the standard modelon standard reference datasets. The results of experiments with the tracker have beendescribed in an extensive technical report. A paper on the extended active appearancemodel has been submitted to the IEEE International Conference on Computer Visionand Pattern Recognition (CVPR). In Figure 1 and 2, we show two examples of activeappearance model fits on frames from the Canal-9 debate video data. The left columnof the figures shows the facial feature point annotations of the active appearance mod-els. The right column of the figures shows the computer-generated appearance model,overlaid onto the frames on which they were generated. Please note that the underlyingmodel that generates the facial appearance consists of only 9 + 25 parameters. Thepresented results show that these parameters contain information with respect to, forinstance identity, face orientation, presence of glasses, and position of the mouth (openor closed).

Tangible outcome: Implementation of active appearance models, a Lucas-Kanadetracker that employs these models, as well as a large number of related utility function.Paper presenting an extended variant of the standard active appearance model that isbetter capable of representing the large variations in facial feature point locations andfacial appearance that are present in the debate data.

Partners involved: Delft University of Technology.

SSPNet D8.2: page 10 of 22

Page 17: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Figure 1: Active appearance model fit on Canal 9 debate frame (example 1).

Figure 2: Active appearance model fit on Canal 9 debate frame (example 2).

SSPNet D8.2: page 11 of 22

Page 18: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Dissemination: The paper will be posted on the portal. The implementation of theactive appearance model will also be made publicly available in due course. At present,it is mainly useful for researchers with some technical background. We aim to post the 9+ 25 face features that our tracker produces for all debate movies on the portal shortly.These features may be used, e.g., by Imperial College to enhance the quality of the AUclassifications.

Related Publications:

• Laurens van der Maaten and Emile Hendriks. Capturing Appearance Variationin Active Appearance Models. Technical Report EWI-ICT TR 2009-002, DelftUniversity of Technology, 2009.

• Short version submitted to IEEE International Conference on Computer Visionand Pattern Recognition (CVPR-2010).

Links to other thematic WPs: The implementation of active appearance modelsmay be useful to WP10, in particular, as one of the modes in multimodal recognitionof social signals, and in the synthesis of virtual characters.

12. Implementation of an easy-to-use semi-automatic shot detection and anno-tation tool.

Description : The provided tool (Windows executable) performs rapid automaticshot change detection on the debate movie. After the automatic shot change detectionis complete, the user can rapidly adapt the resulting annotations in a graphical userinterface. A screenshot of this interface is shown in Figure 3. The tool provides toolsto rapidly merge shots that are similar (e.g., that depict the same person) and splitshots in a shot change was missed. Overall, non-technical users can perform the shotannotation of a 40-minute debate movie in approximately 30 minutes. The resultingshot annotation is a basic requirement for subsequent automatic computerized analysisof the videos, but can also be used by social psychologists to quickly select shots inwhich relevant facial expressions or gestures are present.

Tangible outcome: Windows executable that performs semi-automatic shot changeannotation.

Partners involved: Delft University of Technology.

Dissemination: The shot detection tool is available on the portal.

Related Publications: None.

SSPNet D8.2: page 12 of 22

Page 19: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Figure 3: Screenshot of the shot detection tool.

Links to other thematic WPs: May be useful to WP10, in particular, as a basisof algorithms for multimodal recognition of social signals.

13. Implementation of an automatic skin detection tool.

Description : The tool automatically detects skin-colored blobs in a video. The toolcombines a skin color model developed by Lichtenauer (who is currently at ImperialCollege, previously Delft University of Technology) with heuristics on the size, shape,and location of the detected skin-colored pixels. The tool runs faster than real-time, soit can be used to perform rapid analysis of, for instance, hand movements. The skin-colored blob detections can serve as a basis for subsequent automatic analyses, such ashand or head tracking. The detections are saved in a text file that can be easily importedin, for instance, Microsoft Excel, for further analysis of head or hand movements. Thesimplicity of the tool makes it possible for non-computer experts to use the tool. Thesoftware also provides a graphical user interface that allows the user to retrain theused color model. Retraining the color model may be useful, for instance, when thelighting conditions of the videos are very different. Retraining can be performed bymanually selecting skin-colored blobs using an annotation. Like the skin detection tool,the annotation tool is simple enough to be used by non-computer experts such as socialpsychologists. A screenshot of the annotation tool, as well as an example skin detectionresult on a similar frame, are shown in Figure 4 and 5, respectively.

Tangible outcome: Windows executable that performs automatic skin detection.

SSPNet D8.2: page 13 of 22

Page 20: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Figure 4: Example of skin annotation tool that is used to train the skin color model.

Figure 5: Example of skin detection result on a Canal 9 debate video frame.

SSPNet D8.2: page 14 of 22

Page 21: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Partners involved: Delft University of Technology and Imperial College.

Dissemination: The tool is available on the web portal.

Related Publications:

• Jeroen Lichtenauer, Marcel J.T. Reinders, Emile A. Hendriks. A self calibratingchrominance model applied to skin color detection. Proceedings of the VISAPPConference, vol. 1:pp. 115-120, 2007.

Links to other thematic WPs: May be useful to WP10, in particular, as a basisof algorithms for multimodal recognition of social signals.

14. Detection of Action Units in video sequences.

Description : For the automatic analysis of human behavior in video content, thedetection and classification of facial expressions (in terms of Action Units, AUs) areimportant elements. The first question is if the detection of the presence of an AUrequires the modeling of the full timeseries, or that the presence of a single key shotmay be sufficient. The paper investigates two different approaches to classify sequencesof events: first the timeseries is explicitly modeled using a Conditional Random Field(CRF) [1], second a multi-instance learning appraoch is followed. In the multi-instancelearning approach the object or event is not represented by a single feature vector,but by a bag of feature vectors. The results in the paper show that for most AUdetection problems the detection of a key frame is to be preferred over the modeling ofthe timeseries.

Partners involved: Delft Univeristy of Technology, Imperial College

Dissemination: Paper submission to the ICPR.

Related Publications:

• Lafferty et al., Conditional random fields: Probabilistic models for segmentingand labeling sequence data. MACHINE LEARNING-INTERNATIONAL WORK-SHOP (2001)

Links to other thematic WPs: Useful for group interactions (WP10)

15. Research on the notions of opinion, agreement/disagreement and their sig-nals, with a special focus on the head nod as a polysemic social signal.

SSPNet D8.2: page 15 of 22

Page 22: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Description : The research on agreement, opinion and head nods went through var-ious steps. A bibliographic research was conducted on agreement, opinion and headmovements, and in particular on the head nod. 50 nods from the Canal 9 debateswere annotated considering, beyond the semantic core of acceptance like in Darwinianspeculation, the meaning added by the previous turn and the other signals occurringsimultaneously (i.e.gaze or expressivity parameters). An annotation scheme was tunedwhich interprets each nod in terms of its goal or meaning. Thanks to it, various typesof nods were identified taking into account the difference between those of the speaker,the listener or a third listener, not involved in the dialogue act. Based on this typol-ogy, a procedure for the detection, recognition and interpretation of nods was proposed,considering visual, audio and semantic cues.

Tangible outcome:

• Selection of 50 nods

• Annotation scheme of head nods distinguished on the basis of head movements,expressivity parameters, previous speech act and other simultaneous signals likegaze.

• An oral presentation at Workshop ”Social Signals Foundation. An outline”, D’ErricoF., Vincze L. and Poggi I.: ”Opinion, agreement and nods”, Rome 3-5 december,2009

• Three papers ([Poggi et al., 2009a], [Poggi et al., b], [Poggi et al., c])

Partners involved: Roma Tre University

Dissemination: Laboratories with students at Rome Tre University:

• ”The passion of debates” (Poggi, Vincze). Audience: Undergraduate Students ofEducation Sciences.

• ”To agree or not agree?” (Poggi, D’Errico). Audience: Undergraduate Students ofEducation Sciences.

• Workshop ”Social Signals Foundation. An outline” (Rome 3-5 december). Audi-ence: International Workshop, plus Virtual Leaning Center of the SSPNet portal.

Related Publications:

• Paper presented at AISC (Italian cognitive science association): Poggi I., D’ErricoF. and Vincze L.: ”Fare s col capo. Polisemia di un segnale sociale”. VI AiscCongress (Italian cognitive science association - Naples 25-26 november)

• Poggi I., D’Errico F. and Vincze L. ”Nodding: not only agreement”. Paper sub-mitted to LREC (Language research and evaluation conference, Malta, 22-24 May,2010.

• Poggi I., D’Errico F. and Vincze L. ”Types of nod. Polysemy of a social signal”.Paper submitted to a Special Issue on the Journal of Multimodal Users Interface.

SSPNet D8.2: page 16 of 22

Page 23: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Links to other thematic WPs: Useful for group interactions (WP10) and politeness(wp9).

16. Research on the morphemes of gaze. Empirical studies on the meaning ofspecific aspects of gaze communication, with particular focus on the positionof the eyelids.

Description : In two studies, exploiting quantitative and qualitative methodologies,two questionaires, one of multiple choice questions and one of open questions, askedsubjects to attribute meanings to items of gaze with different positions of the eyelids.Results show that some eyelids positions have a morphemic value, in that they recur-rently convey one and the same meaning whatever the item of gaze they belong.

Tangible outcome: A clearer view of the meanings of gaze, possibly useful for gazerecognition and two papers in press ([Poggi et al., a], [Poggi et al., 2009b]).

Partners involved: Roma Tre University

Dissemination: Laboratories with students at Rome Tre University: ”Talking eyes.Playing with Greta” (Poggi,D’Errico,Spagnolo).

Related Publications:

• Poggi I., D’Errico F and Spagnolo A.(in press). ”The embodied morphemes ofgaze”. In S.Kopp (Ed.), Proceedings of the Gestures Workshop 09. Springer-Verlag.

• Poggi I., Spagnolo A and D’Errico F. (in press). ”The morphemes of the eyelids”.In Proceeding of ”Comunicazione Parlata”, Liguori: Napoli.

Links to other thematic WPs: Useful for WP9 and WP10

17. Research on social emotions.

Description : In view of multimodal interfaces capable of a detailed representationof the User’s possible emotions, the emotion of bitterness was analyzed in terms ofits mental ingredients, the beliefs and goals represented in the mind of a person whenfeeling an emotion. The ingredients found in a pilot study were tested through qualita-tive analysis of a further questionnaire, which confirmed the ingredients hypothesized,further revealing the different nature of bitterness across ages and across types of work.

Tangible outcome: A model of an emotion unfortunately widespread in everydaylife

Partners involved: Roma Tre University

SSPNet D8.2: page 17 of 22

Page 24: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Dissemination: A paper ([Poggi and D’Errico, 2009]

Related Publications:

• Poggi I., D’Errico F., (2009) The mental ingredients of Bitterness. Journal ofmultimodal user interface. Springer. DOI: 10.1007/s12193-009-0021-9

Links to other thematic WPs: Useful to wp9 and wp10

18. Research on the use of facial expression in order to convey emotions to theaudience during a persuasive political discourse.

Description : The research on the use of facial expression in persuasion went throughseveral steps. First the literature on facial expression, starting from the ancient Romanrhetorical texts (Cicero, Quintilian) until the modern studies on facial expression (Ek-man and Friesen) was overviewed. Second, the insights from the literature were appliedto our study. Through analysis of the persuasive uses of facial expression and gaze ofSegolene Royal, the counter-candidate of Nicolas Sarkozy to the presidential electionsin France (2007), the purpose of the research was to emphasize the importance of nonverbal strategies within her persuasive political discourse.

Tangible outcome: N/A

Partners involved: Roma Tre

Dissemination: Audience of the International Conference ”La comunicazione par-lata”, a paper in press ([Vincze and Poggi, b])

Related Publications:

• L.Vincze and I.Poggi (in press). Movere e delectare. Persuadere con lo sguardo el’espressione del viso. In Proceedings of the International Conference ”La comuni-cazione parlata”, Naples, 23-25th February 2009.

Links to other thematic WPs: Useful for WP9 and WP10.

19. Analysis of the persuasive import of gesture and gaze with the help of anannotation scheme.

Description : The study investigated the use of gesture and gaze in political dis-course, and presents an annotation scheme for the analysis of their persuasive import.Two studies were reported on electoral debates of three politicians in Italy and France(Achille Occhetto, Romano Prodi and Segolene Royal), and an annotation scheme waspresented through which the gesture and gaze items produced in some fragments of po-litical discourse were analyzed as to their signal and their literal and indirect meanings,and classified in terms of the persuasive strategies pursued, logos, ethos or pathos.

SSPNet D8.2: page 18 of 22

Page 25: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Tangible outcome: Annotation scheme for the analysis of persuasive gesture.

Partners involved: Roma Tre University

Dissemination: A paper in press ([Vincze and Poggi, a])

Related Publications:

• L. Vincze and I. Poggi (in press) Il gesto, lo sguardo e i loro significati: uno schemadi annotazione. In Proceedings of the Workshop Teorie e trascrizione. Trascrizionie teoria. Bolzano, December 2nd, 2009.

Links to other thematic WPs: WP10

20. Analysis of humour, irony and ridicule, their communication and functionin power comparison.

Description : The work analysed the multimodal communication of the prosecutorDi Pietro and the accused Cirino Pomicino during a trial of high political importance,the ”Clean hands trial”. It defines the notions of humor, irony and ridicule, presentscases of them in the trial and illustrates their function in the persuasive structure of ajudicial debate.

Tangible outcome: N/A

Partners involved: Roma Tre

Dissemination: A paper in press ([Poggi, ])

Related Publications:

• I. Poggi (in press): Irony, humour ad ridicule. Power, image and judicial rhetoricin an Italian political trial. In R. Vion (Ed.) La corporalit du langage. Multi-modalit, discourse, criture. Hommage Claire Maury-Rouan”. Aix-en- Provence:Publications de L’Universit de Provence.

Links to other thematic WPs: WP10

4 Future Achievements

For the next year we foresee two major directions:

SSPNet D8.2: page 19 of 22

Page 26: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Research into (automatic) detection and understanding of agreement and dis-agreement in political debates using Facial Actions Units: So far we have 30 labeledclips of agreement and 30 labeled clips of disagreement available. Although it is rather laborintensive the objective is to have more labeled data available. The aim is to have for eachselected person in the video data multiple clips (prefarably 10) annotated.The available data is now mainly processed manually and the analysis is done on the ex-cel/XML files on individual AUs. To be able to find correlations between AU, a more auto-matic processing of the data is necessary.. If more data is available pattern recognition tools to investigate if and how agreement/disagreementcan be automatically recognized using AU detections, can be used. For the next year we willlook further into the automatic detection of the AUs in the available Canal 9 videos and alsoinvestigate the dynamics of the Facial Action Units in relation with agreement / disagree-ment. Since speaker recognition and speaker overlap recognition is important for detection ofagreement / disagreement, speaker(s) analysis will be further investigated. Another directionis to study the effect of dominance related vocal parameters on the outcome of conflicts inpolitical debates.

This idea results from the observation that in the Canal9 debates we have a continuousaccess to verbal signals (unlike for visual signals), which makes it easier to evaluate to socialconsequences of utterances. Another observation is that some people rarely express theiragreement or disagreement. So for example some people rarely agree, or rarely disagree. Thisis likely to be related to dominance. Agreeing is like conceding a victory to the opponent andis therefore not perceived as dominant. A preliminary analysis of this question would be to:

• define vocal parameters that have been associated with dominance in past studies.

• measure these parameters in participants of the debate.

• collect all the instances when a conflict occurs, using your criterion of overlapping speech

• determine who ”won”/”lost” the conflict (the person who has the last word, or whocontinues speaking, or who is given the floor by the moderator).

• investigate the relationship between the voice parameters and the outcome of the conflictin terms of winning/loosing.

Further analysis would investigate the relationship between personal demographic dataand the conflict outcomes.

Role of nodding in relation with agreement/disagreement in political debates:In the next year we will further research into (automatic) detection of nods and a betterunderstanding of these nods in their role with agreement and disagreement in political debates.The univeristy of Rome developed a first version of a taxonomy of nods used in a numberof canal 9 political debates videos. This will be further investigated and improved. At thesame time we will look into how certain main properties of nodding like time duration, speedand frequency of the nods, but also important related features like gaze, head orientation andcorresponding gestures, can be automatically detected from the video.

SSPNet D8.2: page 20 of 22

Page 27: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

Effect of dominance related vocal parameters on the outcome of conflicts inpolitical debates: We observe that debate participants tend to display dominance as ameans to appear the winners of conflictual discussions. This complements the detectionof agreement and disagreement in those cases where these are not expressed explicitly. Apreliminary analysis of this question would be to:

1. define vocal parameters that have been associated with dominance in past studies.

2. measure these parameters in participants of the debate.

3. collect instances of conflict.

4. determine who “won”/“lost” the conflict (the person who has the last word, or whocontinues speaking, or who is given the floor by the moderator).

5. investigate the relationship between the voice parameters and the outcome of the conflictin terms of winning/loosing.

Further analysis will investigate the relationship between personal demographic data andthe conflict outcomes.

References

[Ajmera, 2005] Ajmera, J. (2005). Robust audio segmentation. PhD thesis.

[Ajmera and Wooters, 2003] Ajmera, J. and Wooters, C. (2003). A robust speaker clusteringalgorithm. Workshop on Automatic Speech Recognition and Understanding, 2003.

[Poggi, ] Poggi, I. Irony, humour ad ridicule. power, image and judicial rhetoric in an italianpolitical trial. In R. Vion (Ed.) La corporalit du langage. Multimodalit, discourse, crit-ure. Hommage Claire Maury-Rouan”. Aix-en- Provence: Publications de L’Universit deProvence.

[Poggi and D’Errico, 2009] Poggi, I. and D’Errico, F. (2009). The mental ingredients of bitter-ness. journal of multimodal user interface. Journal of multimodal user interface. Springer.DOI: 10.1007/s12193-009-0021-9.

[Poggi et al., a] Poggi, I., D’Errico, F., and Spagnolo, A. The embodied morphemes of gaze.In S.Kopp (Ed.), Proceedings of the Gestures Workshop 09. Springer-Verlag, in press.

[Poggi et al., b] Poggi, I., D’Errico, F., and Vincze, L. Nodding: not only agreement. Papersubmitted to LREC, Language research and evaluation conference, Malta, 22-24 May.

[Poggi et al., c] Poggi, I., D’Errico, F., and Vincze, L. Types of nod. polysemy of a socialsignal. Paper submitted to a Special Issue on the Journal of Multimodal Users Interface.

[Poggi et al., 2009a] Poggi, I., D’Errico, F., and Vincze, L. (2009a). Fare s col capo. polisemiadi un segnale sociale. VI Aisc Congress, Italian cognitive science association, Naples 25-26November, 2009.

[Poggi et al., 2009b] Poggi, I., Spagnolo, A., and D’Errico, F. (2009b). The morphemes ofthe eyelids. In Proceeding of ”Comunicazione Parlata”, Liguori: Napoli.

SSPNet D8.2: page 21 of 22

Page 28: SSPNet Social Signal Processing Networkvincia/papers/D8.2.pdf · 2011. 3. 31. · Recordings: Using Social A liation Networks for Feature Extraction IEEE Trans-actions on Multimedia,

SSPNet [231287] D8.2: Experimental procedures

[Vincze and Poggi, a] Vincze, L. and Poggi, I. Il gesto, lo sguardo e i loro significati: unoschema di annotazione. In Proceedings of the Workshop Teorie e trascrizione. Trascrizionie teoria. Bolzano, December 2nd (in press).

[Vincze and Poggi, b] Vincze, L. and Poggi, I. Movere e delectare. persuadere con lo sguardoe l’espressione del viso. In Proceedings of the International Conference ”La comunicazioneparlata”, Naples, 23-25th February, 2009 (in press).

SSPNet D8.2: page 22 of 22