Nov 22 nd, 2006 Multimodal Analysis of Expressive Human Communication: Speech and gesture interplay Ph.D. Dissertation Proposal Carlos Busso Adviser: Dr

Nov 22nd, 2006

Multimodal Analysis of Expressive Human Communication:

Speech and gesture interplay

Ph.D. Dissertation Proposal

Carlos Busso

Adviser: Dr. Shrikanth S. Narayanan

Nov 22nd, 2006

Outline

Introduction Analysis Recognition Synthesis Conclusions

Nov 22nd, 2006

Introduction

• Gestures and speech are intricately coordinated to express messages

• Affective and articulatory goals jointly modulate these channels in a non-trivial manner

• A joint analysis of these modalities is needed to better understand expressive human communication

• Goals: • Understand how to model the spatial-temporal modulation of these

communicative goals in gestures and speech

• Use these models to improve human-machine interfaces• Computer could give specific and appropriate help to users • Realistic facial animation could be improved by learning human-like gestures

01/40 IntroductionIntroduction

Motivation

This proposal focuses on the analysis, recognition and synthesis of expressive human communication under a multimodal framework

This proposal focuses on the analysis, recognition and synthesis of expressive human communication under a multimodal framework

Nov 22nd, 2006

Introduction

• How to model the spatio-temporal emotional modulation• If audio-visual models do not consider how the coupling between gestures

and speech changes in presence of emotion, they will not accurately reflect the manner in which human communicate

• Which interdependencies between the various communicative channels appear in conveying verbal and non-verbal messages?• Interplay between communicative, affective and social goals

• How to infer meta-information from speakers (emotion, engagement)?• How gestures are used to respond to the feedback given by the

listener?• How the verbal and non-verbal messages conveyed by one speaker are

perceived by others?• How to use models to design and enhance applications that will help

and engage the users?


Open challenges

Nov 22nd, 2006

Introduction


Proposed Approach

Nov 22nd, 2006

Analysis

Introduction

Analysis• Facial Gesture/speech Interrelation

• Affective/Linguistic Interplay

Recognition Synthesis Conclusions

01/40 AnalysisAnalysis

• Facial Gesture/speech Interrelation• Facial Gesture/speech Interrelation

C. Busso and S.S. Narayanan. Interrelation between Speech and Facial Gestures in Emotional Utterances. Under submission to IEEE Transactions on Audio, Speech and Language Processing.

Nov 22nd, 2006

Facial gestures/speech interrelation

• Gestures and speech interact and cooperate to convey a desired

message [McNeill,1992], [Vatikiotis,1996], [Cassell,1994]

• Notable among communicative components are the linguistic, emotional and idiosyncratic aspects of human communication

• Both gestures and speech are affected by these modulations

• It is important to understand the interrelation between facial gestures and speech in terms of all these aspects of human communication


Motivation

Nov 22nd, 2006


• To focus on the linguistic and emotional aspects of human communication• To investigate the relation between certain gestures and acoustic features

• To propose recommendations for synthesis and recognition applications

• Relationship between gestures and speech as conversational functions [Ekman,1979], [Cassell,1999], [Valbonesi,2002], [Graf,2002], [Granstrom,2005]

• Relationship between gestures and speech as results of articulation [Vatikiotis,1996], [Yeshia,1998], [Jiang,2002], [Barker,1999]

• Relationship between gestures and speech influenced by emotions [Nordstrand,2003], [Caldognetto,2003], [Bevacqua,2004] [Lee, 2005]


Goals

Related work[s]

Nov 22nd, 2006


• Pearson’s correlation is used to quantify relationship between speech and facial features

• Affine Minimum Mean-Square Error is used to estimate the facial gestures from speech

• Sentence-level mapping

• Global-level mapping


Proposed Framework: Data-driven approach

Nov 22nd, 2006


• Four emotions are targeted• Sadness

• Angry

• Happiness

• Neutral state

• 102 Markers to track facial expressions

• Single subject

• Phoneme balanced corpus (258 sentences)

• Facial motion and speech are simultaneously captured


Audio-Visual Database

Nov 22nd, 2006


• Speech• Prosodic features (source of the speech): Pitch, energy and

they first and second derivatives

• MFCC coefficients (vocal tract)

• Facial features• Head motion

• Eyebrow

• Lips

• Each marker grouped in• Upper, middle and lower

face regions


Facial and acoustic features

Nov 22nd, 2006


• High levels of correlation

• Correlation levels are higher when MFCC features are used

• Clear emotional effects

• Correlation levels are equal or greater than neutral case

• Happiness and anger are similar


Correlation results : Sentence-level

Neutral Sad Happy Angry


Pro

sodi

cM

FC

Cs

Nov 22nd, 2006


• Correlation levels decreases compared sentence-level mapping

• Link between facial gestures and speech varies from sentence to sentence

• Correlation levels are higher when MFCC features are used

• The lower face region presents the highest correlation

• Clear emotional effects• Correlation levels for neutral speech

are higher than emotional category


Correlation results : Global-level



Pro

sodi

cM

FC

Cs

Nov 22nd, 2006


• Goal: study structure of mapping parameters

• Approach: Principal Component analysis (PCA)

• For each facial feature, find P such that it cover 90% of the variance

• Emotional-dependent vs. emotional independent analysis


Mapping parameter

Nov 22nd, 2006


• Parameters [are] cluster in small subspace• Prosodic-based parameters [are] cluster in a smaller subspace than MFCC-

based parameters• Further evidences of emotional-dependent influence in the relationship

between facial gestures and speech


Mapping parameter’ Results

Fraction of eigenvectors used to span 90% or more of the variance of the parameter T

Nov 22nd, 2006


• Correlation levels as function of P

• Slope in prosodic-based features is lower than in MFCCs

• Smaller dimension of the cluster

• Slope depends on the facial region

• Different levels of coupling


Mapping parameter’ Results

Mid

dle

Upp

erL

ower

Prosodic MFCCs

Nov 22nd, 2006

Affective/Linguistic Interplay

Introduction Analysis

• Facial Gesture/speech Interrelation

• Affective/Linguistic Interplay

Recognition Synthesis Conclusions


• Affective/Linguistic Interplay• Affective/Linguistic Interplay

C. Busso and S.S. Narayanan. Interplay between linguistic and affective goals in facial expression during emotional utterances. To appear International seminar on Speech Production (ISSP 2006)

Nov 22nd, 2006

Linguistic/affective interplay

• Linguistic and emotional goals jointly modulate speech and gestures to convey the desired messages

• Articulatory and affective goals co-occur during normal human interaction, sharing the same channels

• Some control needs to buffer, prioritize and execute these communicative goals in coherent manner

• Linguistic and affective goals interplay interchangeably as primary and secondary controls

• During speech, affective goals are displayed under articulatory constraints• Some facial areas have more degree of freedom to display non-verbal clues


Motivation

Hypotheses

Nov 22nd, 2006


• Low vowels (/a/) with less restrictive tongue position observed greater emotional coloring then high vowels (/i/) [Yildirim, 2004] [Lee,2005] [Lee, 2004]

• Focus of this analysis is on the interplay in facial expressions

• Compare facial expressions of neutral and emotional utterances with same semantic content• Correlation

• Euclidean Distance

• The database is a subset of the MOCAP data


Previous results

Approach

Nov 22nd, 2006



Facial activation analysis

Neutral Sad

Happy Angry

• Measure of facial motion

• Lower face area has the highest activeness levels• Articulatory processes play a

crucial role

• Emotional modulation• Happy and angry more active

• Sadness less active than neutral

• Activeness in upper face region increases more then other regions

Nov 22nd, 2006


• Goal: Compare in detail[s] the facial expression displayed during neutral and emotional utterances with similar semantic content

• Dynamic Time Warping (DTW) is used to align the utterances


Neutral vs. emotional analysis

Nov 22nd, 2006


• Higher correlation implies higher articulatory constraints

• Lower facial region has the highest correlation levels

• More constrained

• Upper facial region has the lower correlation levels

• Can communicate non-verbal information regardless of the linguistic content


Correlation analysis : neutral vs. emotional

Neutral-Sad Neutral-Happy Neutral-Angry

(median results)

Nov 22nd, 2006



Euclidean distance analysis : neutral vs. emotional

(median results)

Neutral-Sad Neutral-Happy Neutral-Angry

• After scaling the facial features, the Euclidean distance was estimated

• High values indicate that facial features are more independent of the articulation.

• Similar results than in correlation

• Upper face region less constrained by articulatory processes

Nov 22nd, 2006

Analysis

• Facial gestures and speech are strongly interrelated

• The correlation levels present inter-emotion differences

• There is an emotion-dependent structure in the mapping parameter that may be learned

• The prosodic-based mapping parameter set is grouped in a small cluster

• Facial areas and speech are coupled at different resolutions


Remarks from analysis section

Nov 22nd, 2006

Analysis

• During speech, facial activeness is mainly driven by articulation

• However, linguistic and affective goals co-occur during active speech.

• There is an interplay between linguistic and affective goals in facial expression

• Forehead and cheeks have more degree of freedom to convey non-verbal messages

• The lower face region is more constrained by the articulatory process


Remarks from analysis section

Nov 22nd, 2006

Recognition

Introduction Analysis Recognition

• Emotion recognition

• Engagement recognition

Synthesis Conclusions

01/40 RecognitionRecognition

• Emotion recognition• Emotion recognition

C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information ,” in Sixth International Conference on Multimodal Interfaces ICMI 2004, State College, PA, 2004, pp. 205–211, ACM Press.

Nov 22nd, 2006

Multimodal Emotion Recognition

• Emotions are an important element of human-human interaction• Design improved human-machine interfaces• Give specific and appropriate help to user

• Modalities give complementary information • Some emotions are better recognized in a particular domain• Multimodal approach provide better performance and robustness

• Decision-level fusion systems (rule-based system) [Chen,1998] [DeSilva,2000] [Yoshitomi,2000]

• Feature-level fusion systems [Chen,1998_2] [Huang,1998]


Motivation

Hypotheses

Related work

Nov 22nd, 2006


• Analyze the strength and limitation of unimodal systems to recognize emotion states

• Study the performance of multimodal system

• MOCAP database is used

• Sentence-level features (e.g. mean, variance, range, etc.)

• Speech : prosodic features

• Facial expression: upper and middle face area

• Sequential backward features selection

• Support vector machine classifier (SVC)

• Decision and feature level integration


Proposed work

Nov 22nd, 2006

Anger Sadness Happiness Neutral

Anger 0.95 0.00 0.03 0.03Sadness 0.00 0.79 0.03 0.18Happiness 0.02 0.00 0.91 0.08Neutral 0.01 0.05 0.02 0.92


• From speech• Average ~70%

• Confusion sadness-neutral ( )

• Confusion happiness-anger ( )

• From facial expression• Average ~85%

• Confusion anger-sadness ( )

• Confusion neutral-happiness ( )

• Confusion sadness-neutral ( )

• Multimodal system (feature-level)• Average ~90%

• Confusion neutral-sadness ( )

• Other pairs are correctly separated


Emotion recognition results



Anger Sadness Happiness NeutralAnger 0.68 0.05 0.21 0.05Sadness 0.07 0.64 0.06 0.22Happiness 0.19 0.04 0.70 0.08Neutral 0.04 0.14 0.01 0.81

Nov 22nd, 2006

Inferring participants’ engagement

Introduction Analysis Recognition

• Emotion recognition

• Engagement recognition

Synthesis Conclusions


• Engagement recognition• Engagement recognition

C. Busso, S. Hernanz, C.W. Chu, S. Kwon, S. Lee, P. G. Georgiou, I. Cohen, S. Narayanan. Smart Room: Participant and Speaker Localization and Identification. In Proc. ICASSP, Philadelphia, PA, March 2005. C. Busso, P.G. Georgiou and S.S. Narayanan. Real-time monitoring of participants’ interaction in a meeting using audio-visual sensors. Under submission to International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007)

Nov 22nd, 2006


• At small group level, the strategies of one participant are affected by the strategies of other participants

• Automatic annotations of human interaction will provide better tools for analyzing teamwork and collaboration strategies

• Examples of application in which monitoring human interaction is very useful are summarization, retrieval and classification of meetings

• Infer meta-information from participants in a multiperson meeting

• To monitor and track the behaviors, strategies and engagements of the participants

• Infer interaction flow of the discussion


Motivation

Goals

Nov 22nd, 2006


• Extract high-level features from automatic annotations of speaker activity (e.g. number and average duration of each turn)

• Use an intelligent environment equipped with audio-visual sensors to get the annotations

• Intelligent environment [Checka,2004] [Gatica-Perez,2003] [Pingali,1999]

• Monitoring human interaction [McCowan,2005] [Banerjee,2004] [Zhang,2006] [Basu,2001]


Approach

Related work

Nov 22nd, 2006


• Visual• 4 firewire CCD cameras

• 360o Omnidirectional camera

• Audio• 16-channel microphone

array

• Directional microphone (SID)


Smart Room

Nov 22nd, 2006


• After fusing audio-visual stream of data, the system gives• Participants’ location• Sitting arrangement• Speaker identity • Speaker activity

• Testing (~85%)• Three 20-minute meeting

(4 participants)• Casual conversation with

interruptions and overlap


Localization and identification

Nov 22nd, 2006


• High level features per participant• Number of turns

• Average duration of turns

• Amount of time as active speaker

• Transition matrix depicting turn-taking between participants

• Evaluation• Hand-based annotation of speaker activity

• Results describe here correspond to one of the meetings


Participants interaction

Nov 22nd, 2006


• Automatic annotations are good approximation

• The distribution of time used as active speaker correlate dominance [Rienks,2006]

• Subject 1 spoke more than 65% of the time

• Discussion are characterized by many short turns to show agreement (e.g. “uh-huh”) and longer turns taken by mediators [Burger,2002]

• Subject 1 was leading discussion

• Subject 3 was only an active listener


Results : Participants interactionEstimated durationGround-true duration

Ground-true time distribution

Estimatedtime distribution

Estimated no. of turns

Ground-trueno. of turns

Nov 22nd, 2006


• The transition matrix gives the interaction flow and turn taking patterns

• Claim: transition between speaker ~ who was being addressed• To evaluate this hypothesis, addressee

was manually annotated and compared with transition matrix

• Transition matrix provides a good first approximation to identifying the interlocutor dynamics.

• Discussion was mainly between subjects 1 and 3.


Results : Participants interaction

Ground-true transition

Estimatedtransition

Nov 22nd, 2006


• These high-level features can be estimated in small windows over time to infer participants’ engagement• Subject 4 not engaged

• Subjects 1, 2 and 3 engaged

Results : Participants interaction

Dynamic behavior of speakers’ activeness over time

Nov 22nd, 2006

Recognition

• Multimodal approaches to infer meta-information from speaker gives better performance than unimodal system

• When acoustic and facial features are fused, the performance and the robustness of the emotion recognition system improve measurably

• In small group meetings, it is possible to accurately estimate in real-time not only the flow of the interaction, but also how dominant and engaged each participant was during the discussion

Remarks from recognition section

Nov 22nd, 2006

Synthesis

Introduction Analysis Recognition Synthesis

• Head motion synthesis

Conclusions Future Work

01/40 SynthesisSynthesis

• Head motion synthesis• Head motion synthesis

C. Busso, Z. Deng, U. Neumann, and S.S. Narayanan, “Natural head motion synthesis driven by acoustic prosodic features,” Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 283–290, July 2005. C. Busso, Z. Deng, M. Grimm, U. Neumann and S. Narayanan. Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis. IEEE Transactions on Audio, Speech and Language Processing, March 2007

Nov 22nd, 2006

Natural Head Motion Synthesis

• Mapping between facial gestures and speech can be learned using more sophisticated framework

• A useful and practical application is avatars driven by speech

• Engaging human-computer interfaces and application such as animated features films have motivated realistic avatars

• Focus of this section: head motion


Motivation

Nov 22nd, 2006


• It has received little attention compared to other gestures

• Important for acknowledging active listening• Improves acoustic perception [Munhall,2004]

• Distinguish interrogative and declarative statements [Munhall,2004]

• Recognize speaker identity [Hill,2001]

• Segment spoken content [Graf,2002


Why head motion?

Nov 22nd, 2006


• Head motion is important for human-like facial animation

• Head motion change the perception of the emotion

• Head motion can be synthesize from acoustic features

• Rule-based systems [Pelachaud,1994]

• Gaussian Mixtures Model [Costa,2001]

• Specific head motion (e.g. ‘nod’) [Cassell, 1994] [Graf, 2002]

• Example-based system [Deng, 2004], [Chuang, 2004]


Hypotheses

Related Work

Nov 22nd, 2006



• Hidden Markov Models are trained to capture the temporal relation between the prosodic features and the head motion sequence

• Vector quantization is used to produce a discrete representation of head poses

• Two-step smoothing techniques are used based on first order Markov model and spherical cubic interpolation

• Emotion perception is studied by rendering deliberate mismatches between the emotional speech and the emotional head motion sequences

Proposed Framework

Nov 22nd, 2006


• Same audio-visual database

• Acoustic Features ~ Prosody (6D)• Pitch

• RMS energy

• First and second derivative

• Head motion ~ head rotation (3DOF) • Reduce the number of HMMs

• For closed-view of the face, translation effect less important


Database and features

Nov 22nd, 2006



Head motion analysis in expressive speech

• Prosodic features are coupled with head motion (emotion dependent)

• Emotional patterns in activeness, range and velocity

• Discriminate analysis ~ 65.5%

• Emotion-dependent models are needed

Nov 22nd, 2006


• Head motions are modeled with HMMs• HMMs provide a suitable and natural framework to model the

temporal relation between prosodic features and head motions• HMMs will be used as sequence generator (head motion sequence)

• Discrete head pose representation• The 3D head motion data is quantized using K-dimensional vector quantization

• Each cluster is characterized by its mean, and covariance,


Head motion analysis in expressive speech

( ) iVHeadPose ≈= γβα ,, { }Ki ..1∈

iU iΣ

Nov 22nd, 2006


• The observation, O, are the acoustic prosodic features

• One HMM will be trained for each head pose cluster,

• Likelihood distribution: P(O/Vi )

• It is modeled as a Markov process

• A mixture of M Gaussian densities is used to model the pdf of the observations

• Standard algorithm are used to train the parameters (Forward-backward, Baum-Welch re-estimation)


)()|()|( iii VPVOPcOVP ⋅=

iV

Learning natural head motion

Nov 22nd, 2006


• Prior distribution: P(Vi)

• It is built as bi-gram models learned from the data (1st smoothing step)

• Transitions between clusters that do not appear in the training data are penalized

• This smoothing constraint is imposed in the decoding step


)()|()|( iii VPVOPcOVP ⋅=

Learning natural head motion

Nov 22nd, 2006


Synthesis of natural head motion• For a novel sentence, the HMMs generate the most likely head motion sequence

• Interpolation is used to smooth the cluster transition region (2nd smoothing step)

Nov 22nd, 2006


• 2nd smoothing constraint• Spherical Cubic Interpolation

• Remove the breaks in the cluster transition of the new sequences• The interpolation take place in the quaternion unit sphere [Shoemake, 1985]


Nov 22nd, 2006


• Configuration• Left-to-Right topology• K=16 (number of cluster)• S=2 (number of states)• M=2 (number of mixtures)• 80% training set, 20% test set

• A set of HMMs was built for each emotion• From Euler angles to talking avatars

• The Euler angles are directed applied to the control parameters of the face model

• Face is synthesized with techniques given in [Deng,2004], [Deng,2005], [Deng,2005_2], [Deng,2006]


Nov 22nd, 2006


• Canonical correlation between original and synthesized sequence


Results

Neutral Happiness

Nov 22nd, 2006



ResultsSadness Anger

Subjective naturalness assessment

Nov 22nd, 2006


• Approach: Render animations with deliberate mismatches between the emotional content of the speech and the emotional pattern of head motion

• Dynamic Time Warping for alignment

• 17 human subjects assessed the videos

• Evaluation is performed in primitives attributes domain (valence, activation and dominance)

Emotional Perception


Nov 22nd, 2006



• Happy head motion make the attitude of the animation more positive (s.s.)

• Angry head motion make the attitude of the animation more negative (n.s.s.)

Results: Valence (Positive-Negative)

1 2 3 4 5

Nov 22nd, 2006



Results: Activation (Excited-Calm)

• Anger head motion makes the attitude of the animation more excited than happy head motion (s.s)

• Happy speech with sad head motion is perceived more excited (s.s.)• Artifact of the approach

• True effect generated by combination of modalities (McGurk effect)

1 2 3 4 5

Nov 22nd, 2006



• Head motion does not modify this attribute

• Neutral speech with happy head motion is perceived more strong (n.s.s.)

• Happy speech synthesized with angry head motion is perceived more strong (n.s.s.)

Results: Dominance (Weak-strong)

1 2 3 4 5

Nov 22nd, 2006

Synthesis

Remarks from synthesis section• Re-visiting the hypotheses

Head motion is important for human-like facial animation• Animation is perceived more natural with head motion

Head motion change the perception of the emotion• Specially in valence and activation domain

• Head motion need to be designed to convey the desire emotion

Head motion can be synthesize from acoustic features• The synthesized sequences were perceives as natural as the original

sequences

• HMMs capture the relation between prosodic and head motion features

Nov 22nd, 2006

Conclusions

Introduction Analysis Recognition Synthesis Conclusions

Proposed work Timeline

01/40 ConclusionsConclusions

Nov 22nd, 2006

Proposed work

• To jointly model different modalities within an integrated framework• gestures and speech are not synchronous and they are coupled at different

resolutions

• Explore human emotional perception• Different combination of modalities may create different emotion percept

• The idiosyncratic influence in expressive human communication• How speaker-dependent are the results presented here

• At dyad level, how gestures and speech of the speakers are affected by the feedbacks provided by other interlocutors

• At small group level, to infer meta-information from participants’ gestures

Research goals


Nov 22nd, 2006

Proposed work

• Features• Dyad interaction

• 5 session 2 actor each

• Emotions were elicited in context

• ~14 hour of data

• Markers on the face and on the hands

• Happiness, sadness anger, frustration neutral state

• Still under preparation


Interactive and emotional motion capture database

Nov 22nd, 2006

Proposed work

Gesture and speech framework• To model different modalities with a single framework that considers

asynchrony and interrelation between modalities• Coupled HMMs and graphical models

• Multiresolution analysis of gestures and speech during expressive utterances• Facial gestures and speech are systematically synchronized in different

scales (phonemes-wordsphrases-sentences) [Cassell,1994]• The lower face area is strongly constrained by articulation• The upper face area has more degrees of freedom to communicate non-

linguistic messages• A multiresolution decomposition approach may provide a better

framework to analyze the interrelation between facial and acoustic features• We will study the correlation levels of coarse-to-fine representations of

acoustic and facial features


Individual level

Nov 22nd, 2006

Proposed work

Emotion perception• Evaluate the hypothesis that different combination of

modalities create different emotion percept (McGurk effect)• Approach: design controlled experiments using facial animations• Create deliberate mismatches between the emotional speech and

specific facial gestures (e.g. eyebrow)• Human raters will assess the emotions conveyed in these animations• For facial animation the open source software Xface will be used

• Emotion perception in different modalities• We will compare acoustic versus visual emotion perception• We will evaluate the importance of content in emotion perception• Approach: assess the IEMOCAP database


Individual level

Nov 22nd, 2006

Proposed work

Gestures and speech driven by discourses functions• Study influence of high-level linguistic functions in the relationship between

gestures and speech• We propose to analyze gestures that are generated as discourse functions (e.g.

head nod for “yes”)

• Application: Improve facial animations• e.g. Head motion sequences driven by prosody and discourse functions

Analysis of personal styles

• We propose to study the idiosyncrasy aspect of expressive human communication

• Since IEMOCAP database has 10 subjects, results can generalized• To learn inter-personal similarities (speaker-independent emotion

recognition systems)• To learn inter-personal differences, (better human-like facial animation)


Individual level

Nov 22nd, 2006

Proposed work

Extension of interplay theory

• Analyze facial expressions during acoustic silence

• The lower face area may be modulated as much as the upper/middle face areas

Gestures of active listeners

• Active listeners respond with non-verbal gestures• These gestures appear in specific structure of the speaker’s word [Ekman,1979]

• Application: design active listener virtual agents

• We propose to analyze the gestures and speech of the subjects when they are trying to positively affect the mood of the other interlocutor• Hypothesis: particular gestures are used which can be learned and

synthesize


Dyad level

Nov 22nd, 2006

Proposed work

Gestures of the participants

• Rough estimations of the participant gestures will be extracted

• We propose to include this information as additional clue to measure speaker engagement

• Use gestures to improve fusion algorithm• A measure of hand activeness can be used for speaker localization

• Head postures of the participant can improve the change of turns detection

Smart room as training tool

• Evaluate whether the report provided by the smart room can be used as training tool for improving participant skills during discussions


Small group level

Nov 22nd, 2006

Timeline

January-March

• Multiresolution analysis of gestures and speech during expressive utterances

• Analysis of McGurk effects in emotional perception of expressive facial animations

• Relation between visual and acoustic boundaries

April-June

• Gesture and Speech framework (e.g. CHMM, graphical models)

• Emotion perception in different modalities (context vs. insolated emotional assessments)

• Extension of the interplay theory during acoustic silence

Nov 22nd, 2006

Timeline

July-September

• Facial Animation driven by discourse function and acoustic features

• Study of idiosyncrasy aspect of human communication

• Engagement analysis in multiparty discussion

October-November

• Active listeners analysis

• Hand gestures analysis and its relationship with speech

Nov 22nd, 2006

Publications

Journal Articles[1] C. Busso and S.S. Narayanan. “Interrelation between Speech and Facial Gestures in Emotional

Utterances”. Submitted to IEEE Transactions on Audio, Speech and Language Processing.[2] C. Busso, Z. Deng, M. Grimm, U. Neumann and S. Narayanan. “Rigid Head Motion in Expressive

Speech Animation: Analysis and Synthesis”. IEEE Transactions on Audio, Speech and Language Processing, March 2007

[3] C. Busso, Z. Deng, U. Neumann, and S.S. Narayanan, “Natural head motion synthesis driven by acoustic prosodic features,” Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 283-290, July 2005.

Conferences Proceedings[1] C. Busso, P.G. Georgiou and S.S. Narayanan. “Real-time monitoring of participants’ interaction in a

meeting using audio-visual sensors”. Submitted to International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007)

[2] C. Busso and S.S. Narayanan. “Interplay between linguistic and affective goals in facial expression during emotional utterances”. To appear International seminar on Speech Production (ISSP 2006)

[3] M. Bulut, C. Busso, S. Yildirim, A. Kazemzadeh, C.M. Lee, S. Lee, and S. Narayanan, “Investigating the role of phoneme-level modifications in emotional speech resynthesis,” in 9th European Conference on Speech Communication and Technology (Interspeech 2005 - Eurospeech), Lisbon, Portugal, September 2005, pp. 801804.

Nov 22nd, 2006

Publications

Conferences Proceedings (cont)[4] C. Busso, S. Hernanz, C.W. Chu, S. Kwon, S. Lee, P. G. Georgiou, I. Cohen, S.

Narayanan. “Smart Room: Participant and Speaker Localization and Identification”. In Proc. ICASSP, Philadelphia, PA, March 2005.

[5] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information,” in Sixth International Conference on Multimodal Interfaces ICMI 2004, State College, PA, 2004, pp. 205211, ACM Press.

[6] Z. Deng, C. Busso, S. Narayanan, and U. Neumann, “Audio-based head motion synthesis for avatar-based telepresence systems,” in ACM SIGMM2004Workshop on Effective Telepresence (ETP 2004), New York, NY, 2004, pp. 2430, ACM Press.

[7] C.M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S. Narayanan, “Emotion recognition based on phoneme classes,” in 8th International Conference on Spoken Language Processing (ICSLP 04), Jeju Island, Korea, 2004.

[8] S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S. Narayanan, “An acoustic study of emotions expressed in speech,” in 8th International Conference on Spoken Language Processing (ICSLP 04), Jeju Island, Korea, 2004.

Nov 22nd, 2006

Publications

Abstracts[1] S. Yildirim, M. Bulut, C. Busso, C.M. Lee, A. Kazamzadeh, S. Lee, and S. Narayanan. “Study of

acoustic correlates associate with emotional speech”. J. Acoust. Soc. Am., 116:2481, 2004.

[2] C. M. Lee, S. Yildirim, M. Bulut, C. Busso, A. Kazamzadeh, S. Lee, and S. Narayanan. “Effects of emotion on different phoneme classes”. J. Acoust. Soc. Am., 116:2481, 2004.

[3] M. Bulut, S. Yildirim, S. Lee, C.M. Lee, C. Busso, A. Kazamzadeh, and S. Narayanan. “Emotion to emotion speech conversion in phoneme level”. J. Acoust. Soc. Am., 116:2481, 2004.

Thanks!Thanks!

Nov 22nd, 2006

Spherical Cubic Interpolation

• Interpolation procedure• Euler angles are transformed to quaternion• Key-points are selected by down-sampling the quaternion• Spherical cubic interpolation (squad) is used to interpolate key-

points• The interpolated results are transformed to Euler angles

• Motivation for spherical cubit interpolation• Interpolation in Euler space introduce jerky movement• Introduce undesired effects such as Gimbal lock

))1(2),,,(),,,((),,,,( 32414321 uuuqqslerpuqqslerpslerpuqqqqsquad −=

2121 sin

sin

sin

)1sin(),,( q

uq

uuqqslerp

θθ

θθ

+−

=

SynthesisSynthesis

Nov 22nd, 2006

Techniques

SynthesisSynthesis

• Scale-invariant optimum linear framework to measure the correlation between two streams of data with different dimensions [Dehon, 2000]

• Basic idea: project the features into a common space in which Pearson’s correlation can be computed

• Def: The standard deviation of the sentence-level mean-removed signal

Canonical correlation analysis (CCA)

Motion Coefficient

Nov 22nd, 2006

HMM Configuration

• Using generic emotional-independent models 8 configurations were tested

• For emotional-dependent models, less data for training is available

• HMMs were set with LR, S=2, M=2, K=16

SynthesisSynthesis

Nov 22nd, 2006

From Euler Angles to Talking Avatars

• Improve

• Avatar is synthesized using Maya

• A model with 46 blend shapes is used• Lip and eye motions are also included [Deng,2004], [Deng,2005],

[Deng,2005_2], [Deng,2006]

SynthesisSynthesis

Nov 22nd, 2006

Dynamic Time Warping

• Two acoustic signals are aligned by finding the optimum path (dynamic programming)

• Optimum path is used to modify the facial gestures

Nov 22nd, 2006

MFCC features

• The first coefficient of the MFCCs was removed (Energy)

• The velocity and acceleration coefficients were also included

• Feature vector was reduced from 36 to 12 using Principal Component Analysis (95% of the variance)

• This post-processed feature vector is what will be referred here on as MFCCs.

AnalysisAnalysis

Nov 22nd, 2006

Markers post processing

• Translate the markers (nose marker is the reference)

• Frames are multiplied by a rotational matrix• Choose a neutral pose as reference (102×3 Mref)

• For the frame t, construct a similar matrix Mt

• Compute Singular Value Decomposition, UDVT , of matrix (Mref) T ·Mt

• The product VUT gives the rotational matrix, Rt, [Stegmann,2002]

AnalysisAnalysis

Nov 22nd, 2006

Statistical significant correlation analysis

• Sentence level

Mapping

• Global Level

Mapping

Nov 22nd, 2006

Phoneme level analysis of correlation

Nov 22nd, 2006

Features from Facial Expression

• 4-D feature vector

• Data is normalized to remove head motion

• Five facial areas are defined

• 3-D coordinates are concatenated

• PCA is used to reduce to 10-D vector

• Frame level classification (K-nearest neighbor)

• The statistic of the frame at utterance level are aggregated forming a 4D feature vector

Nov 22nd, 2006

More details on emotion recognition

• Feature-level integration (89%)• High performance of anger, happiness and

neutral state

• Bad performance of sadness 79%

• The performance of happiness decreased

• Decision-level integration (89%)• Product of the posterior probabilities was the best

criterion

• Product criterion: Same results, big differences

Multimodal Systems



Overall Anger Sadness Hapiness Neutral

Maximum combining 0.84 0.82 0.81 0.92 0.81Averaging combining 0.88 0.84 0.84 1.00 0.84Product combining 0.89 0.84 0.90 0.98 0.84Weight combining 0.86 0.89 0.75 1.00 0.81

In detail



Nov 22nd, 2006

Smart room fusion

Documents

Nov 22 nd, 2006 Multimodal Analysis of Expressive Human Communication: Speech and gesture interplay Ph.D. Dissertation Proposal Carlos Busso Adviser: Dr