Modeling Human Communication Dynamics - Allen School · Albert (Skip) Rizzo Louis-Philippe Morency ... II. Hidden substructure Nonlinear input fusion. Multi-View Hidden Conditional

USC Multimodal Communication and

Machine Learning Lab

Modeling Human Communication Dynamics

[MultiComp Lab]

PhD students: Derya Ozkan and Sunghyun Park

Master student: Moitreya Chatterjee

Post-doctoral researcher: Stefan Scherer

Research programmer: Giota Stratou

Project manager: Alesia Egan

Louis-Philippe Morency

Multimodal Communication and Machine Learning Lab

© Keith Schaffer

Multimodal Perception of User State: Applications

Engagement Dominance Empathy

Social

Disorders

Depression PTSD Distress

Medical

Distress Indicators Suicide prevention

Education

Group learning analytics Virtual Learning Peer

Business

with MIT and CogitoHealth with Cincinnati Hospital

with Stanford and UCSD

YouTube: Opinion mining with UNT and UT Dallas

with CMU

Affect

Frustration Agreement Sentiment

Negotiation outcomes

with USC business school

Multimodal Perception of Distress Indicators

Multimodal Communicative Behaviors

Gestures Head gestures Eye gestures Arm gestures

Body language Body posture Proxemics

Eye contact Head gaze Eye gaze

Facial expressions FACS action units Smile, frowning

Verbal Visual

Prosody Intonation Voice quality

Auditory

Vocal expressions Laughter, moans

Lexicon Words

Syntax Part-of-speech Dependencies

Pragmatics Discourse acts

From Audio-Visual Signals to Perceived User State

Gestures Head gestures Eye gestures Arm gestures

Body language Body posture Proxemics

Eye contact Head gaze Eye gaze

Facial expressions FACS action units Smile, frowning

Verbal Visual

Prosody Intonation Voice quality

Auditory

Vocal expressions Laughter, moans

Lexicon Words

Syntax Part-of-speech Dependencies

Pragmatics Discourse acts

Engagement Dominance Empathy

Social

Disorders

Depression PTSD Distress

Affect

Frustration Agreement Sentiment


Audio signals

Head pose

Perceived User State

Distress Engagement Sentiment

Visual signals

Voice pitch

• Low-cost depth sensor • Articulated body tracking

• 3D head pose estimation • Real-time facial feature tracker

[FG 2008, best paper award]

[CVPR 2012]


Audio signals

Head pose

Visual signals

Voice pitch

• Pitch, energy and speaking rate • Automatic voice quality analysis


Human Communication Dynamics

• Audio • Visual • Verbal

Perceived User State

Distress Engagement Sentiment

Audio signals

Head pose

Visual signals

Voice pitch

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=2rvbNGpBInNIvM&tbnid=oKgOgEW517A9IM:&ved=0CAUQjRw&url=http://www.parexcellencemagazine.com/society-and-politics/culture-controls-communication.html&ei=FbcmUa_FBo3LigLhsICwAg&bvm=bv.42768644,d.cGE&psig=AFQjCNEDxh_qAnVihTDETIMIG1fPOUfhFg&ust=1361578046916891

Computational Behavior Indicators

Human in the loop:

Identify behaviors useful to human task

Behavior quantification

Quantify changes in human behaviors

Visualization of Computational Behavior Indicators

Transparent comparisons of observed behavioral indicators

Analogous to medical lab result sheets

Low Normal High

Detection and Computational Analysis of

Psychological Signals (DCAPS)

Albert (Skip) Rizzo Louis-Philippe Morency

Jonathan Gratch Arno Hartholt David Traum Stacy Marsella

Multimodal Perception

Animation Evaluation Integration Dialogue

Visualization & Audio Analysis

Clinical Expert

+ 27 researchers, programmers, artists

and clinicians

Psychological Distress: Datasets

Aims: Study behaviors associated to psychological distress

Depression (PHQ-9)

Trait Anxiety (STAI)

PTSD (PCL-C/PCL-M)

Examine how these cues may differ

Across different interaction settings

Face-to-face: expected to evoke strongest indicators

Computer-mediated (TeleCoach): intended use case

Human-computer (SimSensei): intended use case

Across different populations

General Los Angeles population (Craigslist)

Recent veterans (US Vets)

Multimodal Perception of Distress Indicators

Psychological Distress Indicators

Vertical eye gaze Smile intensity

Distress

Hand self-adaptor Legs fidgeting

Distress No-distress

Distress No-distress Distress No-distress

No-distress

[IEEE FG 2013 – Best paper award]

Distress Distress No-distress No-distress

Voice energy std. Voice quality (NAQ)

Distress Distress No-distress No-distress

Joy – Facial expr. Sad – Facial expr.

• Distress

• Anxiety

• Depression

• PTSD

Effect of Gender on Distress Indicators [ACII 2013]

• Distress

• Anxiety

• Depression

• PTSD

Effect of Gender on Distress Indicators [ACII 2013]

• Distress

• Anxiety

• Depression

• PTSD

Suicide Prevention

Nonverbal indicators of suicidal ideations

Dataset: 30 suicidal adolescents/30 non-suicidal adolescents

Suicidal teenagers use more breathy tones

Sourc

e: C

DC

0.1

0.3

0.5

0.7

0.9

Non-Suicidal Suicidal

Open Quotient (OQ)

0

0.1

0.2

0.3

0.4


Std. Open Quotient (OQ std.)


Std. Norm. Amplitude Quotient (NAQ std.)

0

0.04

0.08

0.12

0.16

****

**

[ICASSP 2013]

MultiSense: Multimodal Perception Library

TOOLS\MODULES

TRANSFORMERS:

Facetrackers

•Gavam

•CLM/CLMZ

•Okao

•Shore

Real-time hCRF

ActiveMQ –VHMessenger

EmoVoice

CONSUMER

*Each one exported as .dll

PROVIDERS:

Audio Capture

Webcam Capture

Mouse

Kinect( Depth/Intensity/IR

image/Skeleton)

CONSUMERS:

Image Painter

Signal Painter

Sensing Layer

Behavior Layer

<person id=“subjectA”>

<sensingLayer>

<headPose>

<position z="223" y="345" x="193" />

<rotation rotZ="15" rotY="35" rotX="10" />

<confidence>0.34<confidence/>

</headPose>

...

</sensingLayer> </person>

<person id=“subjectB”>

<behaviorLayer>

<behavior>

<type>attention</type>

<level>high</level>

<value>0.6</value>

<confidence>0.46<confidence/>

</behavior>

...

</behaviorLayer>

</person>

PML:

Perception

Markup

language

Temporal Probabilistic Learning

x1

y1

x2

y2

x3

y3

xn

yn

Naïve Bayes Classifier

x1

y1

x2

y2

x3

y3

xn

yn

Maximum Entropy Model &

Support Vector Machine

x1 x2 x3 xn

h1 h2 h3 hn

Hidden Markov Model

x1 x2 x3 xn

y1 y2 y3 yn

Conditional Random Field

Observations (e.g., yaw, roll pitch)

Labels (e.g., head-nod,other-gesture)

Caption

Generative models

yi

xi

Discriminative models

• Audio

• Visual

• Verbal

I. Temporal dynamic

Temporal Probabilistic Learning

x1 x2 x3 xn

h1 h2 h3 hn

y

Hidden


x1 x2 x3 xn

h1 h2 h3 hn

y1 y2 y3 yn

Latent-Dynamic


[CVPR 2007] [CVPR 2006,PAMI 2007]

Model Accuracy

HMM 84.2%

CRF 86.0%

HCRF (w=0) 91.6%

HCRF (w=1) 93.9% 0 0.2 0.4 0.6 0.8 1

False positive rate

HMM

SVM

CRF

LDCRF

Tru

e p

osi

tiv

e ra

te

• Audio

• Visual

• Verbal

I. Temporal dynamic

II. Hidden substructure

Latent-Dynamic Conditional Neural Field

h1 h3 h3 hn

y1 y2 y3 yn

g3

x3 x3

x3

gn

xn xn

xn

g2

x2 x2

x2 x1

g1

x1 x1

x1

70

80

90

100

Rapportdataset

Taskardataset

LDCRF

LDCNF

60

65

70

75

80

Audioonly

Visualonly

Earlyfusion

Latefusion

Acc

ura

cy (

%)

Audio-visual sub-challenge

LDCNFAVEC dataset (95 videos)

[FG 2013, submitted]

• Audio

• Visual

• Verbal

I. Temporal dynamic


III. Nonlinear input fusion

Multi-View Hidden Conditional Random Field

y

x1 x2 x3 xn

h1 h2 h3 hn

x1 x2 x3 xn

h1 h2 h3 hn

Multi-View HCRF

y1

x1 x2 x3 xn

h1 h2 h3 hn

x1 x2 x3 xn

h1 h2 h3 hn

y2 y3 yn

Multi-View LDCRF

[CVPR 2012]

• Audio

• Visual

• Verbal

I. Temporal dynamic



IV. Multi-stream models

Canal 9 debate dataset 50

55

60

65

70

HMM CRF HCRF MV-HCRF

Multi-View Hidden Conditional Random Field

y

x1 x2 x3 xn

h1 h2 h3 hn

x1 x2 x3 xn

h1 h2 h3 hn

Multi-View HCRF

y1

x1 x2 x3 xn

h1 h2 h3 hn

x1 x2 x3 xn

h1 h2 h3 hn

y2 y3 yn

Multi-View LDCRF

[ICMI 2012]

Canal 9 debate dataset 50

55

60

65

70

HMM CRF HCRF MV-HCRF MV-HCRF+KCCA

• Audio

• Visual

• Verbal

I. Temporal dynamic



IV. Multi-stream models

V. Multimodal synchrony

Multimodal Machine Learning

• Audio

• Visual

• Verbal

I. Temporal dynamic in label sequence

II. Hidden substructure in label sequence

III. Nonlinear modeling of instantaneous input features

IV. Multi-stream modeling of hidden substructure

V. Synchrony in multimodal input streams

VI. Multi-label structure and correlation

VII. Symbol and signal integration

VIII.Uncertainty in behavior labels

HCRF Library: ICT open-source machine learning library

http://hcrf.sf.net

http://hcrf.sf.net/

Cognition Un

de

rsta

nd

ing

Ge

ne

rati

on

Decision-making

Dialog

Emotion

SimSensei Virtual Human Interaction Loop

Perception

Social Cues

Affective Cues

Physical Cues

Action

Social Cues

Affective Cues

Physical Cues Virtual World

Physical World

Cognition Un

de

rsta

nd

ing

Ge

ne

rati

on

Action

Virtual World

Decision-making

Dialog

Emotion

SimSensei Virtual Human Interaction Loop

Physical World Perception

Social Cues

Affective Cues

Physical Cues

Social Cues

Affective Cues

Physical Cues

MultiSense

Cerebella + Smartbody

Flores Dialogue manager

MultiSense + SimSensei: Video Demonstration

MultiSense + SimSensei: Video Demonstration

Collaborations and Available Technologies

MultiSense: standardized perception framework Real-time facial tracking, articulated body tracking and auditory analysis

Modular and multi-threaded architecture for easy extension

http://multicomp.ict.usc.edu/

hCRF library, machine learning library Matlab and C++ implementations of LDCRF, HCRF and CRF.

2559 downloads during the last year

http://hcrf.sf.net/

GAVAM + CLM-Z, real-time nonverbal behavior recognition Real-time head position and orientation estimation

66 facial feature tracking with automatic initialization




http://hcrf.sf.net/


Thank you!

Documents

Modeling Human Communication Dynamics - Allen School · Albert (Skip) Rizzo Louis-Philippe Morency ... II. Hidden substructure Nonlinear input fusion. Multi-View Hidden Conditional