Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Dynamic Bayesian Networks for Multimodal Interaction

Tony JebaraMachine Learning LabColumbia University

joint work with A. Howard and N. Gu


Outline•Introduction: Multi-Modal and Multi-Person•Bayesian Networks and the Junction Tree Algorithm•Maximum Likelihood and Expectation Maximization•Dynamic Bayesian Networks (HMMs, Kalman Filters)•Hidden ARMA Models

•Maximum Conditional Likelihood and Conditional EM•Two-Person Visual Interaction (Gesture Games)

•Input-Output Hidden Markov Models•Audio-Visual Interaction (Conversation)

•Intractable DBNs, Minimum Free Energy, Generalized EM•Dynamical System Trees

•Multi-Person Visual Interaction (Football Plays)•Haptic-Visual Modeling (Surgical Drills)

•Ongoing Directions


Introduction•Simplest Dynamical Systems (single Markovian Process)

•Hidden Markov Model and Kalman Filter•But Multi-modal data (audio, video and haptics) have:

•Different time scale processes•Different amplitude scale processes•Different noise characteristics processes

•Also, Multi-person data (multi-limb, two-person, group)

•Weakly coupled•Conditionally Dependent

•Dangerous to slam all time data into one single series:•Find new ways to zipper multiple interacting processes


Bayesian Networks•Also called Graphical Models•Marry graph theory & statistics•Directed graph which efficiently encodes large p(x1,…,xN) as product of conditionals of node given parents•Avoids storing huge hypercube over all variables x1,…,xN

•Here, xi discrete (multinomial) or continuous (Gaussian)•Split BNs over sets of hidden XH and observed XV variables•Three basic operations for BNs

1) Infer marginals/conditionals of hidden (JTA)

2) Compute likelihood of data (JTA)

3) Maximize likelihood the data (EM)

1x

4x

2x

5x

6x

3x

( ) ( )1 1, , |

n

n i iip x x p x

== pÕK

( ), |V H

p X X q

( )| ,H V

p X X q

( ), |H

H VXp X X qå

( )max , |H

H VXp X X

qqå


Bayes Nets to Junction Trees

1x

4x

2x

5x

6x

3x

1x

4x

2x

5x

6x

3x

1 2 3x x x

1) Bayes Net

3) Triangulated 4) Junction tree

2) Moral Graph

1x

4x

2x

5x

6x

3x

52 3x x x

52 6x x x

2 4x x2

x

52x x

2 3x x

•Workhorse of BNs is Junction Tree Algorithm


Junction Tree Algorithm

BAB BC { } { } { }, ,V A B S B W B C= = =

If agree: ( )\ \V S S WV S W S

p Sy = f = = f = yå åElse:

*

\

**

*

S VV S

SW W

S

V V

f = y

fy = y

f

y = y

å

Send messageFrom V to W…

Send messageFrom W to V…

** *

\

**** *

*

** *

S WW S

SV V

S

W W

f = y

fy = y

f

y = y

å

Then, Cliques Agree

•The JTA sends messages from cliques through separators (these are just tables or potential functions)•Ensures that various tables in the junction tree graph agree/consistent over shared variables (via marginals).


Junction Tree Algorithm•On trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute

Ends with potentialsas marginals

or conditionalsof hidden variables

given datap(Xh1|Xv)p(Xh2|Xv)

p(Xh1, Xh2|Xv)And likelihood

p(Xv)is potentialnormalizer


•We wish to maximize the likelihood over for learning:

•EM instead iteratively maxes lower bound on log-likelihood:

•E-step:

•M-step:

Maximum Likelihood with EM

q(z)

L(q,)

( )

( )

1 argmax ,

| ,

t tq

tH V

q q

p X X

+ = q

= q

L

( )1 argmax ,t tq+q

q = qL

l()

( )max , |H

H VXp X X

qqå

( ) ( ) ( )

( ) ( )( ) ( )

log , | log | ,

|| | ,

H HH V H V HX X

tH H V

p X X q X p X X

KL q X p X X q

q ³ q

+ q = q

å åL


Dynamic Bayes Nets

1

0

( | ) ( , )

( ) ( )

t tP S i S j i j

P S i i

1 1 1

0 0 0 0 0

( | ) ( | , )

( ) ( | , )

t t t t t tP X x X x N x Ax Q

P X x N x µ Q

( | ) ( , )

( | ) ( | )

t t

t t t t i i

P Y i S j i j

or

P Y y S i N y

Hidden Markov Model Linear Dynamical System

State Transition Model: State Transition Model:

Emission Model:

( | ) ( | , )t t t t t tP Y y X x N y Cx R Emission Model:

1s 2s 3s

0y

0s

1y 2y 3y

1x 2x 3x

0y

0x

1y 2y 3y

•Dynamic Bayesian Networks are BNs unrolled in time•Simples and most classical examples are:


Two-Person Interaction

Interact with single user via p(y|x)Learn from two users to get p(y|x)

1s 2s 3s

0y

0s

1y 2y 3y

•Learn from two interacting people (person Y and person X) to mimic interaction via simulated person Y.

•One hidden Markov model for each user…no coupling!

•One time series for both users… too rigid!

1s 2s 3s

0y

0s

1y 2y 3y


DBN: Hidden ARMA ModelLearn to imitate behaviorby watching a teacherexhibit it.

Eg. unsupervisedobservation of 2- agentinteraction

Eg. Track lip motion

Discover correlationsbetween past action &subsequent reaction

Estimate p(Y | past X , past Y)

X

Y


DBN: Hidden ARMA Model

1s 2s 3s

0y

0s

1y 2y 3y

4s

4y

5s

5y

6s

6y

0x 1x 2x 3x 4x 5x 6x

•Focus on predicting person Y from past of both X and Y•Have multiple linear models of the past to the future•Use a window for moving average (compressed with PCA)•But, select among them using S (nonlinear)•Here, we show only a 2nd order moving average to predict the next Y given past two Y’s, past two X’s and current X and random choice of ARMA linear model


Hidden ARMA Features:•Model skin color as mixture of RGB Gaussians•Track person as mixture of spatial Gaussians

•But, want to predict only Y from X… Be discriminative•Use maximum conditional likelihood (CEM)


Conditional EM

EM:divide &conquer

CEM:discriminativedivide &conquer

8.0

1.7c

l

l

= -

= -

54.7

0.4c

l

l

= -

= +

( ) ( )max log , , | log , |H H

H V V H VX Xp X Y X p X X

qq - qå å

( )| ,V V

p Y X q•Only need a conditional? •Then maximize conditional likelihood


Conditional EM

CEM vs. EM p(c|x,y)CEM accuracy = 100%EM accuracy = 51%

EM p(y|x)

CEM p(y|x)


Conditional EM for hidden ARMA

Nearest Neighbor 1.57% RMSConstant Velocity 0.85% RMSHidden ARMA: 0.64% RMS

2 Users gesture to each other for a few minutesModel: Mix of 25 Gaussians, STM: T=120, Dims=22+15

Estimate Prediction Discriminatively/Conditionally p(future|past)


Hidden ARMA on Gesture

SCARE WAVE

CLAP


DBN: Input-Output HMM

-Sony Picturebook Laptop-2 Cameras (7 Hz) (USB & Analog)-2 Microphones (USB & Analog)-100 Megs per hour (10$/Gig)

•Similarly, learn person’s response audio video stimuli to predict Y (or agent A) from X (or world W)•Wearable collects audio & video A,W


DBN: Input-Output HMM

( )log | log ( , ) log ( )p A W p A W pW= -

1s 2s 3s

0y

0s

1y 2y 3y

1s 2s 3s

0w

0s

1w 2w 3w

0a 1a 2a 3a

•Consider simulating agent given world•Hidden Markov model on its own is insufficient since it does not distinguish between the input rule the world has and the output we need to generate•Instead, form input-output HMM•One IOHMM predicts agent’s audio using all 3 past channels•One IOHMM predicts agent’s video•Use CEM to learn the IOHMM discriminatively


Input-Output HMM DataVideo -Histogram lighting correction

-RGB Mixture of Gaussians to detect skin-Face: 2000 pixels at 7Hz (X,Y,Intensity)

Audio -Hamming Window, FFT, Equalization-Spectrograms at 60Hz-200 bands (Amplitude, Frequency)

Very noisy data set!


Video Representation- Principal Components Analysis - linear vectors in Euclidean space- Images, spectrograms, time series vectors.- Vectorization is bad, nonlinear

- Images = collections of (X,Y,I) tuples “pixels”- Spectrograms = collections of (A,F) tuples…therefore...- Corresponded Principal Components Analysis

2

1 1 1

T D Kid idn nm m

i d m

X c V= = =

æ ö÷ç - ÷ç ÷ç ÷çè øå å å

2

1 1 11

D KT ij id jdn n nm m

j d miM X c V

t

= = ==

æ ö÷ç -å å åå ÷ç ÷çè ø

M are soft permutationmatrices

=X


Video Representation

Original PCA CPCA 2000 XYI Pixels: Compress to 20 dims


Input-Output HMM

For agent and world:1 Loudness scalar20 Spectro Coeffs20 Face Coeffs

Estimate hiddentrellis frompartial data


Input-Output HMM with CEM

( )log | log ( , ) log ( )p A W p A W pW= -

Conditionally model p(Agent Audio | World Audio , World Video)p(Agent Video | World Audio, World Video)

Don’t care how well we can model world audio and videoJust as long as we can map it to agent audio or agent videoAvoids temporal scale problems too (Video 5Hz, Audio 60 Hz)

CEM: 60-state 82-Dim HMMDiagonal Gaussian Emissions90,000 Samples Train / 36,000 Test

Audio IOHMM:1s 2s 3s

0w

0s

1w 2w 3w

0a 1a 2a 3a

1s 2s 3s

0w

0s

1w 2w 3w


Input-Output HMM with CEM

EM (red) CEM (blue)Audio 99.61 100.58Video -122.46 -121.26

Joint Likelihood Conditional Likelihood

Spectrograms from eigenspace

KD-Tree on Video Coefficientsto closest image in training(point-cloud too confusing)

TRAINING & TESTING

RESYNTHESIS


Input-Output HMM Results

Test

Train


Intractable Dynamic Bayes Nets

11s

12s

13s

0y

10s

1y 2y 3y

21s

22s

23s

20s

31s

32s

33s

30s

Factorial Hidden Markov Model

Interaction Through Output

11s

12s

13s

10y

10s

11y

12y

13y

21s

22s

23s

20s

20y

21y

22y

23y

Interaction Through Hidden States

Coupled Hidden Markov Model


Intractable DBNs: Generalized EM•As before, we use bound on likelihood:

•But best q over hidden vars that minimizes KL intractable!•Thus, restrict q to only explore factorized distributions•EM still converges underpartial E steps & partial M steps,

q(z)

-L(q,)

( )1 argmax ,t tq FACTORIZED

q q+Î

= qL

( )1 argmax ,t tq+q

q = qL

l()

( ) ( ) ( )

( ) ( )( ) ( )

log , | log | ,

|| | ,

H HH V H V HX X

tH H V

p X X q X p X X

KL q X p X X q

q ³ q

+ q = q

å åL


Intractable DBNs Variational EM

11s

12s

13s

10s

21s

22s

23s

20s

31s

32s

33s

30s

Factorial Hidden Markov Model

11s

12s

13s

10s

21s

22s

23s

20s

Coupled Hidden Markov Model

•Now, the q distributions are limited to be chains•Tractable as an iterative method•Also known as variational EM structured mean-field


•How to handle more people and a hieararchy of coupling?•DSTs consider coupling university staff: students -> department -> school -> university

Dynamical System Trees

20y

30y

40y

11y

21y

31y

41y

10y

20x

30x

40x

11x

21x

31x

41x

Interaction Through Aggregated Community State

10s

10x

20s

30s

40s

11s

21s

31s

41s

1,20s 3,4

0s3,41s

1,21s

0s 1s

Internal nodes are states. Leaf nodes are emissions.Any subtree is also a DST. DST above unrolled over 2 time steps


•Also apply generalization of EM and do variational structured mean field for q distribution.•Becomes formulaic fo any DST topology!•Code available at http://www.cs.columbia.edu/~jebara/dst

Dynamical System Trees

20x 4

0x

11x

21x

31x

41x

10s

10x

20s

30s

40s

11s

21s

31s

41s

1,20s 3,4

0s3,41s

1,21s

0s 1s


DSTs and Generalized EM

20y

30y

40y

10y

20x

30x

40x

10s

10x

20s

30s

40s

1,20s

3,40s

0s

Structured Mean Field:

Use tractable distribution Q to approximate P

Introduce variational parameters

Find Min KL(Q||P)

Introduce v.p.

Introduce v.p.

Introduce v.p.Inference

Inference

Inference

Inference


DSTs for American Football

Initial frame of a typical play

Trajectories of players


DSTs for American Football

~20 time series of two typesof plays (wham and digs)Likelihood ratio of models used as classiferDST1 puts all players into 1 game stateDST2 combines players into two teams and then into game


DSTs for Gene Networks•Time series of cell cycle•Hundreds of gene expression levels over time•Use given hierarchical clustering•DST with hierarchical clustering structure gives best test likelihood


Robotic Surgery, Haptics & Video•Davinci Laparoscopic Robot•Used in hundreds of hospitals•Surgeon works on console•Robot mimics movement on (local) patient•Captures all actuator/robot data as 300Hz time series•Multi-Channel Video of cameras inside patient


Robotic Surgery, Haptics & Video


Robotic Surgery, Haptics & Video

Suturing

Expert Novice

64 Dimensional Time Series @ 300 HzConsole and Actuator Parameters


•Compress Haptic & Video data with PCA to 60 dims.•Collected Data from Novices and Experts and built several DBNs (IOHMMs, DSTs, etc.) of expert and novice for 3 different drills (6 models total).•Preliminary results:

Minefield Russian RouletteSuture

Robotic Surgical Drills Results


Conclusion•Dynamic Bayesian networks are natural upgrade to HMMs.•Relevant for structured, multi-modal and multi-person temporal data.•Several exampls of dynamic Bayesian networks for

audio, video and haptic channelssingle, two-person and multi-person activity.

•DBNs: HMMs, Kalman Filters, hidden ARMA, input-output HMMs.•Use max likelihood (EM) or max conditional likelihood (CEM).•Intractable DBNs: switched Kalman filters, dynamical systems trees.•Use max free energy (GEM) and structured mean field.•Examples of applications:

gesture interaction (gesture games)audio-video interaction (social conversation)multi-person game playing (American football)haptic-video interaction (robotic laparoscopy).

•Funding provided in part by the National Science Foundation, the

Central Intelligence Agency, Alphastar and Microsoft.

Documents

Dynamic Bayesian Networks for Multimodal Interaction