40
University Dynamic Bayesian Networks for Multimodal Interaction Tony Jebara Machine Learning Lab Columbia University joint work with A. Howard and N. Gu

Dynamic Bayesian Networks for Multimodal Interaction

  • Upload
    haines

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Dynamic Bayesian Networks for Multimodal Interaction. Tony Jebara Machine Learning Lab Columbia University joint work with A. Howard and N. Gu. Outline. Introduction: Multi-Modal and Multi-Person Bayesian Networks and the Junction Tree Algorithm - PowerPoint PPT Presentation

Citation preview

Page 1: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Dynamic Bayesian Networks for Multimodal Interaction

Tony JebaraMachine Learning LabColumbia University

joint work with A. Howard and N. Gu

Page 2: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Outline•Introduction: Multi-Modal and Multi-Person•Bayesian Networks and the Junction Tree Algorithm•Maximum Likelihood and Expectation Maximization•Dynamic Bayesian Networks (HMMs, Kalman Filters)•Hidden ARMA Models

•Maximum Conditional Likelihood and Conditional EM•Two-Person Visual Interaction (Gesture Games)

•Input-Output Hidden Markov Models•Audio-Visual Interaction (Conversation)

•Intractable DBNs, Minimum Free Energy, Generalized EM•Dynamical System Trees

•Multi-Person Visual Interaction (Football Plays)•Haptic-Visual Modeling (Surgical Drills)

•Ongoing Directions

Page 3: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Introduction•Simplest Dynamical Systems (single Markovian Process)

•Hidden Markov Model and Kalman Filter•But Multi-modal data (audio, video and haptics) have:

•Different time scale processes•Different amplitude scale processes•Different noise characteristics processes

•Also, Multi-person data (multi-limb, two-person, group)

•Weakly coupled•Conditionally Dependent

•Dangerous to slam all time data into one single series:•Find new ways to zipper multiple interacting processes

Page 4: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Bayesian Networks•Also called Graphical Models•Marry graph theory & statistics•Directed graph which efficiently encodes large p(x1,…,xN) as product of conditionals of node given parents•Avoids storing huge hypercube over all variables x1,…,xN

•Here, xi discrete (multinomial) or continuous (Gaussian)•Split BNs over sets of hidden XH and observed XV variables•Three basic operations for BNs

1) Infer marginals/conditionals of hidden (JTA)

2) Compute likelihood of data (JTA)

3) Maximize likelihood the data (EM)

1x

4x

2x

5x

6x

3x

( ) ( )1 1, , |

n

n i iip x x p x

== pÕK

( ), |V H

p X X q

( )| ,H V

p X X q

( ), |H

H VXp X X qå

( )max , |H

H VXp X X

qqå

Page 5: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Bayes Nets to Junction Trees

1x

4x

2x

5x

6x

3x

1x

4x

2x

5x

6x

3x

1 2 3x x x

1) Bayes Net

3) Triangulated 4) Junction tree

2) Moral Graph

1x

4x

2x

5x

6x

3x

52 3x x x

52 6x x x

2 4x x2

x

52x x

2 3x x

•Workhorse of BNs is Junction Tree Algorithm

Page 6: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Junction Tree Algorithm

BAB BC { } { } { }, ,V A B S B W B C= = =

If agree: ( )\ \V S S WV S W S

p Sy = f = = f = yå åElse:

*

\

**

*

S VV S

SW W

S

V V

f = y

fy = y

f

y = y

å

Send messageFrom V to W…

Send messageFrom W to V…

** *

\

**** *

*

** *

S WW S

SV V

S

W W

f = y

fy = y

f

y = y

å

Then, Cliques Agree

•The JTA sends messages from cliques through separators (these are just tables or potential functions)•Ensures that various tables in the junction tree graph agree/consistent over shared variables (via marginals).

Page 7: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Junction Tree Algorithm•On trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute

Ends with potentialsas marginals

or conditionalsof hidden variables

given datap(Xh1|Xv)p(Xh2|Xv)

p(Xh1, Xh2|Xv)And likelihood

p(Xv)is potentialnormalizer

Page 8: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

•We wish to maximize the likelihood over for learning:

•EM instead iteratively maxes lower bound on log-likelihood:

•E-step:

•M-step:

Maximum Likelihood with EM

q(z)

L(q,)

( )

( )

1 argmax ,

| ,

t tq

tH V

q q

p X X

+ = q

= q

L

( )1 argmax ,t tq+q

q = qL

l()

( )max , |H

H VXp X X

qqå

( ) ( ) ( )

( ) ( )( ) ( )

log , | log | ,

|| | ,

H HH V H V HX X

tH H V

p X X q X p X X

KL q X p X X q

q ³ q

+ q = q

å åL

Page 9: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Dynamic Bayes Nets

1

0

( | ) ( , )

( ) ( )

t tP S i S j i j

P S i i

1 1 1

0 0 0 0 0

( | ) ( | , )

( ) ( | , )

t t t t t tP X x X x N x Ax Q

P X x N x µ Q

( | ) ( , )

( | ) ( | )

t t

t t t t i i

P Y i S j i j

or

P Y y S i N y

Hidden Markov Model Linear Dynamical System

State Transition Model: State Transition Model:

Emission Model:

( | ) ( | , )t t t t t tP Y y X x N y Cx R Emission Model:

1s 2s 3s

0y

0s

1y 2y 3y

1x 2x 3x

0y

0x

1y 2y 3y

•Dynamic Bayesian Networks are BNs unrolled in time•Simples and most classical examples are:

Page 10: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Two-Person Interaction

Interact with single user via p(y|x)Learn from two users to get p(y|x)

1s 2s 3s

0y

0s

1y 2y 3y

•Learn from two interacting people (person Y and person X) to mimic interaction via simulated person Y.

•One hidden Markov model for each user…no coupling!

•One time series for both users… too rigid!

1s 2s 3s

0y

0s

1y 2y 3y

Page 11: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DBN: Hidden ARMA ModelLearn to imitate behaviorby watching a teacherexhibit it.

Eg. unsupervisedobservation of 2- agentinteraction

Eg. Track lip motion

Discover correlationsbetween past action &subsequent reaction

Estimate p(Y | past X , past Y)

X

Y

Page 12: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DBN: Hidden ARMA Model

1s 2s 3s

0y

0s

1y 2y 3y

4s

4y

5s

5y

6s

6y

0x 1x 2x 3x 4x 5x 6x

•Focus on predicting person Y from past of both X and Y•Have multiple linear models of the past to the future•Use a window for moving average (compressed with PCA)•But, select among them using S (nonlinear)•Here, we show only a 2nd order moving average to predict the next Y given past two Y’s, past two X’s and current X and random choice of ARMA linear model

Page 13: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Hidden ARMA Features:•Model skin color as mixture of RGB Gaussians•Track person as mixture of spatial Gaussians

•But, want to predict only Y from X… Be discriminative•Use maximum conditional likelihood (CEM)

Page 14: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Conditional EM

EM:divide &conquer

CEM:discriminativedivide &conquer

8.0

1.7c

l

l

= -

= -

54.7

0.4c

l

l

= -

= +

( ) ( )max log , , | log , |H H

H V V H VX Xp X Y X p X X

qq - qå å

( )| ,V V

p Y X q•Only need a conditional? •Then maximize conditional likelihood

Page 15: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Conditional EM

CEM vs. EM p(c|x,y)CEM accuracy = 100%EM accuracy = 51%

EM p(y|x)

CEM p(y|x)

Page 16: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Conditional EM for hidden ARMA

Nearest Neighbor 1.57% RMSConstant Velocity 0.85% RMSHidden ARMA: 0.64% RMS

2 Users gesture to each other for a few minutesModel: Mix of 25 Gaussians, STM: T=120, Dims=22+15

Estimate Prediction Discriminatively/Conditionally p(future|past)

Page 17: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Hidden ARMA on Gesture

SCARE WAVE

CLAP

Page 18: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DBN: Input-Output HMM

-Sony Picturebook Laptop-2 Cameras (7 Hz) (USB & Analog)-2 Microphones (USB & Analog)-100 Megs per hour (10$/Gig)

•Similarly, learn person’s response audio video stimuli to predict Y (or agent A) from X (or world W)•Wearable collects audio & video A,W

Page 19: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DBN: Input-Output HMM

( )log | log ( , ) log ( )p A W p A W pW= -

1s 2s 3s

0y

0s

1y 2y 3y

1s 2s 3s

0w

0s

1w 2w 3w

0a 1a 2a 3a

•Consider simulating agent given world•Hidden Markov model on its own is insufficient since it does not distinguish between the input rule the world has and the output we need to generate•Instead, form input-output HMM•One IOHMM predicts agent’s audio using all 3 past channels•One IOHMM predicts agent’s video•Use CEM to learn the IOHMM discriminatively

Page 20: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Input-Output HMM DataVideo -Histogram lighting correction

-RGB Mixture of Gaussians to detect skin-Face: 2000 pixels at 7Hz (X,Y,Intensity)

Audio -Hamming Window, FFT, Equalization-Spectrograms at 60Hz-200 bands (Amplitude, Frequency)

Very noisy data set!

Page 21: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Video Representation- Principal Components Analysis - linear vectors in Euclidean space- Images, spectrograms, time series vectors.- Vectorization is bad, nonlinear

- Images = collections of (X,Y,I) tuples “pixels”- Spectrograms = collections of (A,F) tuples…therefore...- Corresponded Principal Components Analysis

2

1 1 1

T D Kid idn nm m

i d m

X c V= = =

æ ö÷ç - ÷ç ÷ç ÷çè øå å å

2

1 1 11

D KT ij id jdn n nm m

j d miM X c V

t

= = ==

æ ö÷ç -å å åå ÷ç ÷çè ø

M are soft permutationmatrices

=X

Page 22: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Video Representation

Original PCA CPCA 2000 XYI Pixels: Compress to 20 dims

Page 23: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Input-Output HMM

For agent and world:1 Loudness scalar20 Spectro Coeffs20 Face Coeffs

Estimate hiddentrellis frompartial data

Page 24: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Input-Output HMM with CEM

( )log | log ( , ) log ( )p A W p A W pW= -

Conditionally model p(Agent Audio | World Audio , World Video)p(Agent Video | World Audio, World Video)

Don’t care how well we can model world audio and videoJust as long as we can map it to agent audio or agent videoAvoids temporal scale problems too (Video 5Hz, Audio 60 Hz)

CEM: 60-state 82-Dim HMMDiagonal Gaussian Emissions90,000 Samples Train / 36,000 Test

Audio IOHMM:1s 2s 3s

0w

0s

1w 2w 3w

0a 1a 2a 3a

1s 2s 3s

0w

0s

1w 2w 3w

Page 25: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Input-Output HMM with CEM

EM (red) CEM (blue)Audio 99.61 100.58Video -122.46 -121.26

Joint Likelihood Conditional Likelihood

Spectrograms from eigenspace

KD-Tree on Video Coefficientsto closest image in training(point-cloud too confusing)

TRAINING & TESTING

RESYNTHESIS

Page 26: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Input-Output HMM Results

Test

Train

Page 27: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Intractable Dynamic Bayes Nets

11s

12s

13s

0y

10s

1y 2y 3y

21s

22s

23s

20s

31s

32s

33s

30s

Factorial Hidden Markov Model

Interaction Through Output

11s

12s

13s

10y

10s

11y

12y

13y

21s

22s

23s

20s

20y

21y

22y

23y

Interaction Through Hidden States

Coupled Hidden Markov Model

Page 28: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Intractable DBNs: Generalized EM•As before, we use bound on likelihood:

•But best q over hidden vars that minimizes KL intractable!•Thus, restrict q to only explore factorized distributions•EM still converges underpartial E steps & partial M steps,

q(z)

-L(q,)

( )1 argmax ,t tq FACTORIZED

q q+Î

= qL

( )1 argmax ,t tq+q

q = qL

l()

( ) ( ) ( )

( ) ( )( ) ( )

log , | log | ,

|| | ,

H HH V H V HX X

tH H V

p X X q X p X X

KL q X p X X q

q ³ q

+ q = q

å åL

Page 29: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Intractable DBNs Variational EM

11s

12s

13s

10s

21s

22s

23s

20s

31s

32s

33s

30s

Factorial Hidden Markov Model

11s

12s

13s

10s

21s

22s

23s

20s

Coupled Hidden Markov Model

•Now, the q distributions are limited to be chains•Tractable as an iterative method•Also known as variational EM structured mean-field

Page 30: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

•How to handle more people and a hieararchy of coupling?•DSTs consider coupling university staff: students -> department -> school -> university

Dynamical System Trees

20y

30y

40y

11y

21y

31y

41y

10y

20x

30x

40x

11x

21x

31x

41x

Interaction Through Aggregated Community State

10s

10x

20s

30s

40s

11s

21s

31s

41s

1,20s 3,4

0s3,41s

1,21s

0s 1s

Internal nodes are states. Leaf nodes are emissions.Any subtree is also a DST. DST above unrolled over 2 time steps

Page 31: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

•Also apply generalization of EM and do variational structured mean field for q distribution.•Becomes formulaic fo any DST topology!•Code available at http://www.cs.columbia.edu/~jebara/dst

Dynamical System Trees

20x 4

0x

11x

21x

31x

41x

10s

10x

20s

30s

40s

11s

21s

31s

41s

1,20s 3,4

0s3,41s

1,21s

0s 1s

Page 32: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DSTs and Generalized EM

20y

30y

40y

10y

20x

30x

40x

10s

10x

20s

30s

40s

1,20s

3,40s

0s

Structured Mean Field:

Use tractable distribution Q to approximate P

Introduce variational parameters

Find Min KL(Q||P)

Introduce v.p.

Introduce v.p.

Introduce v.p.Inference

Inference

Inference

Inference

Page 33: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DSTs for American Football

Initial frame of a typical play

Trajectories of players

Page 34: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DSTs for American Football

~20 time series of two typesof plays (wham and digs)Likelihood ratio of models used as classiferDST1 puts all players into 1 game stateDST2 combines players into two teams and then into game

Page 35: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

DSTs for Gene Networks•Time series of cell cycle•Hundreds of gene expression levels over time•Use given hierarchical clustering•DST with hierarchical clustering structure gives best test likelihood

Page 36: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Robotic Surgery, Haptics & Video•Davinci Laparoscopic Robot•Used in hundreds of hospitals•Surgeon works on console•Robot mimics movement on (local) patient•Captures all actuator/robot data as 300Hz time series•Multi-Channel Video of cameras inside patient

Page 37: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Robotic Surgery, Haptics & Video

Page 38: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Robotic Surgery, Haptics & Video

Suturing

Expert Novice

64 Dimensional Time Series @ 300 HzConsole and Actuator Parameters

Page 39: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

•Compress Haptic & Video data with PCA to 60 dims.•Collected Data from Novices and Experts and built several DBNs (IOHMMs, DSTs, etc.) of expert and novice for 3 different drills (6 models total).•Preliminary results:

Minefield Russian RouletteSuture

Robotic Surgical Drills Results

Page 40: Dynamic Bayesian Networks for Multimodal Interaction

Tony Jebara, Columbia University

Conclusion•Dynamic Bayesian networks are natural upgrade to HMMs.•Relevant for structured, multi-modal and multi-person temporal data.•Several exampls of dynamic Bayesian networks for

audio, video and haptic channelssingle, two-person and multi-person activity.

•DBNs: HMMs, Kalman Filters, hidden ARMA, input-output HMMs.•Use max likelihood (EM) or max conditional likelihood (CEM).•Intractable DBNs: switched Kalman filters, dynamical systems trees.•Use max free energy (GEM) and structured mean field.•Examples of applications:

gesture interaction (gesture games)audio-video interaction (social conversation)multi-person game playing (American football)haptic-video interaction (robotic laparoscopy).

•Funding provided in part by the National Science Foundation, the

Central Intelligence Agency, Alphastar and Microsoft.