Upload
haines
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dynamic Bayesian Networks for Multimodal Interaction. Tony Jebara Machine Learning Lab Columbia University joint work with A. Howard and N. Gu. Outline. Introduction: Multi-Modal and Multi-Person Bayesian Networks and the Junction Tree Algorithm - PowerPoint PPT Presentation
Citation preview
Tony Jebara, Columbia University
Dynamic Bayesian Networks for Multimodal Interaction
Tony JebaraMachine Learning LabColumbia University
joint work with A. Howard and N. Gu
Tony Jebara, Columbia University
Outline•Introduction: Multi-Modal and Multi-Person•Bayesian Networks and the Junction Tree Algorithm•Maximum Likelihood and Expectation Maximization•Dynamic Bayesian Networks (HMMs, Kalman Filters)•Hidden ARMA Models
•Maximum Conditional Likelihood and Conditional EM•Two-Person Visual Interaction (Gesture Games)
•Input-Output Hidden Markov Models•Audio-Visual Interaction (Conversation)
•Intractable DBNs, Minimum Free Energy, Generalized EM•Dynamical System Trees
•Multi-Person Visual Interaction (Football Plays)•Haptic-Visual Modeling (Surgical Drills)
•Ongoing Directions
Tony Jebara, Columbia University
Introduction•Simplest Dynamical Systems (single Markovian Process)
•Hidden Markov Model and Kalman Filter•But Multi-modal data (audio, video and haptics) have:
•Different time scale processes•Different amplitude scale processes•Different noise characteristics processes
•Also, Multi-person data (multi-limb, two-person, group)
•Weakly coupled•Conditionally Dependent
•Dangerous to slam all time data into one single series:•Find new ways to zipper multiple interacting processes
Tony Jebara, Columbia University
Bayesian Networks•Also called Graphical Models•Marry graph theory & statistics•Directed graph which efficiently encodes large p(x1,…,xN) as product of conditionals of node given parents•Avoids storing huge hypercube over all variables x1,…,xN
•Here, xi discrete (multinomial) or continuous (Gaussian)•Split BNs over sets of hidden XH and observed XV variables•Three basic operations for BNs
1) Infer marginals/conditionals of hidden (JTA)
2) Compute likelihood of data (JTA)
3) Maximize likelihood the data (EM)
1x
4x
2x
5x
6x
3x
( ) ( )1 1, , |
n
n i iip x x p x
== pÕK
( ), |V H
p X X q
( )| ,H V
p X X q
( ), |H
H VXp X X qå
( )max , |H
H VXp X X
qqå
Tony Jebara, Columbia University
Bayes Nets to Junction Trees
1x
4x
2x
5x
6x
3x
1x
4x
2x
5x
6x
3x
1 2 3x x x
1) Bayes Net
3) Triangulated 4) Junction tree
2) Moral Graph
1x
4x
2x
5x
6x
3x
52 3x x x
52 6x x x
2 4x x2
x
52x x
2 3x x
•Workhorse of BNs is Junction Tree Algorithm
Tony Jebara, Columbia University
Junction Tree Algorithm
BAB BC { } { } { }, ,V A B S B W B C= = =
If agree: ( )\ \V S S WV S W S
p Sy = f = = f = yå åElse:
*
\
**
*
S VV S
SW W
S
V V
f = y
fy = y
f
y = y
å
Send messageFrom V to W…
Send messageFrom W to V…
** *
\
**** *
*
** *
S WW S
SV V
S
W W
f = y
fy = y
f
y = y
å
Then, Cliques Agree
•The JTA sends messages from cliques through separators (these are just tables or potential functions)•Ensures that various tables in the junction tree graph agree/consistent over shared variables (via marginals).
Tony Jebara, Columbia University
Junction Tree Algorithm•On trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute
Ends with potentialsas marginals
or conditionalsof hidden variables
given datap(Xh1|Xv)p(Xh2|Xv)
p(Xh1, Xh2|Xv)And likelihood
p(Xv)is potentialnormalizer
Tony Jebara, Columbia University
•We wish to maximize the likelihood over for learning:
•EM instead iteratively maxes lower bound on log-likelihood:
•E-step:
•M-step:
Maximum Likelihood with EM
q(z)
L(q,)
( )
( )
1 argmax ,
| ,
t tq
tH V
q q
p X X
+ = q
= q
L
( )1 argmax ,t tq+q
q = qL
l()
( )max , |H
H VXp X X
qqå
( ) ( ) ( )
( ) ( )( ) ( )
log , | log | ,
|| | ,
H HH V H V HX X
tH H V
p X X q X p X X
KL q X p X X q
q ³ q
+ q = q
å åL
Tony Jebara, Columbia University
Dynamic Bayes Nets
1
0
( | ) ( , )
( ) ( )
t tP S i S j i j
P S i i
1 1 1
0 0 0 0 0
( | ) ( | , )
( ) ( | , )
t t t t t tP X x X x N x Ax Q
P X x N x µ Q
( | ) ( , )
( | ) ( | )
t t
t t t t i i
P Y i S j i j
or
P Y y S i N y
Hidden Markov Model Linear Dynamical System
State Transition Model: State Transition Model:
Emission Model:
( | ) ( | , )t t t t t tP Y y X x N y Cx R Emission Model:
1s 2s 3s
0y
0s
1y 2y 3y
1x 2x 3x
0y
0x
1y 2y 3y
•Dynamic Bayesian Networks are BNs unrolled in time•Simples and most classical examples are:
Tony Jebara, Columbia University
Two-Person Interaction
Interact with single user via p(y|x)Learn from two users to get p(y|x)
1s 2s 3s
0y
0s
1y 2y 3y
•Learn from two interacting people (person Y and person X) to mimic interaction via simulated person Y.
•One hidden Markov model for each user…no coupling!
•One time series for both users… too rigid!
1s 2s 3s
0y
0s
1y 2y 3y
Tony Jebara, Columbia University
DBN: Hidden ARMA ModelLearn to imitate behaviorby watching a teacherexhibit it.
Eg. unsupervisedobservation of 2- agentinteraction
Eg. Track lip motion
Discover correlationsbetween past action &subsequent reaction
Estimate p(Y | past X , past Y)
X
Y
Tony Jebara, Columbia University
DBN: Hidden ARMA Model
1s 2s 3s
0y
0s
1y 2y 3y
4s
4y
5s
5y
6s
6y
0x 1x 2x 3x 4x 5x 6x
•Focus on predicting person Y from past of both X and Y•Have multiple linear models of the past to the future•Use a window for moving average (compressed with PCA)•But, select among them using S (nonlinear)•Here, we show only a 2nd order moving average to predict the next Y given past two Y’s, past two X’s and current X and random choice of ARMA linear model
Tony Jebara, Columbia University
Hidden ARMA Features:•Model skin color as mixture of RGB Gaussians•Track person as mixture of spatial Gaussians
•But, want to predict only Y from X… Be discriminative•Use maximum conditional likelihood (CEM)
Tony Jebara, Columbia University
Conditional EM
EM:divide &conquer
CEM:discriminativedivide &conquer
8.0
1.7c
l
l
= -
= -
54.7
0.4c
l
l
= -
= +
( ) ( )max log , , | log , |H H
H V V H VX Xp X Y X p X X
qq - qå å
( )| ,V V
p Y X q•Only need a conditional? •Then maximize conditional likelihood
Tony Jebara, Columbia University
Conditional EM
CEM vs. EM p(c|x,y)CEM accuracy = 100%EM accuracy = 51%
EM p(y|x)
CEM p(y|x)
Tony Jebara, Columbia University
Conditional EM for hidden ARMA
Nearest Neighbor 1.57% RMSConstant Velocity 0.85% RMSHidden ARMA: 0.64% RMS
2 Users gesture to each other for a few minutesModel: Mix of 25 Gaussians, STM: T=120, Dims=22+15
Estimate Prediction Discriminatively/Conditionally p(future|past)
Tony Jebara, Columbia University
Hidden ARMA on Gesture
SCARE WAVE
CLAP
Tony Jebara, Columbia University
DBN: Input-Output HMM
-Sony Picturebook Laptop-2 Cameras (7 Hz) (USB & Analog)-2 Microphones (USB & Analog)-100 Megs per hour (10$/Gig)
•Similarly, learn person’s response audio video stimuli to predict Y (or agent A) from X (or world W)•Wearable collects audio & video A,W
Tony Jebara, Columbia University
DBN: Input-Output HMM
( )log | log ( , ) log ( )p A W p A W pW= -
1s 2s 3s
0y
0s
1y 2y 3y
1s 2s 3s
0w
0s
1w 2w 3w
0a 1a 2a 3a
•Consider simulating agent given world•Hidden Markov model on its own is insufficient since it does not distinguish between the input rule the world has and the output we need to generate•Instead, form input-output HMM•One IOHMM predicts agent’s audio using all 3 past channels•One IOHMM predicts agent’s video•Use CEM to learn the IOHMM discriminatively
Tony Jebara, Columbia University
Input-Output HMM DataVideo -Histogram lighting correction
-RGB Mixture of Gaussians to detect skin-Face: 2000 pixels at 7Hz (X,Y,Intensity)
Audio -Hamming Window, FFT, Equalization-Spectrograms at 60Hz-200 bands (Amplitude, Frequency)
Very noisy data set!
Tony Jebara, Columbia University
Video Representation- Principal Components Analysis - linear vectors in Euclidean space- Images, spectrograms, time series vectors.- Vectorization is bad, nonlinear
- Images = collections of (X,Y,I) tuples “pixels”- Spectrograms = collections of (A,F) tuples…therefore...- Corresponded Principal Components Analysis
2
1 1 1
T D Kid idn nm m
i d m
X c V= = =
æ ö÷ç - ÷ç ÷ç ÷çè øå å å
2
1 1 11
D KT ij id jdn n nm m
j d miM X c V
t
= = ==
æ ö÷ç -å å åå ÷ç ÷çè ø
M are soft permutationmatrices
=X
Tony Jebara, Columbia University
Video Representation
Original PCA CPCA 2000 XYI Pixels: Compress to 20 dims
Tony Jebara, Columbia University
Input-Output HMM
For agent and world:1 Loudness scalar20 Spectro Coeffs20 Face Coeffs
Estimate hiddentrellis frompartial data
Tony Jebara, Columbia University
Input-Output HMM with CEM
( )log | log ( , ) log ( )p A W p A W pW= -
Conditionally model p(Agent Audio | World Audio , World Video)p(Agent Video | World Audio, World Video)
Don’t care how well we can model world audio and videoJust as long as we can map it to agent audio or agent videoAvoids temporal scale problems too (Video 5Hz, Audio 60 Hz)
CEM: 60-state 82-Dim HMMDiagonal Gaussian Emissions90,000 Samples Train / 36,000 Test
Audio IOHMM:1s 2s 3s
0w
0s
1w 2w 3w
0a 1a 2a 3a
1s 2s 3s
0w
0s
1w 2w 3w
Tony Jebara, Columbia University
Input-Output HMM with CEM
EM (red) CEM (blue)Audio 99.61 100.58Video -122.46 -121.26
Joint Likelihood Conditional Likelihood
Spectrograms from eigenspace
KD-Tree on Video Coefficientsto closest image in training(point-cloud too confusing)
TRAINING & TESTING
RESYNTHESIS
Tony Jebara, Columbia University
Input-Output HMM Results
Test
Train
Tony Jebara, Columbia University
Intractable Dynamic Bayes Nets
11s
12s
13s
0y
10s
1y 2y 3y
21s
22s
23s
20s
31s
32s
33s
30s
Factorial Hidden Markov Model
Interaction Through Output
11s
12s
13s
10y
10s
11y
12y
13y
21s
22s
23s
20s
20y
21y
22y
23y
Interaction Through Hidden States
Coupled Hidden Markov Model
Tony Jebara, Columbia University
Intractable DBNs: Generalized EM•As before, we use bound on likelihood:
•But best q over hidden vars that minimizes KL intractable!•Thus, restrict q to only explore factorized distributions•EM still converges underpartial E steps & partial M steps,
q(z)
-L(q,)
( )1 argmax ,t tq FACTORIZED
q q+Î
= qL
( )1 argmax ,t tq+q
q = qL
l()
( ) ( ) ( )
( ) ( )( ) ( )
log , | log | ,
|| | ,
H HH V H V HX X
tH H V
p X X q X p X X
KL q X p X X q
q ³ q
+ q = q
å åL
Tony Jebara, Columbia University
Intractable DBNs Variational EM
11s
12s
13s
10s
21s
22s
23s
20s
31s
32s
33s
30s
Factorial Hidden Markov Model
11s
12s
13s
10s
21s
22s
23s
20s
Coupled Hidden Markov Model
•Now, the q distributions are limited to be chains•Tractable as an iterative method•Also known as variational EM structured mean-field
Tony Jebara, Columbia University
•How to handle more people and a hieararchy of coupling?•DSTs consider coupling university staff: students -> department -> school -> university
Dynamical System Trees
20y
30y
40y
11y
21y
31y
41y
10y
20x
30x
40x
11x
21x
31x
41x
Interaction Through Aggregated Community State
10s
10x
20s
30s
40s
11s
21s
31s
41s
1,20s 3,4
0s3,41s
1,21s
0s 1s
Internal nodes are states. Leaf nodes are emissions.Any subtree is also a DST. DST above unrolled over 2 time steps
Tony Jebara, Columbia University
•Also apply generalization of EM and do variational structured mean field for q distribution.•Becomes formulaic fo any DST topology!•Code available at http://www.cs.columbia.edu/~jebara/dst
Dynamical System Trees
20x 4
0x
11x
21x
31x
41x
10s
10x
20s
30s
40s
11s
21s
31s
41s
1,20s 3,4
0s3,41s
1,21s
0s 1s
Tony Jebara, Columbia University
DSTs and Generalized EM
20y
30y
40y
10y
20x
30x
40x
10s
10x
20s
30s
40s
1,20s
3,40s
0s
Structured Mean Field:
Use tractable distribution Q to approximate P
Introduce variational parameters
Find Min KL(Q||P)
Introduce v.p.
Introduce v.p.
Introduce v.p.Inference
Inference
Inference
Inference
Tony Jebara, Columbia University
DSTs for American Football
Initial frame of a typical play
Trajectories of players
Tony Jebara, Columbia University
DSTs for American Football
~20 time series of two typesof plays (wham and digs)Likelihood ratio of models used as classiferDST1 puts all players into 1 game stateDST2 combines players into two teams and then into game
Tony Jebara, Columbia University
DSTs for Gene Networks•Time series of cell cycle•Hundreds of gene expression levels over time•Use given hierarchical clustering•DST with hierarchical clustering structure gives best test likelihood
Tony Jebara, Columbia University
Robotic Surgery, Haptics & Video•Davinci Laparoscopic Robot•Used in hundreds of hospitals•Surgeon works on console•Robot mimics movement on (local) patient•Captures all actuator/robot data as 300Hz time series•Multi-Channel Video of cameras inside patient
Tony Jebara, Columbia University
Robotic Surgery, Haptics & Video
Tony Jebara, Columbia University
Robotic Surgery, Haptics & Video
Suturing
Expert Novice
64 Dimensional Time Series @ 300 HzConsole and Actuator Parameters
Tony Jebara, Columbia University
•Compress Haptic & Video data with PCA to 60 dims.•Collected Data from Novices and Experts and built several DBNs (IOHMMs, DSTs, etc.) of expert and novice for 3 different drills (6 models total).•Preliminary results:
Minefield Russian RouletteSuture
Robotic Surgical Drills Results
Tony Jebara, Columbia University
Conclusion•Dynamic Bayesian networks are natural upgrade to HMMs.•Relevant for structured, multi-modal and multi-person temporal data.•Several exampls of dynamic Bayesian networks for
audio, video and haptic channelssingle, two-person and multi-person activity.
•DBNs: HMMs, Kalman Filters, hidden ARMA, input-output HMMs.•Use max likelihood (EM) or max conditional likelihood (CEM).•Intractable DBNs: switched Kalman filters, dynamical systems trees.•Use max free energy (GEM) and structured mean field.•Examples of applications:
gesture interaction (gesture games)audio-video interaction (social conversation)multi-person game playing (American football)haptic-video interaction (robotic laparoscopy).
•Funding provided in part by the National Science Foundation, the
Central Intelligence Agency, Alphastar and Microsoft.