ecse.rpi.eduqji/CV/Sample_papers_QJ.pdf · 2019-10-07 · 0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

A Novel Dynamic Model Capturing Spatial andTemporal Patterns for Facial Expression Analysis

Shangfei Wang, Zhuangqiang Zheng, Shi Yin, Jiajia Yang, Qiang Ji

Abstract—Facial expression analysis could be greatly improvedby incorporating spatial and temporal patterns present in facialbehavior, but the patterns have not yet been utilized to theirfull advantage. We remedy this via a novel dynamic model - aninterval temporal restricted Boltzmann machine (IT-RBM) - thatis able to capture both universal spatial patterns and complicatedtemporal patterns in facial behavior for facial expression analysis.We regard a facial expression as a multifarious activity composedof sequential or overlapping primitive facial events. Allen’sinterval algebra is implemented to portray these complicatedtemporal patterns via a two-layer Bayesian network. The nodesin the upper-most layer are representative of the primitive facialevents, and the nodes in the lower layer depict the temporal rela-tionships between those events. Our model also captures inherentuniversal spatial patterns via a multi-value restricted Boltzmannmachine in which the visible nodes are facial events, and theconnections between hidden and visible nodes model intrinsicspatial patterns. Efficient learning and inference algorithms areproposed. Experiments on posed and spontaneous expressiondistinction and expression recognition demonstrate that ourproposed IT-RBM achieves superior performance compared tostate-of-the art research due to its ability to incorporate thesefacial behavior patterns.

Index Terms—interval temporal restricted Boltzmann machine,global spatial and temporal patterns, posed and spontaneousexpressions distinction, expressions categories recognition

I. INTRODUCTION

THERE has been a proliferation of research on facialexpression analysis recently, since facial expression is

a crucial channel for both human-human communication andhuman-robot interaction.

Current works on facial expression recognition may becategorized into either of two approaches: a frame-based ap-proach or a sequence-based approach. A frame-based approachrecognizes facial expressions from static facial images, usu-ally from the manually annotated apex frame. This approachcompletely disregards the important dynamic patterns inherentin facial behavior. A sequence-based approach relies on thewhole image sequence, and thus has the potential to modelboth spatial and temporal patterns through features or dynamicclassifiers. Current works either employ hand-crafted spatialand temporal features or use learned representation throughdeep networks. Several dynamic classifier models, such as

Shangfei Wang is the corresponding author. Shangfei Wang, ZhuangqiangZheng, Shi Yin and Jiajia Yang are with the School of Computer Scienceand Technology, University of Science and Technology of China, 230027,Hefei, China. E-mail: [email protected]; [email protected];[email protected]; [email protected]

Qiang Ji is with the Department of Electrical, Computer, and Sys-tems Engineering Rensselaer Polytechnic Institute, NY 12180-3590. E-mail:[email protected]

hidden Markov models (HMMs), dynamic Bayesian networks(DBNs), latent conditional random fields (LCRFs), long short-term memory networks (LSTMs), or gated recurrent unitnetworks (GRUs), are frequently used. All of these workstry to find more discriminative features or more powerfulclassifiers to explore embedded spatial and temporal patterns,and have been successful for facial expression analysis. Werefer to these approaches as feature-driven methods.

Few works consider the underlying anatomic mechanismsgoverning facial muscular interactions. Nearly any facial ex-pression can be deconstructed into the contraction or relaxationof one or more facial muscles. These facial muscle movementsinteract in space and time to convey different expressions.At each time slice, facial muscle motions may co-occur orbe mutually exclusive. For example, as shown in Figure 1(a)and Figure 1(b), most people raise the inner brow and outerbrow simultaneously, since both motions are related to thefrontalis muscle group. The lip corner puller rarely occursin tandem with the lip corner depressor, as shown in Figure1(c) and Figure 1(d). The lip corner puller uses the musclegroup zygomaticus major, and the latter is produced by thedepressor anguli oris muscle group. Temporally, the movementof one facial muscle can either activate, meet, overlap, orsucceed another muscle. As shown in Figure 2, for example,most people show happiness by stretching their mouths whileraising their cheeks. Therefore, the contraction of zygomaticmajor is more likely to occur asymmetrically if a smile isposed rather than spontaneous. When an expression is naturaland spontaneous, the trajectory is typically smoother. It has ashorter duration, and onset is gradual rather than immediate.Such spatial and temporal patterns caused by the interactionof facial expression muscles are extremely complex, time-dependent, and global, yet have not been fully modeled bycurrent facial expression analysis methods.

We propose a novel dynamic model that leverages the com-plex spatial and temporal patterns caused by the underlyinganatomic mechanism for expression analysis. We assume anexpression is a multifarious activity made up of sequentialor overlapping primitive facial events, and that each eventtakes place over a certain amount of time. First, we introduceAllen’s interval algebra to capture several types of temporalrelationships, including A takes place before B, A meets B,A overlaps with B, A initiates B, A occurs during B, Afinishes B, A is equal to B, and the inverse of the first sixrelations. We implement the complex temporal relations usinga Bayesian network incorporating primitive facial event nodesand temporal relationship nodes. The links connecting the twotypes of nodes characterize their temporal relationships. Next,




(a)Netural frame image of surprise (b)Peak frame image of surprise

(c)Peak frame image of happy (d)Peak frame image of sad

Fig. 1. Sample images demonstrating spatial patterns inherent in expressions.

Frame No

1 18 47 94 111

Posed smile

Spontaneous smile

Fig. 2. Image sequences demonstrating temporal patterns inherent in expres-sions. The x-axis is the frame number.

a restricted Boltzmann machine (RBM) is adopted to representthe global spatial patterns among primitive facial events. Thevisible nodes of the RBM depict primitive facial events, andthe connections between hidden nodes and visible nodes modelthe spatial patterns inherent in expressions. During training, webuild an IT-RBM model for each type of expression, and theparameters and structures of the proposed IT-RBM are learnedthrough maximum likelihood. When testing occurs, the testsample label is equivalent to the model with the largest loglikelihood.

The proposed IT-RBM differs from other dynamic modelsin that it introduces Allen’s interval algebra to capture all 13temporal relations. Unlike current dynamic models, which arelimited to time-slice structure and must assume stationary andtime-independent temporal relations, the suggested model cancapture more complex global temporal relationships.

The paper is organized as follows. Section II is a brief re-view of related works on expression analysis, including expres-sion recognition as well as posed and spontaneous expressiondistinction. Section III details the proposed IT-RBM model.Section IV outlines the experiments and analysis, with posedand spontaneous expression distinction experiments outlined inSection IV-A and expression recognition experiments detailedin Section IV-B. Section V summarizes our work.

II. RELATED WORK

A. Posed and Spontaneous Expression Distinction

Inner feelings may be disguised with a posed expression, buttrue emotions are conveyed via spontaneous expressions. It isdifficult to distinguish one from the other, since expressionsvary by subject and condition and the differences betweena spontaneous and posed expression are subtle. The inherentspatial and temporal patterns in facial expressions can beleveraged to improve distinction between these similar typesof expressions.

Behavioral research has found slight but distinctive differ-ences between temporal and spatial patterns in posed andspontaneous expressions. Examples of temporal patterns in-clude speed, trajectory, amplitude, and duration of expressiononset and offset. For example, Ekman et al. [1][2] found thatspontaneous expressions usually have a smoother trajectoryand shorter duration than posed expressions. Schmidt et al.[3] revealed that for posed smiles, the maximum speed ofmovement onset is greater than it is for spontaneous smiles.Deliberate eyebrow raises are shorter in duration and havea greater maximum speed and amplitude than spontaneous,natural eyebrow raises. Spatial patterns mainly consist of themovement of facial muscles. For example, Ekman et al.’swork [1] found that the orbicularis oculi only contract duringspontaneous smiles. When smiling, the contraction of thezygomatic major muscle is more likely to be asymmetric for aposed expression than a spontaneous one [4]. Ross and Pulusu[5] indicated that posed expressions typically commence onthe right side of the face, while spontaneous expressionsoriginate on the left side of the face. This is especially true forupper facial expressions. Namba et al.’s work [6] comparedthe morphological and dynamic properties of spontaneous andposed facial expressions as they related to surprise, amuse-ment, disgust, and sadness. For amusement, AUs yield nosignificant differences. For disgust, AU10 and AU12 occurmore frequently when an expression is spontaneous rather thanposed, while AU17 appears more often in posed expressions.For sadness, morphological properties of spontaneous facialexpressions are not observed, while AU4, AU7, and AU17 aremost frequently observed in posed facial expressions.

Most research uses certain features to distinguish betweenposed and spontaneous expressions. Cohn and Schmidt [7]adopted temporal features, including duration, amplitude, andthe ratio between the two. Valstar [8] utilized features suchas speed, duration, trajectory, intensity, symmetry, and theoccurrence order of brow actions based on fiducial facial pointdisplacement. Dibeklioglu et al. [9] described the dynamicsof eyelid, cheek, and lip corner movements using amplitude,duration, speed, and acceleration. Seckington [10] representedtemporal dynamics using six features (i.e., morphology, apexoverlap, symmetry, total duration, onset speed, and offsetspeed).

Static classifiers (e.g., linear discriminant classifiers [7],support vector machines [11], k-NN [12], and naive Bayesianclassifiers [12]) and dynamic classifiers (e.g., continuous hid-den Markov models [12] and dynamic Bayesian networks[10]) were investigated for the task of distinguishing between




posed and spontaneous expressions. Static classifiers modelthe mapping between features and expression types, whiledynamic classifiers model the temporal relationships.

Progress has been made in distinguishing between posedand spontaneous expressions. However, these feature-drivenmethods do not explicitly leverage the underlying interactionsbetween facial expression muscles, and their influences onposed and spontaneous expressions.

Recently, Wang et al [13] proposed a model-based methodusing multiple Bayesian networks (BNs) to capture spatial pat-terns for expressions given gender and expression categories.This model only includes local dependencies due to the first-order Markov assumption of BNs; it cannot capture high-orderor global relations. Wu et al [14] proposed to address thatissue by implementing a restricted Boltzmann machine (RBM)to explicitly model complex joint distributions over featurepoints. RBMs introduce a layer of latent units, allowing themto model high-order dependencies among variables [15]. Al-though this model is an improvement, it does not leverage thedependencies among hidden units. Quan et al. [16] employed alatent regression Bayesian network (LRBN) to leverage higher-order and global dependencies among facial features. A latentregression Bayesian network differs from an RBM in that itis a directed rather than undirected model. The “explainingaway” effect in Bayesian networks allows LRBNs to capturedependencies among both latent and visible variables; thesedependencies are vital to accurately represent the data. Thesuccess of each of these three model-based works prove thatspatial patterns can contribute to the differentiation of posedfrom spontaneous expressions.

Thus far, there have not been many attempts to capture andleverage the spatial and temporal patterns embedded in posedand spontaneous facial expressions. We propose an intervaltemporal restricted Boltzmann machine (IT-RBM) to jointlycapture global and complex patterns and improve the task ofexpression distinction.

B. Expression Recognition

Expression recognition has attracted much more attentionthan the distinction between posed and spontaneous expres-sions. Corneanu et al.’s work [17] and Brais Martinez et al.’swork [18] provided a literature review of facial expressionrecognition.

Mainstream facial expression recognition works regard fa-cial expression recognition as a pattern recognition problemand focus primarily on discriminative features and power-ful classifiers. For features, both engineered dynamic repre-sentations and learned representations from video volumesare exploited to encode temporal variations among sequenceframes. Engineered dynamic representations such as LBP-TOP[19], and Gabor motion energy [20] do not require labelledsequences for training, and thus are simple and generic forany expression analysis tasks. However, the optimality isquestionable. The learned representations may attain state ofthe art performance, but they require many training videoswith ground-truth labels. Dynamic graphic models, such asHMM [21][22][23] and DBN [24], have commonly been used

for facial expression analysis tasks. As time-slice (based ontime points) graphical models, these dynamic models representeach activity as a sequence of events occurring instantaneously,and thus offer three time-point relations (i.e., before, followsand equals). Since facial expressions are complex and con-sist of facial events that may be sequential or temporallyoverlapping, current dynamic graphical models are unable torepresent several of the temporal relations happening betweenevents throughout the activity. Recently, deep dynamic modelssuch as LSTM [25] have been adopted for facial expressionanalysis. Usually, a convolutional neural network (CNN) isused to obtain static representations from each frame, andthen the learned static representations are fed into the LSTMto learn dynamic representations and expression classifierssimultaneously. In spite of its good performance, LSTMrequires a lot of training data. Furthermore, LSTM is also atime-slice model and cannot successfully represent the globaland complex temporal relations between primitive facial eventsinherent in facial expressions. Just as in expression distinction,these feature-drive expression recognition methods ignore theunderlying interaction among facial expression muscles.

A facial expression is defined as at least one motion of thefacial muscles over a period of time. These muscle movementscommonly appear in certain patterns to communicate differentexpressions. For example, the facial expression of happinessis characterized by raised cheeks and a stretched mouth.Surprise is usually displayed by widened eyes and a gapingmouth. A look of sadness is easily identified by upwardlyslanted eyebrows and a frown. An expression of anger is oftendetermined by squeezed eyebrows as well as tight and straighteyelids. Fear typically includes widened eyes and eyebrowsslanted upward. These expression-dependent temporal andspatial patterns are essential for expression recognition, buthave yet to be exploited thoroughly.

As far as we know, only one related work attempts tocapture these patterns for expression recognition. Wang et al.[26] suggested an interval temporal Bayesian network (ITBN)including temporal entity nodes and temporal relation nodes.The links between the former types of nodes represent spatialdependencies among temporal entities. Links joining temporalrelation nodes with their corresponding temporal entities arerepresentative of the temporal relationships between the twoconnected temporal entities. Thus, the ITBN is able to leveragespatial and temporal patterns. Due to the Markov assumption,Bayesian networks only capture local dependences. Therefore,instead of a BN, we employ an RBM to capture and depictglobal spatial patterns. Since Allen’s interval algebra definescomplete temporal relations between two events and a BN canfully capture dependencies between two events, we still em-ploy a BN to model temporal patterns embedded in expressionchanges.

The proposed IT-RBM is a novel dynamic model and canprovide complex and global relations through the use ofinterval algebra, which defines complete temporal relationsbetween two events. This is an improvement over typicaldynamic models like HMM and DBN, which use a time-slicestructure to present three time-point relations, and are onlyable to capture stationary dynamics.




This paper makes the following contributions to this fieldof study:

1. A novel dynamic model, IT-RBM, is proposed to jointlycapture both global spatial patterns and complex temporalpatterns.

2. We explicitly model spatial-temporal patterns innate tovarious expression categories for expression recognition.

2. We explicitly model spatial-temporal patterns found inposed and spontaneous expressions to better distinguish be-tween those expressions.

A previous version of the paper appeared as Yang et al’work [27], which proposed an IT-RBM to capture and utilizespatial and temporal patterns embedded in posed and sponta-neous expressions for expression distinction. Unlike the pre-vious version, which only focuses on posed and spontaneousexpression distinction, this paper extends the proposed IT-RBM for expression recognition. To show the effectiveness ofthe proposed IT-RBM for expression recognition, experimentsare conducted on the CK+ and the MMI databases. We haveadded two models for posed and spontaneous expressiondistinction (i.e., PS gender model and PS exp model), sincethe spatial and temporal patterns embedded in expressionsare influenced by gender and type of expression. For thePS gender model, we train four models from male posed,male spontaneous, female posed, and female spontaneoussamples. For the PS exp model, we train a posed model anda spontaneous model for each expression type.

III. PROPOSED METHOD

Facial expressions are the results of a set of musclemovements over a period of time. At each time slice, facialmuscle motions can co-occur or be mutually exclusive. From atemporal perspective, the movement of one facial muscle canactivate, overlap, or follow the movement of another muscle.Because of the difficulty in measuring minute facial musclemotions, the movements of facial feature points are used todefine primitive facial events as recommended by Wang et al’work [26]. Each feature point movement is a singular primitivefacial event. The interval relation between each pair of eventscan be defined as one of 13 interval relations proposed byAllen’s interval algebra [28]. First, we select the primitiveevent pairs with the largest interval relation variance among thedifferent expressions. For each type of expression, an IT-RBMmodel is constructed using the selected events and intervalrelations. The global spatial and temporal patterns are jointlycaptured during training. During testing, the label of a testsample is the model with the largest likelihood. The methodframework is illustrated in Figure 3.

This section focuses first on the extraction of primitivefacial events. Then, we describe the definition and selectionof temporal relations. After that, the proposed IT-RBM modelis presented in detail.

A. Primitive facial events extraction

Given the data set of sample videos with different types ofexpression, denoted as D = {(x(1), y(1)), (x(2), y(2)), ...,(x(N), y(N))}, where x(i) is the ith video with frame length

y

t

V

T1

(b)

v1:

vK:

R12 v1 during v2

P1

P2

Neutral

frame

End

Frame

Peak

frame

Start

frameT2

y

t

...

y

v2

v1

Facial expression

videos database

Select primitive event pairs

and temporal relations

Primitive events extraction

Label

IT-RBMs with gender or expression Test primitive events

Likelihood of

each modelTest phase

v2:

TR1

3

TR1

2

TR2m TR3m

...

...

h1 h2 hn

v1 v2 v3 vm

n

m

RTR1

3

TR1

2

TR2m TR3m

...

...

h1 h2 hn

v1 v2 v3 vm

n

mm

RTR1

3

TR1

2

TR2m TR3m

...

...

h1 h2 hn

v1 v2 v3 vm

n

m

RTR1

3

TR1

2

TR2m TR3m

...

...

h1 h2 hn

v1 v2 v3 vm

n

mm

R

(a) (c)

C models

Fig. 3. Outline of the recognition system

of fi, and y(i) ∈ {0, 1, ..., C} is its expression label (C isthe total expression classes of videos), N is the number oftotal sample videos. Each frame is a facial image with Pnofacial points. Each video is assumed to contain primitive facialevents, which are either sequential or temporally overlapping.A primitive facial event is the movement of one feature pointand includes the motion state, the commencement time whenthe feature point is no longer in neutral position, and themoment when the point returns to neutral. Figure 4 depictsa primitive event corresponding to ith facial point denoted asVi = 〈tsi, tei, vi〉(tsi, tei ∈ R, tsi < tei, vi ∈ {1, 2, ...,K}),tsi and tei are representative of the start and end times respec-tively, vi represents the motion state, {1, 2, ...,K} representsall possible primitive event states. As expression videos differin frame length, we normalize all frames to the shortest framelength in the training set len. Samples are equidistantly down-sampled to len. We obtain K movement states by using K-means clustering on the feature point displacement sequence.

Figure 4 illustrates some example primitive facial events. In(a) facial points P1 and P2 correspond to events V1 and V2,representing the muscle motions of the right wing of the noseand the right mouth corner respectively. (b) shows the traceof V1 and V2 along the vertical direction, and T1, T2 are theircorresponding durations. Each event has K possible statesrepresenting its movement pattern throughout the duration,as (c) shows. The flat red line depicts a point that remainsin neutral for the entire process, and the other states arerepresentative of k − 1 movement patterns. For example, apoint that moves up and then returns to neutral would berepresented by state Sm(shown as the dotted black line). Amore complex pattern is depicted by state S2 (the solid line),in which a point moves upward and then downward.

B. Temporal relations definition and selection

According to Allen’s interval algebra [28], there are 13potential temporal relationships between two primitive eventsas illustrated in Table I. The 13 possible relations I ={b, bi,m,mi, o, oi, s, si, d, di, f, fi, eq}, representing before,meets, overlaps, starts, during, finishes, equals and their in-verses. The temporal relationships between pairs of facialevents Vi and Vj can be obtained by calculating the temporaldistance dis(Vi, Vj) according to Eq.1.




y

t

E

T1

(a) (b)

S1:

S2:

Sm:

R12 E1 during E2

P1

P2

Neutral

frame

End

Frame

Peak

frame

Start

frameT2

y

t

...

(c)

y

E2

E1

Fig. 4. (a) Facial muscle movement as captured by the movement of facialpoints. (b) Duration for events V1 and V2 and their temporal relations. (c)Example movement states of a primitive facial event.

dis(Vi, Vj) = [tsi − tsj , tsi − tej , tei − tsj , tei − tej ] (1)

TABLE ITR AND INTERVAL RELATION MAPPING TABLE

No TR tsi − tsj tei − tej tsi − tej tei − tsj illustration

1 b < 0 < 0 < 0 < 0 ! "!

2 bi > 0 > 0 > 0 > 0

3 d > 0 < 0 < 0 > 0 !

"!

4 di < 0 > 0 < 0 > 0

5 o < 0 < 0 < 0 > 0 !

"!

6 oi > 0 > 0 < 0 > 0

7 m < 0 < 0 < 0 = 0 ! "!

8 mi > 0 > 0 = 0 > 0

9 s = 0 < 0 < 0 > 0 !

"!

10 si = 0 > 0 < 0 > 0

11 f > 0 = 0 < 0 > 0 !

"!

12 fi < 0 = 0 < 0 > 0

13 eq = 0 = 0 − − !

"!

The horizontal bars represent the time interval of the correspondingprimitive events.

The extraction of primitive facial events yields Pno∗(Pno−1) pairs of events and the corresponding temporal relationsfor each sample. It is expected that discriminative temporalrelations will have a wider variance between expression types,so we propose a Kullback-Leibler divergence-based score [29]to measure the difference between the two probability distribu-tions. The score of event pair Vi, Vj is defined in Eq.2, whereTRij represents relation between primitive event pair Vi, Vj ,Px(TRij) and Py(TRij) are the probability distribution ofTRij for the x and y expression respectively. DKL stands forthe KL divergence. Primitive event pairs are ranked by thescore, and the top ξ pairs with m events are selected.

Sij =∑x,y∈{1,2,...,C}

(DKL(Px(TRij)‖Py(TRij)) +DKL(Py(TRij)‖Px(TRij)))

(2)

C. Capturing spatial and temporal patterns through the IT-RBM Model

Our proposed hybrid graphic model known as IT-RBM isshown in Figure 5. The upper section is a multi-value RBMand the lower layer is a Bayesian network. The uppermostlayer contains n binary latent variables hj ∈ {0, 1}(j ∈

{1, 2, ..., n}). The layer below that contains m visible nodes.vi ∈ {1, ...,K}(i ∈ {1, 2, ...,m}) describes m selectedfacial events. Each facial event consists of K motion statesrepresented by an one-hot vector. Specifically, vi consists ofbinary nodes vi1, vi2, ..., viK , thus vi = k can be representedwith an one-hot vector by setting vik = 1, other K-1 binarynodes to zeros. The bottom layer contains ξ temporal relationnodes, TR ∈ I representing 13 temporal relations. Complextemporal relations are captured by the lower part; the spatialdependencies among facial events are modeled by the upperpart. Eq. 3 shows the joint probability of the suggested model.

TR13 TR12 TR2m TR3m

...

...

h1 h2 hn

v1 v2 v3 vm

W W

W of 2 hidden nodes,3 move

models of visible nodes

n

m

R

Fig. 5. An example of IT-RBM model

P (v, TR) = P (TR|v)P (v) = P (TR|v)∑h

P (v, h) (3)

where

P (TR|v)=R∏

r=1

P (TRr|π(TRr)), (4)

TRr represents the rth temporal relation node. π(TRr) arethe two primitive event nodes that produce TRr.

After primitive events and temporal relations are extracted,given training data Dt = {(v(1), TR(1)), (v(2), TR(2)), ...,(v(Nt), TRNt)}, where Nt indicates the number of trainingsamples for one expression, v(i) and TR(i) represents motionstates and temporal relations of ith sample. The goal of modellearning is log likelihood maximization, shown as follows:

θ* = argmaxθ

1

Nt

∑(logP (v; θ) + logP (TR|v; θ)) (5)

Eq. 5 demonstrates that we can factorize the log likelihoodof IT-RBM into the sum of the log likelihood of RBM andthe log likelihood of BN. Since the model parameters of RBMθRBM is independent of model parameters of BN θBN , wecan train RBM and BN separately. Training of the multi-valueRBM only concerns motion states of primitive event, so wedenote DRBM

t = {v(1), v(2), ..., v(Nt)}.The marginal distribution of the visible units is calculated

as Eq. 6,

P (v) =∑h

P (v, h) =1

Z

∑h

e−E(v,h)

=1

Ze

m∑i=1

K∑k=1

bikvikn∏j=1

(1 + eaj+

m∑i=1

K∑k=1

wjikvik

)

(6)




where E is the energy function of multi-value RBM and isdefined in Eq.7. {W,a, b} are the model parameters: wjik is asymmetric interaction term between visible unit i which takeson value k and hidden unit j, bik is the bias of unit i thattakes on value k, and aj is the bias of hidden unit j.

E(v, h) = −m∑i=1

n∑j=1

K∑k=1

vikwjikhj −

m∑i=1

K∑k=1

vikbik −n∑j=1

hjaj

(7)The gradient with respect to θRBM = {W,a, b} can be

calculated as Eq.8, where P (v, h) and P (h|v) denote themodel-defined distribution, v in the first term is from trainingset, v in the second term is sampled from model-defineddistribution P (v).

∆θRBM = ε∂ logP (v)

∂θRBM

= ε

−∑h

P (h|v)∂E(v, h)

∂θRBM+∑v,h

P (v, h)∂E(v, h)

∂θRBM

(8)

The contrastive divergence (CD) algorithm is used to over-come the challenge of inferring the second term of Eq. 8,which is intractable and is needed for gradient calculation[30]. The conditional distribution of visible nodes given hid-den nodes and the conditional distribution of hidden nodesgiven visible nodes are softmax function and logistic functionrespectively, as follows:

P (vi = k|h) =exp(bik +

∑nj=1 hjw

jik)∑K

l=1 exp(bil +∑nj=1 hjw

jil)

(9)

P (hj = 1|v) = σ(aj +

m∑i=1

K∑k=1

wjikvik) (10)

The detailed algorithm for learning multi-value RBM is shownas Algorithm 1.

Algorithm 1 The training algorithm for multi-value RBMusing CD learning

Require: Training data:DRBMt = {v(1), v(2), ..., v(Nt)}, la-

tent nodes number n, learning rate ε, maximum trainingtimes T

Ensure: wjik, aj , bik1: Initialize: set w, a, b to small random values2: for t = 1, 2, ..., T do3: sample one example v from DRBM

t

4: for j = 1, 2, ..., n do5: Sample hj ∼ p(hj |v) with Eq.106: end for7: for i = 1, 2, ...,m do8: Sample v′i ∼ p(vi|h) with Eq.99: end for

10: parameters update:11: wjik ← wjik + ε(P (hj = 1|v)vik − P (hj = 1|v′)v′ik)12: bik ← bik + ε(vik − v′ik)13: aj ← aj + ε(P (hj = 1|v)− P (hj = 1|v′))14: end for

The conditional probability distributions for each tempo-ral relation node TRij given its parent nodes vi and vjare used to define parameters for the BN. The structureof the BN and the number of parameters can be deter-mined after temporal relations are selected. The goal ofparameter approximation is to find the maximum likeli-hood estimate of parameter θBN given training data Dt ={(v(1), TR(1)), (v(2), TR(2)), ..., (v(Nt), TRNt)}. This is de-picted in Eq. 11.

θBN= argmaxθBN

∑Nt

logP (TR|v; θBN ) (11)

D. Expression analysesAn IT-RBM model can be obtained for each expression after

training. During testing, test sample t is labeled with the classthat has the largest log likelihood value, according to Eq. 12.In that equation, y∗ represents the predicted label and C isthe number of expression categories (as well as the number ofIT-RBM models).

y∗ = maxy∈{1,...,C}

{logP (t|θy)} (12)

The log likelihood that IT-RBM trained on class y assignedto test sample t is as follows:logP (t|θy)

= log

(∑h

exp(−E(h, t; θy))

)− logZ (θy) + log(P (TR|t; θy))

(13)

in which the first and third terms can be directly calculatedand the partition function Z is intractable. The extended AISmethod inspired by annealed importance sampling (AIS) [31]is used to compute the partition function of multi-value RBM.

AIS approximates the ratio of the partition functions of theobject RBM to the base-rate RBM. For example, supposethere are two multi-value RBMs with parameters θA ={WA, bA, aA} and θB = {WB , bB , aB}. These RBMs defineprobability distributions PA and PB over the same v ∈{0, 1, ...,K}m, and hA ∈ {0, 1}nA , hB ∈ {0, 1}nB .

First, the intermediate distribution sequence for τ = 0, ..., nis defined as:

Pτ (v) =P ∗τ (v)

Zτ=

1

Zτ

∑h

exp(−Eτ (v, h)) (14)

where the energy function is delineated in Eq. 15, P0(v) =PA and Pn(v) = PB . Eq. 16 approximates the unnormalizedprobability over visible units, where 0 = β0 < β1 < ... <βτ < ... < βn = 1.

Eτ (v, h) = (1− βτ )E(v, hA; θA) + βτE(v, hB ; θB) (15)

P ∗τ (v) = e(1−βτ )

∑i

∑kbik

AviknA∏

j=1

(1 + e(1−βτ )(

∑i

∑kwjAikvik+a

Aj )

)

∗ eβτ

∑i

∑kbik

BviknB∏

j=1

(1 + eβτ (

∑i

∑kwjBikvik+a

Bj )

)

(16)

Next, we establish a Markov chain transition operatorTτ (v

′; v) that leaves Pτ (v) invariant. Logistic and softmaxfunctions yield the conditional distributions as follows:

P (hAj = 1|v) = σ((1− βτ )(aAj +∑i

∑k

wjAik vik)) (17)




P (hBj = 1|v) = σ(βτ (aBj +

∑i

∑k

wjBik vik)) (18)

P (vi = k|h) =

exp

((1− βτ )

(bAik +

n∑j=1

hAj wjAik

)+ βτ

(bBik +

n∑j=1

hBj wjAik

))∑Kl=1 exp

((1− βτ )

(bAil +

n∑j=1

hAj wjAil

)+ βτ

(bBik +

n∑j=1

hBj wjAil

))(19)

Hidden units hA and hB are stochastically activated usingEq. 17 and Eq. 18. A new sample is drawn using Eq. 19.

Finally, with initial θA = {0, bA, 0}, ZA is calculated asEq. 20. Then we can calculate ZB with Eq. 21. We defineω(i) in Algorithm 2 to avoid using too many symbols.

ZA = 2nA∏i

∑k

ebki (20)

ZBZA≈= 1

Mr

Mr∑i=1

ω(i) = rAIS (21)

The detailed algorithm is outlined below in Algorithm 2.

Algorithm 2 The AIS algorithm for capturing partition func-tion Z[31]Require: Base-rate RBM’s parameters θA = θ0, Objective

RBM’s parameters θB = θ1,Ensure: Objective RBM’s ZB

1: for i = 1 to Mr do2: for β = 0 to 1 do3: Generate v1, v2, ..., vτ , ..., vn using Tτ as follows:4: Sample v1 from PA = P0

5: Sample v2 given v1 using T16: ...7: Sample vτ given vτ−1 using Tτ−18: ...9: Sample vn given vn−1 using Tn−1

10: end for11: ω(i) =

P∗1 (v1)P∗0 (v1)

P∗2 (v2)P∗1 (v2)

...P∗τ (vτ )P∗τ−1(vτ )

...P∗n(vn)P∗n−1(vn)

12: end for13: rAIS = 1

Mr

Mr∑i=1

ω(i)

14: ZB = ZA ∗ rAIS

IV. EXPERIMENTS

The proposed IT-RBM model can be applied to distinctionbetween posed and spontaneous expression as well as expres-sion recognition. Therefore, to validate the proposed IT-RBMmodel, we conduct experiments on posed and spontaneousexpression distinction as well as expression recognition.

A. Posed and spontaneous expression distinction experiments

1) Experimental Conditions: We use two benchmarkdatabases to conduct experiments on posed and sponta-neous expression distinction: the Extended DISFA (DISFA+)database [32] and the SPOS database [33]. The DISFA+database is composed of 572 posed expression videos and

252 spontaneous expression videos. Disgust, fear, happiness,sadness, and surprise are exhibited by 9 young adults (4male and 5 female). The SPOS database contains 84 posedexpression samples and 150 spontaneous expression samples.This database explores the same expression categories as theDISFA+ database, with the addition of anger. Expressionsare made by 7 subjects (4 male and 3 female). The datadistribution of these databases is show in Table II.

TABLE IIDATA DISTRIBUTION OF SPOS AND DISFA+

SPOS DISFA+

Expression P S P S

Anger(An) 14 13

Disgust(Di) 14 23 163 81

Fear(Fe) 14 32 163 63

Happy(Ha) 14 66 42 18

Sadness(Sa) 14 5 122 54

Surprise(Su) 14 11 82 36

Total 84 150 572 252

We extracted facial feature points from images to collectfacial events as defined in Section III-A. The superviseddescent method (SDM) [34] extracts 49 facial feature pointsfrom the SPOS database, as seen in the left side of Figure6. The DISFA+ database provides 68 feature points extractedfrom database constructors. We ignore the facial outline anduse the interior 49 points, shown in the right side of Figure 6.

1

2 3 45 6 7 8 9

10

11

12

13

14

1516 17 18

19

20

21 2223

2425

2627 28

293031

3233 34 35

36 3738

39

404142

43

44 45 46474849

Fig. 6. Facial feature points. Left: SPOS, CK+, MMI; right: DISFA+

We adopt recognition accuracy and F1-score as perfor-mance metrics. We use five-fold subject-independent cross-validation on the SPOS database and ten-fold subject-independent cross-validation on the DISFA+ database.

To compare the performance of our method to state-of-the-art research, we conduct expression distinction experimentswith five methods. We use our proposed IT-RBM method,which is able to simultaneously capture global spatial patternsand complex temporal patterns. We compare it to the upperlayer of the IT-RBM, which is a multi-value RBM modellinghigh-order spatial patterns only. The third method is HMM,a popular dynamic model capturing local temporal patterns.The fourth and fifth methods are LSTM and GRU. The firstthree methods are generative models, while the last two arediscriminative models. The displacement of feature points areused as features for the above five methods.

For experiments with LSTM and GRU, we adopt Princi-pal Component Analysis (PCA) to further reduce feature




dimension of the landmark displacement of consecutiveframes. After that, we obtain time series data with thelength of T as the input of LSTM and GRU. Due to thesmall data size, both LSTM and GRU only have one layerof hidden units. The hidden states of the LSTM and GRUare then feed into a fully-connected network to classifyexpressions. For cross-validation on the DSIFA+ and SPOSdatabases, one fold from the training set is used as vali-dation set for parameter selection. These are as the sameas the experimental conditions used for the proposed IT-RBM. A grid search strategy is used for hyper parameters’selection. Specifically, for feature dimension reduction byPCA, we get a certain number of principal componentsby setting different cumulative variance contribution ratesranging in {0.8, 0.85, 0.9, 0.95, 0.99, 0.999, 1}, for the lengthof input time series data, T ∈ {5, 10, 20, 30, 40, 50}; forthe dimensions of hidden states of the LSTM (GRU),nh ∈ {5, 10, 15, 20, 25, 30, 35, 40}; for the value of learningrate, ε ∈ {0.001, 0.005, 0.01, 0.05, 0.1}; for the size of amini-batch, bn ∈ {20, 40, 60, 80, 100}.

Since embedded spatial and temporal patterns are affectedby many factors including gender and type of expression, weadd two additional models using the proposed IT-RBM. Oneis referred to as the PS gender model, in which we trainfour models: one from male posed samples, one from malespontaneous samples, one from female posed samples, andone from female spontaneous samples. The other is calledthe PS exp model, in which we train a posed model and aspontaneous model for every expression type. We also examinethe IT-RBM model trained on the posed and spontaneoussamples, denoted as the PS model.

2) Experimental results and analyses: Table III shows theresults of our experiments. From Table III, we observe thefollowing:

Firstly, the proposed IT-RBM achieves superior accuraciesand F1-scores than multi-value RBM; the proposed IT-RBMtakes spatial patterns as well as temporal patterns into account,while the multi-value RBM only models spatial patterns. Thebetter performance of IT-RBM demonstrates the importanceof temporal patterns when distinguishing between posed andspontaneous expressions.

Secondly, the proposed IT-RBM achieves higher accuraciesand F1-scores than HMM on both databases. As Table IIIillustrates, the accuracies of the HMM method are lowerthan the accuracy of IT-RBM by 0.1444 on the DISFA+database and by 0.1069 on the SPOS database. F1-scores of theHMM method are lower than IT-RBM by 0.0636 and 0.0432,respectively. HMM is a popular dynamic model, but it canonly handle three temporal relationships - precedes, follows, orequals. It is also limited to capturing local stationary dynamicsbecause of assumptions made by the first order Markovproperty and stationary transition. The suggested model usesinterval algebra to depict 13 complex time-point relationships,and has the ability to model global rather than only local tem-poral relations. This results in improved distinction betweenposed and spontaneous expressions.

Thirdly, IT-RBM is superior to LSTM and GRU.Specifically, compared to LSTM, IT-RBM increases the

distinction accuracies by 0.0133 and 0.2650 and F1-scoresby 0.0504 and 0.1251 on the DISFA+ and SPOS databases,respectively. Compared to GRU, IT-RBM increases theaccuracies by 0.0048 and 0.2522, and increases F1-scoresby 0.0004 and 0.0851 on the DISFA+ and SPOS databasesrespectively. Although LSTM and GRU are state-of-the-art discriminative dynamic models, they are still time-slicemodels. Therefore, they cannot successfully represent theglobal and complex temporal relations between primitivefacial events inherent in facial expressions, as IT-RBMdoes. Furthermore, recurrent neural networks achievebetter performance with larger amounts of data than theused databases possess.

Figure 7 is a graphic depiction of primitive event pairsand their corresponding temporal relations from the DISFA+database. Figure 7 (a) displays the 40 selected pairs of events.Points around the eyebrow, eyelet, and lips have the most links,as these areas are crucial for expressions. Just as Ekman etal.’s research [1] [4] showed, the most telling muscles whendistinguishing between expressions are orbicularis oculi andthe zygomatic major. Our findings are consistent with thatobservation.

Figure 7 (b-1) and Figure 7 (b-2) illustrate temporal re-lations between points 20 and 29 for posed and spontaneousexpressions, respectively. Figure 7 (c-1) and Figure 7 (c-2) arethe histogram, displaying the frequencies of the 13 relationsbetween feature point 20 and 29 in the two expressions. Figure7 (c-1) and (c-2) show that for posed expressions, the relationsof 4 and 12 occur with more frequency than the relations of3 and 11. The inverse is true for spontaneous expressions.For relation 3 and 11, ts20 − ts29 > 0 , which means thatevent v29 starts before event v20, while for relation 4 and 12,ts20 − ts29 < 0, meaning that event v20 starts before v29 asshown in Table I. This indicates that for a posed expression v20starts before v29 in most cases, while in genuine expressionv29 is likely to start before v20. Since points 20 and 29 arerepresentative of the right eye and the left eye respectively,we can conclude that a posed expression is more likely tobegin on the right side of the face, while a genuine expressioncommences on the left side. This corroborates the findings ofRoss and Pulusu [5].

Table IV shows the experimental results of the PS model,the PS gender model, and the PS exp model. We make thefollowing observations. First, the PS gender model performsbetter than the PS model on both of the databases, withhigher accuracies and F1-scores. Specifically, compared tothe PS model, the PS gender model increases recognitionaccuracies by 0.0134 and 0.0085 and F1-scores by 0.0152 and0.0028 on the DISFA+ and SPOS databases respectively. Thisindicates that gender information available only during trainingis useful to capture innate spatial and temporal patterns inposed and spontaneous expressions for different genders, andthus improves the distinction task.

Figure 8 graphically depicts the average weights of differentmovements to further analyze these spatial and temporalpatterns. The x-axis represents the 20 movement patterns onthe DISFA+ and SPOS databases. The y-axis represents theaverage value of weight wjik for the Kth movement pattern.




TABLE IIIRESULTS OF POSED AND SPONTANEOUS DISTINCTION EXPERIMENTS

Database DISFA+ SPOS

Method HMM LSTM GRU RBM* IT-RBM HMM LSTM GRU RBM* IT-RBM

Accuracy 0.8046 0.9357 0.9442 0.9211 0.9490 0.7222 0.5641 0.5769 0.7735 0.8291F1-score 0.8768 0.8900 0.9400 0.9095 0.9404 0.7619 0.6800 0.7200 0.7427 0.8051* RBM is the proposed multi-value RBM

Posedexample

Spontaneousexample

Fig. 7. (a) Graphical depiction of temporal relations selected in DISFA+. (b) Examples of relation between point 20 and 29. (c) Frequencies of thirteenrelations between point 20 and point 29 with respect to posed and genuine expressions. x-axis represents the index of relationships.

DISFA+ Posed_gender

2 4 6 8 10 12 14 16 18 20

k-model

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

Posed_male

Posed_female

DISFA+ Spon_gender

2 4 6 8 10 12 14 16 18 20

k-model

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Spon_male

Spon_female

(a) DISFA+ PS gender

SPOS Posed_gender

2 4 6 8 10 12 14 16 18 20

k-model

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

Posed_male

Posed_female

SPOS Spon_gender

2 4 6 8 10 12 14 16 18 20

k-model

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Spon_male

Spon_female

(b) SPOS PS gender

Fig. 8. The mean weight of PS gender models at different move models onall facial points and hidden nodes. x-axis represents K move models, y-axisrepresents the mean value of W j

ik at every k = K.

TABLE IVPOSED AND SPONTANEOUS DISTINCTION

Database DISFA+ SPOS

Model PS PS gender PS exp PS PS gender PS expAccuracy 0.9490 0.9624 0.9515 0.8291 0.8376 0.8333F1-score 0.9404 0.9556 0.9420 0.8051 0.8079 0.8093

The brown bars represent the weights of PS male model, andthe blue bars are the weights of PS female model. From Figure8, we can find that for some movement patterns, the weightsof male and female models are either both positive or bothnegative, but for other movement patterns, the weight signsof male and female models are opposing. This confirms thatfemales and males may display different spatial and temporalpatterns. Therefore, the gender information available duringtraining is beneficial for capturing more specific and precisespatial temporal patterns embedded in posed and spontaneousexpressions, and results in better distinction between posedand spontaneous expressions.

Table IV shows that the PS exp model also performs betterthan the PS model, achieving superior accuracies and F1-




1000

900-2

-1

700

0

800600

1

10-3

2

500

3

400

4

700300

(a) posed happy

1000

900-1

-0.5

700

0

800600

0.5

10-3

1

500

1.5

400

2

700300

(b) spontaneous happy

1000

900-2

-1

700

0

800600

1

10-3

2

500

3

400

4

700300

(c) posed surprise

1000

900-1

700

-0.5

0

800600

10-3

0.5

500

1

400

1.5

700300

(d) spontaneous surprise

1000

900-1

-0.5

700

0

800600

0.5

10-3

1

500

1.5

2

400 700300

(e) posed sad

1000

900-4

700

-2

800600

0

10-4

500

2

400

4

700300

(f) spontaneous sad

1000

900-2

700

-1

0

800600

10-3

1

500

2

400

3

700300

(g) posed fear

1000

900-1

700

-0.5

0

800600

10-3

0.5

500

1

400

1.5

700300

(h) spontaneous fear

1000

900

-1

700

-0.5

600 800

0

10-3

0.5

500

1

400700300

(i) posed disgust

1000

900-2

700

-1

0

800600

10-3

1

500

2

400

3

700300

(j) spontaneous disgust

Fig. 9. On the DISFA+ database, the mean weight of PS exp models at everyselected facial points on a certain movement state with all hidden nodes fromtrained IT-RBMs. z-axis represents the mean value of Wk

ij when k = K atevery facial points.

scores in most cases. Specifically, compared to the PS model,the accuracy of the PS exp model is 0.0025 and the F1-score is 0.0016 higher on the DISFA+ database. On the SPOSdatabase, the accuracy improves by 0.0042 and the F1 scoreis 0.0042 higher. This demonstrates that the expression infor-mation available only during training is helpful for capturinginherent spatial and temporal patterns, and thus improves theperformance of posed and spontaneous expression distinction.

To analyze the effect of expression type when modelingspatial and temporal patterns, we graphically depict the av-

erage weight of the hidden nodes at every selected facialfeature point for a certain movement pattern. Figure 9 showsan example of this on the DISFA+ database. Comparing thebar graphs in the left column to those in the right, we findthat there are significant differences between the weights’ dis-tribution of posed and spontaneous expressions. This indicatesthat the spatial patterns of posed and spontaneous expressionsare absolutely different. In addition, it’s clear that the Wvalues differ significantly based on expression type; sadnessis a good example. Most of the weights of the spontaneousmodel are negative, while most of the weights of the posedmodel are positive, indicating that fewer facial events areobserved in spontaneous expression. Namba’s [6] researchshows similar results, noting that morphological properties arenot observed in spontaneous facial expressions. One possiblereason is that the video clip used in this study is too shortfor the viewer to elicit visible expressions of sadness. Thisexplanation is supported by Eckman et al who posit that thenature of sadness necessitates a longer-term or more personalexperience [35]. For the spontaneous disgust expression, theweights in lips are positive, while not all the weights inlips are positive for posed disgust expression. Namba’s [6]research showed that AU10 and AU12 were more frequentlypresent in spontaneous disgust. Our finding is consistent withNamba’s work [6], corroborating that there are expression-dependent and posed- or spontaneous-dependent differencesin AUs. Adding expression information can more preciselydepict the detailed patterns inherent in posed and spontaneousexpressions.

TABLE VCOMPARISON WITH RELATED WORK OF POSED AND SPONTANEOUS

EXPRESSIONS DISTINCTION ON SPOS

Method accuracy

Cohn et al. [7] 0.7250Dibeklioglu et al. [9] 0.7875Dibeklioglu et al. [36] 0.7500

Wu et al. [37] 0.7950Wu et al. [38] 0.8125

Wang et al. [13] 0.7479Wang et al. [14] 0.7607Quan et al. [16] 0.7607

IT-RBM 0.8291

3) Comparison with related work: For the task of distin-guishing between posed versus spontaneous expressions, wecompare our method to both model-based and feature-drivenmethods. Most recent model-based methods on expressiondistinction conducted experiments on the NVIE and the SPOSdatabases. As the NVIE database only provides onset and apexframes for posed expressions, it is not a viable option for theproposed IT-RBM. Instead, we used the SPOS database tocompare the performance of the suggested method to currentmethods, as shown in Table V. The DISFA+ database isrelatively new, opening to researchers in June of 2016. Untilnow, no experiments on posed versus spontaneous expressiondistinction have been performed on this database. Therefore,we are unable to compare our work to others on this database.

From Table V, we find that state-of-the-art model-basedmethods do not perform as well as the suggested IT-RBM.




Compared to Wang et al. [14]’s work and Quan et al. [16]’ swork, which model spatial patterns only, the IT-RBM is ableto more fully represent posed and spontaneous expressionsby jointly modeling the spatial and temporal patterns. Thisresults in superior performance. The proposed method alsooutperforms Wu et al. [38]’s work, which is the highest-performing feature-driven method. Wu et al. [38] proposed aregion-specific texture descriptor that represented local patternchanges in different areas of the face. The temporal phase ofeach facial region was divided by calculating the intensity ofthe corresponding facial region. Then, they used a mid-levelfusion strategy of SVM to combine the two feature types.By defining discriminative features, their method models theinnate spatial and temporal patterns to a certain extent. How-ever, they do not take full advantage of embedded spatial andtemporal patterns as the IT-RBM does via our method’s pa-rameters and structure. Hence, the proposed method achievessuperior performance.

B. Experiments and Analyses of Expression Recognition

1) Experimental conditions: Expression recognition exper-iments are conducted on the extended Cohn-Kanade (CK+)database [39][40] and the MMI database [41]. The CK+database is composed of 327 posed expression samples col-lected from 118 subjects. It includes seven expression cate-gories: anger, contempt, disgust, fear, happiness, sadness, andsurprise. The image sequences in this database begin at theonset frame and end with the apex frame. Therefore, only threetemporal relations, i.e., before, at the same time, and after,exist in the image sequences. The MMI database is updatedcontinuously. During April of 2017, there were 236 sequenceslabeled with expressions; 208 of those sequences showed thefront of the face. We used these 208 image sequences from 31subjects. There are six expression categories: anger, disgust,fear, happiness, sadness, and surprise. Table VI shows the datadistribution of the two databases. SDM is used to extract the49 facial feature points shown on the left side of Figure 6 [34].

Recognition accuracy is used as a performance metric.We adopt five-fold subject-independent cross-validation onthe CK+ database and ten-fold subject-independent cross-validation on the MMI database.

As with posed and spontaneous expression distinction ex-periments, we conduct expression recognition experimentsusing five methods: IT-RBM, multi-value RBM, HMM, LSTMand GRU. For the experiments using HMM, the experimentalresults listed in [26] are directly used here. For the experi-ments using LSTM and GRU, the similar network struc-ture and hyper-parameter selection strategy as those ofposed and spontaneous expression distinction experimentsare used.

2) Experimental results and analyses: Results of our ex-periments on expression recognition are found in Table VII.From Table VII, we observe as follows:

Firstly, compared to HMM [26], the accuracy of IT-RBMis higher by 0.0366 on the CK+ database and 0.3071 on theMMI database. As described in Section II-B, HMM is a time-slice graphical model and can only capture three time-point

TABLE VIDATA DISTRIBUTION OF CK+ AND MMI

Expression CK+ MMI

Anger(An) 45 33

Contempt(Co) 18

Disgust(Di) 59 32

Fear(Fe) 25 28

Happy(Ha) 69 42

Sadness(Sa) 28 32

Surprise 83 41

Fig. 10. Graphical depiction of the selected event pairs in the MMI database.

Anger

1 2 3 4 5 6 7 8 9 10 11 12 130

0.1

0.2

0.3

0.4

0.5

0.6

0.7Disgust

1 2 3 4 5 6 7 8 9 10 11 12 130

0.1

0.2

0.3

0.4

0.5

0.6

0.7Fear

1 2 3 4 5 6 7 8 9 10 11 12 130

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Happy

1 2 3 4 5 6 7 8 9 10 11 12 130

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Sadness

1 2 3 4 5 6 7 8 9 10 11 12 130

0.1

0.2

0.3

0.4

0.5

0.6Surprise

1 2 3 4 5 6 7 8 9 10 11 12 130

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fig. 11. Frequencies of thirteen relations among a pair of events with respectto different expressions in MMI. x-axis represents the index of relationships.

relations. IT-RBM can not only capture 13 complex temporalrelations defined by Allen’s interval algebra, but also capturecomplex spatial patterns in facial behavior. Thus IT-RBMachieves better performance than HMM.

Secondly, the proposed IT-RBM outperforms multi-valueRBM, with higher accuracies of 0.0612 on the CK+ databaseand 0.0481 on the MMI database. The IT-RBM not onlycaptures global spatial relations but also temporal patternsembedded in different expressions, while the multi-value RBMis only capable of modelling inherent spatial patterns. The IT-RBM leverages the additional temporal patterns for improvedexpression recognition.

Lastly, the suggested method outperforms both LSTMand GRU. Compared to LSTM, IT-RBM increases therecognition accuracy by 0.0184 on the CK+ database and0.2452 on the MMI database. Compared to GRU, IT-RBMincreases the accuracy by 0.0092 and 0.2259 on the CK+




TABLE VIIRESULTS OF EXPRESSION CATEGORIES RECOGNITION EXPERIMENTS

CK+

RBM*

An Co Di Fe Ha Sa Su Acc

An 84.44 4.44 8.89 0.00 0.00 2.22 0.00

0.8104

Co 11.11 83.33 0.00 5.56 0.00 0.00 0.00

Di 13.56 0.00 76.27 1.69 0.00 5.08 3.39

Fe 0.00 4.00 4.00 68.00 16.00 8.00 0.00

Ha 0.00 1.45 0.00 1.45 97.1 0.00 0.00

Sa 25.00 0.00 17.86 0.00 0.00 57.14 0.00

Su 2.41 1.20 9.64 1.20 2.41 2.41 80.72

IT-RBM

An Co Di Fe Ha Sa Su

An 91.11 8.89 0.00 0.00 0.00 0.00 0.00

0.8716

Co 5.56 94.44 0.00 0.0 0.00 0.00 0.00

Di 6.78 0.00 86.44 1.69 0.00 1.69 3.39

Fe 0.00 4.00 0.00 72.00 16.00 8.00 0.00

Ha 0.00 1.45 0.00 1.45 97.10 0.00 0.00

Sa 17.86 0.00 3.57 0.00 0.00 78.57 0.00

Su 2.41 1.20 8.43 1.20 1.20 2.41 83.13

LSTM


An 84.44 2.22 6.67 0.00 0.00 6.67 0.00

0.8532

Co 5.56 44.44 16.67 0.00 5.56 11.11 16.67

Di 5.08 1.70 81.36 1.69 1.69 1.69 6.78

Fe 0.00 4.00 0.00 84.00 8.00 0.00 4.00

Ha 0.00 1.45 0.00 4.35 94.20 0.00 0.00

Sa 10.71 3.57 3.57 3.57 0.00 78.57 0.00

Su 1.20 3.61 2.41 0.00 0.00 0.00 92.78

GRU


An 82.22 2.22 6.67 0.00 0.00 8.89 0.00

0.8624

Co 0.00 50.00 0.00 0.00 0.00 11.11 38.89

Di 5.08 5.08 84.74 0.00 0.00 0.00 5.08

Fe 0.00 8.00 0.00 76.00 8.00 0.00 8.00

Ha 0.00 0.00 0.00 2.90 97.10 0.00 0.00

Sa 10.71 7.14 0.00 0.00 0.00 82.14 0.00

Su 0.00 4.82 0.00 0.00 1.20 1.20 92.78

HMM [26] 0.835

ITBN [26] 0.863

Elaiwat et al. [43] 0.9566

Sariyanidi et al. [42] 0.9602

MMI

RBM*

An Di Fe Ha Sa Su Acc

An 84.85 12.12 0.00 0.00 3.03 0.00

0.7740

Di 9.38 68.75 3.13 9.38 0.00 9.38

Fe 0.00 3.57 71.43 10.71 3.57 10.71

Ha 2.38 0.00 4.76 83.33 9.52 0.00

Sa 9.38 0.00 3.13 3.13 81.25 3.13

Su 9.76 4.88 7.32 2.44 2.44 73.17

IT-RBM

An Di Fe Ha Sa Su

An 90.91 9.09 0.00 0.00 0.00 0.00

0.8221

Di 6.25 81.25 0.00 3.13 0.00 9.38

Fe 0.00 3.57 75.00 10.71 3.57 7.14

Ha 2.38 0.00 7.14 83.30 7.14 0.00

Sa 9.38 0.00 0.00 3.13 84.38 3.13

Su 9.76 4.88 2.44 2.44 2.44 78.05

LSTM

An Di Fe Ha Sa Su

An 54.55 21.21 3.03 3.03 12.12 6.06

0.5769

Di 28.13 40.63 3.13 18.75 0.00 9.38

Fe 3.57 10.71 28.57 14.29 14.29 28.57

Ha 0.00 9.52 9.52 73.81 4.76 2.38

Sa 21.88 9.38 25.00 3.13 40.63 0.00

Su 4.88 4.88 12.20 4.88 9.76 63.41

GRU

An Di Fe Ha Sa Su

An 54.55 21.21 3.03 3.03 12.12 6.06

0.5962

Di 28.13 40.63 3.13 18.75 0.00 9.38

Fe 3.57 10.71 28.57 14.29 14.29 28.57

Ha 0.00 9.52 9.52 73.81 4.76 2.38

Sa 21.88 9.38 25.00 3.13 40.63 0.00

Su 4.88 4.88 12.20 4.88 9.76 63.41

HMM [26] 0.515

ITBN [26] 0.597

Elaiwat et al. [43] 0.8163

Sariyanidi et al. [42] 0.7512

and the MMI databases, respectively. This further provesthe superiority of our proposed IT-RBM in capturing andleveraging complex spatial and temporal patterns inherentin expressions for expression recognition.

In order to prove the validity of the IT-RBM for expressioncategory recognition, Figure 10 graphically depicts all 30selected event pairs in the MMI database. Figure 10 shows thatthe selected facial points involve all components of the face.This is reasonable, since there are 5 expressions in the databaseand different expressions are related to different facial muscles.Unlike Figure 7(a), in which most links appear on the leftside of the face, the distribution of links in Figure 10 is morehomogeneous. This may further indicate that spatial-temporalpatterns existing in posed and spontaneous expressions arenot symmetrical between the left and the right sides of theface, while the spatial-temporal patterns existing in differentemotion expressions are symmetrical.

Figure 11 shows the frequencies of 13 relations betweenfeature point 6 and point 32. From Figure 11, we find thatthe frequencies of 13 interval relations among 5 expressionsvary greatly. This indicates that the selected temporal relationsprovide discriminative information for expression recognition.

3) Comparison with related work: To illustrate the supe-riority of the proposed method IT-RBM, we compare it withthe most related work (i.e., ITBN[26]) and state of the artfeature-based methods [43], [42]. From Table VII, we havethe following findings:

Firstly, compared with ITBN, IT-RBM achieves the betterperformance on the CK+ and MMI databases. As describedin Section II-B, Although both ITBN and IT-RBM capturecomplex relations defined by Allen’s interval algebra, ITBNuses a Bayesian network to model the local spatial patterns andIT-RBM uses a RBM to capture global spatial patterns inherentin facial behavior. Therefore, IT-RBM is more successfulat capturing spatial patterns than ITBN and achieves betterperformance. IT-RBM can capture complex spatio-temporalpatterns inherent in facial behavior, which contributes itssuperior performance.

Secondly, compared with state of the art feature-basedmethods, the proposed method achieves the best performanceon the MMI database but the worst performance on theCK+ database. On the CK+ database, sequences begin atneutral and conclude at the peak frame. The image sequencesencompass the beginning half of the expression changes only,which enforces a limit on the temporal patterns to just threerelationships: A precedes B, B precedes A, and A and Bcommence simultaneously. Therefore, IT-RBM’s ability ofcapturing complex temporal patterns cannot be fully displayedand IT-RBM gets the poor performance on the CK+ database.The MMI database provides the whole process of the expres-sion change. Therefore, IT-RBM can capture whole temporalpatterns and spatial patterns in the facial behavior and achievesthe best performance on the MMI database. Compared withHMM, the IT-RBM get moderate improvement on theCK+ database but significant improvement on the MMIdatabase. This is because both IT-RBM and HMM can onlycapture three temporal relations on the CK+ database butIT-RBM can capture more complex temporal patterns than




HMM on the MMI database, which further demonstratesthe importance of capturing complex temporal relationsdefined by Allen’s interval algebra and the superior of theproposed method.

V. CONCLUSION

In this paper, a novel dynamic model called IT-RBM isproposed to jointly capture and leverage embedded globalspatial patterns and complex temporal patterns for improvedexpression analysis. A facial expression is defined as a com-plex activity made up of sequential or temporally overlappingprimitive facial events, which can further be delineated asthe motion of feature points. Allen’s interval algebra is usedto represent these complex temporal patterns via a two-layerBayesian network in which the upper layer nodes representprimitive facial events, the bottom layer nodes are temporalrelations between facial events, and the links between thetwo layers capture temporal dependencies among primitivefacial events. We also suggest the use of a multi-value RBMto obtain and utilize intrinsic global spatial patterns amongfacial events. The visible nodes of the restricted Boltzmannmachine are facial events, and the connections between hiddennodes and visible nodes model the spatial patterns inherentin expressions. In the training phase, an efficient learningalgorithm is proposed to simultaneously learn spatial andtemporal patterns through maximum log likelihood in thetraining. Samples are classified in the testing phase accordingto the IT-RBM with the largest likelihood. We propose anefficient inference algorithm that extends annealing importancesampling to the IT-RBM to calculate the partition function ofthe multi-value RBM. The results of our experiments on bothexpression recognition and posed and spontaneous expressiondistinction demonstrate that the proposed method is able tocapture intrinsic facial spatial-temporal patterns, leading tosuperior performance compared to state-of-the-art works.

ACKNOWLEDGMENT

This work has been supported by the National Key R&DProgram of China (2018YFB1307102) and the National Sci-ence Foundation of China (917418129).

REFERENCES

[1] P. Ekman and W. V. Friesen, “Felt, false, and miserable smiles,” Journalof nonverbal behavior, vol. 6, no. 4, pp. 238–252, 1982.

[2] P. Ekman, “Darwin, deception, and facial expression,” Annals of the NewYork Academy of Sciences, vol. 1000, no. 1, pp. 205–221, 2003.

[3] K. L. Schmidt, S. Bhattacharya, and R. Denlinger, “Comparison ofdeliberate and spontaneous facial movement in smiles and eyebrowraises,” Journal of nonverbal behavior, vol. 33, no. 1, pp. 35–45, 2009.

[4] P. Ekman, J. C. Hager, and W. V. Friesen, “The symmetry of emotionaland deliberate facial actions,” Psychophysiology, vol. 18, no. 2, pp. 101–106, 1981.

[5] E. D. Ross and V. K. Pulusu, “Posed versus spontaneous facial expres-sions are modulated by opposite cerebral hemispheres,” Cortex, vol. 49,no. 5, pp. 1280–1291, 2013.

[6] S. Namba, S. Makihara, R. S. Kabir, M. Miyatani, and T. Nakao, “Spon-taneous facial expressions are different from posed facial expressions:Morphological properties and dynamic sequences,” Current Psychology,pp. 1–13, 2016.

[7] J. F. Cohn and K. L. Schmidt, “The timing of facial motion in posed andspontaneous smiles,” International Journal of Wavelets, Multiresolutionand Information Processing, vol. 2, no. 02, pp. 121–132, 2004.

[8] M. F. Valstar, M. Pantic, Z. Ambadar, and J. F. Cohn, “Spontaneousvs. posed facial behavior: automatic analysis of brow actions,” in Pro-ceedings of the 8th international conference on Multimodal interfaces.ACM, 2006, pp. 162–170.

[9] H. Dibeklioglu, A. A. Salah, and T. Gevers, “Recognition of genuinesmiles,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 279–294,2015.

[10] M. Seckington, “Using dynamic bayesian networks for posed versusspontaneous facial expression recognition,” Mater Thesis, Departmentof Computer Science, Delft University of Technology, 2011.

[11] G. C. Littlewort, M. S. Bartlett, and K. Lee, “Automatic coding of facialexpressions displayed during posed and genuine pain,” Image and VisionComputing, vol. 27, no. 12, pp. 1797–1803, 2009.

[12] H. Dibeklioglu, R. Valenti, A. A. Salah, and T. Gevers, “Eyes do notlie: Spontaneous versus posed smiles,” in Proceedings of the 18th ACMinternational conference on Multimedia. ACM, 2010, pp. 703–706.

[13] S. Wang, C. Wu, M. He, J. Wang, and Q. Ji, “Posed and spontaneousexpression recognition through modeling their spatial patterns,” MachineVision and Applications, vol. 26, no. 2-3, pp. 219–231, 2015.

[14] S. Wang, C. Wu, and Q. Ji, “Capturing global spatial patterns fordistinguishing posed and spontaneous expressions,” Computer Visionand Image Understanding, vol. 147, pp. 69–76, 2016.

[15] G. Hinton, “A practical guide to training restricted boltzmann machines,”Momentum, vol. 9, no. 1, p. 926, 2010.

[16] Q. Gan, S. Nie, S. Wang, and Q. Ji, “Differentiating between posedand spontaneous expressions with latent regression bayesian network.”in AAAI, 2017, pp. 4039–4045.

[17] C. A. Corneanu, M. O. Simon, J. F. Cohn, and S. E. Guerrero,“Survey on rgb, 3d, thermal, and multimodal approaches for facialexpression recognition: History, trends, and affect-related applications,”IEEE transactions on pattern analysis and machine intelligence, vol. 38,no. 8, pp. 1548–1568, 2016.

[18] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic, “Automatic analysisof facial actions: A survey,” IEEE Transactions on Affective Computing,2017.

[19] G. Zhao and M. Pietikainen, “Dynamic texture recognition using localbinary patterns with an application to facial expressions,” IEEE trans-actions on pattern analysis and machine intelligence, vol. 29, no. 6, pp.915–928, 2007.

[20] T. Wu, M. S. Bartlett, and J. R. Movellan, “Facial expression recognitionusing gabor motion energy filters,” in Computer Vision and PatternRecognition Workshops (CVPRW), 2010 IEEE Computer Society Con-ference on. IEEE, 2010, pp. 42–47.

[21] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang, “Facial expres-sion recognition from video sequences: temporal and static modeling,”Computer Vision and image understanding, vol. 91, no. 1, pp. 160–187,2003.

[22] L. Shang and K.-P. Chan, “Nonparametric discriminant hmm andapplication to facial expression recognition,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp. 2090–2096.

[23] M. F. Valstar and M. Pantic, “Fully automatic recognition of the temporalphases of facial actions,” IEEE Transactions on Systems, Man, andCybernetics, Part B (Cybernetics), vol. 42, no. 1, pp. 28–43, 2012.

[24] R. El Kaliouby and P. Robinson, “Real-time inference of complex mentalstates from facial expressions and head gestures,” in Real-time vision forhuman-computer interaction. Springer, 2005, pp. 181–200.

[25] P. Rodriguez, G. Cucurull, J. Gonzalez, J. M. Gonfaus, K. Nasrollahi,T. B. Moeslund, and F. X. Roca, “Deep pain: Exploiting long short-termmemory networks for facial expression classification,” IEEE Transac-tions on Cybernetics, 2017.

[26] Z. Wang, S. Wang, and Q. Ji, “Capturing complex spatio-temporalrelations among facial muscles for facial expression recognition,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 3422–3429.

[27] J. Yang and S. Wang, “Capturing spatial and temporal patterns for distin-guishing between posed and spontaneous expressions,” in Proceedingsof the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 469–477.

[28] J. F. Allen, “Maintaining knowledge about temporal intervals,” Commu-nications of the ACM, vol. 26, no. 11, pp. 832–843, 1983.

[29] J. M. Joyce, “Kullback-leibler divergence,” in International encyclopediaof statistical science. Springer, 2011, pp. 720–722.

[30] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.

[31] R. Salakhutdinov and I. Murray, “On the quantitative analysis of deepbelief networks,” in Proceedings of the 25th international conference onMachine learning. ACM, 2008, pp. 872–879.




[32] M. Mavadati, P. Sanger, and M. H. Mahoor, “Extended disfa dataset:Investigating posed and spontaneous facial expressions,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops, 2016, pp. 1–8.

[33] T. Pfister, X. Li, G. Zhao, and M. Pietikainen, “Differentiating sponta-neous from posed facial expressions within a generic facial expressionrecognition framework,” in Computer Vision Workshops (ICCV Work-shops), 2011 IEEE International Conference on. IEEE, 2011, pp. 868–875.

[34] X. Xiong and F. De la Torre, “Supervised descent method and itsapplications to face alignment,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2013, pp. 532–539.

[35] P. Eckman, “Emotions revealed,” St. Martin’s Griffin, New York, 2003.[36] H. Dibeklioglu, A. Salah, and T. Gevers, “Are you really smiling at me?

spontaneous versus posed enjoyment smiles,” Computer Vision–ECCV2012, pp. 525–538, 2012.

[37] P. Wu, H. Liu, and X. Zhang, “Spontaneous versus posed smile recog-nition using discriminative local spatial-temporal descriptors,” in Acous-tics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on. IEEE, 2014, pp. 1240–1244.

[38] P. Wu, H. Liu, X. Zhang, and Y. Gao, “Spontaneous versus posed smilerecognition via region-specific texture descriptor and geometric facialdynamics,” Frontiers, vol. 1, 2016.

[39] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facialexpression analysis,” in Automatic Face and Gesture Recognition, 2000.Proceedings. Fourth IEEE International Conference on. IEEE, 2000,pp. 46–53.

[40] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,“The extended cohn-kanade dataset (ck+): A complete dataset foraction unit and emotion-specified expression,” in Computer Vision andPattern Recognition Workshops (CVPRW), 2010 IEEE Computer SocietyConference on. IEEE, 2010, pp. 94–101.

[41] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: anaddition to the mmi facial expression database,” in Proc. 3rd Intern.Workshop on EMOTION (satellite of LREC): Corpora for Research onEmotion and Affect, 2010, p. 65.

[42] E. Sariyanidi, H. Gunes, and A. Cavallaro, “Learning bases of activityfor facial expression recognition,” IEEE Transactions on Image Process-ing, 2017.

[43] S. Elaiwat, M. Bennamoun, and F. Boussaid, “A spatio-temporal rbm-based model for facial expression recognition,” Pattern Recognition,vol. 49, pp. 152–161, 2016.

Shangfei Wang received her BS in Electronic Engi-neering from Anhui University, Hefei, Anhui, China,in 1996. She received her MS in circuits and sys-tems, and the PhD in signal and information pro-cessing from University of Science and Technologyof China (USTC), Hefei, Anhui, China, in 1999 and2002. From 2004 to 2005, she was a postdoctoral re-search fellow in Kyushu University, Japan. Between2011 and 2012, Dr. Wang was a visiting scholar atRensselaer Polytechnic Institute in Troy, NY, USA.She is currently an Associate Professor of School

of Computer Science and Technology, USTC. Her research interests coveraffective computing and probabilistic graphical models. She has authored orco-authored over 90 publications. She is a senior member of the IEEE and amember of the ACM.

Zhuangqiang Zheng received his BS in mathe-matics from Liaoning Technical University in 2017,and he is currently pursuing his MS in ComputerScience in the University of Science and Technologyof China, Hefei, China. His research interesting isaffective computing.

Shi Yin received his BS in Automation from CentralSouth University in 2016, and he is now a PhDstudent majoring in Computer Science and Technol-ogy in the University of Science and Technologyof China, Hefei, China. His research interesting isaffective computing.

Jiajia Yang received his BS in Software engineeringfrom Dalian maritime university in 2015, and she iscurrently pursuing his MS in Computer Science inthe University of Science and Technology of China,Hefei, China. Her research interesting is affectivecomputing.

Qiang Ji received the PhD degree in electrical en-gineering from the University of Washington. He iscurrently a professor with the Department of Electri-cal, Computer, and Systems Engineering, RensselaerPolytechnic Institute (RPI). He recently served as aprogram director at the National Science Foundation(NSF), where he managed NSFs computer visionand machine learning programs. He also held teach-ing and research positions with the Beckman Insti-tute at University of Illinois at Urbana-Champaign,the Robotics Institute at Carnegie Mellon University,

the Dept. of Computer Science at University of Nevada at Reno, and theUS Air Force Research Laboratory. He currently serves as the director ofthe Intelligent Systems Laboratory (ISL) at RPI. His research interests arein computer vi sion, probabilistic graphical models, information fusion, andtheir applications in various fields. He has published more than 160 papers inpeer-reviewed journals and conferences. His research has been supported bymajor governmental agencies including NSF, NIH, DARPA, ONR, ARO, andAFOSR as well as by major companies including Honda and Boeing. He isan editor on several related the IEEE and international journals and he hasserved as a general chair, program chair, technical area chair, and programcommittee member in numerous international conferences/workshops. He isa fellow of IAPR and the IEEE.

Deep Structured Prediction for Facial LandmarkDetection

Lisha Chen, Hui Su, and Qiang JiRensselaer Polytechnic Institute

Abstract

Existing deep learning based facial landmark detection methods have achieved1

excellent performance. These methods, however, do not explicitly embed the2

structural dependencies among landmark points. They hence cannot preserve the3

geometric relationships between landmark points or generalize well to challenging4

conditions or unseen data. This paper proposes a method for deep structured5

facial landmark detection based on combining a deep Convolutional Network6

with a Conditional Random Field. We demonstrate its superior performance to7

existing state-of-the-art techniques in facial landmark detection, especially a better8

generalization ability on challenging datasets that include large pose and occlusion.9

1 Introduction10

Facial landmark detection is to automatically localize the fiducial facial landmark points around facial11

components and facial contour. It is essential for various facial analysis tasks such as facial expression12

analysis, headpose estimation and face recognition. With the development of deep learning techniques,13

traditional facial landmark detection approaches that rely on hand-crafted low-level features have14

been outperformed by deep feature based approaches. The purely deep learning based methods,15

however, cannot effectively capture the structural dependencies among landmark points. They hence16

cannot perform well under challenging conditions, such as large head pose, occlusion, and large17

expression variation. Probabilistic graphical models such as Conditional Random Fields (CRFs),18

have been widely applied to various computer vision tasks. They can systematically capture the19

structural relationships among random variables and perform structured prediction. Recently, there20

has been works that combine deep models (e.g. CNNs) with CRF to simultaneously leverage CNNs’21

representation power and CRF’s structure modeling power [11, 10, 53]. Their combination has22

yielded significant performance improvement over methods that use either CNN or CRF alone. These23

works so far are mainly applied to classification tasks such as semantic image segmentation. Besides24

classification, there are also works that apply the CNN and CRF model to human pose and facial25

landmark detection [44, 14, 13]. To simplify computational complexity, their CRF model is typically26

of special structure (e.g. tree) and, moreover, they employ approximate learning and inference criteria.27

In this work, we propose to combine CNN with a fully-connected CRF to jointly perform facial28

landmark detection in regression framework using exact learning and inference methods.29

Compared to the existing works, the contributions of our work are summarized as follows:30

1) We introduce the fully-connected CNN-CRF that produces structured probabilistic prediction of31

facial landmark locations.32

2) Our model explicitly captures the structure relationship variations caused by pose and deformation,33

unlike some previous works that combine CNN with CRF using a fixed pairwise relationship34

represented by a convolution kernel.35

3) We derive closed form solutions for learning and inference given the deformation parameter,36

unlike previous works that use approximate methods such as energy minimization ignoring the37

Submitted to 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Do not distribute.

Advances in Neural Information Processing Systems (NurIPS 2019)

partition function for learning and using mean-field for inference. And instead of using discriminative38

criterion or other approximated loss function, we employ optimal negative log likelihood loss39

function, without any assumption.40

4) Experiments on benchmark face alignment datasets demonstrate the advantages of the proposed41

method in achieving better prediction accuracy and generalization to challenging or unseen data than42

current state-of-the-art (SoA) models.43

44

2 Related Work45

2.1 Facial Landmark Detection46

Classic facial landmark detection methods include Active Shape Model (ASM) [16, 30], Active47

Appearance Model (AAM) [15, 26, 29, 40], Constrained Local Model (CLM) [27, 41], and Cascade48

Regression [9, 6, 55, 7, 48] rely on hand-crafted shallow image features and are usually sensitive to49

initializations. They are outperformed by modern deep learning based methods.50

Deep learning based method for face alignment was first proposed in [42] and achieved better51

performance than classic methods. This purely deep appearance based approach uses a deep cascade52

convolutional network and coordinate regression in each cascade level. Later on, more work using53

purely deep appearance based framework for coordinate regression has been explored. Tasks-54

constrained deep convolutional network (TCDCN) [52] was proposed to jointly optimize facial55

landmark detection with correlated tasks such as head pose estimation and facial attribute inference.56

Mnemonic Descent Method (MDM) [45], an end-to-end trainable deep convolutional Recurrent57

Neural Network (RNN), was proposed where the cascade regression was implemented by RNN.58

Recently, heatmap learning based methods established new state-of-the-art for face alignment and59

body pose estimation [44, 32, 46]. And most of these face alignment methods [5, 47] follow the60

architecture of Stacked Hourglass [32]. The stacked modules refine the network predictions after61

each stack. Different from direct coordinate regression, it predicts a heatmap with same size as the62

input image. The landmark location is predicted either in regression framework by the coordinate63

on the heatmap with largest response [44, 32, 4, 12, 8] or in classification framework by segmenting64

the heatmap pixels into different facial parts [34, 22, 36]. Hybrid deep methods combine deep65

models with face shape models. One strategy is to directly predict 3D deformable parameters instead66

of landmark locations in a cascaded deep regression framework, e.g. 3D Dense Face Alignment67

(3DDFA) [56] and Pose-Invariant Face Alignment (PIFA) [25]. Another strategy is to use deformable68

model as a constraint to limit the face shape search space thus to refine the predictions from the69

appearance features, e.g. Convolutional Experts Constrained Local Model (CE-CLM) [50].70

As the deep learning technique develops, more works take advantage of the expressive deep features71

and combine them with graphical models to produce structured prediction. Early work like [33]72

jointly trains a CNN and a graphical model for image segmentation. Do et al.[18] introduced73

NeuralCRF for sequence labeling. And various works are explored for other tasks. For instance,74

Jain et al.[24] and Eigen et al.’s work [19] for image restoration, Yao et al. and Morin et al.’s work75

[49, 31] for language understanding, Yoshua et al., Peng et al. and Jaderberg et al.’s work [3, 35, 23]76

for handwriting or text recognition. Recently, for human body pose estimation, Chen et al.[11] uses77

CNN to output image dependent part presence as unary term and spatial relationship as pairwise78

potential in a tree-structured CRF and uses Dynamic Programming for inference. Tompson et al.79

[44, 43] jointly trained a CNN and a fully-connected MRF by using the convolution kernel to capture80

pairwise relationships among different body joints and an iterative convolution process to implement81

the belief propagation. The idea of using convolution to implement message passing has also been82

explored in [14], where structure relationships at body joint feature level rather than the output level83

are captured in a bi-directional tree structured model. And the work of Chu et al.[14] is applied to84

face alignment [47] to pass message between facial part boundary feature maps. As an extension to85

[14], [13] models structures in both output and hidden feature layers in CNN. Similarly, for image86

segmentation, DeepLab [10] uses fully connected CRF with binary cliques and mean-field inference,87

and [28] uses efficient piecewise training to avoid repeated inference during training. In [53], the88

CRF mean-field inference is implemented by RNN and the network is end-to-end trainable by directly89

optimizing performance of the mean-field inference. Using RNN to implement message passing has90

also been applied to facial action unit recognition [17]. In [21], the MRF deformable part model is91

implemented as a layer in a CNN.92

2

Comparison. Compared to previous models serving similar purpose such as [14, 13, 47] that assume93

a tree structured model with belief propagation as inference method, we use a fully-connected model.94

With a fully connected model, we don’t need to specify a certain tree structured model, letting the95

model learn the strong or weak relationships from data, thus this method is more generalizable to96

different tasks. And the works [44, 14, 13, 47, 53] use convolution to implement the pairwise term97

and the message passing process. The pairwise term, once trained, is independent of the input image,98

thus cannot capture the pairwise constraint variations across different conditions like target object99

rotation and object shape. However, we explicitly capture the object pose, deformation variations.100

Moreover, they employ approximate method such as energy minimization ignoring the partition101

function for learning and mean-field for inference. In this paper we show that we can do exact learning102

and inference. Lastly, compared to traditional CRF model [37, 38], the weights for each unary terms103

in our model are also outputs of the neural network whose inverse quantifies heteroscedastic aleatoric104

uncertainty of the unary prediction.105

3 Method106

This section presents the proposed structured deep probabilistic facial landmark detection model. In107

this model, the joint probability distribution of facial landmark locations and deformable parameters108

are captured by a conditional random field model.109

3.1 Model definition110

Denote the face image as x, the 2D facial landmark locations as y, each landmark is yi, i = 1, . . . , N .111

The deformable model parameters that capture pose, identity and expression variation are denoted as112

ζ. The model parameter we want to learn is denoted as Θ. Assuming ζ is marginally dependent on x113

but conditionally independent of x given y, the graphical model is shown in Fig. 1.

Output: deformableparameters

Output: faciallandmark locations y3

Input: face image

y2

y5

y4

y1

Figure 1: Overview of the graphical model, dashed lines represent dependencies between each pairof landmarks, dotted lines represent dependencies between landmark and deformable parameters,solid lines represent dependencies between landmarks and face image.

114Based on this definition and assumption, the joint distribution of landmarks y and deformable115

parameters ζ conditioned on the face image x can be formulated in a CRF framework and written as116

pΘ(y, ζ | x) =1

ZΘ(x)exp{−

N∑i=1

φθ1(yi | x)−N∑i=1

N∑j=i+1

ψCij(yi,yj , ζ)} (1)

where Θ = [θ1, Cij ], θ1 is neural network parameter, Cij is a 2 × 2 symmetric positive definite117

matrix that captures the spatial relationships between a pair of landmark points, and ZΘ(x) is the118

partition function. φθ1(yi | x) is the unary potential function with parameter θ1 and ψCij(yi,yj , ζ)119

is the triple-wise potential function with parameter Cij .120

3.2 Potential functions121

We define the unary and triple-wise potentials in Eq.(2) and Eq.(3) respectively.122

φθ1(yi | x) =1

2[yi − µi(x, θ1)]TΣ−1

i (x, θ1)[yi − µi(x, θ1)] (2)

3

123

ψCij(yi,yj , ζ) = [yi − yj − µij(ζ)]TCij [yi − yj − µij(ζ)] (3)

where µi(x, θ1) and Σi(x, θ1) are the outputs of the CNN that represent mean and covariance matrix124

of each landmark given the image x. µij(ζ) represents the difference mean between two landmarks.125

It is fully determined by the 3D deformable face shape parameters ζ , which contains rigid parameters:126

rotation R and scale S, and non-rigid parameters q.127 [µij(ζ)

1

]=

1

λSR(y3d

i + Φiq− y3dj − Φjq) (4)

where y3d is the 3D mean face shape, Φ is the bases of deformable model, they are learned from128

data. The deformable parameters ζ = [S,R,q] are jointly estimated with 2D landmark locations129

during inference. In this work, we assume weak perspective projection model. S is a 3× 3 diagonal130

matrix that contains 2 independent parameters sx, sy as scaling factor (encode the camera intrinsic131

parameters) for column and row respectively. While R is a 3 × 3 orthonormal matrix with 3132

independent parameters γ1, γ2, γ3 as the pitch, yaw, roll rotation angle. Note that the translation133

vector is canceled by taking the difference of two landmark points.134

3.3 Learning and Inference135

We propose to implement the conditional probability distribution in Eq. (1) with a CNN-CRF model.136

As shown in Fig. 2, the CNN with parameter θ1 outputs mean µi(x, θ1) and covariance matrix137

Σi(x, θ1) for each facial landmark yi, which together forms the unary potential function φθ1(yi | x).138

A fully-connected (FC) graph with parameter Cij gives the triple-wise potential ψCij(yi,yj , ζ), if139

given ζ as well as the output from the unary, the FC can output E(x, ζ,Θ) and Λp(x, ζ,Θ), the mean140

and precision matrix for the conditional distribution pΘ(y | ζ,x). The FC can be implemented as141

another layer following the CNN. Combining the unary and the triple-wise potential, we obtain the142

joint distribution pΘ(y, ζ | x). However, directly infer y∗, ζ∗ from pΘ(y, ζ | x) is difficult, therefore143

we iteratively infer from conditional distributions pΘ(y | ζ,x) and pΘ(ζ | y).

CNN

unary potential triple-wise potential

FC

Input image x

Joint distributionexp normalize

Conditional distribution

Conditional distribution

Jointly infer

-

+

Figure 2: Overall flowchart of the proposed CNN-CRF model.

144

Mean and Precision matrix145

During learning and inference, we need to compute conditional probabilities pΘ(y | ζ,x). By using146

the quadratic unary and triple-wise potential function, the distribution pΘ(y | ζ,x) is a multivariate147

Gaussian distribution that can be written as148

pΘ(y | ζ,x) =1

Z ′Θ(x)exp{−

N∑i=1

φθ1(yi | x)−N∑i=1

N∑j=i+1

ψCij(yi,yj , ζ)}

= exp{1

2ln |Λp(x,Θ, ζ)| − 1

2[y − E(x,Θ, ζ)]TΛp(x,Θ, ζ)[y − E(x,Θ, ζ)]}

(5)

where E(x,Θ, ζ) and Λp(x,Θ, ζ) is the mean and precision matrix of the multivariate Gaussian149

distribution. They are computed exactly during learning and inference. The mean E can be computed150

by solving the linear system of equations ΛpE = b where Λp, the precision matrix, is a symmetric151

4

positive definite matrix that can be directly computed from the coefficient in the unary and pairwise152

term as shown in Eq. (6), and b can be computed from Eq. (6).153

Λp =

Λp11 . . . Λp1N

.... . .

...ΛpN1 . . . ΛpNN

, {Λpii = Σ−1i +

∑j 6=i Cij

Λpij = −Cijb =

b1b2...bN

, bi = Σ−1i µi +

∑j 6=i

Cijµij (6)

From Eq.(6) we can see that the final inference resultEi is a linear combination of µi and µj+µij , j ∈154

{1, . . . , N}, j 6= i. To solve this linear system of equations, we use direct method for exact solution155

with a fast implementation by Cholesky factorization that requires O(N3) FLOPs. For a practical156

implementation of the determinant to avoid numerical issues, we again use the Cholesky factorization157

of Λp to get LLT = Λp, then we compute the log determinant by ln |Λp| = 2∑

ln diag(L) where158

diag(·) takes the diagonal element of a matrix.159

Learning160

During learning, our goal is to optimize Θ given training data D = {xm,ym,m = 1, . . . ,M}.161

However, considering that our training data does not contain ground truth for ζm, we jointly optimize162

Θ and ζm,m = 1, . . . ,M . Therefore we define the learning problem as163

Θ∗, ζ∗ = arg minΘ,ζ

−M∑m=1

ln pΘ(ym, ζm | xm) (7)

where ζ = {ζ1, . . . , ζm}. We use an alternating method, based on the current Θt, optimize ζ by164

ζt+1m = arg min

ζm

− ln pΘt(ym, ζm | xm) = arg minζm

ψCtij

(ymi,ymj , ζm) (8)

Then based on current ζt, optimize Θ by165

Θt+1 = arg minΘ

−M∑

m=1

ln pΘ(ym, ζtm | xm) = arg min

Θ−

M∑m=1

ln pΘ(ym | ζtm,xm)

= arg minΘ

M∑m=1

−1

2ln |Λp(xm,Θ, ζ

tm)|+ 1

2[ym − E(xm,Θ, ζ

tm)]T Λp(xm,Θ, ζ

tm)[ym − E(xm,Θ, ζ

tm)]

(9)The algorithm for this problem is designed to first initialize Cij and optimize ζ, then fix a subset of166

parameters from Θ and optimize the others alternately, whose pseudo code is shown in Algorithm 1.167

Algorithm 1: Learning CNN-CRFInput: training data {xm,ym,m = 1, . . . ,M};Initialization: parameters Θ0 = {θ0

1 = randn,C0ij = I}, t = 0 ;

while not converge doStage 1: Fix parameters Θ, optimize ζ by Eq. (8); t = t+ 1;Stage 2: Fix ζ = ζt, Cij = Ctij , update θ1 using Eq. (9) by back propagation;

while not converge doθt+1

1 = θt1 − ηt1 ∂Loss∂θ1; t = t+ 1;

end[ζt, Ctij ] = [ζ, Cij ]

Stage 3: Fix ζ = ζt, θ1 = θt1, update Cij using Eq. (9) by back propagation;while not converge doCt+1ij = Ctij − ηt2 ∂Loss∂Cij

; t = t+ 1;end[ζt, θt1] = [ζ, θ1]

end

168

5

Inference169

The inference problem is a joint inference of ζ,y for each input face image x. Therefore the inference170

problem is defined as171

y∗, ζ∗ = arg maxy,ζ

ln pΘ(y, ζ | x) (10)

We use an alternating method. Based on the current yt, optimize ζt by172

ζt = arg maxζ

ln pΘ(yt, ζ | x) = arg minζ

N∑i=1

N∑j=i+1

ψCij (yti ,ytj , ζ) (11)

Based on the current ζt, optimize yt+1 by173

yt+1 = arg maxy

ln pΘ(y, ζt | x) = arg maxy

ln pΘ(y | ζt,x) = E(x,Θ, ζt) (12)

The algorithm is shown in Algorithm 2.174

Algorithm 2: Inference for CNN-CRFInput: one face image xInitialization: y0

i = µi, i = 1, . . . , N , t = 0;while not converge do

Update ζ by Eq. (11). ζt = arg minζ∑Ni=1

∑Nj=i+1 ψCij

(yti ,ytj , ζ);

Update y by Eq. (12). yt+1 = E(x,Θ, ζt);t = t+ 1 ;

end

175

4 Experiments176

Datasets. We evaluate our methods on popular benchmark facial landmark detection datasets,177

including 300W [39], Menpo [51], COFW [6], 300VW [1].178

300W has 68 landmark annotation. We first train the method on 300W-LP [56] dataset which is179

augmented from original 300W dataset for large yaw pose. And then we fine tune on the original180

dataset (3837 faces). Testing is performed on 300W test set which contains 600 images.181

Menpo contains images from AFLW and FDDB with landmark re-annotation following the 68182

landmark annotation scheme. It has two subsets, frontal which has 68 landmark annotation for near183

frontal faces (6679 samples) and profile which has 39 landmark annotation for profile faces (2300184

samples). We use it as test set for cross dataset evaluation.185

COFW has 1345 training samples and 507 testing samples, whose facial images are all partially186

occluded. The original dataset is annotated with 29 landmarks. We also use the COFW-68 test set187

[20] which has 68 landmarks re-annotation for cross dataset evaluation.188

300VW is a facial video dataset with 68 landmarks annotation. It contains 3 scenarios: 1) constrained189

laboratory and naturalistic well-lit conditions; 2) unconstrained real-world conditions with different190

illuminations, dark rooms, overexposed shots, etc.; 3) completely unconstrained arbitrary conditions191

including various illumination, occlusions, make-up, expression, head pose, etc.192

Evaluation metrics. We evaluate our algorithm using standard normalised mean error (NME)193

and Cumulative Errors Distribution (CED) curve. In addition, the area-under-the-curve (AUC)194

and the failure rate (FR) for a maximum error of 0.07 are reported. Same as in [5], the NME is195

defined as the average point-to-point Euclidean distance between the ground truth (ygt) and predicted196

(ypred) landmark locations normalized by the ground truth bounding box size d =√wbbox ∗ hbbox,197

NME = 1N

∑Ni=1

||y(i)pred−y

(i)gt ||2

d . Based on the NME in the test dataset, we can draw a CED Curve198

with NME as the horizontal axis and percentage of test images as the vertical axis. Then the AUC is199

computed as the area under that curve for each test dataset.200

Implementation details. To make a fair comparison with the SoA purely deep learning based201

methods [5], we use the same training and testing procedure for 2D landmark detection. The 3D202

deformable model was trained on 300W-train dataset. For CNN, we use one stack of Hourglass203

with same structure as [5], followed by a softmax layer to output a probability map for each facial204

6

landmark. From the probability map we compute mean µi and covariance Σi. And we use additional205

softmax cross entropy loss to assist training which shows better performance empirically.206

Training procedure: The initial learning rate η1 is 10−4 for 15 epochs using a minibatch of 10, then207

dropped to 10−5 and 10−6 after every 15 epochs and keep training until convergence. The learning208

rate η2 is set to 10−3. We applied random augmentations such as random cropping, rotation, etc.209

Testing procedure: We follow the same testing procedure as [5]. The face is cropped using the210

ground truth bounding box defined in 300W. The cropped face is rescaled to 256× 256 before passed211

to the network. For Menpo-profile dataset, the annotation scheme is different, we use the overlapping212

26 points for evaluation, i.e., removing points other than the 2 end points on the face contour and the213

eyebrow respectively and removing the 5th point on the nose contour.214

4.1 Comparison with existing approaches215

We compare with the SoA facial landmark detection algorithms, including purely deep learning216

based methods such as TCDCN [52] and FAN [5] as well as hybrid methods such as CLNF [2] and217

CE-CLM [50]. The results for these methods are evaluated using the code provided by the authors in218

the same experiment protocol, i.e., same bounding box and same evaluation metrics. The results on219

300W testset are shown in Table 1. The CED curves on the 300W testset are shown in Fig. 3a.

Table 1: 300W testset prediction results (%)Method 300W-test-indoor 300W-test-outdoor 300W-test-all

NME AUC FR NME AUC FR NME AUC FRTCDCN [52] 4.16 42.3 5.33 4.14 41.8 4.33 4.15 42.1 4.83CFSS [54] 3.19 56.1 2.33 2.98 57.4 1.33 3.09 56.7 1.83CLNF [2] 4.38 47.2 7.67 4.06 48.1 5.67 4.22 47.6 6.67CE-CLM [50] 3.02 57.5 2.67 3.07 56.3 2.00 3.05 56.9 2.33FAN [5] 2.52 63.6 1.00 2.50 64.0 0.00 2.51 63.8 0.50proposed 2.34 67.0 1.00 2.25 67.5 0.00 2.28 67.2 0.50

220

0 1 2 3 4 5 6 7NME (%)

0

10

20

30

40

50

60

70

80

90

100

Pro

port

ion

of im

ages

(%

)

TCDCNCFSSCLNFCECLMFANproposed

(a) 300W testset

0 1 2 3 4 5 6 7NME (%)

0

10

20

30

40

50

60

70

80

90

100

Pro

port

ion

of im

ages

(%

)


(b) Menpo-frontal dataset

0 1 2 3 4 5 6 7NME (%)

0

10

20

30

40

50

60

70

80

Pro

port

ion

of im

ages

(%

)


(c) Menpo-profile dataset

0 1 2 3 4 5 6 7NME (%)

0

10

20

30

40

50

60

70

80

90

100

Pro

port

ion

of im

ages

(%

)


(d) COFW-68 testsetFigure 3: CED curves on different datasets (better viewed in color and magnified)

Cross-dataset Evaluation221

Besides 300W testset, we evaluate the proposed method on Menpo dataset, COFW-68 testset, 300VW222

testset for cross dataset evaluation. The results are shown in Table 2 for Menpo and COFW-68 dataset223

and Table 3 for 300VW dataset. And the CED curves are shown in Fig. 3b, 3c, 3d respectively. The224

method is trained on 300W-LP and fine-tuned on 300W Challenge train set for 68 landmarks. We can225

see that compared to the results on 300W testset and Menpo-frontal dataset, where the SoA methods226

attaining saturating performance as mentioned in [5], for cross-dataset evaluation in more challenging227

conditions such as COFW with heavy occlusion and Menpo-profile with large pose, the proposed228

method shows better generalization ability with a significant performance improvement. On the other229

hand, the proposed method shows smallest failure rate (FR) on all evaluated datasets.230

Table 2: Cross dataset prediction results on Menpo dataset and COFW-68 testset (%)Method Menpo-frontal Menpo-profile COFW-68 test

NME AUC FR NME AUC FR NME AUC FRTCDCN [52] 4.04 46.2 5.84 15.79 4.1 83.35 4.71 35.8 8.68CFSS [54] 3.91 57.4 9.75 16.96 10.4 71.35 3.79 49.0 4.34CLNF [2] 3.74 55.4 5.82 10.63 20.5 46.09 4.75 42.9 10.65CE-CLM [50] 2.78 63.3 1.66 7.40 33.4 29.83 3.36 52.4 2.37FAN [5] 2.34 66.3 0.33 6.12 42.4 27.30 2.99 57.0 0.00proposed 2.35 66.2 0.18 4.75 48.0 24.30 2.73 60.6 0.00

7

Table 3: 300VW testset prediction results for cross-dataset evaluation (%)Method 300VW-category1 300VW-category2 300VW-category3

NME AUC FR NME AUC FR NME AUC FRTCDCN [52] 3.49 51.2 1.74 3.80 45.8 1.76 4.45 43.8 8.85CFSS [54] 2.44 67.0 1.66 2.49 64.3 0.77 3.26 60.5 5.18FAN [5] 2.10 71.0 0.33 2.21 68.1 0.07 2.93 63.7 2.77proposed 2.11 70.5 0.29 2.09 69.8 0.05 2.59 66.8 2.03

4.2 Analysis231

In this section, we report the results of sensitivity analysis and ablation study. If not specified, analysis232

is performed on test datasets with models trained on 300W-LP and fine-tuned on 300W train set.233

Sensitivity to challenging conditions. We evaluate different methods on challenging conditions234

caused by either high noise, low resolution, or different initializations in Fig. 4. Generally the235

proposed CNN-CRF model is more robust under challenging conditions compared to a pure CNN236

model with the same structure, i.e. the CNN-CRF model with Cij = 0.

(a) Noise (b) Lower resolution (c) Different initializationFigure 4: Prediction error sensitivity to challenging conditions

237 Ablation Study. The improvement of the proposed method lies in two aspects. On the one hand,238

the proposed softmax + Gaussian NLL loss give better results empirically. On the other hand, the239

joint training of CNN-CRF model with the assistance of the deformable model captures structured240

relationships with pose and deformation awareness. To analyze the effect of the proposed method, in241

Table 4, we evaluate the performance of a plain CNN prediction, the 3D deformable model fitting242

and the joint CNN-CRF prediction accuracy.

Table 4: Ablation study on 300W testset (%)Method NME AUC FRPlain CNN with softmax cross entropy loss 2.72 58.6 1.00Plain CNN with softmax + Gaussian NLL loss (proposed loss) 2.52 62.3 1.00Separately trained CNN and CRF with proposed loss 2.44 66.0 0.503D deformable model fitting 1.39 79.8 0.00Jointly trained CNN-CRF with proposed loss (proposed method) 2.28 67.2 0.50

243 5 Conclusion244

In this paper, we propose a method combining CNN with a fully-connected CRF model for facial245

landmark detection. Compared to SoA purely deep learning based methods, our method captures246

the structure relationships between different facial landmark locations explicitly. Compared to247

previous methods that combine CNN with CRF for human body pose estimation that learn a fixed248

pairwise relationship representation for different test samples implemented by convolution, our249

methods capture the structure relationship variations caused by pose and deformation. Moreover,250

we use a fully-connected model instead of a tree-structured model, obtaining a better representation251

ability. Lastly, compared to previous methods that do approximate learning such as omitting the252

partition function and inference such as mean-field method, we perform exact learning and inference.253

Experiments on benchmark datasets demonstrate that the proposed method outperform the existing254

SoA methods, in particular under challenging conditions, for both within dataset and cross dataset.255

8

References256

[1] 300VW dataset. http://ibug.doc.ic.ac.uk/resources/300-VW/, 2015.257

[2] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. Continuous conditional neural fields258

for structured regression. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,259

Computer Vision – ECCV 2014, pages 593–608, Cham, 2014. Springer International Publishing.260

[3] Yoshua Bengio, Yann LeCun, and Donnie Henderson. Globally trained handwritten word recognizer261

using spatial representation, convolutional neural networks, and hidden markov models. In J. D. Cowan,262

G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 937–944.263

Morgan-Kaufmann, 1994.264

[4] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap265

regression. In ECCV, 2016.266

[5] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment267

problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision,268

2017.269

[6] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollár. Robust face landmark estimation under270

occlusion. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV ’13,271

pages 1513–1520, Washington, DC, USA, 2013. IEEE Computer Society.272

[7] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression. Interna-273

tional Journal of Computer Vision, 107(2):177–190, Apr 2014.274

[8] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using275

part affinity fields. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages276

1302–1310, 2017.277

[9] Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and Jian Sun. Joint cascade face detection and278

alignment. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision –279

ECCV 2014, pages 109–122, Cham, 2014. Springer International Publishing.280

[10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation281

with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern282

Analysis and Machine Intelligence, 40(4):834–848, April 2018.283

[11] Xianjie Chen and Alan Yuille. Articulated pose estimation by a graphical model with image dependent284

pairwise relations. In Proceedings of the 27th International Conference on Neural Information Processing285

Systems - Volume 1, NIPS’14, pages 1736–1744, Cambridge, MA, USA, 2014. MIT Press.286

[12] Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. Adversarial posenet: A structure-287

aware convolutional network for human pose estimation. 2017 IEEE International Conference on Computer288

Vision (ICCV), pages 1221–1230, 2017.289

[13] Xiao Chu, Wanli Ouyang, hongsheng Li, and Xiaogang Wang. Crf-cnn: Modeling structured information290

in human pose estimation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,291

Advances in Neural Information Processing Systems 29, pages 316–324. Curran Associates, Inc., 2016.292

[14] Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Structured feature learning for pose293

estimation. In CVPR, 2016.294

[15] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. In Hans Burkhardt and Bernd295

Neumann, editors, Computer Vision — ECCV’98, pages 484–498, Berlin, Heidelberg, 1998. Springer296

Berlin Heidelberg.297

[16] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models-their training and application.298

Computer Vision and Image Understanding, 61(1):38 – 59, 1995.299

[17] Ciprian A. Corneanu, Meysam Madadi, and Sergio Escalera. Deep structure inference network for facial300

action unit recognition. In ECCV, 2018.301

[18] Trinh–Minh–Tri Do and Thierry Artieres. Neural conditional random fields. In Yee Whye Teh and Mike302

Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and303

Statistics, volume 9 of Proceedings of Machine Learning Research, pages 177–184, Chia Laguna Resort,304

Sardinia, Italy, 13–15 May 2010. PMLR.305

9

http://ibug.doc.ic.ac.uk/resources/300-VW/

[19] David Eigen, Dilip Krishnan, and Rob Fergus. Restoring an image taken through a window covered with306

dirt or rain. In Proceedings - 2013 IEEE International Conference on Computer Vision, ICCV 2013, pages307

633–640. Institute of Electrical and Electronics Engineers Inc., 2013.308

[20] Golnaz Ghiasi and Charless C. Fowlkes. Occlusion coherence: Detecting and localizing occluded faces.309

CoRR, abs/1506.08347, 2015.310

[21] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks.311

In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 437–446, June312

2015.313

[22] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut:314

A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.315

[23] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Structured Output316

Learning for Unconstrained Text Recognition. dec 2014.317

[24] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. L. Briggman, M. N. Helmstaedter, W. Denk, and318

H. S. Seung. Supervised learning of image restoration with convolutional networks. In 2007 IEEE 11th319

International Conference on Computer Vision, pages 1–8, Oct 2007.320

[25] Amin Jourabloo and Xiaoming Liu. Pose-invariant face alignment via cnn-based dense 3d model fitting.321

Int. J. Comput. Vision, 124(2):187–203, September 2017.322

[26] F. Kahraman, G. Muhitin, S. Darkner, and R. Larsen. An active illumination and appearance model for323

face alignment. Turkish Journal of Electrical Engineering and Computer Science, 18(4):677–692, 2010.324

[27] Neeraj Kumar, Peter N. Belhumeur, and Shree K. Nayar. Facetracer: A search engine for large collections325

of images with faces. In The 10th European Conference on Computer Vision (ECCV), October 2008.326

[28] G. Lin, C. Shen, A. Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic327

segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages328

3194–3203, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.329

[29] Iain Matthews and Simon Baker. Active appearance models revisited. International Journal of Computer330

Vision, 60(2):135–164, Nov 2004.331

[30] Stephen Milborrow and Fred Nicolls. Locating facial features with an extended active shape model. In332

Proceedings of the 10th European Conference on Computer Vision: Part IV, ECCV ’08, pages 504–513,333

Berlin, Heidelberg, 2008. Springer-Verlag.334

[31] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Robert G.335

Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial336

Intelligence and Statistics, pages 246–252. Society for Artificial Intelligence and Statistics, 2005.337

[32] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In338

Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14,339

2016, Proceedings, Part VIII, pages 483–499, 2016.340

[33] Feng Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E. Barbano. Toward automatic phenotyping341

of developing embryos from videos. Trans. Img. Proc., 14(9):1360–1371, September 2005.342

[34] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Christoph Bregler,343

and Kevin P. Murphy. Towards accurate multi-person pose estimation in the wild. 2017 IEEE Conference344

on Computer Vision and Pattern Recognition (CVPR), pages 3711–3719, 2017.345

[35] Jian Peng, Liefeng Bo, and Jinbo Xu. Conditional neural fields. In Y. Bengio, D. Schuurmans, J. D.346

Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems347

22, pages 1419–1427. Curran Associates, Inc., 2009.348

[36] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler,349

and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. 2016350

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4929–4937, 2016.351

[37] Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Continuous conditional random fields352

for regression in remote sensing. In Proceedings of the 2010 Conference on ECAI 2010: 19th European353

Conference on Artificial Intelligence, pages 809–814, Amsterdam, The Netherlands, The Netherlands,354

2010. IOS Press.355

10

[38] Kosta Ristovski, Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Continuous conditional356

random fields for efficient regression in large fully connected graphs. In AAAI, 2013.357

[39] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja358

Pantic. 300 faces in-the-wild challenge. Image Vision Comput., 47(C):3–18, March 2016.359

[40] J. Saragih and R. Goecke. A nonlinear discriminative approach to aam fitting. In 2007 IEEE 11th360

International Conference on Computer Vision, pages 1–8, 2007. Exported from https://app.dimensions.ai361

on 2018/11/15.362

[41] Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Deformable model fitting by regularized landmark363

mean-shift. International Journal of Computer Vision, 91(2):200–215, Jan 2011.364

[42] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point de-365

tection. In Computer Vision - CVPR IEEE Computer Society Conference on Computer Vision and366

Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. .367

10.1109/CVPR.2013.446., Proceedings, pages 3476–3483, 2013.368

[43] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object369

localization using convolutional networks. In CVPR, 2015.370

[44] Jonathan Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional371

network and a graphical model for human pose estimation. In Proceedings of the 27th International372

Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 1799–1807, Cambridge,373

MA, USA, 2014. MIT Press.374

[45] George Trigeorgis, Patrick Snape, Mihalis A. Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou.375

Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR, pages376

4177–4187. IEEE Computer Society, 2016.377

[46] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In378

2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,379

June 27-30, 2016, pages 4724–4732, 2016.380

[47] Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A381

boundary-aware face alignment algorithm. In CVPR, 2018.382

[48] Xuehan Xiong and Fernando De la Torre. Global supervised descent method. In CVPR, pages 2664–2673.383

IEEE Computer Society, 2015.384

[49] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao. Recurrent conditional random field for language385

understanding. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing386

(ICASSP), pages 4077–4081, May 2014.387

[50] Amir Zadeh, Yao Chong Lim, Tadas Baltrusaitis, and Louis-Philippe Morency. Convolutional experts388

constrained local model for 3d facial landmark detection. In The IEEE International Conference on389

Computer Vision (ICCV) Workshops, Oct 2017.390

[51] S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen. The menpo facial landmark localisation391

challenge: A step towards the solution. In 2017 IEEE Conference on Computer Vision and Pattern392

Recognition Workshops (CVPRW), pages 2116–2125, July 2017.393

[52] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep394

multi-task learning. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer395

Vision – ECCV 2014, pages 94–108, Cham, 2014. Springer International Publishing.396

[53] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong397

Du, Chang Huang, and Philip H. S. Torr. Conditional random fields as recurrent neural networks. In398

Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages399

1529–1537, Washington, DC, USA, 2015. IEEE Computer Society.400

[54] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape401

searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages402

4998–5006, 2015.403

[55] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Unconstrained face alignment via cascaded404

compositional learning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),405

pages 3409–3417, 2016.406

[56] Xiangyu Zhu, Zhen Lei, Stan Z Li, et al. Face alignment in full pose range: A 3d total solution. IEEE407

Transactions on Pattern Analysis and Machine Intelligence, 2017.408

11

Hierarchical Context Modeling for VideoEvent Recognition

Xiaoyang Wang,Member, IEEE and Qiang Ji, Fellow, IEEE

Abstract—Current video event recognition research remains largely target-centered. For real-world surveillance videos, target-centered

event recognition faces great challenges due to large intra-class target variation, limited image resolution, and poor detection and

tracking results. Tomitigate these challenges, we introduced a context-augmented video event recognition approach. Specifically,

we explicitly capture different types of contexts from three levels including image level, semantic level, and prior level. At the image level,

we introduce two types of contextual features including the appearance context features and interaction context features to capture the

appearance of context objects and their interactions with the target objects. At the semantic level, we propose a deepmodel based on

deep Boltzmannmachine to learn event object representations and their interactions. At the prior level, we utilize two types of prior-level

contexts including scene priming and dynamic cueing. Finally, we introduce a hierarchical context model that systematically integrates

the contextual information at different levels. Through the hierarchical context model, contexts at different levels jointly contribute to the

event recognition.We evaluate the hierarchical context model for event recognition on benchmark surveillance video datasets. Results

show that incorporating contexts in each level can improve event recognition performance, and jointly integrating three levels of contexts

through our hierarchical model achieves the best performance.

Index Terms—Hierarchical context model, event recognition, image context, semantic context, priming context

Ç

1 INTRODUCTION

VISUAL event recognition is attracting growing interestfrom both academia and industry [1]. Various app-

roaches have been developed for event recognition. Theseexisting approaches can generally be divided into descriptor-based approaches and model-based approaches. Descriptor-based approaches build descriptors or features to capture thelocal appearance or motion patterns of the target object. Theseapproaches usually employ various descriptors as featuresand recognize the events using classifiers such as SupportVectorMachines (SVMs). Severalwidely used features includethe histogram of oriented gradient (HOG) [2], the Spatio-Temporal Interest Point (STIP) [3], and the optical flow [4].These descriptors generally focusmore on target objects.

Model-based approaches utilize probabilistic graphicalmodels to encode the appearance or motion patterns of thetarget object. These approaches generally build either staticmodels including Markov Random Fields (MRFs) [5] andConditional Random Fields (CRFs) [6], or dynamic modelsincluding Hidden Markov Models (HMMs) [7], DynamicBayesian Networks (DBNs) [8], and their variants. Thesemodels are used to encode the spatial and temporal interac-tions, and are combined with local measurements for event

recognition. The event recognition is then performedthrough the model inference.

Despite these efforts, surveillance video event recogni-tion remains extremely challenging even with the well-constructed descriptors or models for describing the events.The first difficulty arises from the tremendous intra-classvariations in events. The same category of events can havehuge variations in their observations due to not only largetarget variabilities in shape, appearance and motion, but alsolarge environmental variabilities like viewpoint change,illumination change, and occlusion. Fig. 1 gives examples ofevent “loading” with large appearance variations. Second,the poor target tracking results and the often low video reso-lution further aggravate the problem. These challenges forceus to rethink the existing data-driven and target-centeredevent recognition approach and to look for extra informationto help mitigate the challenges. The contextual informationserves this purposewell.

Contexts in general can be grouped into the image levelcontext [9], [10], [11], the semantic level context [5], [12], [13],and the prior level context [14], [15]. These three levels ofcontexts have also been investigated for event recognition.For example, at the image level, Wang et al. [10] presenta multi-scale spatio-temporal context feature that capturesthe spatio-temporal interest points in event neighborhoods.At the semantic relationship level, Yao et al. [5] propose a con-textmodel tomake humanpose estimation and activity recog-nition as mutual context to help each other. At the prior/priming information level, the scene priming [14] has beenproven to be effective for event recognition in [16], [17], [18].

However, existing work on contexts generally incorpo-rates one type of context, or context information at one level.There is not much work that simultaneously exploits

� X. Wang is with Nokia Bell Labs, Murray Hill, NJ 07974.E-mail: [email protected].

� Q. Ji is with the Department of Electrical, Computer and SystemsEngineering, Rensselaer Polytechnic Institute, Troy, NY 12180.E-mail: [email protected].

Manuscript received 24 July 2015; revised 6 July 2016; accepted 7 Sept. 2016.Date of publication 10 Oct. 2016; date of current version 11 Aug. 2017.Recommended for acceptance by G. Mori.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPAMI.2016.2616308

1770 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017

0162-8828 � 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

different types of contexts at different levels. Since contextexists at different levels and comes in different types, webelieve event recognition can benefit greatly if we can simul-taneously exploit contexts at different levels and systemati-cally incorporate them into event recognition.

To this goal, we introduce the unified hierarchical con-text modeling that allows systematically capturing of con-texts at different levels, and principally integrates thecaptured contexts with the image measurements for robustevent recognition from surveillance videos. We first proposetwo types of context features including the appearancecontext feature and the interaction context feature. Theseimage level contexts exploit the contextual neighborhood ofthe event instead of the target as in [19]. Next, we introducethe deep hierarchical context model that integrates the pro-posed context features with semantic level context and priorlevel context. In the proposed model, the semantic level con-text captures the interactions among the entities of an event(e.g., person and object), and the prior level context includesthe scene priming and dynamic curing. Through this hierar-chical context model, context in the bottom (feature) levelwould provide diagnostic support for the event, while con-text in the top (prior) level provides predictive knowledgeon the event. The top-down and bottom-up context meet atthe middle (semantic relationship) level, where the threelevels of contexts are systematically integrated to yield acomprehensive characterization of events and their context.

We evaluate the proposed hierarchical context models onVIRAT 1.0 Dataset, VIRAT 2.0 Ground Dataset with six andall events [1], as well as the UT-Interaction Dataset [20].Experimental results show that the proposed context fea-tures can improve the event recognition performance whencombined with the target-centered event feature. Moreover,the proposed hierarchical context models, by capturingthree levels of context information, can obviously improvethe recognition accuracy for most of the events.

2 RELATED WORK

We discuss related work in event recognition with contexts,integrating multiple levels of contexts, and deep models inSections 2.1, 2.2 and 2.3 respectively.

2.1 Event Recognition with Contexts

In the recent years, there are increasing efforts in applyingcontext to event recognition. In the event recognition system,contextual information can exist at different levels includingimage level [9], [10], [11], semantic relationship level [5], [12], [21],and prior/priming information level [14], [22]. Below, we brieflysummarize the work in each category as well as the latestefforts in integrating contexts from different levels.

At image level, the context features capture informationregarding the context, and serve as a necessary addition to thetraditional event features, which are solely extracted withinthe event bounding box. Many context features have been

introduced for activity/event recognition. Kovashka et al. [9]propose to learn the shapes of space-time feature neighbor-hoods which are most discriminative for a given category.Wang et al. [10] present a representation that captures the con-textual interactions between interest points in both local andneighborhood spatio-temporal domains. Also, Zhu et al. [11]propose both the intra-activity and inter-activity context fea-ture descriptors for activity recognition. And, Escorcia andNiebles [23] propose the action descriptor to capture the evo-lution of human-object interactionswith respect to time.

At semantic level, context captures relationships amongbasic elements of events such as the semantic relationshipsbetween actions/activities, objects, human poses, and socialroles. Different approaches have been proposed to capturethese types of semantic relationships. Gupta et al. [12] presenta BN approach for combining action understanding withobject perception. Yao et al. [5] propose a Markov RandomField model to encode the mutual context of objects andhuman poses in human-object interaction activities. Yuanet al. [24] propose to capture human activities by a set of mid-level components obtained by trajectory clustering, and usethe spatio-temporal context kernel to encode both local prop-erties and context information. Ramanathan et al. [25] proposea Conditional Random Fieldmodel to capture the interactionsbetween the event and the social roles of different personsin a weakly supervised manner. In general, approaches thatmodel semantic relationships as contexts utilize the probabi-listic graphical models like BN, MRF, or CRF, and capture theco-occurrence and mutually exclusive relationships to boostthe corresponding recognition performances.

At prior level, the context captures the global spatial ortemporal environments within which events may happen.The scene priming used by Torralba et al. [14] and Sudderthet al. [22] demonstrate that scene provides good prior infor-mation for object recognition and object detection. The scenepriming information [14] has also proven to be effective forevent recognition in [16], [17].

In general, the existingwork in context-aided event recog-nition focuses mostly on context at an individual level. Thereare few works studying the integration of multiple levels ofcontexts.Wewill discuss these works in Section 2.2.

2.2 Integrating Multiple Levels of Contexts

There are several approaches that integrate multiple levelsof contexts for action, activity, and event recognition appli-cations. Li et al. [13] try to capture the semantic co-occur-rence relationships between event, scene, and objects with aBayesian topic model for static image event recognition.This model essentially captures the semantic level context,and incorporates the hierarchical priors in the model. Sunet al. [26] propose to combine the point-level context feature,the intra-trajectory context feature, and the inter-trajectorycontext image through a multiple kernel learning model forhuman action recognition. These multiple level contexts areall at the image level. The approach by Zhu et al. [11] alsoexploits contexts for surveillance video event recognition.While similar to our approach, our approach differs from[11] in the following aspects: 1) we propose a probabilisticdeep model to learn the latent representation for semanticand prior level contexts, and to integrate multiple levels ofcontexts. By contrast, their model is a structural linear

Fig. 1. Events “loading” with large intra-class variation.

WANG AND JI: HIERARCHICAL CONTEXT MODELING FOR VIDEO EVENT RECOGNITION 1771

model. 2) We integrate contexts from all three levels, whiletheir model only integrates contexts at feature and semanticrelationship levels.

Approaches like [27], [28] utilize hierarchical probabilis-tic models for event and action recognition. However, thesetwo approaches focus on capturing the hierarchy on feature,body parts and human actions, without incorporating con-text information beyond the target. Also, the dynamic topicmodels including the Markov clustering topic model byHospedales et al. [29], and the sequential topic model byVaradarajan et al. [30] are proposed for video activitymodeling. Different from our approach that integrates con-text in three levels, these models capture the Bayesianmodel prior distribution in the prior level, and integrate thetemporal transition in the semantic level.

In summary, the existing work in integration of contextsat different levels is limited to two levels. By contrast, wepropose a unified model that integrates contexts from allthree levels simultaneously. Experiments demonstrate sig-nificant performance improvement over the existing modelson challenging real-world benchmark surveillance videos.

2.3 Deep Models for Event Recognition

Recently, deep models including probabilistic models likedeep belief networks [31] and deep Boltzmann machines(DBMs) [32], [33], as well as non-probabilistic models likethe stacked auto-encoders [34], [35] and convolutional neu-ral networks (CNN) [36] are used in different applications.For action and activity recognition, the ConvNets [37], [38],convolutional gated restricted Boltzmann machine [39],independent subspace analysis [40], and auto-encoderapproaches [41] are developed. However, these deep mod-els are designed as a feature learning approach to learn thedeep representation for the target. They are generally data-driven and target-centered, without explicitly incorporatingcontext information. Comparatively, our proposed deepcontext model utilizes the deep structure to explicitly cap-ture the prior level, semantic level, and image level contextsfor event recognition.

There are several differences between our model and thepopular discriminative deep neural networks (NN) likeCNN. First, the structure of our model as well as its connec-tions are specifically designed for modeling events with twointeracting entities and they carry clear semantics, while thestructure of the conventional NN is more general withoutmuch semantic meanings and they tend to be fully con-nected. Second, our model is probabilistic while deep mod-els like CNN are deterministic. Third, compared to the deepNN, our model is rather shallow, consisting of only a fewlayers. Finally, in terms of learning and inference methods,we employ probabilistic methods such as maximizing likeli-hood during learning and MAP for inference, while CNNminimizes the empirical loss function during learning.

Among the existing deep models, the multi-modal deeplearning approaches including the multi-modal stackedautoencoders [42] and the multi-modal DBM models [43],[44] are related but different from our proposed approach.The multi-modal deep models aim at learning a joint featurerepresentation over multiple modalities. Comparatively,our approach aims at utilizing the deep structure to intrinsi-cally capture three levels of contexts (i.e., image level,

semantic level, and prior level) through the proposedmodel. Their inputs are multi-modal, while our inputs areonly from image. Their goal of learning is to learn the jointmiddle level representation to capture both modalities,while our goal of learning is to capture contextual informa-tion at different levels. Their goal of inference is to use thelearned middle level representation for classification orinfer one modality data given the other modality, while ourgoal of inference is to infer the most likely event.

There is little work studying utilizing deep models tocapture contexts for visual recognition tasks. He et al. [45]utilize a RBM model to capture the pixel level interactionsfor image labeling. Also, Zeng et al. [46] build a multi-stagecontextual deep model that uses the score map outputsfrom multi-stage classifiers as contextual information forthe pedestrian detection deep model. However, both thesetwo models are not designed to capture three levels of con-texts, and are not for event recognition.

This paper presents our most recent and comprehensivework on the three-level hierarchical context modeling basedon our previous studies in [47], [48]. Different from [47], thispaper develops two novel context features in the bottomlevel, and builds the deep context model instead of the BNbased context model in [47] to capture contexts. This workalso extends [48] with significantly more details on method-ology, more details on the learning and inference methods,greatly extended experimental evaluations with a new data-set, enhanced discussions in introduction, evaluation andconclusion, as well as greatly extended related work.

3 HIERARCHICAL CONTEXT MODELING

Fig. 2 illustrates the overall idea of this approach. We proposeto model three levels of contexts: image level context, seman-tic level context, and prior level context. The image levelcontext in the bottom level provides diagnostic support infor-mation for the event, while the prior level context at the toplevel supplies top-down predictive knowledge on the event.The top-down and bottom-up contexts meet at the middlelevel (semantic level context), where the three levels of con-texts are systematically integrated to give a comprehensivecharacterization of the events and their overall contexts.

3.1 Image Level Context Features

Our image level context can be categorized into two differ-ent types: the appearance context feature and the interaction

Fig. 2. The integration of contexts from three levels.


context feature. Here, we first define the event neighborhoodbased on the event bounding box. From there, both appear-ance and interaction context features can be extracted.

3.1.1 Definition of Event Neighborhood

For event recognition of a video sequence, the event bound-ing box is a set of rectangles over the video sequence frames,which are from frame 1 to frame T , with each rectangleassigned to one of the T frames. The event bounding boxcontains the event objects. For each frame, the event occurswithin the corresponding rectangle. The rectangle in frame tcan be represented by its upper-left corner point with coor-dinate ðxt; ytÞ, and its width wt and height ht. In this way,the event bounding box can be represented by the set

fðxt; yt; wt; htÞTt¼1g which includes rectangles over all T

frames.Given the event bounding box, we can further define the

spatial neighborhoods of the event. As shown in Fig. 3a, forframe t, the event bounding box rectangle is extended to alarger rectangle by increasing the width with Dwt on bothleft and right sides of the original rectangle. Similarly, theheight is also increased with Dht on both top and bottomsides of the original rectangle. The event neighborhood ofan event in frame t is then the region within the extendedrectangle but outside of event bounding box, as presentedby the shaded region of Fig. 3a.

We use the ratio � to determine the relative scope of theevent neighborhood with respect to the event bounding boxsize

� ¼ Dht

ht¼ Dwt

wt: (1)

Our event spatial neighborhood is then extended to thespatial-temporal neighborhoods over T frames, as shown inthe shaded areas of Fig. 3b.

3.1.2 Appearance Context Feature

The appearance context feature captures the appearance ofcontextual objects, which are defined as nearby non-targetobjects located within the event neighborhood. Since ourevent neighborhood is a direct spatial extension of the eventbounding box, it would naturally contain both the contex-tual objects and the background. To efficiently extract andcapture the contextual objects from the background, we uti-lize Scale Invariant Feature Transform (SIFT) descriptors [49]to detect key points in the event neighborhood. Fig. 4a gives

an illustration of SIFT key points in the event neighborhoodover frame.

The SIFT extracts 128 dimensional scale and orientationinvariant local textual features surrounding the detectedkey points. This feature provides an appearance-baseddescription of the contextual objects. For each eventsequence, the standard Bag-Of-Words (BOW) approach isused to transform the key points into the histogram contextfeature with fixed length using k-means clustering. Supposean event sequence contains M key points p1; . . . ; pM , witheach point assigned to a clustering label from 1 to K. The Kdimensional histogram h for this event sequence is

hðkÞ ¼ #fpi : pi 2 binðkÞg for k ¼ 1; . . . ; K: (2)

This K dimensional histogram then captures the appear-ance of the contextual objects. After normalization, it isused as the appearance context feature.

3.1.3 Interaction Context Feature

The interaction context feature captures the interactionsbetween event objects and contextual objects as well asamong contextual objects. The contextual objects are repre-sented by the SIFT key points extracted in the event neigh-borhood as discussed in Section 3.1.2. We use SIFT keypoints detected within the event bounding box to representthe event objects.

Then, the k-means clustering is applied to the 128 dimen-sional features of key points in the event bounding box andthe event neighborhood of all training sequences to generatea joint dictionary matrix DI . With this dictionary, key pointsinside and outside the event bounding box can be assignedto the same set of words. As in Fig. 5, we use a 2D histogramto capture the co-occurrence frequencies of words insideand outside the event bounding box over frames.

Specifically, for an event sequence with T frames in total,denote the key points inside and outside of the event

Fig. 3. The definition of event neighborhood, in which the blue rectangleindicates the event bounding box, and the dashed green rectangle is theextended rectangle. The shaded region within the extended rectanglebut outside of the event bounding is the spatial neighborhood. The eventneighborhood is the union of the spatial neighborhoods over T frames.

Fig. 4. Extracting appearance context feature from event neighborhood.(a) SIFT key points in the neighborhood of each frame; (b) BOW histo-gram feature.

Fig. 5. Extracting interaction context feature with a 2D histogram thatcaptures the co-occurrence frequencies of words of event objects andcontextual objects.


bounding box in frame t as pit and qjt respectively. The pointpair ðpit; qjtÞ would be counted only when these two pointsappear in the same frame. In such sense, theK-by-K dimen-sional histogram h for this event sequence is

hðk; k0Þ ¼ #fðpit; qjtÞ : pit 2 binðkÞ; qjt 2 binðk0Þgfor k; k0 ¼ 1; . . . ; K; t ¼ 1; . . . ; T:

(3)

We normalize this 2D histogram to ensure all elements inthe matrix sum to 1. They then constitute as the interactioncontext feature we use for event recognition after beingreshaped into aK2 dimensional vector.

3.2 Model Capturing Semantic Level Context

The semantic level contexts stand for the semantic interac-tions among event entities. Since both the person and objectare important entities of an event, the semantic level con-texts for this work capture the interactions between personand object for an event. For example, event “person gettinginto vehicle” is highly correlated with human state “facingtowards vehicle” and object state “door open”; also, event“person opening a trunk” has strong relations with humanstate “at tail of vehicle” and object state “trunk open”. Thesemantic context modeling should therefore capture event,person, object, and their interactions. Different from thesemantic level context modeling in [47], we learn a set ofmiddle level representations for the person and object enti-ties through the deep structure, and capture the interactionsbetween event and the learned middle level representations.Existing approaches like [47] utilize a single discrete vari-able to describe the person and object states instead.

Suppose we have K types of events to recognize. We usea K dimensional vector y with binary units to represent theevent label through the 1-of-K coding scheme, in which theevent belonging to class Ck would be a vector with elementk as “1” and all the remaining elements as “0”. We use thebinary hidden units hp and ho to represent the latent statesof person and object. We treat both hp and ho as hiddenunits such that their optimal states can be learned as latenttopics automatically during training.

3.2.1 Semantic Context Modeling

The model structure shown in Fig. 6 is used to capture thesemantic level contexts. In this structure, the event label vec-tor y lies in the top layer, and the hidden units hp for personand ho for object both lie in the bottom layer. Another set ofhidden units hr standing in the intermediate layer is incor-porated to capture the interactions between person andobject. Here, every single hidden unit in hr is connected toall the units in hp, ho, and y. In such way, the global

interactions among units from person, object, and eventlabel are captured through the intermediate hidden layer hr.

3.2.2 Combining with Observations

The observation vectors for the event, person, and object canbe further added to the semantic level context model seen inFig. 6. It results in the context model as shown in Fig. 7. Thevectors p and o denote the person and object observationvectors as continuous STIP features. Both the person obser-vation vector p and the object observation vector o are con-nected only to their corresponding hidden units hp and ho

respectively. In this way, the middle level representationsfor person and object can be obtained from their corre-sponding observations. In addition, the event observation eas STIP event feature and the context feature c introducedin Section 3.1 are directly connected to the event label y.

The model in Fig. 7 combines semantic contexts in themiddle level with context feature c in the bottom level. Thismodel is called the Model-BM context model, and is com-pared to other models in the experiment section.

3.3 Model Capturing Prior Level Context

The prior level contexts capture the prior information ofevents. It reflects the related high level context that deter-mines the likelihood of the occurrence of certain events. Forthis research, we utilize two types of prior contexts: the scenepriming [14] and the dynamic cueing, though the model isgeneric enough to apply to other high level priming contexts.

3.3.1 Scene Priming

The scene priming context refers to the scene informationobtained from the global image. It provides an environmen-tal context such as location (e.g., parking lot, shop entrance)within which events occur. Hence, it can serve as spatialprior to dictate whether certain events would occur.

To capture the scene context as prior context, we uti-lize the hidden units hs to represent different possiblescene states. As shown in Fig. 8, each hidden unit in hs

Fig. 6. The model capturing semantic level contexts, where hp and ho

are the first layer hidden units representing person and object middlelevel representation, hr is the second layer hidden units capturing inter-actions, and y stands for the event class label.

Fig. 7. The model combining semantic level contexts with observations,where vectors p and o denote the person and object observations, and eand c represent the event and context observations respectively.

Fig. 8. The model capturing prior level contexts, where s represents theglobal scene observation, m�1 denotes the recognition measurement ofthe previous event, hs denotes the hidden units representing differentpossible scene states, y�1 denotes the previous event, and y stands forthe current event.


is connected to all the elements within the event labelvector y. In this way, the state of the scene would have adirect impact on the event label. The observation vectors represents the GIST feature extracted from the globalscene image. Each element in s is connected to each unitin hs to provide global observation to the hidden scenestates.

3.3.2 Dynamic Cueing

The second prior level context is the the dynamic cueing. Itprovides temporal support as to what event will likely tohappen given the events that have happened up to now.Event at current time is influenced by events in previoustimes. For example, event “loading/unloading a vehicle”typically precedes event “closing a trunk”. The informationon the previous events provides a beneficial cue on the rec-ognition of the current event. Dynamic context can thereforeserve as a temporal prior on current event.

We capture dynamic context through Markov chainmodeling which is specifically useful for a serial of happen-ing with a temporal order. For example, people typically“get out of the vehicle” first, “open the trunk”, “load thevehicle”, and then “close the trunk” at last. In this work, theprevious event is represented by the K dimensional binaryvector y�1 in the 1-of-K coding scheme. Moreover, y�1 is fur-ther connected to previous event measurement vector m�1which denotes the recognition measurement of the previousevent. As shown in Fig. 8, both hs and y�1 provide top-downprior information for the inference of the current event.

3.4 Integrating Contexts in Three Levels

Given the contexts in three levels as introduced previously,we now discuss the formulation of the proposed deep hier-archical context model for integrating contexts from allthree levels. We first present the deep model, and then dis-cuss its learning and inference.

3.4.1 Deep Hierarchical Context Model

We introduce the deep context model to systematicallyincorporate three levels of contexts. As shown in Fig. 9, themodel consists of six layers. From bottom to top, the firstlayer at the bottom includes the target and contextual mea-surement vectors p, o, e, and c that are visible in both learn-ing and testing. The vectors p and o denote the person andobject observations. And, the vectors e and c denote the

event and context features. The second layer includesbinary hidden units hp and ho as middle level representa-tions for person and object. On the third layer, the binaryhidden units hr are incorporated as an intermediate layer tocapture the interactions between event, person, and object.The fourth layer denoted by vector y represents the eventlabel through the 1-of-K coding scheme. On the top twolayers, the hidden units hs represent the scene states, andvector s is the scene observation. Also, y�1 represents theprevious event state, with its measurement as m�1. Thismodel is essentially the combination of Model-BM and priormodel in Figs. 7 and 8.

The proposed model is an undirected model. With thestructure in Fig. 9, the model energy function is:

Eðy;hr;hp;ho;p; o; e; c; y�1;m�1;hs; s; uÞ ¼�~p>W1hp � ~o>W2ho � h>p Q

1hr � h>o Q2hr � h>r Ly

�~e>U1y� ~c>U2y� y>�1Dy� h>s Ty�m>�1Fy�1�~s>Ghs � b>hphp � b>hoho � b>hrhr � b>y y� b>hshs

�b>y�1y�1 � b>m�1m�1 þXi

ðpi � bpiÞ22s2

pi

þXj

ðoj � bojÞ22s2

oj

þXk

ðek � bekÞ22s2

ek

þXi0

ðci0 � bci0 Þ22s2

ci0þXj0

ðsj0 � bsj0 Þ22s2

sj0;

(4)

where W1, W2, Q1, Q2, L, U1, U2, T, D, F, and G are theweight matrices between the groups of visible or hiddenunits. Also, bhp , bho , bhr , by, bhs , by�1 , and bm�1 are the bias

terms for the discrete units. And, bp, ssp, bo, sso, be, sse, bc,

ssc, bs, and sss are the parameters for the continuous units,similar to those in Gaussian-Bernoulli RBM. We use u torepresent the whole model parameter set that includes allthe parameters in the weight matrices and the bias terms.

For convenience, Eq. (4) utilizes vectors ~p; ~o; ~e;~c, and ~s,which are the original observation vectors p; o; e; c; sdivided by ssp; sso; sse; ssc, and sss respectively in each dimen-

sion. For instance, ~pi ¼ pi=spi.

Given the energy function, the joint probability of all thevariables y, hr, hp, ho, p, o, e, c, y�1, m�1, hs, and s can bewritten as

P ðy;hr;hp;ho;p; o; e; c; y�1;m�1;hs; s; uÞ ¼ 1

ZðuÞexpð�Eðy;hr;hp;ho;p; o; e; c; y�1;m�1;hs; s; uÞÞ:

(5)

3.4.2 Model Learning

The model learning process learns the model parameter setu which includes all the weight matrices and bias terms in

Eq. (4). With training data fyi; y�1;i;pi; oi; ei; ci; si;m�1;igNi¼1,these parameters are learned by maximizing the log likeli-hood as

u� ¼ argmaxu

LðuÞ

LðuÞ ¼XNi¼1

logP ðyi; y�1;i;pi; oi; ei; ci; si;m�1;i; uÞ:(6)

In the following texts of this section, we will refer to allthe hidden units in the model as h, and refer to all the

Fig. 9. The proposed deep context model integrating image level, seman-tic level, and prior level contexts, where the shaded units represent thehidden units, the striped units represent the observed units that would beavailable both in training and testing, and the units in grid are event labelunits which are available in training and not available in testing.


visible units in the model as v. The optimization in Eq. (6)can be solved via the stochastic gradient ascent method inwhich the gradients are calculated as [32]:

@LðuÞ@u¼ @E

@u

� �Pdata

� @E

@u

� �Pmodel

; (7)

where E is the model energy function defined in Eq. (4). Theoperator h�iPdata denotes the expectation with respect to the

data distribution Pdataðh; vÞ ¼ pðhjvÞPdataðvÞ, where PdataðvÞ ¼1N

Pi dðv� viÞ represents the empirical distribution. And, the

operator h�iPmodelis the expectation with respect to the model

distribution defined in Eq. (5). The expectation h�iPdata is usu-ally called the data-dependent expectation, and h�iPmodel

is usu-

ally called themodel’s expectation.Since the computation cost for the direct calculation of

both expectations is exponential to the hidden unit number,the exact learning is intractable. Hence, we learn the pro-posed model with the approximate learning method [32], inwhich the mean-field based variational inference approachis used to estimate the data-dependent expectation, and theMarkov chain Monte Carlo (MCMC) based approximationis used to estimate the model’s expectation.

For pre-training, the model in Fig. 9 is first divided intosub-models. The layer-wise pre-training [34] is then per-formed independently for each sub-model. Specifically, thepre-training starts with training the RBM model formed bythe pair of vectors p and hp, as well as training the RBMmodel formed by the pair of vectors o and ho. Oncetrained, the two RBM models can be sampled to obtainsamples for hp and ho, which are then used jointly withthe ground truth y to train the sub-model represented byFig. 6, with hr as hidden units. Next, we train the RBMmodel formed by the pair of vectors s and hs. Oncetrained, samples of hs from this RBM can be further usedto train the sub-model consisting of hs and y. The remain-ing models consist of visible layers. We learn them by indi-vidually maximizing the log likelihood of the connectedvisible vector pairs (e.g., y and e).

With the model parameters initialized through pre-train-ing, we then learn the model parameters jointly by solvingthe optimization in Eq. (6) through the stochastic gradientascent method. As shown in Eq. (7), the gradient calculationincludes the estimation of the data-dependent expectationand the model’s expectation. For estimating the data-depen-dent expectation, we replace the true posterior P ðhjv; uÞ bythe variational posterior Qðhjv;mmÞ. As in Eq. (8), we use themean-field approximation which assumes all the hiddenunits are fully factorized, ignoring the dependenciesbetween the hidden units

Qðhjv;mmÞ ¼Yi

qðhpijvÞ ! Y

j

qðhojjvÞ !

�Yk

qðhrkjvÞ ! Y

g

qðhsgjvÞ !

;

(8)

where mm ¼ fmmp;mmo;mmr;mmsg are the mean-field variationalparameters with qðhi ¼ 1Þ ¼ mi. The estimation then findsthe parameters mm that maximize the variational lower

bound of the log likelihood in Eq. (6) with u to be fixed. Inthis estimation, given yi, the variational parameters mms on

top are independent from the remaining variational param-eters in the bottom. Thus, the parameters for the top andbottom can be estimated separately. This method can henceiteratively update mm for different hidden units through thefollowing mean-field fixed point equations

mpi sXj

W 1jipj=spj þ

Xk

Q1ikmrk þ bhpi

!

moj sXi

W 2ijoi=soi þ

Xk

Q2jkmrk þ bhoj

!

mrk sXi

Q1ikmpi þ

Xj

Q2jkmoj þ

Xk0

Lkk0yk0 þ bhrk

!

msg sXj

Gjgsj=ssj þXk

Tgkyk þ bhsg

!;

(9)

where sðxÞ ¼ 1=ð1þ expð�xÞÞ represents the logistic func-tion. The estimated variational parameters can then be usedto calculate the data-dependent expectation as in the exam-ples shown below

@E

@W1

� �Pdata

¼ 1

N

XNn¼1

~pmm>p ;@E

@Q1

� �Pdata

¼ 1

N

XNn¼1

mmpmm>r :

For estimating the model’s expectation, we use theMCMC based stochastic approximation procedure. It firstrandomly initializes M Markov chains with samples of y0;j,

h0;jr , h0;j

p , h0;jo , p0;j, o0;j, e0;j, c0;j, y0;j�1, m

0;j�1, h

0;js , s0;j. For each

Markov chain j from 1 to M, the ðtþ 1Þth step samples

ytþ1;j, htþ1;jr , htþ1;j

p , htþ1;jo , ptþ1;j, otþ1;j, etþ1;j, ctþ1;j, ytþ1;j�1 ,

mtþ1;j�1 , htþ1;j

s , stþ1;j given the samples from the tth step as

yt;j, ht;jr , ht;j

p , ht;jo , pt;j, ot;j, et;j, ct;j, yt;j�1, m

t;j�1, h

t;js , st;j by run-

ning a Gibbs sampler. The M sampled Markov particles arethen used to estimate the model’s expectation in the modeloptimization as in the examples shown below

@E

@W1

� �Pmodel

¼ 1

M

XMj¼1

~ptþ1;jhtþ1;jp

>

@E

@Q1

� �Pmodel

¼ 1

M

XMj¼1

htþ1;jp htþ1;j

r

>:

The learning procedure of the proposed model can thenbe summarized in Algorithm 1.

With the mean-field based variational inference and theMCMC based stochastic approximation approaches, thecomputational complexity of learning is OðRNM2Þ, whereR is the learning iteration number, N is the training samplenumber, andM is number of nodes in the model.

3.4.3 Model Inference

Given a query event sequence with event observation vectore, context observation vector c, person observation vector p,object observation vector o, the global scene observationvector s, and the previous event measurement m�1, the


model can recognize the event category k� by maximizingits posterior probability given all the observation vectorsthrough as

k� ¼ argmaxk

P ðyk ¼ 1je; c;p; o; s;m�1; uÞ: (10)

We emphasize that y�1 as the previous event label isavailable during model learning but not available duringtesting. However, its measurement m�1 is available duringtesting. In model inference, m�1 can influence the currentevent y through y�1, providing therefore diagnostic predic-tion on y. We do not update the decision of previous eventy�1 in testing.

Algorithm 1. Learning of the Proposed Model

Data: fyi; y�1;i;pi; oi; ei; ci; si;m�1;igNi¼1 as the training set withN training samples, and M as the number of Markovchains.

result: model parameter set uInitialize u0 with layer-wise pre-training of RBMs;

Initialize Markov chains fv0;j;h0;jgMj¼1 randomly;for t ¼ 0! T do

// Variational Inference;for each training sample i ¼ 1! N do

Run Eq. (9) updates till convergence;end// MCMC Stochastic Approximation;for each sample j ¼ 1!M do

Sample fvtþ1;j;htþ1;jg given fvt;j;ht;jg;endUpdate parameter utþ1 ¼ ut þ h

@LðuÞ@u

n o;

end

Computing this posterior probability requires marginal-izing over all the hidden units in hp, ho, hr, and hs. Its exactcalculation is intractable. However, the inference can be effi-ciently solved using the Gibbs sampling method. Given theobservation vectors e, c, p, o, s, and m�1 during testing,Gibbs sampling first randomly initializes hp, ho, and y, andthen iteratively samples hr, hp, ho, hs, y�1 and y given theiradjacent units. During this process, hidden units hr, hp, ho,and hs are actively involved in each iteration of Gibbs sam-pling. After burn-in period, samples of y are collected andthey are used to approximate the marginal probability of ythrough their frequency in the Gibbs samples.

The detailed inference algorithm is presented in Algo-rithm 2. The computational complexity for the model infer-ence is OðCTM2Þ, where C is Markov chain number, T ischain length, andM is number of nodes in the model.

4 EXPERIMENTS

We demonstrate the effectiveness of the proposed approachon four event recognition benchmark datasets. The first twodatasets are the VIRAT 1.0 Ground Dataset and VIRAT 2.0Ground Dataset [1] both with six types of person-vehicleinteraction events including Loading a Vehicle (LAV), Unload-ing a Vehicle (UAV), Opening a Trunk (OAT), Closing a Trunk(CAT), Getting into a Vehicle (GIV), and Getting out of a Vehicle(GOV). These two datasets are state-of-the-art real-world sur-veillance video datasets focusing on surveillance video eventswhich include interactions between persons and objects. The

VIRAT 1.0 Ground Dataset includes around 3 hours of videosVideos in this dataset are recorded fromdifferent school park-ing lots. And, the VIRAT 2.0 Ground Dataset includes over 8hours of surveillance videos from school parking lots, shopentrances, outdoor dining areas and construction sites. Forboth datasets, we use half of the event sequences for training,and the rest of the sequences for testing.

Algorithm 2. Inference of P ðyje; c;p; o; s;m�1ÞData: the input observation vectors e, c, p, o, s, and m�1 for

the query event sequence; model parameter set uResult: P ðyk ¼ 1je; c;p; o; s;m�1; uÞ for k ¼ 1; . . . ;Kfor chain ¼ 1! C do

Randomly initialize h0p, h

0o and y0;

for t ¼ 0! T doSample ht

r given htp, h

to, and yt;

Sample htþ1p given ht

r and p;

Sample htþ1o given ht

r and o;

Sample htþ1s given yt and s;

Sample ytþ1�1 given yt andm�1;Sample ytþ1 given ht

r, htþ1s , ytþ1�1 , e and c;

endendCollect the last T 0 samples of y from each chain;Calculate P ðyje; c;p; o; s;m�1Þwith the samples;

We further experiment with Full VIRAT 2.0 Ground Data-set as the third dataset. Besides the six person-object interac-tion events included in the VIRAT 1.0 and 2.0 GroundDatasets, this dataset further includes five types of addi-tional events including Gesturing (GST), Carrying an Object(CAO), Running (RUN), Entering a Facility (EAF), and Exit-ing a Facility (XAF) from VIRAT [1]. Here, half of these eventsequences are used for training, and the remaining sequen-ces are used for testing.

The fourth dataset is theUT-Interaction Dataset [20]. This isa surveillance video dataset with person-person interactionevents. It consists of six person-person interaction eventsincluding: “hand shaking, hugging, kicking, pointing, punching,and pushing” [20]. The dataset includes two sets, each with10 video sequences in the length of around 1minute. To com-pare with state-of-the-art methods, we use the standard 10-fold leave-one-out cross validation for evaluation on set 1.

In this work, we focus on event recognition from pre-seg-mented video sequences with the event bounding boxesprovided by the dataset. We further assume there is onlyone event per segment to recognize. Hence, the average rec-ognition accuracy overall all event categories, the recogni-tion accuracy for each event, and the confusion matrices caneffectively reflect the recognition performance. With theprovided event bounding box, we further obtain the personand object (vehicle) bounding boxes through detection inthe event region. The descriptors for person and object(vehicle) are then extracted within their correspondingbounding box. The erroneous person and object (vehicle)detections would have negative effects for event recogni-tion. However, the holistic use of contexts in all three levelscan compensate the errors in person and object detection.

We implement the proposed deep hierarchical contextmodel based upon the deep Boltzmann machine library


provided by Salakhutdnov [50]. Additionally, the RBF ker-nel SVM is repeatedly used in the following experiments. Ineach experiment, the optimal C and g hyper parameters ofSVM are determined by cross validation on the training set.

4.1 Experiments on the Context Features

The effectiveness of the proposed context features discussedin Section 3.1 is demonstrated by experiments on the VIRAT2.0 Ground Dataset [1] with six types of events: LAV, UAV,OAT, CAT, GIV, and GOV.

4.1.1 Baseline Event Features

The baseline event features are the traditional event featuresthat are extracted only within the event bounding box. Here,we utilize the most widely used STIP [51] feature. A 500-word BOW model is used to transform the STIP points intothe baseline event feature vector. It reaches 41.74 percentaverage accuracywith RBF kernel SVM, as shown in Table 1.

4.1.2 Performance of Appearance Context Features

The appearance context feature is evaluated with differentneighborhood sizes determined by the ratio �. The resultsfor the appearance context feature used alone and combinedwith baseline event features are given in Table 1. We alwaysuse the RBF kernel SVM classifier with the same pre-definedcoefficients C and g cross-validated to optimal on trainingset each time. The BOW vocabulary size is set to be 100.

From Table 1, we can see that using appearance contextfeatures alone does not perform as well as the baseline eventfeatures for event recognition. However, combining thebaseline event features with the appearance context featurescan always improve the recognition performance. Also,Table 1 indicates that the ratio � ¼ 0:35 performs the best. Itsignificantly improves the baseline method by about 6percent.

4.1.3 Performance of Interaction Context Features

Also, we evaluate the performance of interaction contextfeatures with the relative neighborhood size is set to� ¼ 0:35, which is the optimal neighborhood size by Table 1.We test the feature performance with different vocabularysize K to find the best tradeoff between the vocabulary sizeand the recognition performance. Table 2 gives the resultsof this experiment.

Results in Table 2 tell us two things. First, the interactioncontext feature generally performs worse than the baselineevent feature when used alone. However, the combinationof baseline event feature with the interaction context featuregenerally improves the performance by up to 6 percent over

the baseline method. Secondly, using the vocabulary sizeK ¼ 10 gives the best performance for the combinedfeature.

4.1.4 Performance of Combined Features

We further test on combining the baseline event featurewith two types of context features, where we choose theneighborhood size ratio � ¼ 0:35, and the vocabulary sizefor the interaction context feature as K ¼ 10. The finalresults are shown in Table 3.

From Table 3, we can see that combining either theappearance or the interaction context feature can alreadyimprove the performance of baseline event feature for eventrecognition. Combining two context features with the base-line event feature can further improve the recognition accu-racy. In all, combining our proposed context features cansignificantly improve the event recognition performance byover 10 percent.

4.2 Experiments with Proposed Model

After discussing the performances of the proposed contextfeatures, we proceed to demonstrate the effectiveness of theproposed deep hierarchical context model that integratesthree levels of contexts. For this model, hp, ho, hr, and hs

have 50, 50, 100, and 20 hidden units respectively. Theseexperiments are performed on VIRAT 1.0, VIRAT 2.0 withsix and all events respectively, and the UT-InteractionDatasets.

4.2.1 Baselines and State-of-the-Art Methods

Three baseline approaches are used in our experiments toevaluate the effectiveness of the proposed deep hierarchicalcontext model denoted as Deep Model. The first baselineuses the SVM classifier with the STIP event feature, and isdenoted as SVM-STIP. This approach does not use any con-texts for event recognition. The second baseline denoted asSVM-Context concatenates the event feature with both theappearance and interaction context features, and also usesSVM as the classifier. It hence evaluates the effectiveness ofthe proposed context features. The third baseline is the

TABLE 1Performance of Appearance Context Feature with Different

Neighborhood Size on Six Events of VIRAT 2.0

� 0.25 0.3 0.35 0.4 0.45

App. Cont. 15.71% 19.54% 22.56% 20.59% 17.83%Baseline 41.74% 41.74% 41.74% 41.74% 41.74%Combined 46.47% 45.08% 47.87% 46.51% 42.18%

� stands for the relative neighborhood size. App.: Appearance; Cont.: Context.Combined: the combined feature.

TABLE 2Performance of Interaction Context Features with Different

Vocabulary Size on Six Events of VIRAT 2.0

K 8 10 12 15 18

Int. Cont. 17.15% 19.64% 19.94% 21.13% 21.66%Baseline 41.74% 41.74% 41.74% 41.74% 41.74%Combined 43.99% 47.54% 44.70% 43.70% 40.42%

K: BOW vocabulary size. “Int. Cont.” stands for interaction context feature.Its dimension isK2. Combined: the combined feature.

TABLE 3Performance of Proposed Context Features Combinedwith Baseline STIP Features on Six Events of VIRAT 2.0

STIPSTIP +

App. Cont.STIP +

Int. Cont.STIP + App.& Int. Cont.

Accuracy 41.74% 47.87% 47.54% 51.91%

App.: Appearance; Int.: Interaction; Cont.: Context.


Model-BM model in Fig. 7 that simultaneously integratesimage level contexts and semantic level contexts. Thesebaseline approaches are compared to our proposed modelthat systematically integrates image, semantic, and priorlevel contexts.

For both the Model-BM and Deep Model, the event tem-poral orders are needed to incorporate the dynamic context.The UT-Interaction Dataset has a simple linear order ofevents without temporal overlapping. On the other hand,the VIRAT datasets contain multiple ongoing events par-tially overlapping in time. However, these temporally-overlapping events are largely uncorrelated due to large spa-tial distance. Hence, in this approach, wemodel the temporaldependencies between event segments only when these seg-ments are also spatially close. The sequence of event seg-ments is then built according to the event temporal orderswith respect to the event starting time. If two events have thesame starting time, the ending time is used to decide theorder. If two spatially-close events are completely tempo-rally overlapping, which is extremely rare, the algorithm isdesigned to use the annotation order of the two video seg-ments. The proposed model then learns the temporal depen-dencies between these event segments in sequence.

Besides these baselines, we also compare our results tomultiple state-of-the-art performances including [11], [47],[52], [53] in VIRAT 1.0 and 2.0 Ground Datasets, as well as[54], [55], [56], [57], [58], [59], [60] in UT-Interaction Dataset.

4.2.2 Performance on VIRAT 1.0 Ground Dataset

We first compare our proposed Deep Model with our threebaselines (SVM-STIP, SVM-Context, and Model-BM)approaches on VIRAT 1.0 Ground Dataset. We show therecognition accuracy for each event, and the average recog-nition accuracy over the six events. The detailed recognitionaccuracies for the three baselines and the proposed modelare also presented in Table 4.

In this comparison, the Model-BM baseline performs bet-ter than the SVM-Context baseline by over 8 percent. Thisresult indicates that incorporating the semantic level contextbetween events and the middle level representations of per-son and object can obviously improve the event recognitionperformance. More importantly, our proposed Deep Modeloutperforms the three baselines for four of the six events. Inthis experiment, the SVM-STIP approach faces great difficul-ties in distinguishing pairs of events such as LAV and UAVwhich are similar in appearance. The proposed approach, byutilizing contexts, can alleviate the mismatches and improve

the recognition for both events. For the average recognitionaccuracy of the six events, the SVM-STIP reaches 39.91 per-cent, the SVM-Context reaches 53.21 percent, and theModel-BM reaches 62.15 percent. Our Deep Model performs thebest at 69.88 percent. This is a 29 percent absolute improve-ment over SVM-STIP, and a 16 percent absolute improve-ment over SVM-Context.

Table 5 gives the comparison of our Deep Model withstate-of-the-art performances on VIRAT 1.0 Ground Dataset.Here, our Deep Model approach performs the best for threeof the six events, and outperforms the BN Model [47] byover 4 percent in the overall performance. This result dem-onstrates that our proposed model is more effective thantraditional BN based hierarchical context model in integrat-ing three levels of context information for event recognition.

4.2.3 Performance on VIRAT 2.0 Ground Dataset

We further compare the performances of the proposed DeepModel with the three baselines SVM-STIP, SVM-Context,andModel-BM on the VIRAT 2.0 Ground Dataset for the rec-ognition of six person-vehicle interaction events. As shownin Table 6, our Deep Model can consistently outperform thebaseline approaches for each event, and improves the aver-age recognition accuracy from 41.74 percent (SVM-STIP),51.91 percent (SVM-Context), and 58.75 percent (Model-BM)to 66.45 percent (Deep Model). This is close to 25 percentabsolute improvement from the SVM-STIP, and close to 15percent absolute improvement from the SVM-Context.

The confusion matrices for SVM-STIP, SVM-Context, andthe proposed Deep Model are further provided in Fig. 10.From Fig. 10a, we can see the SVM-STIP approach still facesdifficulties in distinguishing pairs of event that are similar

TABLE 4Performances of SVM-STIP, SVM-Context, Model-BM

and Deep Model on VIRAT 1.0 Ground Dataset

Accuracy%

SVM-STIP

SVM-Context

Model-BM

DeepModel

LAV 33.33 33.33 66.67 66.67UAV 42.86 57.14 85.71 85.71OAT 10.00 60.00 40.00 50.00CAT 27.27 36.36 54.55 81.82GIV 61.29 67.74 61.29 64.52GOV 64.71 64.71 64.71 70.59

Average 39.91 53.21 62.15 69.88

TABLE 5The Comparison of Our Model with State-of-the-Art

Approaches on VIRAT 1.0 Ground Dataset

Accuracy%

Reddyet al. [52]

Zhuet al. [11]

BN[47]

DeepModel

LAV 10.0 52.1 100 66.67UAV 16.3 57.5 71.4 85.71OAT 20.0 69.1 50.0 50.00CAT 34.4 72.8 54.5 81.82GIV 38.1 61.3 45.2 64.52GOV 61.3 64.6 73.5 70.59

Average 35.6 62.9 65.8 69.88

TABLE 6Performances of SVM-STIP, SVM-Context, Model-BMBaselines and the Proposed Deep Model Comparedwith BN [47] for Six Events on VIRAT 2.0 Dataset

Accuracy%

SVM-STIP

SVM-Context

Model-BM

DeepModel

BN[47]

LAV 44.44 66.67 66.67 66.67 77.78UAV 51.72 62.07 68.97 68.97 58.62OAT 10.00 15.00 25.00 45.00 35.00CAT 52.63 63.16 84.21 89.47 63.16GIV 58.33 64.58 52.08 70.83 68.75GOV 33.33 40.00 55.56 57.78 48.89

Average 41.74 51.91 58.75 66.45 58.70


in appearance (e.g., “getting into a vehicle” (GIV) and“getting out of a vehicle” (GOV)). On the other hand, theSVM-Context approach can alleviate such mismatchbetween similar events. Moreover, the proposed DeepModel can obviously reduce the mismatch between similarpairs of events with the incorporation of prior, semantic,and feature level contexts simultaneously.

In VIRAT 2.0 dataset, we also experiment with three dif-ferent model variants of the proposed Deep Model as extrabaselines. The first model variant is the baseline modelexcluding all hidden layers in Fig. 9. This model reaches52.54 percent average accuracy, which is slightly better thanSVM-Context, but is around 14 percent worse than our pro-posed model. This result suggests that, with the introduc-tion of hidden layers, the proposed deep model caneffectively learn the salient representations from the inputand improve recognition performance.

The second model variant is the standard DBM modeltaken only the e as input. To compare with our proposedmodel, we use two layers of hidden units for this standardDBM model, which is trained with standard mean-fieldmethod and then fine-tuned. The model reaches 42.12 per-cent average accuracy, which is only slightly better theSVM-STIP baseline, but much worse than our proposedcontext approaches. This result indicates that the improve-ment by our model is mainly attributed to the integration ofcontexts from all three levels, rather than merely from theDBM formulation.

In the third model variant, we incorporate two sets ofhidden units he and hc for the input e and c respectively.Both he and hc are further connected to y. he and hc

would serve as the latent representations of e and c.Except for capturing the image level representation, suchrepresentations do not capture any semantic contextualinformation. Furthermore, they make the model morecomplex. This model reaches 64.67 percent average accu-racy, which is close to but slightly worse than the pro-posed model. Hence, we do not introduce the additionallatent layers for e and c.

4.2.4 Performance on Full VIRAT 2.0 Ground Dataset

We further experiment on the full VIRAT ground datasetwith all provided events, and compare with the perform-ances of the state of the art methods including the BNmodel [47]. For the events without the “interactingobject”, we take the event region excluding the personregion as the “object” input. The recall (ratio of correctclassifications over all test samples of the event) and pre-cision (ratio of correct classifications over all test samplesclassified into the event) averaged over all 11 types ofevents are then given in Table 7. In this comparison, ourproposed model has a higher averaged precision, andslightly outperforms the BN [47] in the averaged recall.This indicates a smaller improvement when lacking the“interacting object”.

4.2.5 Performance on UT-Interaction Dataset

UT-Interaction Dataset is a surveillance video dataset withperson-person interaction events. To experiment on thisdataset, we turn the object input in the model into the sec-ondary person input. Specifically, we first utilize the HOGfeature based person detectors to detect the two personswithin the event bounding box of the video. The STIP fea-tures for each of the two persons are then extracted accord-ingly. To compare state-of-the-art performances, we use theFisher Vector encoding method [61], [62] for the STIP eventfeature. We experiment on Set 1 of this dataset. The overallperformances of the SVM-STIP and our proposed DeepModel as well as different state-of-the-art performances arelisted in Table 8.

The state-of-the-art performances listed in Table 8 aremainly target-centered descriptor-based approaches. OurSVM-STIP baseline, which is the most standard target-cen-tered descriptor-based approach, performs not as well asmany of these approaches. However, our Deep Model canfurther improve the SVM-STIP baseline, and reaches thebest performance. In addition, our Deep Model outperforms

Fig. 10. Confusion matrices for the recognition of six person-vehicle interaction events on VIRAT 2.0 Ground Dataset with the SVM-STIP, SVM-Context, and the proposed Deep Model.

TABLE 7Comparisons with State-of-the-Art Methods for Recognition

of All Events from VIRAT

Ameret al. [53]

Zhuet al. [11]

BN[47]

OurModel

Precision 72% 71.8% 74.73% 76.50%Recall 70% 73.5% 77.42% 77.47%

TABLE 8Overall Recognition Accuracies Compared to State-of-the-Art

Methods on UT-Interaction Dataset


the approach by Raptis and Sigal [58], which captures thetemporal context between key frames.

In this work, we use “human-object” interaction as exam-ple events. However, we have shown from these experi-ments that the proposed model applies to not only human-object interaction events (as in VIRAT events in Sections4.2.2 and 4.2.3), but also human-human interaction events(as in UT-Interaction events in Section 4.2.5). Furthermore,as discussed in Section 4.2.4, the model can be furtherapplied to the events without the “interacting object”.

5 CONCLUSION

In this paper, we propose a deep Boltzmann machinebased context model to integrate the image level, semanticlevel and prior level contexts. We first introduce two newimage context features. They are appearance context fea-ture and interaction context feature. These features capturethe appearance of the contextual objects, and their interac-tions with the event objects. Then, we introduce a deepcontext model to learn the semantic context. We furtherintroduce the two prior level contexts: scene priming anddynamic cueing. Finally, we introduce a hierarchical deepmodel to integrate contexts at three levels. The model istrained with mean-field based approximate learningmethod, and can be directly used to infer event classesthrough Gibbs sampling. We evaluate our model perfor-mance on VIRAT 1.0 Ground Dataset, VIRAT 2.0 GroundDataset and the UT-Interaction Dataset for recognizing thereal world surveillance video events with complex back-grounds. The results with the proposed deep contextmodel show significant improvements over the baselineapproaches that also utilize multiple levels of contexts. Inaddition, the proposed model also outperforms state of theart methods on benchmark datasets.

Despite significant improvements our methods havemade on benchmark datasets, they still have several limita-tions, which we will address in the future. First, this workrelies on the pre-segmented video sequences and boundingboxes provided by others. This could limit our methods’practical utility. One possible future work is simultaneousevent bounding box detection and event recognition. Sec-ond, our dynamic cueing modeling is the standard Markovchain modeling. The benefit from dynamic cueing contextcould be limited when the events do not have a temporalorder. More complex temporal modeling could be devel-oped to extend Markov chain modeling. Finally, the currentmodel is designed for two interacting entities, typically oneperson and one object. It cannot be directly applied to otherscenarios where more than two entities are involved.Besides extending the model to capture integrations amongmultiple entities by adding additional visible and hiddennodes, another direction is to model interactions amongentities, whose number is unknown and varies over timewith the nonparametric Bayesian models.

ACKNOWLEDGMENTS

This work is supported in part by Defense AdvancedResearch Projects Agency under grants HR0011-08-C-0135-S8 and HR0011-10-C-0112, and by the Army Research Officeunder grant W911NF-13-1-0395.

REFERENCES

[1] S. Oh, et al., “A large-scale benchmark dataset for event recogni-tion in surveillance video,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2011, pp. 3153–3160.

[2] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., 2005, pp. 886–893.

[3] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis.,vol. 64, no. 2/3, pp. 107–123, 2005.

[4] R. Cutler and M. Turk, “View-based interpretation of real-timeoptical flow for gesture recognition,” in Proc. IEEE Int. Conf. Auto-mation Face and Gesture Recognit., 1998, pp. 416–416.

[5] B. Yao and L. Fei-Fei, “Modeling mutual context of object andhuman pose in human-object interaction activities,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2010, pp. 17–24.

[6] D. Vail, M. M. Veloso, and J. D. Lafferty, “Conditional randomfields for activity recognition,” in Proc. 6th Int. Joint Conf. Auton.Agents Multiagent Syst., 2007, pp. 1331–1338.

[7] F. Lv and R. Nevatia, “Recognition and segmentation of 3Dhuman action using hmm and multi-class adaboost,” in Proc. 9thEur. Conf. Comput. Vis., 2006, pp. 359–372.

[8] G. Yang, Y. Lin, and P. Bhattacharya, “A driver fatigue recogni-tion model based on information fusion and dynamic Bayesiannetwork,” Inf. Sci., vol. 180, no. 10, pp. 1942–1954, 2010.

[9] A. Kovashka and K. Grauman, “Learning a hierarchy of discrimi-native space-time neighborhood features for human action recog-nition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010,pp. 2046–2053.

[10] J. Wang, Z. Chen, and Y. Wu, “Action recognition with multiscalespatio-temporal contexts,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2011, pp. 3185–3192.

[11] Y. Zhu, N. M. Nayak, and A. K. R. Chowdhury, “Context-awaremodeling and recognition of activities in video,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2013, pp. 2491–2498.

[12] A. Gupta and L. Davis, “Objects in action: An approach for com-bining action understanding and object perception,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8.

[13] L. Li and L. Fei-Fei, “What, where and who? classifying events byscene and object recognition,” in Proc. IEEE Int. Conf. Comput. Vis.,2007, pp. 1–8.

[14] A. Torralba, “Contextual priming for object detection,” Int. J. Com-put. Vis., vol. 53, no. 2, pp. 169–191, 2003.

[15] A. Gallagher and T. Chen, “Estimating age, gender, and identityusing first name priors,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit, 2008, pp. 1–8.

[16] S. Oh and A. Hoogs, “Unsupervised learning of activities in videousing scene context,” in Proc. 20th Int. Conf. Pattern Recognit., 2010,pp. 3579–3582.

[17] X. Wang and Q. Ji, “Incorporating contextual knowledge todynamic Bayesian networks for event recognition,” in Proc. 21stInt. Conf. Pattern Recognit., 2012, pp. 3378–3381.

[18] X. Wang and Q. Ji, “Context augmented dynamic Bayesiannetworks for event recognition,” Pattern Recognit. Lett., vol. 43,pp. 62–70, 2014.

[19] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.

[20] M. Ryoo and J. Aggarwal, “UT-interaction dataset, ICPR conteston semantic description of human activities,” 2010. [Online].Available: http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html

[21] T. Lan, Y. Wang, W. Yang, S. N. Robinovitch, and G. Mori,“Discriminative latent models for recognizing contextual groupactivities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8,pp. 1549–1562, Aug. 2012.

[22] E. Sudderth, A. Torralba, A. Freeman, and A. S. Willsky,“Learning hierarchical models of scenes, objects, and parts,” inProc. IEEE Int. Conf. Comput. Vis., 2005, pp. 1331–1338.

[23] V. Escorcia and J. C. Niebles, “Spatio-temporal human-objectinteractions for action recognition in videos,” in Proc. IEEE Int.Conf. Comput. Vis. Workshop, 2013, pp. 508–514.

[24] F. Yuan, G. S. Xia, H. Sahbi, and V. Prinet, “Mid-level features andspatio-temporal context for activity recognition,” Pattern Recognit.,vol. 45, no. 12, pp. 4182–4191, 2012.

[25] V. Ramanathan, B. Yao, and L. Fei-Fei, “Social role discovery inhuman events,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2013, pp. 2475–2482.


[26] J. Sun, et al., “Hierarchical spatio-temporal context modeling foraction recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., 2009, pp. 2004–2011.

[27] S. Park and J. Aggarwal, “A hierarchical Bayesian network forevent recognition of human actions and interactions,” MultimediaSyst., vol. 10, no. 2, pp. 164–179, 2004.

[28] J. Niebles and L. Fei-Fei, “A hierarchical model of shape andappearance for human action classification,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2007, pp. 1–8.

[29] T. Hospedales, S. Gong, and T. Xiang, “Video behaviour miningusing a dynamic topic model,” Int. J. Comput. Vis., vol. 98, no. 3,pp. 303–323, 2012.

[30] J. Varadarajan, R. Emonet, and J. M. Odobez, “A sequential topicmodel for mining recurrent activities from long term video logs,”Int. J. Comput. Vis., vol. 103, no. 1, pp. 100–126, 2013.

[31] G. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithmfor deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,2006.

[32] R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” inProc. 12th Int. Conf. Artif. Intell. Statist., 2009, pp. 448–455.

[33] R. Salakhutdinov and G. Hinton, “An efficient learning procedurefor deep Boltzmann machines,” Neural Comput., vol. 24, no. 8,pp. 1967–2006, 2012.

[34] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedylayer-wise training of deep networks,” in Proc. Advances NeuralInf. Process. Syst., 2007, pp. 153–160.

[35] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Managol,“Extracting and composing robust features with denoisingautoencoders,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1096–1103.

[36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proc. IEEE, vol. 86, no.11, pp. 2278–2324, Nov. 1998.

[37] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural net-works for human action recognition,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013.

[38] A. Karpathy, G. Toderici, S. Shetty, T. Leung, and R. Sukthankar,“Large-scale video classification with convolutional neuralnetworks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014,pp. 1725–1732.

[39] G. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutionallearning of spatio-temporal features,” in Proc. 11th Eur. Conf. Com-put. Vis., 2010, pp. 140–153.

[40] Q. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchi-cal invariant spatio-temporal features for action recognition withindependent subspace analysis,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2011, pp. 3361–3368.

[41] M. Hasan and A. Roy-Chowdhury, “Continuous learning ofhuman activity models using deep nets,” in Proc. 13th Eur. Conf.Comput. Vis., 2014, pp. 705–720.

[42] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,“Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn.,2011, pp. 689–696.

[43] N. Srivastava and R. Salakhutdinov, “Multimodal learning withdeep Boltzmann machines,” in Proc. Advances Neural Inf. Process.Syst., 2012, pp. 2222–2230.

[44] N. Srivastava and R. Salakhutdinov, “Multimodal learning withdeep Boltzmann machines,” J. Mach. Learn. Res., vol. 15, no. 9,pp. 2949–2980, 2014.

[45] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, “Multiscale con-ditional random fields for image labeling,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2004, pp. 695–702.

[46] X. Zeng, W. Ouyang, and X. Wang, “Multi-stage contextual deeplearning for pedestrian detection,” in Proc. IEEE Int. Conf. Comput.Vis., 2013, pp. 121–128.

[47] X. Wang and Q. Ji, “A hierarchical context model for event recog-nition in surveillance video,” in Proc. IEEE Conf. Comput. Vis. Pat-tern Recognit., 2014, pp. 2561–2568.

[48] X. Wang and Q. Ji, “Video event recognition with deep hierarchi-cal context model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., 2015, pp. 4418–4427.

[49] D. Lowe, “Object recognition from local scale-invariant features,”in Proc. IEEE Int. Conf. Comput. Vis., 1999, pp. 1150–1157.

[50] R. Salakhutdinov, Learning deep Boltzmann machines. (2012).[Online]. Available: http://www.cs.toronto.edu/�rsalakhu/DBM.html

[51] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learningrealistic human actions from movies,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2008, pp. 1–8.

[52] K. Reddy, N. Cuntoor, A. Perera, and A. Hoogs, “Human actionrecognition in large-scale datasets using histogram of spatiotem-poral gradients,” in Proc. IEEE 9th Int. Conf. Advanced Video Signal-Based Surveillance, 2012, pp. 106–111.

[53] M. Amer and S. Todorovic, “Sum-product networks for modelingactivities with stochastic structure,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2012, pp. 1314–1321.

[54] M. Ryoo and J. Aggarwal, “Spatio-temporal relationship match:Video structure comparison for recognition of complex humanactivities,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 1593–1600.

[55] M. Ryoo, “Human activity prediction: Early recognition of ongo-ing activities from streaming videos,” in Proc. IEEE Int. Conf. Com-put. Vis., 2011, pp. 1036–1043.

[56] D. Waltisberg, A. Yao, J. Gall, and L. van Gool, “Variations of aHough-voting action recognition system,” in Proc. Recognizing Pat-terns Signals Speech Images Videos, 2010, pp. 306–312.

[57] G. Yu, J. Yuan, and Z. Liu, “Propagative hough voting for humanactivity recognition,” in Proc. 12th Eur. Conf. Comput. Vis., 2012,pp. 693–706.

[58] M. Raptis and L. Sigal, “Poselet key-framing: A model for humanactivity recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-ognit., 2013, pp. 2650–2657.

[59] S. Shariat and V. Pavlovic, “A new adaptive segmental matchingmeasure for human activity recognition,” in Proc. IEEE Int. Conf.Comput. Vis., 2013, pp. 3583–3590.

[60] Y. Zhang, X. Liu, M.-C. Chang, W. Ge, and T. Chen, “Spatio-temporal phrases for activity recognition,” in Proc. 12th Eur. Conf.Comput. Vis., 2012, pp. 707–721.

[61] F. Perronnin and C. Dance, “Fisher kernels on visual vocabulariesfor image categorization,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2007, pp. 1–8.

[62] F. Perronnin, J. Sanchez, and T. Mensink, “Improving the Fisherkernel for large-scale image classification,” in Proc. 11th Eur. Conf.Comput. Vis., 2010, pp. 143–156.

Xiaoyang Wang received the BS and MS degreeboth from Tsinghua University, Beijing, China, in2007 and 2010, respectively, and the PhD degreefrom Rensselaer Polytechnic Institute, Troy,New York, in 2015. He currently works with NokiaBell Labs, MurrayHill, New Jersey as a researcher.His research interests include video event recogni-tion, object recognition, attribute prediction, contextmodeling, and probabilistic graphical models. Hereceived the ICPR Piero Zamperoni Best StudentPaper Award in 2012. He is amember of the IEEE.

Qiang Ji received the PhD degree from the Uni-versity of Washington. He is currently a Professorin the Department of Electrical, Computer, andSystems engineering, RPI. From January, 2009to August, 2010, he served as a program directorof the National Science Foundation, managingNSF’s machine learning and computer vision pro-grams. Prior to joining RPI in 2001, he was anassistant professor in the Department of Com-puter Science, University of Nevada, Reno.He also held research and visiting positions in

the Beckman Institute, University of Illinois at Urbana-Champaign, theRobotics Institute, Carnegie Mellon University, and the US Air ForceResearch Laboratory. He is a fellow of the IEEE and the IAPR.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Local Causal Discovery of Direct Causes and Effects

Tian Gao Qiang JiDepartment of ECSE

Rensselaer Polytechnic Institute, Troy, NY 12180{gaot, jiq}@rpi.edu

Abstract

We focus on the discovery and identification of direct causes and effects of a targetvariable in a causal network. State-of-the-art causal learning algorithms generallyneed to find the global causal structures in the form of complete partial directedacyclic graphs (CPDAG) in order to identify direct causes and effects of a targetvariable. While these algorithms are effective, it is often unnecessary and wastefulto find the global structures when we are only interested in the local structure ofone target variable (such as class labels). We propose a new local causal discov-ery algorithm, called Causal Markov Blanket (CMB), to identify the direct causesand effects of a target variable based on Markov Blanket Discovery. CMB is de-signed to conduct causal discovery among multiple variables, but focuses only onfinding causal relationships between a specific target variable and other variables.Under standard assumptions, we show both theoretically and experimentally thatthe proposed local causal discovery algorithm can obtain the comparable identifi-cation accuracy as global methods but significantly improve their efficiency, oftenby more than one order of magnitude.

1 Introduction

Causal discovery is the process to identify the causal relationships among a set of random variables.It not only can aid predictions and classifications like feature selection [4], but can also help pre-dict consequences of some given actions, facilitate counter-factual inference, and help explain theunderlying mechanisms of the data [13]. A lot of research efforts have been focused on predict-ing causality from observational data [13, 18]. They can be roughly divided into two sub-areas:causal discovery between a pair of variables and among multiple variables. We focus on multivari-ate causal discovery, which searches for correlations and dependencies among variables in causalnetworks [13]. Causal networks can be used for local or global causal prediction, and thus they canbe learned locally and globally. Many causal discovery algorithms for causal networks have beenproposed, and the majority of them belong to global learning algorithms as they seek to learn globalcausal structures. The Spirtes-Glymour-Scheines (SGS) [18] and Peter-Clark (P-C) algorithm [19]test for the existence of edges between every pair of nodes in order to first find the skeleton, orundirected edges, of causal networks and then discover all the V-structures, resulting in a partiallydirected acyclic graph (PDAG). The last step of these algorithms is then to orient the rest of edgesas much as possible using Meek rules [10] while maintaining consistency with the existing edges.Given a causal network, causal relationships among variables can be directly read off the structure.

Due to the complexity of the P-C algorithm and unreliable high order conditional independence tests[9], several works [23, 15] have incorporated the Markov Blanket (MB) discovery into the causaldiscovery with a local-to-global approach. Growth and Shrink (GS) [9] algorithm uses the MBsof each node to build the skeleton of a causal network, discover all the V-structures, and then usethe Meek rules to complete the global causal structure. The max-min hill climbing (MMHC) [23]algorithm also finds MBs of each variable first, but then uses the MBs as constraints to reduce thesearch space for the score-based standard hill climbing structure learning methods. In [15], authors

1

Advances in Neural Information Processing Systems 28 (NIPS 2015)

use Markov Blanket with Collider Sets (CS) to improve the efficiency of the GS algorithm by com-bining the spouse and V-structure discovery. All these local-to-global methods rely on the globalstructure to find the causal relationships and require finding the MBs for all nodes in a graph, evenif the interest is the causal relationships between one target variable and other variables. Differ-ent MB discovery algorithms can be used and they can be divided into two different approaches:non-topology-based and topology-based. Non-topology-based methods [5, 9], used by CS and GSalgorithms, greedily test the independence between each variable and the target by directly using thedefinition of Markov Blanket. In contrast, more recent topology-based methods [22, 1, 11] aim toimprove the data efficiency while maintaining a reasonable time complexity by finding the parentsand children (PC) set first and then the spouses to complete the MB.

Local learning of causal networks generally aims to identify a subset of causal edges in a causalnetwork. Local Causal Discovery (LCD) algorithm and its variants [3, 17, 7] aim to find causal edgesby testing the dependence/independence relationships among every four-variable set in a causalnetwork. Bayesian Local Causal Discovery (BLCD) [8] explores the Y-structures among MB nodesto infer causal edges [6]. While LCD/BLCD algorithms aim to identify a subset of causal edges viaspecial structures among all variables, we focus on finding all the causal edges adjacent to one targetvariable. In other words, we want to find the causal identities of each node, in terms of direct causesand effects, with respect to one target node. We first use Markov Blankets to find the direct causesand effects, and then propose a new Causal Markov Blanket (CMB) discovery algorithm, whichdetermines the exact causal identities of MB nodes of a target node by tracking their conditionalindependence changes, without finding the global causal structure of a causal network. The proposedCMB algorithm is a complete local discovery algorithm and can identify the same direct causes andeffects for a target variable as global methods under standard assumptions. CMB is more scalablethan global methods, more efficient than local-to-global methods, and is complete in identifyingdirect causes and effects of one target while other local methods are not.

2 Backgrounds

We use V to represent the variable space, capital letters (such as X,Y ) to represent variables, boldletters (such as Z,MB) to represent variable sets, and use |Z| to represent the size of set Z. X ⊥⊥ Yand X ⊥\⊥ Y represent independence and dependence between X and Y , respectively. We assumereaders are familar with related concepts in causal network learning, and only review a few majorones here. In a causal network or causal Bayesian Network [13], nodes correspond to the randomvariables in a variable set V . Two nodes are adjacent if they are connected by an edge. A directededge from node X to node Y , (X,Y ) ∈ V , indicates X is a parent or direct cause of Y and Y isa child or direct effect of X [12]. Moreover, If there is a directed path from X to Y , then X is anancestor of Y and Y is a descendant of X . If nonadjacent X and Y have a common child, X andY are spouses. Three nodes X , Y , and Z form a V-structure [12] if Y has two incoming edges fromX and Z, forming X → Y ← Z, and X is not adjacent to Z. Y is a collider in a path if Y has twoincoming edges in this path. Y with nonadjacent parents X and Z is an unshielded collider. A pathJ from node X and Y is blocked [12] by a set of nodes Z, if any of following holds true: 1) there isa non-collider node in J belonging to Z. 2) there is a collider node C on J such that neither C norany of its descendants belong to Z. Otherwise, J is unblocked or active.

A PDAG is a graph which may have both undirected and directed edges and has at most one edgebetween any pair of nodes [10]. CPDAGs [2] represent Markov equivalence classes of DAGs, captur-ing the same conditional independence relationships with the same skeleton but potentially differentedge orientations. CPDAGs contain directed edges that has the same orientation for every DAG inthe equivalent class and undirected edges that have reversible orientations in the equivalent class.Let G be the causal DAG of a causal network with variable set V and P be the joint probability dis-tribution over variables in V . G and P satisfy Causal Markov condition [13] if and only if, ∀X ∈ V ,X is independent of non-effects of X given its direct causes. The causal faithfulness condition [13]states that G and P are faithful to each other, if all and every independence and conditional indepen-dence entailed by P is present in G. It enables the recovery of G from sampled data of P . Anotherwidely-used assumption by existing causal discovery algorithms is causal sufficiency [12]. A set ofvariables X ⊆ V is causally sufficient, if no set of two or more variables in X shares a commoncause variable outside V . Without causal sufficiency assumption, latent confounders between adja-cent nodes would be modeled by bi-directed edges [24]. We also assume no selection bias [20] and

2

we can capture the same independence relationships among variables from the sampled data as theones from the entire population.

Many concepts and properties of a DAG hold in causal networks, such as d-separation and MB.A Markov Blanket [12] of a target variable T , MBT , in a causal network is the minimal set ofnodes conditioned on which all other nodes are independent of T , denoted as X ⊥⊥ T |MBT ,∀X ⊆{V \T} \MBT . Given an unknown distribution P that satisfied the Markov condition with respectto an unknown DAG G0, Markov Blanket Discovery is the process used to estimate the MB of atarget node in G0, from independently and identically distributed (i.i.d) data D of P . Under thecausal faithfulness assumption between G0 and P , the MB of a target node T is unique and is theset of parents, children, and spouses of T (i.e., other parents of children of T ) [12]. In addition, theparents and children set of T , PCT , is also unique. Intuitively, the MB can directly facilitate causaldiscovery. If conditioning on the MB of a target variable T renders a variable X independent ofT , then X cannot be a direct cause or effect of T . From the local causal discovery point of view,although MB may contain nodes with different causal relationships with the target, it is reasonableto believe that we can identify their relationships exactly, up to the Markov equivalence, with furthertests.

Lastly, exiting causal network learning algorithms all use three Meek rules [10], which we assumethe readers are familiar with, to orient as many edges as possible given all V-structures in PDAGs toobtain CPDAG. The basic idea is to orient the edges so that 1) the edge directions do not introducenew V-structures, 2) preserve the no-cycle property of a DAG, and 3) enforce 3-fork V-structures.

3 Local Causal Discovery of Direct Causes and Effects

Existing MB discovery algorithms do not directly offer the exact causal identities of the learned MBnodes of a target. Although the topology-based methods can find the PC set of the target withinthe MB set, they can only provide the causal identities of some children and spouses that form v-structures. Nevertheless, following existing works [4, 15], under standard assumptions, every PCvariable of a target can only be its direct cause or effect:Theorem 1. Causality within a MB. Under the causal faithfulness, sufficiency, correct indepen-dence tests, and no selection bias assumptions, the parent and child nodes within a target’s MB setin a causal network contains all and only the direct causes and effects of the target variable.

The proof can be directly derived from the PC set definition of a causal network. Therefore, usingthe topology-based MB discovery methods, if we can discover the exact causal identities of the PCnodes within the MB, causal discovery of direct causes and effects of the target can therefore besuccessfully accomplished.

Building on MB discovery, we propose a new local causal discovery algorithm, Causal MarkovBlanket (CMB) discovery as shown in Algorithm 1. It identifies the direct causes and effects of atarget variable without the need of finding the global structure or the MBs of all other variables ina causal network. CMB has three major steps: 1) to find the MB set of the target and to identifysome direct causes and effects by tracking the independence relationship changes among a target’sPC nodes before and after conditioning on the target node, 2) to repeat Step 1 but conditioned onone PC node’s MB set, and 3) to repeat Step 1 and 2 with unidentified neighboring nodes as newtargets to identify more direct causes and effects of the original target.

Step 1: Initial identification. CMB first finds the MB nodes of a target T , MBT , using a topology-based MB discovery algorithm that also finds PCT . CMB then uses the CausalSearch subroutine,shown in Algorithm 2, to get an initial causal identities of variables in PCT by checking everyvariable pair in PCT according to Lemma 1.Lemma 1. Let (X,Y ) ∈ PCT , the PC set of the target T ∈ V in a causal DAG. The independencerelationships between X and Y can be divided into the following four conditions:

C1 X ⊥⊥ Y and X ⊥⊥ Y |T ; this condition can not happen.

C2 X ⊥⊥ Y and X ⊥\⊥ Y |T ⇒ X and Y are both the parents of T .

C3 X ⊥\⊥ Y and X ⊥⊥ Y |T ⇒ at least one of X and Y is a child of T .

C4 X ⊥\⊥ Y and X ⊥\⊥ Y |T ⇒ their identities are inconclusive and need further tests.

3

Algorithm 1 Causal Markov Blanket Discovery Algorithm

1: Input: D: Data; T : target variable2: Output: IDT : the causal identities of all

nodes with respect to T{Step 1: Establish initial ID }

3: IDT = zeros(|V|, 1);4: (MBT ,PCT )← FindMB(T,D);5: Z← ∅;6: IDT ← CausalSearch(D,T,PCT ,Z, IDT ){Step 2: Further test variables with idT = 4}

7: for one X in each pair (X,Y ) with idT = 4 do8: MBX ← FindMB(X,D);9: Z← {MBX \ T} \ Y ;

10: IDT ← CausalSearch(D,T,PCT ,Z, IDT );

11: if no element of IDT is equal to 4, break;12: for every pair of parents (X,Y ) of T do

13: if ∃Z s.t. (X,Z) and (Y,Z) are idT = 4 pairsthen

14: IDT (Z) = 1;15: IDT (X)← 3, ∀X that IDT (X) = 4;{Step 3: Resolve variable set with idT = 3}

16: for each X with idT = 3 do17: Recursively find IDX , without going back to

the already queried variables;18: update IDT according to IDX ;19: if IDX(T ) = 2 then20: IDT (X) = 1;21: for every Y in idT = 3 variable pairs

(X,Y ) do22: IDT (Y ) = 2;23: if no element of IDT is equal to 3, break;24: Return: IDT

Algorithm 2 CausalSearch Subroutine

1: Input: D: Data; T : target variable; PCT :the PC set of T ; Z: the conditioned variableset; ID: current ID

2: Output: IDT : the new causal identities ofall nodes with respect to T{Step 1: Single PC }

3: if |PCT | = 1 then4: IDT (PCT )← 3;{Step 2: Check C2 & C3}

5: for every X,Y ∈ PCT do6: if X ⊥⊥ Y |Z and X ⊥\⊥ Y |T ∪ Z then7: IDT (X)← 1; IDT (Y )← 1;8: else if X ⊥\⊥ Y |Z and X ⊥⊥ Y |T∪Z then9: if IDT (X) = 1 then

10: IDT (Y )← 211: else if IDT (Y ) 6= 2 then12: IDT (Y )← 3

13: if IDT (Y ) = 1 then14: IDT (X)← 215: else if IDT (X) 6= 2 then16: IDT (X)← 317: add (X,Y ) to pairs with idT = 318: else19: if IDT (X) & IDT (Y ) = 0 or 4 then20: IDT (X)← 4; IDT (Y )← 421: add (X,Y ) to pairs with idT = 4{Step 3: identify idT = 3 pairs with knownparents}

22: for every X such that IDT (X) = 1 do23: for every Y in idT = 3 variable pairs

(X,Y ) do24: IDT (Y )← 2;25: Return: IDT

C1 does not happen because the path X − T − Y is unblocked either not given T or given T , andthe unblocked path makes X and Y dependent on each other. C2 implies that X and Y form aV-structure with T as the corresponding collider, such as node C in Figure 1a which has two parentsA and B. C3 indicates that the paths between X and Y are blocked conditioned on T , which meansthat either one of (X,Y ) is a child of T and the other is a parent, or both of (X,Y ) are children ofT . For example, node D and F in Figure 1a satisfy this condition with respect to E. C4 shows thatthere may be another unblocked path from X and Y besides X − T − Y . For example, in Figure1b, node D and C have multiple paths between them besides D − T − C. Further tests are neededto resolve this case.

Notation-wise, we use IDT to represent the causal identities for all the nodes with respect to T ,IDT (X) as variable X’s causal identity to T , and the small case idT as the individual ID of a nodeto T . We also use IDX to represent the causal identities of nodes with respect to node X . To avoidchanging the already identified PCs, CMB establishes a priority system1. We use the idT = 1 torepresent nodes as the parents of T , idT = 2 children of T , idT = 3 to represent a pair of nodes thatcannot be both parents (and/or ambiguous pairs from Markov equivalent structures, to be discussedat Step 2), and idT = 4 to represent the inconclusiveness. A lower number id cannot be changed

1Note that the identification number is slightly different from the condition number in Lemma 1.

4

A B

C

D

G EF

T

C

B

D

A

E

(𝑎) (𝑏)

Figure 1: a) A Sample Causal Network. b) A Sample Network with C4 nodes. The only active pathbetween D and C conditioned on MBC \ {T,D} is D − T − C.

into a higher number (shown by Line 11∼15 of Algorithm 2). If a variable pair satisfies C2, theywill both be labeled as parents (Line 7 of Algorithm 2). If a variable pair satisfies C3, one of themis labeled as idT = 2 only if the other variable within the pair is already identified as a parent;otherwise, they are both labeled as idT = 3 (Line 9∼12 and 15∼17 of Algorithm 2). If a PC noderemains inconclusive with idT = 0, it is labeled as idT = 4 in Line 20 of Algorithm 2. Note thatif T has only one PC node, it is labeled as idT = 3 (Line 4 of Algorithm 2). Non-PC nodes alwayshave idT = 0.

Step 2: Resolve idT = 4. Lemma 1 alone cannot identify the variable pairs in PCT with idT = 4due to other possible unblocked paths, and we have to seek other information. Fortunately, bydefinition, the MB set of one of the target’s PC node can block all paths to that PC node.

Lemma 2. Let (X,Y ) ∈ PCT , the PC set of the target T ∈ V in a causal DAG. The independencerelationships between X and Y , conditioned on the MB of X minus {Y, T}, MBX \ {Y, T}, canbe divided into the following four conditions:

C1 X ⊥⊥ Y |MBX \ {Y, T} and X ⊥⊥ Y |T ∪MBX \ Y ; this condition can not happen.

C2 X ⊥⊥ Y |MBX \ {Y, T} and X ⊥\⊥ Y |T ∪MBX \ Y ⇒ X and Y are both the parents of T .

C3 X ⊥\⊥ Y |MBX \ {Y, T} and X ⊥⊥ Y |T ∪MBX \Y ⇒ at least one of X and Y is a child of T .

C4 X ⊥\⊥ Y |MBX \ {Y, T} and X ⊥\⊥ Y |T ∪MBX \ Y ⇒ then X and Y is directly connected.

C1∼3 are very similar to those in Lemma 1. C4 is true because, conditioned on T and the MB of Xminus Y , the only potentially unblocked paths between X and Y are X − T − Y and/or X − Y . IfC4 happens, then the path X−T−Y has no impact on the relationship between X and Y , and henceX − Y must be directly connected. If X and Y are not directly connected and the only potentiallyunblocked path between X and Y is X − T − Y , and X and Y will be identified by Line 10 ofAlgorithm 1 with idT ∈ {1, 2, 3}. For example in Figure 1b, conditioned on MBC \ {T,D}, i.e.,{A,B}, the only path between C and D is through T. However, if X and Y are directly connected,they will remain with idT = 4 (such as node D and E from Figure 1b). In this case, X , Y , andT form a fully connected clique, and edges among the variables that form a fully connected cliquecan have many different orientation combinations without affecting the conditional independencerelationships. Therefore, this case needs further tests to ensure Meek rules are satisfied. The thirdMeek rule (enforcing 3-fork V-structures) is first enforced by Line 14 of Algorithm 1. Then the restof idT = 4 nodes are changed to have idT = 3 by Line 15 of Algorithm 1 and to be further processed(even though they could be both parents at the same time) with neighbor nodes’ causal identities.Therefore, Step 2 of Algorithm 1 makes all variable pairs with idT = 4 to become identified eitheras parents, children, or with idT = 3 after taking some neighbors’ MBs into consideration. Notethat Step 2 of CMB only needs to find the MB’s for a small subset of the PC variables (in fact onlyone MB for each variable pair with idT = 4).

Step 3: Resolve idT = 3. After Step 2, some PC variables may still have idT = 3. This couldhappen because of the existence of Markov equivalence structures. Below we show the conditionunder which the CMB can resolve the causal identities of all PC nodes.

5

Lemma 3. The Identifiability Condition. For Algorithm 1 to fully identify all the causal relation-ships within the PC set of a target T , 1) T must have at least two nonadjacent parents, 2) one of T ’ssingle ancestors must contain at least two nonadjacent parents, or 3) T has 3 parents that form a3-fork pattern as defined in Meeks rules.

We use single ancestors to represent ancestor nodes that do not have a spouse with a mutual child thatis also an ancestor of T . If the target does not meet any of the conditions in Lemma 2, C2 will neverbe satisfied and all PC variables within a MB will have idT = 3. Without a single parent identified,it is impossible to infer the identities of children nodes using C3. Therefore, all the identities of thePC nodes are uncertain, even though the resulting structure could be a CPDAG.

Step 3 of CMB searches for a non-single ancestor of T to infer the causal directions. For each nodeX with idT = 3, CMB tries to identify its local causal structure recursively. If X’s PC nodes areall identified, it would return to the target with the resolved identities; otherwise, it will continueto search for a non-single ancestor of X . Note that CMB will not go back to already-searchedvariables with unresolved PC nodes without providing new information. Step 3 of CMB checks theidentifiability condition for all the ancestors of the target. If a graph structure does not meet theconditions of Lemma 3, the final IDT will contain some idT = 3, which indicates reversible edgesin CPDAGs. The found causal graph using CMB will be a PDAG after Step 2 of Algorithm 1, andit will be a CPDAG after Step 3 of Algorithm 1.

Case Study. The procedure using CMB to identify the direct causes and effects of E in Figure 1ahas the following 3 steps. Step 1: CMB finds the MB and PC set of E. The PC set contains nodeD and F . Then, IDE(D) = 3 and IDE(F ) = 3. Step 2: to resolve the variable pair D and Fwith idE = 3, 1) CMB finds the PC set of D, containing C, E, and G. Their idD are all 3’s, sinceD contains only one parent. 2) To resolve IDD, CMB checks causal identities of node C and G(without going back to E). The PC set of C contains A, B, and D. CMB identifies IDC(A) = 1,IDC(B) = 1, and IDC(D) = 2. Since C resolves all its PC nodes, CMB returns to node Dwith IDD(C) = 1. 3) With the new parent C, IDD(G) = 2, IDD(E) = 2, and CMB returns tonode E with IDE(D) = 1. Step 3: the IDE(D) = 1, and after resolving the pair with idE = 3,IDE(F ) = 2.

Theorem 2. The Soundness and Completeness of CMB Algorithm. If the identifiability conditionis satisfied, using a sound and complete MB discovery algorithm, CMB will identify the direct causesand effects of the target under the causal faithfulness, sufficiency, correct independence tests, andno selection bias assumptions.

Proof. A sound and complete MB discovery algorithm find all and only the MB nodes of a target.Using it and under the causal sufficiency assumption, the learned PC set contains all and only thecause-effect variables by Theorem 1. When Lemma 3 is satisfied, all parent nodes are identifiablethrough V-structure independence changes, either by Lemma 1 or by Lemma 2. Also since childrencannot be conditionally independent of another PC node given its MB minus the target node (C2),all parents identified by Lemma 1 and 2 will be the true positive direct causes. Therefore, all andonly the true positive direct causes will be correctly identified by CMB. Since PC variables can onlybe direct causes or direct effects, all and only the direct effects are identified correctly by CMB.

In the cases where CMB fails to identify all the PC nodes, global causal discovery methods cannotidentify them either. Specifically, structures failing to satisfy Lemma 3 can have different orien-tations on some edges while preserving the skeleton and v-structures, hence leading to Markovequivalent structures. For the cases where T has all single ancestors, the edge directions among allsingle ancestors can always be reversed without introducing new V-structures and DAG violations,in which cases the Meek rules cannot identify the causal directions either. For the cases with fullyconnected cliques, these fully connected cliques do not meet the nonadjacent-parents requirementfor the first Meek rule (no new V-structures), and the second Meek rule (preserving DAGs) canalways be satisfied within a clique by changing the direction of one edge. Since CMB orients the3-fork V-structure in the third Meek rule correctly by Line 12∼14 of Algorithm 1, CMB can identifythe same structure as the global methods that use the Meek rules.

Theorem 3. Consistency between CMB and Global Causal Discovery Methods. For the sameDAG G, Algorithm 1 will correctly identify all the direct causes and effects of a target variable T

6

as the global and local-to-global causal discovery methods2 that use the Meek rules [10], up to G’sCPDAG under the causal faithfulness, sufficiency, correct independence tests, and no selection biasassumptions.

Proof. It has been shown that causal methods using Meek rules [10] can identify up to a graph’sCPDAG. Since Meek rules cannot identify the structures that fail Lemma 3, the global and local-to-global methods can only identify the same structures as CMB. Since CMB is sound and complete inidentifying these structures by Theorem 2, CMB will identify all direct causes and effects up to G’sCPDAG.

3.1 Complexity

The complexity of CMB algorithm is dominated by the step of finding the MB, which can have anexponential complexity [1, 16]. All other steps of CMB are trivial in comparison. If we assume auniform distribution on the neighbor sizes in a network with N nodes, then the expected time com-plexity of Step 1 of CMB is O( 1

N

∑Ni=1 2

i) = O( 2N

N ), while local-to-global methods are O(2N ).In later steps, CMB also needs to find MBs for a small subset of nodes that include 1) one nodebetween every pair of nodes that meet C4, and 2) a subset of the target’s neighboring nodes thatprovide additional clues for the target. Let l be the total size of these nodes, then CMB reduces thecost by N

l times asymptotically.

4 Experiments

We use benchmark causal learning datasets to evaluate the accuracy and efficiency of CMB withfour other causal discovery algorithms discussed: P-C, GS, MMHC, CS, and the local causal dis-covery algorithm LCD2 [7]. Due to page limit, we show the results of the causal algorithms on fourmedium-to-large datasets: ALARM, ALARM3, CHILD3, and INSUR3. They contain 37 to 111nodes. We use 1000 data samples for all datasets. For each global or local-to-global algorithm, wefind the global structure of a dataset and then extract causal identities of all nodes to a target node.CMB finds causal identities of every variable with respect to the target directly. We repeat the dis-covery process for each node in the datasets, and compare the discovered causal identities of all thealgorithms to all the Markov equivalent structures with the known ground truth structure. We use theedge scores [15] to measure the number of missing edges, extra edges, and reversed edges3 in eachnode’s local causal structure and report average values along with its standard deviation, for all thenodes in a dataset. We use the existing implementation [21] of HITON-MB discovery algorithm tofind the MB of a target variable for all the algorithms. We also use the existing implementations [21]for P-C, MMHC, and LCD2 algorithms. We implement GS, CS, and the proposed CMB algorithmsin MATLAB on a machine with 2.66GHz CPU and 24GB memory. Following the existing proto-col [15], we use the number of conditional independence tests needed (or scores computed for thescore-based search method MMHC) to find the causal structures given the MBs4, and the numberof times that MB discovery algorithms are invoked to measure the efficiency of various algorithms.We also use mutual-information-based conditional independence tests with a standard significancelevel of 0.02 for all the datasets without worrying about parameter tuning.

As shown in Table 1, CMB consistently outperforms the global discovery algorithms on benchmarkcausal networks, and has comparable edge accuracy with local-to-global algorithms. Although CMBmakes slightly more total edge errors in ALARM and ALARM3 datasets than CS, CMB is the bestmethod on CHILD3 and INSUR3. Since LCD2 is an incomplete algorithm, it never finds extra orreversed edges but misses the most amount of edges. Efficiency-wise, CMB can achieve more thanone order of magnitude speedup, sometimes two orders of magnitude as shown in CHILD3 andINSUR3, than the global methods. Compared to local-to-global methods, CMB also can achieve

2We specify the global and local-to-global causal methods to be P-C [19], GS [9] and CS [15].3If an edge is reversible in the equivalent class of the original graph but are not in the equivalent class of the

learned graph, it is considered as reversed edges as well.4For global methods, it is the number of tests needed or scores computed given the moral graph of the global

structure. For LCD2, it would be the total number of tests since it does not use moral graph or MBs.

7

Table 1: Performance of Various Causal Discovery Algorithms on Benchmark Networks

Errors: Edges EfficiencyDataset Method Extra Missing Reversed Total No. Tests No. MBAlarm P-C 1.59±0.19 2.19±0.14 0.32±0.10 4.10±0.19 4.0e3±4.0e2 -

MMHC 1.29±0.18 1.94±0.09 0.24±0.06 3.46±0.23 1.8e3±1.7e3 37±0GS 0.39±0.44 0.87±0.48 1.13±0.23 2.39±0.44 586.5±72.2 37±0CS 0.42±0.10 0.64±0.10 0.38±0.08 1.43±0.10 331.4±61.9 37± 0

LCD2 0.00±0.00 2.49±0.00 0.00±0.0 2.49±0.00 1.4e3±0 -CMB 0.69±0.13 0.61±0.11 0.51±0.10 1.81±0.11 53.7±4.5 2.61 ± 0.12

Alarm3 P-C 3.71±0.57 2.21±0.25 1.37±0.04 7.30±0.68 1.6e4±4.0e2 -MMHC 2.36±0.11 2.45±0.08 0.72±0.08 5.53±0.27 3.7e3±6.1e2 111 ± 0

GS 1.24±0.23 1.41±0.05 0.99±0.14 3.64±0.13 2.1e3±1.2e2 111 ± 0CS 1.26±0.16 1.47±0.08 0.63±0.14 3.38±0.13 699.1±60.4 111±0

LCD2 0.00±0.00 3.85±0.00 0.00±0.0 3.85±0.00 1.2e4±0 -CMB 1.41±0.13 1.55±0.27 0.78±0.25 3.73±0.11 50.3±6.2 2.58 ± 0.09

Child3 P-C 4.32±0.68 2.69±0.08 0.84±0.10 7.76±0.98 8.3e4±2.9e3 -MMHC 1.98±0.10 1.57±0.04 0.43±0.04 4.00±0.93 6.6e3±8.2e2 60 ±0

GS 0.88±0.04 0.75±0.08 1.03±0.08 2.66±0.33 2.1e3±2.5e2 60±0CS 0.94±0.20 0.91±0.14 0.53±0.08 2.37±0.33 1.0e3±4.8e2 60± 0

LCD2 0.00±0.00 2.63±0.00 0.00±0.0 2.63±0.00 3.6e3±0 -CMB 0.92±0.12 0.84±0.16 0.60±0.10 2.36±0.31 78.2±15.2 2.53 ± 0.15

Insur3 P-C 4.76±1.33 2.50±0.11 1.29±0.11 8.55±0.81 2.5e5±1.2e4 -MMHC 2.39±0.18 2.53±0.06 0.76±0.07 5.68±0.43 3.1e4±5.2e2 81 ± 0

GS 1.94±0.06 1.44±0.05 1.19±0.10 4.57±0.33 4.5e4±2.2e3 81±0CS 1.92±0.08 1.56±0.06 0.89±0.09 4.37±0.23 2.6e4±3.9e3 81±0

LCD2 0.00±0.00 5.03±0.00 0.00±0.0 5.03±0.00 6.6e3±0 -CMB 1.72±0.07 1.39±0.06 1.19±0.05 4.30±0.21 159.8±38.5 2.46 ± 0.11

more than one order of speedup on ALARM3, CHILD3, and INSUR3. In addition, on these datasets,CMB only invokes MB discovery algorithms between 2 to 3 times, drastically reducing the MB callsof local-to-global algorithms. Since independence test comparison is unfair to LCD2 who does notuse MB discovery or find moral graphs, we also compared time efficiency between LCD2 and CMB.CMB is 5 times faster on ALARM, 4 times faster on ALARM3 and CHILD3, and 8 times faster onINSUR3 than LCD2.

In practice, the performance of CMB depends on two factors: the accuracy of independence testsand MB discovery algorithms. First, independence tests may not always be accurate and couldintroduce errors while checking the four conditions of Lemma 1 and 2, especially under insufficientdata samples. Secondly, causal discovery performance heavily depends on the performance of theMB discovery step, as the error could propagate to later steps of CMB. Improvements on both areascould further improve CMB’s accuracy. Efficiency-wise, CMB’s complexity can still be exponentialand is dominated by the MB discovery phrase, and thus its worst case complexity could be the sameas local-to-global approaches for some special structures.

5 Conclusion

We propose a new local causal discovery algorithm CMB. We show that CMB can identify thesame causal structure as the global and local-to-global causal discovery algorithms with the sameidentification condition, but uses a fraction of the cost of the global and local-to-global approaches.We further prove the soundness and completeness of CMB. Experiments on benchmark datasetsshow the comparable accuracy and greatly improved efficiency of CMB for local causal discovery.Possible future works could study assumption relaxations, especially without the causal sufficiencyassumption, such as by using a similar procedure as FCI algorithm and the improved CS algorithm[14] to handle latent variables in CMB.

8

References[1] Constantin Aliferis, Ioannis Tsamardinos, Alexander Statnikov, C. F. Aliferis M. D, Ph. D, I. Tsamardi-

nos Ph. D, and Er Statnikov M. S. Hiton, a novel markov blanket algorithm for optimal variable selection,2003.

[2] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of MachineLearning Research, 2002.

[3] Gregory F Cooper. A simple constraint-based algorithm for efficiently mining observational databases forcausal relationships. Data Mining and Knowledge Discovery, 1(2):203–224, 1997.

[4] Isabelle Guyon, Andre Elisseeff, and Constantin Aliferis. Causal feature selection. 2007.

[5] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In ICML 1996, pages 284–292.Morgan Kaufmann, 1996.

[6] Subramani Mani, Constantin F Aliferis, Alexander R Statnikov, and MED NYU. Bayesian algorithms forcausal data mining. In NIPS Causality: Objectives and Assessment, pages 121–136, 2010.

[7] Subramani Mani and Gregory F Cooper. A study in causal discovery from population-based infant birthand death records. In Proceedings of the AMIA Symposium, page 315. American Medical InformaticsAssociation, 1999.

[8] Subramani Mani and Gregory F Cooper. Causal discovery using a bayesian local causal discovery algo-rithm. Medinfo, 11(Pt 1):731–735, 2004.

[9] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Ad-vances in Neural Information Processing Systems 12, pages 505–511. MIT Press, 1999.

[10] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedingsof the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410. Morgan KaufmannPublishers Inc., 1995.

[11] Teppo Niinimaki and Pekka Parviainen. Local structure disocvery in bayesian network. In Proceedingsof Uncertainy in Artifical Intelligence, Workshop on Causal Structure Learning, pages 634–643, 2012.

[12] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. MorganKaufmann Publishers, Inc., 2 edition, 1988.

[13] Judea Pearl. Causality: models, reasoning and inference, volume 29. Cambridge Univ Press, 2000.

[14] Jean-Philippe Pellet and Andre Elisseeff. Finding latent causes in causal networks: an efficient approachbased on markov blankets. In Advances in Neural Information Processing Systems, pages 1249–1256,2009.

[15] Jean-Philippe Pellet and Andre Ellisseeff. Using markov blankets for causal structure learning. Journalof Machine Learning, 2008.

[16] Jose M. Peoa, Roland Nilsson, Johan Bjorkegren, and Jesper Tegner. Towards scalable and data efficientlearning of markov boundaries. Int. J. Approx. Reasoning, 45(2):211–232, July 2007.

[17] Craig Silverstein, Sergey Brin, Rajeev Motwani, and Jeff Ullman. Scalable techniques for mining causalstructures. Data Mining and Knowledge Discovery, 4(2-3):163–192, 2000.

[18] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. The MIT Press, 2nd edition,2000.

[19] Peter Spirtes, Clark Glymour, Richard Scheines, Stuart Kauffman, Valerio Aimale, and Frank Wimberly.Constructing bayesian network models of gene expression networks from microarray data, 2000.

[20] Peter Spirtes, Christopher Meek, and Thomas Richardson. Causal inference in the presence of latentvariables and selection bias. In Proceedings of the Eleventh conference on Uncertainty in artificial intel-ligence, pages 499–506. Morgan Kaufmann Publishers Inc., 1995.

[21] Alexander Statnikov, Ioannis Tsamardinos, Laura E. Brown, and Constatin F. Aliferis. Causal explorer:A matlab library for algorithms for causal discovery and variable selection for classification. In Causationand Prediction Challenge at WCCI, 2008.

[22] Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. Time and sample efficient discov-ery of markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD interna-tional conference on Knowledge discovery and data mining, KDD ’03, pages 673–678, New York, NY,USA, 2003. ACM.

[23] Ioannis Tsamardinos, LauraE. Brown, and ConstantinF. Aliferis. The max-min hill-climbing bayesiannetwork structure learning algorithm. Machine Learning, 65(1):31–78, 2006.

[24] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent con-founders and selection bias. Artificial Intelligence, 172(16):1873–1896, 2008.

9

Neuro-inspired Eye Tracking with Eye Movement Dynamics

Kang Wang

RPI

[email protected]

Hui Su

RPI and IBM

[email protected]

Qiang Ji

RPI

[email protected]

Abstract

Generalizing eye tracking to new subjects/environments

remains challenging for existing appearance-based meth-

ods. To address this issue, we propose to leverage on eye

movement dynamics inspired by neurological studies. S-

tudies show that there exist several common eye movement

types, independent of viewing contents and subjects, such

as fixation, saccade, and smooth pursuits. Incorporating

generic eye movement dynamics can therefore improve the

generalization capabilities. In particular, we propose a nov-

el Dynamic Gaze Transition Network (DGTN) to capture

the underlying eye movement dynamics and serve as the top-

down gaze prior. Combined with the bottom-up gaze mea-

surements from the deep convolutional neural network, our

method achieves better performance for both within-dataset

and cross-dataset evaluations compared to state-of-the-art.

In addition, a new DynamicGaze dataset is also constructed

to study eye movement dynamics and eye gaze estimation.

1. Introduction

Eye gaze is one of the most important approaches for

people to interact with each other and with the visual world.

Eye tracking has been applied to different fields, including

psychology study [1], social network [2, 3, 4, 5], web search

[6, 7, 8], marketing and advertising [9], human computer

interaction [10, 11, 12]. In addition, since neurological activ-

ities affect the way to process visual information (reflected

by eye movements), eye tracking, therefore becomes one of

the most effective tools to study neuroscience. The estimat-

ed eye movements, eye gaze patterns can help attentional

studies like object-search mechanisms [6], understand neu-

rological functions during perceptual decision making [13],

and medical diagnosis like schizophrenia, post-concussive

syndrome, autism, Fragile X, etc. Despite the importance

of eye tracking to neuroscience studies, researchers ignored

that neurological studies on eyes can also benefit eye track-

ing. It is revealed that eye tracking is not a random process

but involves strong dynamics. There exist common eye

movement dynamics 1 that are independent of the viewing

content and subjects. Exploiting eye movement dynamics

can significantly improve the performance of eye tracking.

From neuroanatomy studies, there are several major types

of eye movements 2: vergence, saccade, fixation and smooth

pursuit. Vergence movements are to fixate on objects at dif-

ferent distances where two eyes move in opposite direction.

As vergence is less common in natural viewing scenarios,

we mainly focus on fixation, saccade, and smooth pursuit

eye movements. Saccadic movement is rapid eye movement

from one fixation to another, its duration is short and the

amplitude is linearly correlated with the duration. There

are also study on microsaccade [14] which is not the focus

of this paper. Fixation is to fixate on the same object for a

period of time, eye movements are very small (miniature)

and can be considered as a stationary or random walk. S-

mooth pursuit is eye movement which smoothly tracks a

slowly moving object. It cannot be triggered voluntarily and

typically require a moving object.

Existing work (see [15] for a comprehensive survey) on

eye gaze estimation are static frame-based, without explic-

itly considering the underlying dynamics. Among them,

model-based methods [16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

estimate eye gaze based on a geometric 3D eye model. Eye

gaze can be estimated by detecting key points in the geomet-

ric 3D eye model. Differently, appearance-based methods

[26, 27, 28, 29, 30, 31] directly learn a mapping function

from eye appearance to eye gaze.

Unlike traditional static frame-based methods, we pro-

pose to estimate eye gaze with the help of eye movement dy-

namics. Since eye movement dynamics can generalize across

subjects and environments, the proposed method therefore

achieves better generalization capabilities. The system is

illustrated in Fig. 1. For online eye tracking, the static gaze

estimation network first estimates the raw gaze xt from input

frame. Next, we combine top-down eye movement dynamics

with bottom-up image measurements (Alg. 1) to get a more

accurate prediction yt. In addition, yt is further fed back to

refine the static network so that we can better generalize to

1In this work, eye movement refers to actual gaze movement on screens.2https://www.ncbi.nlm.nih.gov/books/NBK10991/

19831

y

t

x

t

G

(Alg.2)←w

t

w

t−1

Input videostream

Static gazeestimation network

= f ( ; )x

t

I

t

w

t−1

Dynamic gazetransition network

G(α)

Output gazestream

Eye gaze estimation (Alg.1)= g({ ,G(α))y

t

x

i

}

t

i=t−k+1

I

t

Online eye trackingModel refinement

Figure 1. Overview of the proposed system. For online eye tracking, we combine static gaze estimation network with dynamic gaze transition

network to obtain better gaze estimation. In addition, the feedback mechanism of the system allows model refinement, so that we can better

generalize the static network to unseen subjects or environments.

current user and environment (Alg. 2). The proposed method

makes following contributions:

• To the best of our knowledge, we are the first to take

advantage of dynamic information to improve gaze esti-

mation. Combining top-down eye movement dynamics

with bottom-up image measurements gives better gen-

eralization and accuracy (%15 improvement), and can

automatically adapt to unseen subjects and environ-

ments.

• Propose the DGTN that effectively captures the transi-

tions of different eye movements as well as their under-

lying dynamics.

• Construct the DynamicGaze dataset, which not only

provides another benchmark for evaluating static gaze

estimation but benefits the community for studying eye

gaze and eye movement dynamics.

2. Related Work

Static eye gaze estimation. The most relevant work to

our static gaze estimation is from [27]. The authors proposed

to estimate gaze on mobile devices with face, eye and head

pose information using a deep convolutional neural network.

Though they can achieve good performance within-dataset,

they cannot generalize well to other datasets.

Eye gaze estimation with eye movement dynamics.

Eye movement is a spatial-temporal process. Most exist-

ing work only uses spatial eye movements, also known as

saliency map. In [32, 18, 33], the authors approximated

the spatial gaze distribution with the saliency map extracted

from image/video stimulus. However, their purpose is to

perform implicit personal calibration instead of improving

gaze estimation accuracy, since spatial saliency map is scene-

dependent. In [34], the authors used the fact that over 80%chance that first two fixations are on faces to help estimate

eye gaze. However, their approximation is too simple and

cannot apply to more natural scenarios.

For temporal eye movements, the authors in [35] pro-

posed to estimate the future gaze positions for recommender

systems with a Hidden Markov Model (HMM), where fixa-

tion is assumed to be a latent state, and user actions (clicking,

rating, dwell time, etc) are the observations. Their method is

however very much task-dependent and cannot generalize

to different tasks. In [36], the authors proposed to use a

similar HMM to predict gaze positions to reduce the delay

of networked video streaming. They also considered three

states corresponding to fixation, saccade, and smooth pursuit.

However, their approach ignores the different duration for

the three states, and their detailed modeling of the dynamics

for each state is relatively simpler. In addition, it requires

a commercial eye tracker, while the proposed method is an

appearance-based gaze estimator, which can perform on-

line real-time eye tracking with a simple web-camera. Fur-

thermore, the proposed method supports model-refinement

which can generalize to new subjects and environments.

Eye Movement Analysis. Besides eye tracking, there are

plenty of work on identifying the eye movement types given

eye tracking data. It includes threshold-based [37, 38] and

probabilistic-based [39, 40, 41]. Both methods require mea-

surements from eye tracking data like dispersion, velocity or

acceleration. Analyzing the underlying distribution of these

measurements can help identify the eye movement types.

However, these approaches are not interested in modeling

the gaze transitions for improving eye tracking.

3. Proposed Framework

We first discuss the eye movement dynamics and the

DGTN in Sec. 3.1. Next, we briefly introduce the static gaze

estimation network in Sec. 3.2. Then we talk about how to

perform online eye tracking with top-down eye movement

dynamics and bottom-up gaze measurements in Sec. 3.3.

Finally in Sec. 3.4, we focus on the refinement of the static

gaze estimation network.

9832

t

x

y

Fixation onthe motorcyclist

Saccade

Fixation onthe car

Saccade

Smooth pursuitfollowing motion

Legend:

spatial positionon scene framegaze transitionfixation pointsaccade pointsmooth pursuit point

(a)

(b)

Figure 2. Eye movement dynamics. (a) Illustration of eye move-

ments while watching a video, (b) Graphical representation of

dynamic gaze transition network.

3.1. Eye Movement Dynamics and DGTN

We first take a look at the eye movements while watching

a video. As shown in Fig. 2 (a), the user is first attracted

by the motorcyclist on the sky. After spending some time

fixating on the motorcyclist, the user shifts the focus on the

recently appeared car (due to shooting angle change). A

saccade is in between of the two fixations. Next, the user

turns the focus back to the motorcyclist and starts following

the motion with smooth pursuit eye movement. We have

three observations regarding the eye movements: 1) each eye

movement has its own unique dynamic pattern, 2) different

eye movements have different durations, and 3) there exists

special transition patterns across different eye movements.

These observations inspire us to construct the dynamic model

shown in Fig. 2 (b) to model the overall gaze transitions.

Specifically, we employ the semi-Markov model to model

the durations for each eye movement type. In Fig. 2 (b), the

red curve on the top shows a sample gaze pattern with 3segments of fixation, saccade, and smooth pursuit respec-

tively. The top row represents the state chain st, where

st = {fix, sac, sp} can take three values corresponding

to fixation, saccade, and smooth pursuit respectively. Each

state can generate a sequence of true gaze positions {yt}dt=1,

where d represents the duration for the state. Though the

state st is constant for a long period, its value is copied for

all time slices within the state to ensure a regular structure.

The true gaze yt not only depends on the current state but

also depends on previous gaze positions. For example, the

moving direction for smooth pursuit is determined by sever-

al previous gaze positions. Given the true gaze yt, we can

generate the noisy measurements xt, which are the outputs

from the static gaze estimation methods.

In the following, we will discuss in details 1) within-

state dynamics (Sec. 3.1.1), 2) eye movement duration and

transition (Sec. 3.1.2), 3) measurement model (Sec. 3.1.3),

and 4) parameter learning (Sec. 3.1.4).

3.1.1 Within-state Dynamics

100

horizontal gaze

0.5

time

0.5

ve

rtic

al g

aze

1000200

1

fixation

saccade

smooth pursuit

0 50 100 150 200

time

0

0.2

0.4

0.6

0.8

ho

rizo

nta

l g

aze

fixation

saccade

smooth pursuit

0 50 100 150 200

time

0

0.2

0.4

0.6

0.8

ve

rtic

al g

aze

fixation

saccade

smooth pursuit

0 0.2 0.4 0.6 0.8

horizontal gaze

0

0.2

0.4

0.6

0.8

ve

rtic

al g

aze

fixation

saccade

smooth pursuit

Figure 3. Visualization of eye movements. top-left: 3D plot of x-y-

t; top-right: projected 2D plot on y-t plane; bottom-left: projected

2D plot on x-t plane; bottom-right: projected 2D plot on x-y plane.

Fixation. Fixation is to fixate eye gaze on the same static

object for a period of time (Fig. 3 (d)). We propose to model

it with random walk : yt = yt−1 + wfix, where wfix is

the Gaussian noise with zero-mean and covariance matrix of

Σfix.

Saccade. Typically, saccade is fast eye movement be-

tween two fixations. The trajectory is typically a straight line

or generalized exponential curves (Fig. 3). In this work, we

approximate the trajectory with piece-wise linear functions.

The first saccade point y1 is actually the end point of last

fixation. Predicting the position of second saccade point y2

is difficult without knowing the image content. However, ac-

cording to [42], horizontal saccades are more frequent than

vertical saccades, which provide strong cues to the second

saccade point. Specifically, we assume second point can be

estimated by transiting first point with certain amplitude and

direction (angle) on 2D plane: y2 = y1+λ[cos(θ), sin(θ)]T ,

where amplitude λ ∼ N (µλ, σλ) and angle θ ∼ N (µθ, σθ)both follow Gaussian distributions. The histogram plot of

amplitude (Fig. 4 (a)) and angle (Fig. 4 (b)) from real data

also validates the feasibility of Gaussian distributions.

9833

0 200 400 600

amplitude / pixel

0

500

1000

# o

f sa

mp

les

(a)

-100 -50 0 50 100

angle/ degree

0

100

200

300

# o

f sam

ple

s

(b)

1->2 2->3 3->4 4->5 5->6 6->7

transition index i -> index j

0

50

100

150

am

plit

ud

e /

pix

el

(c)

Figure 4. Saccade characteristics. (a) Amplitude distribution, (b) Angle distribution, (c) Amplitude change from adjacent saccade points.

The rest saccade points can be estimated with the previous

two points: yt = Bi1yt−1+Bi

2yt−2+wsac, where Bd1 and

Bi2 are the regression matrices, the superscript i indicates

the index of current saccade point, or how many frames

have past when we enter the state. The value of i equals

the duration variable d in Eq. (1). It might be easier if we

assume Bi1 and Bi

2 remain the same for different indexes

i, but saccade movements have certain characteristics. For

example as in (Fig. 4 (c)), the amplitude changes between

adjacent saccade points first increases than decreases. Using

index-dependent regression matrices can better capture the

underlying dynamics. wsac is the Gaussian noise with zero-

mean and covariance matrix of Σsac.

Smooth Pursuit. Smooth pursuit is to keep track of a

slowly moving object. Therefore we can approximate the

moving trajectory by piece-wise linear functions similar to

saccade points. For the second smooth pursuit point, we in-

troduce amplitude and angle variable {λsp, θsp). For remain-

ing smooth pursuit points, we introduce index-dependent

regression matrices: yt = Ci1yt−1 +Ci

2yt−2 +wsp. wsp

is the Gaussian noise with zero-mean and covariance matrix

of Σsp.

3.1.2 Eye Movement Duration and Transition

The hidden semi-Markov model has been well studied in[43], we adopt a similar formulation for our model in termsof state duration and transition modeling. Besides randomvariables st, yt and xt for state, true gaze position and mea-sured gaze position, we introduce another discrete randomvariable dt (range {0, 1, ..., D}) representing the remainingduration of state st. The state st and the remaining durationdt are discrete random variables and follows multinomial(categorical) distribution. The CPDs for the state transitionare defined as follows:

P (st = j|st−1 = i, dt = d) =

{

δ(i, j) if d > 0A(i, j) if d = 0

P (dt = d′|dt = d, st = k) =

{

δ(d′, d− 1) if d > 0pk(d

′) if d = 0(1)

where δ(i, j) = 1 if i = j else 0. When we enter a new state

st = i, the duration dt is drawn from a prior multinomial

distribution qi(·) = [pi(1), ..., pi(D)]. The duration is then

counts down to 0. When dt = 0, the state transits to a

different state with the state transition matrix A and the

duration for the new state is drawn again from qi(·).

3.1.3 Measurement Model

The measurement model P (xt|yt) is independent of the type

of eye movement, and we assume: xt = Dyt +wn, where

D is the regression matrix, and wn is multi-variate Gaussian

noise with zero-mean and covariance matrix of Σn.

3.1.4 Parameter Learning

The DGTN parameters are summarized in Table 1.

For simplicity, we denote all the parameters as α =[αst,αsd,αfix,αsac,αsp,αm] and the DGTN is represent-

ed as G(α). All the random variables in Fig. 2 (b) are

observed during learning (the states and true gaze are not

known during online gaze tracking). Given the fully ob-

served K sequences ({skt ,ykt ,x

kt }

Tk

t=1) each with length Tk,

we can use Maximum log likelihood to estimate all the pa-

rameters:

α∗ = argmax

α

log

K∏

k=1

P ({skt ,ykt ,x

kt }

Tk

t=1|α) (2)

= argmaxα

K∑

k=1

log

Tk∏

t=1

∑

dkt

P (skt , dkt )P (yk

t |skt , d

kt )P (xk

t |ykt )

With fully-observed data, the above optimization problem

can be factorized to following sub-problems, each of which

can be solved independently:

α∗

m = argmaxαm

K∑

k=1

log

Tk∏

t=1

P (xkt |y

kt ,αm), (3)

{αst,αsd}∗ = arg max

αst,αsd

K∑

k=1

log

Tk∏

t=1

∑

dkt

P (skt , dkt ) (4)

α∗

j = argmaxαj

Nj∑

n=1

log

Tn∏

t=1

P (ykt |s

kt = j, dkt = Tn,αj)

∀j ∈ {fix, sac, sp}. (5)

9834

Table 1. Summary of model parameters.

State tran-

sition αstState duration αsd

Fixation

αfixSaccade αsac Smooth Pursuit αsp

Measurement

αm

Aqi = [pi(1), ..., pi(Di)]for i ∈ {fix, sac, sp}

Σfix{µλ, σλ, µθ, σθ}

sac,

{Bi1,B

i2}

Dsaci=3

,Σsac

{µλ, σλ, µθ, σθ}sp,

{Ci1,C

i2}

Dsp

i=3,Σsp

D,Σn

3.2. Static Eye Gaze Estimation

Figure 5. Architecture of static gaze estimation network.

The raw gaze measurements xt is estimated with a stan-

dard deep convolutional neural network (Fig. 5) [44, 45].

The input are left and right eyes (both of size 36× 60) and

the 6-dimension head pose information (rotation and trans-

lation: pitch, yaw, roll angles and x, y, z). The left and

right eye branch share the same weights of the convolutional

layers. Each convolution layer is followed by a max-pooling

layer with size 2. RELU is used for the activation of fully-

connected layers. Detailed layer configuration are as follows:

CONV-R1, CONV-L1: 5 × 5/50, CONV-R2, CONV-L2:

5× 5/100, FC-RT1: 512, FC-E1, FC-RT2: 256, FC-1: 500,

FC-2: 300, FC-3: 100. For simplicity, we denote static gaze

estimation as xt = f(It;w), where I and w are input frame

and model parameters respectively.

3.3. Online Eye Gaze Tracking

Traditional static-based methods only output the mea-sured gaze x from static gaze estimation network. In thiswork, we propose to output the true gaze y with the help ofDGTN:

yt = argmax p(yt|x1,x2, ...,xt)

= argmax

∫

st

p(yt, st|x1,x2, ...,xt)dst (6)

Solving the problem in Eq. (6) directly is intractable be-

cause of the integral over the hidden state. Alternatively we

propose to first draw samples of possible state st ([43]) from

its posterior. Given state, gaze estimation is a standard infer-

ence problem of LDS or Kalman filter ([46]). The algorithm

is summarized in Alg. 1.

3.4. Model Refinement

The static gaze estimation network is learned from sub-

jects during the offline stage. They may not generalize well

Algorithm 1: Online eye tracking

while getting a new frame It, do- Draw samples of state st ([43]) from its posterior:

sit ∼ P (st|xt−k, ...,xt), ∀i = 1, ..., N .

- According to the sample values of state st, using

the corresponding LDS in Eq. (1)([46]) to predict

the true gaze: yit =

argmaxyitP (yi

t|xt−k, ...,xt, sit) ∀i = 1, ..., N .

- Average the results from N samples:

yt ≈1N

∑N

i=1 yit.

to new subjects or environments. Therefore we propose to

leverage on the refined true gaze to refine the static gaze

estimation network (last two fully-connected layers). The

algorithm is illustrated in Alg. 2. Notice we do not use the

exact values of y, but instead assuming the temporal gaze

distribution from the static network (p(xt)) matches the true

gaze distribution (p(yt)). Similar to Fig. 3 (b) and (c), we

treat the x − t curve and y − t curve as two categorical

distributions(p = [p1, ..., pT ]), whose range is from 1 to T,

and the value pi equals to the normalized gaze positions. By

minimizing the KL-divergence between the two gaze distri-

butions, we can gradually refine the parameters of the static

network. The proposed algorithm may not give good accura-

cy in the beginning, but it can be performed incrementally

and gives better predictions as we collect more frames.

Algorithm 2: Model refinement for static gaze estima-

tion network.

1. Input: Static gaze estimation network f(·) with

initial parameters w0.

2. while getting a new frame It, do- Gather last k true gaze point yt = (at, bt) from

Alg. 1 and construct two categorical distributions

for horizontal and vertical gaze:

px = 1∑ai[at−k, ..., at],py = 1∑

bi[bt−k, ..., bt].

- Gather last k raw gaze point (at, bt) = f(It;w)and construct bottom-up categorical distributions:

qx(w) = 1∑ai[at−k, ..., at],

qy(w) = 1∑bi[bt−k, ..., bt].

- Update static gaze estimation network: wt =argminw DKL(px||qx(w)) +DKL(py||qy(w)),

where DKL(p||q) =∑

i p(i) logp(i)q(i) .

9835

4. DynamicGaze Dataset

Existing datasets for gaze estimation and eye movement

dynamics have little overlap. On one hand, gaze-related

benchmark datasets are all frame-based. Subjects are asked

to look at markers on the screen, where their face images

and groundtruth gaze are recorded. However, there are no

natural dynamic gaze patterns in the dataset. On the other

hand, eye movement related datasets focus on collecting data

while subjects watch natural video stimulus. Though the col-

lected data involves dynamics, there are no bottom-up image

measurements. To bridge the gap between these two fields,

we construct a new dataset which records both images and

groundtruth gaze positions while subjects perform natural

operations (browsing websites, watching videos). Clear eye

movement dynamics can be observed from the dataset.

To acquire the groundtruth gaze positions, we use a com-

mercial eye tracker which runs at the back-end. In the mean-

time, the front-facing camera of the laptop records the video

stream of the subjects. The video stream and the gaze stream

are synchronized during post-processing. The Tobii 4C eye

tracker gives less than 0.5 error after calibration, and we

believe the accuracy is sufficient to construct a dataset for

the webcam-based eye gaze tracking system.

4.1. Data collection procedure

We invite 15 male subjects and 5 female subjects, whose

age ranges from 20 to 30, to participate in the dataset con-

struction. We collected 3 sessions of data: 1) frame-based;

2) video-watching 3) website-browsing.

Frame-based. There are two purposes: 1) provide anoth-

er benchmark for static eye gaze estimation and 2) train our

generic static gaze estimation network. Subjects are asked

to look at some random moving objects on the screen, the

random moving objects are to ensure subjects’ gaze spread

on the entire screen. Each subject takes 3-6 trials at differ-

ent days, locations. We also ask subjects to sit at different

positions in front of the laptop to introduce more variations.

Finally, we end up with around 370000 valid frames.

Video-watching. The subjects are asked to watch 10video stimulus (Tab. 2) from 3 eye tracking research datasets.

The collection procedure is similar to the previous session,

and finally we collect a total of around 145000 valid frames.

Website-browsing. Similarly, subjects are asked to

browse websites freely on the laptop for around 5− 6 min-

utes, and a total of around 130000 frames are collected.

4.2. Data visualization and statistics

Fig. 6 shows sample eye images from the 20 subjects.

There are occlusions like glasses and reflections. Fig. 7

shows the spatial gaze distributions on a monitor with reso-

lution 2880× 1620. For frame-based data , the gaze appears

uniformly distributed. For video-watching data, the gaze

Table 2. Information about different video stimulus.Dataset Name Description

CRCNS [47]1. saccadetest Dots moving across the screen.

2. beverly07 People walking and running.

[48]3. 01-car-pursuit Car driving in a roundabout.

4. 02-turning-car Car turning around.

DIEM [49]

5. advert bbc4 bees Flying bees on BBC logo.

6. arctic bears Arctic bears in the ocean.

7. nightlife in mozambique One crab hunting for fishes.

8. pingpong no bodies Pingpong bouncing around.

9. sport barcelona extreme Extreme sports cut.

10. sport scramblers Extreme sports for scramblers.

Figure 6. Sample eye images from the dataset.

0 500 1000 1500 2000 2500x / pixel

0

200

400

600

800

1000

1200

1400

1600

y / p

ixel

pearsonr = 0.036; p = 3.7e-107

(a) frame-based

500 1000 1500 2000 2500x / pixel

0

200

400

600

800

1000

1200

1400

1600

y / p

ixel

pearsonr = -0.05; p = 8.4e-80

(b) video-watching

0 500 1000 1500 2000 2500x / pixel

0

200

400

600

800

1000

1200

1400

1600

y / p

ixel

pearsonr = 0.075; p = 2.5e-161

(c) website-browsing

Figure 7. Spatial gaze distributions for the DynamicGaze dataset.

0 100 200 300 400 500

time

0

500

1000

1500

2000

2500

ho

rizo

nta

l g

aze

po

sitio

n

0 100 200 300 400 500

time

0

500

1000

1500

ve

rtic

al g

aze

po

sitio

n

Figure 8. Sample dynamic gaze patterns from 8 subjects watching

the same video.

appears center-biased, which is the most common pattern

when watching videos. Finally, for website-browsing, the

gaze pattern is focused on the left side of the screen mainly

due to the design of the website. Since the major goal of

the dataset is to explore gaze dynamics, we also take a look

at the dynamic gaze patterns from 8 subjects watching the

same video stimuli. As shown in Fig. 8, different subjects

share similar overall gaze patterns, though the exact values

of horizontal and vertical gaze positions are different.

5. Experiments and Analysis

For DGTN, the measurement model P (xt|yt) is learned

with the data from DynamicGaze, where we have both

groundtruth gaze yt and measured gaze from the static gaze

estimation network. The remaining part of the model is

learned with the data from CRCNS [47], where we have the

groundtruth state annotations st and the groundtruth gaze.

9836

CRCNS consists of 50 video clips and 235 valid eye move-

ment traces from 8 subjects. For the static gaze estimation

network, we use Tensorflow as our backend engine.

Fixation is a one-order LDS, saccade, and smooth pursuit

can be considered as second-order LDS, therefore the value

k in Alg.1 is either 1 or 2. The value k in Alg.2 is set to 50(around 2 seconds of data), where we use them to update the

parameters of the static network. For overall gaze estimation,

the static gaze estimation (GPU Tesla 54 K40c) takes less

than 1 ms, while the online part (Alg. 1) with Intel Xeon CPU

E5-2620 v3 @2.4GHz takes around 50-60 ms. In practice,

for real-time processing, the model refinement runs with a

separate thread other than the gaze estimation thread.

The performance is evaluated using the angular error in

degree. We first compute the Euclidean pixel error on the

monitor(2880×1620), which can be transformed to centime-

ter error errd given monitor dimensions. The angular error

is approximated by erra = arctan(errd/tz), where tz is

the estimated depth of subject’s head relative to the camera.

5.1. Baseline for Static Gaze Estimation Network

Table 3. Comparison of different input data channels.

L R F L,R L,R,F L,R,P L,R,F,P

Error 5.38 5.27 5.56 4.70 5.29 4.27 4.47

We experiment with different input combinations. As

shown in Table 3, the symbol L, R, F, P represent left eye im-

age, right eye image, face image, and head pose respectively.

According to the results, we decide to use both eyes and head

pose. To obtain head pose, we perform offline detection of

the facial landmarks [50], then we can solve the head pose

angle with a 3D shape model [51, 52]. Note that adding

face is not helpful, since subjects have very different facial

texture than eye texture, which makes it hard to generalize

to new subjects. In addition, adding face may significantly

increase the inference time.

5.2. Evaluation on Different Model Components

The proposed model consists of two major components:

1) gaze estimation with eye movement dynamics and 2)

refinement model to better fit current users/environments.

To study the contributions of each component, we compare

following 3 variants of the proposed model:

• Static: this model outputs the raw gaze prediction x

and serves as the baseline.

• EMD (Eye movement dynamics): this model only us-

es eye movement dynamics (Alg. 1) without model

refinement and output the true gaze prediction y.

• Full: this is our full model contains both eye movement

dynamics and model refinement.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Subjects

0123456789

10

Err

or

(de

gre

e)

Static, avg = 5.34

EMD, avg = 4.97

Full, avg = 4.65

(a) video-watching

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Subjects

0123456789

10

Err

or

(de

gre

e)

Static, avg = 4.97

EMD, avg = 4.58

Full, avg = 4.07

(b) website-browsing

Figure 9. Gaze estimation error for all subjects.

We perform cross-subject evaluation and Fig. 9 shows

the performance of the 3 models. First, the Full model

shows improved performance over the Static model for most

subjects. The average estimation error reduces from 5.34

degrees to 4.65 degrees ((pitch, yaw) = (2.67, 3.81), 13%

improvement) for video-watching and 4.97 degrees to 4.07

degrees ((pitch, yaw) = (2.23, 3.41), 18% improvement)

for web-browsing. Second, compare EMD (gray bar) with

Static (black bar), we can always achieve better results for

both scenarios, demonstrating the importance of incorporat-

ing dynamics, especially in practical scenarios where user’s

gaze patterns have strong dynamics. The average improve-

ments with eye movement dynamics are 6.9% and 7.9% for

video-watching and website-browsing respectively. Third,

the difference between Full (white bar) with EMD (gray

bar) demonstrates the effect of Model Refinement. We can

clearly observe that the Static model cannot generalize well

to some subjects. With Model Refinement, we significantly

reduce the error for some subjects (Eg. Subj 6, 15, 16, 18

in video-watching and Subj 15, 16, 18 in website-browsing).

We also observe that model refinement may not always help,

it may increase the error for some subjects (Eg. Subj 4, 5, 7

in video-watching). Averagely speaking, Model Refinement

improves 6.4% and 11.2% for video-watching and website-

browsing respectively. Overall, both components can help

reduce the error of eye gaze estimation and combining the

two further reduces the error.

5.3. Performance of gaze estimation over time

Fig. 10 shows the gaze estimation error over time. The er-

ror is averaged from all subjects from their first 8000 frames.

For both scenarios, the improvement for the first period of

time is small (sometimes even decrease), but gradually there

is more significant improvements as we have more data.

9837

0 1000 2000 3000 4000 5000 6000 7000 8000

frame

-2

0

2

4

6

8

10

err

or

(degre

e)

static

dynamic

static - dynamic

(a) video-watching

0 1000 2000 3000 4000 5000 6000 7000 8000

frame

-2

0

2

4

6

8

10

err

or

(degre

e)

static

dynamic

static - dynamic

(b) website-browsing

Figure 10. Gaze estimation error over time. Red curve represents

error from Static model, green curve represents error from Full

model and green curve shows the reduced error.

This demonstrates that with enough frames, the proposed

method can significantly improve the accuracy of eye gaze

estimation.

5.4. Comparison with different dynamic models

Table 4. Average error of all subjects with different dynamic mod-

els.Static Mean Median LDS s-LDS RNN Ours

Video 5.34 5.18 5.16 5.20 5.14 5.15 4.97

Web 4.97 4.85 4.84 4.70 4.66 4.71 4.58

In this experiment, we compare with several baseline

dynamic models. The experimental results are illustrated in

Table 4. First, we find incorporating dynamics outperforms

the static method. Even the simple mean/median filters

can improve the results. The LDS model trained on entire

sequence without consideration of different eye movement

types cannot give good results. Once we consider different

eye movement types, the switching-LDS can improve the

results even without duration modeling. RNN [53, 54] gives

reasonably good results but ignores the characteristics of

different eye movements and therefore our proposed method

can still outperform it. Overall, we believe the proposed

dynamic modeling can better explain the underlying eye

movement dynamics and help improve the accuracy of eye

gaze estimation.

5.5. Comparison with stateoftheart

We compare with the state-of-the-art appearance-based

method [27] for both within-dataset and cross-dataset ex-

periments. Specifically, we re-implement the model in [27]

using Tensorflow by following the same architecture and

architecture-related hyperparameters. For training-related

Table 5. Comparison with state-of-the-art.Exp. Within-dataset Cross-dataset

Video Website Video Website

1. Static network (ours) 5.34 4.97 9.12 9.65

2. Static network ([27]) 4.97 4.86 8.73 9.17

3. Static network (ours) + DGTN 4.65 4.07 7.15 7.87

4. Static network ([27]) + DGTN 4.51 4.00 7.05 7.59

hyperparameters (e.g. learning rate, epochs), we do not

follow the one in [27] and adjust them based on cross-

validation.

For within-dataset experiments, the two models are

trained on the frame-based data from DynamicGaze and

are tested on web and video data from DynamicGaze. For

cross-dataset experiments, the two models are trained with

data from EyeDiap ([55]) and are tested on web and video

data from DynamicGaze.

The results are shown in Table 5. We have following

observations: 1) Compare Exp.1 and Exp.2, we can see

both static networks give reasonable accuracy, and the more

complex one ([27]) gives better performance than ours; 2)

Compare Exp.2 and Exp.4, adding DGTN to static network

significantly reduces the gaze estimation error; 3) similarly

compare Exp.2 and Exp.4, adding DGTN module to state-of-

the-art static network can still achieve better performance; 4)

the improvement for cross-dataset setting is more significant

than the within-dataset case, demonstrating better generaliza-

tion by incorporating eye movement dynamics; 5) compare

Exp.2 and Exp.3, we can find that our proposed method (Ex-

p.3) outperforms current state-of-the-art (Exp.2), especially

in the cross-dataset case.

6. Conclusion

In this paper, we propose to leverage on eye movement

dynamics to improve eye gaze estimation. By analyzing the

eye movement patterns when naturally interacting with the

computer, we construct a dynamic gaze transition network

that captures the underlying dynamics of fixation, saccade,

smooth pursuit, as well as their durations and transition-

s. Combining top-down gaze transition prior from DGTN

with the bottom-up gaze measurements from the deep model,

we can significantly improve the eye tracking performance.

Furthermore, the proposed method allows online model re-

finement which helps generalize to unseen subjects or new

environments. Quantitative results demonstrate the effec-

tiveness of the proposed method and the significance of

incorporating eye movement dynamics into eye tracking.

Acknowledgments: The work described in this paper

is supported in part by NSF award (IIS 1539012) and by

RPI-IBM Cognitive Immersive Systems Laboratory (CISL),

a center in IBM’s AI Horizon Network.

9838

References

[1] A. L. Yarbus, “Eye movements during perception of complex objects,”

in Eye movements and vision, pp. 171–211, Springer, 1967. 1

[2] W. A. W. Adnan, W. N. H. Hassan, N. Abdullah, and J. Taslim, “Eye

tracking analysis of user behavior in online social networks,” in Inter-

national Conference on Online Communities and Social Computing,

pp. 113–119, Springer, 2013. 1

[3] G.-J. Qi, C. C. Aggarwal, and T. S. Huang, “Online community detec-

tion in social sensing,” in Proceedings of the sixth ACM international

conference on Web search and data mining, pp. 617–626, ACM, 2013.

1

[4] J. Tang, X. Shu, G.-J. Qi, Z. Li, M. Wang, S. Yan, and R. Jain,

“Tri-clustered tensor completion for social-aware image tag refinemen-

t,” IEEE transactions on pattern analysis and machine intelligence,

vol. 39, no. 8, pp. 1662–1674, 2017. 1

[5] G.-J. Qi, C. C. Aggarwal, and T. Huang, “Link prediction across

networks by biased cross-network sampling,” in 2013 IEEE 29th

International Conference on Data Engineering (ICDE), pp. 793–804,

IEEE, 2013. 1

[6] J. H. Goldberg, M. J. Stimson, M. Lewenstein, N. Scott, and A. M.

Wichansky, “Eye tracking in web search tasks: design implications,”

in Proceedings of the 2002 symposium on Eye tracking research &

applications, pp. 51–58, ACM, 2002. 1

[7] X. Wang, T. Zhang, G.-J. Qi, J. Tang, and J. Wang, “Supervised quan-

tization for similarity search,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pp. 2018–2026, 2016.

1

[8] S. Chang, G.-J. Qi, C. C. Aggarwal, J. Zhou, M. Wang, and T. S.

Huang, “Factorized similarity learning in networks,” in 2014 IEEE

International Conference on Data Mining, pp. 60–69, IEEE, 2014. 1

[9] C. H. Morimoto and M. R. Mimica, “Eye gaze tracking techniques for

interactive applications,” Computer vision and image understanding,

vol. 98, no. 1, pp. 4–24, 2005. 1

[10] K. Wang, R. Zhao, and Q. Ji, “Human computer interaction with head

pose, eye gaze and body gestures,” in 2018 13th IEEE International

Conference on Automatic Face & Gesture Recognition (FG 2018),

pp. 789–789, IEEE, 2018. 1

[11] R. Zhao, K. Wang, R. Divekar, R. Rouhani, H. Su, and Q. Ji, “An

immersive system with multi-modal human-computer interaction,”

in 2018 13th IEEE International Conference on Automatic Face &

Gesture Recognition (FG 2018), pp. 517–524, IEEE, 2018. 1

[12] R. R. Divekar, M. Peveler, R. Rouhani, R. Zhao, J. O. Kephart,

D. Allen, K. Wang, Q. Ji, and H. Su, “Cira: An architecture for

building configurable immersive smart-rooms,” in Proceedings of SAI

Intelligent Systems Conference, pp. 76–95, Springer, 2018. 1

[13] S. Fiedler and A. Glockner, “The dynamics of decision making in

risky choice: An eye-tracking analysis,” Frontiers in psychology,

vol. 3, p. 335, 2012. 1

[14] S. Martinez-Conde, J. Otero-Millan, and S. L. Macknik, “The impact

of microsaccades on vision: towards a unified theory of saccadic

function,” Nature Reviews Neuroscience, vol. 14, no. 2, p. 83, 2013. 1

[15] D. Hansen and Q. Ji, “In the eye of the beholder: A survey of models

for eyes and gaze,” 2010. 1

[16] K. Wang and Q. Ji, “3d gaze estimation without explicit personal

calibration,” Pattern Recognition, 2018. 1

[17] D. Beymer and M. Flickner, “Eye gaze tracking using an active stereo

head,” in Computer vision and pattern recognition, 2003. Proceedings.

2003 IEEE computer society conference on, vol. 2, pp. II–451, IEEE,

2003. 1

[18] K. Wang, S. Wang, and Q. Ji, “Deep eye fixation map learning for

calibration-free eye gaze tracking,” in Proceedings of the Ninth Bi-

ennial ACM Symposium on Eye Tracking Research & Applications,

pp. 47–55, ACM, 2016. 1, 2

[19] E. D. Guestrin and M. Eizenman, “General theory of remote gaze

estimation using the pupil center and corneal reflections,” IEEE Trans-

actions on biomedical engineering, vol. 53, no. 6, pp. 1124–1133,

2006. 1

[20] K. Wang and Q. Ji, “Hybrid model and appearance based eye tracking

with kinect,” in Proceedings of the Ninth Biennial ACM Symposium

on Eye Tracking Research & Applications, pp. 331–332, ACM, 2016.

1

[21] X. Xiong, Q. Cai, Z. Liu, and Z. Zhang, “Eye gaze tracking using an

rgbd camera: A comparison with a rgb solution,” UBICOMP, 2014. 1

[22] K. Wang and Q. Ji, “Real time eye gaze tracking with 3d deformable

eye-face model,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pp. 1003–1011, 2017. 1

[23] L. Jianfeng and L. Shigang, “Eye-model-based gaze estimation by

rgb-d camera,” in CVPR Workshops, 2014. 1

[24] K. Wang and Q. Ji, “Real time eye gaze tracking with kinect,” in

Pattern Recognition (ICPR), 2016 23rd International Conference on,

pp. 2752–2757, IEEE, 2016. 1

[25] K. Wang, R. Zhao, and Q. Ji, “A hierarchical generative model for

eye image synthesis and eye gaze estimation,” in Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition,

pp. 440–448, 2018. 1

[26] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Appearance-based

gaze estimation in the wild,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pp. 4511–4520, 2015.

1

[27] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar,

W. Matusik, and A. Torralba, “Eye tracking for everyone,” in Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pp. 2176–2184, 2016. 1, 2, 8

[28] Q. Huang, A. Veeraraghavan, and A. Sabharwal, “Tabletgaze: dataset

and analysis for unconstrained appearance-based gaze estimation in

mobile tablets,” Machine Vision and Applications, vol. 28, no. 5-6,

pp. 445–461, 2017. 1

[29] T. Fischer, H. Jin Chang, and Y. Demiris, “Rt-gene: Real-time eye

gaze estimation in natural environments,” in Proceedings of the Eu-

ropean Conference on Computer Vision (ECCV), pp. 334–352, 2018.

1

[30] Y. Cheng, F. Lu, and X. Zhang, “Appearance-based gaze estimation

via evaluation-guided asymmetric regression,” in Proceedings of the

European Conference on Computer Vision (ECCV), pp. 100–115,

2018. 1

[31] K. Wang, R. Zhao, H. Su, and Q. Ji, “Generalizing eye tracking with

bayesian adversarial learning,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2019. 1

[32] Y. Sugano, Y. Matsushita, and Y. Sato, “Appearance-based gaze esti-

mation using visual saliency,” IEEE transactions on pattern analysis

and machine intelligence, vol. 35, no. 2, pp. 329–341, 2013. 2

[33] J. Chen and Q. Ji, “A probabilistic approach to online eye gaze track-

ing without explicit personal calibration,” IEEE Transactions on Im-

age Processing, vol. 24, no. 3, pp. 1076–1086, 2015. 2

[34] M. Cerf, J. Harel, W. Einhauser, and C. Koch, “Predicting human gaze

using low-level saliency combined with face detection,” in Advances

in neural information processing systems, pp. 241–248, 2008. 2

[35] Q. Zhao, S. Chang, F. M. Harper, and J. A. Konstan, “Gaze predic-

tion for recommender systems,” in Proceedings of the 10th ACM

Conference on Recommender Systems, pp. 131–138, ACM, 2016. 2

9839

[36] Y. Feng, G. Cheung, W.-t. Tan, and Y. Ji, “Hidden markov model for

eye gaze prediction in networked video streaming,” in Multimedia

and Expo (ICME), 2011 IEEE International Conference on, pp. 1–6,

IEEE, 2011. 2

[37] A. T. Duchowski, “Eye tracking methodology,” Theory and practice,

vol. 328, 2007. 2

[38] M. Nystrom and K. Holmqvist, “An adaptive algorithm for fixation,

saccade, and glissade detection in eyetracking data,” 2010. 2

[39] E. Tafaj, G. Kasneci, W. Rosenstiel, and M. Bogdan, “Bayesian online

clustering of eye movement data,” in Proceedings of the Symposium

on Eye Tracking Research and Applications, pp. 285–288, ACM,

2012. 2

[40] L. Larsson, M. Nystrom, R. Andersson, and M. Stridh, “Detection of

fixations and smooth pursuit movements in high-speed eye-tracking

data,” Biomedical Signal Processing and Control, vol. 18, pp. 145–

152, 2015. 2

[41] T. Santini, W. Fuhl, T. Kubler, and E. Kasneci, “Bayesian identifi-

cation of fixations, saccades, and smooth pursuits,” in Proceedings

of the Ninth Biennial ACM Symposium on Eye Tracking Research &

Applications, pp. 163–170, ACM, 2016. 2

[42] O. Le Meur and Z. Liu, “Saccadic model of eye movements for free-

viewing condition,” Vision research, vol. 116, pp. 152–164, 2015.

3

[43] K. P. Murphy, “Hidden semi-markov models (hsmms),” 2002. 4, 5

[44] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification

with deep convolutional neural networks,” in Advances in neural

information processing systems, pp. 1097–1105, 2012. 5

[45] T. Zhang, G.-J. Qi, B. Xiao, and J. Wang, “Interleaved group con-

volutions,” in Proceedings of the IEEE International Conference on

Computer Vision, pp. 4373–4382, 2017. 5

[46] K. P. Murphy and S. Russell, “Dynamic bayesian networks: represen-

tation, inference and learning,” 2002. 5

[47] R. Carmi and L. Itti, “The role of memory in guiding attention during

natural vision,” Journal of vision, vol. 6, no. 9, pp. 4–4, 2006. 6

[48] K. Kurzhals, C. F. Bopp, J. Bassler, F. Ebinger, and D. Weiskopf,

“Benchmark data for evaluating visualization and analysis techniques

for eye tracking for video stimuli,” in Proceedings of the fifth work-

shop on beyond time and errors: novel evaluation methods for visual-

ization, pp. 54–60, ACM, 2014. 6

[49] P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson, “Clustering of

gaze during dynamic scene viewing is predicted by motion,” Cognitive

Computation, vol. 3, no. 1, pp. 5–24, 2011. 6

[50] A. Bulat and G. Tzimiropoulos, “How far are we from solving the

2d & 3d face alignment problem?(and a dataset of 230,000 3d facial

landmarks),” in International Conference on Computer Vision, vol. 1,

p. 4, 2017. 7

[51] E. Murphy-Chutorian and M. M. Trivedi, “Head pose estimation in

computer vision: A survey,” IEEE transactions on pattern analysis

and machine intelligence, vol. 31, no. 4, pp. 607–626, 2009. 7

[52] K. Wang, Y. Wu, and Q. Ji, “Head pose estimation on low-quality

images,” in 2018 13th IEEE International Conference on Automatic

Face & Gesture Recognition (FG 2018), pp. 540–547, IEEE, 2018. 7

[53] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur,

“Recurrent neural network based language model,” in Eleventh annual

conference of the international speech communication association,

2010. 8

[54] H. Hu and G.-J. Qi, “State-frequency memory recurrent neural net-

works,” in Proceedings of the 34th International Conference on Ma-

chine Learning-Volume 70, pp. 1568–1577, JMLR. org, 2017. 8

[55] K. A. F. Mora, F. Monay, and J.-M. Odobez, “Eyediap: A database for

the development and evaluation of gaze estimation algorithms from

rgb and rgb-d cameras,” in Proceedings of the Symposium on Eye

Tracking Research and Applications, pp. 255–258, ACM, 2014. 8

9840

An Adversarial Hierarchical Hidden MarkovModel for Human Pose Modeling and Generation

Rui Zhao, Qiang JiRensselaer Polytechnic Institute

Troy NY, USA{zhaor,jiq}@rpi.edu

Abstract

We propose a hierarchical extension to hidden Markov model(HMM) under the Bayesian framework to overcome its limitedmodel capacity. The model parameters are treated as randomvariables whose distributions are governed by hyperparame-ters. Therefore the variation in data can be modeled at bothinstance level and distribution level. We derive a novel learn-ing method for estimating the parameters and hyperparametersof our model based on adversarial learning framework, whichhas shown promising results in generating photorealistic im-ages and videos. We demonstrate the benefit of the proposedmethod on human motion capture data through comparisonwith both state-of-the-art methods and the same model thatis learned by maximizing likelihood. The first experiment onreconstruction shows the model’s capability of generalizing tonovel testing data. The second experiment on synthesis showsthe model’s capability of generating realistic and diverse data.

IntroductionIn recent years, generative dynamic model has attracted alot of attention due to its potential of learning representationfrom unlabeled sequential data as well as its capability ofdata generation. (Gan et al. 2015; Srivastava, Mansimov, andSalakhudinov 2015; Mittelman et al. 2014; Xue et al. 2016;Walker et al. 2016). Sequential data introduce additionalchallenge for modeling due to temporal dependencies andsignificant intra-class variation. Consider human action asan example. Even though the underlying dynamic patternremains similar for the same type of action, the actual poseand speed vary for different people. Even if the same personperforms the action repeatedly, there will be noticeable dif-ference. This motivates us to design a probabilistic dynamicmodel that not only can capture the consistent dynamic pat-tern across different data instances, but also can accommodatethe variation therein.

Widely used dynamic model like HMM models dynamicprocess through transition among different discrete states.In order to encode N bits of information, HMM needs 2Nnumber of states. Therefore, the model complexity increasesexponentially with the model capacity. Linear dynamic sys-tem (LDS) uses continuous states to capture dynamics, whichavoids exponential increase of model complexity. However,

Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

LDS assumes the underlying dynamics can be described by alinear model, which may not be sufficient for case like humanmotion data. On the other hand, more complex model suchas recurrent neural networks (RNN) based deep models oftenhas exceedingly large number of parameters. Without suffi-ciently large amount of data or careful regularization, trainingsuch model is prone to overfit. In addition, the model is de-terministic. Simply reducing the model complexity compro-mises the capability of capturing randomness and variationpresented in data. We instead propose a hierarchical HMM(HHMM), which extends the shallow HMM leveraging onBayesian framework. The proposed HHMM allows modelparameters vary as random variables among data instances.Given the same amount of parameters, HHMM has a muchlarger capacity compared to HMM. Besides, HHMM retainsthe inference method available in HMM, allowing us to dovarious inference tasks efficiently. Finally, as a probabilisticgenerative model, HHMM can capture spatio-temporal de-pendencies in dynamic process and modeling variations in aprincipled way.

As for model learning, maximum likelihood estimate(MLE) has been the de facto learning objective for prob-abilistic generative models. Despite its wide adoption, MLEtends to fit a diffused distribution on data (Theis, Oord, andBethge 2015). For static image synthesis, the results oftenlook blurred. Recently, adversarial learning has emergedto be a popular learning criteria for learning generativemodels. Variants of generative adversarial networks (GAN)(Goodfellow et al. 2014; Radford, Metz, and Chintala 2015;Reed et al. 2016; Nowozin, Cseke, and Tomioka 2016;Nguyen et al. 2017) show promising results in generatingboth sharp and realistic-looking images of face, object andindoor/outdoor scene. There is also an increasing interest inextending the framework for dynamic data (Vondrick, Pirsi-avash, and Torralba 2016; Saito, Matsumoto, and Saito 2017;Tulyakov et al. 2017). In this work, we explore the ideaof training the proposed HHMM using adversarial objective,which has two major benefits. First, it bypasses the intractableobjective of MLE in hierarchical model where the integrationover parameters introduces additional dependencies amongrandom variables. Second, it aims at learning a model thatcan generate realistic-looking data. Following the adversariallearning framework, we introduce a separate discriminativedynamic model to guide the learning of HHMM, which serves

The Thirty-Second AAAI Conferenceon Artificial Intelligence (AAAI-18)

2636

as the generator. While the generator tries to generate datathat looks as realistic as possible. The discriminator tries toclassify the generated data as fake. The two models competeagainst each other in order to reach an equilibrium. We derivea gradient ascent based optimization method for updatingparameters of both models. To the best of our knowledge,this is the first work that exploit adversarial learning on mod-eling dynamic data with fully probabilistic generator anddiscriminator.

Related workProbabilistic dynamic models HMM and its variants (Ra-biner 1989; Fine, Singer, and Tishby 1998; Brand, Oliver,and Pentland 1997; Ghahramani, Jordan, and Smyth 1997;Yu 2010) are widely used to model sequential data, wheredynamics change according to transition among differentdiscrete states. The observations are then emitted from astate-dependent distribution. The state can also be continu-ous as modeled in LDS, which is also known as Kalmanfilters (Kalman and others 1960). In a more general for-mulation, both HMM and LDS can be considered as spe-cial variants of dynamic Bayesian networks (DBN) (Mur-phy 2002). Our model expands the model capacity throughthe hierarchical structure instead of increasing complexity,which is ineffective for HMM. With enhanced model capac-ity, our model can better accommodate variation and non-linearity of the dynamics. Another major type of dynamicmodel consists of undirected graphical models such as tempo-ral extension of restricted Boltzmann machine (RBM) (Tay-lor, Hinton, and Roweis 2006; Sutskever and Hinton 2007;Mittelman et al. 2014) and dynamic conditional randomfield (DCRF) (Sutton, McCallum, and Rohanimanesh 2007;Tang, Fei-Fei, and Koller 2012). While RBM can capturenon-linearity and expand capacity through vectorized hiddenstates, the learning requires approximation to intractable par-tition function and the choice of hidden state dimension maynot be trivial. DCRF model is trained discriminatively givenclass labels and not suitable for data generation task.

More recently, models that combine probabilistic frame-work with deterministic model such as neural networks (NN)have been proposed. (Krishnan, Shalit, and Sontag 2015) pro-posed deep Kalman filters which used NN to parameterizetransition and emission probability. (Johnson et al. 2016) usedvariational autoencoder to specify the emission distributionof switching LDS. (Gan et al. 2015) proposed deep temporalsigmoid belief network (TSBN), where the hidden node isbinary and its conditional distribution is specified by sigmoidfunction. Variants of RNNs with additional stochastic nodesare introduced to improve the capability of modeling ran-domness (Bayer and Osendorfer 2014; Chung et al. 2015;Fraccaro et al. 2016). To better account for intra-class varia-tion, (Wang, Fleet, and Hertzmann 2008) modeled dynamicsusing Gaussian process where the uncertainty is handled bymarginalizing out parameter space imposed with Gaussianprocess prior. (Joshi et al. 2017) proposed a Bayesian NNwhich can adapt to subject dependent variation for actionrecognition. Deep learning based models typically requirelarge amount of training data. For smaller dataset, carefulregularization or other auxiliary techniques such as data aug-

mentation, pre-train, drop-out, etc. are needed. In contrast,our HHMM has built-in regularization through the hyper-parameters learned using all the intra-class data. It is lessprone to overfitting. Besides, HHMM can handle missingdata as the probabilistic inference can be carried out in ab-sence of some observations. Furthermore, HHMM is easier tointerpret as the nodes are associated with semantic meanings.

Learning methods of dynamic models Maximum like-lihood learning is widely used to obtain point estimate ofmodel parameters. For models with tractable likelihood func-tion, numerical optimization techniques such as gradient as-cent can be used to maximize likelihood function directlywith respect to the parameters. In general, for model withhidden variables, whose values are always unknown duringtraining, expectation maximization (EM) (Dempster, Laird,and Rubin 1977) is often used, which optimizes a tight lowerbound of the model loglikelihood. Bayesian parameter es-timation can also be used as an alternative to MLE in casewhen prior information on parameters need to be incorpo-rated, resulting in maximum a posteriori (MAP) estimate. Forinstance, (Brand and Hertzmann 2000) introduced a prioron HMM parameters to encourage smaller cross entropy be-tween specific stylistic motion model and generic motionmodel. In case the goal is to classify data into different cate-gories, generative dynamic model can also be learned withdiscriminative criteria such as maximizing the conditionallikelihood of being as one of the categories (Wang and Ji2012). Our work provides another objective to learn gener-ative dynamic models by adopting the adversarial learningframework. The generative model has to compete againstanother discriminative model in order to fit the data distri-bution well. An important difference of our method fromexisting adversarially learned dynamic models like TGAN(Saito, Matsumoto, and Saito 2017) is that, both our gener-ator and discriminator are fully probabilistic models whichexplicitly model the variation of data distribution.

MethodsWe first present the proposed dynamic model. Then we brieflyreview the adversarial learning framework and describe indetails about the learning algorithm. Finally, we discuss theinference methods used for various tasks.

Bayesian Hierarchical HMMWe now describe the proposed HHMM, which models thedynamics and variation of data in two levels. First, the ran-dom variables capture spatial distribution and temporal evo-lution of dynamic data. Second, the parameters specifyingthe model are also treated as random variables with prior dis-tributions. Notice that the term HHMM is first used in (Fine,Singer, and Tishby 1998), where the hierarchy is applied onlyon hidden or observed nodes with fixed parameters in order tomodel multi-scale structure in data. Our model constructs thehierarchy using Bayesian framework in order to handle largevariation in data. Specifically, we define X = {X1, ..., XT }as the sequence of observations and Z = {Z1, ..., ZT } asthe hidden state chain. The joint distribution of HHMM with

2637

Figure 1: Topology of HHMM, where plate notation is used.T is the length of sequence. N is the number of sequences. Qis the number of hidden states. The self-edge of Zt shows thetemporal link from Zt−1 to Zt. Circle-shaped nodes representvariables and diamond-shaped nodes represent parameters orhyperparameters.

first-order Markov assumption is given by

P (X,Z, θ|α) = P (Z1|π)T∏

t=2

P (Zt|Zt−1,A) (1)

T∏t=1

P (Xt|Zt, μ,Σ)P (A|η)P (μ|λ)

where π is a stochastic vector specifying the initial statedistribution i.e. P (Z1 = i) = πi. A is a stochastic matrixwhere the ith row specifies the probability of transiting fromstate i to other states i.e. P (Zt = j|Zt−1 = i) = Aij .μ and Σ are the emission distribution parameters. We useGaussian distribution as the observations are continuous i.e.P (Xt|Zt = i) = N (μi,Σi). Diagonal covariance matrix isalso assumed. θ = {A, μ} and θ = {π,Σ} are the modelparameters and α = {η, λ} are the model hyperparameters.We denote α = {α, θ} = {η, λ, π,Σ} as the augmented setof hyperparameters by including θ. The model topology isshown in Figure 1.

We use conjugate prior for θ. Specifically, we use a Dirich-let prior on the transition parameter A with hyperparameterη and a Normal prior on the emission mean μ with hyperpa-rameter {μ0,Σ0}.

P (Ai:|ηi) ∝Q∏

j=1

Aηij−1ij

P (μi|λ) ∝ exp(− 1

2(μi − μi0)

TΣ−1i0 (μi − μi0)

)

where i = 1, ..., Q, ηij > 0, μi0 ∈ RO,Σi0 ∈ R

O×O. Q isthe number of hidden states and O is the dimension of data.The benefit of using hierarchical model can be seen fromits structure. Under the same model complexity i.e. samenumber of hidden states and dimension of data, parametersin HHMM can further vary according to each data instance.Thus HHMM has increased modeling capacity compared toHMM, which is crucial for modeling data variation.

Adversarial learning of HHMMThe adversarial learning approach utilizes a novel objectiveto train a generative model G by introducing another dis-criminative model D. Intuitively, G is aimed at generating

samples that resemble real data distribution. D tries to differ-entiate whether a sample is from real data or generated by G.When both G and D are parameterized by neural networks,it yields the GAN (Goodfellow et al. 2014). Leveraging onthe adversarial learning framework, we develop a method forlearning HHMM, which we use as the generator. The choiceof discriminator is a pair of HMMs that are trained with dis-criminative objective. We describe the overall optimizationformulation first, followed by detailed discussion on gener-ator and discriminator learning. We introduce an additionalbinary variable y associated with X to indicate whether Xis real (y = 1) or fake (y = −1). The overall optimizationobjective is defined by Eq. (2).

minα

maxφ

EX∼Pdata(X)[logD(X|φ)] (2)

+EX∼PG(X|α)[log(1−D(X|φ))]where D(X|φ) � PD(y = 1|X, φ) is the output of discrim-inator specifying the probability of X being real data andφ is the parameters of discriminator. Pdata(X) is the realdata distribution. PG(X|α) is the likelihood of α on X gen-erated from G. Compared to GAN, the use of probabilisticgenerative model directly specify distribution X, where therandomness and dependency is encoded through the latentvariables. The goal of learning is to estimate α and φ. Theoptimization uses alternating strategy where we optimize onemodel while holding the other as fixed at each iteration.

Generator We now discuss in details about generator learn-ing, which is HHMM in our case. The benefit of using aprobabilistic dynamic model is that we can model data vari-ation and randomness in a principled way. In addition, wecan generate different length of sequences. Finally, we canevaluate data likelihood using learned model as describedlater in inference. When optimizing α in Eq. (2), we hold φfixed. The same approximate objective as in (Goodfellow etal. 2014) is also used. This results in the following objective.

maxα

LG(α) � EX∼PG(X|α)[logD(X|φ)] (3)

≈N∑i=1

M∑j=1

1

MNlogD(Xij |φ),

θi ∼ P (θ|α),Xij ∼ P (X|θi, θ)However, the sample-based approximation no longer explic-itly depends on α. We use the identity that ∇Xf(X) =f(X)∇X log f(X) to derive an unbiased estimate of gradi-ent of LG(α) by directly taking derivative of Eq. (3), wheresimilar strategy is also used in (Williams 1992).

∂LG(α)

∂α≈

N∑i=1

M∑j=1

logD(Xij |φ)MN

∂ logP (θi|α)∂α

(4)

∂LG(α)

∂θ≈

N∑i=1

M∑j=1

logD(Xij |φ)MN

∂ logP (Xij |θi, θ)∂θ

(5)

where θi ∼ P (θ|α),Xij ∼ P (X|θi, θ). In Eq. (4), the par-tial derivative is taken by the prior distribution of parameters,

2638

which has an analytical form given our parameterization. InEq. (5), the partial derivative corresponds to the gradient ofloglikelihood of HMM, which can be computed by exploitingthe chain structure of hidden states as described in (Cappe,Buchoux, and Moulines 1998). SGD with RMSProp (Tiele-man and Hinton 2012) for adaptive gradient magnitude isperformed to update α. We also reparameterize κ = log σ,where σ2 is the diagonal entries of Σ, which is assumed to bediagonal. Intuitively speaking, given a fixed D, samples withD(Xij |φ) → 0 will be weighted heavily to encourage im-provement. Samples with D(Xij |φ) → 1 have ∂LG(α)

∂α → 0,thus contribute little to the update.

Discriminator Our discriminator consists of a pair ofHMMs with parameters specified as φ+ and φ− respectively.The use of dynamic model based discriminator is largelymotivated by the needs to work with sequential data. To dif-ferentiate whether a motion sequence looks realistic or not,the discriminator should be able to recognize the underlyingmotion pattern subject to variation. In addition, dynamic dis-criminator also can accommodate sequences with differentlength. Specifically, the output of discriminator is defined asfollows.

D(X|φ) = P (X|φ+)

P (X|φ+) + P (X|φ−)P (y=−1)P (y=1)

(6)

where P (y) is the prior probability of the labels. Since wechoose the same number of real and fake samples at eachupdate, we can assume uniform distribution of labels, namelyP (y = 1) = P (y = −1) = 1/2. P (X|φ+) and P (X|φ−)are the likelihoods of φ+ and φ− evaluated on X respec-tively. The two HMMs are trained discriminatively underthe objective of Eq. (2) with α holding fixed. Specifically,given a set of M randomly generated samples {X−j } fromgenerator and a set of M randomly selected real data sam-ples {X+

j }, the objective of learning φ is equivalent to thenegative cross-entropy loss as follows.

maxφ

LD(φ) � EX∼Pdata(X)[logD(X|φ)] (7)

+ EX∼PG(X|α)[log(1−D(X|φ))]

≈ 1

M

M∑j=1

logD(X+j |φ) + log(1−D(X−j |φ))

By substituting Eq. (6) to Eq. (7), we can compute thegradient of LD(φ) with respect to φ.

∂LD(φ)

∂φ+≈ 1

M

M∑j=1

[P (X+

j |φ−)s(X+

j )

∂ logP (X+j |φ+)

∂φ+(8)

− P (X−j |φ+)

s(X−j )

∂ logP (X−j |φ+)

∂φ+]

∂LD(φ)

∂φ−≈ 1

M

M∑j=1

[P (X−j |φ+)

s(X−j )

∂ logP (X−j |φ−)∂φ−

(9)

− P (X+j |φ−)

s(X+j )

∂ logP (X+j |φ−)

∂φ−]

where s(X) = P (X|φ+) + P (X|φ−). Again, ∂ logP (X|φ+)∂φ+

and ∂ logP (X|φ−)∂φ− are gradients of loglikelihood of the two

HMMs respectively, where analytical form is available asdescribed in generator update. The overall algorithm is sum-marized as Algorithm 1.

Algorithm 1 Adversarial learning of HHMM

Require: {X}: real dataset. Q: number of hidden states. M : num-ber of samples. N : number of parameter sets. k: update step forφ. l: update step for α.

Ensure: Generator α. Discriminator φ.1: Initialization of α, φ2: repeat3: for k steps do4: Draw M samples from both PG and real dataset.5: Update discriminator φ using RMSProp with gradient

defined by Eq. (8) and Eq. (9).6: end for7: for l steps do8: Draw N samples of θ. For each θ, draw M samples.9: Update generator α using RMSProp with gradient defined

by Eq. (4) and Eq. (5).10: end for11: until convergence or reach maximum iteration number12: return α

InferenceWe describe our methods on three inference problems asso-ciated with HHMM when applied to different data analysisapplications as described later in experiments.

Data synthesis One of the major applications for genera-tive model is to automatically synthesize data. The potentialuse of synthetic motion data is to supply the training of deeplearning models for tasks like action recognition. We use an-cestral sampling based approach to generate synthetic motiondata. Specifically, we first sample parameter A, μ from theircorresponding prior distribution given learned hyperparame-ters i.e. A ∼ P (A|η), μ ∼ P (μ|μ0,Σ0). Second, we samplehidden state chain given sampled parameters A and learnedparameters π i.e. Z1 ∼ P (Z1|π), Zt ∼ P (Zt|Zt−1,A).Finally, we compute the most likely observation sequenceX1, ..., XT conditioned on Z1, ..., ZT and parameters μ,Σ.Due to the model structure, observed nodes are independentwith each other given the hidden states. A naive solutionthat maximizes the conditional likelihood P (X|Z) yieldsmean value of the corresponding Gaussian at each frame i.e.Xt = μZt . For motion capture data, this results non-smoothchange between different poses. We alleviate this issue byaugmenting features Xt to include information with both firstorder i.e. position and second order i.e. speed as suggested in(Brand 1999), where the speed is computed as the differenceof consecutive position change. Then we solve the followinginference problem.

maxX

logP (X|Z) =∑t

logN(Xt|μZt,ΣZt

) (10)

where X = {Xt}, X = {Xt}, Xt = [Xt, Xt − Xt−1].Eq. (10) is a quadratic system with respect to X, where

2639

(a) PCC (b) MSE (c) Loglikelihood on training data (d) Loglikelihood on testing data

Figure 2: Reconstruction experiment results on Berkeley’s jumping action versus change of number of hidden states. HHMM-Arefers to adversarial learning variant and HHMM-M refers to maximum likelihood learning variant. (Best view in color)

closed-form solution can be obtained by setting the derivativeto zero.

Reconstruction The goal of reconstruction is to generate anovel sequence that resembles the input sequence. The recon-struction process evaluates the capability of model capturingthe dynamics of sequential data. Since the hidden state chainis the primary source that encodes the dynamic change ofthe data, we first infer the most probable configuration ofstate chain. We compute the MAP estimate θ∗ by solving thefollowing problem using MAP-EM (Gauvain and Lee 1994),where the E-step has complexity O(Q2T ) and M-step hasclosed-form solution.

θ∗ = argmaxθ

log∑Z

P (X,Z|θ, θ) + logP (θ|α) (11)

Then we perform Viterbi decoding algorithm (Rabiner 1989)on observed testing sequence given θ∗. Finally, we computethe most likely observations given the decoded states in thesame way as described in data synthesis.

Compute data likelihood The marginal likelihood of themodel evaluated on data X is defined as follows.

llh(X) = logPG(X|α) (12)

= log

∫θ

∑Z

P (X,Z|θ, θ)P (θ|α)dθ

Exact computation of Eq. (12) is intractable due to the inte-gration over θ introduces additional dependencies among Z.We use the following approximation.

llh(X) ≈ ˆllh(X) = log∑Z

P (X,Z|θ∗, θ) (13)

where θ∗ is defined by Eq. (11). Then Eq. (13) can be com-puted using forward-backward algorithm (Rabiner 1989).

ExperimentsWe evaluate the model on two tasks related to motion capturedata analysis. For each type of real motion capture data, wefit one model to capture the specific dynamic process of theaction. We first quantitatively evaluate the model capabilityin capturing dynamics through reconstruction experiments.Then we show the learned model can be used to synthesize

novel motion data with different intra-class variation withboth quantitative and qualitative results.

Datasets: CMU Motion capture database (CMU ) con-tains a diverse collection of human motion data captured bycommercial motion capture system. Up to date, there are2605 sequences in 6 major categories and 23 subcategoriescollected from 112 different subjects. We select a subset ofthe database to train our model including actions of walking,running and boxing from 31 subjects with averaging 101 se-quences per action. UC Berkeley MHAD (Ofli et al. 2013)contains motion data collected by multiple modalities. Weonly use the motion capture data. There are 12 subjects per-form 11 type of actions and each action is repeated 5 times,yielding large intra-class variation. We select three actionsfor experiments, namely jumping in place, jumping Jack andboxing, which involve substantial whole body movement.

Preprocessing: We subtract the root joint location of eachframe to make the skeleton pose invariant to position change.We further convert the rotation angles to exponential map rep-resentation in the same way as (Taylor, Hinton, and Roweis2006), which makes the skeleton pose invariant to the ori-entation against gravitational vertical. We exclude featuresthat are mostly constant (standard deviation < 0.5), result-ing 53 and 60 feature dimension per frame respectively onCMU and Berkeley datasets. The feature dimension is thendoubled by including speed feature obtained as the differ-ence of consecutive frames along each feature dimension.All features are scaled to have standard deviation 1 withineach dimension. Finally, we divide the original sequencesinto overlapping segments of the same length for simplicityso that the model likelihood on different data is unaffected bythe sequence length, though our model can take sequence in-put with different length. The preprocessed data is then usedto train HHMM and other compared methods. We evaluateperformance on feature space for all methods.

Implementation: For Algorithm 1, we use k = 1, l =1,M = 10, N = 100. RMSProp decay is 0.9 and perturba-tion is 10−6. The learning rate for generator is 10−3 and fordiscriminator is 10−4. The maximum number of epochs isset to 100. To initialize α, we use K-means to cluster observa-tions and use cluster assignment as hidden state value, fromwhich we can estimate the model parameters and hyperpa-rameters. To initialize φ+, we use MLE of the first batch ofreal and synthetic data. φ− is set equal to φ+. Our Matlab

2640

code runs on a PC with 3.4GHz CPU and 8GB RAM. Theaverage training time per class is 1.3 hour on CMU datasetand 1.9 hour on Berkeley dataset.

Data reconstructionIn this experiment, we demonstrate the learned HHMM havelarge capacity to handle intra-class variation in motion cap-ture data. For each action category, we divide the data into 4folds with each fold containing distinct subjects. Reconstruc-tion is performed in a cross-fold manner meaning each foldis used as testing data once with remaining folds as trainingdata. We report the average results over all folds and all inputdimensions.

Quantitative metrics: We use Pearson correlation coef-ficient (PCC) and mean squared error (MSE) computed infeature space between reconstructed and actual values. PCCmeasures how well the prediction can capture the trend ofmotion change. PCC is a number between 0 and 1 and thelarger the better. MSE is a positive number measuring thedeviation between reconstructed and actual value and thesmaller the better. We also report approximate loglikelihoodof model evaluated on reconstructed data.

First, we compare with two baselines namely HMM andHHMM that are both learned by maximizing likelihood.While MLE of HMM is done using EM, MLE of HHMMis intractable. We approximate the MLE through a two-stepoptimization process as described in inference method. Wevary the hidden state number of all the methods and evaluatetheir performance as shown in Figure 2.

We observe that both variants of HHMM consistently out-performs HMM in PCC and MSE across different state num-bers. In addition, when the state number is small, increasingthe value helps both methods. As the value keeps increasing,HHMM performance reaches a plateau and HMM perfor-mance starts to drop, showing symptom of overfitting totraining data. The overfitting of HMM becomes more clearwhen looking at the likelihoods, which drop significantlyfrom training data to testing data. This shows that comparedto non-hierarchical counter-part, HHMM has a larger capac-ity, which allows the model to adapt to novel data and lessprone to overfitting. Comparison between two variants showsthat HHMM-M consistently achieves higher likelihood ontraining data across actions and datasets than HHMM-A.This is consistent with the maximizing likelihood objectiveof HHMM-M. On testing data, the likelihood gap betweenHHMM-M and HHMM-A becomes smaller. For PCC andMSE, HHMM-A consistently outperforms HHMM-M. Over-all, these results show that the adversarially learned HHMMcan generalize better to novel data by capturing dynamic datadistribution well.

Then we compare our method with several state-of-the-artdynamic generative models including GPDM (Wang, Fleet,and Hertzmann 2008), which is a non-parametric model,ERD (Fragkiadaki et al. 2015), which is an RNN/LSTMbased method, and TSBN (Gan et al. 2015), which incor-porates neural networks and graphical models. We set thehidden state number to 20 for HHMM throughout the remain-ing experiments. For other methods, we use author provided

code to perform experiments. The results average over differ-ent actions are shown in Table 1.

Table 1: Reconstruction results of different methods averagedover different features and actions. Number in [] is standarddeviation.

Dataset CMU BerkeleyMetric PCC MSE PCC MSEHMM 0.36[0.46] 1.12[1.54] 0.43[0.46] 0.87[1.95]GPDM 0.70[0.24] 0.24[0.36] 0.47[0.35] 0.51[1.03]ERD 0.66[0.34] 0.61[1.15] 0.75[0.30] 0.31[1.17]TSBN 0.79[0.24] 0.27[0.92] 0.81[0.25] 0.18[0.64]HHMM 0.81[0.22] 0.20[0.77] 0.81[0.26] 0.12[0.30]

On average, in CMU dataset, we achieve 2% absolute im-provement in PCC compared to the second best TSBN and0.04 absolute reduction in MSE compared to the second bestGPDM. In Berkeley dataset, we achieve comparable perfor-mance in PCC compared to the second best TSBN and 6%improvement to ERD. We reduce MSE by 0.06 compared tothe closest competitive TSBN. In both datasets, we outper-form the baseline method by a large margin.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CRBM ERD TSBN HHMM-M HHMM-A Training

Aver

age

Max

SSI

M

Running

Walking

Boxing

Average

(a) CMU

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CRBM ERD TSBN HHMM-M HHMM-A Training

Aver

age

Max

SSI

M

Jumping

Jack

Boxing

Average

(b) Berkeley

Figure 3: Average largest pairwise SSIM between syntheticmotion sequences and real sequences from (a) CMU and (b)Berkeley datasets. (Best view in color)

2641

Data synthesisIn this experiment, we demonstrate that adversarially learnedHHMM can generate both realistic and diverse motion se-quences. For each type of action, we train a model, which isthen used to generate motion of the same type following thedescription in inference method.

Quantitative results: Sequential data brings additionalchallenge to quality evaluation due to large variation anddependency on both space and time. Motivated by the needto consider both fidelity and diversity of the generated se-quential data, we adopt the structure similarity index (SSIM)(Wang et al. 2004) to evaluate synthesized data quality. SSIMis originally proposed for evaluating quality of a corruptedimage against intact reference image. It is easy to computeand correlates well with perceptual quality of the image. Itis a value between 0 and 1. The larger the value the moreperceptually similar the images. (Odena, Olah, and Shlens2017) adopted it for evaluating the overall image diversitygenerated by the GAN. To adapt SSIM for sequential data,we concatenate the features over time so that it can be viewedas an image, where each pixel in the image corresponds to ajoint angle at a time. For each method, we generate 1000 se-quences. For each sequence, we compute the pairwise SSIMagainst all the training sequences and choose the largest one.Finally, we use the average largest SSIM as measure of thediversity of the synthesized sequences. As a reference, wecompute the pairwise SSIM among all the training sequences.The results are shown in Figure 3. For both datasets, the av-erage training data SSIM is the lowest among all the results,indicating significant intra-class variation. Among differentcompeting methods, HHMM-A achieves the lowest averageSSIM. This shows adversially learned HHMM can gener-ate the most diverse set of motion sequences. Comparingdifferent action categories, a more complex action such asboxing usually achieves lower SSIM. From this point of view,HHMM is generating data consistent with the training set. Formethod producing high SSIM value e.g. TSBN on Berkeley’sboxing, it indicates that the method overfits to some trainingdata instances and fails to generalize diverse synthetic data.

Qualitative results: Figure 4 show some examples of syn-thetic sequences of different actions, where different rowsshow different samples drawn from the same motion category.Notice that the sampling process may not always generatemeaningful motion in terms of the pose change since therandom nature of hidden state transition. We use SSIM as areference and select the generated sequence whose largestSSIM is above a threshold. On one hand, we are able todistinguish different motion sequences, indicating the datalook physically meaningful and realistic. On the other hand,the sequences show various styles in motion, which showHHMM can generate different variations for the same action.

ConclusionIn this paper, we enhanced HMM through Bayesian hierarchi-cal framework to improve the model capability in modelingdynamics under intra-class variation. We proposed a novellearning method that learns HHMM under adversarial objec-

(a) Walking (b) Running

(c) Boxing (d) Jumping

Figure 4: Synthetic motion sequences. Each row is a uni-formly downsampled skeletal sequence from one syntheticaction. Different rows are different samples.

tive, which has shown promising results in data generationapplications compared to conventional maximum likelihoodlearning. Through both quantitative and qualitative evalua-tions, we showed the learned model can capture the dynamicprocess of human motion data well and can be used to gener-ate realistic motion sequence with intra-class variation. Forfuture work, we plan to introduce higher order dependencystructure to better capture long-term dependency. We are alsointerested in training with different types of actions togetherinstead of fitting one model at a time.

AcknowledgmentThis work is partially supported by Cognitive Immersive Sys-tems Laboratory (CISL), a collaboration between IBM andRPI, and also a center in IBM’s Cognitive Horizon Network(CHN).

2642

ReferencesBayer, J., and Osendorfer, C. 2014. Learning stochastic recurrentnetworks. arXiv.Brand, M., and Hertzmann, A. 2000. Style machines. In SIGGRAPH.ACM.Brand, M.; Oliver, N.; and Pentland, A. 1997. Coupled hiddenmarkov models for complex action recognition. In CVPR.Brand, M. 1999. Voice puppetry. In SIGGRAPH.Cappe, O.; Buchoux, V.; and Moulines, E. 1998. Quasi-newtonmethod for maximum likelihood estimation of hidden markov mod-els. In ICASSP.Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A. C.; andBengio, Y. 2015. A recurrent latent variable model for sequentialdata. In NIPS.CMU. Cmu mocap database http://mocap.cs.cmu.edu/.Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximumlikelihood from incomplete data via the em algorithm. Journal ofthe royal statistical society.Fine, S.; Singer, Y.; and Tishby, N. 1998. The hierarchical hiddenmarkov model: Analysis and applications. Machine learning.Fraccaro, M.; Sønderby, S. K.; Paquet, U.; and Winther, O. 2016.Sequential neural models with stochastic layers. In NIPS.Fragkiadaki, K.; Levine, S.; Felsen, P.; and Malik, J. 2015. Recurrentnetwork models for human dynamics. In ICCV.Gan, Z.; Li, C.; Henao, R.; Carlson, D. E.; and Carin, L. 2015. Deeptemporal sigmoid belief networks for sequence modeling. In NIPS.Gauvain, J.-L., and Lee, C.-H. 1994. Maximum a posteriori esti-mation for multivariate gaussian mixture observations of markovchains. TSAP.Ghahramani, Z.; Jordan, M. I.; and Smyth, P. 1997. Factorial hiddenmarkov models. Machine learning.Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley,D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generativeadversarial nets. In NIPS.Johnson, M.; Duvenaud, D. K.; Wiltschko, A.; Adams, R. P.; andDatta, S. R. 2016. Composing graphical models with neural net-works for structured representations and fast inference. In NIPS.Joshi, A.; Ghosh, S.; Betke, M.; Sclaroff, S.; and Pfister, H. 2017.Personalizing gesture recognition using hierarchical bayesian neuralnetworks. In CVPR.Kalman, R. E., et al. 1960. A new approach to linear filtering andprediction problems. Journal of basic Engineering.Krishnan, R. G.; Shalit, U.; and Sontag, D. 2015. Deep kalmanfilters. arXiv.Mittelman, R.; Kuipers, B.; Savarese, S.; and Lee, H. 2014. Struc-tured recurrent temporal restricted boltzmann machines. In ICML.Murphy, K. P. 2002. Dynamic bayesian networks: representation,inference and learning. Ph.D. Dissertation, University of California,Berkeley.Nguyen, A.; Clune, J.; Bengio, Y.; Dosovitskiy, A.; and Yosinski,J. 2017. Plug & play generative networks: Conditional iterativegeneration of images in latent space. In CVPR.Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-gan: Training gen-erative neural samplers using variational divergence minimization.In NIPS.Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional imagesynthesis with auxiliary classifier GANs. In ICML.

Ofli, F.; Chaudhry, R.; Kurillo, G.; Vidal, R.; and Bajcsy, R.2013. Berkeley mhad: A comprehensive multimodal human ac-tion database. In WACV.Rabiner, L. R. 1989. A tutorial on hidden markov models andselected applications in speech recognition. Proceedings of theIEEE.Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised repre-sentation learning with deep convolutional generative adversarialnetworks. arXiv.Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; and Lee,H. 2016. Generative adversarial text to image synthesis. In ICML.Saito, M.; Matsumoto, E.; and Saito, S. 2017. Temporal generativeadversarial nets with singular value clipping. In ICCV.Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsuper-vised learning of video representations using lstms. In ICML.Sutskever, I., and Hinton, G. 2007. Learning multilevel distributedrepresentations for high-dimensional sequences. In AISTATS.Sutton, C.; McCallum, A.; and Rohanimanesh, K. 2007. Dynamicconditional random fields: Factorized probabilistic models for label-ing and segmenting sequence data. JMLR.Tang, K.; Fei-Fei, L.; and Koller, D. 2012. Learning latent temporalstructure for complex event detection. In CVPR.Taylor, G. W.; Hinton, G. E.; and Roweis, S. T. 2006. Modelinghuman motion using binary latent variables. In NIPS.Theis, L.; Oord, A. v. d.; and Bethge, M. 2015. A note on theevaluation of generative models. arXiv.Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude. COURSERA:Neural networks for machine learning.Tulyakov, S.; Liu, M.-Y.; Yang, X.; and Kautz, J. 2017. Mocogan:Decomposing motion and content for video generation. arXiv.Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Generatingvideos with scene dynamics. In NIPS.Walker, J.; Doersch, C.; Gupta, A.; and Hebert, M. 2016. Anuncertain future: Forecasting from static images using variationalautoencoders. In ECCV.Wang, X., and Ji, Q. 2012. Learning dynamic bayesian networkdiscriminatively for human activity recognition. In ICPR.Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004.Image quality assessment: from error visibility to structural similar-ity. TIP.Wang, J. M.; Fleet, D. J.; and Hertzmann, A. 2008. Gaussianprocess dynamical models for human motion. TPAMI.Williams, R. J. 1992. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. Machine learning.Xue, T.; Wu, J.; Bouman, K.; and Freeman, B. 2016. Visual dy-namics: Probabilistic future frame synthesis via cross convolutionalnetworks. In NIPS.Yu, S.-Z. 2010. Hidden semi-markov models. Artificial Intelligence.

2643

Documents

ecse.rpi.eduqji/CV/Sample_papers_QJ.pdf · 2019-10-07 · 0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See