Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
A Novel Dynamic Model Capturing Spatial andTemporal Patterns for Facial Expression Analysis
Shangfei Wang, Zhuangqiang Zheng, Shi Yin, Jiajia Yang, Qiang Ji
Abstract—Facial expression analysis could be greatly improvedby incorporating spatial and temporal patterns present in facialbehavior, but the patterns have not yet been utilized to theirfull advantage. We remedy this via a novel dynamic model - aninterval temporal restricted Boltzmann machine (IT-RBM) - thatis able to capture both universal spatial patterns and complicatedtemporal patterns in facial behavior for facial expression analysis.We regard a facial expression as a multifarious activity composedof sequential or overlapping primitive facial events. Allen’sinterval algebra is implemented to portray these complicatedtemporal patterns via a two-layer Bayesian network. The nodesin the upper-most layer are representative of the primitive facialevents, and the nodes in the lower layer depict the temporal rela-tionships between those events. Our model also captures inherentuniversal spatial patterns via a multi-value restricted Boltzmannmachine in which the visible nodes are facial events, and theconnections between hidden and visible nodes model intrinsicspatial patterns. Efficient learning and inference algorithms areproposed. Experiments on posed and spontaneous expressiondistinction and expression recognition demonstrate that ourproposed IT-RBM achieves superior performance compared tostate-of-the art research due to its ability to incorporate thesefacial behavior patterns.
Index Terms—interval temporal restricted Boltzmann machine,global spatial and temporal patterns, posed and spontaneousexpressions distinction, expressions categories recognition
I. INTRODUCTION
THERE has been a proliferation of research on facialexpression analysis recently, since facial expression is
a crucial channel for both human-human communication andhuman-robot interaction.
Current works on facial expression recognition may becategorized into either of two approaches: a frame-based ap-proach or a sequence-based approach. A frame-based approachrecognizes facial expressions from static facial images, usu-ally from the manually annotated apex frame. This approachcompletely disregards the important dynamic patterns inherentin facial behavior. A sequence-based approach relies on thewhole image sequence, and thus has the potential to modelboth spatial and temporal patterns through features or dynamicclassifiers. Current works either employ hand-crafted spatialand temporal features or use learned representation throughdeep networks. Several dynamic classifier models, such as
Shangfei Wang is the corresponding author. Shangfei Wang, ZhuangqiangZheng, Shi Yin and Jiajia Yang are with the School of Computer Scienceand Technology, University of Science and Technology of China, 230027,Hefei, China. E-mail: [email protected]; [email protected];[email protected]; [email protected]
Qiang Ji is with the Department of Electrical, Computer, and Sys-tems Engineering Rensselaer Polytechnic Institute, NY 12180-3590. E-mail:[email protected]
hidden Markov models (HMMs), dynamic Bayesian networks(DBNs), latent conditional random fields (LCRFs), long short-term memory networks (LSTMs), or gated recurrent unitnetworks (GRUs), are frequently used. All of these workstry to find more discriminative features or more powerfulclassifiers to explore embedded spatial and temporal patterns,and have been successful for facial expression analysis. Werefer to these approaches as feature-driven methods.
Few works consider the underlying anatomic mechanismsgoverning facial muscular interactions. Nearly any facial ex-pression can be deconstructed into the contraction or relaxationof one or more facial muscles. These facial muscle movementsinteract in space and time to convey different expressions.At each time slice, facial muscle motions may co-occur orbe mutually exclusive. For example, as shown in Figure 1(a)and Figure 1(b), most people raise the inner brow and outerbrow simultaneously, since both motions are related to thefrontalis muscle group. The lip corner puller rarely occursin tandem with the lip corner depressor, as shown in Figure1(c) and Figure 1(d). The lip corner puller uses the musclegroup zygomaticus major, and the latter is produced by thedepressor anguli oris muscle group. Temporally, the movementof one facial muscle can either activate, meet, overlap, orsucceed another muscle. As shown in Figure 2, for example,most people show happiness by stretching their mouths whileraising their cheeks. Therefore, the contraction of zygomaticmajor is more likely to occur asymmetrically if a smile isposed rather than spontaneous. When an expression is naturaland spontaneous, the trajectory is typically smoother. It has ashorter duration, and onset is gradual rather than immediate.Such spatial and temporal patterns caused by the interactionof facial expression muscles are extremely complex, time-dependent, and global, yet have not been fully modeled bycurrent facial expression analysis methods.
We propose a novel dynamic model that leverages the com-plex spatial and temporal patterns caused by the underlyinganatomic mechanism for expression analysis. We assume anexpression is a multifarious activity made up of sequentialor overlapping primitive facial events, and that each eventtakes place over a certain amount of time. First, we introduceAllen’s interval algebra to capture several types of temporalrelationships, including A takes place before B, A meets B,A overlaps with B, A initiates B, A occurs during B, Afinishes B, A is equal to B, and the inverse of the first sixrelations. We implement the complex temporal relations usinga Bayesian network incorporating primitive facial event nodesand temporal relationship nodes. The links connecting the twotypes of nodes characterize their temporal relationships. Next,
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
(a)Netural frame image of surprise (b)Peak frame image of surprise
(c)Peak frame image of happy (d)Peak frame image of sad
Fig. 1. Sample images demonstrating spatial patterns inherent in expressions.
Frame No
1 18 47 94 111
Posed smile
Spontaneous smile
Fig. 2. Image sequences demonstrating temporal patterns inherent in expres-sions. The x-axis is the frame number.
a restricted Boltzmann machine (RBM) is adopted to representthe global spatial patterns among primitive facial events. Thevisible nodes of the RBM depict primitive facial events, andthe connections between hidden nodes and visible nodes modelthe spatial patterns inherent in expressions. During training, webuild an IT-RBM model for each type of expression, and theparameters and structures of the proposed IT-RBM are learnedthrough maximum likelihood. When testing occurs, the testsample label is equivalent to the model with the largest loglikelihood.
The proposed IT-RBM differs from other dynamic modelsin that it introduces Allen’s interval algebra to capture all 13temporal relations. Unlike current dynamic models, which arelimited to time-slice structure and must assume stationary andtime-independent temporal relations, the suggested model cancapture more complex global temporal relationships.
The paper is organized as follows. Section II is a brief re-view of related works on expression analysis, including expres-sion recognition as well as posed and spontaneous expressiondistinction. Section III details the proposed IT-RBM model.Section IV outlines the experiments and analysis, with posedand spontaneous expression distinction experiments outlined inSection IV-A and expression recognition experiments detailedin Section IV-B. Section V summarizes our work.
II. RELATED WORK
A. Posed and Spontaneous Expression Distinction
Inner feelings may be disguised with a posed expression, buttrue emotions are conveyed via spontaneous expressions. It isdifficult to distinguish one from the other, since expressionsvary by subject and condition and the differences betweena spontaneous and posed expression are subtle. The inherentspatial and temporal patterns in facial expressions can beleveraged to improve distinction between these similar typesof expressions.
Behavioral research has found slight but distinctive differ-ences between temporal and spatial patterns in posed andspontaneous expressions. Examples of temporal patterns in-clude speed, trajectory, amplitude, and duration of expressiononset and offset. For example, Ekman et al. [1][2] found thatspontaneous expressions usually have a smoother trajectoryand shorter duration than posed expressions. Schmidt et al.[3] revealed that for posed smiles, the maximum speed ofmovement onset is greater than it is for spontaneous smiles.Deliberate eyebrow raises are shorter in duration and havea greater maximum speed and amplitude than spontaneous,natural eyebrow raises. Spatial patterns mainly consist of themovement of facial muscles. For example, Ekman et al.’swork [1] found that the orbicularis oculi only contract duringspontaneous smiles. When smiling, the contraction of thezygomatic major muscle is more likely to be asymmetric for aposed expression than a spontaneous one [4]. Ross and Pulusu[5] indicated that posed expressions typically commence onthe right side of the face, while spontaneous expressionsoriginate on the left side of the face. This is especially true forupper facial expressions. Namba et al.’s work [6] comparedthe morphological and dynamic properties of spontaneous andposed facial expressions as they related to surprise, amuse-ment, disgust, and sadness. For amusement, AUs yield nosignificant differences. For disgust, AU10 and AU12 occurmore frequently when an expression is spontaneous rather thanposed, while AU17 appears more often in posed expressions.For sadness, morphological properties of spontaneous facialexpressions are not observed, while AU4, AU7, and AU17 aremost frequently observed in posed facial expressions.
Most research uses certain features to distinguish betweenposed and spontaneous expressions. Cohn and Schmidt [7]adopted temporal features, including duration, amplitude, andthe ratio between the two. Valstar [8] utilized features suchas speed, duration, trajectory, intensity, symmetry, and theoccurrence order of brow actions based on fiducial facial pointdisplacement. Dibeklioglu et al. [9] described the dynamicsof eyelid, cheek, and lip corner movements using amplitude,duration, speed, and acceleration. Seckington [10] representedtemporal dynamics using six features (i.e., morphology, apexoverlap, symmetry, total duration, onset speed, and offsetspeed).
Static classifiers (e.g., linear discriminant classifiers [7],support vector machines [11], k-NN [12], and naive Bayesianclassifiers [12]) and dynamic classifiers (e.g., continuous hid-den Markov models [12] and dynamic Bayesian networks[10]) were investigated for the task of distinguishing between
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
posed and spontaneous expressions. Static classifiers modelthe mapping between features and expression types, whiledynamic classifiers model the temporal relationships.
Progress has been made in distinguishing between posedand spontaneous expressions. However, these feature-drivenmethods do not explicitly leverage the underlying interactionsbetween facial expression muscles, and their influences onposed and spontaneous expressions.
Recently, Wang et al [13] proposed a model-based methodusing multiple Bayesian networks (BNs) to capture spatial pat-terns for expressions given gender and expression categories.This model only includes local dependencies due to the first-order Markov assumption of BNs; it cannot capture high-orderor global relations. Wu et al [14] proposed to address thatissue by implementing a restricted Boltzmann machine (RBM)to explicitly model complex joint distributions over featurepoints. RBMs introduce a layer of latent units, allowing themto model high-order dependencies among variables [15]. Al-though this model is an improvement, it does not leverage thedependencies among hidden units. Quan et al. [16] employed alatent regression Bayesian network (LRBN) to leverage higher-order and global dependencies among facial features. A latentregression Bayesian network differs from an RBM in that itis a directed rather than undirected model. The “explainingaway” effect in Bayesian networks allows LRBNs to capturedependencies among both latent and visible variables; thesedependencies are vital to accurately represent the data. Thesuccess of each of these three model-based works prove thatspatial patterns can contribute to the differentiation of posedfrom spontaneous expressions.
Thus far, there have not been many attempts to capture andleverage the spatial and temporal patterns embedded in posedand spontaneous facial expressions. We propose an intervaltemporal restricted Boltzmann machine (IT-RBM) to jointlycapture global and complex patterns and improve the task ofexpression distinction.
B. Expression Recognition
Expression recognition has attracted much more attentionthan the distinction between posed and spontaneous expres-sions. Corneanu et al.’s work [17] and Brais Martinez et al.’swork [18] provided a literature review of facial expressionrecognition.
Mainstream facial expression recognition works regard fa-cial expression recognition as a pattern recognition problemand focus primarily on discriminative features and power-ful classifiers. For features, both engineered dynamic repre-sentations and learned representations from video volumesare exploited to encode temporal variations among sequenceframes. Engineered dynamic representations such as LBP-TOP[19], and Gabor motion energy [20] do not require labelledsequences for training, and thus are simple and generic forany expression analysis tasks. However, the optimality isquestionable. The learned representations may attain state ofthe art performance, but they require many training videoswith ground-truth labels. Dynamic graphic models, such asHMM [21][22][23] and DBN [24], have commonly been used
for facial expression analysis tasks. As time-slice (based ontime points) graphical models, these dynamic models representeach activity as a sequence of events occurring instantaneously,and thus offer three time-point relations (i.e., before, followsand equals). Since facial expressions are complex and con-sist of facial events that may be sequential or temporallyoverlapping, current dynamic graphical models are unable torepresent several of the temporal relations happening betweenevents throughout the activity. Recently, deep dynamic modelssuch as LSTM [25] have been adopted for facial expressionanalysis. Usually, a convolutional neural network (CNN) isused to obtain static representations from each frame, andthen the learned static representations are fed into the LSTMto learn dynamic representations and expression classifierssimultaneously. In spite of its good performance, LSTMrequires a lot of training data. Furthermore, LSTM is also atime-slice model and cannot successfully represent the globaland complex temporal relations between primitive facial eventsinherent in facial expressions. Just as in expression distinction,these feature-drive expression recognition methods ignore theunderlying interaction among facial expression muscles.
A facial expression is defined as at least one motion of thefacial muscles over a period of time. These muscle movementscommonly appear in certain patterns to communicate differentexpressions. For example, the facial expression of happinessis characterized by raised cheeks and a stretched mouth.Surprise is usually displayed by widened eyes and a gapingmouth. A look of sadness is easily identified by upwardlyslanted eyebrows and a frown. An expression of anger is oftendetermined by squeezed eyebrows as well as tight and straighteyelids. Fear typically includes widened eyes and eyebrowsslanted upward. These expression-dependent temporal andspatial patterns are essential for expression recognition, buthave yet to be exploited thoroughly.
As far as we know, only one related work attempts tocapture these patterns for expression recognition. Wang et al.[26] suggested an interval temporal Bayesian network (ITBN)including temporal entity nodes and temporal relation nodes.The links between the former types of nodes represent spatialdependencies among temporal entities. Links joining temporalrelation nodes with their corresponding temporal entities arerepresentative of the temporal relationships between the twoconnected temporal entities. Thus, the ITBN is able to leveragespatial and temporal patterns. Due to the Markov assumption,Bayesian networks only capture local dependences. Therefore,instead of a BN, we employ an RBM to capture and depictglobal spatial patterns. Since Allen’s interval algebra definescomplete temporal relations between two events and a BN canfully capture dependencies between two events, we still em-ploy a BN to model temporal patterns embedded in expressionchanges.
The proposed IT-RBM is a novel dynamic model and canprovide complex and global relations through the use ofinterval algebra, which defines complete temporal relationsbetween two events. This is an improvement over typicaldynamic models like HMM and DBN, which use a time-slicestructure to present three time-point relations, and are onlyable to capture stationary dynamics.
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
This paper makes the following contributions to this fieldof study:
1. A novel dynamic model, IT-RBM, is proposed to jointlycapture both global spatial patterns and complex temporalpatterns.
2. We explicitly model spatial-temporal patterns innate tovarious expression categories for expression recognition.
2. We explicitly model spatial-temporal patterns found inposed and spontaneous expressions to better distinguish be-tween those expressions.
A previous version of the paper appeared as Yang et al’work [27], which proposed an IT-RBM to capture and utilizespatial and temporal patterns embedded in posed and sponta-neous expressions for expression distinction. Unlike the pre-vious version, which only focuses on posed and spontaneousexpression distinction, this paper extends the proposed IT-RBM for expression recognition. To show the effectiveness ofthe proposed IT-RBM for expression recognition, experimentsare conducted on the CK+ and the MMI databases. We haveadded two models for posed and spontaneous expressiondistinction (i.e., PS gender model and PS exp model), sincethe spatial and temporal patterns embedded in expressionsare influenced by gender and type of expression. For thePS gender model, we train four models from male posed,male spontaneous, female posed, and female spontaneoussamples. For the PS exp model, we train a posed model anda spontaneous model for each expression type.
III. PROPOSED METHOD
Facial expressions are the results of a set of musclemovements over a period of time. At each time slice, facialmuscle motions can co-occur or be mutually exclusive. From atemporal perspective, the movement of one facial muscle canactivate, overlap, or follow the movement of another muscle.Because of the difficulty in measuring minute facial musclemotions, the movements of facial feature points are used todefine primitive facial events as recommended by Wang et al’work [26]. Each feature point movement is a singular primitivefacial event. The interval relation between each pair of eventscan be defined as one of 13 interval relations proposed byAllen’s interval algebra [28]. First, we select the primitiveevent pairs with the largest interval relation variance among thedifferent expressions. For each type of expression, an IT-RBMmodel is constructed using the selected events and intervalrelations. The global spatial and temporal patterns are jointlycaptured during training. During testing, the label of a testsample is the model with the largest likelihood. The methodframework is illustrated in Figure 3.
This section focuses first on the extraction of primitivefacial events. Then, we describe the definition and selectionof temporal relations. After that, the proposed IT-RBM modelis presented in detail.
A. Primitive facial events extraction
Given the data set of sample videos with different types ofexpression, denoted as D = {(x(1), y(1)), (x(2), y(2)), ...,(x(N), y(N))}, where x(i) is the ith video with frame length
y
t
V
T1
(b)
v1:
vK:
R12 v1 during v2
P1
P2
Neutral
frame
End
Frame
Peak
frame
Start
frameT2
y
t
...
y
v2
v1
Facial expression
videos database
Select primitive event pairs
and temporal relations
Primitive events extraction
Label
IT-RBMs with gender or expression Test primitive events
Likelihood of
each modelTest phase
v2:
TR1
3
TR1
2
TR2m TR3m
...
...
h1 h2 hn
v1 v2 v3 vm
n
m
RTR1
3
TR1
2
TR2m TR3m
...
...
h1 h2 hn
v1 v2 v3 vm
n
mm
RTR1
3
TR1
2
TR2m TR3m
...
...
h1 h2 hn
v1 v2 v3 vm
n
m
RTR1
3
TR1
2
TR2m TR3m
...
...
h1 h2 hn
v1 v2 v3 vm
n
mm
R
(a) (c)
C models
Fig. 3. Outline of the recognition system
of fi, and y(i) ∈ {0, 1, ..., C} is its expression label (C isthe total expression classes of videos), N is the number oftotal sample videos. Each frame is a facial image with Pnofacial points. Each video is assumed to contain primitive facialevents, which are either sequential or temporally overlapping.A primitive facial event is the movement of one feature pointand includes the motion state, the commencement time whenthe feature point is no longer in neutral position, and themoment when the point returns to neutral. Figure 4 depictsa primitive event corresponding to ith facial point denoted asVi = 〈tsi, tei, vi〉(tsi, tei ∈ R, tsi < tei, vi ∈ {1, 2, ...,K}),tsi and tei are representative of the start and end times respec-tively, vi represents the motion state, {1, 2, ...,K} representsall possible primitive event states. As expression videos differin frame length, we normalize all frames to the shortest framelength in the training set len. Samples are equidistantly down-sampled to len. We obtain K movement states by using K-means clustering on the feature point displacement sequence.
Figure 4 illustrates some example primitive facial events. In(a) facial points P1 and P2 correspond to events V1 and V2,representing the muscle motions of the right wing of the noseand the right mouth corner respectively. (b) shows the traceof V1 and V2 along the vertical direction, and T1, T2 are theircorresponding durations. Each event has K possible statesrepresenting its movement pattern throughout the duration,as (c) shows. The flat red line depicts a point that remainsin neutral for the entire process, and the other states arerepresentative of k − 1 movement patterns. For example, apoint that moves up and then returns to neutral would berepresented by state Sm(shown as the dotted black line). Amore complex pattern is depicted by state S2 (the solid line),in which a point moves upward and then downward.
B. Temporal relations definition and selection
According to Allen’s interval algebra [28], there are 13potential temporal relationships between two primitive eventsas illustrated in Table I. The 13 possible relations I ={b, bi,m,mi, o, oi, s, si, d, di, f, fi, eq}, representing before,meets, overlaps, starts, during, finishes, equals and their in-verses. The temporal relationships between pairs of facialevents Vi and Vj can be obtained by calculating the temporaldistance dis(Vi, Vj) according to Eq.1.
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
y
t
E
T1
(a) (b)
S1:
S2:
Sm:
R12 E1 during E2
P1
P2
Neutral
frame
End
Frame
Peak
frame
Start
frameT2
y
t
...
(c)
y
E2
E1
Fig. 4. (a) Facial muscle movement as captured by the movement of facialpoints. (b) Duration for events V1 and V2 and their temporal relations. (c)Example movement states of a primitive facial event.
dis(Vi, Vj) = [tsi − tsj , tsi − tej , tei − tsj , tei − tej ] (1)
TABLE ITR AND INTERVAL RELATION MAPPING TABLE
No TR tsi − tsj tei − tej tsi − tej tei − tsj illustration
1 b < 0 < 0 < 0 < 0 ! "!
2 bi > 0 > 0 > 0 > 0
3 d > 0 < 0 < 0 > 0 !
"!
4 di < 0 > 0 < 0 > 0
5 o < 0 < 0 < 0 > 0 !
"!
6 oi > 0 > 0 < 0 > 0
7 m < 0 < 0 < 0 = 0 ! "!
8 mi > 0 > 0 = 0 > 0
9 s = 0 < 0 < 0 > 0 !
"!
10 si = 0 > 0 < 0 > 0
11 f > 0 = 0 < 0 > 0 !
"!
12 fi < 0 = 0 < 0 > 0
13 eq = 0 = 0 − − !
"!
The horizontal bars represent the time interval of the correspondingprimitive events.
The extraction of primitive facial events yields Pno∗(Pno−1) pairs of events and the corresponding temporal relationsfor each sample. It is expected that discriminative temporalrelations will have a wider variance between expression types,so we propose a Kullback-Leibler divergence-based score [29]to measure the difference between the two probability distribu-tions. The score of event pair Vi, Vj is defined in Eq.2, whereTRij represents relation between primitive event pair Vi, Vj ,Px(TRij) and Py(TRij) are the probability distribution ofTRij for the x and y expression respectively. DKL stands forthe KL divergence. Primitive event pairs are ranked by thescore, and the top ξ pairs with m events are selected.
Sij =∑x,y∈{1,2,...,C}
(DKL(Px(TRij)‖Py(TRij)) +DKL(Py(TRij)‖Px(TRij)))
(2)
C. Capturing spatial and temporal patterns through the IT-RBM Model
Our proposed hybrid graphic model known as IT-RBM isshown in Figure 5. The upper section is a multi-value RBMand the lower layer is a Bayesian network. The uppermostlayer contains n binary latent variables hj ∈ {0, 1}(j ∈
{1, 2, ..., n}). The layer below that contains m visible nodes.vi ∈ {1, ...,K}(i ∈ {1, 2, ...,m}) describes m selectedfacial events. Each facial event consists of K motion statesrepresented by an one-hot vector. Specifically, vi consists ofbinary nodes vi1, vi2, ..., viK , thus vi = k can be representedwith an one-hot vector by setting vik = 1, other K-1 binarynodes to zeros. The bottom layer contains ξ temporal relationnodes, TR ∈ I representing 13 temporal relations. Complextemporal relations are captured by the lower part; the spatialdependencies among facial events are modeled by the upperpart. Eq. 3 shows the joint probability of the suggested model.
TR13 TR12 TR2m TR3m
...
...
h1 h2 hn
v1 v2 v3 vm
W W
W of 2 hidden nodes,3 move
models of visible nodes
n
m
R
Fig. 5. An example of IT-RBM model
P (v, TR) = P (TR|v)P (v) = P (TR|v)∑h
P (v, h) (3)
where
P (TR|v)=R∏
r=1
P (TRr|π(TRr)), (4)
TRr represents the rth temporal relation node. π(TRr) arethe two primitive event nodes that produce TRr.
After primitive events and temporal relations are extracted,given training data Dt = {(v(1), TR(1)), (v(2), TR(2)), ...,(v(Nt), TRNt)}, where Nt indicates the number of trainingsamples for one expression, v(i) and TR(i) represents motionstates and temporal relations of ith sample. The goal of modellearning is log likelihood maximization, shown as follows:
θ* = argmaxθ
1
Nt
∑(logP (v; θ) + logP (TR|v; θ)) (5)
Eq. 5 demonstrates that we can factorize the log likelihoodof IT-RBM into the sum of the log likelihood of RBM andthe log likelihood of BN. Since the model parameters of RBMθRBM is independent of model parameters of BN θBN , wecan train RBM and BN separately. Training of the multi-valueRBM only concerns motion states of primitive event, so wedenote DRBM
t = {v(1), v(2), ..., v(Nt)}.The marginal distribution of the visible units is calculated
as Eq. 6,
P (v) =∑h
P (v, h) =1
Z
∑h
e−E(v,h)
=1
Ze
m∑i=1
K∑k=1
bikvikn∏j=1
(1 + eaj+
m∑i=1
K∑k=1
wjikvik
)
(6)
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
where E is the energy function of multi-value RBM and isdefined in Eq.7. {W,a, b} are the model parameters: wjik is asymmetric interaction term between visible unit i which takeson value k and hidden unit j, bik is the bias of unit i thattakes on value k, and aj is the bias of hidden unit j.
E(v, h) = −m∑i=1
n∑j=1
K∑k=1
vikwjikhj −
m∑i=1
K∑k=1
vikbik −n∑j=1
hjaj
(7)The gradient with respect to θRBM = {W,a, b} can be
calculated as Eq.8, where P (v, h) and P (h|v) denote themodel-defined distribution, v in the first term is from trainingset, v in the second term is sampled from model-defineddistribution P (v).
∆θRBM = ε∂ logP (v)
∂θRBM
= ε
−∑h
P (h|v)∂E(v, h)
∂θRBM+∑v,h
P (v, h)∂E(v, h)
∂θRBM
(8)
The contrastive divergence (CD) algorithm is used to over-come the challenge of inferring the second term of Eq. 8,which is intractable and is needed for gradient calculation[30]. The conditional distribution of visible nodes given hid-den nodes and the conditional distribution of hidden nodesgiven visible nodes are softmax function and logistic functionrespectively, as follows:
P (vi = k|h) =exp(bik +
∑nj=1 hjw
jik)∑K
l=1 exp(bil +∑nj=1 hjw
jil)
(9)
P (hj = 1|v) = σ(aj +
m∑i=1
K∑k=1
wjikvik) (10)
The detailed algorithm for learning multi-value RBM is shownas Algorithm 1.
Algorithm 1 The training algorithm for multi-value RBMusing CD learning
Require: Training data:DRBMt = {v(1), v(2), ..., v(Nt)}, la-
tent nodes number n, learning rate ε, maximum trainingtimes T
Ensure: wjik, aj , bik1: Initialize: set w, a, b to small random values2: for t = 1, 2, ..., T do3: sample one example v from DRBM
t
4: for j = 1, 2, ..., n do5: Sample hj ∼ p(hj |v) with Eq.106: end for7: for i = 1, 2, ...,m do8: Sample v′i ∼ p(vi|h) with Eq.99: end for
10: parameters update:11: wjik ← wjik + ε(P (hj = 1|v)vik − P (hj = 1|v′)v′ik)12: bik ← bik + ε(vik − v′ik)13: aj ← aj + ε(P (hj = 1|v)− P (hj = 1|v′))14: end for
The conditional probability distributions for each tempo-ral relation node TRij given its parent nodes vi and vjare used to define parameters for the BN. The structureof the BN and the number of parameters can be deter-mined after temporal relations are selected. The goal ofparameter approximation is to find the maximum likeli-hood estimate of parameter θBN given training data Dt ={(v(1), TR(1)), (v(2), TR(2)), ..., (v(Nt), TRNt)}. This is de-picted in Eq. 11.
θBN= argmaxθBN
∑Nt
logP (TR|v; θBN ) (11)
D. Expression analysesAn IT-RBM model can be obtained for each expression after
training. During testing, test sample t is labeled with the classthat has the largest log likelihood value, according to Eq. 12.In that equation, y∗ represents the predicted label and C isthe number of expression categories (as well as the number ofIT-RBM models).
y∗ = maxy∈{1,...,C}
{logP (t|θy)} (12)
The log likelihood that IT-RBM trained on class y assignedto test sample t is as follows:logP (t|θy)
= log
(∑h
exp(−E(h, t; θy))
)− logZ (θy) + log(P (TR|t; θy))
(13)
in which the first and third terms can be directly calculatedand the partition function Z is intractable. The extended AISmethod inspired by annealed importance sampling (AIS) [31]is used to compute the partition function of multi-value RBM.
AIS approximates the ratio of the partition functions of theobject RBM to the base-rate RBM. For example, supposethere are two multi-value RBMs with parameters θA ={WA, bA, aA} and θB = {WB , bB , aB}. These RBMs defineprobability distributions PA and PB over the same v ∈{0, 1, ...,K}m, and hA ∈ {0, 1}nA , hB ∈ {0, 1}nB .
First, the intermediate distribution sequence for τ = 0, ..., nis defined as:
Pτ (v) =P ∗τ (v)
Zτ=
1
Zτ
∑h
exp(−Eτ (v, h)) (14)
where the energy function is delineated in Eq. 15, P0(v) =PA and Pn(v) = PB . Eq. 16 approximates the unnormalizedprobability over visible units, where 0 = β0 < β1 < ... <βτ < ... < βn = 1.
Eτ (v, h) = (1− βτ )E(v, hA; θA) + βτE(v, hB ; θB) (15)
P ∗τ (v) = e(1−βτ )
∑i
∑kbik
AviknA∏
j=1
(1 + e(1−βτ )(
∑i
∑kwjAikvik+a
Aj )
)
∗ eβτ
∑i
∑kbik
BviknB∏
j=1
(1 + eβτ (
∑i
∑kwjBikvik+a
Bj )
)
(16)
Next, we establish a Markov chain transition operatorTτ (v
′; v) that leaves Pτ (v) invariant. Logistic and softmaxfunctions yield the conditional distributions as follows:
P (hAj = 1|v) = σ((1− βτ )(aAj +∑i
∑k
wjAik vik)) (17)
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
P (hBj = 1|v) = σ(βτ (aBj +
∑i
∑k
wjBik vik)) (18)
P (vi = k|h) =
exp
((1− βτ )
(bAik +
n∑j=1
hAj wjAik
)+ βτ
(bBik +
n∑j=1
hBj wjAik
))∑Kl=1 exp
((1− βτ )
(bAil +
n∑j=1
hAj wjAil
)+ βτ
(bBik +
n∑j=1
hBj wjAil
))(19)
Hidden units hA and hB are stochastically activated usingEq. 17 and Eq. 18. A new sample is drawn using Eq. 19.
Finally, with initial θA = {0, bA, 0}, ZA is calculated asEq. 20. Then we can calculate ZB with Eq. 21. We defineω(i) in Algorithm 2 to avoid using too many symbols.
ZA = 2nA∏i
∑k
ebki (20)
ZBZA≈= 1
Mr
Mr∑i=1
ω(i) = rAIS (21)
The detailed algorithm is outlined below in Algorithm 2.
Algorithm 2 The AIS algorithm for capturing partition func-tion Z[31]Require: Base-rate RBM’s parameters θA = θ0, Objective
RBM’s parameters θB = θ1,Ensure: Objective RBM’s ZB
1: for i = 1 to Mr do2: for β = 0 to 1 do3: Generate v1, v2, ..., vτ , ..., vn using Tτ as follows:4: Sample v1 from PA = P0
5: Sample v2 given v1 using T16: ...7: Sample vτ given vτ−1 using Tτ−18: ...9: Sample vn given vn−1 using Tn−1
10: end for11: ω(i) =
P∗1 (v1)P∗0 (v1)
P∗2 (v2)P∗1 (v2)
...P∗τ (vτ )P∗τ−1(vτ )
...P∗n(vn)P∗n−1(vn)
12: end for13: rAIS = 1
Mr
Mr∑i=1
ω(i)
14: ZB = ZA ∗ rAIS
IV. EXPERIMENTS
The proposed IT-RBM model can be applied to distinctionbetween posed and spontaneous expression as well as expres-sion recognition. Therefore, to validate the proposed IT-RBMmodel, we conduct experiments on posed and spontaneousexpression distinction as well as expression recognition.
A. Posed and spontaneous expression distinction experiments
1) Experimental Conditions: We use two benchmarkdatabases to conduct experiments on posed and sponta-neous expression distinction: the Extended DISFA (DISFA+)database [32] and the SPOS database [33]. The DISFA+database is composed of 572 posed expression videos and
252 spontaneous expression videos. Disgust, fear, happiness,sadness, and surprise are exhibited by 9 young adults (4male and 5 female). The SPOS database contains 84 posedexpression samples and 150 spontaneous expression samples.This database explores the same expression categories as theDISFA+ database, with the addition of anger. Expressionsare made by 7 subjects (4 male and 3 female). The datadistribution of these databases is show in Table II.
TABLE IIDATA DISTRIBUTION OF SPOS AND DISFA+
SPOS DISFA+
Expression P S P S
Anger(An) 14 13
Disgust(Di) 14 23 163 81
Fear(Fe) 14 32 163 63
Happy(Ha) 14 66 42 18
Sadness(Sa) 14 5 122 54
Surprise(Su) 14 11 82 36
Total 84 150 572 252
We extracted facial feature points from images to collectfacial events as defined in Section III-A. The superviseddescent method (SDM) [34] extracts 49 facial feature pointsfrom the SPOS database, as seen in the left side of Figure6. The DISFA+ database provides 68 feature points extractedfrom database constructors. We ignore the facial outline anduse the interior 49 points, shown in the right side of Figure 6.
1
2 3 45 6 7 8 9
10
11
12
13
14
1516 17 18
19
20
21 2223
2425
2627 28
293031
3233 34 35
36 3738
39
404142
43
44 45 46474849
Fig. 6. Facial feature points. Left: SPOS, CK+, MMI; right: DISFA+
We adopt recognition accuracy and F1-score as perfor-mance metrics. We use five-fold subject-independent cross-validation on the SPOS database and ten-fold subject-independent cross-validation on the DISFA+ database.
To compare the performance of our method to state-of-the-art research, we conduct expression distinction experimentswith five methods. We use our proposed IT-RBM method,which is able to simultaneously capture global spatial patternsand complex temporal patterns. We compare it to the upperlayer of the IT-RBM, which is a multi-value RBM modellinghigh-order spatial patterns only. The third method is HMM,a popular dynamic model capturing local temporal patterns.The fourth and fifth methods are LSTM and GRU. The firstthree methods are generative models, while the last two arediscriminative models. The displacement of feature points areused as features for the above five methods.
For experiments with LSTM and GRU, we adopt Princi-pal Component Analysis (PCA) to further reduce feature
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
dimension of the landmark displacement of consecutiveframes. After that, we obtain time series data with thelength of T as the input of LSTM and GRU. Due to thesmall data size, both LSTM and GRU only have one layerof hidden units. The hidden states of the LSTM and GRUare then feed into a fully-connected network to classifyexpressions. For cross-validation on the DSIFA+ and SPOSdatabases, one fold from the training set is used as vali-dation set for parameter selection. These are as the sameas the experimental conditions used for the proposed IT-RBM. A grid search strategy is used for hyper parameters’selection. Specifically, for feature dimension reduction byPCA, we get a certain number of principal componentsby setting different cumulative variance contribution ratesranging in {0.8, 0.85, 0.9, 0.95, 0.99, 0.999, 1}, for the lengthof input time series data, T ∈ {5, 10, 20, 30, 40, 50}; forthe dimensions of hidden states of the LSTM (GRU),nh ∈ {5, 10, 15, 20, 25, 30, 35, 40}; for the value of learningrate, ε ∈ {0.001, 0.005, 0.01, 0.05, 0.1}; for the size of amini-batch, bn ∈ {20, 40, 60, 80, 100}.
Since embedded spatial and temporal patterns are affectedby many factors including gender and type of expression, weadd two additional models using the proposed IT-RBM. Oneis referred to as the PS gender model, in which we trainfour models: one from male posed samples, one from malespontaneous samples, one from female posed samples, andone from female spontaneous samples. The other is calledthe PS exp model, in which we train a posed model and aspontaneous model for every expression type. We also examinethe IT-RBM model trained on the posed and spontaneoussamples, denoted as the PS model.
2) Experimental results and analyses: Table III shows theresults of our experiments. From Table III, we observe thefollowing:
Firstly, the proposed IT-RBM achieves superior accuraciesand F1-scores than multi-value RBM; the proposed IT-RBMtakes spatial patterns as well as temporal patterns into account,while the multi-value RBM only models spatial patterns. Thebetter performance of IT-RBM demonstrates the importanceof temporal patterns when distinguishing between posed andspontaneous expressions.
Secondly, the proposed IT-RBM achieves higher accuraciesand F1-scores than HMM on both databases. As Table IIIillustrates, the accuracies of the HMM method are lowerthan the accuracy of IT-RBM by 0.1444 on the DISFA+database and by 0.1069 on the SPOS database. F1-scores of theHMM method are lower than IT-RBM by 0.0636 and 0.0432,respectively. HMM is a popular dynamic model, but it canonly handle three temporal relationships - precedes, follows, orequals. It is also limited to capturing local stationary dynamicsbecause of assumptions made by the first order Markovproperty and stationary transition. The suggested model usesinterval algebra to depict 13 complex time-point relationships,and has the ability to model global rather than only local tem-poral relations. This results in improved distinction betweenposed and spontaneous expressions.
Thirdly, IT-RBM is superior to LSTM and GRU.Specifically, compared to LSTM, IT-RBM increases the
distinction accuracies by 0.0133 and 0.2650 and F1-scoresby 0.0504 and 0.1251 on the DISFA+ and SPOS databases,respectively. Compared to GRU, IT-RBM increases theaccuracies by 0.0048 and 0.2522, and increases F1-scoresby 0.0004 and 0.0851 on the DISFA+ and SPOS databasesrespectively. Although LSTM and GRU are state-of-the-art discriminative dynamic models, they are still time-slicemodels. Therefore, they cannot successfully represent theglobal and complex temporal relations between primitivefacial events inherent in facial expressions, as IT-RBMdoes. Furthermore, recurrent neural networks achievebetter performance with larger amounts of data than theused databases possess.
Figure 7 is a graphic depiction of primitive event pairsand their corresponding temporal relations from the DISFA+database. Figure 7 (a) displays the 40 selected pairs of events.Points around the eyebrow, eyelet, and lips have the most links,as these areas are crucial for expressions. Just as Ekman etal.’s research [1] [4] showed, the most telling muscles whendistinguishing between expressions are orbicularis oculi andthe zygomatic major. Our findings are consistent with thatobservation.
Figure 7 (b-1) and Figure 7 (b-2) illustrate temporal re-lations between points 20 and 29 for posed and spontaneousexpressions, respectively. Figure 7 (c-1) and Figure 7 (c-2) arethe histogram, displaying the frequencies of the 13 relationsbetween feature point 20 and 29 in the two expressions. Figure7 (c-1) and (c-2) show that for posed expressions, the relationsof 4 and 12 occur with more frequency than the relations of3 and 11. The inverse is true for spontaneous expressions.For relation 3 and 11, ts20 − ts29 > 0 , which means thatevent v29 starts before event v20, while for relation 4 and 12,ts20 − ts29 < 0, meaning that event v20 starts before v29 asshown in Table I. This indicates that for a posed expression v20starts before v29 in most cases, while in genuine expressionv29 is likely to start before v20. Since points 20 and 29 arerepresentative of the right eye and the left eye respectively,we can conclude that a posed expression is more likely tobegin on the right side of the face, while a genuine expressioncommences on the left side. This corroborates the findings ofRoss and Pulusu [5].
Table IV shows the experimental results of the PS model,the PS gender model, and the PS exp model. We make thefollowing observations. First, the PS gender model performsbetter than the PS model on both of the databases, withhigher accuracies and F1-scores. Specifically, compared tothe PS model, the PS gender model increases recognitionaccuracies by 0.0134 and 0.0085 and F1-scores by 0.0152 and0.0028 on the DISFA+ and SPOS databases respectively. Thisindicates that gender information available only during trainingis useful to capture innate spatial and temporal patterns inposed and spontaneous expressions for different genders, andthus improves the distinction task.
Figure 8 graphically depicts the average weights of differentmovements to further analyze these spatial and temporalpatterns. The x-axis represents the 20 movement patterns onthe DISFA+ and SPOS databases. The y-axis represents theaverage value of weight wjik for the Kth movement pattern.
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
TABLE IIIRESULTS OF POSED AND SPONTANEOUS DISTINCTION EXPERIMENTS
Database DISFA+ SPOS
Method HMM LSTM GRU RBM* IT-RBM HMM LSTM GRU RBM* IT-RBM
Accuracy 0.8046 0.9357 0.9442 0.9211 0.9490 0.7222 0.5641 0.5769 0.7735 0.8291F1-score 0.8768 0.8900 0.9400 0.9095 0.9404 0.7619 0.6800 0.7200 0.7427 0.8051* RBM is the proposed multi-value RBM
Posedexample
Spontaneousexample
Fig. 7. (a) Graphical depiction of temporal relations selected in DISFA+. (b) Examples of relation between point 20 and 29. (c) Frequencies of thirteenrelations between point 20 and point 29 with respect to posed and genuine expressions. x-axis represents the index of relationships.
DISFA+ Posed_gender
2 4 6 8 10 12 14 16 18 20
k-model
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
Posed_male
Posed_female
DISFA+ Spon_gender
2 4 6 8 10 12 14 16 18 20
k-model
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Spon_male
Spon_female
(a) DISFA+ PS gender
SPOS Posed_gender
2 4 6 8 10 12 14 16 18 20
k-model
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Posed_male
Posed_female
SPOS Spon_gender
2 4 6 8 10 12 14 16 18 20
k-model
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Spon_male
Spon_female
(b) SPOS PS gender
Fig. 8. The mean weight of PS gender models at different move models onall facial points and hidden nodes. x-axis represents K move models, y-axisrepresents the mean value of W j
ik at every k = K.
TABLE IVPOSED AND SPONTANEOUS DISTINCTION
Database DISFA+ SPOS
Model PS PS gender PS exp PS PS gender PS expAccuracy 0.9490 0.9624 0.9515 0.8291 0.8376 0.8333F1-score 0.9404 0.9556 0.9420 0.8051 0.8079 0.8093
The brown bars represent the weights of PS male model, andthe blue bars are the weights of PS female model. From Figure8, we can find that for some movement patterns, the weightsof male and female models are either both positive or bothnegative, but for other movement patterns, the weight signsof male and female models are opposing. This confirms thatfemales and males may display different spatial and temporalpatterns. Therefore, the gender information available duringtraining is beneficial for capturing more specific and precisespatial temporal patterns embedded in posed and spontaneousexpressions, and results in better distinction between posedand spontaneous expressions.
Table IV shows that the PS exp model also performs betterthan the PS model, achieving superior accuracies and F1-
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
1000
900-2
-1
700
0
800600
1
10-3
2
500
3
400
4
700300
(a) posed happy
1000
900-1
-0.5
700
0
800600
0.5
10-3
1
500
1.5
400
2
700300
(b) spontaneous happy
1000
900-2
-1
700
0
800600
1
10-3
2
500
3
400
4
700300
(c) posed surprise
1000
900-1
700
-0.5
0
800600
10-3
0.5
500
1
400
1.5
700300
(d) spontaneous surprise
1000
900-1
-0.5
700
0
800600
0.5
10-3
1
500
1.5
2
400 700300
(e) posed sad
1000
900-4
700
-2
800600
0
10-4
500
2
400
4
700300
(f) spontaneous sad
1000
900-2
700
-1
0
800600
10-3
1
500
2
400
3
700300
(g) posed fear
1000
900-1
700
-0.5
0
800600
10-3
0.5
500
1
400
1.5
700300
(h) spontaneous fear
1000
900
-1
700
-0.5
600 800
0
10-3
0.5
500
1
400700300
(i) posed disgust
1000
900-2
700
-1
0
800600
10-3
1
500
2
400
3
700300
(j) spontaneous disgust
Fig. 9. On the DISFA+ database, the mean weight of PS exp models at everyselected facial points on a certain movement state with all hidden nodes fromtrained IT-RBMs. z-axis represents the mean value of Wk
ij when k = K atevery facial points.
scores in most cases. Specifically, compared to the PS model,the accuracy of the PS exp model is 0.0025 and the F1-score is 0.0016 higher on the DISFA+ database. On the SPOSdatabase, the accuracy improves by 0.0042 and the F1 scoreis 0.0042 higher. This demonstrates that the expression infor-mation available only during training is helpful for capturinginherent spatial and temporal patterns, and thus improves theperformance of posed and spontaneous expression distinction.
To analyze the effect of expression type when modelingspatial and temporal patterns, we graphically depict the av-
erage weight of the hidden nodes at every selected facialfeature point for a certain movement pattern. Figure 9 showsan example of this on the DISFA+ database. Comparing thebar graphs in the left column to those in the right, we findthat there are significant differences between the weights’ dis-tribution of posed and spontaneous expressions. This indicatesthat the spatial patterns of posed and spontaneous expressionsare absolutely different. In addition, it’s clear that the Wvalues differ significantly based on expression type; sadnessis a good example. Most of the weights of the spontaneousmodel are negative, while most of the weights of the posedmodel are positive, indicating that fewer facial events areobserved in spontaneous expression. Namba’s [6] researchshows similar results, noting that morphological properties arenot observed in spontaneous facial expressions. One possiblereason is that the video clip used in this study is too shortfor the viewer to elicit visible expressions of sadness. Thisexplanation is supported by Eckman et al who posit that thenature of sadness necessitates a longer-term or more personalexperience [35]. For the spontaneous disgust expression, theweights in lips are positive, while not all the weights inlips are positive for posed disgust expression. Namba’s [6]research showed that AU10 and AU12 were more frequentlypresent in spontaneous disgust. Our finding is consistent withNamba’s work [6], corroborating that there are expression-dependent and posed- or spontaneous-dependent differencesin AUs. Adding expression information can more preciselydepict the detailed patterns inherent in posed and spontaneousexpressions.
TABLE VCOMPARISON WITH RELATED WORK OF POSED AND SPONTANEOUS
EXPRESSIONS DISTINCTION ON SPOS
Method accuracy
Cohn et al. [7] 0.7250Dibeklioglu et al. [9] 0.7875Dibeklioglu et al. [36] 0.7500
Wu et al. [37] 0.7950Wu et al. [38] 0.8125
Wang et al. [13] 0.7479Wang et al. [14] 0.7607Quan et al. [16] 0.7607
IT-RBM 0.8291
3) Comparison with related work: For the task of distin-guishing between posed versus spontaneous expressions, wecompare our method to both model-based and feature-drivenmethods. Most recent model-based methods on expressiondistinction conducted experiments on the NVIE and the SPOSdatabases. As the NVIE database only provides onset and apexframes for posed expressions, it is not a viable option for theproposed IT-RBM. Instead, we used the SPOS database tocompare the performance of the suggested method to currentmethods, as shown in Table V. The DISFA+ database isrelatively new, opening to researchers in June of 2016. Untilnow, no experiments on posed versus spontaneous expressiondistinction have been performed on this database. Therefore,we are unable to compare our work to others on this database.
From Table V, we find that state-of-the-art model-basedmethods do not perform as well as the suggested IT-RBM.
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
Compared to Wang et al. [14]’s work and Quan et al. [16]’ swork, which model spatial patterns only, the IT-RBM is ableto more fully represent posed and spontaneous expressionsby jointly modeling the spatial and temporal patterns. Thisresults in superior performance. The proposed method alsooutperforms Wu et al. [38]’s work, which is the highest-performing feature-driven method. Wu et al. [38] proposed aregion-specific texture descriptor that represented local patternchanges in different areas of the face. The temporal phase ofeach facial region was divided by calculating the intensity ofthe corresponding facial region. Then, they used a mid-levelfusion strategy of SVM to combine the two feature types.By defining discriminative features, their method models theinnate spatial and temporal patterns to a certain extent. How-ever, they do not take full advantage of embedded spatial andtemporal patterns as the IT-RBM does via our method’s pa-rameters and structure. Hence, the proposed method achievessuperior performance.
B. Experiments and Analyses of Expression Recognition
1) Experimental conditions: Expression recognition exper-iments are conducted on the extended Cohn-Kanade (CK+)database [39][40] and the MMI database [41]. The CK+database is composed of 327 posed expression samples col-lected from 118 subjects. It includes seven expression cate-gories: anger, contempt, disgust, fear, happiness, sadness, andsurprise. The image sequences in this database begin at theonset frame and end with the apex frame. Therefore, only threetemporal relations, i.e., before, at the same time, and after,exist in the image sequences. The MMI database is updatedcontinuously. During April of 2017, there were 236 sequenceslabeled with expressions; 208 of those sequences showed thefront of the face. We used these 208 image sequences from 31subjects. There are six expression categories: anger, disgust,fear, happiness, sadness, and surprise. Table VI shows the datadistribution of the two databases. SDM is used to extract the49 facial feature points shown on the left side of Figure 6 [34].
Recognition accuracy is used as a performance metric.We adopt five-fold subject-independent cross-validation onthe CK+ database and ten-fold subject-independent cross-validation on the MMI database.
As with posed and spontaneous expression distinction ex-periments, we conduct expression recognition experimentsusing five methods: IT-RBM, multi-value RBM, HMM, LSTMand GRU. For the experiments using HMM, the experimentalresults listed in [26] are directly used here. For the experi-ments using LSTM and GRU, the similar network struc-ture and hyper-parameter selection strategy as those ofposed and spontaneous expression distinction experimentsare used.
2) Experimental results and analyses: Results of our ex-periments on expression recognition are found in Table VII.From Table VII, we observe as follows:
Firstly, compared to HMM [26], the accuracy of IT-RBMis higher by 0.0366 on the CK+ database and 0.3071 on theMMI database. As described in Section II-B, HMM is a time-slice graphical model and can only capture three time-point
TABLE VIDATA DISTRIBUTION OF CK+ AND MMI
Expression CK+ MMI
Anger(An) 45 33
Contempt(Co) 18
Disgust(Di) 59 32
Fear(Fe) 25 28
Happy(Ha) 69 42
Sadness(Sa) 28 32
Surprise 83 41
Fig. 10. Graphical depiction of the selected event pairs in the MMI database.
Anger
1 2 3 4 5 6 7 8 9 10 11 12 130
0.1
0.2
0.3
0.4
0.5
0.6
0.7Disgust
1 2 3 4 5 6 7 8 9 10 11 12 130
0.1
0.2
0.3
0.4
0.5
0.6
0.7Fear
1 2 3 4 5 6 7 8 9 10 11 12 130
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Happy
1 2 3 4 5 6 7 8 9 10 11 12 130
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Sadness
1 2 3 4 5 6 7 8 9 10 11 12 130
0.1
0.2
0.3
0.4
0.5
0.6Surprise
1 2 3 4 5 6 7 8 9 10 11 12 130
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Fig. 11. Frequencies of thirteen relations among a pair of events with respectto different expressions in MMI. x-axis represents the index of relationships.
relations. IT-RBM can not only capture 13 complex temporalrelations defined by Allen’s interval algebra, but also capturecomplex spatial patterns in facial behavior. Thus IT-RBMachieves better performance than HMM.
Secondly, the proposed IT-RBM outperforms multi-valueRBM, with higher accuracies of 0.0612 on the CK+ databaseand 0.0481 on the MMI database. The IT-RBM not onlycaptures global spatial relations but also temporal patternsembedded in different expressions, while the multi-value RBMis only capable of modelling inherent spatial patterns. The IT-RBM leverages the additional temporal patterns for improvedexpression recognition.
Lastly, the suggested method outperforms both LSTMand GRU. Compared to LSTM, IT-RBM increases therecognition accuracy by 0.0184 on the CK+ database and0.2452 on the MMI database. Compared to GRU, IT-RBMincreases the accuracy by 0.0092 and 0.2259 on the CK+
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
TABLE VIIRESULTS OF EXPRESSION CATEGORIES RECOGNITION EXPERIMENTS
CK+
RBM*
An Co Di Fe Ha Sa Su Acc
An 84.44 4.44 8.89 0.00 0.00 2.22 0.00
0.8104
Co 11.11 83.33 0.00 5.56 0.00 0.00 0.00
Di 13.56 0.00 76.27 1.69 0.00 5.08 3.39
Fe 0.00 4.00 4.00 68.00 16.00 8.00 0.00
Ha 0.00 1.45 0.00 1.45 97.1 0.00 0.00
Sa 25.00 0.00 17.86 0.00 0.00 57.14 0.00
Su 2.41 1.20 9.64 1.20 2.41 2.41 80.72
IT-RBM
An Co Di Fe Ha Sa Su
An 91.11 8.89 0.00 0.00 0.00 0.00 0.00
0.8716
Co 5.56 94.44 0.00 0.0 0.00 0.00 0.00
Di 6.78 0.00 86.44 1.69 0.00 1.69 3.39
Fe 0.00 4.00 0.00 72.00 16.00 8.00 0.00
Ha 0.00 1.45 0.00 1.45 97.10 0.00 0.00
Sa 17.86 0.00 3.57 0.00 0.00 78.57 0.00
Su 2.41 1.20 8.43 1.20 1.20 2.41 83.13
LSTM
An Co Di Fe Ha Sa Su
An 84.44 2.22 6.67 0.00 0.00 6.67 0.00
0.8532
Co 5.56 44.44 16.67 0.00 5.56 11.11 16.67
Di 5.08 1.70 81.36 1.69 1.69 1.69 6.78
Fe 0.00 4.00 0.00 84.00 8.00 0.00 4.00
Ha 0.00 1.45 0.00 4.35 94.20 0.00 0.00
Sa 10.71 3.57 3.57 3.57 0.00 78.57 0.00
Su 1.20 3.61 2.41 0.00 0.00 0.00 92.78
GRU
An Co Di Fe Ha Sa Su
An 82.22 2.22 6.67 0.00 0.00 8.89 0.00
0.8624
Co 0.00 50.00 0.00 0.00 0.00 11.11 38.89
Di 5.08 5.08 84.74 0.00 0.00 0.00 5.08
Fe 0.00 8.00 0.00 76.00 8.00 0.00 8.00
Ha 0.00 0.00 0.00 2.90 97.10 0.00 0.00
Sa 10.71 7.14 0.00 0.00 0.00 82.14 0.00
Su 0.00 4.82 0.00 0.00 1.20 1.20 92.78
HMM [26] 0.835
ITBN [26] 0.863
Elaiwat et al. [43] 0.9566
Sariyanidi et al. [42] 0.9602
MMI
RBM*
An Di Fe Ha Sa Su Acc
An 84.85 12.12 0.00 0.00 3.03 0.00
0.7740
Di 9.38 68.75 3.13 9.38 0.00 9.38
Fe 0.00 3.57 71.43 10.71 3.57 10.71
Ha 2.38 0.00 4.76 83.33 9.52 0.00
Sa 9.38 0.00 3.13 3.13 81.25 3.13
Su 9.76 4.88 7.32 2.44 2.44 73.17
IT-RBM
An Di Fe Ha Sa Su
An 90.91 9.09 0.00 0.00 0.00 0.00
0.8221
Di 6.25 81.25 0.00 3.13 0.00 9.38
Fe 0.00 3.57 75.00 10.71 3.57 7.14
Ha 2.38 0.00 7.14 83.30 7.14 0.00
Sa 9.38 0.00 0.00 3.13 84.38 3.13
Su 9.76 4.88 2.44 2.44 2.44 78.05
LSTM
An Di Fe Ha Sa Su
An 54.55 21.21 3.03 3.03 12.12 6.06
0.5769
Di 28.13 40.63 3.13 18.75 0.00 9.38
Fe 3.57 10.71 28.57 14.29 14.29 28.57
Ha 0.00 9.52 9.52 73.81 4.76 2.38
Sa 21.88 9.38 25.00 3.13 40.63 0.00
Su 4.88 4.88 12.20 4.88 9.76 63.41
GRU
An Di Fe Ha Sa Su
An 54.55 21.21 3.03 3.03 12.12 6.06
0.5962
Di 28.13 40.63 3.13 18.75 0.00 9.38
Fe 3.57 10.71 28.57 14.29 14.29 28.57
Ha 0.00 9.52 9.52 73.81 4.76 2.38
Sa 21.88 9.38 25.00 3.13 40.63 0.00
Su 4.88 4.88 12.20 4.88 9.76 63.41
HMM [26] 0.515
ITBN [26] 0.597
Elaiwat et al. [43] 0.8163
Sariyanidi et al. [42] 0.7512
and the MMI databases, respectively. This further provesthe superiority of our proposed IT-RBM in capturing andleveraging complex spatial and temporal patterns inherentin expressions for expression recognition.
In order to prove the validity of the IT-RBM for expressioncategory recognition, Figure 10 graphically depicts all 30selected event pairs in the MMI database. Figure 10 shows thatthe selected facial points involve all components of the face.This is reasonable, since there are 5 expressions in the databaseand different expressions are related to different facial muscles.Unlike Figure 7(a), in which most links appear on the leftside of the face, the distribution of links in Figure 10 is morehomogeneous. This may further indicate that spatial-temporalpatterns existing in posed and spontaneous expressions arenot symmetrical between the left and the right sides of theface, while the spatial-temporal patterns existing in differentemotion expressions are symmetrical.
Figure 11 shows the frequencies of 13 relations betweenfeature point 6 and point 32. From Figure 11, we find thatthe frequencies of 13 interval relations among 5 expressionsvary greatly. This indicates that the selected temporal relationsprovide discriminative information for expression recognition.
3) Comparison with related work: To illustrate the supe-riority of the proposed method IT-RBM, we compare it withthe most related work (i.e., ITBN[26]) and state of the artfeature-based methods [43], [42]. From Table VII, we havethe following findings:
Firstly, compared with ITBN, IT-RBM achieves the betterperformance on the CK+ and MMI databases. As describedin Section II-B, Although both ITBN and IT-RBM capturecomplex relations defined by Allen’s interval algebra, ITBNuses a Bayesian network to model the local spatial patterns andIT-RBM uses a RBM to capture global spatial patterns inherentin facial behavior. Therefore, IT-RBM is more successfulat capturing spatial patterns than ITBN and achieves betterperformance. IT-RBM can capture complex spatio-temporalpatterns inherent in facial behavior, which contributes itssuperior performance.
Secondly, compared with state of the art feature-basedmethods, the proposed method achieves the best performanceon the MMI database but the worst performance on theCK+ database. On the CK+ database, sequences begin atneutral and conclude at the peak frame. The image sequencesencompass the beginning half of the expression changes only,which enforces a limit on the temporal patterns to just threerelationships: A precedes B, B precedes A, and A and Bcommence simultaneously. Therefore, IT-RBM’s ability ofcapturing complex temporal patterns cannot be fully displayedand IT-RBM gets the poor performance on the CK+ database.The MMI database provides the whole process of the expres-sion change. Therefore, IT-RBM can capture whole temporalpatterns and spatial patterns in the facial behavior and achievesthe best performance on the MMI database. Compared withHMM, the IT-RBM get moderate improvement on theCK+ database but significant improvement on the MMIdatabase. This is because both IT-RBM and HMM can onlycapture three temporal relations on the CK+ database butIT-RBM can capture more complex temporal patterns than
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
HMM on the MMI database, which further demonstratesthe importance of capturing complex temporal relationsdefined by Allen’s interval algebra and the superior of theproposed method.
V. CONCLUSION
In this paper, a novel dynamic model called IT-RBM isproposed to jointly capture and leverage embedded globalspatial patterns and complex temporal patterns for improvedexpression analysis. A facial expression is defined as a com-plex activity made up of sequential or temporally overlappingprimitive facial events, which can further be delineated asthe motion of feature points. Allen’s interval algebra is usedto represent these complex temporal patterns via a two-layerBayesian network in which the upper layer nodes representprimitive facial events, the bottom layer nodes are temporalrelations between facial events, and the links between thetwo layers capture temporal dependencies among primitivefacial events. We also suggest the use of a multi-value RBMto obtain and utilize intrinsic global spatial patterns amongfacial events. The visible nodes of the restricted Boltzmannmachine are facial events, and the connections between hiddennodes and visible nodes model the spatial patterns inherentin expressions. In the training phase, an efficient learningalgorithm is proposed to simultaneously learn spatial andtemporal patterns through maximum log likelihood in thetraining. Samples are classified in the testing phase accordingto the IT-RBM with the largest likelihood. We propose anefficient inference algorithm that extends annealing importancesampling to the IT-RBM to calculate the partition function ofthe multi-value RBM. The results of our experiments on bothexpression recognition and posed and spontaneous expressiondistinction demonstrate that the proposed method is able tocapture intrinsic facial spatial-temporal patterns, leading tosuperior performance compared to state-of-the-art works.
ACKNOWLEDGMENT
This work has been supported by the National Key R&DProgram of China (2018YFB1307102) and the National Sci-ence Foundation of China (917418129).
REFERENCES
[1] P. Ekman and W. V. Friesen, “Felt, false, and miserable smiles,” Journalof nonverbal behavior, vol. 6, no. 4, pp. 238–252, 1982.
[2] P. Ekman, “Darwin, deception, and facial expression,” Annals of the NewYork Academy of Sciences, vol. 1000, no. 1, pp. 205–221, 2003.
[3] K. L. Schmidt, S. Bhattacharya, and R. Denlinger, “Comparison ofdeliberate and spontaneous facial movement in smiles and eyebrowraises,” Journal of nonverbal behavior, vol. 33, no. 1, pp. 35–45, 2009.
[4] P. Ekman, J. C. Hager, and W. V. Friesen, “The symmetry of emotionaland deliberate facial actions,” Psychophysiology, vol. 18, no. 2, pp. 101–106, 1981.
[5] E. D. Ross and V. K. Pulusu, “Posed versus spontaneous facial expres-sions are modulated by opposite cerebral hemispheres,” Cortex, vol. 49,no. 5, pp. 1280–1291, 2013.
[6] S. Namba, S. Makihara, R. S. Kabir, M. Miyatani, and T. Nakao, “Spon-taneous facial expressions are different from posed facial expressions:Morphological properties and dynamic sequences,” Current Psychology,pp. 1–13, 2016.
[7] J. F. Cohn and K. L. Schmidt, “The timing of facial motion in posed andspontaneous smiles,” International Journal of Wavelets, Multiresolutionand Information Processing, vol. 2, no. 02, pp. 121–132, 2004.
[8] M. F. Valstar, M. Pantic, Z. Ambadar, and J. F. Cohn, “Spontaneousvs. posed facial behavior: automatic analysis of brow actions,” in Pro-ceedings of the 8th international conference on Multimodal interfaces.ACM, 2006, pp. 162–170.
[9] H. Dibeklioglu, A. A. Salah, and T. Gevers, “Recognition of genuinesmiles,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 279–294,2015.
[10] M. Seckington, “Using dynamic bayesian networks for posed versusspontaneous facial expression recognition,” Mater Thesis, Departmentof Computer Science, Delft University of Technology, 2011.
[11] G. C. Littlewort, M. S. Bartlett, and K. Lee, “Automatic coding of facialexpressions displayed during posed and genuine pain,” Image and VisionComputing, vol. 27, no. 12, pp. 1797–1803, 2009.
[12] H. Dibeklioglu, R. Valenti, A. A. Salah, and T. Gevers, “Eyes do notlie: Spontaneous versus posed smiles,” in Proceedings of the 18th ACMinternational conference on Multimedia. ACM, 2010, pp. 703–706.
[13] S. Wang, C. Wu, M. He, J. Wang, and Q. Ji, “Posed and spontaneousexpression recognition through modeling their spatial patterns,” MachineVision and Applications, vol. 26, no. 2-3, pp. 219–231, 2015.
[14] S. Wang, C. Wu, and Q. Ji, “Capturing global spatial patterns fordistinguishing posed and spontaneous expressions,” Computer Visionand Image Understanding, vol. 147, pp. 69–76, 2016.
[15] G. Hinton, “A practical guide to training restricted boltzmann machines,”Momentum, vol. 9, no. 1, p. 926, 2010.
[16] Q. Gan, S. Nie, S. Wang, and Q. Ji, “Differentiating between posedand spontaneous expressions with latent regression bayesian network.”in AAAI, 2017, pp. 4039–4045.
[17] C. A. Corneanu, M. O. Simon, J. F. Cohn, and S. E. Guerrero,“Survey on rgb, 3d, thermal, and multimodal approaches for facialexpression recognition: History, trends, and affect-related applications,”IEEE transactions on pattern analysis and machine intelligence, vol. 38,no. 8, pp. 1548–1568, 2016.
[18] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic, “Automatic analysisof facial actions: A survey,” IEEE Transactions on Affective Computing,2017.
[19] G. Zhao and M. Pietikainen, “Dynamic texture recognition using localbinary patterns with an application to facial expressions,” IEEE trans-actions on pattern analysis and machine intelligence, vol. 29, no. 6, pp.915–928, 2007.
[20] T. Wu, M. S. Bartlett, and J. R. Movellan, “Facial expression recognitionusing gabor motion energy filters,” in Computer Vision and PatternRecognition Workshops (CVPRW), 2010 IEEE Computer Society Con-ference on. IEEE, 2010, pp. 42–47.
[21] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang, “Facial expres-sion recognition from video sequences: temporal and static modeling,”Computer Vision and image understanding, vol. 91, no. 1, pp. 160–187,2003.
[22] L. Shang and K.-P. Chan, “Nonparametric discriminant hmm andapplication to facial expression recognition,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp. 2090–2096.
[23] M. F. Valstar and M. Pantic, “Fully automatic recognition of the temporalphases of facial actions,” IEEE Transactions on Systems, Man, andCybernetics, Part B (Cybernetics), vol. 42, no. 1, pp. 28–43, 2012.
[24] R. El Kaliouby and P. Robinson, “Real-time inference of complex mentalstates from facial expressions and head gestures,” in Real-time vision forhuman-computer interaction. Springer, 2005, pp. 181–200.
[25] P. Rodriguez, G. Cucurull, J. Gonzalez, J. M. Gonfaus, K. Nasrollahi,T. B. Moeslund, and F. X. Roca, “Deep pain: Exploiting long short-termmemory networks for facial expression classification,” IEEE Transac-tions on Cybernetics, 2017.
[26] Z. Wang, S. Wang, and Q. Ji, “Capturing complex spatio-temporalrelations among facial muscles for facial expression recognition,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 3422–3429.
[27] J. Yang and S. Wang, “Capturing spatial and temporal patterns for distin-guishing between posed and spontaneous expressions,” in Proceedingsof the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 469–477.
[28] J. F. Allen, “Maintaining knowledge about temporal intervals,” Commu-nications of the ACM, vol. 26, no. 11, pp. 832–843, 1983.
[29] J. M. Joyce, “Kullback-leibler divergence,” in International encyclopediaof statistical science. Springer, 2011, pp. 720–722.
[30] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
[31] R. Salakhutdinov and I. Murray, “On the quantitative analysis of deepbelief networks,” in Proceedings of the 25th international conference onMachine learning. ACM, 2008, pp. 872–879.
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2911937, IEEETransactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
[32] M. Mavadati, P. Sanger, and M. H. Mahoor, “Extended disfa dataset:Investigating posed and spontaneous facial expressions,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops, 2016, pp. 1–8.
[33] T. Pfister, X. Li, G. Zhao, and M. Pietikainen, “Differentiating sponta-neous from posed facial expressions within a generic facial expressionrecognition framework,” in Computer Vision Workshops (ICCV Work-shops), 2011 IEEE International Conference on. IEEE, 2011, pp. 868–875.
[34] X. Xiong and F. De la Torre, “Supervised descent method and itsapplications to face alignment,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2013, pp. 532–539.
[35] P. Eckman, “Emotions revealed,” St. Martin’s Griffin, New York, 2003.[36] H. Dibeklioglu, A. Salah, and T. Gevers, “Are you really smiling at me?
spontaneous versus posed enjoyment smiles,” Computer Vision–ECCV2012, pp. 525–538, 2012.
[37] P. Wu, H. Liu, and X. Zhang, “Spontaneous versus posed smile recog-nition using discriminative local spatial-temporal descriptors,” in Acous-tics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on. IEEE, 2014, pp. 1240–1244.
[38] P. Wu, H. Liu, X. Zhang, and Y. Gao, “Spontaneous versus posed smilerecognition via region-specific texture descriptor and geometric facialdynamics,” Frontiers, vol. 1, 2016.
[39] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facialexpression analysis,” in Automatic Face and Gesture Recognition, 2000.Proceedings. Fourth IEEE International Conference on. IEEE, 2000,pp. 46–53.
[40] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,“The extended cohn-kanade dataset (ck+): A complete dataset foraction unit and emotion-specified expression,” in Computer Vision andPattern Recognition Workshops (CVPRW), 2010 IEEE Computer SocietyConference on. IEEE, 2010, pp. 94–101.
[41] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: anaddition to the mmi facial expression database,” in Proc. 3rd Intern.Workshop on EMOTION (satellite of LREC): Corpora for Research onEmotion and Affect, 2010, p. 65.
[42] E. Sariyanidi, H. Gunes, and A. Cavallaro, “Learning bases of activityfor facial expression recognition,” IEEE Transactions on Image Process-ing, 2017.
[43] S. Elaiwat, M. Bennamoun, and F. Boussaid, “A spatio-temporal rbm-based model for facial expression recognition,” Pattern Recognition,vol. 49, pp. 152–161, 2016.
Shangfei Wang received her BS in Electronic Engi-neering from Anhui University, Hefei, Anhui, China,in 1996. She received her MS in circuits and sys-tems, and the PhD in signal and information pro-cessing from University of Science and Technologyof China (USTC), Hefei, Anhui, China, in 1999 and2002. From 2004 to 2005, she was a postdoctoral re-search fellow in Kyushu University, Japan. Between2011 and 2012, Dr. Wang was a visiting scholar atRensselaer Polytechnic Institute in Troy, NY, USA.She is currently an Associate Professor of School
of Computer Science and Technology, USTC. Her research interests coveraffective computing and probabilistic graphical models. She has authored orco-authored over 90 publications. She is a senior member of the IEEE and amember of the ACM.
Zhuangqiang Zheng received his BS in mathe-matics from Liaoning Technical University in 2017,and he is currently pursuing his MS in ComputerScience in the University of Science and Technologyof China, Hefei, China. His research interesting isaffective computing.
Shi Yin received his BS in Automation from CentralSouth University in 2016, and he is now a PhDstudent majoring in Computer Science and Technol-ogy in the University of Science and Technologyof China, Hefei, China. His research interesting isaffective computing.
Jiajia Yang received his BS in Software engineeringfrom Dalian maritime university in 2015, and she iscurrently pursuing his MS in Computer Science inthe University of Science and Technology of China,Hefei, China. Her research interesting is affectivecomputing.
Qiang Ji received the PhD degree in electrical en-gineering from the University of Washington. He iscurrently a professor with the Department of Electri-cal, Computer, and Systems Engineering, RensselaerPolytechnic Institute (RPI). He recently served as aprogram director at the National Science Foundation(NSF), where he managed NSFs computer visionand machine learning programs. He also held teach-ing and research positions with the Beckman Insti-tute at University of Illinois at Urbana-Champaign,the Robotics Institute at Carnegie Mellon University,
the Dept. of Computer Science at University of Nevada at Reno, and theUS Air Force Research Laboratory. He currently serves as the director ofthe Intelligent Systems Laboratory (ISL) at RPI. His research interests arein computer vi sion, probabilistic graphical models, information fusion, andtheir applications in various fields. He has published more than 160 papers inpeer-reviewed journals and conferences. His research has been supported bymajor governmental agencies including NSF, NIH, DARPA, ONR, ARO, andAFOSR as well as by major companies including Honda and Boeing. He isan editor on several related the IEEE and international journals and he hasserved as a general chair, program chair, technical area chair, and programcommittee member in numerous international conferences/workshops. He isa fellow of IAPR and the IEEE.
Deep Structured Prediction for Facial LandmarkDetection
Lisha Chen, Hui Su, and Qiang JiRensselaer Polytechnic Institute
Abstract
Existing deep learning based facial landmark detection methods have achieved1
excellent performance. These methods, however, do not explicitly embed the2
structural dependencies among landmark points. They hence cannot preserve the3
geometric relationships between landmark points or generalize well to challenging4
conditions or unseen data. This paper proposes a method for deep structured5
facial landmark detection based on combining a deep Convolutional Network6
with a Conditional Random Field. We demonstrate its superior performance to7
existing state-of-the-art techniques in facial landmark detection, especially a better8
generalization ability on challenging datasets that include large pose and occlusion.9
1 Introduction10
Facial landmark detection is to automatically localize the fiducial facial landmark points around facial11
components and facial contour. It is essential for various facial analysis tasks such as facial expression12
analysis, headpose estimation and face recognition. With the development of deep learning techniques,13
traditional facial landmark detection approaches that rely on hand-crafted low-level features have14
been outperformed by deep feature based approaches. The purely deep learning based methods,15
however, cannot effectively capture the structural dependencies among landmark points. They hence16
cannot perform well under challenging conditions, such as large head pose, occlusion, and large17
expression variation. Probabilistic graphical models such as Conditional Random Fields (CRFs),18
have been widely applied to various computer vision tasks. They can systematically capture the19
structural relationships among random variables and perform structured prediction. Recently, there20
has been works that combine deep models (e.g. CNNs) with CRF to simultaneously leverage CNNs’21
representation power and CRF’s structure modeling power [11, 10, 53]. Their combination has22
yielded significant performance improvement over methods that use either CNN or CRF alone. These23
works so far are mainly applied to classification tasks such as semantic image segmentation. Besides24
classification, there are also works that apply the CNN and CRF model to human pose and facial25
landmark detection [44, 14, 13]. To simplify computational complexity, their CRF model is typically26
of special structure (e.g. tree) and, moreover, they employ approximate learning and inference criteria.27
In this work, we propose to combine CNN with a fully-connected CRF to jointly perform facial28
landmark detection in regression framework using exact learning and inference methods.29
Compared to the existing works, the contributions of our work are summarized as follows:30
1) We introduce the fully-connected CNN-CRF that produces structured probabilistic prediction of31
facial landmark locations.32
2) Our model explicitly captures the structure relationship variations caused by pose and deformation,33
unlike some previous works that combine CNN with CRF using a fixed pairwise relationship34
represented by a convolution kernel.35
3) We derive closed form solutions for learning and inference given the deformation parameter,36
unlike previous works that use approximate methods such as energy minimization ignoring the37
Submitted to 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Do not distribute.
Advances in Neural Information Processing Systems (NurIPS 2019)
partition function for learning and using mean-field for inference. And instead of using discriminative38
criterion or other approximated loss function, we employ optimal negative log likelihood loss39
function, without any assumption.40
4) Experiments on benchmark face alignment datasets demonstrate the advantages of the proposed41
method in achieving better prediction accuracy and generalization to challenging or unseen data than42
current state-of-the-art (SoA) models.43
44
2 Related Work45
2.1 Facial Landmark Detection46
Classic facial landmark detection methods include Active Shape Model (ASM) [16, 30], Active47
Appearance Model (AAM) [15, 26, 29, 40], Constrained Local Model (CLM) [27, 41], and Cascade48
Regression [9, 6, 55, 7, 48] rely on hand-crafted shallow image features and are usually sensitive to49
initializations. They are outperformed by modern deep learning based methods.50
Deep learning based method for face alignment was first proposed in [42] and achieved better51
performance than classic methods. This purely deep appearance based approach uses a deep cascade52
convolutional network and coordinate regression in each cascade level. Later on, more work using53
purely deep appearance based framework for coordinate regression has been explored. Tasks-54
constrained deep convolutional network (TCDCN) [52] was proposed to jointly optimize facial55
landmark detection with correlated tasks such as head pose estimation and facial attribute inference.56
Mnemonic Descent Method (MDM) [45], an end-to-end trainable deep convolutional Recurrent57
Neural Network (RNN), was proposed where the cascade regression was implemented by RNN.58
Recently, heatmap learning based methods established new state-of-the-art for face alignment and59
body pose estimation [44, 32, 46]. And most of these face alignment methods [5, 47] follow the60
architecture of Stacked Hourglass [32]. The stacked modules refine the network predictions after61
each stack. Different from direct coordinate regression, it predicts a heatmap with same size as the62
input image. The landmark location is predicted either in regression framework by the coordinate63
on the heatmap with largest response [44, 32, 4, 12, 8] or in classification framework by segmenting64
the heatmap pixels into different facial parts [34, 22, 36]. Hybrid deep methods combine deep65
models with face shape models. One strategy is to directly predict 3D deformable parameters instead66
of landmark locations in a cascaded deep regression framework, e.g. 3D Dense Face Alignment67
(3DDFA) [56] and Pose-Invariant Face Alignment (PIFA) [25]. Another strategy is to use deformable68
model as a constraint to limit the face shape search space thus to refine the predictions from the69
appearance features, e.g. Convolutional Experts Constrained Local Model (CE-CLM) [50].70
As the deep learning technique develops, more works take advantage of the expressive deep features71
and combine them with graphical models to produce structured prediction. Early work like [33]72
jointly trains a CNN and a graphical model for image segmentation. Do et al.[18] introduced73
NeuralCRF for sequence labeling. And various works are explored for other tasks. For instance,74
Jain et al.[24] and Eigen et al.’s work [19] for image restoration, Yao et al. and Morin et al.’s work75
[49, 31] for language understanding, Yoshua et al., Peng et al. and Jaderberg et al.’s work [3, 35, 23]76
for handwriting or text recognition. Recently, for human body pose estimation, Chen et al.[11] uses77
CNN to output image dependent part presence as unary term and spatial relationship as pairwise78
potential in a tree-structured CRF and uses Dynamic Programming for inference. Tompson et al.79
[44, 43] jointly trained a CNN and a fully-connected MRF by using the convolution kernel to capture80
pairwise relationships among different body joints and an iterative convolution process to implement81
the belief propagation. The idea of using convolution to implement message passing has also been82
explored in [14], where structure relationships at body joint feature level rather than the output level83
are captured in a bi-directional tree structured model. And the work of Chu et al.[14] is applied to84
face alignment [47] to pass message between facial part boundary feature maps. As an extension to85
[14], [13] models structures in both output and hidden feature layers in CNN. Similarly, for image86
segmentation, DeepLab [10] uses fully connected CRF with binary cliques and mean-field inference,87
and [28] uses efficient piecewise training to avoid repeated inference during training. In [53], the88
CRF mean-field inference is implemented by RNN and the network is end-to-end trainable by directly89
optimizing performance of the mean-field inference. Using RNN to implement message passing has90
also been applied to facial action unit recognition [17]. In [21], the MRF deformable part model is91
implemented as a layer in a CNN.92
2
Comparison. Compared to previous models serving similar purpose such as [14, 13, 47] that assume93
a tree structured model with belief propagation as inference method, we use a fully-connected model.94
With a fully connected model, we don’t need to specify a certain tree structured model, letting the95
model learn the strong or weak relationships from data, thus this method is more generalizable to96
different tasks. And the works [44, 14, 13, 47, 53] use convolution to implement the pairwise term97
and the message passing process. The pairwise term, once trained, is independent of the input image,98
thus cannot capture the pairwise constraint variations across different conditions like target object99
rotation and object shape. However, we explicitly capture the object pose, deformation variations.100
Moreover, they employ approximate method such as energy minimization ignoring the partition101
function for learning and mean-field for inference. In this paper we show that we can do exact learning102
and inference. Lastly, compared to traditional CRF model [37, 38], the weights for each unary terms103
in our model are also outputs of the neural network whose inverse quantifies heteroscedastic aleatoric104
uncertainty of the unary prediction.105
3 Method106
This section presents the proposed structured deep probabilistic facial landmark detection model. In107
this model, the joint probability distribution of facial landmark locations and deformable parameters108
are captured by a conditional random field model.109
3.1 Model definition110
Denote the face image as x, the 2D facial landmark locations as y, each landmark is yi, i = 1, . . . , N .111
The deformable model parameters that capture pose, identity and expression variation are denoted as112
ζ. The model parameter we want to learn is denoted as Θ. Assuming ζ is marginally dependent on x113
but conditionally independent of x given y, the graphical model is shown in Fig. 1.
Output: deformableparameters
Output: faciallandmark locations y3
Input: face image
y2
y5
y4
y1
Figure 1: Overview of the graphical model, dashed lines represent dependencies between each pairof landmarks, dotted lines represent dependencies between landmark and deformable parameters,solid lines represent dependencies between landmarks and face image.
114Based on this definition and assumption, the joint distribution of landmarks y and deformable115
parameters ζ conditioned on the face image x can be formulated in a CRF framework and written as116
pΘ(y, ζ | x) =1
ZΘ(x)exp{−
N∑i=1
φθ1(yi | x)−N∑i=1
N∑j=i+1
ψCij(yi,yj , ζ)} (1)
where Θ = [θ1, Cij ], θ1 is neural network parameter, Cij is a 2 × 2 symmetric positive definite117
matrix that captures the spatial relationships between a pair of landmark points, and ZΘ(x) is the118
partition function. φθ1(yi | x) is the unary potential function with parameter θ1 and ψCij(yi,yj , ζ)119
is the triple-wise potential function with parameter Cij .120
3.2 Potential functions121
We define the unary and triple-wise potentials in Eq.(2) and Eq.(3) respectively.122
φθ1(yi | x) =1
2[yi − µi(x, θ1)]TΣ−1
i (x, θ1)[yi − µi(x, θ1)] (2)
3
123
ψCij(yi,yj , ζ) = [yi − yj − µij(ζ)]TCij [yi − yj − µij(ζ)] (3)
where µi(x, θ1) and Σi(x, θ1) are the outputs of the CNN that represent mean and covariance matrix124
of each landmark given the image x. µij(ζ) represents the difference mean between two landmarks.125
It is fully determined by the 3D deformable face shape parameters ζ , which contains rigid parameters:126
rotation R and scale S, and non-rigid parameters q.127 [µij(ζ)
1
]=
1
λSR(y3d
i + Φiq− y3dj − Φjq) (4)
where y3d is the 3D mean face shape, Φ is the bases of deformable model, they are learned from128
data. The deformable parameters ζ = [S,R,q] are jointly estimated with 2D landmark locations129
during inference. In this work, we assume weak perspective projection model. S is a 3× 3 diagonal130
matrix that contains 2 independent parameters sx, sy as scaling factor (encode the camera intrinsic131
parameters) for column and row respectively. While R is a 3 × 3 orthonormal matrix with 3132
independent parameters γ1, γ2, γ3 as the pitch, yaw, roll rotation angle. Note that the translation133
vector is canceled by taking the difference of two landmark points.134
3.3 Learning and Inference135
We propose to implement the conditional probability distribution in Eq. (1) with a CNN-CRF model.136
As shown in Fig. 2, the CNN with parameter θ1 outputs mean µi(x, θ1) and covariance matrix137
Σi(x, θ1) for each facial landmark yi, which together forms the unary potential function φθ1(yi | x).138
A fully-connected (FC) graph with parameter Cij gives the triple-wise potential ψCij(yi,yj , ζ), if139
given ζ as well as the output from the unary, the FC can output E(x, ζ,Θ) and Λp(x, ζ,Θ), the mean140
and precision matrix for the conditional distribution pΘ(y | ζ,x). The FC can be implemented as141
another layer following the CNN. Combining the unary and the triple-wise potential, we obtain the142
joint distribution pΘ(y, ζ | x). However, directly infer y∗, ζ∗ from pΘ(y, ζ | x) is difficult, therefore143
we iteratively infer from conditional distributions pΘ(y | ζ,x) and pΘ(ζ | y).
CNN
unary potential triple-wise potential
FC
Input image x
Joint distributionexp normalize
Conditional distribution
Conditional distribution
Jointly infer
-
+
Figure 2: Overall flowchart of the proposed CNN-CRF model.
144
Mean and Precision matrix145
During learning and inference, we need to compute conditional probabilities pΘ(y | ζ,x). By using146
the quadratic unary and triple-wise potential function, the distribution pΘ(y | ζ,x) is a multivariate147
Gaussian distribution that can be written as148
pΘ(y | ζ,x) =1
Z ′Θ(x)exp{−
N∑i=1
φθ1(yi | x)−N∑i=1
N∑j=i+1
ψCij(yi,yj , ζ)}
= exp{1
2ln |Λp(x,Θ, ζ)| − 1
2[y − E(x,Θ, ζ)]TΛp(x,Θ, ζ)[y − E(x,Θ, ζ)]}
(5)
where E(x,Θ, ζ) and Λp(x,Θ, ζ) is the mean and precision matrix of the multivariate Gaussian149
distribution. They are computed exactly during learning and inference. The mean E can be computed150
by solving the linear system of equations ΛpE = b where Λp, the precision matrix, is a symmetric151
4
positive definite matrix that can be directly computed from the coefficient in the unary and pairwise152
term as shown in Eq. (6), and b can be computed from Eq. (6).153
Λp =
Λp11 . . . Λp1N
.... . .
...ΛpN1 . . . ΛpNN
, {Λpii = Σ−1i +
∑j 6=i Cij
Λpij = −Cijb =
b1b2...bN
, bi = Σ−1i µi +
∑j 6=i
Cijµij (6)
From Eq.(6) we can see that the final inference resultEi is a linear combination of µi and µj+µij , j ∈154
{1, . . . , N}, j 6= i. To solve this linear system of equations, we use direct method for exact solution155
with a fast implementation by Cholesky factorization that requires O(N3) FLOPs. For a practical156
implementation of the determinant to avoid numerical issues, we again use the Cholesky factorization157
of Λp to get LLT = Λp, then we compute the log determinant by ln |Λp| = 2∑
ln diag(L) where158
diag(·) takes the diagonal element of a matrix.159
Learning160
During learning, our goal is to optimize Θ given training data D = {xm,ym,m = 1, . . . ,M}.161
However, considering that our training data does not contain ground truth for ζm, we jointly optimize162
Θ and ζm,m = 1, . . . ,M . Therefore we define the learning problem as163
Θ∗, ζ∗ = arg minΘ,ζ
−M∑m=1
ln pΘ(ym, ζm | xm) (7)
where ζ = {ζ1, . . . , ζm}. We use an alternating method, based on the current Θt, optimize ζ by164
ζt+1m = arg min
ζm
− ln pΘt(ym, ζm | xm) = arg minζm
ψCtij
(ymi,ymj , ζm) (8)
Then based on current ζt, optimize Θ by165
Θt+1 = arg minΘ
−M∑
m=1
ln pΘ(ym, ζtm | xm) = arg min
Θ−
M∑m=1
ln pΘ(ym | ζtm,xm)
= arg minΘ
M∑m=1
−1
2ln |Λp(xm,Θ, ζ
tm)|+ 1
2[ym − E(xm,Θ, ζ
tm)]T Λp(xm,Θ, ζ
tm)[ym − E(xm,Θ, ζ
tm)]
(9)The algorithm for this problem is designed to first initialize Cij and optimize ζ, then fix a subset of166
parameters from Θ and optimize the others alternately, whose pseudo code is shown in Algorithm 1.167
Algorithm 1: Learning CNN-CRFInput: training data {xm,ym,m = 1, . . . ,M};Initialization: parameters Θ0 = {θ0
1 = randn,C0ij = I}, t = 0 ;
while not converge doStage 1: Fix parameters Θ, optimize ζ by Eq. (8); t = t+ 1;Stage 2: Fix ζ = ζt, Cij = Ctij , update θ1 using Eq. (9) by back propagation;
while not converge doθt+1
1 = θt1 − ηt1 ∂Loss∂θ1; t = t+ 1;
end[ζt, Ctij ] = [ζ, Cij ]
Stage 3: Fix ζ = ζt, θ1 = θt1, update Cij using Eq. (9) by back propagation;while not converge doCt+1ij = Ctij − ηt2 ∂Loss∂Cij
; t = t+ 1;end[ζt, θt1] = [ζ, θ1]
end
168
5
Inference169
The inference problem is a joint inference of ζ,y for each input face image x. Therefore the inference170
problem is defined as171
y∗, ζ∗ = arg maxy,ζ
ln pΘ(y, ζ | x) (10)
We use an alternating method. Based on the current yt, optimize ζt by172
ζt = arg maxζ
ln pΘ(yt, ζ | x) = arg minζ
N∑i=1
N∑j=i+1
ψCij (yti ,ytj , ζ) (11)
Based on the current ζt, optimize yt+1 by173
yt+1 = arg maxy
ln pΘ(y, ζt | x) = arg maxy
ln pΘ(y | ζt,x) = E(x,Θ, ζt) (12)
The algorithm is shown in Algorithm 2.174
Algorithm 2: Inference for CNN-CRFInput: one face image xInitialization: y0
i = µi, i = 1, . . . , N , t = 0;while not converge do
Update ζ by Eq. (11). ζt = arg minζ∑Ni=1
∑Nj=i+1 ψCij
(yti ,ytj , ζ);
Update y by Eq. (12). yt+1 = E(x,Θ, ζt);t = t+ 1 ;
end
175
4 Experiments176
Datasets. We evaluate our methods on popular benchmark facial landmark detection datasets,177
including 300W [39], Menpo [51], COFW [6], 300VW [1].178
300W has 68 landmark annotation. We first train the method on 300W-LP [56] dataset which is179
augmented from original 300W dataset for large yaw pose. And then we fine tune on the original180
dataset (3837 faces). Testing is performed on 300W test set which contains 600 images.181
Menpo contains images from AFLW and FDDB with landmark re-annotation following the 68182
landmark annotation scheme. It has two subsets, frontal which has 68 landmark annotation for near183
frontal faces (6679 samples) and profile which has 39 landmark annotation for profile faces (2300184
samples). We use it as test set for cross dataset evaluation.185
COFW has 1345 training samples and 507 testing samples, whose facial images are all partially186
occluded. The original dataset is annotated with 29 landmarks. We also use the COFW-68 test set187
[20] which has 68 landmarks re-annotation for cross dataset evaluation.188
300VW is a facial video dataset with 68 landmarks annotation. It contains 3 scenarios: 1) constrained189
laboratory and naturalistic well-lit conditions; 2) unconstrained real-world conditions with different190
illuminations, dark rooms, overexposed shots, etc.; 3) completely unconstrained arbitrary conditions191
including various illumination, occlusions, make-up, expression, head pose, etc.192
Evaluation metrics. We evaluate our algorithm using standard normalised mean error (NME)193
and Cumulative Errors Distribution (CED) curve. In addition, the area-under-the-curve (AUC)194
and the failure rate (FR) for a maximum error of 0.07 are reported. Same as in [5], the NME is195
defined as the average point-to-point Euclidean distance between the ground truth (ygt) and predicted196
(ypred) landmark locations normalized by the ground truth bounding box size d =√wbbox ∗ hbbox,197
NME = 1N
∑Ni=1
||y(i)pred−y
(i)gt ||2
d . Based on the NME in the test dataset, we can draw a CED Curve198
with NME as the horizontal axis and percentage of test images as the vertical axis. Then the AUC is199
computed as the area under that curve for each test dataset.200
Implementation details. To make a fair comparison with the SoA purely deep learning based201
methods [5], we use the same training and testing procedure for 2D landmark detection. The 3D202
deformable model was trained on 300W-train dataset. For CNN, we use one stack of Hourglass203
with same structure as [5], followed by a softmax layer to output a probability map for each facial204
6
landmark. From the probability map we compute mean µi and covariance Σi. And we use additional205
softmax cross entropy loss to assist training which shows better performance empirically.206
Training procedure: The initial learning rate η1 is 10−4 for 15 epochs using a minibatch of 10, then207
dropped to 10−5 and 10−6 after every 15 epochs and keep training until convergence. The learning208
rate η2 is set to 10−3. We applied random augmentations such as random cropping, rotation, etc.209
Testing procedure: We follow the same testing procedure as [5]. The face is cropped using the210
ground truth bounding box defined in 300W. The cropped face is rescaled to 256× 256 before passed211
to the network. For Menpo-profile dataset, the annotation scheme is different, we use the overlapping212
26 points for evaluation, i.e., removing points other than the 2 end points on the face contour and the213
eyebrow respectively and removing the 5th point on the nose contour.214
4.1 Comparison with existing approaches215
We compare with the SoA facial landmark detection algorithms, including purely deep learning216
based methods such as TCDCN [52] and FAN [5] as well as hybrid methods such as CLNF [2] and217
CE-CLM [50]. The results for these methods are evaluated using the code provided by the authors in218
the same experiment protocol, i.e., same bounding box and same evaluation metrics. The results on219
300W testset are shown in Table 1. The CED curves on the 300W testset are shown in Fig. 3a.
Table 1: 300W testset prediction results (%)Method 300W-test-indoor 300W-test-outdoor 300W-test-all
NME AUC FR NME AUC FR NME AUC FRTCDCN [52] 4.16 42.3 5.33 4.14 41.8 4.33 4.15 42.1 4.83CFSS [54] 3.19 56.1 2.33 2.98 57.4 1.33 3.09 56.7 1.83CLNF [2] 4.38 47.2 7.67 4.06 48.1 5.67 4.22 47.6 6.67CE-CLM [50] 3.02 57.5 2.67 3.07 56.3 2.00 3.05 56.9 2.33FAN [5] 2.52 63.6 1.00 2.50 64.0 0.00 2.51 63.8 0.50proposed 2.34 67.0 1.00 2.25 67.5 0.00 2.28 67.2 0.50
220
0 1 2 3 4 5 6 7NME (%)
0
10
20
30
40
50
60
70
80
90
100
Pro
port
ion
of im
ages
(%
)
TCDCNCFSSCLNFCECLMFANproposed
(a) 300W testset
0 1 2 3 4 5 6 7NME (%)
0
10
20
30
40
50
60
70
80
90
100
Pro
port
ion
of im
ages
(%
)
TCDCNCFSSCLNFCECLMFANproposed
(b) Menpo-frontal dataset
0 1 2 3 4 5 6 7NME (%)
0
10
20
30
40
50
60
70
80
Pro
port
ion
of im
ages
(%
)
TCDCNCFSSCLNFCECLMFANproposed
(c) Menpo-profile dataset
0 1 2 3 4 5 6 7NME (%)
0
10
20
30
40
50
60
70
80
90
100
Pro
port
ion
of im
ages
(%
)
TCDCNCFSSCLNFCECLMFANproposed
(d) COFW-68 testsetFigure 3: CED curves on different datasets (better viewed in color and magnified)
Cross-dataset Evaluation221
Besides 300W testset, we evaluate the proposed method on Menpo dataset, COFW-68 testset, 300VW222
testset for cross dataset evaluation. The results are shown in Table 2 for Menpo and COFW-68 dataset223
and Table 3 for 300VW dataset. And the CED curves are shown in Fig. 3b, 3c, 3d respectively. The224
method is trained on 300W-LP and fine-tuned on 300W Challenge train set for 68 landmarks. We can225
see that compared to the results on 300W testset and Menpo-frontal dataset, where the SoA methods226
attaining saturating performance as mentioned in [5], for cross-dataset evaluation in more challenging227
conditions such as COFW with heavy occlusion and Menpo-profile with large pose, the proposed228
method shows better generalization ability with a significant performance improvement. On the other229
hand, the proposed method shows smallest failure rate (FR) on all evaluated datasets.230
Table 2: Cross dataset prediction results on Menpo dataset and COFW-68 testset (%)Method Menpo-frontal Menpo-profile COFW-68 test
NME AUC FR NME AUC FR NME AUC FRTCDCN [52] 4.04 46.2 5.84 15.79 4.1 83.35 4.71 35.8 8.68CFSS [54] 3.91 57.4 9.75 16.96 10.4 71.35 3.79 49.0 4.34CLNF [2] 3.74 55.4 5.82 10.63 20.5 46.09 4.75 42.9 10.65CE-CLM [50] 2.78 63.3 1.66 7.40 33.4 29.83 3.36 52.4 2.37FAN [5] 2.34 66.3 0.33 6.12 42.4 27.30 2.99 57.0 0.00proposed 2.35 66.2 0.18 4.75 48.0 24.30 2.73 60.6 0.00
7
Table 3: 300VW testset prediction results for cross-dataset evaluation (%)Method 300VW-category1 300VW-category2 300VW-category3
NME AUC FR NME AUC FR NME AUC FRTCDCN [52] 3.49 51.2 1.74 3.80 45.8 1.76 4.45 43.8 8.85CFSS [54] 2.44 67.0 1.66 2.49 64.3 0.77 3.26 60.5 5.18FAN [5] 2.10 71.0 0.33 2.21 68.1 0.07 2.93 63.7 2.77proposed 2.11 70.5 0.29 2.09 69.8 0.05 2.59 66.8 2.03
4.2 Analysis231
In this section, we report the results of sensitivity analysis and ablation study. If not specified, analysis232
is performed on test datasets with models trained on 300W-LP and fine-tuned on 300W train set.233
Sensitivity to challenging conditions. We evaluate different methods on challenging conditions234
caused by either high noise, low resolution, or different initializations in Fig. 4. Generally the235
proposed CNN-CRF model is more robust under challenging conditions compared to a pure CNN236
model with the same structure, i.e. the CNN-CRF model with Cij = 0.
(a) Noise (b) Lower resolution (c) Different initializationFigure 4: Prediction error sensitivity to challenging conditions
237 Ablation Study. The improvement of the proposed method lies in two aspects. On the one hand,238
the proposed softmax + Gaussian NLL loss give better results empirically. On the other hand, the239
joint training of CNN-CRF model with the assistance of the deformable model captures structured240
relationships with pose and deformation awareness. To analyze the effect of the proposed method, in241
Table 4, we evaluate the performance of a plain CNN prediction, the 3D deformable model fitting242
and the joint CNN-CRF prediction accuracy.
Table 4: Ablation study on 300W testset (%)Method NME AUC FRPlain CNN with softmax cross entropy loss 2.72 58.6 1.00Plain CNN with softmax + Gaussian NLL loss (proposed loss) 2.52 62.3 1.00Separately trained CNN and CRF with proposed loss 2.44 66.0 0.503D deformable model fitting 1.39 79.8 0.00Jointly trained CNN-CRF with proposed loss (proposed method) 2.28 67.2 0.50
243 5 Conclusion244
In this paper, we propose a method combining CNN with a fully-connected CRF model for facial245
landmark detection. Compared to SoA purely deep learning based methods, our method captures246
the structure relationships between different facial landmark locations explicitly. Compared to247
previous methods that combine CNN with CRF for human body pose estimation that learn a fixed248
pairwise relationship representation for different test samples implemented by convolution, our249
methods capture the structure relationship variations caused by pose and deformation. Moreover,250
we use a fully-connected model instead of a tree-structured model, obtaining a better representation251
ability. Lastly, compared to previous methods that do approximate learning such as omitting the252
partition function and inference such as mean-field method, we perform exact learning and inference.253
Experiments on benchmark datasets demonstrate that the proposed method outperform the existing254
SoA methods, in particular under challenging conditions, for both within dataset and cross dataset.255
8
References256
[1] 300VW dataset. http://ibug.doc.ic.ac.uk/resources/300-VW/, 2015.257
[2] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. Continuous conditional neural fields258
for structured regression. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,259
Computer Vision – ECCV 2014, pages 593–608, Cham, 2014. Springer International Publishing.260
[3] Yoshua Bengio, Yann LeCun, and Donnie Henderson. Globally trained handwritten word recognizer261
using spatial representation, convolutional neural networks, and hidden markov models. In J. D. Cowan,262
G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 937–944.263
Morgan-Kaufmann, 1994.264
[4] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap265
regression. In ECCV, 2016.266
[5] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment267
problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision,268
2017.269
[6] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollár. Robust face landmark estimation under270
occlusion. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV ’13,271
pages 1513–1520, Washington, DC, USA, 2013. IEEE Computer Society.272
[7] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression. Interna-273
tional Journal of Computer Vision, 107(2):177–190, Apr 2014.274
[8] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using275
part affinity fields. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages276
1302–1310, 2017.277
[9] Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and Jian Sun. Joint cascade face detection and278
alignment. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision –279
ECCV 2014, pages 109–122, Cham, 2014. Springer International Publishing.280
[10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation281
with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern282
Analysis and Machine Intelligence, 40(4):834–848, April 2018.283
[11] Xianjie Chen and Alan Yuille. Articulated pose estimation by a graphical model with image dependent284
pairwise relations. In Proceedings of the 27th International Conference on Neural Information Processing285
Systems - Volume 1, NIPS’14, pages 1736–1744, Cambridge, MA, USA, 2014. MIT Press.286
[12] Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. Adversarial posenet: A structure-287
aware convolutional network for human pose estimation. 2017 IEEE International Conference on Computer288
Vision (ICCV), pages 1221–1230, 2017.289
[13] Xiao Chu, Wanli Ouyang, hongsheng Li, and Xiaogang Wang. Crf-cnn: Modeling structured information290
in human pose estimation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,291
Advances in Neural Information Processing Systems 29, pages 316–324. Curran Associates, Inc., 2016.292
[14] Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Structured feature learning for pose293
estimation. In CVPR, 2016.294
[15] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. In Hans Burkhardt and Bernd295
Neumann, editors, Computer Vision — ECCV’98, pages 484–498, Berlin, Heidelberg, 1998. Springer296
Berlin Heidelberg.297
[16] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models-their training and application.298
Computer Vision and Image Understanding, 61(1):38 – 59, 1995.299
[17] Ciprian A. Corneanu, Meysam Madadi, and Sergio Escalera. Deep structure inference network for facial300
action unit recognition. In ECCV, 2018.301
[18] Trinh–Minh–Tri Do and Thierry Artieres. Neural conditional random fields. In Yee Whye Teh and Mike302
Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and303
Statistics, volume 9 of Proceedings of Machine Learning Research, pages 177–184, Chia Laguna Resort,304
Sardinia, Italy, 13–15 May 2010. PMLR.305
9
[19] David Eigen, Dilip Krishnan, and Rob Fergus. Restoring an image taken through a window covered with306
dirt or rain. In Proceedings - 2013 IEEE International Conference on Computer Vision, ICCV 2013, pages307
633–640. Institute of Electrical and Electronics Engineers Inc., 2013.308
[20] Golnaz Ghiasi and Charless C. Fowlkes. Occlusion coherence: Detecting and localizing occluded faces.309
CoRR, abs/1506.08347, 2015.310
[21] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks.311
In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 437–446, June312
2015.313
[22] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut:314
A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.315
[23] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Structured Output316
Learning for Unconstrained Text Recognition. dec 2014.317
[24] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. L. Briggman, M. N. Helmstaedter, W. Denk, and318
H. S. Seung. Supervised learning of image restoration with convolutional networks. In 2007 IEEE 11th319
International Conference on Computer Vision, pages 1–8, Oct 2007.320
[25] Amin Jourabloo and Xiaoming Liu. Pose-invariant face alignment via cnn-based dense 3d model fitting.321
Int. J. Comput. Vision, 124(2):187–203, September 2017.322
[26] F. Kahraman, G. Muhitin, S. Darkner, and R. Larsen. An active illumination and appearance model for323
face alignment. Turkish Journal of Electrical Engineering and Computer Science, 18(4):677–692, 2010.324
[27] Neeraj Kumar, Peter N. Belhumeur, and Shree K. Nayar. Facetracer: A search engine for large collections325
of images with faces. In The 10th European Conference on Computer Vision (ECCV), October 2008.326
[28] G. Lin, C. Shen, A. Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic327
segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages328
3194–3203, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.329
[29] Iain Matthews and Simon Baker. Active appearance models revisited. International Journal of Computer330
Vision, 60(2):135–164, Nov 2004.331
[30] Stephen Milborrow and Fred Nicolls. Locating facial features with an extended active shape model. In332
Proceedings of the 10th European Conference on Computer Vision: Part IV, ECCV ’08, pages 504–513,333
Berlin, Heidelberg, 2008. Springer-Verlag.334
[31] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Robert G.335
Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial336
Intelligence and Statistics, pages 246–252. Society for Artificial Intelligence and Statistics, 2005.337
[32] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In338
Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14,339
2016, Proceedings, Part VIII, pages 483–499, 2016.340
[33] Feng Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E. Barbano. Toward automatic phenotyping341
of developing embryos from videos. Trans. Img. Proc., 14(9):1360–1371, September 2005.342
[34] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Christoph Bregler,343
and Kevin P. Murphy. Towards accurate multi-person pose estimation in the wild. 2017 IEEE Conference344
on Computer Vision and Pattern Recognition (CVPR), pages 3711–3719, 2017.345
[35] Jian Peng, Liefeng Bo, and Jinbo Xu. Conditional neural fields. In Y. Bengio, D. Schuurmans, J. D.346
Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems347
22, pages 1419–1427. Curran Associates, Inc., 2009.348
[36] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler,349
and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. 2016350
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4929–4937, 2016.351
[37] Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Continuous conditional random fields352
for regression in remote sensing. In Proceedings of the 2010 Conference on ECAI 2010: 19th European353
Conference on Artificial Intelligence, pages 809–814, Amsterdam, The Netherlands, The Netherlands,354
2010. IOS Press.355
10
[38] Kosta Ristovski, Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Continuous conditional356
random fields for efficient regression in large fully connected graphs. In AAAI, 2013.357
[39] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja358
Pantic. 300 faces in-the-wild challenge. Image Vision Comput., 47(C):3–18, March 2016.359
[40] J. Saragih and R. Goecke. A nonlinear discriminative approach to aam fitting. In 2007 IEEE 11th360
International Conference on Computer Vision, pages 1–8, 2007. Exported from https://app.dimensions.ai361
on 2018/11/15.362
[41] Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Deformable model fitting by regularized landmark363
mean-shift. International Journal of Computer Vision, 91(2):200–215, Jan 2011.364
[42] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point de-365
tection. In Computer Vision - CVPR IEEE Computer Society Conference on Computer Vision and366
Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. .367
10.1109/CVPR.2013.446., Proceedings, pages 3476–3483, 2013.368
[43] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object369
localization using convolutional networks. In CVPR, 2015.370
[44] Jonathan Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional371
network and a graphical model for human pose estimation. In Proceedings of the 27th International372
Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 1799–1807, Cambridge,373
MA, USA, 2014. MIT Press.374
[45] George Trigeorgis, Patrick Snape, Mihalis A. Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou.375
Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR, pages376
4177–4187. IEEE Computer Society, 2016.377
[46] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In378
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,379
June 27-30, 2016, pages 4724–4732, 2016.380
[47] Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A381
boundary-aware face alignment algorithm. In CVPR, 2018.382
[48] Xuehan Xiong and Fernando De la Torre. Global supervised descent method. In CVPR, pages 2664–2673.383
IEEE Computer Society, 2015.384
[49] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao. Recurrent conditional random field for language385
understanding. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing386
(ICASSP), pages 4077–4081, May 2014.387
[50] Amir Zadeh, Yao Chong Lim, Tadas Baltrusaitis, and Louis-Philippe Morency. Convolutional experts388
constrained local model for 3d facial landmark detection. In The IEEE International Conference on389
Computer Vision (ICCV) Workshops, Oct 2017.390
[51] S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen. The menpo facial landmark localisation391
challenge: A step towards the solution. In 2017 IEEE Conference on Computer Vision and Pattern392
Recognition Workshops (CVPRW), pages 2116–2125, July 2017.393
[52] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep394
multi-task learning. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer395
Vision – ECCV 2014, pages 94–108, Cham, 2014. Springer International Publishing.396
[53] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong397
Du, Chang Huang, and Philip H. S. Torr. Conditional random fields as recurrent neural networks. In398
Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages399
1529–1537, Washington, DC, USA, 2015. IEEE Computer Society.400
[54] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape401
searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages402
4998–5006, 2015.403
[55] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Unconstrained face alignment via cascaded404
compositional learning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),405
pages 3409–3417, 2016.406
[56] Xiangyu Zhu, Zhen Lei, Stan Z Li, et al. Face alignment in full pose range: A 3d total solution. IEEE407
Transactions on Pattern Analysis and Machine Intelligence, 2017.408
11
Hierarchical Context Modeling for VideoEvent Recognition
Xiaoyang Wang,Member, IEEE and Qiang Ji, Fellow, IEEE
Abstract—Current video event recognition research remains largely target-centered. For real-world surveillance videos, target-centered
event recognition faces great challenges due to large intra-class target variation, limited image resolution, and poor detection and
tracking results. Tomitigate these challenges, we introduced a context-augmented video event recognition approach. Specifically,
we explicitly capture different types of contexts from three levels including image level, semantic level, and prior level. At the image level,
we introduce two types of contextual features including the appearance context features and interaction context features to capture the
appearance of context objects and their interactions with the target objects. At the semantic level, we propose a deepmodel based on
deep Boltzmannmachine to learn event object representations and their interactions. At the prior level, we utilize two types of prior-level
contexts including scene priming and dynamic cueing. Finally, we introduce a hierarchical context model that systematically integrates
the contextual information at different levels. Through the hierarchical context model, contexts at different levels jointly contribute to the
event recognition.We evaluate the hierarchical context model for event recognition on benchmark surveillance video datasets. Results
show that incorporating contexts in each level can improve event recognition performance, and jointly integrating three levels of contexts
through our hierarchical model achieves the best performance.
Index Terms—Hierarchical context model, event recognition, image context, semantic context, priming context
Ç
1 INTRODUCTION
VISUAL event recognition is attracting growing interestfrom both academia and industry [1]. Various app-
roaches have been developed for event recognition. Theseexisting approaches can generally be divided into descriptor-based approaches and model-based approaches. Descriptor-based approaches build descriptors or features to capture thelocal appearance or motion patterns of the target object. Theseapproaches usually employ various descriptors as featuresand recognize the events using classifiers such as SupportVectorMachines (SVMs). Severalwidely used features includethe histogram of oriented gradient (HOG) [2], the Spatio-Temporal Interest Point (STIP) [3], and the optical flow [4].These descriptors generally focusmore on target objects.
Model-based approaches utilize probabilistic graphicalmodels to encode the appearance or motion patterns of thetarget object. These approaches generally build either staticmodels including Markov Random Fields (MRFs) [5] andConditional Random Fields (CRFs) [6], or dynamic modelsincluding Hidden Markov Models (HMMs) [7], DynamicBayesian Networks (DBNs) [8], and their variants. Thesemodels are used to encode the spatial and temporal interac-tions, and are combined with local measurements for event
recognition. The event recognition is then performedthrough the model inference.
Despite these efforts, surveillance video event recogni-tion remains extremely challenging even with the well-constructed descriptors or models for describing the events.The first difficulty arises from the tremendous intra-classvariations in events. The same category of events can havehuge variations in their observations due to not only largetarget variabilities in shape, appearance and motion, but alsolarge environmental variabilities like viewpoint change,illumination change, and occlusion. Fig. 1 gives examples ofevent “loading” with large appearance variations. Second,the poor target tracking results and the often low video reso-lution further aggravate the problem. These challenges forceus to rethink the existing data-driven and target-centeredevent recognition approach and to look for extra informationto help mitigate the challenges. The contextual informationserves this purposewell.
Contexts in general can be grouped into the image levelcontext [9], [10], [11], the semantic level context [5], [12], [13],and the prior level context [14], [15]. These three levels ofcontexts have also been investigated for event recognition.For example, at the image level, Wang et al. [10] presenta multi-scale spatio-temporal context feature that capturesthe spatio-temporal interest points in event neighborhoods.At the semantic relationship level, Yao et al. [5] propose a con-textmodel tomake humanpose estimation and activity recog-nition as mutual context to help each other. At the prior/priming information level, the scene priming [14] has beenproven to be effective for event recognition in [16], [17], [18].
However, existing work on contexts generally incorpo-rates one type of context, or context information at one level.There is not much work that simultaneously exploits
� X. Wang is with Nokia Bell Labs, Murray Hill, NJ 07974.E-mail: [email protected].
� Q. Ji is with the Department of Electrical, Computer and SystemsEngineering, Rensselaer Polytechnic Institute, Troy, NY 12180.E-mail: [email protected].
Manuscript received 24 July 2015; revised 6 July 2016; accepted 7 Sept. 2016.Date of publication 10 Oct. 2016; date of current version 11 Aug. 2017.Recommended for acceptance by G. Mori.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPAMI.2016.2616308
1770 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017
0162-8828 � 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
different types of contexts at different levels. Since contextexists at different levels and comes in different types, webelieve event recognition can benefit greatly if we can simul-taneously exploit contexts at different levels and systemati-cally incorporate them into event recognition.
To this goal, we introduce the unified hierarchical con-text modeling that allows systematically capturing of con-texts at different levels, and principally integrates thecaptured contexts with the image measurements for robustevent recognition from surveillance videos. We first proposetwo types of context features including the appearancecontext feature and the interaction context feature. Theseimage level contexts exploit the contextual neighborhood ofthe event instead of the target as in [19]. Next, we introducethe deep hierarchical context model that integrates the pro-posed context features with semantic level context and priorlevel context. In the proposed model, the semantic level con-text captures the interactions among the entities of an event(e.g., person and object), and the prior level context includesthe scene priming and dynamic curing. Through this hierar-chical context model, context in the bottom (feature) levelwould provide diagnostic support for the event, while con-text in the top (prior) level provides predictive knowledgeon the event. The top-down and bottom-up context meet atthe middle (semantic relationship) level, where the threelevels of contexts are systematically integrated to yield acomprehensive characterization of events and their context.
We evaluate the proposed hierarchical context models onVIRAT 1.0 Dataset, VIRAT 2.0 Ground Dataset with six andall events [1], as well as the UT-Interaction Dataset [20].Experimental results show that the proposed context fea-tures can improve the event recognition performance whencombined with the target-centered event feature. Moreover,the proposed hierarchical context models, by capturingthree levels of context information, can obviously improvethe recognition accuracy for most of the events.
2 RELATED WORK
We discuss related work in event recognition with contexts,integrating multiple levels of contexts, and deep models inSections 2.1, 2.2 and 2.3 respectively.
2.1 Event Recognition with Contexts
In the recent years, there are increasing efforts in applyingcontext to event recognition. In the event recognition system,contextual information can exist at different levels includingimage level [9], [10], [11], semantic relationship level [5], [12], [21],and prior/priming information level [14], [22]. Below, we brieflysummarize the work in each category as well as the latestefforts in integrating contexts from different levels.
At image level, the context features capture informationregarding the context, and serve as a necessary addition to thetraditional event features, which are solely extracted withinthe event bounding box. Many context features have been
introduced for activity/event recognition. Kovashka et al. [9]propose to learn the shapes of space-time feature neighbor-hoods which are most discriminative for a given category.Wang et al. [10] present a representation that captures the con-textual interactions between interest points in both local andneighborhood spatio-temporal domains. Also, Zhu et al. [11]propose both the intra-activity and inter-activity context fea-ture descriptors for activity recognition. And, Escorcia andNiebles [23] propose the action descriptor to capture the evo-lution of human-object interactionswith respect to time.
At semantic level, context captures relationships amongbasic elements of events such as the semantic relationshipsbetween actions/activities, objects, human poses, and socialroles. Different approaches have been proposed to capturethese types of semantic relationships. Gupta et al. [12] presenta BN approach for combining action understanding withobject perception. Yao et al. [5] propose a Markov RandomField model to encode the mutual context of objects andhuman poses in human-object interaction activities. Yuanet al. [24] propose to capture human activities by a set of mid-level components obtained by trajectory clustering, and usethe spatio-temporal context kernel to encode both local prop-erties and context information. Ramanathan et al. [25] proposea Conditional Random Fieldmodel to capture the interactionsbetween the event and the social roles of different personsin a weakly supervised manner. In general, approaches thatmodel semantic relationships as contexts utilize the probabi-listic graphical models like BN, MRF, or CRF, and capture theco-occurrence and mutually exclusive relationships to boostthe corresponding recognition performances.
At prior level, the context captures the global spatial ortemporal environments within which events may happen.The scene priming used by Torralba et al. [14] and Sudderthet al. [22] demonstrate that scene provides good prior infor-mation for object recognition and object detection. The scenepriming information [14] has also proven to be effective forevent recognition in [16], [17].
In general, the existingwork in context-aided event recog-nition focuses mostly on context at an individual level. Thereare few works studying the integration of multiple levels ofcontexts.Wewill discuss these works in Section 2.2.
2.2 Integrating Multiple Levels of Contexts
There are several approaches that integrate multiple levelsof contexts for action, activity, and event recognition appli-cations. Li et al. [13] try to capture the semantic co-occur-rence relationships between event, scene, and objects with aBayesian topic model for static image event recognition.This model essentially captures the semantic level context,and incorporates the hierarchical priors in the model. Sunet al. [26] propose to combine the point-level context feature,the intra-trajectory context feature, and the inter-trajectorycontext image through a multiple kernel learning model forhuman action recognition. These multiple level contexts areall at the image level. The approach by Zhu et al. [11] alsoexploits contexts for surveillance video event recognition.While similar to our approach, our approach differs from[11] in the following aspects: 1) we propose a probabilisticdeep model to learn the latent representation for semanticand prior level contexts, and to integrate multiple levels ofcontexts. By contrast, their model is a structural linear
Fig. 1. Events “loading” with large intra-class variation.
WANG AND JI: HIERARCHICAL CONTEXT MODELING FOR VIDEO EVENT RECOGNITION 1771
model. 2) We integrate contexts from all three levels, whiletheir model only integrates contexts at feature and semanticrelationship levels.
Approaches like [27], [28] utilize hierarchical probabilis-tic models for event and action recognition. However, thesetwo approaches focus on capturing the hierarchy on feature,body parts and human actions, without incorporating con-text information beyond the target. Also, the dynamic topicmodels including the Markov clustering topic model byHospedales et al. [29], and the sequential topic model byVaradarajan et al. [30] are proposed for video activitymodeling. Different from our approach that integrates con-text in three levels, these models capture the Bayesianmodel prior distribution in the prior level, and integrate thetemporal transition in the semantic level.
In summary, the existing work in integration of contextsat different levels is limited to two levels. By contrast, wepropose a unified model that integrates contexts from allthree levels simultaneously. Experiments demonstrate sig-nificant performance improvement over the existing modelson challenging real-world benchmark surveillance videos.
2.3 Deep Models for Event Recognition
Recently, deep models including probabilistic models likedeep belief networks [31] and deep Boltzmann machines(DBMs) [32], [33], as well as non-probabilistic models likethe stacked auto-encoders [34], [35] and convolutional neu-ral networks (CNN) [36] are used in different applications.For action and activity recognition, the ConvNets [37], [38],convolutional gated restricted Boltzmann machine [39],independent subspace analysis [40], and auto-encoderapproaches [41] are developed. However, these deep mod-els are designed as a feature learning approach to learn thedeep representation for the target. They are generally data-driven and target-centered, without explicitly incorporatingcontext information. Comparatively, our proposed deepcontext model utilizes the deep structure to explicitly cap-ture the prior level, semantic level, and image level contextsfor event recognition.
There are several differences between our model and thepopular discriminative deep neural networks (NN) likeCNN. First, the structure of our model as well as its connec-tions are specifically designed for modeling events with twointeracting entities and they carry clear semantics, while thestructure of the conventional NN is more general withoutmuch semantic meanings and they tend to be fully con-nected. Second, our model is probabilistic while deep mod-els like CNN are deterministic. Third, compared to the deepNN, our model is rather shallow, consisting of only a fewlayers. Finally, in terms of learning and inference methods,we employ probabilistic methods such as maximizing likeli-hood during learning and MAP for inference, while CNNminimizes the empirical loss function during learning.
Among the existing deep models, the multi-modal deeplearning approaches including the multi-modal stackedautoencoders [42] and the multi-modal DBM models [43],[44] are related but different from our proposed approach.The multi-modal deep models aim at learning a joint featurerepresentation over multiple modalities. Comparatively,our approach aims at utilizing the deep structure to intrinsi-cally capture three levels of contexts (i.e., image level,
semantic level, and prior level) through the proposedmodel. Their inputs are multi-modal, while our inputs areonly from image. Their goal of learning is to learn the jointmiddle level representation to capture both modalities,while our goal of learning is to capture contextual informa-tion at different levels. Their goal of inference is to use thelearned middle level representation for classification orinfer one modality data given the other modality, while ourgoal of inference is to infer the most likely event.
There is little work studying utilizing deep models tocapture contexts for visual recognition tasks. He et al. [45]utilize a RBM model to capture the pixel level interactionsfor image labeling. Also, Zeng et al. [46] build a multi-stagecontextual deep model that uses the score map outputsfrom multi-stage classifiers as contextual information forthe pedestrian detection deep model. However, both thesetwo models are not designed to capture three levels of con-texts, and are not for event recognition.
This paper presents our most recent and comprehensivework on the three-level hierarchical context modeling basedon our previous studies in [47], [48]. Different from [47], thispaper develops two novel context features in the bottomlevel, and builds the deep context model instead of the BNbased context model in [47] to capture contexts. This workalso extends [48] with significantly more details on method-ology, more details on the learning and inference methods,greatly extended experimental evaluations with a new data-set, enhanced discussions in introduction, evaluation andconclusion, as well as greatly extended related work.
3 HIERARCHICAL CONTEXT MODELING
Fig. 2 illustrates the overall idea of this approach. We proposeto model three levels of contexts: image level context, seman-tic level context, and prior level context. The image levelcontext in the bottom level provides diagnostic support infor-mation for the event, while the prior level context at the toplevel supplies top-down predictive knowledge on the event.The top-down and bottom-up contexts meet at the middlelevel (semantic level context), where the three levels of con-texts are systematically integrated to give a comprehensivecharacterization of the events and their overall contexts.
3.1 Image Level Context Features
Our image level context can be categorized into two differ-ent types: the appearance context feature and the interaction
Fig. 2. The integration of contexts from three levels.
1772 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017
context feature. Here, we first define the event neighborhoodbased on the event bounding box. From there, both appear-ance and interaction context features can be extracted.
3.1.1 Definition of Event Neighborhood
For event recognition of a video sequence, the event bound-ing box is a set of rectangles over the video sequence frames,which are from frame 1 to frame T , with each rectangleassigned to one of the T frames. The event bounding boxcontains the event objects. For each frame, the event occurswithin the corresponding rectangle. The rectangle in frame tcan be represented by its upper-left corner point with coor-dinate ðxt; ytÞ, and its width wt and height ht. In this way,the event bounding box can be represented by the set
fðxt; yt; wt; htÞTt¼1g which includes rectangles over all T
frames.Given the event bounding box, we can further define the
spatial neighborhoods of the event. As shown in Fig. 3a, forframe t, the event bounding box rectangle is extended to alarger rectangle by increasing the width with Dwt on bothleft and right sides of the original rectangle. Similarly, theheight is also increased with Dht on both top and bottomsides of the original rectangle. The event neighborhood ofan event in frame t is then the region within the extendedrectangle but outside of event bounding box, as presentedby the shaded region of Fig. 3a.
We use the ratio � to determine the relative scope of theevent neighborhood with respect to the event bounding boxsize
� ¼ Dht
ht¼ Dwt
wt: (1)
Our event spatial neighborhood is then extended to thespatial-temporal neighborhoods over T frames, as shown inthe shaded areas of Fig. 3b.
3.1.2 Appearance Context Feature
The appearance context feature captures the appearance ofcontextual objects, which are defined as nearby non-targetobjects located within the event neighborhood. Since ourevent neighborhood is a direct spatial extension of the eventbounding box, it would naturally contain both the contex-tual objects and the background. To efficiently extract andcapture the contextual objects from the background, we uti-lize Scale Invariant Feature Transform (SIFT) descriptors [49]to detect key points in the event neighborhood. Fig. 4a gives
an illustration of SIFT key points in the event neighborhoodover frame.
The SIFT extracts 128 dimensional scale and orientationinvariant local textual features surrounding the detectedkey points. This feature provides an appearance-baseddescription of the contextual objects. For each eventsequence, the standard Bag-Of-Words (BOW) approach isused to transform the key points into the histogram contextfeature with fixed length using k-means clustering. Supposean event sequence contains M key points p1; . . . ; pM , witheach point assigned to a clustering label from 1 to K. The Kdimensional histogram h for this event sequence is
hðkÞ ¼ #fpi : pi 2 binðkÞg for k ¼ 1; . . . ; K: (2)
This K dimensional histogram then captures the appear-ance of the contextual objects. After normalization, it isused as the appearance context feature.
3.1.3 Interaction Context Feature
The interaction context feature captures the interactionsbetween event objects and contextual objects as well asamong contextual objects. The contextual objects are repre-sented by the SIFT key points extracted in the event neigh-borhood as discussed in Section 3.1.2. We use SIFT keypoints detected within the event bounding box to representthe event objects.
Then, the k-means clustering is applied to the 128 dimen-sional features of key points in the event bounding box andthe event neighborhood of all training sequences to generatea joint dictionary matrix DI . With this dictionary, key pointsinside and outside the event bounding box can be assignedto the same set of words. As in Fig. 5, we use a 2D histogramto capture the co-occurrence frequencies of words insideand outside the event bounding box over frames.
Specifically, for an event sequence with T frames in total,denote the key points inside and outside of the event
Fig. 3. The definition of event neighborhood, in which the blue rectangleindicates the event bounding box, and the dashed green rectangle is theextended rectangle. The shaded region within the extended rectanglebut outside of the event bounding is the spatial neighborhood. The eventneighborhood is the union of the spatial neighborhoods over T frames.
Fig. 4. Extracting appearance context feature from event neighborhood.(a) SIFT key points in the neighborhood of each frame; (b) BOW histo-gram feature.
Fig. 5. Extracting interaction context feature with a 2D histogram thatcaptures the co-occurrence frequencies of words of event objects andcontextual objects.
WANG AND JI: HIERARCHICAL CONTEXT MODELING FOR VIDEO EVENT RECOGNITION 1773
bounding box in frame t as pit and qjt respectively. The pointpair ðpit; qjtÞ would be counted only when these two pointsappear in the same frame. In such sense, theK-by-K dimen-sional histogram h for this event sequence is
hðk; k0Þ ¼ #fðpit; qjtÞ : pit 2 binðkÞ; qjt 2 binðk0Þgfor k; k0 ¼ 1; . . . ; K; t ¼ 1; . . . ; T:
(3)
We normalize this 2D histogram to ensure all elements inthe matrix sum to 1. They then constitute as the interactioncontext feature we use for event recognition after beingreshaped into aK2 dimensional vector.
3.2 Model Capturing Semantic Level Context
The semantic level contexts stand for the semantic interac-tions among event entities. Since both the person and objectare important entities of an event, the semantic level con-texts for this work capture the interactions between personand object for an event. For example, event “person gettinginto vehicle” is highly correlated with human state “facingtowards vehicle” and object state “door open”; also, event“person opening a trunk” has strong relations with humanstate “at tail of vehicle” and object state “trunk open”. Thesemantic context modeling should therefore capture event,person, object, and their interactions. Different from thesemantic level context modeling in [47], we learn a set ofmiddle level representations for the person and object enti-ties through the deep structure, and capture the interactionsbetween event and the learned middle level representations.Existing approaches like [47] utilize a single discrete vari-able to describe the person and object states instead.
Suppose we have K types of events to recognize. We usea K dimensional vector y with binary units to represent theevent label through the 1-of-K coding scheme, in which theevent belonging to class Ck would be a vector with elementk as “1” and all the remaining elements as “0”. We use thebinary hidden units hp and ho to represent the latent statesof person and object. We treat both hp and ho as hiddenunits such that their optimal states can be learned as latenttopics automatically during training.
3.2.1 Semantic Context Modeling
The model structure shown in Fig. 6 is used to capture thesemantic level contexts. In this structure, the event label vec-tor y lies in the top layer, and the hidden units hp for personand ho for object both lie in the bottom layer. Another set ofhidden units hr standing in the intermediate layer is incor-porated to capture the interactions between person andobject. Here, every single hidden unit in hr is connected toall the units in hp, ho, and y. In such way, the global
interactions among units from person, object, and eventlabel are captured through the intermediate hidden layer hr.
3.2.2 Combining with Observations
The observation vectors for the event, person, and object canbe further added to the semantic level context model seen inFig. 6. It results in the context model as shown in Fig. 7. Thevectors p and o denote the person and object observationvectors as continuous STIP features. Both the person obser-vation vector p and the object observation vector o are con-nected only to their corresponding hidden units hp and ho
respectively. In this way, the middle level representationsfor person and object can be obtained from their corre-sponding observations. In addition, the event observation eas STIP event feature and the context feature c introducedin Section 3.1 are directly connected to the event label y.
The model in Fig. 7 combines semantic contexts in themiddle level with context feature c in the bottom level. Thismodel is called the Model-BM context model, and is com-pared to other models in the experiment section.
3.3 Model Capturing Prior Level Context
The prior level contexts capture the prior information ofevents. It reflects the related high level context that deter-mines the likelihood of the occurrence of certain events. Forthis research, we utilize two types of prior contexts: the scenepriming [14] and the dynamic cueing, though the model isgeneric enough to apply to other high level priming contexts.
3.3.1 Scene Priming
The scene priming context refers to the scene informationobtained from the global image. It provides an environmen-tal context such as location (e.g., parking lot, shop entrance)within which events occur. Hence, it can serve as spatialprior to dictate whether certain events would occur.
To capture the scene context as prior context, we uti-lize the hidden units hs to represent different possiblescene states. As shown in Fig. 8, each hidden unit in hs
Fig. 6. The model capturing semantic level contexts, where hp and ho
are the first layer hidden units representing person and object middlelevel representation, hr is the second layer hidden units capturing inter-actions, and y stands for the event class label.
Fig. 7. The model combining semantic level contexts with observations,where vectors p and o denote the person and object observations, and eand c represent the event and context observations respectively.
Fig. 8. The model capturing prior level contexts, where s represents theglobal scene observation, m�1 denotes the recognition measurement ofthe previous event, hs denotes the hidden units representing differentpossible scene states, y�1 denotes the previous event, and y stands forthe current event.
1774 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017
is connected to all the elements within the event labelvector y. In this way, the state of the scene would have adirect impact on the event label. The observation vectors represents the GIST feature extracted from the globalscene image. Each element in s is connected to each unitin hs to provide global observation to the hidden scenestates.
3.3.2 Dynamic Cueing
The second prior level context is the the dynamic cueing. Itprovides temporal support as to what event will likely tohappen given the events that have happened up to now.Event at current time is influenced by events in previoustimes. For example, event “loading/unloading a vehicle”typically precedes event “closing a trunk”. The informationon the previous events provides a beneficial cue on the rec-ognition of the current event. Dynamic context can thereforeserve as a temporal prior on current event.
We capture dynamic context through Markov chainmodeling which is specifically useful for a serial of happen-ing with a temporal order. For example, people typically“get out of the vehicle” first, “open the trunk”, “load thevehicle”, and then “close the trunk” at last. In this work, theprevious event is represented by the K dimensional binaryvector y�1 in the 1-of-K coding scheme. Moreover, y�1 is fur-ther connected to previous event measurement vector m�1which denotes the recognition measurement of the previousevent. As shown in Fig. 8, both hs and y�1 provide top-downprior information for the inference of the current event.
3.4 Integrating Contexts in Three Levels
Given the contexts in three levels as introduced previously,we now discuss the formulation of the proposed deep hier-archical context model for integrating contexts from allthree levels. We first present the deep model, and then dis-cuss its learning and inference.
3.4.1 Deep Hierarchical Context Model
We introduce the deep context model to systematicallyincorporate three levels of contexts. As shown in Fig. 9, themodel consists of six layers. From bottom to top, the firstlayer at the bottom includes the target and contextual mea-surement vectors p, o, e, and c that are visible in both learn-ing and testing. The vectors p and o denote the person andobject observations. And, the vectors e and c denote the
event and context features. The second layer includesbinary hidden units hp and ho as middle level representa-tions for person and object. On the third layer, the binaryhidden units hr are incorporated as an intermediate layer tocapture the interactions between event, person, and object.The fourth layer denoted by vector y represents the eventlabel through the 1-of-K coding scheme. On the top twolayers, the hidden units hs represent the scene states, andvector s is the scene observation. Also, y�1 represents theprevious event state, with its measurement as m�1. Thismodel is essentially the combination of Model-BM and priormodel in Figs. 7 and 8.
The proposed model is an undirected model. With thestructure in Fig. 9, the model energy function is:
Eðy;hr;hp;ho;p; o; e; c; y�1;m�1;hs; s; uÞ ¼�~p>W1hp � ~o>W2ho � h>p Q
1hr � h>o Q2hr � h>r Ly
�~e>U1y� ~c>U2y� y>�1Dy� h>s Ty�m>�1Fy�1�~s>Ghs � b>hphp � b>hoho � b>hrhr � b>y y� b>hshs
�b>y�1y�1 � b>m�1m�1 þXi
ðpi � bpiÞ22s2
pi
þXj
ðoj � bojÞ22s2
oj
þXk
ðek � bekÞ22s2
ek
þXi0
ðci0 � bci0 Þ22s2
ci0þXj0
ðsj0 � bsj0 Þ22s2
sj0;
(4)
where W1, W2, Q1, Q2, L, U1, U2, T, D, F, and G are theweight matrices between the groups of visible or hiddenunits. Also, bhp , bho , bhr , by, bhs , by�1 , and bm�1 are the bias
terms for the discrete units. And, bp, ssp, bo, sso, be, sse, bc,
ssc, bs, and sss are the parameters for the continuous units,similar to those in Gaussian-Bernoulli RBM. We use u torepresent the whole model parameter set that includes allthe parameters in the weight matrices and the bias terms.
For convenience, Eq. (4) utilizes vectors ~p; ~o; ~e;~c, and ~s,which are the original observation vectors p; o; e; c; sdivided by ssp; sso; sse; ssc, and sss respectively in each dimen-
sion. For instance, ~pi ¼ pi=spi.
Given the energy function, the joint probability of all thevariables y, hr, hp, ho, p, o, e, c, y�1, m�1, hs, and s can bewritten as
P ðy;hr;hp;ho;p; o; e; c; y�1;m�1;hs; s; uÞ ¼ 1
ZðuÞexpð�Eðy;hr;hp;ho;p; o; e; c; y�1;m�1;hs; s; uÞÞ:
(5)
3.4.2 Model Learning
The model learning process learns the model parameter setu which includes all the weight matrices and bias terms in
Eq. (4). With training data fyi; y�1;i;pi; oi; ei; ci; si;m�1;igNi¼1,these parameters are learned by maximizing the log likeli-hood as
u� ¼ argmaxu
LðuÞ
LðuÞ ¼XNi¼1
logP ðyi; y�1;i;pi; oi; ei; ci; si;m�1;i; uÞ:(6)
In the following texts of this section, we will refer to allthe hidden units in the model as h, and refer to all the
Fig. 9. The proposed deep context model integrating image level, seman-tic level, and prior level contexts, where the shaded units represent thehidden units, the striped units represent the observed units that would beavailable both in training and testing, and the units in grid are event labelunits which are available in training and not available in testing.
WANG AND JI: HIERARCHICAL CONTEXT MODELING FOR VIDEO EVENT RECOGNITION 1775
visible units in the model as v. The optimization in Eq. (6)can be solved via the stochastic gradient ascent method inwhich the gradients are calculated as [32]:
@LðuÞ@u¼ @E
@u
� �Pdata
� @E
@u
� �Pmodel
; (7)
where E is the model energy function defined in Eq. (4). Theoperator h�iPdata denotes the expectation with respect to the
data distribution Pdataðh; vÞ ¼ pðhjvÞPdataðvÞ, where PdataðvÞ ¼1N
Pi dðv� viÞ represents the empirical distribution. And, the
operator h�iPmodelis the expectation with respect to the model
distribution defined in Eq. (5). The expectation h�iPdata is usu-ally called the data-dependent expectation, and h�iPmodel
is usu-
ally called themodel’s expectation.Since the computation cost for the direct calculation of
both expectations is exponential to the hidden unit number,the exact learning is intractable. Hence, we learn the pro-posed model with the approximate learning method [32], inwhich the mean-field based variational inference approachis used to estimate the data-dependent expectation, and theMarkov chain Monte Carlo (MCMC) based approximationis used to estimate the model’s expectation.
For pre-training, the model in Fig. 9 is first divided intosub-models. The layer-wise pre-training [34] is then per-formed independently for each sub-model. Specifically, thepre-training starts with training the RBM model formed bythe pair of vectors p and hp, as well as training the RBMmodel formed by the pair of vectors o and ho. Oncetrained, the two RBM models can be sampled to obtainsamples for hp and ho, which are then used jointly withthe ground truth y to train the sub-model represented byFig. 6, with hr as hidden units. Next, we train the RBMmodel formed by the pair of vectors s and hs. Oncetrained, samples of hs from this RBM can be further usedto train the sub-model consisting of hs and y. The remain-ing models consist of visible layers. We learn them by indi-vidually maximizing the log likelihood of the connectedvisible vector pairs (e.g., y and e).
With the model parameters initialized through pre-train-ing, we then learn the model parameters jointly by solvingthe optimization in Eq. (6) through the stochastic gradientascent method. As shown in Eq. (7), the gradient calculationincludes the estimation of the data-dependent expectationand the model’s expectation. For estimating the data-depen-dent expectation, we replace the true posterior P ðhjv; uÞ bythe variational posterior Qðhjv;mmÞ. As in Eq. (8), we use themean-field approximation which assumes all the hiddenunits are fully factorized, ignoring the dependenciesbetween the hidden units
Qðhjv;mmÞ ¼Yi
qðhpijvÞ ! Y
j
qðhojjvÞ !
�Yk
qðhrkjvÞ ! Y
g
qðhsgjvÞ !
;
(8)
where mm ¼ fmmp;mmo;mmr;mmsg are the mean-field variationalparameters with qðhi ¼ 1Þ ¼ mi. The estimation then findsthe parameters mm that maximize the variational lower
bound of the log likelihood in Eq. (6) with u to be fixed. Inthis estimation, given yi, the variational parameters mms on
top are independent from the remaining variational param-eters in the bottom. Thus, the parameters for the top andbottom can be estimated separately. This method can henceiteratively update mm for different hidden units through thefollowing mean-field fixed point equations
mpi sXj
W 1jipj=spj þ
Xk
Q1ikmrk þ bhpi
!
moj sXi
W 2ijoi=soi þ
Xk
Q2jkmrk þ bhoj
!
mrk sXi
Q1ikmpi þ
Xj
Q2jkmoj þ
Xk0
Lkk0yk0 þ bhrk
!
msg sXj
Gjgsj=ssj þXk
Tgkyk þ bhsg
!;
(9)
where sðxÞ ¼ 1=ð1þ expð�xÞÞ represents the logistic func-tion. The estimated variational parameters can then be usedto calculate the data-dependent expectation as in the exam-ples shown below
@E
@W1
� �Pdata
¼ 1
N
XNn¼1
~pmm>p ;@E
@Q1
� �Pdata
¼ 1
N
XNn¼1
mmpmm>r :
For estimating the model’s expectation, we use theMCMC based stochastic approximation procedure. It firstrandomly initializes M Markov chains with samples of y0;j,
h0;jr , h0;j
p , h0;jo , p0;j, o0;j, e0;j, c0;j, y0;j�1, m
0;j�1, h
0;js , s0;j. For each
Markov chain j from 1 to M, the ðtþ 1Þth step samples
ytþ1;j, htþ1;jr , htþ1;j
p , htþ1;jo , ptþ1;j, otþ1;j, etþ1;j, ctþ1;j, ytþ1;j�1 ,
mtþ1;j�1 , htþ1;j
s , stþ1;j given the samples from the tth step as
yt;j, ht;jr , ht;j
p , ht;jo , pt;j, ot;j, et;j, ct;j, yt;j�1, m
t;j�1, h
t;js , st;j by run-
ning a Gibbs sampler. The M sampled Markov particles arethen used to estimate the model’s expectation in the modeloptimization as in the examples shown below
@E
@W1
� �Pmodel
¼ 1
M
XMj¼1
~ptþ1;jhtþ1;jp
>
@E
@Q1
� �Pmodel
¼ 1
M
XMj¼1
htþ1;jp htþ1;j
r
>:
The learning procedure of the proposed model can thenbe summarized in Algorithm 1.
With the mean-field based variational inference and theMCMC based stochastic approximation approaches, thecomputational complexity of learning is OðRNM2Þ, whereR is the learning iteration number, N is the training samplenumber, andM is number of nodes in the model.
3.4.3 Model Inference
Given a query event sequence with event observation vectore, context observation vector c, person observation vector p,object observation vector o, the global scene observationvector s, and the previous event measurement m�1, the
1776 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017
model can recognize the event category k� by maximizingits posterior probability given all the observation vectorsthrough as
k� ¼ argmaxk
P ðyk ¼ 1je; c;p; o; s;m�1; uÞ: (10)
We emphasize that y�1 as the previous event label isavailable during model learning but not available duringtesting. However, its measurement m�1 is available duringtesting. In model inference, m�1 can influence the currentevent y through y�1, providing therefore diagnostic predic-tion on y. We do not update the decision of previous eventy�1 in testing.
Algorithm 1. Learning of the Proposed Model
Data: fyi; y�1;i;pi; oi; ei; ci; si;m�1;igNi¼1 as the training set withN training samples, and M as the number of Markovchains.
result: model parameter set uInitialize u0 with layer-wise pre-training of RBMs;
Initialize Markov chains fv0;j;h0;jgMj¼1 randomly;for t ¼ 0! T do
// Variational Inference;for each training sample i ¼ 1! N do
Run Eq. (9) updates till convergence;end// MCMC Stochastic Approximation;for each sample j ¼ 1!M do
Sample fvtþ1;j;htþ1;jg given fvt;j;ht;jg;endUpdate parameter utþ1 ¼ ut þ h
@LðuÞ@u
n o;
end
Computing this posterior probability requires marginal-izing over all the hidden units in hp, ho, hr, and hs. Its exactcalculation is intractable. However, the inference can be effi-ciently solved using the Gibbs sampling method. Given theobservation vectors e, c, p, o, s, and m�1 during testing,Gibbs sampling first randomly initializes hp, ho, and y, andthen iteratively samples hr, hp, ho, hs, y�1 and y given theiradjacent units. During this process, hidden units hr, hp, ho,and hs are actively involved in each iteration of Gibbs sam-pling. After burn-in period, samples of y are collected andthey are used to approximate the marginal probability of ythrough their frequency in the Gibbs samples.
The detailed inference algorithm is presented in Algo-rithm 2. The computational complexity for the model infer-ence is OðCTM2Þ, where C is Markov chain number, T ischain length, andM is number of nodes in the model.
4 EXPERIMENTS
We demonstrate the effectiveness of the proposed approachon four event recognition benchmark datasets. The first twodatasets are the VIRAT 1.0 Ground Dataset and VIRAT 2.0Ground Dataset [1] both with six types of person-vehicleinteraction events including Loading a Vehicle (LAV), Unload-ing a Vehicle (UAV), Opening a Trunk (OAT), Closing a Trunk(CAT), Getting into a Vehicle (GIV), and Getting out of a Vehicle(GOV). These two datasets are state-of-the-art real-world sur-veillance video datasets focusing on surveillance video eventswhich include interactions between persons and objects. The
VIRAT 1.0 Ground Dataset includes around 3 hours of videosVideos in this dataset are recorded fromdifferent school park-ing lots. And, the VIRAT 2.0 Ground Dataset includes over 8hours of surveillance videos from school parking lots, shopentrances, outdoor dining areas and construction sites. Forboth datasets, we use half of the event sequences for training,and the rest of the sequences for testing.
Algorithm 2. Inference of P ðyje; c;p; o; s;m�1ÞData: the input observation vectors e, c, p, o, s, and m�1 for
the query event sequence; model parameter set uResult: P ðyk ¼ 1je; c;p; o; s;m�1; uÞ for k ¼ 1; . . . ;Kfor chain ¼ 1! C do
Randomly initialize h0p, h
0o and y0;
for t ¼ 0! T doSample ht
r given htp, h
to, and yt;
Sample htþ1p given ht
r and p;
Sample htþ1o given ht
r and o;
Sample htþ1s given yt and s;
Sample ytþ1�1 given yt andm�1;Sample ytþ1 given ht
r, htþ1s , ytþ1�1 , e and c;
endendCollect the last T 0 samples of y from each chain;Calculate P ðyje; c;p; o; s;m�1Þwith the samples;
We further experiment with Full VIRAT 2.0 Ground Data-set as the third dataset. Besides the six person-object interac-tion events included in the VIRAT 1.0 and 2.0 GroundDatasets, this dataset further includes five types of addi-tional events including Gesturing (GST), Carrying an Object(CAO), Running (RUN), Entering a Facility (EAF), and Exit-ing a Facility (XAF) from VIRAT [1]. Here, half of these eventsequences are used for training, and the remaining sequen-ces are used for testing.
The fourth dataset is theUT-Interaction Dataset [20]. This isa surveillance video dataset with person-person interactionevents. It consists of six person-person interaction eventsincluding: “hand shaking, hugging, kicking, pointing, punching,and pushing” [20]. The dataset includes two sets, each with10 video sequences in the length of around 1minute. To com-pare with state-of-the-art methods, we use the standard 10-fold leave-one-out cross validation for evaluation on set 1.
In this work, we focus on event recognition from pre-seg-mented video sequences with the event bounding boxesprovided by the dataset. We further assume there is onlyone event per segment to recognize. Hence, the average rec-ognition accuracy overall all event categories, the recogni-tion accuracy for each event, and the confusion matrices caneffectively reflect the recognition performance. With theprovided event bounding box, we further obtain the personand object (vehicle) bounding boxes through detection inthe event region. The descriptors for person and object(vehicle) are then extracted within their correspondingbounding box. The erroneous person and object (vehicle)detections would have negative effects for event recogni-tion. However, the holistic use of contexts in all three levelscan compensate the errors in person and object detection.
We implement the proposed deep hierarchical contextmodel based upon the deep Boltzmann machine library
WANG AND JI: HIERARCHICAL CONTEXT MODELING FOR VIDEO EVENT RECOGNITION 1777
provided by Salakhutdnov [50]. Additionally, the RBF ker-nel SVM is repeatedly used in the following experiments. Ineach experiment, the optimal C and g hyper parameters ofSVM are determined by cross validation on the training set.
4.1 Experiments on the Context Features
The effectiveness of the proposed context features discussedin Section 3.1 is demonstrated by experiments on the VIRAT2.0 Ground Dataset [1] with six types of events: LAV, UAV,OAT, CAT, GIV, and GOV.
4.1.1 Baseline Event Features
The baseline event features are the traditional event featuresthat are extracted only within the event bounding box. Here,we utilize the most widely used STIP [51] feature. A 500-word BOW model is used to transform the STIP points intothe baseline event feature vector. It reaches 41.74 percentaverage accuracywith RBF kernel SVM, as shown in Table 1.
4.1.2 Performance of Appearance Context Features
The appearance context feature is evaluated with differentneighborhood sizes determined by the ratio �. The resultsfor the appearance context feature used alone and combinedwith baseline event features are given in Table 1. We alwaysuse the RBF kernel SVM classifier with the same pre-definedcoefficients C and g cross-validated to optimal on trainingset each time. The BOW vocabulary size is set to be 100.
From Table 1, we can see that using appearance contextfeatures alone does not perform as well as the baseline eventfeatures for event recognition. However, combining thebaseline event features with the appearance context featurescan always improve the recognition performance. Also,Table 1 indicates that the ratio � ¼ 0:35 performs the best. Itsignificantly improves the baseline method by about 6percent.
4.1.3 Performance of Interaction Context Features
Also, we evaluate the performance of interaction contextfeatures with the relative neighborhood size is set to� ¼ 0:35, which is the optimal neighborhood size by Table 1.We test the feature performance with different vocabularysize K to find the best tradeoff between the vocabulary sizeand the recognition performance. Table 2 gives the resultsof this experiment.
Results in Table 2 tell us two things. First, the interactioncontext feature generally performs worse than the baselineevent feature when used alone. However, the combinationof baseline event feature with the interaction context featuregenerally improves the performance by up to 6 percent over
the baseline method. Secondly, using the vocabulary sizeK ¼ 10 gives the best performance for the combinedfeature.
4.1.4 Performance of Combined Features
We further test on combining the baseline event featurewith two types of context features, where we choose theneighborhood size ratio � ¼ 0:35, and the vocabulary sizefor the interaction context feature as K ¼ 10. The finalresults are shown in Table 3.
From Table 3, we can see that combining either theappearance or the interaction context feature can alreadyimprove the performance of baseline event feature for eventrecognition. Combining two context features with the base-line event feature can further improve the recognition accu-racy. In all, combining our proposed context features cansignificantly improve the event recognition performance byover 10 percent.
4.2 Experiments with Proposed Model
After discussing the performances of the proposed contextfeatures, we proceed to demonstrate the effectiveness of theproposed deep hierarchical context model that integratesthree levels of contexts. For this model, hp, ho, hr, and hs
have 50, 50, 100, and 20 hidden units respectively. Theseexperiments are performed on VIRAT 1.0, VIRAT 2.0 withsix and all events respectively, and the UT-InteractionDatasets.
4.2.1 Baselines and State-of-the-Art Methods
Three baseline approaches are used in our experiments toevaluate the effectiveness of the proposed deep hierarchicalcontext model denoted as Deep Model. The first baselineuses the SVM classifier with the STIP event feature, and isdenoted as SVM-STIP. This approach does not use any con-texts for event recognition. The second baseline denoted asSVM-Context concatenates the event feature with both theappearance and interaction context features, and also usesSVM as the classifier. It hence evaluates the effectiveness ofthe proposed context features. The third baseline is the
TABLE 1Performance of Appearance Context Feature with Different
Neighborhood Size on Six Events of VIRAT 2.0
� 0.25 0.3 0.35 0.4 0.45
App. Cont. 15.71% 19.54% 22.56% 20.59% 17.83%Baseline 41.74% 41.74% 41.74% 41.74% 41.74%Combined 46.47% 45.08% 47.87% 46.51% 42.18%
� stands for the relative neighborhood size. App.: Appearance; Cont.: Context.Combined: the combined feature.
TABLE 2Performance of Interaction Context Features with Different
Vocabulary Size on Six Events of VIRAT 2.0
K 8 10 12 15 18
Int. Cont. 17.15% 19.64% 19.94% 21.13% 21.66%Baseline 41.74% 41.74% 41.74% 41.74% 41.74%Combined 43.99% 47.54% 44.70% 43.70% 40.42%
K: BOW vocabulary size. “Int. Cont.” stands for interaction context feature.Its dimension isK2. Combined: the combined feature.
TABLE 3Performance of Proposed Context Features Combinedwith Baseline STIP Features on Six Events of VIRAT 2.0
STIPSTIP +
App. Cont.STIP +
Int. Cont.STIP + App.& Int. Cont.
Accuracy 41.74% 47.87% 47.54% 51.91%
App.: Appearance; Int.: Interaction; Cont.: Context.
1778 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017
Model-BM model in Fig. 7 that simultaneously integratesimage level contexts and semantic level contexts. Thesebaseline approaches are compared to our proposed modelthat systematically integrates image, semantic, and priorlevel contexts.
For both the Model-BM and Deep Model, the event tem-poral orders are needed to incorporate the dynamic context.The UT-Interaction Dataset has a simple linear order ofevents without temporal overlapping. On the other hand,the VIRAT datasets contain multiple ongoing events par-tially overlapping in time. However, these temporally-overlapping events are largely uncorrelated due to large spa-tial distance. Hence, in this approach, wemodel the temporaldependencies between event segments only when these seg-ments are also spatially close. The sequence of event seg-ments is then built according to the event temporal orderswith respect to the event starting time. If two events have thesame starting time, the ending time is used to decide theorder. If two spatially-close events are completely tempo-rally overlapping, which is extremely rare, the algorithm isdesigned to use the annotation order of the two video seg-ments. The proposed model then learns the temporal depen-dencies between these event segments in sequence.
Besides these baselines, we also compare our results tomultiple state-of-the-art performances including [11], [47],[52], [53] in VIRAT 1.0 and 2.0 Ground Datasets, as well as[54], [55], [56], [57], [58], [59], [60] in UT-Interaction Dataset.
4.2.2 Performance on VIRAT 1.0 Ground Dataset
We first compare our proposed Deep Model with our threebaselines (SVM-STIP, SVM-Context, and Model-BM)approaches on VIRAT 1.0 Ground Dataset. We show therecognition accuracy for each event, and the average recog-nition accuracy over the six events. The detailed recognitionaccuracies for the three baselines and the proposed modelare also presented in Table 4.
In this comparison, the Model-BM baseline performs bet-ter than the SVM-Context baseline by over 8 percent. Thisresult indicates that incorporating the semantic level contextbetween events and the middle level representations of per-son and object can obviously improve the event recognitionperformance. More importantly, our proposed Deep Modeloutperforms the three baselines for four of the six events. Inthis experiment, the SVM-STIP approach faces great difficul-ties in distinguishing pairs of events such as LAV and UAVwhich are similar in appearance. The proposed approach, byutilizing contexts, can alleviate the mismatches and improve
the recognition for both events. For the average recognitionaccuracy of the six events, the SVM-STIP reaches 39.91 per-cent, the SVM-Context reaches 53.21 percent, and theModel-BM reaches 62.15 percent. Our Deep Model performs thebest at 69.88 percent. This is a 29 percent absolute improve-ment over SVM-STIP, and a 16 percent absolute improve-ment over SVM-Context.
Table 5 gives the comparison of our Deep Model withstate-of-the-art performances on VIRAT 1.0 Ground Dataset.Here, our Deep Model approach performs the best for threeof the six events, and outperforms the BN Model [47] byover 4 percent in the overall performance. This result dem-onstrates that our proposed model is more effective thantraditional BN based hierarchical context model in integrat-ing three levels of context information for event recognition.
4.2.3 Performance on VIRAT 2.0 Ground Dataset
We further compare the performances of the proposed DeepModel with the three baselines SVM-STIP, SVM-Context,andModel-BM on the VIRAT 2.0 Ground Dataset for the rec-ognition of six person-vehicle interaction events. As shownin Table 6, our Deep Model can consistently outperform thebaseline approaches for each event, and improves the aver-age recognition accuracy from 41.74 percent (SVM-STIP),51.91 percent (SVM-Context), and 58.75 percent (Model-BM)to 66.45 percent (Deep Model). This is close to 25 percentabsolute improvement from the SVM-STIP, and close to 15percent absolute improvement from the SVM-Context.
The confusion matrices for SVM-STIP, SVM-Context, andthe proposed Deep Model are further provided in Fig. 10.From Fig. 10a, we can see the SVM-STIP approach still facesdifficulties in distinguishing pairs of event that are similar
TABLE 4Performances of SVM-STIP, SVM-Context, Model-BM
and Deep Model on VIRAT 1.0 Ground Dataset
Accuracy%
SVM-STIP
SVM-Context
Model-BM
DeepModel
LAV 33.33 33.33 66.67 66.67UAV 42.86 57.14 85.71 85.71OAT 10.00 60.00 40.00 50.00CAT 27.27 36.36 54.55 81.82GIV 61.29 67.74 61.29 64.52GOV 64.71 64.71 64.71 70.59
Average 39.91 53.21 62.15 69.88
TABLE 5The Comparison of Our Model with State-of-the-Art
Approaches on VIRAT 1.0 Ground Dataset
Accuracy%
Reddyet al. [52]
Zhuet al. [11]
BN[47]
DeepModel
LAV 10.0 52.1 100 66.67UAV 16.3 57.5 71.4 85.71OAT 20.0 69.1 50.0 50.00CAT 34.4 72.8 54.5 81.82GIV 38.1 61.3 45.2 64.52GOV 61.3 64.6 73.5 70.59
Average 35.6 62.9 65.8 69.88
TABLE 6Performances of SVM-STIP, SVM-Context, Model-BMBaselines and the Proposed Deep Model Comparedwith BN [47] for Six Events on VIRAT 2.0 Dataset
Accuracy%
SVM-STIP
SVM-Context
Model-BM
DeepModel
BN[47]
LAV 44.44 66.67 66.67 66.67 77.78UAV 51.72 62.07 68.97 68.97 58.62OAT 10.00 15.00 25.00 45.00 35.00CAT 52.63 63.16 84.21 89.47 63.16GIV 58.33 64.58 52.08 70.83 68.75GOV 33.33 40.00 55.56 57.78 48.89
Average 41.74 51.91 58.75 66.45 58.70
WANG AND JI: HIERARCHICAL CONTEXT MODELING FOR VIDEO EVENT RECOGNITION 1779
in appearance (e.g., “getting into a vehicle” (GIV) and“getting out of a vehicle” (GOV)). On the other hand, theSVM-Context approach can alleviate such mismatchbetween similar events. Moreover, the proposed DeepModel can obviously reduce the mismatch between similarpairs of events with the incorporation of prior, semantic,and feature level contexts simultaneously.
In VIRAT 2.0 dataset, we also experiment with three dif-ferent model variants of the proposed Deep Model as extrabaselines. The first model variant is the baseline modelexcluding all hidden layers in Fig. 9. This model reaches52.54 percent average accuracy, which is slightly better thanSVM-Context, but is around 14 percent worse than our pro-posed model. This result suggests that, with the introduc-tion of hidden layers, the proposed deep model caneffectively learn the salient representations from the inputand improve recognition performance.
The second model variant is the standard DBM modeltaken only the e as input. To compare with our proposedmodel, we use two layers of hidden units for this standardDBM model, which is trained with standard mean-fieldmethod and then fine-tuned. The model reaches 42.12 per-cent average accuracy, which is only slightly better theSVM-STIP baseline, but much worse than our proposedcontext approaches. This result indicates that the improve-ment by our model is mainly attributed to the integration ofcontexts from all three levels, rather than merely from theDBM formulation.
In the third model variant, we incorporate two sets ofhidden units he and hc for the input e and c respectively.Both he and hc are further connected to y. he and hc
would serve as the latent representations of e and c.Except for capturing the image level representation, suchrepresentations do not capture any semantic contextualinformation. Furthermore, they make the model morecomplex. This model reaches 64.67 percent average accu-racy, which is close to but slightly worse than the pro-posed model. Hence, we do not introduce the additionallatent layers for e and c.
4.2.4 Performance on Full VIRAT 2.0 Ground Dataset
We further experiment on the full VIRAT ground datasetwith all provided events, and compare with the perform-ances of the state of the art methods including the BNmodel [47]. For the events without the “interactingobject”, we take the event region excluding the personregion as the “object” input. The recall (ratio of correctclassifications over all test samples of the event) and pre-cision (ratio of correct classifications over all test samplesclassified into the event) averaged over all 11 types ofevents are then given in Table 7. In this comparison, ourproposed model has a higher averaged precision, andslightly outperforms the BN [47] in the averaged recall.This indicates a smaller improvement when lacking the“interacting object”.
4.2.5 Performance on UT-Interaction Dataset
UT-Interaction Dataset is a surveillance video dataset withperson-person interaction events. To experiment on thisdataset, we turn the object input in the model into the sec-ondary person input. Specifically, we first utilize the HOGfeature based person detectors to detect the two personswithin the event bounding box of the video. The STIP fea-tures for each of the two persons are then extracted accord-ingly. To compare state-of-the-art performances, we use theFisher Vector encoding method [61], [62] for the STIP eventfeature. We experiment on Set 1 of this dataset. The overallperformances of the SVM-STIP and our proposed DeepModel as well as different state-of-the-art performances arelisted in Table 8.
The state-of-the-art performances listed in Table 8 aremainly target-centered descriptor-based approaches. OurSVM-STIP baseline, which is the most standard target-cen-tered descriptor-based approach, performs not as well asmany of these approaches. However, our Deep Model canfurther improve the SVM-STIP baseline, and reaches thebest performance. In addition, our Deep Model outperforms
Fig. 10. Confusion matrices for the recognition of six person-vehicle interaction events on VIRAT 2.0 Ground Dataset with the SVM-STIP, SVM-Context, and the proposed Deep Model.
TABLE 7Comparisons with State-of-the-Art Methods for Recognition
of All Events from VIRAT
Ameret al. [53]
Zhuet al. [11]
BN[47]
OurModel
Precision 72% 71.8% 74.73% 76.50%Recall 70% 73.5% 77.42% 77.47%
TABLE 8Overall Recognition Accuracies Compared to State-of-the-Art
Methods on UT-Interaction Dataset
1780 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017
the approach by Raptis and Sigal [58], which captures thetemporal context between key frames.
In this work, we use “human-object” interaction as exam-ple events. However, we have shown from these experi-ments that the proposed model applies to not only human-object interaction events (as in VIRAT events in Sections4.2.2 and 4.2.3), but also human-human interaction events(as in UT-Interaction events in Section 4.2.5). Furthermore,as discussed in Section 4.2.4, the model can be furtherapplied to the events without the “interacting object”.
5 CONCLUSION
In this paper, we propose a deep Boltzmann machinebased context model to integrate the image level, semanticlevel and prior level contexts. We first introduce two newimage context features. They are appearance context fea-ture and interaction context feature. These features capturethe appearance of the contextual objects, and their interac-tions with the event objects. Then, we introduce a deepcontext model to learn the semantic context. We furtherintroduce the two prior level contexts: scene priming anddynamic cueing. Finally, we introduce a hierarchical deepmodel to integrate contexts at three levels. The model istrained with mean-field based approximate learningmethod, and can be directly used to infer event classesthrough Gibbs sampling. We evaluate our model perfor-mance on VIRAT 1.0 Ground Dataset, VIRAT 2.0 GroundDataset and the UT-Interaction Dataset for recognizing thereal world surveillance video events with complex back-grounds. The results with the proposed deep contextmodel show significant improvements over the baselineapproaches that also utilize multiple levels of contexts. Inaddition, the proposed model also outperforms state of theart methods on benchmark datasets.
Despite significant improvements our methods havemade on benchmark datasets, they still have several limita-tions, which we will address in the future. First, this workrelies on the pre-segmented video sequences and boundingboxes provided by others. This could limit our methods’practical utility. One possible future work is simultaneousevent bounding box detection and event recognition. Sec-ond, our dynamic cueing modeling is the standard Markovchain modeling. The benefit from dynamic cueing contextcould be limited when the events do not have a temporalorder. More complex temporal modeling could be devel-oped to extend Markov chain modeling. Finally, the currentmodel is designed for two interacting entities, typically oneperson and one object. It cannot be directly applied to otherscenarios where more than two entities are involved.Besides extending the model to capture integrations amongmultiple entities by adding additional visible and hiddennodes, another direction is to model interactions amongentities, whose number is unknown and varies over timewith the nonparametric Bayesian models.
ACKNOWLEDGMENTS
This work is supported in part by Defense AdvancedResearch Projects Agency under grants HR0011-08-C-0135-S8 and HR0011-10-C-0112, and by the Army Research Officeunder grant W911NF-13-1-0395.
REFERENCES
[1] S. Oh, et al., “A large-scale benchmark dataset for event recogni-tion in surveillance video,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2011, pp. 3153–3160.
[2] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., 2005, pp. 886–893.
[3] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis.,vol. 64, no. 2/3, pp. 107–123, 2005.
[4] R. Cutler and M. Turk, “View-based interpretation of real-timeoptical flow for gesture recognition,” in Proc. IEEE Int. Conf. Auto-mation Face and Gesture Recognit., 1998, pp. 416–416.
[5] B. Yao and L. Fei-Fei, “Modeling mutual context of object andhuman pose in human-object interaction activities,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2010, pp. 17–24.
[6] D. Vail, M. M. Veloso, and J. D. Lafferty, “Conditional randomfields for activity recognition,” in Proc. 6th Int. Joint Conf. Auton.Agents Multiagent Syst., 2007, pp. 1331–1338.
[7] F. Lv and R. Nevatia, “Recognition and segmentation of 3Dhuman action using hmm and multi-class adaboost,” in Proc. 9thEur. Conf. Comput. Vis., 2006, pp. 359–372.
[8] G. Yang, Y. Lin, and P. Bhattacharya, “A driver fatigue recogni-tion model based on information fusion and dynamic Bayesiannetwork,” Inf. Sci., vol. 180, no. 10, pp. 1942–1954, 2010.
[9] A. Kovashka and K. Grauman, “Learning a hierarchy of discrimi-native space-time neighborhood features for human action recog-nition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010,pp. 2046–2053.
[10] J. Wang, Z. Chen, and Y. Wu, “Action recognition with multiscalespatio-temporal contexts,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2011, pp. 3185–3192.
[11] Y. Zhu, N. M. Nayak, and A. K. R. Chowdhury, “Context-awaremodeling and recognition of activities in video,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2013, pp. 2491–2498.
[12] A. Gupta and L. Davis, “Objects in action: An approach for com-bining action understanding and object perception,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8.
[13] L. Li and L. Fei-Fei, “What, where and who? classifying events byscene and object recognition,” in Proc. IEEE Int. Conf. Comput. Vis.,2007, pp. 1–8.
[14] A. Torralba, “Contextual priming for object detection,” Int. J. Com-put. Vis., vol. 53, no. 2, pp. 169–191, 2003.
[15] A. Gallagher and T. Chen, “Estimating age, gender, and identityusing first name priors,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit, 2008, pp. 1–8.
[16] S. Oh and A. Hoogs, “Unsupervised learning of activities in videousing scene context,” in Proc. 20th Int. Conf. Pattern Recognit., 2010,pp. 3579–3582.
[17] X. Wang and Q. Ji, “Incorporating contextual knowledge todynamic Bayesian networks for event recognition,” in Proc. 21stInt. Conf. Pattern Recognit., 2012, pp. 3378–3381.
[18] X. Wang and Q. Ji, “Context augmented dynamic Bayesiannetworks for event recognition,” Pattern Recognit. Lett., vol. 43,pp. 62–70, 2014.
[19] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.
[20] M. Ryoo and J. Aggarwal, “UT-interaction dataset, ICPR conteston semantic description of human activities,” 2010. [Online].Available: http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
[21] T. Lan, Y. Wang, W. Yang, S. N. Robinovitch, and G. Mori,“Discriminative latent models for recognizing contextual groupactivities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8,pp. 1549–1562, Aug. 2012.
[22] E. Sudderth, A. Torralba, A. Freeman, and A. S. Willsky,“Learning hierarchical models of scenes, objects, and parts,” inProc. IEEE Int. Conf. Comput. Vis., 2005, pp. 1331–1338.
[23] V. Escorcia and J. C. Niebles, “Spatio-temporal human-objectinteractions for action recognition in videos,” in Proc. IEEE Int.Conf. Comput. Vis. Workshop, 2013, pp. 508–514.
[24] F. Yuan, G. S. Xia, H. Sahbi, and V. Prinet, “Mid-level features andspatio-temporal context for activity recognition,” Pattern Recognit.,vol. 45, no. 12, pp. 4182–4191, 2012.
[25] V. Ramanathan, B. Yao, and L. Fei-Fei, “Social role discovery inhuman events,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2013, pp. 2475–2482.
WANG AND JI: HIERARCHICAL CONTEXT MODELING FOR VIDEO EVENT RECOGNITION 1781
[26] J. Sun, et al., “Hierarchical spatio-temporal context modeling foraction recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., 2009, pp. 2004–2011.
[27] S. Park and J. Aggarwal, “A hierarchical Bayesian network forevent recognition of human actions and interactions,” MultimediaSyst., vol. 10, no. 2, pp. 164–179, 2004.
[28] J. Niebles and L. Fei-Fei, “A hierarchical model of shape andappearance for human action classification,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2007, pp. 1–8.
[29] T. Hospedales, S. Gong, and T. Xiang, “Video behaviour miningusing a dynamic topic model,” Int. J. Comput. Vis., vol. 98, no. 3,pp. 303–323, 2012.
[30] J. Varadarajan, R. Emonet, and J. M. Odobez, “A sequential topicmodel for mining recurrent activities from long term video logs,”Int. J. Comput. Vis., vol. 103, no. 1, pp. 100–126, 2013.
[31] G. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithmfor deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,2006.
[32] R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” inProc. 12th Int. Conf. Artif. Intell. Statist., 2009, pp. 448–455.
[33] R. Salakhutdinov and G. Hinton, “An efficient learning procedurefor deep Boltzmann machines,” Neural Comput., vol. 24, no. 8,pp. 1967–2006, 2012.
[34] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedylayer-wise training of deep networks,” in Proc. Advances NeuralInf. Process. Syst., 2007, pp. 153–160.
[35] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Managol,“Extracting and composing robust features with denoisingautoencoders,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1096–1103.
[36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proc. IEEE, vol. 86, no.11, pp. 2278–2324, Nov. 1998.
[37] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural net-works for human action recognition,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013.
[38] A. Karpathy, G. Toderici, S. Shetty, T. Leung, and R. Sukthankar,“Large-scale video classification with convolutional neuralnetworks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014,pp. 1725–1732.
[39] G. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutionallearning of spatio-temporal features,” in Proc. 11th Eur. Conf. Com-put. Vis., 2010, pp. 140–153.
[40] Q. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchi-cal invariant spatio-temporal features for action recognition withindependent subspace analysis,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2011, pp. 3361–3368.
[41] M. Hasan and A. Roy-Chowdhury, “Continuous learning ofhuman activity models using deep nets,” in Proc. 13th Eur. Conf.Comput. Vis., 2014, pp. 705–720.
[42] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,“Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn.,2011, pp. 689–696.
[43] N. Srivastava and R. Salakhutdinov, “Multimodal learning withdeep Boltzmann machines,” in Proc. Advances Neural Inf. Process.Syst., 2012, pp. 2222–2230.
[44] N. Srivastava and R. Salakhutdinov, “Multimodal learning withdeep Boltzmann machines,” J. Mach. Learn. Res., vol. 15, no. 9,pp. 2949–2980, 2014.
[45] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, “Multiscale con-ditional random fields for image labeling,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2004, pp. 695–702.
[46] X. Zeng, W. Ouyang, and X. Wang, “Multi-stage contextual deeplearning for pedestrian detection,” in Proc. IEEE Int. Conf. Comput.Vis., 2013, pp. 121–128.
[47] X. Wang and Q. Ji, “A hierarchical context model for event recog-nition in surveillance video,” in Proc. IEEE Conf. Comput. Vis. Pat-tern Recognit., 2014, pp. 2561–2568.
[48] X. Wang and Q. Ji, “Video event recognition with deep hierarchi-cal context model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., 2015, pp. 4418–4427.
[49] D. Lowe, “Object recognition from local scale-invariant features,”in Proc. IEEE Int. Conf. Comput. Vis., 1999, pp. 1150–1157.
[50] R. Salakhutdinov, Learning deep Boltzmann machines. (2012).[Online]. Available: http://www.cs.toronto.edu/�rsalakhu/DBM.html
[51] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learningrealistic human actions from movies,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2008, pp. 1–8.
[52] K. Reddy, N. Cuntoor, A. Perera, and A. Hoogs, “Human actionrecognition in large-scale datasets using histogram of spatiotem-poral gradients,” in Proc. IEEE 9th Int. Conf. Advanced Video Signal-Based Surveillance, 2012, pp. 106–111.
[53] M. Amer and S. Todorovic, “Sum-product networks for modelingactivities with stochastic structure,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2012, pp. 1314–1321.
[54] M. Ryoo and J. Aggarwal, “Spatio-temporal relationship match:Video structure comparison for recognition of complex humanactivities,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 1593–1600.
[55] M. Ryoo, “Human activity prediction: Early recognition of ongo-ing activities from streaming videos,” in Proc. IEEE Int. Conf. Com-put. Vis., 2011, pp. 1036–1043.
[56] D. Waltisberg, A. Yao, J. Gall, and L. van Gool, “Variations of aHough-voting action recognition system,” in Proc. Recognizing Pat-terns Signals Speech Images Videos, 2010, pp. 306–312.
[57] G. Yu, J. Yuan, and Z. Liu, “Propagative hough voting for humanactivity recognition,” in Proc. 12th Eur. Conf. Comput. Vis., 2012,pp. 693–706.
[58] M. Raptis and L. Sigal, “Poselet key-framing: A model for humanactivity recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-ognit., 2013, pp. 2650–2657.
[59] S. Shariat and V. Pavlovic, “A new adaptive segmental matchingmeasure for human activity recognition,” in Proc. IEEE Int. Conf.Comput. Vis., 2013, pp. 3583–3590.
[60] Y. Zhang, X. Liu, M.-C. Chang, W. Ge, and T. Chen, “Spatio-temporal phrases for activity recognition,” in Proc. 12th Eur. Conf.Comput. Vis., 2012, pp. 707–721.
[61] F. Perronnin and C. Dance, “Fisher kernels on visual vocabulariesfor image categorization,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2007, pp. 1–8.
[62] F. Perronnin, J. Sanchez, and T. Mensink, “Improving the Fisherkernel for large-scale image classification,” in Proc. 11th Eur. Conf.Comput. Vis., 2010, pp. 143–156.
Xiaoyang Wang received the BS and MS degreeboth from Tsinghua University, Beijing, China, in2007 and 2010, respectively, and the PhD degreefrom Rensselaer Polytechnic Institute, Troy,New York, in 2015. He currently works with NokiaBell Labs, MurrayHill, New Jersey as a researcher.His research interests include video event recogni-tion, object recognition, attribute prediction, contextmodeling, and probabilistic graphical models. Hereceived the ICPR Piero Zamperoni Best StudentPaper Award in 2012. He is amember of the IEEE.
Qiang Ji received the PhD degree from the Uni-versity of Washington. He is currently a Professorin the Department of Electrical, Computer, andSystems engineering, RPI. From January, 2009to August, 2010, he served as a program directorof the National Science Foundation, managingNSF’s machine learning and computer vision pro-grams. Prior to joining RPI in 2001, he was anassistant professor in the Department of Com-puter Science, University of Nevada, Reno.He also held research and visiting positions in
the Beckman Institute, University of Illinois at Urbana-Champaign, theRobotics Institute, Carnegie Mellon University, and the US Air ForceResearch Laboratory. He is a fellow of the IEEE and the IAPR.
" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
1782 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 39, NO. 9, SEPTEMBER 2017
Local Causal Discovery of Direct Causes and Effects
Tian Gao Qiang JiDepartment of ECSE
Rensselaer Polytechnic Institute, Troy, NY 12180{gaot, jiq}@rpi.edu
Abstract
We focus on the discovery and identification of direct causes and effects of a targetvariable in a causal network. State-of-the-art causal learning algorithms generallyneed to find the global causal structures in the form of complete partial directedacyclic graphs (CPDAG) in order to identify direct causes and effects of a targetvariable. While these algorithms are effective, it is often unnecessary and wastefulto find the global structures when we are only interested in the local structure ofone target variable (such as class labels). We propose a new local causal discov-ery algorithm, called Causal Markov Blanket (CMB), to identify the direct causesand effects of a target variable based on Markov Blanket Discovery. CMB is de-signed to conduct causal discovery among multiple variables, but focuses only onfinding causal relationships between a specific target variable and other variables.Under standard assumptions, we show both theoretically and experimentally thatthe proposed local causal discovery algorithm can obtain the comparable identifi-cation accuracy as global methods but significantly improve their efficiency, oftenby more than one order of magnitude.
1 Introduction
Causal discovery is the process to identify the causal relationships among a set of random variables.It not only can aid predictions and classifications like feature selection [4], but can also help pre-dict consequences of some given actions, facilitate counter-factual inference, and help explain theunderlying mechanisms of the data [13]. A lot of research efforts have been focused on predict-ing causality from observational data [13, 18]. They can be roughly divided into two sub-areas:causal discovery between a pair of variables and among multiple variables. We focus on multivari-ate causal discovery, which searches for correlations and dependencies among variables in causalnetworks [13]. Causal networks can be used for local or global causal prediction, and thus they canbe learned locally and globally. Many causal discovery algorithms for causal networks have beenproposed, and the majority of them belong to global learning algorithms as they seek to learn globalcausal structures. The Spirtes-Glymour-Scheines (SGS) [18] and Peter-Clark (P-C) algorithm [19]test for the existence of edges between every pair of nodes in order to first find the skeleton, orundirected edges, of causal networks and then discover all the V-structures, resulting in a partiallydirected acyclic graph (PDAG). The last step of these algorithms is then to orient the rest of edgesas much as possible using Meek rules [10] while maintaining consistency with the existing edges.Given a causal network, causal relationships among variables can be directly read off the structure.
Due to the complexity of the P-C algorithm and unreliable high order conditional independence tests[9], several works [23, 15] have incorporated the Markov Blanket (MB) discovery into the causaldiscovery with a local-to-global approach. Growth and Shrink (GS) [9] algorithm uses the MBsof each node to build the skeleton of a causal network, discover all the V-structures, and then usethe Meek rules to complete the global causal structure. The max-min hill climbing (MMHC) [23]algorithm also finds MBs of each variable first, but then uses the MBs as constraints to reduce thesearch space for the score-based standard hill climbing structure learning methods. In [15], authors
1
Advances in Neural Information Processing Systems 28 (NIPS 2015)
use Markov Blanket with Collider Sets (CS) to improve the efficiency of the GS algorithm by com-bining the spouse and V-structure discovery. All these local-to-global methods rely on the globalstructure to find the causal relationships and require finding the MBs for all nodes in a graph, evenif the interest is the causal relationships between one target variable and other variables. Differ-ent MB discovery algorithms can be used and they can be divided into two different approaches:non-topology-based and topology-based. Non-topology-based methods [5, 9], used by CS and GSalgorithms, greedily test the independence between each variable and the target by directly using thedefinition of Markov Blanket. In contrast, more recent topology-based methods [22, 1, 11] aim toimprove the data efficiency while maintaining a reasonable time complexity by finding the parentsand children (PC) set first and then the spouses to complete the MB.
Local learning of causal networks generally aims to identify a subset of causal edges in a causalnetwork. Local Causal Discovery (LCD) algorithm and its variants [3, 17, 7] aim to find causal edgesby testing the dependence/independence relationships among every four-variable set in a causalnetwork. Bayesian Local Causal Discovery (BLCD) [8] explores the Y-structures among MB nodesto infer causal edges [6]. While LCD/BLCD algorithms aim to identify a subset of causal edges viaspecial structures among all variables, we focus on finding all the causal edges adjacent to one targetvariable. In other words, we want to find the causal identities of each node, in terms of direct causesand effects, with respect to one target node. We first use Markov Blankets to find the direct causesand effects, and then propose a new Causal Markov Blanket (CMB) discovery algorithm, whichdetermines the exact causal identities of MB nodes of a target node by tracking their conditionalindependence changes, without finding the global causal structure of a causal network. The proposedCMB algorithm is a complete local discovery algorithm and can identify the same direct causes andeffects for a target variable as global methods under standard assumptions. CMB is more scalablethan global methods, more efficient than local-to-global methods, and is complete in identifyingdirect causes and effects of one target while other local methods are not.
2 Backgrounds
We use V to represent the variable space, capital letters (such as X,Y ) to represent variables, boldletters (such as Z,MB) to represent variable sets, and use |Z| to represent the size of set Z. X ⊥⊥ Yand X ⊥\⊥ Y represent independence and dependence between X and Y , respectively. We assumereaders are familar with related concepts in causal network learning, and only review a few majorones here. In a causal network or causal Bayesian Network [13], nodes correspond to the randomvariables in a variable set V . Two nodes are adjacent if they are connected by an edge. A directededge from node X to node Y , (X,Y ) ∈ V , indicates X is a parent or direct cause of Y and Y isa child or direct effect of X [12]. Moreover, If there is a directed path from X to Y , then X is anancestor of Y and Y is a descendant of X . If nonadjacent X and Y have a common child, X andY are spouses. Three nodes X , Y , and Z form a V-structure [12] if Y has two incoming edges fromX and Z, forming X → Y ← Z, and X is not adjacent to Z. Y is a collider in a path if Y has twoincoming edges in this path. Y with nonadjacent parents X and Z is an unshielded collider. A pathJ from node X and Y is blocked [12] by a set of nodes Z, if any of following holds true: 1) there isa non-collider node in J belonging to Z. 2) there is a collider node C on J such that neither C norany of its descendants belong to Z. Otherwise, J is unblocked or active.
A PDAG is a graph which may have both undirected and directed edges and has at most one edgebetween any pair of nodes [10]. CPDAGs [2] represent Markov equivalence classes of DAGs, captur-ing the same conditional independence relationships with the same skeleton but potentially differentedge orientations. CPDAGs contain directed edges that has the same orientation for every DAG inthe equivalent class and undirected edges that have reversible orientations in the equivalent class.Let G be the causal DAG of a causal network with variable set V and P be the joint probability dis-tribution over variables in V . G and P satisfy Causal Markov condition [13] if and only if, ∀X ∈ V ,X is independent of non-effects of X given its direct causes. The causal faithfulness condition [13]states that G and P are faithful to each other, if all and every independence and conditional indepen-dence entailed by P is present in G. It enables the recovery of G from sampled data of P . Anotherwidely-used assumption by existing causal discovery algorithms is causal sufficiency [12]. A set ofvariables X ⊆ V is causally sufficient, if no set of two or more variables in X shares a commoncause variable outside V . Without causal sufficiency assumption, latent confounders between adja-cent nodes would be modeled by bi-directed edges [24]. We also assume no selection bias [20] and
2
we can capture the same independence relationships among variables from the sampled data as theones from the entire population.
Many concepts and properties of a DAG hold in causal networks, such as d-separation and MB.A Markov Blanket [12] of a target variable T , MBT , in a causal network is the minimal set ofnodes conditioned on which all other nodes are independent of T , denoted as X ⊥⊥ T |MBT ,∀X ⊆{V \T} \MBT . Given an unknown distribution P that satisfied the Markov condition with respectto an unknown DAG G0, Markov Blanket Discovery is the process used to estimate the MB of atarget node in G0, from independently and identically distributed (i.i.d) data D of P . Under thecausal faithfulness assumption between G0 and P , the MB of a target node T is unique and is theset of parents, children, and spouses of T (i.e., other parents of children of T ) [12]. In addition, theparents and children set of T , PCT , is also unique. Intuitively, the MB can directly facilitate causaldiscovery. If conditioning on the MB of a target variable T renders a variable X independent ofT , then X cannot be a direct cause or effect of T . From the local causal discovery point of view,although MB may contain nodes with different causal relationships with the target, it is reasonableto believe that we can identify their relationships exactly, up to the Markov equivalence, with furthertests.
Lastly, exiting causal network learning algorithms all use three Meek rules [10], which we assumethe readers are familiar with, to orient as many edges as possible given all V-structures in PDAGs toobtain CPDAG. The basic idea is to orient the edges so that 1) the edge directions do not introducenew V-structures, 2) preserve the no-cycle property of a DAG, and 3) enforce 3-fork V-structures.
3 Local Causal Discovery of Direct Causes and Effects
Existing MB discovery algorithms do not directly offer the exact causal identities of the learned MBnodes of a target. Although the topology-based methods can find the PC set of the target withinthe MB set, they can only provide the causal identities of some children and spouses that form v-structures. Nevertheless, following existing works [4, 15], under standard assumptions, every PCvariable of a target can only be its direct cause or effect:Theorem 1. Causality within a MB. Under the causal faithfulness, sufficiency, correct indepen-dence tests, and no selection bias assumptions, the parent and child nodes within a target’s MB setin a causal network contains all and only the direct causes and effects of the target variable.
The proof can be directly derived from the PC set definition of a causal network. Therefore, usingthe topology-based MB discovery methods, if we can discover the exact causal identities of the PCnodes within the MB, causal discovery of direct causes and effects of the target can therefore besuccessfully accomplished.
Building on MB discovery, we propose a new local causal discovery algorithm, Causal MarkovBlanket (CMB) discovery as shown in Algorithm 1. It identifies the direct causes and effects of atarget variable without the need of finding the global structure or the MBs of all other variables ina causal network. CMB has three major steps: 1) to find the MB set of the target and to identifysome direct causes and effects by tracking the independence relationship changes among a target’sPC nodes before and after conditioning on the target node, 2) to repeat Step 1 but conditioned onone PC node’s MB set, and 3) to repeat Step 1 and 2 with unidentified neighboring nodes as newtargets to identify more direct causes and effects of the original target.
Step 1: Initial identification. CMB first finds the MB nodes of a target T , MBT , using a topology-based MB discovery algorithm that also finds PCT . CMB then uses the CausalSearch subroutine,shown in Algorithm 2, to get an initial causal identities of variables in PCT by checking everyvariable pair in PCT according to Lemma 1.Lemma 1. Let (X,Y ) ∈ PCT , the PC set of the target T ∈ V in a causal DAG. The independencerelationships between X and Y can be divided into the following four conditions:
C1 X ⊥⊥ Y and X ⊥⊥ Y |T ; this condition can not happen.
C2 X ⊥⊥ Y and X ⊥\⊥ Y |T ⇒ X and Y are both the parents of T .
C3 X ⊥\⊥ Y and X ⊥⊥ Y |T ⇒ at least one of X and Y is a child of T .
C4 X ⊥\⊥ Y and X ⊥\⊥ Y |T ⇒ their identities are inconclusive and need further tests.
3
Algorithm 1 Causal Markov Blanket Discovery Algorithm
1: Input: D: Data; T : target variable2: Output: IDT : the causal identities of all
nodes with respect to T{Step 1: Establish initial ID }
3: IDT = zeros(|V|, 1);4: (MBT ,PCT )← FindMB(T,D);5: Z← ∅;6: IDT ← CausalSearch(D,T,PCT ,Z, IDT ){Step 2: Further test variables with idT = 4}
7: for one X in each pair (X,Y ) with idT = 4 do8: MBX ← FindMB(X,D);9: Z← {MBX \ T} \ Y ;
10: IDT ← CausalSearch(D,T,PCT ,Z, IDT );
11: if no element of IDT is equal to 4, break;12: for every pair of parents (X,Y ) of T do
13: if ∃Z s.t. (X,Z) and (Y,Z) are idT = 4 pairsthen
14: IDT (Z) = 1;15: IDT (X)← 3, ∀X that IDT (X) = 4;{Step 3: Resolve variable set with idT = 3}
16: for each X with idT = 3 do17: Recursively find IDX , without going back to
the already queried variables;18: update IDT according to IDX ;19: if IDX(T ) = 2 then20: IDT (X) = 1;21: for every Y in idT = 3 variable pairs
(X,Y ) do22: IDT (Y ) = 2;23: if no element of IDT is equal to 3, break;24: Return: IDT
Algorithm 2 CausalSearch Subroutine
1: Input: D: Data; T : target variable; PCT :the PC set of T ; Z: the conditioned variableset; ID: current ID
2: Output: IDT : the new causal identities ofall nodes with respect to T{Step 1: Single PC }
3: if |PCT | = 1 then4: IDT (PCT )← 3;{Step 2: Check C2 & C3}
5: for every X,Y ∈ PCT do6: if X ⊥⊥ Y |Z and X ⊥\⊥ Y |T ∪ Z then7: IDT (X)← 1; IDT (Y )← 1;8: else if X ⊥\⊥ Y |Z and X ⊥⊥ Y |T∪Z then9: if IDT (X) = 1 then
10: IDT (Y )← 211: else if IDT (Y ) 6= 2 then12: IDT (Y )← 3
13: if IDT (Y ) = 1 then14: IDT (X)← 215: else if IDT (X) 6= 2 then16: IDT (X)← 317: add (X,Y ) to pairs with idT = 318: else19: if IDT (X) & IDT (Y ) = 0 or 4 then20: IDT (X)← 4; IDT (Y )← 421: add (X,Y ) to pairs with idT = 4{Step 3: identify idT = 3 pairs with knownparents}
22: for every X such that IDT (X) = 1 do23: for every Y in idT = 3 variable pairs
(X,Y ) do24: IDT (Y )← 2;25: Return: IDT
C1 does not happen because the path X − T − Y is unblocked either not given T or given T , andthe unblocked path makes X and Y dependent on each other. C2 implies that X and Y form aV-structure with T as the corresponding collider, such as node C in Figure 1a which has two parentsA and B. C3 indicates that the paths between X and Y are blocked conditioned on T , which meansthat either one of (X,Y ) is a child of T and the other is a parent, or both of (X,Y ) are children ofT . For example, node D and F in Figure 1a satisfy this condition with respect to E. C4 shows thatthere may be another unblocked path from X and Y besides X − T − Y . For example, in Figure1b, node D and C have multiple paths between them besides D − T − C. Further tests are neededto resolve this case.
Notation-wise, we use IDT to represent the causal identities for all the nodes with respect to T ,IDT (X) as variable X’s causal identity to T , and the small case idT as the individual ID of a nodeto T . We also use IDX to represent the causal identities of nodes with respect to node X . To avoidchanging the already identified PCs, CMB establishes a priority system1. We use the idT = 1 torepresent nodes as the parents of T , idT = 2 children of T , idT = 3 to represent a pair of nodes thatcannot be both parents (and/or ambiguous pairs from Markov equivalent structures, to be discussedat Step 2), and idT = 4 to represent the inconclusiveness. A lower number id cannot be changed
1Note that the identification number is slightly different from the condition number in Lemma 1.
4
A B
C
D
G EF
T
C
B
D
A
E
(𝑎) (𝑏)
Figure 1: a) A Sample Causal Network. b) A Sample Network with C4 nodes. The only active pathbetween D and C conditioned on MBC \ {T,D} is D − T − C.
into a higher number (shown by Line 11∼15 of Algorithm 2). If a variable pair satisfies C2, theywill both be labeled as parents (Line 7 of Algorithm 2). If a variable pair satisfies C3, one of themis labeled as idT = 2 only if the other variable within the pair is already identified as a parent;otherwise, they are both labeled as idT = 3 (Line 9∼12 and 15∼17 of Algorithm 2). If a PC noderemains inconclusive with idT = 0, it is labeled as idT = 4 in Line 20 of Algorithm 2. Note thatif T has only one PC node, it is labeled as idT = 3 (Line 4 of Algorithm 2). Non-PC nodes alwayshave idT = 0.
Step 2: Resolve idT = 4. Lemma 1 alone cannot identify the variable pairs in PCT with idT = 4due to other possible unblocked paths, and we have to seek other information. Fortunately, bydefinition, the MB set of one of the target’s PC node can block all paths to that PC node.
Lemma 2. Let (X,Y ) ∈ PCT , the PC set of the target T ∈ V in a causal DAG. The independencerelationships between X and Y , conditioned on the MB of X minus {Y, T}, MBX \ {Y, T}, canbe divided into the following four conditions:
C1 X ⊥⊥ Y |MBX \ {Y, T} and X ⊥⊥ Y |T ∪MBX \ Y ; this condition can not happen.
C2 X ⊥⊥ Y |MBX \ {Y, T} and X ⊥\⊥ Y |T ∪MBX \ Y ⇒ X and Y are both the parents of T .
C3 X ⊥\⊥ Y |MBX \ {Y, T} and X ⊥⊥ Y |T ∪MBX \Y ⇒ at least one of X and Y is a child of T .
C4 X ⊥\⊥ Y |MBX \ {Y, T} and X ⊥\⊥ Y |T ∪MBX \ Y ⇒ then X and Y is directly connected.
C1∼3 are very similar to those in Lemma 1. C4 is true because, conditioned on T and the MB of Xminus Y , the only potentially unblocked paths between X and Y are X − T − Y and/or X − Y . IfC4 happens, then the path X−T−Y has no impact on the relationship between X and Y , and henceX − Y must be directly connected. If X and Y are not directly connected and the only potentiallyunblocked path between X and Y is X − T − Y , and X and Y will be identified by Line 10 ofAlgorithm 1 with idT ∈ {1, 2, 3}. For example in Figure 1b, conditioned on MBC \ {T,D}, i.e.,{A,B}, the only path between C and D is through T. However, if X and Y are directly connected,they will remain with idT = 4 (such as node D and E from Figure 1b). In this case, X , Y , andT form a fully connected clique, and edges among the variables that form a fully connected cliquecan have many different orientation combinations without affecting the conditional independencerelationships. Therefore, this case needs further tests to ensure Meek rules are satisfied. The thirdMeek rule (enforcing 3-fork V-structures) is first enforced by Line 14 of Algorithm 1. Then the restof idT = 4 nodes are changed to have idT = 3 by Line 15 of Algorithm 1 and to be further processed(even though they could be both parents at the same time) with neighbor nodes’ causal identities.Therefore, Step 2 of Algorithm 1 makes all variable pairs with idT = 4 to become identified eitheras parents, children, or with idT = 3 after taking some neighbors’ MBs into consideration. Notethat Step 2 of CMB only needs to find the MB’s for a small subset of the PC variables (in fact onlyone MB for each variable pair with idT = 4).
Step 3: Resolve idT = 3. After Step 2, some PC variables may still have idT = 3. This couldhappen because of the existence of Markov equivalence structures. Below we show the conditionunder which the CMB can resolve the causal identities of all PC nodes.
5
Lemma 3. The Identifiability Condition. For Algorithm 1 to fully identify all the causal relation-ships within the PC set of a target T , 1) T must have at least two nonadjacent parents, 2) one of T ’ssingle ancestors must contain at least two nonadjacent parents, or 3) T has 3 parents that form a3-fork pattern as defined in Meeks rules.
We use single ancestors to represent ancestor nodes that do not have a spouse with a mutual child thatis also an ancestor of T . If the target does not meet any of the conditions in Lemma 2, C2 will neverbe satisfied and all PC variables within a MB will have idT = 3. Without a single parent identified,it is impossible to infer the identities of children nodes using C3. Therefore, all the identities of thePC nodes are uncertain, even though the resulting structure could be a CPDAG.
Step 3 of CMB searches for a non-single ancestor of T to infer the causal directions. For each nodeX with idT = 3, CMB tries to identify its local causal structure recursively. If X’s PC nodes areall identified, it would return to the target with the resolved identities; otherwise, it will continueto search for a non-single ancestor of X . Note that CMB will not go back to already-searchedvariables with unresolved PC nodes without providing new information. Step 3 of CMB checks theidentifiability condition for all the ancestors of the target. If a graph structure does not meet theconditions of Lemma 3, the final IDT will contain some idT = 3, which indicates reversible edgesin CPDAGs. The found causal graph using CMB will be a PDAG after Step 2 of Algorithm 1, andit will be a CPDAG after Step 3 of Algorithm 1.
Case Study. The procedure using CMB to identify the direct causes and effects of E in Figure 1ahas the following 3 steps. Step 1: CMB finds the MB and PC set of E. The PC set contains nodeD and F . Then, IDE(D) = 3 and IDE(F ) = 3. Step 2: to resolve the variable pair D and Fwith idE = 3, 1) CMB finds the PC set of D, containing C, E, and G. Their idD are all 3’s, sinceD contains only one parent. 2) To resolve IDD, CMB checks causal identities of node C and G(without going back to E). The PC set of C contains A, B, and D. CMB identifies IDC(A) = 1,IDC(B) = 1, and IDC(D) = 2. Since C resolves all its PC nodes, CMB returns to node Dwith IDD(C) = 1. 3) With the new parent C, IDD(G) = 2, IDD(E) = 2, and CMB returns tonode E with IDE(D) = 1. Step 3: the IDE(D) = 1, and after resolving the pair with idE = 3,IDE(F ) = 2.
Theorem 2. The Soundness and Completeness of CMB Algorithm. If the identifiability conditionis satisfied, using a sound and complete MB discovery algorithm, CMB will identify the direct causesand effects of the target under the causal faithfulness, sufficiency, correct independence tests, andno selection bias assumptions.
Proof. A sound and complete MB discovery algorithm find all and only the MB nodes of a target.Using it and under the causal sufficiency assumption, the learned PC set contains all and only thecause-effect variables by Theorem 1. When Lemma 3 is satisfied, all parent nodes are identifiablethrough V-structure independence changes, either by Lemma 1 or by Lemma 2. Also since childrencannot be conditionally independent of another PC node given its MB minus the target node (C2),all parents identified by Lemma 1 and 2 will be the true positive direct causes. Therefore, all andonly the true positive direct causes will be correctly identified by CMB. Since PC variables can onlybe direct causes or direct effects, all and only the direct effects are identified correctly by CMB.
In the cases where CMB fails to identify all the PC nodes, global causal discovery methods cannotidentify them either. Specifically, structures failing to satisfy Lemma 3 can have different orien-tations on some edges while preserving the skeleton and v-structures, hence leading to Markovequivalent structures. For the cases where T has all single ancestors, the edge directions among allsingle ancestors can always be reversed without introducing new V-structures and DAG violations,in which cases the Meek rules cannot identify the causal directions either. For the cases with fullyconnected cliques, these fully connected cliques do not meet the nonadjacent-parents requirementfor the first Meek rule (no new V-structures), and the second Meek rule (preserving DAGs) canalways be satisfied within a clique by changing the direction of one edge. Since CMB orients the3-fork V-structure in the third Meek rule correctly by Line 12∼14 of Algorithm 1, CMB can identifythe same structure as the global methods that use the Meek rules.
Theorem 3. Consistency between CMB and Global Causal Discovery Methods. For the sameDAG G, Algorithm 1 will correctly identify all the direct causes and effects of a target variable T
6
as the global and local-to-global causal discovery methods2 that use the Meek rules [10], up to G’sCPDAG under the causal faithfulness, sufficiency, correct independence tests, and no selection biasassumptions.
Proof. It has been shown that causal methods using Meek rules [10] can identify up to a graph’sCPDAG. Since Meek rules cannot identify the structures that fail Lemma 3, the global and local-to-global methods can only identify the same structures as CMB. Since CMB is sound and complete inidentifying these structures by Theorem 2, CMB will identify all direct causes and effects up to G’sCPDAG.
3.1 Complexity
The complexity of CMB algorithm is dominated by the step of finding the MB, which can have anexponential complexity [1, 16]. All other steps of CMB are trivial in comparison. If we assume auniform distribution on the neighbor sizes in a network with N nodes, then the expected time com-plexity of Step 1 of CMB is O( 1
N
∑Ni=1 2
i) = O( 2N
N ), while local-to-global methods are O(2N ).In later steps, CMB also needs to find MBs for a small subset of nodes that include 1) one nodebetween every pair of nodes that meet C4, and 2) a subset of the target’s neighboring nodes thatprovide additional clues for the target. Let l be the total size of these nodes, then CMB reduces thecost by N
l times asymptotically.
4 Experiments
We use benchmark causal learning datasets to evaluate the accuracy and efficiency of CMB withfour other causal discovery algorithms discussed: P-C, GS, MMHC, CS, and the local causal dis-covery algorithm LCD2 [7]. Due to page limit, we show the results of the causal algorithms on fourmedium-to-large datasets: ALARM, ALARM3, CHILD3, and INSUR3. They contain 37 to 111nodes. We use 1000 data samples for all datasets. For each global or local-to-global algorithm, wefind the global structure of a dataset and then extract causal identities of all nodes to a target node.CMB finds causal identities of every variable with respect to the target directly. We repeat the dis-covery process for each node in the datasets, and compare the discovered causal identities of all thealgorithms to all the Markov equivalent structures with the known ground truth structure. We use theedge scores [15] to measure the number of missing edges, extra edges, and reversed edges3 in eachnode’s local causal structure and report average values along with its standard deviation, for all thenodes in a dataset. We use the existing implementation [21] of HITON-MB discovery algorithm tofind the MB of a target variable for all the algorithms. We also use the existing implementations [21]for P-C, MMHC, and LCD2 algorithms. We implement GS, CS, and the proposed CMB algorithmsin MATLAB on a machine with 2.66GHz CPU and 24GB memory. Following the existing proto-col [15], we use the number of conditional independence tests needed (or scores computed for thescore-based search method MMHC) to find the causal structures given the MBs4, and the numberof times that MB discovery algorithms are invoked to measure the efficiency of various algorithms.We also use mutual-information-based conditional independence tests with a standard significancelevel of 0.02 for all the datasets without worrying about parameter tuning.
As shown in Table 1, CMB consistently outperforms the global discovery algorithms on benchmarkcausal networks, and has comparable edge accuracy with local-to-global algorithms. Although CMBmakes slightly more total edge errors in ALARM and ALARM3 datasets than CS, CMB is the bestmethod on CHILD3 and INSUR3. Since LCD2 is an incomplete algorithm, it never finds extra orreversed edges but misses the most amount of edges. Efficiency-wise, CMB can achieve more thanone order of magnitude speedup, sometimes two orders of magnitude as shown in CHILD3 andINSUR3, than the global methods. Compared to local-to-global methods, CMB also can achieve
2We specify the global and local-to-global causal methods to be P-C [19], GS [9] and CS [15].3If an edge is reversible in the equivalent class of the original graph but are not in the equivalent class of the
learned graph, it is considered as reversed edges as well.4For global methods, it is the number of tests needed or scores computed given the moral graph of the global
structure. For LCD2, it would be the total number of tests since it does not use moral graph or MBs.
7
Table 1: Performance of Various Causal Discovery Algorithms on Benchmark Networks
Errors: Edges EfficiencyDataset Method Extra Missing Reversed Total No. Tests No. MBAlarm P-C 1.59±0.19 2.19±0.14 0.32±0.10 4.10±0.19 4.0e3±4.0e2 -
MMHC 1.29±0.18 1.94±0.09 0.24±0.06 3.46±0.23 1.8e3±1.7e3 37±0GS 0.39±0.44 0.87±0.48 1.13±0.23 2.39±0.44 586.5±72.2 37±0CS 0.42±0.10 0.64±0.10 0.38±0.08 1.43±0.10 331.4±61.9 37± 0
LCD2 0.00±0.00 2.49±0.00 0.00±0.0 2.49±0.00 1.4e3±0 -CMB 0.69±0.13 0.61±0.11 0.51±0.10 1.81±0.11 53.7±4.5 2.61 ± 0.12
Alarm3 P-C 3.71±0.57 2.21±0.25 1.37±0.04 7.30±0.68 1.6e4±4.0e2 -MMHC 2.36±0.11 2.45±0.08 0.72±0.08 5.53±0.27 3.7e3±6.1e2 111 ± 0
GS 1.24±0.23 1.41±0.05 0.99±0.14 3.64±0.13 2.1e3±1.2e2 111 ± 0CS 1.26±0.16 1.47±0.08 0.63±0.14 3.38±0.13 699.1±60.4 111±0
LCD2 0.00±0.00 3.85±0.00 0.00±0.0 3.85±0.00 1.2e4±0 -CMB 1.41±0.13 1.55±0.27 0.78±0.25 3.73±0.11 50.3±6.2 2.58 ± 0.09
Child3 P-C 4.32±0.68 2.69±0.08 0.84±0.10 7.76±0.98 8.3e4±2.9e3 -MMHC 1.98±0.10 1.57±0.04 0.43±0.04 4.00±0.93 6.6e3±8.2e2 60 ±0
GS 0.88±0.04 0.75±0.08 1.03±0.08 2.66±0.33 2.1e3±2.5e2 60±0CS 0.94±0.20 0.91±0.14 0.53±0.08 2.37±0.33 1.0e3±4.8e2 60± 0
LCD2 0.00±0.00 2.63±0.00 0.00±0.0 2.63±0.00 3.6e3±0 -CMB 0.92±0.12 0.84±0.16 0.60±0.10 2.36±0.31 78.2±15.2 2.53 ± 0.15
Insur3 P-C 4.76±1.33 2.50±0.11 1.29±0.11 8.55±0.81 2.5e5±1.2e4 -MMHC 2.39±0.18 2.53±0.06 0.76±0.07 5.68±0.43 3.1e4±5.2e2 81 ± 0
GS 1.94±0.06 1.44±0.05 1.19±0.10 4.57±0.33 4.5e4±2.2e3 81±0CS 1.92±0.08 1.56±0.06 0.89±0.09 4.37±0.23 2.6e4±3.9e3 81±0
LCD2 0.00±0.00 5.03±0.00 0.00±0.0 5.03±0.00 6.6e3±0 -CMB 1.72±0.07 1.39±0.06 1.19±0.05 4.30±0.21 159.8±38.5 2.46 ± 0.11
more than one order of speedup on ALARM3, CHILD3, and INSUR3. In addition, on these datasets,CMB only invokes MB discovery algorithms between 2 to 3 times, drastically reducing the MB callsof local-to-global algorithms. Since independence test comparison is unfair to LCD2 who does notuse MB discovery or find moral graphs, we also compared time efficiency between LCD2 and CMB.CMB is 5 times faster on ALARM, 4 times faster on ALARM3 and CHILD3, and 8 times faster onINSUR3 than LCD2.
In practice, the performance of CMB depends on two factors: the accuracy of independence testsand MB discovery algorithms. First, independence tests may not always be accurate and couldintroduce errors while checking the four conditions of Lemma 1 and 2, especially under insufficientdata samples. Secondly, causal discovery performance heavily depends on the performance of theMB discovery step, as the error could propagate to later steps of CMB. Improvements on both areascould further improve CMB’s accuracy. Efficiency-wise, CMB’s complexity can still be exponentialand is dominated by the MB discovery phrase, and thus its worst case complexity could be the sameas local-to-global approaches for some special structures.
5 Conclusion
We propose a new local causal discovery algorithm CMB. We show that CMB can identify thesame causal structure as the global and local-to-global causal discovery algorithms with the sameidentification condition, but uses a fraction of the cost of the global and local-to-global approaches.We further prove the soundness and completeness of CMB. Experiments on benchmark datasetsshow the comparable accuracy and greatly improved efficiency of CMB for local causal discovery.Possible future works could study assumption relaxations, especially without the causal sufficiencyassumption, such as by using a similar procedure as FCI algorithm and the improved CS algorithm[14] to handle latent variables in CMB.
8
References[1] Constantin Aliferis, Ioannis Tsamardinos, Alexander Statnikov, C. F. Aliferis M. D, Ph. D, I. Tsamardi-
nos Ph. D, and Er Statnikov M. S. Hiton, a novel markov blanket algorithm for optimal variable selection,2003.
[2] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of MachineLearning Research, 2002.
[3] Gregory F Cooper. A simple constraint-based algorithm for efficiently mining observational databases forcausal relationships. Data Mining and Knowledge Discovery, 1(2):203–224, 1997.
[4] Isabelle Guyon, Andre Elisseeff, and Constantin Aliferis. Causal feature selection. 2007.
[5] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In ICML 1996, pages 284–292.Morgan Kaufmann, 1996.
[6] Subramani Mani, Constantin F Aliferis, Alexander R Statnikov, and MED NYU. Bayesian algorithms forcausal data mining. In NIPS Causality: Objectives and Assessment, pages 121–136, 2010.
[7] Subramani Mani and Gregory F Cooper. A study in causal discovery from population-based infant birthand death records. In Proceedings of the AMIA Symposium, page 315. American Medical InformaticsAssociation, 1999.
[8] Subramani Mani and Gregory F Cooper. Causal discovery using a bayesian local causal discovery algo-rithm. Medinfo, 11(Pt 1):731–735, 2004.
[9] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Ad-vances in Neural Information Processing Systems 12, pages 505–511. MIT Press, 1999.
[10] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedingsof the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410. Morgan KaufmannPublishers Inc., 1995.
[11] Teppo Niinimaki and Pekka Parviainen. Local structure disocvery in bayesian network. In Proceedingsof Uncertainy in Artifical Intelligence, Workshop on Causal Structure Learning, pages 634–643, 2012.
[12] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. MorganKaufmann Publishers, Inc., 2 edition, 1988.
[13] Judea Pearl. Causality: models, reasoning and inference, volume 29. Cambridge Univ Press, 2000.
[14] Jean-Philippe Pellet and Andre Elisseeff. Finding latent causes in causal networks: an efficient approachbased on markov blankets. In Advances in Neural Information Processing Systems, pages 1249–1256,2009.
[15] Jean-Philippe Pellet and Andre Ellisseeff. Using markov blankets for causal structure learning. Journalof Machine Learning, 2008.
[16] Jose M. Peoa, Roland Nilsson, Johan Bjorkegren, and Jesper Tegner. Towards scalable and data efficientlearning of markov boundaries. Int. J. Approx. Reasoning, 45(2):211–232, July 2007.
[17] Craig Silverstein, Sergey Brin, Rajeev Motwani, and Jeff Ullman. Scalable techniques for mining causalstructures. Data Mining and Knowledge Discovery, 4(2-3):163–192, 2000.
[18] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. The MIT Press, 2nd edition,2000.
[19] Peter Spirtes, Clark Glymour, Richard Scheines, Stuart Kauffman, Valerio Aimale, and Frank Wimberly.Constructing bayesian network models of gene expression networks from microarray data, 2000.
[20] Peter Spirtes, Christopher Meek, and Thomas Richardson. Causal inference in the presence of latentvariables and selection bias. In Proceedings of the Eleventh conference on Uncertainty in artificial intel-ligence, pages 499–506. Morgan Kaufmann Publishers Inc., 1995.
[21] Alexander Statnikov, Ioannis Tsamardinos, Laura E. Brown, and Constatin F. Aliferis. Causal explorer:A matlab library for algorithms for causal discovery and variable selection for classification. In Causationand Prediction Challenge at WCCI, 2008.
[22] Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. Time and sample efficient discov-ery of markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD interna-tional conference on Knowledge discovery and data mining, KDD ’03, pages 673–678, New York, NY,USA, 2003. ACM.
[23] Ioannis Tsamardinos, LauraE. Brown, and ConstantinF. Aliferis. The max-min hill-climbing bayesiannetwork structure learning algorithm. Machine Learning, 65(1):31–78, 2006.
[24] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent con-founders and selection bias. Artificial Intelligence, 172(16):1873–1896, 2008.
9
Neuro-inspired Eye Tracking with Eye Movement Dynamics
Kang Wang
RPI
Hui Su
RPI and IBM
Qiang Ji
RPI
Abstract
Generalizing eye tracking to new subjects/environments
remains challenging for existing appearance-based meth-
ods. To address this issue, we propose to leverage on eye
movement dynamics inspired by neurological studies. S-
tudies show that there exist several common eye movement
types, independent of viewing contents and subjects, such
as fixation, saccade, and smooth pursuits. Incorporating
generic eye movement dynamics can therefore improve the
generalization capabilities. In particular, we propose a nov-
el Dynamic Gaze Transition Network (DGTN) to capture
the underlying eye movement dynamics and serve as the top-
down gaze prior. Combined with the bottom-up gaze mea-
surements from the deep convolutional neural network, our
method achieves better performance for both within-dataset
and cross-dataset evaluations compared to state-of-the-art.
In addition, a new DynamicGaze dataset is also constructed
to study eye movement dynamics and eye gaze estimation.
1. Introduction
Eye gaze is one of the most important approaches for
people to interact with each other and with the visual world.
Eye tracking has been applied to different fields, including
psychology study [1], social network [2, 3, 4, 5], web search
[6, 7, 8], marketing and advertising [9], human computer
interaction [10, 11, 12]. In addition, since neurological activ-
ities affect the way to process visual information (reflected
by eye movements), eye tracking, therefore becomes one of
the most effective tools to study neuroscience. The estimat-
ed eye movements, eye gaze patterns can help attentional
studies like object-search mechanisms [6], understand neu-
rological functions during perceptual decision making [13],
and medical diagnosis like schizophrenia, post-concussive
syndrome, autism, Fragile X, etc. Despite the importance
of eye tracking to neuroscience studies, researchers ignored
that neurological studies on eyes can also benefit eye track-
ing. It is revealed that eye tracking is not a random process
but involves strong dynamics. There exist common eye
movement dynamics 1 that are independent of the viewing
content and subjects. Exploiting eye movement dynamics
can significantly improve the performance of eye tracking.
From neuroanatomy studies, there are several major types
of eye movements 2: vergence, saccade, fixation and smooth
pursuit. Vergence movements are to fixate on objects at dif-
ferent distances where two eyes move in opposite direction.
As vergence is less common in natural viewing scenarios,
we mainly focus on fixation, saccade, and smooth pursuit
eye movements. Saccadic movement is rapid eye movement
from one fixation to another, its duration is short and the
amplitude is linearly correlated with the duration. There
are also study on microsaccade [14] which is not the focus
of this paper. Fixation is to fixate on the same object for a
period of time, eye movements are very small (miniature)
and can be considered as a stationary or random walk. S-
mooth pursuit is eye movement which smoothly tracks a
slowly moving object. It cannot be triggered voluntarily and
typically require a moving object.
Existing work (see [15] for a comprehensive survey) on
eye gaze estimation are static frame-based, without explic-
itly considering the underlying dynamics. Among them,
model-based methods [16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
estimate eye gaze based on a geometric 3D eye model. Eye
gaze can be estimated by detecting key points in the geomet-
ric 3D eye model. Differently, appearance-based methods
[26, 27, 28, 29, 30, 31] directly learn a mapping function
from eye appearance to eye gaze.
Unlike traditional static frame-based methods, we pro-
pose to estimate eye gaze with the help of eye movement dy-
namics. Since eye movement dynamics can generalize across
subjects and environments, the proposed method therefore
achieves better generalization capabilities. The system is
illustrated in Fig. 1. For online eye tracking, the static gaze
estimation network first estimates the raw gaze xt from input
frame. Next, we combine top-down eye movement dynamics
with bottom-up image measurements (Alg. 1) to get a more
accurate prediction yt. In addition, yt is further fed back to
refine the static network so that we can better generalize to
1In this work, eye movement refers to actual gaze movement on screens.2https://www.ncbi.nlm.nih.gov/books/NBK10991/
19831
y
t
x
t
G
(Alg.2)←w
t
w
t−1
Input videostream
Static gazeestimation network
= f ( ; )x
t
I
t
w
t−1
Dynamic gazetransition network
G(α)
Output gazestream
Eye gaze estimation (Alg.1)= g({ ,G(α))y
t
x
i
}
t
i=t−k+1
I
t
Online eye trackingModel refinement
Figure 1. Overview of the proposed system. For online eye tracking, we combine static gaze estimation network with dynamic gaze transition
network to obtain better gaze estimation. In addition, the feedback mechanism of the system allows model refinement, so that we can better
generalize the static network to unseen subjects or environments.
current user and environment (Alg. 2). The proposed method
makes following contributions:
• To the best of our knowledge, we are the first to take
advantage of dynamic information to improve gaze esti-
mation. Combining top-down eye movement dynamics
with bottom-up image measurements gives better gen-
eralization and accuracy (%15 improvement), and can
automatically adapt to unseen subjects and environ-
ments.
• Propose the DGTN that effectively captures the transi-
tions of different eye movements as well as their under-
lying dynamics.
• Construct the DynamicGaze dataset, which not only
provides another benchmark for evaluating static gaze
estimation but benefits the community for studying eye
gaze and eye movement dynamics.
2. Related Work
Static eye gaze estimation. The most relevant work to
our static gaze estimation is from [27]. The authors proposed
to estimate gaze on mobile devices with face, eye and head
pose information using a deep convolutional neural network.
Though they can achieve good performance within-dataset,
they cannot generalize well to other datasets.
Eye gaze estimation with eye movement dynamics.
Eye movement is a spatial-temporal process. Most exist-
ing work only uses spatial eye movements, also known as
saliency map. In [32, 18, 33], the authors approximated
the spatial gaze distribution with the saliency map extracted
from image/video stimulus. However, their purpose is to
perform implicit personal calibration instead of improving
gaze estimation accuracy, since spatial saliency map is scene-
dependent. In [34], the authors used the fact that over 80%chance that first two fixations are on faces to help estimate
eye gaze. However, their approximation is too simple and
cannot apply to more natural scenarios.
For temporal eye movements, the authors in [35] pro-
posed to estimate the future gaze positions for recommender
systems with a Hidden Markov Model (HMM), where fixa-
tion is assumed to be a latent state, and user actions (clicking,
rating, dwell time, etc) are the observations. Their method is
however very much task-dependent and cannot generalize
to different tasks. In [36], the authors proposed to use a
similar HMM to predict gaze positions to reduce the delay
of networked video streaming. They also considered three
states corresponding to fixation, saccade, and smooth pursuit.
However, their approach ignores the different duration for
the three states, and their detailed modeling of the dynamics
for each state is relatively simpler. In addition, it requires
a commercial eye tracker, while the proposed method is an
appearance-based gaze estimator, which can perform on-
line real-time eye tracking with a simple web-camera. Fur-
thermore, the proposed method supports model-refinement
which can generalize to new subjects and environments.
Eye Movement Analysis. Besides eye tracking, there are
plenty of work on identifying the eye movement types given
eye tracking data. It includes threshold-based [37, 38] and
probabilistic-based [39, 40, 41]. Both methods require mea-
surements from eye tracking data like dispersion, velocity or
acceleration. Analyzing the underlying distribution of these
measurements can help identify the eye movement types.
However, these approaches are not interested in modeling
the gaze transitions for improving eye tracking.
3. Proposed Framework
We first discuss the eye movement dynamics and the
DGTN in Sec. 3.1. Next, we briefly introduce the static gaze
estimation network in Sec. 3.2. Then we talk about how to
perform online eye tracking with top-down eye movement
dynamics and bottom-up gaze measurements in Sec. 3.3.
Finally in Sec. 3.4, we focus on the refinement of the static
gaze estimation network.
9832
t
x
y
Fixation onthe motorcyclist
Saccade
Fixation onthe car
Saccade
Smooth pursuitfollowing motion
Legend:
spatial positionon scene framegaze transitionfixation pointsaccade pointsmooth pursuit point
(a)
(b)
Figure 2. Eye movement dynamics. (a) Illustration of eye move-
ments while watching a video, (b) Graphical representation of
dynamic gaze transition network.
3.1. Eye Movement Dynamics and DGTN
We first take a look at the eye movements while watching
a video. As shown in Fig. 2 (a), the user is first attracted
by the motorcyclist on the sky. After spending some time
fixating on the motorcyclist, the user shifts the focus on the
recently appeared car (due to shooting angle change). A
saccade is in between of the two fixations. Next, the user
turns the focus back to the motorcyclist and starts following
the motion with smooth pursuit eye movement. We have
three observations regarding the eye movements: 1) each eye
movement has its own unique dynamic pattern, 2) different
eye movements have different durations, and 3) there exists
special transition patterns across different eye movements.
These observations inspire us to construct the dynamic model
shown in Fig. 2 (b) to model the overall gaze transitions.
Specifically, we employ the semi-Markov model to model
the durations for each eye movement type. In Fig. 2 (b), the
red curve on the top shows a sample gaze pattern with 3segments of fixation, saccade, and smooth pursuit respec-
tively. The top row represents the state chain st, where
st = {fix, sac, sp} can take three values corresponding
to fixation, saccade, and smooth pursuit respectively. Each
state can generate a sequence of true gaze positions {yt}dt=1,
where d represents the duration for the state. Though the
state st is constant for a long period, its value is copied for
all time slices within the state to ensure a regular structure.
The true gaze yt not only depends on the current state but
also depends on previous gaze positions. For example, the
moving direction for smooth pursuit is determined by sever-
al previous gaze positions. Given the true gaze yt, we can
generate the noisy measurements xt, which are the outputs
from the static gaze estimation methods.
In the following, we will discuss in details 1) within-
state dynamics (Sec. 3.1.1), 2) eye movement duration and
transition (Sec. 3.1.2), 3) measurement model (Sec. 3.1.3),
and 4) parameter learning (Sec. 3.1.4).
3.1.1 Within-state Dynamics
100
horizontal gaze
0.5
time
0.5
ve
rtic
al g
aze
1000200
1
fixation
saccade
smooth pursuit
0 50 100 150 200
time
0
0.2
0.4
0.6
0.8
ho
rizo
nta
l g
aze
fixation
saccade
smooth pursuit
0 50 100 150 200
time
0
0.2
0.4
0.6
0.8
ve
rtic
al g
aze
fixation
saccade
smooth pursuit
0 0.2 0.4 0.6 0.8
horizontal gaze
0
0.2
0.4
0.6
0.8
ve
rtic
al g
aze
fixation
saccade
smooth pursuit
Figure 3. Visualization of eye movements. top-left: 3D plot of x-y-
t; top-right: projected 2D plot on y-t plane; bottom-left: projected
2D plot on x-t plane; bottom-right: projected 2D plot on x-y plane.
Fixation. Fixation is to fixate eye gaze on the same static
object for a period of time (Fig. 3 (d)). We propose to model
it with random walk : yt = yt−1 + wfix, where wfix is
the Gaussian noise with zero-mean and covariance matrix of
Σfix.
Saccade. Typically, saccade is fast eye movement be-
tween two fixations. The trajectory is typically a straight line
or generalized exponential curves (Fig. 3). In this work, we
approximate the trajectory with piece-wise linear functions.
The first saccade point y1 is actually the end point of last
fixation. Predicting the position of second saccade point y2
is difficult without knowing the image content. However, ac-
cording to [42], horizontal saccades are more frequent than
vertical saccades, which provide strong cues to the second
saccade point. Specifically, we assume second point can be
estimated by transiting first point with certain amplitude and
direction (angle) on 2D plane: y2 = y1+λ[cos(θ), sin(θ)]T ,
where amplitude λ ∼ N (µλ, σλ) and angle θ ∼ N (µθ, σθ)both follow Gaussian distributions. The histogram plot of
amplitude (Fig. 4 (a)) and angle (Fig. 4 (b)) from real data
also validates the feasibility of Gaussian distributions.
9833
0 200 400 600
amplitude / pixel
0
500
1000
# o
f sa
mp
les
(a)
-100 -50 0 50 100
angle/ degree
0
100
200
300
# o
f sam
ple
s
(b)
1->2 2->3 3->4 4->5 5->6 6->7
transition index i -> index j
0
50
100
150
am
plit
ud
e /
pix
el
(c)
Figure 4. Saccade characteristics. (a) Amplitude distribution, (b) Angle distribution, (c) Amplitude change from adjacent saccade points.
The rest saccade points can be estimated with the previous
two points: yt = Bi1yt−1+Bi
2yt−2+wsac, where Bd1 and
Bi2 are the regression matrices, the superscript i indicates
the index of current saccade point, or how many frames
have past when we enter the state. The value of i equals
the duration variable d in Eq. (1). It might be easier if we
assume Bi1 and Bi
2 remain the same for different indexes
i, but saccade movements have certain characteristics. For
example as in (Fig. 4 (c)), the amplitude changes between
adjacent saccade points first increases than decreases. Using
index-dependent regression matrices can better capture the
underlying dynamics. wsac is the Gaussian noise with zero-
mean and covariance matrix of Σsac.
Smooth Pursuit. Smooth pursuit is to keep track of a
slowly moving object. Therefore we can approximate the
moving trajectory by piece-wise linear functions similar to
saccade points. For the second smooth pursuit point, we in-
troduce amplitude and angle variable {λsp, θsp). For remain-
ing smooth pursuit points, we introduce index-dependent
regression matrices: yt = Ci1yt−1 +Ci
2yt−2 +wsp. wsp
is the Gaussian noise with zero-mean and covariance matrix
of Σsp.
3.1.2 Eye Movement Duration and Transition
The hidden semi-Markov model has been well studied in[43], we adopt a similar formulation for our model in termsof state duration and transition modeling. Besides randomvariables st, yt and xt for state, true gaze position and mea-sured gaze position, we introduce another discrete randomvariable dt (range {0, 1, ..., D}) representing the remainingduration of state st. The state st and the remaining durationdt are discrete random variables and follows multinomial(categorical) distribution. The CPDs for the state transitionare defined as follows:
P (st = j|st−1 = i, dt = d) =
{
δ(i, j) if d > 0A(i, j) if d = 0
P (dt = d′|dt = d, st = k) =
{
δ(d′, d− 1) if d > 0pk(d
′) if d = 0(1)
where δ(i, j) = 1 if i = j else 0. When we enter a new state
st = i, the duration dt is drawn from a prior multinomial
distribution qi(·) = [pi(1), ..., pi(D)]. The duration is then
counts down to 0. When dt = 0, the state transits to a
different state with the state transition matrix A and the
duration for the new state is drawn again from qi(·).
3.1.3 Measurement Model
The measurement model P (xt|yt) is independent of the type
of eye movement, and we assume: xt = Dyt +wn, where
D is the regression matrix, and wn is multi-variate Gaussian
noise with zero-mean and covariance matrix of Σn.
3.1.4 Parameter Learning
The DGTN parameters are summarized in Table 1.
For simplicity, we denote all the parameters as α =[αst,αsd,αfix,αsac,αsp,αm] and the DGTN is represent-
ed as G(α). All the random variables in Fig. 2 (b) are
observed during learning (the states and true gaze are not
known during online gaze tracking). Given the fully ob-
served K sequences ({skt ,ykt ,x
kt }
Tk
t=1) each with length Tk,
we can use Maximum log likelihood to estimate all the pa-
rameters:
α∗ = argmax
α
log
K∏
k=1
P ({skt ,ykt ,x
kt }
Tk
t=1|α) (2)
= argmaxα
K∑
k=1
log
Tk∏
t=1
∑
dkt
P (skt , dkt )P (yk
t |skt , d
kt )P (xk
t |ykt )
With fully-observed data, the above optimization problem
can be factorized to following sub-problems, each of which
can be solved independently:
α∗
m = argmaxαm
K∑
k=1
log
Tk∏
t=1
P (xkt |y
kt ,αm), (3)
{αst,αsd}∗ = arg max
αst,αsd
K∑
k=1
log
Tk∏
t=1
∑
dkt
P (skt , dkt ) (4)
α∗
j = argmaxαj
Nj∑
n=1
log
Tn∏
t=1
P (ykt |s
kt = j, dkt = Tn,αj)
∀j ∈ {fix, sac, sp}. (5)
9834
Table 1. Summary of model parameters.
State tran-
sition αstState duration αsd
Fixation
αfixSaccade αsac Smooth Pursuit αsp
Measurement
αm
Aqi = [pi(1), ..., pi(Di)]for i ∈ {fix, sac, sp}
Σfix{µλ, σλ, µθ, σθ}
sac,
{Bi1,B
i2}
Dsaci=3
,Σsac
{µλ, σλ, µθ, σθ}sp,
{Ci1,C
i2}
Dsp
i=3,Σsp
D,Σn
3.2. Static Eye Gaze Estimation
Figure 5. Architecture of static gaze estimation network.
The raw gaze measurements xt is estimated with a stan-
dard deep convolutional neural network (Fig. 5) [44, 45].
The input are left and right eyes (both of size 36× 60) and
the 6-dimension head pose information (rotation and trans-
lation: pitch, yaw, roll angles and x, y, z). The left and
right eye branch share the same weights of the convolutional
layers. Each convolution layer is followed by a max-pooling
layer with size 2. RELU is used for the activation of fully-
connected layers. Detailed layer configuration are as follows:
CONV-R1, CONV-L1: 5 × 5/50, CONV-R2, CONV-L2:
5× 5/100, FC-RT1: 512, FC-E1, FC-RT2: 256, FC-1: 500,
FC-2: 300, FC-3: 100. For simplicity, we denote static gaze
estimation as xt = f(It;w), where I and w are input frame
and model parameters respectively.
3.3. Online Eye Gaze Tracking
Traditional static-based methods only output the mea-sured gaze x from static gaze estimation network. In thiswork, we propose to output the true gaze y with the help ofDGTN:
yt = argmax p(yt|x1,x2, ...,xt)
= argmax
∫
st
p(yt, st|x1,x2, ...,xt)dst (6)
Solving the problem in Eq. (6) directly is intractable be-
cause of the integral over the hidden state. Alternatively we
propose to first draw samples of possible state st ([43]) from
its posterior. Given state, gaze estimation is a standard infer-
ence problem of LDS or Kalman filter ([46]). The algorithm
is summarized in Alg. 1.
3.4. Model Refinement
The static gaze estimation network is learned from sub-
jects during the offline stage. They may not generalize well
Algorithm 1: Online eye tracking
while getting a new frame It, do- Draw samples of state st ([43]) from its posterior:
sit ∼ P (st|xt−k, ...,xt), ∀i = 1, ..., N .
- According to the sample values of state st, using
the corresponding LDS in Eq. (1)([46]) to predict
the true gaze: yit =
argmaxyitP (yi
t|xt−k, ...,xt, sit) ∀i = 1, ..., N .
- Average the results from N samples:
yt ≈1N
∑N
i=1 yit.
to new subjects or environments. Therefore we propose to
leverage on the refined true gaze to refine the static gaze
estimation network (last two fully-connected layers). The
algorithm is illustrated in Alg. 2. Notice we do not use the
exact values of y, but instead assuming the temporal gaze
distribution from the static network (p(xt)) matches the true
gaze distribution (p(yt)). Similar to Fig. 3 (b) and (c), we
treat the x − t curve and y − t curve as two categorical
distributions(p = [p1, ..., pT ]), whose range is from 1 to T,
and the value pi equals to the normalized gaze positions. By
minimizing the KL-divergence between the two gaze distri-
butions, we can gradually refine the parameters of the static
network. The proposed algorithm may not give good accura-
cy in the beginning, but it can be performed incrementally
and gives better predictions as we collect more frames.
Algorithm 2: Model refinement for static gaze estima-
tion network.
1. Input: Static gaze estimation network f(·) with
initial parameters w0.
2. while getting a new frame It, do- Gather last k true gaze point yt = (at, bt) from
Alg. 1 and construct two categorical distributions
for horizontal and vertical gaze:
px = 1∑ai[at−k, ..., at],py = 1∑
bi[bt−k, ..., bt].
- Gather last k raw gaze point (at, bt) = f(It;w)and construct bottom-up categorical distributions:
qx(w) = 1∑ai[at−k, ..., at],
qy(w) = 1∑bi[bt−k, ..., bt].
- Update static gaze estimation network: wt =argminw DKL(px||qx(w)) +DKL(py||qy(w)),
where DKL(p||q) =∑
i p(i) logp(i)q(i) .
9835
4. DynamicGaze Dataset
Existing datasets for gaze estimation and eye movement
dynamics have little overlap. On one hand, gaze-related
benchmark datasets are all frame-based. Subjects are asked
to look at markers on the screen, where their face images
and groundtruth gaze are recorded. However, there are no
natural dynamic gaze patterns in the dataset. On the other
hand, eye movement related datasets focus on collecting data
while subjects watch natural video stimulus. Though the col-
lected data involves dynamics, there are no bottom-up image
measurements. To bridge the gap between these two fields,
we construct a new dataset which records both images and
groundtruth gaze positions while subjects perform natural
operations (browsing websites, watching videos). Clear eye
movement dynamics can be observed from the dataset.
To acquire the groundtruth gaze positions, we use a com-
mercial eye tracker which runs at the back-end. In the mean-
time, the front-facing camera of the laptop records the video
stream of the subjects. The video stream and the gaze stream
are synchronized during post-processing. The Tobii 4C eye
tracker gives less than 0.5 error after calibration, and we
believe the accuracy is sufficient to construct a dataset for
the webcam-based eye gaze tracking system.
4.1. Data collection procedure
We invite 15 male subjects and 5 female subjects, whose
age ranges from 20 to 30, to participate in the dataset con-
struction. We collected 3 sessions of data: 1) frame-based;
2) video-watching 3) website-browsing.
Frame-based. There are two purposes: 1) provide anoth-
er benchmark for static eye gaze estimation and 2) train our
generic static gaze estimation network. Subjects are asked
to look at some random moving objects on the screen, the
random moving objects are to ensure subjects’ gaze spread
on the entire screen. Each subject takes 3-6 trials at differ-
ent days, locations. We also ask subjects to sit at different
positions in front of the laptop to introduce more variations.
Finally, we end up with around 370000 valid frames.
Video-watching. The subjects are asked to watch 10video stimulus (Tab. 2) from 3 eye tracking research datasets.
The collection procedure is similar to the previous session,
and finally we collect a total of around 145000 valid frames.
Website-browsing. Similarly, subjects are asked to
browse websites freely on the laptop for around 5− 6 min-
utes, and a total of around 130000 frames are collected.
4.2. Data visualization and statistics
Fig. 6 shows sample eye images from the 20 subjects.
There are occlusions like glasses and reflections. Fig. 7
shows the spatial gaze distributions on a monitor with reso-
lution 2880× 1620. For frame-based data , the gaze appears
uniformly distributed. For video-watching data, the gaze
Table 2. Information about different video stimulus.Dataset Name Description
CRCNS [47]1. saccadetest Dots moving across the screen.
2. beverly07 People walking and running.
[48]3. 01-car-pursuit Car driving in a roundabout.
4. 02-turning-car Car turning around.
DIEM [49]
5. advert bbc4 bees Flying bees on BBC logo.
6. arctic bears Arctic bears in the ocean.
7. nightlife in mozambique One crab hunting for fishes.
8. pingpong no bodies Pingpong bouncing around.
9. sport barcelona extreme Extreme sports cut.
10. sport scramblers Extreme sports for scramblers.
Figure 6. Sample eye images from the dataset.
0 500 1000 1500 2000 2500x / pixel
0
200
400
600
800
1000
1200
1400
1600
y / p
ixel
pearsonr = 0.036; p = 3.7e-107
(a) frame-based
500 1000 1500 2000 2500x / pixel
0
200
400
600
800
1000
1200
1400
1600
y / p
ixel
pearsonr = -0.05; p = 8.4e-80
(b) video-watching
0 500 1000 1500 2000 2500x / pixel
0
200
400
600
800
1000
1200
1400
1600
y / p
ixel
pearsonr = 0.075; p = 2.5e-161
(c) website-browsing
Figure 7. Spatial gaze distributions for the DynamicGaze dataset.
0 100 200 300 400 500
time
0
500
1000
1500
2000
2500
ho
rizo
nta
l g
aze
po
sitio
n
0 100 200 300 400 500
time
0
500
1000
1500
ve
rtic
al g
aze
po
sitio
n
Figure 8. Sample dynamic gaze patterns from 8 subjects watching
the same video.
appears center-biased, which is the most common pattern
when watching videos. Finally, for website-browsing, the
gaze pattern is focused on the left side of the screen mainly
due to the design of the website. Since the major goal of
the dataset is to explore gaze dynamics, we also take a look
at the dynamic gaze patterns from 8 subjects watching the
same video stimuli. As shown in Fig. 8, different subjects
share similar overall gaze patterns, though the exact values
of horizontal and vertical gaze positions are different.
5. Experiments and Analysis
For DGTN, the measurement model P (xt|yt) is learned
with the data from DynamicGaze, where we have both
groundtruth gaze yt and measured gaze from the static gaze
estimation network. The remaining part of the model is
learned with the data from CRCNS [47], where we have the
groundtruth state annotations st and the groundtruth gaze.
9836
CRCNS consists of 50 video clips and 235 valid eye move-
ment traces from 8 subjects. For the static gaze estimation
network, we use Tensorflow as our backend engine.
Fixation is a one-order LDS, saccade, and smooth pursuit
can be considered as second-order LDS, therefore the value
k in Alg.1 is either 1 or 2. The value k in Alg.2 is set to 50(around 2 seconds of data), where we use them to update the
parameters of the static network. For overall gaze estimation,
the static gaze estimation (GPU Tesla 54 K40c) takes less
than 1 ms, while the online part (Alg. 1) with Intel Xeon CPU
E5-2620 v3 @2.4GHz takes around 50-60 ms. In practice,
for real-time processing, the model refinement runs with a
separate thread other than the gaze estimation thread.
The performance is evaluated using the angular error in
degree. We first compute the Euclidean pixel error on the
monitor(2880×1620), which can be transformed to centime-
ter error errd given monitor dimensions. The angular error
is approximated by erra = arctan(errd/tz), where tz is
the estimated depth of subject’s head relative to the camera.
5.1. Baseline for Static Gaze Estimation Network
Table 3. Comparison of different input data channels.
L R F L,R L,R,F L,R,P L,R,F,P
Error 5.38 5.27 5.56 4.70 5.29 4.27 4.47
We experiment with different input combinations. As
shown in Table 3, the symbol L, R, F, P represent left eye im-
age, right eye image, face image, and head pose respectively.
According to the results, we decide to use both eyes and head
pose. To obtain head pose, we perform offline detection of
the facial landmarks [50], then we can solve the head pose
angle with a 3D shape model [51, 52]. Note that adding
face is not helpful, since subjects have very different facial
texture than eye texture, which makes it hard to generalize
to new subjects. In addition, adding face may significantly
increase the inference time.
5.2. Evaluation on Different Model Components
The proposed model consists of two major components:
1) gaze estimation with eye movement dynamics and 2)
refinement model to better fit current users/environments.
To study the contributions of each component, we compare
following 3 variants of the proposed model:
• Static: this model outputs the raw gaze prediction x
and serves as the baseline.
• EMD (Eye movement dynamics): this model only us-
es eye movement dynamics (Alg. 1) without model
refinement and output the true gaze prediction y.
• Full: this is our full model contains both eye movement
dynamics and model refinement.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Subjects
0123456789
10
Err
or
(de
gre
e)
Static, avg = 5.34
EMD, avg = 4.97
Full, avg = 4.65
(a) video-watching
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Subjects
0123456789
10
Err
or
(de
gre
e)
Static, avg = 4.97
EMD, avg = 4.58
Full, avg = 4.07
(b) website-browsing
Figure 9. Gaze estimation error for all subjects.
We perform cross-subject evaluation and Fig. 9 shows
the performance of the 3 models. First, the Full model
shows improved performance over the Static model for most
subjects. The average estimation error reduces from 5.34
degrees to 4.65 degrees ((pitch, yaw) = (2.67, 3.81), 13%
improvement) for video-watching and 4.97 degrees to 4.07
degrees ((pitch, yaw) = (2.23, 3.41), 18% improvement)
for web-browsing. Second, compare EMD (gray bar) with
Static (black bar), we can always achieve better results for
both scenarios, demonstrating the importance of incorporat-
ing dynamics, especially in practical scenarios where user’s
gaze patterns have strong dynamics. The average improve-
ments with eye movement dynamics are 6.9% and 7.9% for
video-watching and website-browsing respectively. Third,
the difference between Full (white bar) with EMD (gray
bar) demonstrates the effect of Model Refinement. We can
clearly observe that the Static model cannot generalize well
to some subjects. With Model Refinement, we significantly
reduce the error for some subjects (Eg. Subj 6, 15, 16, 18
in video-watching and Subj 15, 16, 18 in website-browsing).
We also observe that model refinement may not always help,
it may increase the error for some subjects (Eg. Subj 4, 5, 7
in video-watching). Averagely speaking, Model Refinement
improves 6.4% and 11.2% for video-watching and website-
browsing respectively. Overall, both components can help
reduce the error of eye gaze estimation and combining the
two further reduces the error.
5.3. Performance of gaze estimation over time
Fig. 10 shows the gaze estimation error over time. The er-
ror is averaged from all subjects from their first 8000 frames.
For both scenarios, the improvement for the first period of
time is small (sometimes even decrease), but gradually there
is more significant improvements as we have more data.
9837
0 1000 2000 3000 4000 5000 6000 7000 8000
frame
-2
0
2
4
6
8
10
err
or
(degre
e)
static
dynamic
static - dynamic
(a) video-watching
0 1000 2000 3000 4000 5000 6000 7000 8000
frame
-2
0
2
4
6
8
10
err
or
(degre
e)
static
dynamic
static - dynamic
(b) website-browsing
Figure 10. Gaze estimation error over time. Red curve represents
error from Static model, green curve represents error from Full
model and green curve shows the reduced error.
This demonstrates that with enough frames, the proposed
method can significantly improve the accuracy of eye gaze
estimation.
5.4. Comparison with different dynamic models
Table 4. Average error of all subjects with different dynamic mod-
els.Static Mean Median LDS s-LDS RNN Ours
Video 5.34 5.18 5.16 5.20 5.14 5.15 4.97
Web 4.97 4.85 4.84 4.70 4.66 4.71 4.58
In this experiment, we compare with several baseline
dynamic models. The experimental results are illustrated in
Table 4. First, we find incorporating dynamics outperforms
the static method. Even the simple mean/median filters
can improve the results. The LDS model trained on entire
sequence without consideration of different eye movement
types cannot give good results. Once we consider different
eye movement types, the switching-LDS can improve the
results even without duration modeling. RNN [53, 54] gives
reasonably good results but ignores the characteristics of
different eye movements and therefore our proposed method
can still outperform it. Overall, we believe the proposed
dynamic modeling can better explain the underlying eye
movement dynamics and help improve the accuracy of eye
gaze estimation.
5.5. Comparison with stateoftheart
We compare with the state-of-the-art appearance-based
method [27] for both within-dataset and cross-dataset ex-
periments. Specifically, we re-implement the model in [27]
using Tensorflow by following the same architecture and
architecture-related hyperparameters. For training-related
Table 5. Comparison with state-of-the-art.Exp. Within-dataset Cross-dataset
Video Website Video Website
1. Static network (ours) 5.34 4.97 9.12 9.65
2. Static network ([27]) 4.97 4.86 8.73 9.17
3. Static network (ours) + DGTN 4.65 4.07 7.15 7.87
4. Static network ([27]) + DGTN 4.51 4.00 7.05 7.59
hyperparameters (e.g. learning rate, epochs), we do not
follow the one in [27] and adjust them based on cross-
validation.
For within-dataset experiments, the two models are
trained on the frame-based data from DynamicGaze and
are tested on web and video data from DynamicGaze. For
cross-dataset experiments, the two models are trained with
data from EyeDiap ([55]) and are tested on web and video
data from DynamicGaze.
The results are shown in Table 5. We have following
observations: 1) Compare Exp.1 and Exp.2, we can see
both static networks give reasonable accuracy, and the more
complex one ([27]) gives better performance than ours; 2)
Compare Exp.2 and Exp.4, adding DGTN to static network
significantly reduces the gaze estimation error; 3) similarly
compare Exp.2 and Exp.4, adding DGTN module to state-of-
the-art static network can still achieve better performance; 4)
the improvement for cross-dataset setting is more significant
than the within-dataset case, demonstrating better generaliza-
tion by incorporating eye movement dynamics; 5) compare
Exp.2 and Exp.3, we can find that our proposed method (Ex-
p.3) outperforms current state-of-the-art (Exp.2), especially
in the cross-dataset case.
6. Conclusion
In this paper, we propose to leverage on eye movement
dynamics to improve eye gaze estimation. By analyzing the
eye movement patterns when naturally interacting with the
computer, we construct a dynamic gaze transition network
that captures the underlying dynamics of fixation, saccade,
smooth pursuit, as well as their durations and transition-
s. Combining top-down gaze transition prior from DGTN
with the bottom-up gaze measurements from the deep model,
we can significantly improve the eye tracking performance.
Furthermore, the proposed method allows online model re-
finement which helps generalize to unseen subjects or new
environments. Quantitative results demonstrate the effec-
tiveness of the proposed method and the significance of
incorporating eye movement dynamics into eye tracking.
Acknowledgments: The work described in this paper
is supported in part by NSF award (IIS 1539012) and by
RPI-IBM Cognitive Immersive Systems Laboratory (CISL),
a center in IBM’s AI Horizon Network.
9838
References
[1] A. L. Yarbus, “Eye movements during perception of complex objects,”
in Eye movements and vision, pp. 171–211, Springer, 1967. 1
[2] W. A. W. Adnan, W. N. H. Hassan, N. Abdullah, and J. Taslim, “Eye
tracking analysis of user behavior in online social networks,” in Inter-
national Conference on Online Communities and Social Computing,
pp. 113–119, Springer, 2013. 1
[3] G.-J. Qi, C. C. Aggarwal, and T. S. Huang, “Online community detec-
tion in social sensing,” in Proceedings of the sixth ACM international
conference on Web search and data mining, pp. 617–626, ACM, 2013.
1
[4] J. Tang, X. Shu, G.-J. Qi, Z. Li, M. Wang, S. Yan, and R. Jain,
“Tri-clustered tensor completion for social-aware image tag refinemen-
t,” IEEE transactions on pattern analysis and machine intelligence,
vol. 39, no. 8, pp. 1662–1674, 2017. 1
[5] G.-J. Qi, C. C. Aggarwal, and T. Huang, “Link prediction across
networks by biased cross-network sampling,” in 2013 IEEE 29th
International Conference on Data Engineering (ICDE), pp. 793–804,
IEEE, 2013. 1
[6] J. H. Goldberg, M. J. Stimson, M. Lewenstein, N. Scott, and A. M.
Wichansky, “Eye tracking in web search tasks: design implications,”
in Proceedings of the 2002 symposium on Eye tracking research &
applications, pp. 51–58, ACM, 2002. 1
[7] X. Wang, T. Zhang, G.-J. Qi, J. Tang, and J. Wang, “Supervised quan-
tization for similarity search,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2018–2026, 2016.
1
[8] S. Chang, G.-J. Qi, C. C. Aggarwal, J. Zhou, M. Wang, and T. S.
Huang, “Factorized similarity learning in networks,” in 2014 IEEE
International Conference on Data Mining, pp. 60–69, IEEE, 2014. 1
[9] C. H. Morimoto and M. R. Mimica, “Eye gaze tracking techniques for
interactive applications,” Computer vision and image understanding,
vol. 98, no. 1, pp. 4–24, 2005. 1
[10] K. Wang, R. Zhao, and Q. Ji, “Human computer interaction with head
pose, eye gaze and body gestures,” in 2018 13th IEEE International
Conference on Automatic Face & Gesture Recognition (FG 2018),
pp. 789–789, IEEE, 2018. 1
[11] R. Zhao, K. Wang, R. Divekar, R. Rouhani, H. Su, and Q. Ji, “An
immersive system with multi-modal human-computer interaction,”
in 2018 13th IEEE International Conference on Automatic Face &
Gesture Recognition (FG 2018), pp. 517–524, IEEE, 2018. 1
[12] R. R. Divekar, M. Peveler, R. Rouhani, R. Zhao, J. O. Kephart,
D. Allen, K. Wang, Q. Ji, and H. Su, “Cira: An architecture for
building configurable immersive smart-rooms,” in Proceedings of SAI
Intelligent Systems Conference, pp. 76–95, Springer, 2018. 1
[13] S. Fiedler and A. Glockner, “The dynamics of decision making in
risky choice: An eye-tracking analysis,” Frontiers in psychology,
vol. 3, p. 335, 2012. 1
[14] S. Martinez-Conde, J. Otero-Millan, and S. L. Macknik, “The impact
of microsaccades on vision: towards a unified theory of saccadic
function,” Nature Reviews Neuroscience, vol. 14, no. 2, p. 83, 2013. 1
[15] D. Hansen and Q. Ji, “In the eye of the beholder: A survey of models
for eyes and gaze,” 2010. 1
[16] K. Wang and Q. Ji, “3d gaze estimation without explicit personal
calibration,” Pattern Recognition, 2018. 1
[17] D. Beymer and M. Flickner, “Eye gaze tracking using an active stereo
head,” in Computer vision and pattern recognition, 2003. Proceedings.
2003 IEEE computer society conference on, vol. 2, pp. II–451, IEEE,
2003. 1
[18] K. Wang, S. Wang, and Q. Ji, “Deep eye fixation map learning for
calibration-free eye gaze tracking,” in Proceedings of the Ninth Bi-
ennial ACM Symposium on Eye Tracking Research & Applications,
pp. 47–55, ACM, 2016. 1, 2
[19] E. D. Guestrin and M. Eizenman, “General theory of remote gaze
estimation using the pupil center and corneal reflections,” IEEE Trans-
actions on biomedical engineering, vol. 53, no. 6, pp. 1124–1133,
2006. 1
[20] K. Wang and Q. Ji, “Hybrid model and appearance based eye tracking
with kinect,” in Proceedings of the Ninth Biennial ACM Symposium
on Eye Tracking Research & Applications, pp. 331–332, ACM, 2016.
1
[21] X. Xiong, Q. Cai, Z. Liu, and Z. Zhang, “Eye gaze tracking using an
rgbd camera: A comparison with a rgb solution,” UBICOMP, 2014. 1
[22] K. Wang and Q. Ji, “Real time eye gaze tracking with 3d deformable
eye-face model,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1003–1011, 2017. 1
[23] L. Jianfeng and L. Shigang, “Eye-model-based gaze estimation by
rgb-d camera,” in CVPR Workshops, 2014. 1
[24] K. Wang and Q. Ji, “Real time eye gaze tracking with kinect,” in
Pattern Recognition (ICPR), 2016 23rd International Conference on,
pp. 2752–2757, IEEE, 2016. 1
[25] K. Wang, R. Zhao, and Q. Ji, “A hierarchical generative model for
eye image synthesis and eye gaze estimation,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 440–448, 2018. 1
[26] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Appearance-based
gaze estimation in the wild,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4511–4520, 2015.
1
[27] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar,
W. Matusik, and A. Torralba, “Eye tracking for everyone,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2176–2184, 2016. 1, 2, 8
[28] Q. Huang, A. Veeraraghavan, and A. Sabharwal, “Tabletgaze: dataset
and analysis for unconstrained appearance-based gaze estimation in
mobile tablets,” Machine Vision and Applications, vol. 28, no. 5-6,
pp. 445–461, 2017. 1
[29] T. Fischer, H. Jin Chang, and Y. Demiris, “Rt-gene: Real-time eye
gaze estimation in natural environments,” in Proceedings of the Eu-
ropean Conference on Computer Vision (ECCV), pp. 334–352, 2018.
1
[30] Y. Cheng, F. Lu, and X. Zhang, “Appearance-based gaze estimation
via evaluation-guided asymmetric regression,” in Proceedings of the
European Conference on Computer Vision (ECCV), pp. 100–115,
2018. 1
[31] K. Wang, R. Zhao, H. Su, and Q. Ji, “Generalizing eye tracking with
bayesian adversarial learning,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019. 1
[32] Y. Sugano, Y. Matsushita, and Y. Sato, “Appearance-based gaze esti-
mation using visual saliency,” IEEE transactions on pattern analysis
and machine intelligence, vol. 35, no. 2, pp. 329–341, 2013. 2
[33] J. Chen and Q. Ji, “A probabilistic approach to online eye gaze track-
ing without explicit personal calibration,” IEEE Transactions on Im-
age Processing, vol. 24, no. 3, pp. 1076–1086, 2015. 2
[34] M. Cerf, J. Harel, W. Einhauser, and C. Koch, “Predicting human gaze
using low-level saliency combined with face detection,” in Advances
in neural information processing systems, pp. 241–248, 2008. 2
[35] Q. Zhao, S. Chang, F. M. Harper, and J. A. Konstan, “Gaze predic-
tion for recommender systems,” in Proceedings of the 10th ACM
Conference on Recommender Systems, pp. 131–138, ACM, 2016. 2
9839
[36] Y. Feng, G. Cheung, W.-t. Tan, and Y. Ji, “Hidden markov model for
eye gaze prediction in networked video streaming,” in Multimedia
and Expo (ICME), 2011 IEEE International Conference on, pp. 1–6,
IEEE, 2011. 2
[37] A. T. Duchowski, “Eye tracking methodology,” Theory and practice,
vol. 328, 2007. 2
[38] M. Nystrom and K. Holmqvist, “An adaptive algorithm for fixation,
saccade, and glissade detection in eyetracking data,” 2010. 2
[39] E. Tafaj, G. Kasneci, W. Rosenstiel, and M. Bogdan, “Bayesian online
clustering of eye movement data,” in Proceedings of the Symposium
on Eye Tracking Research and Applications, pp. 285–288, ACM,
2012. 2
[40] L. Larsson, M. Nystrom, R. Andersson, and M. Stridh, “Detection of
fixations and smooth pursuit movements in high-speed eye-tracking
data,” Biomedical Signal Processing and Control, vol. 18, pp. 145–
152, 2015. 2
[41] T. Santini, W. Fuhl, T. Kubler, and E. Kasneci, “Bayesian identifi-
cation of fixations, saccades, and smooth pursuits,” in Proceedings
of the Ninth Biennial ACM Symposium on Eye Tracking Research &
Applications, pp. 163–170, ACM, 2016. 2
[42] O. Le Meur and Z. Liu, “Saccadic model of eye movements for free-
viewing condition,” Vision research, vol. 116, pp. 152–164, 2015.
3
[43] K. P. Murphy, “Hidden semi-markov models (hsmms),” 2002. 4, 5
[44] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural
information processing systems, pp. 1097–1105, 2012. 5
[45] T. Zhang, G.-J. Qi, B. Xiao, and J. Wang, “Interleaved group con-
volutions,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 4373–4382, 2017. 5
[46] K. P. Murphy and S. Russell, “Dynamic bayesian networks: represen-
tation, inference and learning,” 2002. 5
[47] R. Carmi and L. Itti, “The role of memory in guiding attention during
natural vision,” Journal of vision, vol. 6, no. 9, pp. 4–4, 2006. 6
[48] K. Kurzhals, C. F. Bopp, J. Bassler, F. Ebinger, and D. Weiskopf,
“Benchmark data for evaluating visualization and analysis techniques
for eye tracking for video stimuli,” in Proceedings of the fifth work-
shop on beyond time and errors: novel evaluation methods for visual-
ization, pp. 54–60, ACM, 2014. 6
[49] P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson, “Clustering of
gaze during dynamic scene viewing is predicted by motion,” Cognitive
Computation, vol. 3, no. 1, pp. 5–24, 2011. 6
[50] A. Bulat and G. Tzimiropoulos, “How far are we from solving the
2d & 3d face alignment problem?(and a dataset of 230,000 3d facial
landmarks),” in International Conference on Computer Vision, vol. 1,
p. 4, 2017. 7
[51] E. Murphy-Chutorian and M. M. Trivedi, “Head pose estimation in
computer vision: A survey,” IEEE transactions on pattern analysis
and machine intelligence, vol. 31, no. 4, pp. 607–626, 2009. 7
[52] K. Wang, Y. Wu, and Q. Ji, “Head pose estimation on low-quality
images,” in 2018 13th IEEE International Conference on Automatic
Face & Gesture Recognition (FG 2018), pp. 540–547, IEEE, 2018. 7
[53] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur,
“Recurrent neural network based language model,” in Eleventh annual
conference of the international speech communication association,
2010. 8
[54] H. Hu and G.-J. Qi, “State-frequency memory recurrent neural net-
works,” in Proceedings of the 34th International Conference on Ma-
chine Learning-Volume 70, pp. 1568–1577, JMLR. org, 2017. 8
[55] K. A. F. Mora, F. Monay, and J.-M. Odobez, “Eyediap: A database for
the development and evaluation of gaze estimation algorithms from
rgb and rgb-d cameras,” in Proceedings of the Symposium on Eye
Tracking Research and Applications, pp. 255–258, ACM, 2014. 8
9840
An Adversarial Hierarchical Hidden MarkovModel for Human Pose Modeling and Generation
Rui Zhao, Qiang JiRensselaer Polytechnic Institute
Troy NY, USA{zhaor,jiq}@rpi.edu
Abstract
We propose a hierarchical extension to hidden Markov model(HMM) under the Bayesian framework to overcome its limitedmodel capacity. The model parameters are treated as randomvariables whose distributions are governed by hyperparame-ters. Therefore the variation in data can be modeled at bothinstance level and distribution level. We derive a novel learn-ing method for estimating the parameters and hyperparametersof our model based on adversarial learning framework, whichhas shown promising results in generating photorealistic im-ages and videos. We demonstrate the benefit of the proposedmethod on human motion capture data through comparisonwith both state-of-the-art methods and the same model thatis learned by maximizing likelihood. The first experiment onreconstruction shows the model’s capability of generalizing tonovel testing data. The second experiment on synthesis showsthe model’s capability of generating realistic and diverse data.
IntroductionIn recent years, generative dynamic model has attracted alot of attention due to its potential of learning representationfrom unlabeled sequential data as well as its capability ofdata generation. (Gan et al. 2015; Srivastava, Mansimov, andSalakhudinov 2015; Mittelman et al. 2014; Xue et al. 2016;Walker et al. 2016). Sequential data introduce additionalchallenge for modeling due to temporal dependencies andsignificant intra-class variation. Consider human action asan example. Even though the underlying dynamic patternremains similar for the same type of action, the actual poseand speed vary for different people. Even if the same personperforms the action repeatedly, there will be noticeable dif-ference. This motivates us to design a probabilistic dynamicmodel that not only can capture the consistent dynamic pat-tern across different data instances, but also can accommodatethe variation therein.
Widely used dynamic model like HMM models dynamicprocess through transition among different discrete states.In order to encode N bits of information, HMM needs 2Nnumber of states. Therefore, the model complexity increasesexponentially with the model capacity. Linear dynamic sys-tem (LDS) uses continuous states to capture dynamics, whichavoids exponential increase of model complexity. However,
Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
LDS assumes the underlying dynamics can be described by alinear model, which may not be sufficient for case like humanmotion data. On the other hand, more complex model suchas recurrent neural networks (RNN) based deep models oftenhas exceedingly large number of parameters. Without suffi-ciently large amount of data or careful regularization, trainingsuch model is prone to overfit. In addition, the model is de-terministic. Simply reducing the model complexity compro-mises the capability of capturing randomness and variationpresented in data. We instead propose a hierarchical HMM(HHMM), which extends the shallow HMM leveraging onBayesian framework. The proposed HHMM allows modelparameters vary as random variables among data instances.Given the same amount of parameters, HHMM has a muchlarger capacity compared to HMM. Besides, HHMM retainsthe inference method available in HMM, allowing us to dovarious inference tasks efficiently. Finally, as a probabilisticgenerative model, HHMM can capture spatio-temporal de-pendencies in dynamic process and modeling variations in aprincipled way.
As for model learning, maximum likelihood estimate(MLE) has been the de facto learning objective for prob-abilistic generative models. Despite its wide adoption, MLEtends to fit a diffused distribution on data (Theis, Oord, andBethge 2015). For static image synthesis, the results oftenlook blurred. Recently, adversarial learning has emergedto be a popular learning criteria for learning generativemodels. Variants of generative adversarial networks (GAN)(Goodfellow et al. 2014; Radford, Metz, and Chintala 2015;Reed et al. 2016; Nowozin, Cseke, and Tomioka 2016;Nguyen et al. 2017) show promising results in generatingboth sharp and realistic-looking images of face, object andindoor/outdoor scene. There is also an increasing interest inextending the framework for dynamic data (Vondrick, Pirsi-avash, and Torralba 2016; Saito, Matsumoto, and Saito 2017;Tulyakov et al. 2017). In this work, we explore the ideaof training the proposed HHMM using adversarial objective,which has two major benefits. First, it bypasses the intractableobjective of MLE in hierarchical model where the integrationover parameters introduces additional dependencies amongrandom variables. Second, it aims at learning a model thatcan generate realistic-looking data. Following the adversariallearning framework, we introduce a separate discriminativedynamic model to guide the learning of HHMM, which serves
The Thirty-Second AAAI Conferenceon Artificial Intelligence (AAAI-18)
2636
as the generator. While the generator tries to generate datathat looks as realistic as possible. The discriminator tries toclassify the generated data as fake. The two models competeagainst each other in order to reach an equilibrium. We derivea gradient ascent based optimization method for updatingparameters of both models. To the best of our knowledge,this is the first work that exploit adversarial learning on mod-eling dynamic data with fully probabilistic generator anddiscriminator.
Related workProbabilistic dynamic models HMM and its variants (Ra-biner 1989; Fine, Singer, and Tishby 1998; Brand, Oliver,and Pentland 1997; Ghahramani, Jordan, and Smyth 1997;Yu 2010) are widely used to model sequential data, wheredynamics change according to transition among differentdiscrete states. The observations are then emitted from astate-dependent distribution. The state can also be continu-ous as modeled in LDS, which is also known as Kalmanfilters (Kalman and others 1960). In a more general for-mulation, both HMM and LDS can be considered as spe-cial variants of dynamic Bayesian networks (DBN) (Mur-phy 2002). Our model expands the model capacity throughthe hierarchical structure instead of increasing complexity,which is ineffective for HMM. With enhanced model capac-ity, our model can better accommodate variation and non-linearity of the dynamics. Another major type of dynamicmodel consists of undirected graphical models such as tempo-ral extension of restricted Boltzmann machine (RBM) (Tay-lor, Hinton, and Roweis 2006; Sutskever and Hinton 2007;Mittelman et al. 2014) and dynamic conditional randomfield (DCRF) (Sutton, McCallum, and Rohanimanesh 2007;Tang, Fei-Fei, and Koller 2012). While RBM can capturenon-linearity and expand capacity through vectorized hiddenstates, the learning requires approximation to intractable par-tition function and the choice of hidden state dimension maynot be trivial. DCRF model is trained discriminatively givenclass labels and not suitable for data generation task.
More recently, models that combine probabilistic frame-work with deterministic model such as neural networks (NN)have been proposed. (Krishnan, Shalit, and Sontag 2015) pro-posed deep Kalman filters which used NN to parameterizetransition and emission probability. (Johnson et al. 2016) usedvariational autoencoder to specify the emission distributionof switching LDS. (Gan et al. 2015) proposed deep temporalsigmoid belief network (TSBN), where the hidden node isbinary and its conditional distribution is specified by sigmoidfunction. Variants of RNNs with additional stochastic nodesare introduced to improve the capability of modeling ran-domness (Bayer and Osendorfer 2014; Chung et al. 2015;Fraccaro et al. 2016). To better account for intra-class varia-tion, (Wang, Fleet, and Hertzmann 2008) modeled dynamicsusing Gaussian process where the uncertainty is handled bymarginalizing out parameter space imposed with Gaussianprocess prior. (Joshi et al. 2017) proposed a Bayesian NNwhich can adapt to subject dependent variation for actionrecognition. Deep learning based models typically requirelarge amount of training data. For smaller dataset, carefulregularization or other auxiliary techniques such as data aug-
mentation, pre-train, drop-out, etc. are needed. In contrast,our HHMM has built-in regularization through the hyper-parameters learned using all the intra-class data. It is lessprone to overfitting. Besides, HHMM can handle missingdata as the probabilistic inference can be carried out in ab-sence of some observations. Furthermore, HHMM is easier tointerpret as the nodes are associated with semantic meanings.
Learning methods of dynamic models Maximum like-lihood learning is widely used to obtain point estimate ofmodel parameters. For models with tractable likelihood func-tion, numerical optimization techniques such as gradient as-cent can be used to maximize likelihood function directlywith respect to the parameters. In general, for model withhidden variables, whose values are always unknown duringtraining, expectation maximization (EM) (Dempster, Laird,and Rubin 1977) is often used, which optimizes a tight lowerbound of the model loglikelihood. Bayesian parameter es-timation can also be used as an alternative to MLE in casewhen prior information on parameters need to be incorpo-rated, resulting in maximum a posteriori (MAP) estimate. Forinstance, (Brand and Hertzmann 2000) introduced a prioron HMM parameters to encourage smaller cross entropy be-tween specific stylistic motion model and generic motionmodel. In case the goal is to classify data into different cate-gories, generative dynamic model can also be learned withdiscriminative criteria such as maximizing the conditionallikelihood of being as one of the categories (Wang and Ji2012). Our work provides another objective to learn gener-ative dynamic models by adopting the adversarial learningframework. The generative model has to compete againstanother discriminative model in order to fit the data distri-bution well. An important difference of our method fromexisting adversarially learned dynamic models like TGAN(Saito, Matsumoto, and Saito 2017) is that, both our gener-ator and discriminator are fully probabilistic models whichexplicitly model the variation of data distribution.
MethodsWe first present the proposed dynamic model. Then we brieflyreview the adversarial learning framework and describe indetails about the learning algorithm. Finally, we discuss theinference methods used for various tasks.
Bayesian Hierarchical HMMWe now describe the proposed HHMM, which models thedynamics and variation of data in two levels. First, the ran-dom variables capture spatial distribution and temporal evo-lution of dynamic data. Second, the parameters specifyingthe model are also treated as random variables with prior dis-tributions. Notice that the term HHMM is first used in (Fine,Singer, and Tishby 1998), where the hierarchy is applied onlyon hidden or observed nodes with fixed parameters in order tomodel multi-scale structure in data. Our model constructs thehierarchy using Bayesian framework in order to handle largevariation in data. Specifically, we define X = {X1, ..., XT }as the sequence of observations and Z = {Z1, ..., ZT } asthe hidden state chain. The joint distribution of HHMM with
2637
Figure 1: Topology of HHMM, where plate notation is used.T is the length of sequence. N is the number of sequences. Qis the number of hidden states. The self-edge of Zt shows thetemporal link from Zt−1 to Zt. Circle-shaped nodes representvariables and diamond-shaped nodes represent parameters orhyperparameters.
first-order Markov assumption is given by
P (X,Z, θ|α) = P (Z1|π)T∏
t=2
P (Zt|Zt−1,A) (1)
T∏t=1
P (Xt|Zt, μ,Σ)P (A|η)P (μ|λ)
where π is a stochastic vector specifying the initial statedistribution i.e. P (Z1 = i) = πi. A is a stochastic matrixwhere the ith row specifies the probability of transiting fromstate i to other states i.e. P (Zt = j|Zt−1 = i) = Aij .μ and Σ are the emission distribution parameters. We useGaussian distribution as the observations are continuous i.e.P (Xt|Zt = i) = N (μi,Σi). Diagonal covariance matrix isalso assumed. θ = {A, μ} and θ = {π,Σ} are the modelparameters and α = {η, λ} are the model hyperparameters.We denote α = {α, θ} = {η, λ, π,Σ} as the augmented setof hyperparameters by including θ. The model topology isshown in Figure 1.
We use conjugate prior for θ. Specifically, we use a Dirich-let prior on the transition parameter A with hyperparameterη and a Normal prior on the emission mean μ with hyperpa-rameter {μ0,Σ0}.
P (Ai:|ηi) ∝Q∏
j=1
Aηij−1ij
P (μi|λ) ∝ exp(− 1
2(μi − μi0)
TΣ−1i0 (μi − μi0)
)
where i = 1, ..., Q, ηij > 0, μi0 ∈ RO,Σi0 ∈ R
O×O. Q isthe number of hidden states and O is the dimension of data.The benefit of using hierarchical model can be seen fromits structure. Under the same model complexity i.e. samenumber of hidden states and dimension of data, parametersin HHMM can further vary according to each data instance.Thus HHMM has increased modeling capacity compared toHMM, which is crucial for modeling data variation.
Adversarial learning of HHMMThe adversarial learning approach utilizes a novel objectiveto train a generative model G by introducing another dis-criminative model D. Intuitively, G is aimed at generating
samples that resemble real data distribution. D tries to differ-entiate whether a sample is from real data or generated by G.When both G and D are parameterized by neural networks,it yields the GAN (Goodfellow et al. 2014). Leveraging onthe adversarial learning framework, we develop a method forlearning HHMM, which we use as the generator. The choiceof discriminator is a pair of HMMs that are trained with dis-criminative objective. We describe the overall optimizationformulation first, followed by detailed discussion on gener-ator and discriminator learning. We introduce an additionalbinary variable y associated with X to indicate whether Xis real (y = 1) or fake (y = −1). The overall optimizationobjective is defined by Eq. (2).
minα
maxφ
EX∼Pdata(X)[logD(X|φ)] (2)
+EX∼PG(X|α)[log(1−D(X|φ))]where D(X|φ) � PD(y = 1|X, φ) is the output of discrim-inator specifying the probability of X being real data andφ is the parameters of discriminator. Pdata(X) is the realdata distribution. PG(X|α) is the likelihood of α on X gen-erated from G. Compared to GAN, the use of probabilisticgenerative model directly specify distribution X, where therandomness and dependency is encoded through the latentvariables. The goal of learning is to estimate α and φ. Theoptimization uses alternating strategy where we optimize onemodel while holding the other as fixed at each iteration.
Generator We now discuss in details about generator learn-ing, which is HHMM in our case. The benefit of using aprobabilistic dynamic model is that we can model data vari-ation and randomness in a principled way. In addition, wecan generate different length of sequences. Finally, we canevaluate data likelihood using learned model as describedlater in inference. When optimizing α in Eq. (2), we hold φfixed. The same approximate objective as in (Goodfellow etal. 2014) is also used. This results in the following objective.
maxα
LG(α) � EX∼PG(X|α)[logD(X|φ)] (3)
≈N∑i=1
M∑j=1
1
MNlogD(Xij |φ),
θi ∼ P (θ|α),Xij ∼ P (X|θi, θ)However, the sample-based approximation no longer explic-itly depends on α. We use the identity that ∇Xf(X) =f(X)∇X log f(X) to derive an unbiased estimate of gradi-ent of LG(α) by directly taking derivative of Eq. (3), wheresimilar strategy is also used in (Williams 1992).
∂LG(α)
∂α≈
N∑i=1
M∑j=1
logD(Xij |φ)MN
∂ logP (θi|α)∂α
(4)
∂LG(α)
∂θ≈
N∑i=1
M∑j=1
logD(Xij |φ)MN
∂ logP (Xij |θi, θ)∂θ
(5)
where θi ∼ P (θ|α),Xij ∼ P (X|θi, θ). In Eq. (4), the par-tial derivative is taken by the prior distribution of parameters,
2638
which has an analytical form given our parameterization. InEq. (5), the partial derivative corresponds to the gradient ofloglikelihood of HMM, which can be computed by exploitingthe chain structure of hidden states as described in (Cappe,Buchoux, and Moulines 1998). SGD with RMSProp (Tiele-man and Hinton 2012) for adaptive gradient magnitude isperformed to update α. We also reparameterize κ = log σ,where σ2 is the diagonal entries of Σ, which is assumed to bediagonal. Intuitively speaking, given a fixed D, samples withD(Xij |φ) → 0 will be weighted heavily to encourage im-provement. Samples with D(Xij |φ) → 1 have ∂LG(α)
∂α → 0,thus contribute little to the update.
Discriminator Our discriminator consists of a pair ofHMMs with parameters specified as φ+ and φ− respectively.The use of dynamic model based discriminator is largelymotivated by the needs to work with sequential data. To dif-ferentiate whether a motion sequence looks realistic or not,the discriminator should be able to recognize the underlyingmotion pattern subject to variation. In addition, dynamic dis-criminator also can accommodate sequences with differentlength. Specifically, the output of discriminator is defined asfollows.
D(X|φ) = P (X|φ+)
P (X|φ+) + P (X|φ−)P (y=−1)P (y=1)
(6)
where P (y) is the prior probability of the labels. Since wechoose the same number of real and fake samples at eachupdate, we can assume uniform distribution of labels, namelyP (y = 1) = P (y = −1) = 1/2. P (X|φ+) and P (X|φ−)are the likelihoods of φ+ and φ− evaluated on X respec-tively. The two HMMs are trained discriminatively underthe objective of Eq. (2) with α holding fixed. Specifically,given a set of M randomly generated samples {X−j } fromgenerator and a set of M randomly selected real data sam-ples {X+
j }, the objective of learning φ is equivalent to thenegative cross-entropy loss as follows.
maxφ
LD(φ) � EX∼Pdata(X)[logD(X|φ)] (7)
+ EX∼PG(X|α)[log(1−D(X|φ))]
≈ 1
M
M∑j=1
logD(X+j |φ) + log(1−D(X−j |φ))
By substituting Eq. (6) to Eq. (7), we can compute thegradient of LD(φ) with respect to φ.
∂LD(φ)
∂φ+≈ 1
M
M∑j=1
[P (X+
j |φ−)s(X+
j )
∂ logP (X+j |φ+)
∂φ+(8)
− P (X−j |φ+)
s(X−j )
∂ logP (X−j |φ+)
∂φ+]
∂LD(φ)
∂φ−≈ 1
M
M∑j=1
[P (X−j |φ+)
s(X−j )
∂ logP (X−j |φ−)∂φ−
(9)
− P (X+j |φ−)
s(X+j )
∂ logP (X+j |φ−)
∂φ−]
where s(X) = P (X|φ+) + P (X|φ−). Again, ∂ logP (X|φ+)∂φ+
and ∂ logP (X|φ−)∂φ− are gradients of loglikelihood of the two
HMMs respectively, where analytical form is available asdescribed in generator update. The overall algorithm is sum-marized as Algorithm 1.
Algorithm 1 Adversarial learning of HHMM
Require: {X}: real dataset. Q: number of hidden states. M : num-ber of samples. N : number of parameter sets. k: update step forφ. l: update step for α.
Ensure: Generator α. Discriminator φ.1: Initialization of α, φ2: repeat3: for k steps do4: Draw M samples from both PG and real dataset.5: Update discriminator φ using RMSProp with gradient
defined by Eq. (8) and Eq. (9).6: end for7: for l steps do8: Draw N samples of θ. For each θ, draw M samples.9: Update generator α using RMSProp with gradient defined
by Eq. (4) and Eq. (5).10: end for11: until convergence or reach maximum iteration number12: return α
InferenceWe describe our methods on three inference problems asso-ciated with HHMM when applied to different data analysisapplications as described later in experiments.
Data synthesis One of the major applications for genera-tive model is to automatically synthesize data. The potentialuse of synthetic motion data is to supply the training of deeplearning models for tasks like action recognition. We use an-cestral sampling based approach to generate synthetic motiondata. Specifically, we first sample parameter A, μ from theircorresponding prior distribution given learned hyperparame-ters i.e. A ∼ P (A|η), μ ∼ P (μ|μ0,Σ0). Second, we samplehidden state chain given sampled parameters A and learnedparameters π i.e. Z1 ∼ P (Z1|π), Zt ∼ P (Zt|Zt−1,A).Finally, we compute the most likely observation sequenceX1, ..., XT conditioned on Z1, ..., ZT and parameters μ,Σ.Due to the model structure, observed nodes are independentwith each other given the hidden states. A naive solutionthat maximizes the conditional likelihood P (X|Z) yieldsmean value of the corresponding Gaussian at each frame i.e.Xt = μZt . For motion capture data, this results non-smoothchange between different poses. We alleviate this issue byaugmenting features Xt to include information with both firstorder i.e. position and second order i.e. speed as suggested in(Brand 1999), where the speed is computed as the differenceof consecutive position change. Then we solve the followinginference problem.
maxX
logP (X|Z) =∑t
logN(Xt|μZt,ΣZt
) (10)
where X = {Xt}, X = {Xt}, Xt = [Xt, Xt − Xt−1].Eq. (10) is a quadratic system with respect to X, where
2639
(a) PCC (b) MSE (c) Loglikelihood on training data (d) Loglikelihood on testing data
Figure 2: Reconstruction experiment results on Berkeley’s jumping action versus change of number of hidden states. HHMM-Arefers to adversarial learning variant and HHMM-M refers to maximum likelihood learning variant. (Best view in color)
closed-form solution can be obtained by setting the derivativeto zero.
Reconstruction The goal of reconstruction is to generate anovel sequence that resembles the input sequence. The recon-struction process evaluates the capability of model capturingthe dynamics of sequential data. Since the hidden state chainis the primary source that encodes the dynamic change ofthe data, we first infer the most probable configuration ofstate chain. We compute the MAP estimate θ∗ by solving thefollowing problem using MAP-EM (Gauvain and Lee 1994),where the E-step has complexity O(Q2T ) and M-step hasclosed-form solution.
θ∗ = argmaxθ
log∑Z
P (X,Z|θ, θ) + logP (θ|α) (11)
Then we perform Viterbi decoding algorithm (Rabiner 1989)on observed testing sequence given θ∗. Finally, we computethe most likely observations given the decoded states in thesame way as described in data synthesis.
Compute data likelihood The marginal likelihood of themodel evaluated on data X is defined as follows.
llh(X) = logPG(X|α) (12)
= log
∫θ
∑Z
P (X,Z|θ, θ)P (θ|α)dθ
Exact computation of Eq. (12) is intractable due to the inte-gration over θ introduces additional dependencies among Z.We use the following approximation.
llh(X) ≈ ˆllh(X) = log∑Z
P (X,Z|θ∗, θ) (13)
where θ∗ is defined by Eq. (11). Then Eq. (13) can be com-puted using forward-backward algorithm (Rabiner 1989).
ExperimentsWe evaluate the model on two tasks related to motion capturedata analysis. For each type of real motion capture data, wefit one model to capture the specific dynamic process of theaction. We first quantitatively evaluate the model capabilityin capturing dynamics through reconstruction experiments.Then we show the learned model can be used to synthesize
novel motion data with different intra-class variation withboth quantitative and qualitative results.
Datasets: CMU Motion capture database (CMU ) con-tains a diverse collection of human motion data captured bycommercial motion capture system. Up to date, there are2605 sequences in 6 major categories and 23 subcategoriescollected from 112 different subjects. We select a subset ofthe database to train our model including actions of walking,running and boxing from 31 subjects with averaging 101 se-quences per action. UC Berkeley MHAD (Ofli et al. 2013)contains motion data collected by multiple modalities. Weonly use the motion capture data. There are 12 subjects per-form 11 type of actions and each action is repeated 5 times,yielding large intra-class variation. We select three actionsfor experiments, namely jumping in place, jumping Jack andboxing, which involve substantial whole body movement.
Preprocessing: We subtract the root joint location of eachframe to make the skeleton pose invariant to position change.We further convert the rotation angles to exponential map rep-resentation in the same way as (Taylor, Hinton, and Roweis2006), which makes the skeleton pose invariant to the ori-entation against gravitational vertical. We exclude featuresthat are mostly constant (standard deviation < 0.5), result-ing 53 and 60 feature dimension per frame respectively onCMU and Berkeley datasets. The feature dimension is thendoubled by including speed feature obtained as the differ-ence of consecutive frames along each feature dimension.All features are scaled to have standard deviation 1 withineach dimension. Finally, we divide the original sequencesinto overlapping segments of the same length for simplicityso that the model likelihood on different data is unaffected bythe sequence length, though our model can take sequence in-put with different length. The preprocessed data is then usedto train HHMM and other compared methods. We evaluateperformance on feature space for all methods.
Implementation: For Algorithm 1, we use k = 1, l =1,M = 10, N = 100. RMSProp decay is 0.9 and perturba-tion is 10−6. The learning rate for generator is 10−3 and fordiscriminator is 10−4. The maximum number of epochs isset to 100. To initialize α, we use K-means to cluster observa-tions and use cluster assignment as hidden state value, fromwhich we can estimate the model parameters and hyperpa-rameters. To initialize φ+, we use MLE of the first batch ofreal and synthetic data. φ− is set equal to φ+. Our Matlab
2640
code runs on a PC with 3.4GHz CPU and 8GB RAM. Theaverage training time per class is 1.3 hour on CMU datasetand 1.9 hour on Berkeley dataset.
Data reconstructionIn this experiment, we demonstrate the learned HHMM havelarge capacity to handle intra-class variation in motion cap-ture data. For each action category, we divide the data into 4folds with each fold containing distinct subjects. Reconstruc-tion is performed in a cross-fold manner meaning each foldis used as testing data once with remaining folds as trainingdata. We report the average results over all folds and all inputdimensions.
Quantitative metrics: We use Pearson correlation coef-ficient (PCC) and mean squared error (MSE) computed infeature space between reconstructed and actual values. PCCmeasures how well the prediction can capture the trend ofmotion change. PCC is a number between 0 and 1 and thelarger the better. MSE is a positive number measuring thedeviation between reconstructed and actual value and thesmaller the better. We also report approximate loglikelihoodof model evaluated on reconstructed data.
First, we compare with two baselines namely HMM andHHMM that are both learned by maximizing likelihood.While MLE of HMM is done using EM, MLE of HHMMis intractable. We approximate the MLE through a two-stepoptimization process as described in inference method. Wevary the hidden state number of all the methods and evaluatetheir performance as shown in Figure 2.
We observe that both variants of HHMM consistently out-performs HMM in PCC and MSE across different state num-bers. In addition, when the state number is small, increasingthe value helps both methods. As the value keeps increasing,HHMM performance reaches a plateau and HMM perfor-mance starts to drop, showing symptom of overfitting totraining data. The overfitting of HMM becomes more clearwhen looking at the likelihoods, which drop significantlyfrom training data to testing data. This shows that comparedto non-hierarchical counter-part, HHMM has a larger capac-ity, which allows the model to adapt to novel data and lessprone to overfitting. Comparison between two variants showsthat HHMM-M consistently achieves higher likelihood ontraining data across actions and datasets than HHMM-A.This is consistent with the maximizing likelihood objectiveof HHMM-M. On testing data, the likelihood gap betweenHHMM-M and HHMM-A becomes smaller. For PCC andMSE, HHMM-A consistently outperforms HHMM-M. Over-all, these results show that the adversarially learned HHMMcan generalize better to novel data by capturing dynamic datadistribution well.
Then we compare our method with several state-of-the-artdynamic generative models including GPDM (Wang, Fleet,and Hertzmann 2008), which is a non-parametric model,ERD (Fragkiadaki et al. 2015), which is an RNN/LSTMbased method, and TSBN (Gan et al. 2015), which incor-porates neural networks and graphical models. We set thehidden state number to 20 for HHMM throughout the remain-ing experiments. For other methods, we use author provided
code to perform experiments. The results average over differ-ent actions are shown in Table 1.
Table 1: Reconstruction results of different methods averagedover different features and actions. Number in [] is standarddeviation.
Dataset CMU BerkeleyMetric PCC MSE PCC MSEHMM 0.36[0.46] 1.12[1.54] 0.43[0.46] 0.87[1.95]GPDM 0.70[0.24] 0.24[0.36] 0.47[0.35] 0.51[1.03]ERD 0.66[0.34] 0.61[1.15] 0.75[0.30] 0.31[1.17]TSBN 0.79[0.24] 0.27[0.92] 0.81[0.25] 0.18[0.64]HHMM 0.81[0.22] 0.20[0.77] 0.81[0.26] 0.12[0.30]
On average, in CMU dataset, we achieve 2% absolute im-provement in PCC compared to the second best TSBN and0.04 absolute reduction in MSE compared to the second bestGPDM. In Berkeley dataset, we achieve comparable perfor-mance in PCC compared to the second best TSBN and 6%improvement to ERD. We reduce MSE by 0.06 compared tothe closest competitive TSBN. In both datasets, we outper-form the baseline method by a large margin.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
CRBM ERD TSBN HHMM-M HHMM-A Training
Aver
age
Max
SSI
M
Running
Walking
Boxing
Average
(a) CMU
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
CRBM ERD TSBN HHMM-M HHMM-A Training
Aver
age
Max
SSI
M
Jumping
Jack
Boxing
Average
(b) Berkeley
Figure 3: Average largest pairwise SSIM between syntheticmotion sequences and real sequences from (a) CMU and (b)Berkeley datasets. (Best view in color)
2641
Data synthesisIn this experiment, we demonstrate that adversarially learnedHHMM can generate both realistic and diverse motion se-quences. For each type of action, we train a model, which isthen used to generate motion of the same type following thedescription in inference method.
Quantitative results: Sequential data brings additionalchallenge to quality evaluation due to large variation anddependency on both space and time. Motivated by the needto consider both fidelity and diversity of the generated se-quential data, we adopt the structure similarity index (SSIM)(Wang et al. 2004) to evaluate synthesized data quality. SSIMis originally proposed for evaluating quality of a corruptedimage against intact reference image. It is easy to computeand correlates well with perceptual quality of the image. Itis a value between 0 and 1. The larger the value the moreperceptually similar the images. (Odena, Olah, and Shlens2017) adopted it for evaluating the overall image diversitygenerated by the GAN. To adapt SSIM for sequential data,we concatenate the features over time so that it can be viewedas an image, where each pixel in the image corresponds to ajoint angle at a time. For each method, we generate 1000 se-quences. For each sequence, we compute the pairwise SSIMagainst all the training sequences and choose the largest one.Finally, we use the average largest SSIM as measure of thediversity of the synthesized sequences. As a reference, wecompute the pairwise SSIM among all the training sequences.The results are shown in Figure 3. For both datasets, the av-erage training data SSIM is the lowest among all the results,indicating significant intra-class variation. Among differentcompeting methods, HHMM-A achieves the lowest averageSSIM. This shows adversially learned HHMM can gener-ate the most diverse set of motion sequences. Comparingdifferent action categories, a more complex action such asboxing usually achieves lower SSIM. From this point of view,HHMM is generating data consistent with the training set. Formethod producing high SSIM value e.g. TSBN on Berkeley’sboxing, it indicates that the method overfits to some trainingdata instances and fails to generalize diverse synthetic data.
Qualitative results: Figure 4 show some examples of syn-thetic sequences of different actions, where different rowsshow different samples drawn from the same motion category.Notice that the sampling process may not always generatemeaningful motion in terms of the pose change since therandom nature of hidden state transition. We use SSIM as areference and select the generated sequence whose largestSSIM is above a threshold. On one hand, we are able todistinguish different motion sequences, indicating the datalook physically meaningful and realistic. On the other hand,the sequences show various styles in motion, which showHHMM can generate different variations for the same action.
ConclusionIn this paper, we enhanced HMM through Bayesian hierarchi-cal framework to improve the model capability in modelingdynamics under intra-class variation. We proposed a novellearning method that learns HHMM under adversarial objec-
(a) Walking (b) Running
(c) Boxing (d) Jumping
Figure 4: Synthetic motion sequences. Each row is a uni-formly downsampled skeletal sequence from one syntheticaction. Different rows are different samples.
tive, which has shown promising results in data generationapplications compared to conventional maximum likelihoodlearning. Through both quantitative and qualitative evalua-tions, we showed the learned model can capture the dynamicprocess of human motion data well and can be used to gener-ate realistic motion sequence with intra-class variation. Forfuture work, we plan to introduce higher order dependencystructure to better capture long-term dependency. We are alsointerested in training with different types of actions togetherinstead of fitting one model at a time.
AcknowledgmentThis work is partially supported by Cognitive Immersive Sys-tems Laboratory (CISL), a collaboration between IBM andRPI, and also a center in IBM’s Cognitive Horizon Network(CHN).
2642
ReferencesBayer, J., and Osendorfer, C. 2014. Learning stochastic recurrentnetworks. arXiv.Brand, M., and Hertzmann, A. 2000. Style machines. In SIGGRAPH.ACM.Brand, M.; Oliver, N.; and Pentland, A. 1997. Coupled hiddenmarkov models for complex action recognition. In CVPR.Brand, M. 1999. Voice puppetry. In SIGGRAPH.Cappe, O.; Buchoux, V.; and Moulines, E. 1998. Quasi-newtonmethod for maximum likelihood estimation of hidden markov mod-els. In ICASSP.Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A. C.; andBengio, Y. 2015. A recurrent latent variable model for sequentialdata. In NIPS.CMU. Cmu mocap database http://mocap.cs.cmu.edu/.Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximumlikelihood from incomplete data via the em algorithm. Journal ofthe royal statistical society.Fine, S.; Singer, Y.; and Tishby, N. 1998. The hierarchical hiddenmarkov model: Analysis and applications. Machine learning.Fraccaro, M.; Sønderby, S. K.; Paquet, U.; and Winther, O. 2016.Sequential neural models with stochastic layers. In NIPS.Fragkiadaki, K.; Levine, S.; Felsen, P.; and Malik, J. 2015. Recurrentnetwork models for human dynamics. In ICCV.Gan, Z.; Li, C.; Henao, R.; Carlson, D. E.; and Carin, L. 2015. Deeptemporal sigmoid belief networks for sequence modeling. In NIPS.Gauvain, J.-L., and Lee, C.-H. 1994. Maximum a posteriori esti-mation for multivariate gaussian mixture observations of markovchains. TSAP.Ghahramani, Z.; Jordan, M. I.; and Smyth, P. 1997. Factorial hiddenmarkov models. Machine learning.Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley,D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generativeadversarial nets. In NIPS.Johnson, M.; Duvenaud, D. K.; Wiltschko, A.; Adams, R. P.; andDatta, S. R. 2016. Composing graphical models with neural net-works for structured representations and fast inference. In NIPS.Joshi, A.; Ghosh, S.; Betke, M.; Sclaroff, S.; and Pfister, H. 2017.Personalizing gesture recognition using hierarchical bayesian neuralnetworks. In CVPR.Kalman, R. E., et al. 1960. A new approach to linear filtering andprediction problems. Journal of basic Engineering.Krishnan, R. G.; Shalit, U.; and Sontag, D. 2015. Deep kalmanfilters. arXiv.Mittelman, R.; Kuipers, B.; Savarese, S.; and Lee, H. 2014. Struc-tured recurrent temporal restricted boltzmann machines. In ICML.Murphy, K. P. 2002. Dynamic bayesian networks: representation,inference and learning. Ph.D. Dissertation, University of California,Berkeley.Nguyen, A.; Clune, J.; Bengio, Y.; Dosovitskiy, A.; and Yosinski,J. 2017. Plug & play generative networks: Conditional iterativegeneration of images in latent space. In CVPR.Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-gan: Training gen-erative neural samplers using variational divergence minimization.In NIPS.Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional imagesynthesis with auxiliary classifier GANs. In ICML.
Ofli, F.; Chaudhry, R.; Kurillo, G.; Vidal, R.; and Bajcsy, R.2013. Berkeley mhad: A comprehensive multimodal human ac-tion database. In WACV.Rabiner, L. R. 1989. A tutorial on hidden markov models andselected applications in speech recognition. Proceedings of theIEEE.Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised repre-sentation learning with deep convolutional generative adversarialnetworks. arXiv.Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; and Lee,H. 2016. Generative adversarial text to image synthesis. In ICML.Saito, M.; Matsumoto, E.; and Saito, S. 2017. Temporal generativeadversarial nets with singular value clipping. In ICCV.Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsuper-vised learning of video representations using lstms. In ICML.Sutskever, I., and Hinton, G. 2007. Learning multilevel distributedrepresentations for high-dimensional sequences. In AISTATS.Sutton, C.; McCallum, A.; and Rohanimanesh, K. 2007. Dynamicconditional random fields: Factorized probabilistic models for label-ing and segmenting sequence data. JMLR.Tang, K.; Fei-Fei, L.; and Koller, D. 2012. Learning latent temporalstructure for complex event detection. In CVPR.Taylor, G. W.; Hinton, G. E.; and Roweis, S. T. 2006. Modelinghuman motion using binary latent variables. In NIPS.Theis, L.; Oord, A. v. d.; and Bethge, M. 2015. A note on theevaluation of generative models. arXiv.Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude. COURSERA:Neural networks for machine learning.Tulyakov, S.; Liu, M.-Y.; Yang, X.; and Kautz, J. 2017. Mocogan:Decomposing motion and content for video generation. arXiv.Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Generatingvideos with scene dynamics. In NIPS.Walker, J.; Doersch, C.; Gupta, A.; and Hebert, M. 2016. Anuncertain future: Forecasting from static images using variationalautoencoders. In ECCV.Wang, X., and Ji, Q. 2012. Learning dynamic bayesian networkdiscriminatively for human activity recognition. In ICPR.Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004.Image quality assessment: from error visibility to structural similar-ity. TIP.Wang, J. M.; Fleet, D. J.; and Hertzmann, A. 2008. Gaussianprocess dynamical models for human motion. TPAMI.Williams, R. J. 1992. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. Machine learning.Xue, T.; Wu, J.; Bouman, K.; and Freeman, B. 2016. Visual dy-namics: Probabilistic future frame synthesis via cross convolutionalnetworks. In NIPS.Yu, S.-Z. 2010. Hidden semi-markov models. Artificial Intelligence.
2643