IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …robotics.snu.ac.kr/fcp/files/_pdf_files_publications/... · 2016-01-20 · ment process involving image warping with interpolation,

IEEE

Proo

f

A Geometric Particle Filter for Template-BasedVisual Tracking

Junghyun Kwon, Hee Seok Lee, Frank C. Park, and Kyoung Mu Lee

Abstract—Existing approaches to template-based visual tracking, in which the objective is to continuously estimate the spatial

transformation parameters of an object template over video frames, have primarily been based on deterministic optimization, which as

is well-known can result in convergence to local optima. To overcome this limitation of the deterministic optimization approach, in this

paper we present a novel particle filtering approach to template-based visual tracking. We formulate the problem as a particle filtering

problem on matrix Lie groups, specifically the Special Linear group SL(3) and the two-dimensional affine group Aff(2). Computational

performance and robustness are enhanced through a number of features: (i) Gaussian importance functions on the groups are

iteratively constructed via local linearization; (ii) the inverse formulation of the Jacobian calculation is used; (iii) template resizing is

performed; and (iv) parent-child particles are developed and used. Extensive experimental results using challenging video sequences

demonstrate the enhanced performance and robustness of our particle filtering-based approach to template-based visual tracking. We

also show that our approach outperforms several state-of-the-art template-based visual tracking methods via experiments using the

publicly available benchmark data set.

Index Terms—Visual tracking, object template, particle filtering, Lie group, special linear group, affine group, Gaussian importance function

Ç

1 INTRODUCTION

TEMPLATE-BASED object tracking is a particular form ofvisual tracking, in which the objective is to continuously

fit an object template to a sequence of video frames [62].Beginning with the seminal work of Lucas and Kanade [37],template-based visual tracking has primarily been formu-lated as a nonlinear optimization problem, e.g., [3], [8], [9],[11], [18]. As is well known, however, nonlinear optimiza-tion-based tracking methods have the disadvantage ofpotentially becoming trapped in local minima, particularlywhen inter-frame object motions are large. Global optimiza-tion strategies such as multi-scale pyramids and simulatedannealing, despite their considerably larger computationalrequirements, also cannot consistently ensure better track-ing performance.

In this paper we propose an alternative approach to tem-plate-based visual tracking, based on geometric particle fil-tering. Particle filtering can be regarded as a Monte Carlomethod for solving the (unnormalized) conditional densityequations for state estimation [14], in which the state andobservation equations are modelled as nonlinear stochastic

differential equations. Because of its stochastic nature, parti-cle filtering can, when done properly, be made more robustto the problem of local minima than existing deterministicoptimization-based methods, without sacrificing much inthe way of computationally efficiency. Moreover, particlefiltering yields better estimation performance for nonlinearproblems like template-based visual tracking than, e.g.,extensions of the linear Kalman filter, because of its non-parametric density representation.

In applying particle filtering to template-based visualtracking, the first issue that needs to be addressed is how torepresent the state. The estimated transformation parametersare typically in the form of matrices satisfying certain groupproperties—in our case these will be homographies andaffine transformations—and the subsequent state space is nolonger a vector space, but a curved space. To be able todirectly apply existing particle filtering algorithms, whichare primarily developed in the context of systems whosestate space is a vector space, one must therefore choose alocal coordinate representation for the curved state space,but there are several well-known problems associated withthis approach: (i) several local coordinate patches may berequired to cover the entire state space, and (ii) the choice oflocal coordinates will strongly affect tracking performance.See [2] for a detailed discussion on coordinate parametriza-tions for homographies.

A second issue involves the choice of importance func-tion, a critical step in particle filtering. The optimal impor-tance function minimizes the particle weight variance, andeventually maintains the number of effective particles to beas large as possible [15], so that accurate tracking is ensuredeven with a limited number of particles. Since for generalnonlinear systems the optimal importance function cannotbe determined in closed-form, a computationally efficientand sufficiently accurate approximation is required.

� J. Kwon is with TeleSecurity Sciences, Inc., 7391 Prairie Falcon Road,Suite 150-B, Las Vegas, NV 89128. E-mail: [email protected].

� H.S. Lee and K.M. Lee are with the Automation System Research Institute,the Department of Electrical Engineering and Computer Science, SeoulNational University, 1 Gwanak-ro, Gwanak-gu, Seoul 151-744, Korea.E-mail: {ultra21, kyoungmu}@snu.ac.kr.

� F.C. Park is with the School of Mechanical and Aerospace Engineering,Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 151-744,Korea. E-mail: [email protected].

Manuscript received 19 Oct. 2011; revised 30 Aug. 2012; accepted 20 Aug.2013; published online xxxxx.Recommended for acceptance by I. Reid.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2011-10-0755.Digital Object Identifier no. 10.1109/TPAMI.2013.170

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. X, XXXXX 2014 1

0162-8828 � 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE

Proo

f

The main contributions of this paper directly addressthese two fundamental issues:

� Rather than applying conventional vector space par-ticle filtering to some local coordinate formulation ofthe template-based visual tracking problem, weinstead develop a geometric particle filtering algo-rithm that directly takes into account the curvednature of the state space. Specifically, homographiescan be identified with the three-dimensional SpecialLinear group SL(3), while the two-dimensional affinetransformation matrices can be identified with thetwo-dimensional affine group Aff(2). As is well-known, both SL(3) and Aff(2) possess the structure ofan algebraic group and a differentiable manifold, orLie group. Our resulting algorithms are geometric(in the sense of being local coordinate-invariant), sothat their performance does not depend on thechoice of local coordinates.

� We derive an exponential coordinate-based approxi-mation of Gaussian distributions for Lie groups, anduse these as optimal importance functions for parti-cle sampling. This approach can in some sense beregarded as a generalization of the Gaussian impor-tance function of [15] to our Lie groups in question.

� We develop several techniques that, collectively,enhance the accuracy and computational effi-ciency of our template-based visual tracking in asignificant way: the iterative Gaussian importancefunction generation, inverse formulation of Jaco-bian calculation, template resizing, and use of par-ent-child particles.

The overall performance of our tracking framework isdemonstrated in Section 7 via experiments with variouschallenging test video sequences. The quantitative perfor-mance comparison between the previously employedimportance functions and ours as well as that between dif-ferent parameter settings is also performed in Section 7.We finally compare the performance of our tracking frame-work with several state-of-the-art template-based visualtracking methods using the Metaio benchmark data set[35] in Section 7. The compared methods include the ESMtracker [8] and L1 tracker [4], which are widely regardedas the state-of-the-art template-based visual tracking algo-rithms based on deterministic optimization and particle fil-tering, respectively. Key-point matching-based motionestimation methods using descriptors such as SIFT [36],SURF [6], and FERNS [40] are also compared with ourframework.

2 RELATED WORK

While regarded as more straightforward relative to othervisual tracking problems [24], [41], [63], template-basedvisual tracking continues to arise in a wide range of main-stream applications, e.g., visual servoing [8], augmentedreality [21], [45], and visual localization and mapping[29], [51]. What is common between such applications isthat it is necessary to estimate the 3D camera motionfrom tracking results. In order to do so, we should accu-rately estimate the object template motion up to the

homography that is the most general spatial transforma-tion of planar object images, i.e., we should perform pro-jective motion tracking for those applications.

However, previous template-based visual tracking algo-rithms employing particle filtering [4], [31], [32], [33], [46],[64] are all confined to affine motion tracking, i.e., template-based visual tracking with affine transformations. Theexception is [60], but the accuracy of the tracking results aspresented leave something to be desired. To the best of ourknowledge, ours is the first work to solve projective motiontracking via particle filtering, with sufficient accuracy andefficiency by employing Gaussian importance functions toapproximate the optimal importance function.

Following the seminal work of Isard and Blake to suc-cessfully apply particle filtering to contour tracking [24],the efforts to approximate the optimal importance func-tion for visual tracking have mainly concentrated on con-tour tracking [34], [44], [47], [49] by means of theunscented transformation (UT) [25]. One notable work is[1], in which the optimal importance function is analyti-cally obtained with linear Gaussian measurement equa-tions for point set tracking. When approximating anonlinear function, UT is generally recognized as moreaccurate than first-order Taylor expansions [25]. However,in our approach we rely on Taylor expansions instead ofUT, because UT requires repeated trials of the measure-ment process involving image warping with interpolation,which may prohibit real-time tracking. Moreover, it is notclear how to choose in a systematic way the many param-eters required in UT.

Regression-based learning also has been successfullyapplied to template-based visual tracking [22], [26], [33],[57], [65]. Regression methods estimate a regression rela-tion between the spatial transformation parameters andimage similarity values using training samples, whichare then directly used in tracking. It is notable that theregression is performed on Aff(2) in [57] and [33]. Sinceour framework does not require learning with trainingsamples, the comparison with regression-based methodsis not considered here.

Previous work on particle filtering on Lie groups wasfocused on the special orthogonal group SO(3) representing3D rotation matrices [12], [53] and the special Euclideangroup SE(3) representing 3D rigid body transformationmatrices [28], [52]. Recently, Gaussian importance functionson SE(3) obtained by UT were utilized for visual localizationand mapping with a monocular camera [29]. In [33], particlefiltering on Aff(2) is employed as a basic inference tool foraffine motion tracking. It is worth to note that the unscentedKalman filtering (or equivalently UT) on general Rieman-nian manifolds has been formulated in [20] , which can beused to approximate the optimal importance function forparticle filtering on Lie groups.

Two previous papers by the authors, [31] and [30], arethe most closely related with this paper. Compared with[31], we extend the framework from affine motion trackingto projective motion tracking, and utilize Gaussian impor-tance functions instead of the state transition density-based importance functions. [30] is a preliminary confer-ence version of this paper. In addition to the extensionto projective motion tracking, we have made several

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. X, XXXXX 2014

IEEE

Proo

f

important improvements over [30] to increase the accuracyand efficiency that include the iterative Gaussian impor-tance function generation, inverse Jacobian formulation,and use of parent-child particles.

3 BACKGROUND

3.1 Lie Group

A Lie group G is a group which is a differentiable manifoldwith smooth product and inverse group operations. The Liealgebra g associated with G is identified as a tangent vectorspace at the identity element of G. A Lie group G and its Liealgebra g are related via the exponential map, exp : g! G.The logarithm map log is defined as the inverse (near theidentity) of exp, i.e., log : G! g. For the matrix Lie groupswe address, exp and log are given by the ordinary matrixexponential and logarithm, defined as

expðxÞ ¼X1

m¼0

xm

m!; (1)

logðXÞ ¼X1

m¼1

ð�1Þmþ1 ðX � IÞm

m; (2)

where x 2 g, X 2 G, and I is the identity matrix. For suffi-ciently small x; y 2 g, the Baker-Campbell-Hausdorff(BCH) formula states that z satisfyingexpðzÞ ¼ expðxÞ expðyÞ is given by

z ¼ xþ yþ 1

2½x; y� þ 1

12½x; ½x; y�� 1

12½y; ½x; y�� þ � � � ; (3)

where ½�; �� is the matrix commutator (Lie bracket) given by½A;B� ¼ AB�BA. A more detailed description of Liegroups and Lie algebras can be found in, e.g., [19].

As is well known, the exponential map exp locallydefines a diffeomorphism between a neighborhood of Gcontaining the identity and an open set of the Lie algebrag centered at the origin, i.e., given X 2 G sufficientlynear the identity, the exponential map X ¼ expð

Pi uiEiÞ,

where Ei are the basis elements of g, is a local diffeomor-phism. These local exponential coordinates can beextended to cover the entire group by left or right multi-plication, e.g., the neighborhood of any Y 2 G can bewell defined as Y ðuÞ ¼ Y � expð

Pi uiEiÞ. Note also that in

a neighborhood of the identity, the minimal geodesics(i.e., paths of shortest distance) are given precisely bythese exponential trajectories (and their left- and right-translated versions).

3.2 Transformation Matrices as Lie Groups

A projective transformation is represented by a linearequation in the homogeneous coordinates with a nonsin-gular matrix H 2 <3�3 called a homography. To removescale ambiguity of homogeneous coordinates, homogra-phies can be normalized to have the unit determinant.In this case, homographies can be identified with the 3Dspecial linear group SL(3) defined as SLð3Þ ¼ fX 2 GLð3Þ :detðXÞ ¼ þ1g where GL(3) is the 3D general linear groupcorresponding to the set of 3� 3 invertible real matrices.sl(3), the Lie algebra associated with SL(3), corresponds to

the set of 3� 3 real matrices with the zero trace. Wechoose the basis elements of sl(3) as

E1 ¼1 0 0

0 �1 0

0 0 0

264

375; E2 ¼

0 0 0

0 �1 0

0 0 1

264

375; E3 ¼

0 �1 0

1 0 0

0 0 0

264

375;

E4 ¼0 1 0

1 0 0

0 0 0

264

375; E5 ¼

0 0 1

0 0 0

0 0 0

264

375; E6 ¼

0 0 0

0 0 1

0 0 0

264

375;

E7 ¼0 0 0

0 0 0

1 0 0

264

375; E8 ¼

0 0 0

0 0 0

0 1 0

264

375:

ð4Þ

The spatial transformations induced by exponentials of Ei

are shown in Fig. 1. We can define the mapvslð3Þ : slð3Þ ! <8 to represent sl(3) elements in column vec-tors with respect to the basis elements given in (4). For

x ¼x1 x4 x7

x2 x5 x8

x3 x6 x9

24

35 2 slð3Þ;

the map is given by vslð3ÞðxÞ ¼ ðx1; x9;x2�x4

2 ; x2þx42 ; x7;

x8; x3; x6Þ>.Note that our choice of sl(3) basis elements in (4) is differ-

ent from the one in [7] and [8]. The difference is that we sep-arately represent the rotation and skew by E3 and E4,respectively. This helps to set the state covariance for parti-cle filtering properly in practice. For example, if we canexpect the object rotational motion is not significant, we canset the covariance value corresponding to E3 smaller tomaximize the tracking performance with the limitedcomputational power.

An affine transformation is also frequently used in tem-plate-based visual tracking; such a transformation is repre-sented by a matrix of the form ½A b

0 1� where A belongs toGL(2), the set of 2� 2 invertible real matrices, and b 2 <2.We can regard the set of affine transformation matrices asthe affine group Aff(2), which is a semi-direct product ofGL(2) and <2. The Lie algebra associated with Aff(2),denoted aff(2), is represented as ½U v

0 0 � where U belongs togl(2) and v 2 <2. gl(2) is the Lie algebra associated withGL(2) and corresponds to the set of 2� 2 real matrices. The

Fig. 1. The spatial transformations induced by exponentials of Ei of sl(3)in (4). The images are made by transforming a square centered at theorigin with expð�iEiÞ where �i are small positive (dotted lines) or negativescalars (dash-dot lines).

KWON ET AL.: A GEOMETRIC PARTICLE FILTER FOR TEMPLATE-BASED VISUAL TRACKING 3

IEEE

Proo

f

six basis elements of aff(2) can be chosen as the same as thefirst six basis elements of sl(2) in (4) with the exception ofE2. E2 of Aff(2) is given by

E2 ¼1 0 00 1 00 0 0

24

35

representing isotropic scaling. The map vaff ð2Þ : x! <6 isgiven by vaff ð2ÞðxÞ ¼ ðx1þx4

2 ; x1�x42 ; x2�x3

2 ; x2þx32 ; x5; x6Þ> where

x ¼x1 x3 x5

x2 x4 x6

0 0 0

24

35 2 aff ð2Þ:

4 PROBABILISTIC TRACKING FRAMEWORK

In this section, we describe how projective motion track-ing can be realized by particle filtering on SL(3) withappropriate definitions of state and measurement equa-tions. As shown in Fig. 2, the initial state X0 becomes a3� 3 identity matrix. Representing the image pixel coordi-nates by p ¼ ðpx; pyÞ>, the objective is to find Xk 2 SLð3Þthat transforms the object template T ðpÞ to the most simi-lar image region of the kth video frame Ik.

4.1 State Equation on SL(3)

In order to formulate a state equation on SL(3), we first con-sider a left-invariant stochastic differential equation onSL(3) of the form

dX ¼ X �AðXÞdtþXX8

i¼1

bi Ei dwi; (5)

where X 2 SLð3Þ is the state, the map A : SLð3Þ ! slð3Þ ispossibly nonlinear, the bi are scalar constants, the dwi 2 <denote independent Wiener process noise, and the Ei arethe basis elements of sl(3) given in (4). The continuous stateequation (5) can be represented in a more tractable form bythe simple first-order Euler discretization:

Xk ¼ Xk�1 � exp�AðX; tÞDtþ v�1

slð3ÞðdWkÞffiffiffiffiffiDtp �

; (6)

where dWk 2 <8 is the Wiener process noise with a covari-ance P 2 <8�8 rather than the identity matrix, in which thebi terms in (5) are reflected.

AðX; tÞ in (6) can be regarded as the state dynamics. Herewe use a simple state dynamics based on the first orderauto-regressive (AR) process on SL(3) approximating thegeneralized AR process on Riemannian manifolds of [61]using the logarithm map. The state equation on SL(3) thatwe actually use for projective motion tracking is then repre-sented as

Xk ¼ Xk�1 � exp�Ak�1 þ v�1


; (7)

Ak�1 ¼ a � log�X�1k�2Xk�1

�; (8)

where a is the AR parameter, usually set to be smaller than1. We set a ¼ 0:5 for all experiments in Section 7. We canassume in general tracking applications that X�1

k�2Xk�1 lieswithin the injectivity radius of exp, because video framerates typically range at least between 15 and 60 frames persecond (fps), and object speeds are not excessively fast; thusAk�1 can be considered to be uniquely determined.

4.2 Measurement Equation

To represent the measurement equation in a form such thatanalytic Taylor expansion is possible, we utilize the warp-ing function Wðp;XkÞ employed in [3]. For a projectivetransformation of the pixel coordinates p ¼ ðpx; pyÞ> of thetemplate T ðpÞwith

Xk ¼a1 a4 a7

a2 a5 a8

a3 a6 a9

24

35;

Wðp;XkÞ is defined as

Wðp;XkÞ ¼p0xp0y

� �¼

a1px þ a4py þ a7

a3px þ a6py þ a9a2px þ a5py þ a8a3px þ a6py þ a9

264

375: (9)

The region corresponding to Xk can be expressed byIkðWðp;XkÞÞ, which is warped back to the template coordi-nates involving image interpolation. In our implementation,we simply use the nearest neighbor interpolation for compu-tational efficiency. The measurement is then a measure of theimage similarity between T ðpÞ and IkðW ðp;XkÞÞ, and can berepresented by gðT ðpÞ; IkðWðp;XkÞÞÞ.

We regard the object template T ðpÞ as a set composed ofthe initial template T initðpÞ and the incremental PCA modelof [46] with the mean T ðpÞ and M principal eigenvectorsbiðpÞ. Then our measurement equation is given by

yk ¼ gðT ðpÞ; IkðWðp;XkÞÞÞ þ nk ¼gNCC

gPCA

� �þ nk; ð10Þ

where nk � Nð0; RÞ; R ¼ diagðs2NCC; s

2PCAÞ. Here, gNCC is a

normalized cross correlation (NCC) score betweenT initðpÞ and IkðWðp;XkÞÞ, while gPCA is a reconstructionerror determined by

Pp IkðWðp;XkÞÞ � T ðpÞ� �2�

PMi¼1 c

2i ,

where ci are the projection coefficients of the mean-nor-malized image to biðpÞ.

The roles of NCC and PCA are mutually complemen-tary: incremental PCA deals with appearance changes suchas illumination change and pose change of non-planar

Fig. 2. The graphical representation of the state and measurement forour probabilistic template-based visual tracking framework.


IEEE

Proo

f

objects [46], which cannot be properly dealt with by NCC,while NCC minimizes any possible drift problems associ-ated with incremental PCA. The possible drawback of con-junctive use of NCC and PCA is the tracking failure due totoo large error in NCC for severe local appearance changessuch as specular reflection. As a simple remedy, we detectoutliers whose PCA reconstruction error is above thethreshold (0.15 in our implementation) and exclude them incalculating gNCC.

4.3 State Estimation via Particle Filtering

Tracking is now equivalent to estimating the filtering den-sity rðXkjy1:kÞ, where y1:k � fy1; . . . ; ykg, via particle filtering

on SL(3). Assume that fXðiÞk�1; wðiÞk�1; i ¼ 1; . . . ; Ng, a set of

SL(3) samples called particles and their weights, approxi-

mates rðXk�1jy1:k�1Þ well. The new particles XðiÞk are then

sampled from the importance function pðXkjXðiÞ0:k�1; y1:kÞ,and the weights w

ðiÞk for X

ðiÞk are calculated as

wðiÞk ¼ w

ðiÞk�1

r�ykXðiÞk

�r�XðiÞk

XðiÞk�1

�

p�XðiÞk

XðiÞ0:k�1; y1:k

� : ð11Þ

Then, fXðiÞk ; wðiÞk ; i ¼ 1; . . . ; Ng is considered to be distrib-

uted to approximate rðXkjy1:kÞ.rðXðiÞk jX

ðiÞk�1Þ in (11) is calculated with (7) and (8) as

r�XðiÞk jX

ðiÞk�1

�¼ 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ð2pÞ8detðQÞq exp � 1

2d>1 Q�1d1

�; ð12Þ

where d1 ¼ vslð3ÞðlogððXðiÞk�1Þ�1X

ðiÞk Þ �A

ðiÞk�1Þ and Q ¼ PDt.

The measurement likelihood rðykjXðiÞk Þ in (11) is similarlydetermined from (10) as

r�ykjXðiÞk


ð2pÞ2detðRÞq exp � 1

2d>2 R�1d2

�; ð13Þ

where d2 ¼ yk � yðiÞk and yðiÞk ¼ gðT ðpÞ; IkðWðp;X

ðiÞk ÞÞÞ. The

real measurement yk is given by the value of g whenIkðWðp;XkÞÞ ¼ T ðpÞ. That is, gNCC and gPCA of yk are alwaysgiven by 1 and 0, respectively.

Usually, the particles XðiÞk are resampled in proportion

to their weights wðiÞk in order to minimize the degeneracy

problem [14]. Note that a particle can be copied multipletimes as a result of resampling. Therefore, the same indicesat time k� 2 and k� 1 can represent different particle histo-ries actually. Since it is not possible to compute A

ðiÞk�1 in (8)

immediately from XðiÞk�2 and X

ðiÞk�1 due to this reason, the

state is augmented as fXk;Akg in practice.Assuming that particle resampling is conducted at

every time step, the state estimation Xk can be given bythe sample mean of the resampled particles. An efficientgradient descent algorithm to obtain the mean of SL(3)samples using the Riemannian exponential map givenby the form of expð�x>Þexpðxþ x>Þ and its inverse logmap is presented in [7]. For fast computation, we insteaduse the ordinary matrix exponential and log as anapproximation of Riemannian exponential and log maps

using the BCH formula (3). The corresponding algorithmis given in Algorithm 1. Although the convergence ofAlgorithm 1 cannot be guaranteed in the most generalcase, in our extensive experiments the sample mean ofthe resampled particles typically converges within justthree or four iterations, as long as we choose the particlewhose weight before resampling is the greatest amongall particles as the initial mean.

5 GAUSSIAN IMPORTANCE FUNCTION FOR

PROJECTIVE MOTION TRACKING

The performance of particle filtering depends heavily onthe choice of importance function. The state transitiondensity rðXkjXðiÞk�1Þ is popularly used as the importancefunction because of its simplicity in implementation, asthe particle weights are calculated simply in proportionto the measurement likelihood rðykjXðiÞk Þ. However, sincethe object can move unpredictably (i.e., differently fromour assumption of smooth motion implied by the ARprocess-based state equation in (7) and (8)) and a peakedmeasurement likelihood (i.e., a small covariance R for nkin (10)) is inevitable for accurate tracking, most particlessampled from rðXkjXðiÞk�1Þ will have negligible weightsand be wasted; this will result in tracking inaccuracy.

In the ideal case, the variance of particle weights willbe minimum, i.e., evenly distributed particle weights.The optimal importance function minimizing the particleweight variance is given by the form of rðXkjXðiÞk�1; ykÞ[15], which clearly states that the current measurementyk should be considered in particle sampling. Unfortu-nately, this optimal importance function cannot beobtained in closed form for general nonlinear state spacemodels (except for those with linear Gaussian measure-ment equations). Our measurement equation (10) isclearly nonlinear because the warping function Wðp;XkÞitself is nonlinear and arbitrary video frames Ik areinvolved. Therefore, the optimal importance functionmust be properly approximated for accurate template-based visual tracking with a limited number of particles.

In this section, we present in detail how we can approxi-mate the optimal importance function as Gaussian distribu-tions on SL(3) by following the Gaussian importancefunction approach of [15].

5.1 Gaussian Importance Function Approach

We consider the following vector state space model:

xk ¼ fðxk�1Þ þ wk; ð14Þ

yk ¼ gðxkÞ þ nk; ð15Þ


IEEE

Proo

f

where xk 2 <Nx , yk 2 <Ny , and wk and nk are zero-meanGaussian noise with covariances P 2 <Nx�Nx andR 2 <Ny�Ny , respectively. For this model, we briefly reviewthe principal idea of the Gaussian importance function in [15].

In [15], rðxk; ~ykjxðiÞk�1Þ is first assumed to be jointlyGaussian with the following approximated mean m andcovariance S:

m ¼m1

m2

� �¼

E�xkjxðiÞk�1

�

E�~ykjxðiÞk�1

�

" #¼

f�xðiÞk�1

�

g�f�xðiÞk�1

��

" #; ð16Þ

S ¼S11 S12

S>12 S22

" #

¼Var

�xkjxðiÞk�1

�Cov

�xk; ~ykjxðiÞk�1

�

Cov�xk; ~ykjxðiÞk�1

�>Var

�~ykjxðiÞk�1

�

24

35

¼ P PJ>

JP JPJ> þR

" #;

ð17Þ

where ~yk is the approximation of yk by the first-order Taylorexpansion at fðxðiÞk�1Þwith the Jacobian J 2 <Ny�Nx .

Then by conditioning rðxk; ~ykjxðiÞk�1Þ on the recent mea-surement yk via the conditional distribution formula formultivariate Gaussian, rðxkjxðiÞk�1; ykÞ is approximated asNðmk;SkÞwhere

mk ¼ m1 þ S12ðS22Þ�1ðyk � m2Þ; ð18Þ

Sk ¼ S11 � S12ðS22Þ�1ðS12Þ>: ð19Þ

If g in (15) is linear, then rðxkjxðiÞk�1; ykÞ is exactly determinedwithout approximation as the case of the Kalman filter. Inthe subsequent subsections, we show how we can apply thisGaussian importance function approach to our SL(3) case.

5.2 Gaussian Distributions on SL(3)

We first consider proper notions of Gaussian distributionson SL(3). Definitions of Gaussian distributions on generalRiemannian manifolds are given in, e.g., [17] and [42].These coordinate-free characterizations, while geometri-cally appealing, are much too computationally involvedto be used for template-based visual tracking, where fastcomputation is essential. Instead, we make use of theexponential coordinates discussed in Section 3.1. Notethat Gaussian distributions on the Lie algebra canbe well-defined because the Lie algebra is a vector space.We therefore construct Gaussian distributions on SL(3) asthe exponential of Gaussian distributions on sl(3) pro-vided that the covariance values for Gaussian distribu-tions on sl(3) are sufficiently small enough to guaranteethe local diffeomorphism property of the exponentialmap. The exponential coordinates are preferable to othercoordinates, since (locally in a neighborhood of the iden-tity) the minimal geodesics are given by exponentials ofthe form expðxtÞ, where x 2 slð3Þ and t 2 <; noting thatminimal geodesics are the equivalent of straight lines on

curved spaces, the exponential map has the desirableproperty of mapping “straight lines” to “straight lines”.

With this background, we approximately construct aGaussian distribution on SL(3) as

NSLð3ÞðX; m;SÞ ¼ 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2pÞ8detðSÞ

q exp � 1

2d>S

�1d

�; ð20Þ

where d ¼ vslð3Þðlogðm�1XÞÞ. Here m 2 SLð3Þ is the mean andS 2 <8�8 is the covariance for zero-mean Gaussian distribu-tions on sl(3). Random sampling on NSLð3Þðm;SÞ can then beperformed according to

m � exp�v�1

slð3Þð�Þ�; ð21Þ

where � 2 <8 is sampled from the zero-mean Gaussian dis-tribution on sl(3) with covariance S.

This definition can be regarded as an approximation ofthe one in [7], where we use the exponential map insteadof the Riemannian exponential map as done in the case ofthe SL(3) sample mean in Section 4.3. On the rotationgroup, where a bi-invariant metric exists, the definition ofGaussian distribution as given in [7] is identical to ours.We remark that the underlying notion behind our con-struction of Gaussian distributions on Lie groups is quitesimilar to the nonlinear mean shift concept on Lie groupsas given in [54], in which mean shift on a Lie group isrealized via the exponential mapping of ordinary vectorspace mean shift on the Lie algebra.

5.3 SL(3) Case

To make our problem more tractable, we apply the BCH for-mula to (7) using only the first two terms of (3). The result-ing approximated state equation is given as

Xk ¼ fðXk�1Þ � exp�v�1


; ð22Þ

where fðXk�1Þ ¼ Xk�1 � expðAk�1Þ. Comparing (22) with thedefinition of random sampling from Gaussian on SL(3) in

(21), we can easily identify that rðXkjXðiÞk�1Þ corresponds to

NSLð3ÞðXk; QÞ, where Xk ¼ fðXðiÞk�1Þ and Q ¼ PDt. Then we

seek to obtain Taylor expansion of the measurement equa-tion (10) at Xk.1

As discussed in Section 3.1, the neighborhood of Xk isparametrized by the exponential coordinates as XkðuÞ ¼Xk � expðv�1

slð3ÞðuÞÞ. Thus the first-order Taylor expansionof (10) at Xk can be realized by differentiating gðXkðuÞÞwith respect to u. The exponential coordinate-basedapproach to obtain the Taylor expansion can also befound in the literature on optimization methods on Liegroups [38], [56], [58].

With the Jacobian J of gðXkðuÞÞ evaluated at u ¼ 0, yk atXk can be approximated as

~yk ¼ gðXkÞ þ Juþ nk: ð23Þ

1. For notational convenience we shall also use gðXkÞ instead ofgðT ðpÞ; IkðWðp;XkÞÞÞ.


IEEE

Proo

f

Note that ~yk in (23) is essentially an equation of u representingthe local neighborhood ofXk. With a slight abuse of notation,we let the same u also denote the random variable of Gauss-ian distributions on sl(3) for rðXkjXðiÞk�1Þ ¼ NSLð3ÞðXk; QÞ, i.e.,

rðuÞ ¼ Nð0; QÞ. Then analysis of rðXk; ~ykjXðiÞk�1Þ becomes

equivalent to that of rðu; ~ykÞwithXk .We now regard rðu; ~ykÞ to be jointly Gaussian with the

following mean m and covariance S:

m ¼ m1

m2

� �¼ E½u�

E½~yk�

� �; ð24Þ

S ¼S11 S12

S>12 S22

" #

¼VarðuÞ Covðu; ~ykÞ

Covðu; ~ykÞ> Varð~ykÞ

� �:

ð25Þ

Then the conditional distribution rðuj~yk ¼ ykÞ is determinedas Nðu;SkÞ, with

u ¼ m1 þ S12ðS22Þ�1ðyk � m2Þ; ð26Þ

Sk ¼ S11 � S12ðS22Þ�1S>12: ð27Þ

The above implies that rðuÞ on sl(3) used for sampling Xk

from NSLð3ÞðXk; QÞ is transformed from Nð0; QÞ to Nðu;SkÞby incorporating yk.

The sampling of Xk with Xk and the updated Nðu;SkÞcan be realized as

Xk � exp�v�1

slð3Þðuþ �Þ�; ð28Þ

where � � Nð0;SkÞ. Again by using the BCH formulaapproximation, (28) can be rewritten as

Xk � exp�v�1

slð3ÞðuÞ�� exp

�v�1

slð3Þð�Þ�: ð29Þ

This implies that rðXkjXðiÞk�1; ~yk ¼ ykÞ is eventually given by

NSLð3Þðmk;SkÞ where mk ¼ Xk � expðv�1slð3ÞðuÞÞ. That is, the

optimal importance function rðXkjXðiÞk�1; ykÞ is finally

approximated as NSLð3Þðmk;SkÞwith

u ¼ m1 þ S12ðS22Þ�1ðyk � m2Þ; ð30Þ

mk ¼ Xk � expðv�1slð3ÞðuÞÞ; ð31Þ

Sk ¼ S11 � S12ðS22Þ�1S>12 ; ð32Þ

Here, m1 and S11 are immediately given by 0 and Q, respec-tively, because rðuÞ ¼ Nð0; QÞ. The remaining m2, S12, andS22 are also straightforwardly determined as gðXkÞ, QJ>,and JQJ> þR, respectively. Particle sampling from theobtained NSLð3Þðmk;SkÞ is realized by (21), and the denomi-nator of (11) can be straightforwardly calculated by (20).

5.4 Jacobian Calculation

To obtain the Jacobian J of gðXkðuÞÞ in (23), we employ thechain rule

Ji ¼@giðXkðuÞÞ

@u

u¼0¼ @giðX

kÞ

@Ikðp0Þ� ruIkðp0uÞ; ð33Þ

where p0 ¼Wðp;XkÞ, p0u ¼Wðp;XkðuÞÞ, and ruIkðp0uÞ ¼@Ikðp0uÞ@u

u¼0

. Here, Ji and gi represent the ith row of J and ith

element of g, respectively.@giðXkÞ@Ikðp0Þ

for gNCC and gPCA in (10)

are given in [11] and [30], respectively.ruIkðp0uÞ in (33) is related with the image gradient of Ik,

and can be further decomposed via the chain rule as

ruIkðp0uÞ ¼ rIkðp0Þ �@p0

@Xk� @X

kðuÞ@u

u¼0

: ð34Þ

The first term rIkðp0Þ of (34) is the image gradient of Ik cal-culated at the coordinates of Ik and warped back to thecoordinates of T ðpÞ. The second term @p0

@Xk

of (34) can beobtained by differentiating (9) with respect to a ¼ ða1; . . . ;a9Þ>, where Xk is given by

Xk ¼a1 a4 a7

a2 a5 a8

a3 a6 a9

24

35;

the result is

pxp3

0 � p1

p23

pxpyp3

0 � p1

p23

py1p3

0 � p1

p23

0 pxp3� p2

p23

px 0pyp3� p2

p23

py 0 1p3� p2

p23

24

35; ð35Þ

with p1 ¼ a1px þ a4py þ a7, p2 ¼ a2px þ a5py þ a8, and

p3 ¼ a3px þ a6py þ a9. The last term@X

kðuÞ

@u ju¼0 of (34) is a

9� 8 matrix each of whose columns is given by Xk � Ei withEi in (4) represented as a vector in <9 by a.

6 ENHANCEMENT OF ACCURACY AND EFFICIENCY

6.1 Iterative Gaussian Importance FunctionGeneration

In [30], it was shown that Gaussian importance functionsby local linearization yields better results for affinemotion tracking than the state transition density. How-ever, such Gaussian importance functions would poorlyapproximate the optimal importance function for projec-tive motion tracking, whose warping function is nonlin-ear. Although the second-order approximation using theHessian is possible [48], the exact Hessian computationfor our case is time-consuming and its approximation asin [11] would not much enhance the accuracy.

To increase the accuracy of Gaussian importance func-tions by local linearization, we can apply iteratively alocal linearization-based approximation such as the iter-ated extended Kalman filter (EKF) [5]. However, whenXk ventures out of the basin of convergence, it can resultin importance functions even worse than the state transi-tion density because of possible divergence.


IEEE

Proo

f

To overcome this divergence problem, we select the bestamong all Gaussian importance functions obtained at everyiteration by the following two measures:

s1 ¼ yk � gðmk;jÞ; ð37Þ

s2 ¼ v�log�Xk�1mk;j

��; ð37Þ

where mk;j denotes the mean of the Gaussian importancefunction obtained at the jth iteration. s1 represents howdissimilar the predicted measurement of mk;j is to yk,while s2 represents how far mk;j is from Xk. To balancethe influences of s1 and s2, we define a criterion C in theform of a Gaussian using R in (10) and Q ¼ PDt in (22) as

CðjÞ ¼ exp � 1

2s>1 R

�1s1

�exp � 1

2s>2 Q

�1s2

�: ð38Þ

Among all Gaussian importance functions obtained at eachiteration, we select the best maximizing C. The selectedGaussian importance function can be intuitively understoodas the one most similar to yk while minimizing divergencefrom Xk . The algorithm is given in Algorithm 2.

In this case, wðiÞk are calculated via (11) with

r�XðiÞk jX

ðiÞk�1


ð2pÞ8detðQÞq exp � 1

2d>1 Q�1d1

�; ð39Þ

where Q ¼ Sk;j�1 and d1 ¼ vðlogðm�1k;j�1

XðiÞk ÞÞ, in light of the

fact that rðXkÞ has been deterministically transformed

to NSLð3Þðmk;j�1;Sk;j�1Þ before calculation of Jj.Note that in Algorithm 2, the uncertainty of m1 decreases

as the iteration proceeds, because S11 at the jth iteration isgiven by Sk;j�1. In the general iterated EKF, S11 is fixed tothe initial value (Q in our case) because the decrease in thedeterminant of S11 during iteration implies that there aremultiple yk’s at one time step, and it does not apply to gen-eral cases. However, since yk in our framework is alwaysgiven by the value when IkðWðp;XkÞÞ ¼ T ðpÞ, we can freelyassume multiple yk’s at one time step and graduallydecrease the uncertainty of m1. This can contribute to accu-rate tracking with a limited number of particles.

Though the proposed iterative Gaussian importance func-tion generation method can approximate the optimal impor-tance function more accurately than by local linearization,the required iteration likely lowers overall computational

efficiency. In the subsequent subsections, we describe acollective set of methods that together increase computa-tional efficiency.

6.2 Inverse Formulation of Jacobian Calculation

In analogy with the inverse approach of [3], we utilize theinverse formulation of the Jacobian calculation. The idea isthat the derivative of gðT ðpÞ; IkðW ðp;XkðuÞÞÞÞ with respectto u is equivalent to that of gðT ðWðp;X0ð�uÞÞÞ; Ikðp0ÞÞwhere X0 is the initial identity matrix.

Denoting Wðp;X0ð�uÞÞ by p00u, the Jacobian J ofgðT ðp00uÞ; Ikðp0ÞÞ at u ¼ 0 is given by

Ji ¼@giðT ðpÞ; Ikðp0ÞÞ

@T ðpÞ � ruT ðp00uÞ; ð40Þ

where ruT ðp00uÞ ¼@T ðp00uÞ@u

u¼0

. Note that ruT ðp00uÞ is constantand can be pre-computed because it only depends on T ðpÞand X0, which are always constant. To obtain Ji of giðT ðp00uÞ;Ikðp0ÞÞ, we only need to evaluate @giðT ðpÞ;Ikðp0ÞÞ

@T ðpÞ with T ðpÞand Ikðp0Þ.

6.3 Template Resizing

The computational complexity of our framework increaseswith the size of T ðpÞ because the measurement processinvolves image interpolation. To avoid the increasedcomputational complexity with large templates, we resizethe template to the specified size as described in Fig. 3a.Since the affine transformation matrix A for coordinatetransformation only affects the actual image interpolation,all calculations including the Jacobian can be done on thetransformed template coordinates p without additional carerelated to template resizing.

The template resizing in our framework has further sig-nificance beyond just reducing computational complexity.Consider two initial templates with different sizes shown inFig. 3b in solid lines. The dotted quadrilaterals in Fig. 3brepresent the transformed templates with

X1 ¼1 0 00 1 0

0:005 0 1

24

35:

Fig. 3. (a) The template resizing employed in our framework. A and prepresent the affine transformation matrix for coordinate transformationand the transformed template coordinates, respectively. (b) Solid anddotted quadrilaterals represent the initial templates and the transformedtemplates by X1, respectively. The transformed shape of the blue tem-plate is different from that of the red one because of different templatesize despite the same state.


IEEE

Proo

fThe shapes of transformed templates are different fromeach other despite the same state. This means we should setthe covariance P in (7) differently according to the templatesize even with the same expected object movement. Byresizing the template, we can set P reasonably according tothe expected object movement regardless of the originaltemplate size. To reduce the computational complexity, onecan alternatively use a subset of template pixels as in [13].However, it still suffers from the problem described inFig. 3b and thus is not suitable for our framework.

6.4 Use of Parent-Child Particles

To reduce computational complexity further, we propose touse parent-child particles as described in Fig. 4. In Fig. 4,blue circles denote N parent particles X

ðiÞk to which the iter-

ative Gaussian importance function generation is applied.

Green circles are Nc child particles Xði;jÞk sampled from each

Gaussian importance function NðiÞSLð3Þ for each parent parti-

cle XðiÞk . Thus, N �Nc particles are sampled while the itera-

tive Gaussian importance function generation is appliedonly N times. After weighting all N �Nc particles, weresample only N parent particles using residual systematicresampling [10], which is most suitable when the numberof particles varies during resampling.

Theoretically, it is possible to reduce the number ofparticles during resampling as long as the true filteringdensity is well represented by a reduced set of particles.Our underlying assumption about the use of parent-childparticles is that the true filtering density will have a sin-gle mode for projective motion tracking of a singleobject. If a density is single-mode, it can be representedwell by smaller number of particles than the case ofmulti-mode distributions.

One of empirical measures for how well a probabilitydensity is represented by particles and their weightsis the number of effective particles defined as Neff ¼P

i½ð ~wðiÞk Þ

2��1 [15], where ~wðiÞk are the normalized weights.

The number of effective particles varies between one (inthe worst case) and the total number of particles (in theideal case). If the number of effective particles does notchange much by the use of the proposed parent-child par-ticles, it can be said that the parent-child particles doesnot alter the statistical property of particle filtering. InSection 7.2, we experimentally verify that the use of par-ent-child particles is effective in reducing computationalcomplexity without altering the statistical property ofparticle filtering by showing the effective numbers of par-ticles for different pairs of N and Nc of the same totalnumber of particles are consistent.

Now we present the overall algorithm of projectivemotion tracking via particle filtering on SL(3) with Gaussianimportance functions in Algorithm 3.

6.5 Adaptation to Affine Motion Tracking

We now show how to adapt our projective motion trackingframework to the group of affine transformations. Themajor difference in performing affine motion tracking viaparticle filtering on Aff(2) is in the warping functionWðp;XkÞ. All other details, such as the state equation andGaussian importance functions on Aff(2), can be easilyadapted from the SL(3) case.

The warping function Wðp;XkÞ for an affine transforma-tion with

Xk ¼a1 a3 a5

a2 a4 a6

0 0 1

24

35

Fig. 4. The use of parent-child particles using residual systematic resam-pling. Blue and green circles represent parent and child particles,respectively. The iterative Gaussian importance function generation isonly applied to parent particles.


IEEE

Proo

f

is defined as

Wðp;XkÞ ¼a1px þ a3py þ a5

a2px þ a4py þ a6

� �:

Accordingly, the second term @p0@X

kof (34) for Aff(2) is given

by differentiating Wðp;XkÞwith respect to a ¼ ða1; . . . ; a6Þ>,with Xk defined as

Xk ¼a1 a3 a5

a2 a4 a6

0 0 1

24

35;

the result is

px 0 py 0 1 00 px 0 py 0 1

� �:

The last term of (34) for Aff(2) becomes a 6� 6 matrix.We can also perform similarity motion tracking via parti-

cle filtering on Aff(2) by setting the state covariance asP ¼ diagðs2

1; . . . ; s26Þ, with nearly zero s1 and s4. Note that

this is not possible with particle filtering on SL(3) becauseisotropic scaling is not present in (4).

7 EXPERIMENTS

In this section, we demonstrate the feasibility of our particlefiltering-based probabilistic visual tracking frameworkthrough extensive experiments using various test videosequences. For all test sequences, we perform projectivemotion tracking by particle filtering on SL(3) with C imple-mentation on a desktop PC equipped with Intel Core i-72600K CPU and 16 GB RAM. The experimental results canbe also seen in the supplementary video.2

7.1 Overall Tracking Performance

In order to demonstrate the overall performance of the pro-posed tracking framework, we use the following videosequences of planar objects: “Towel” and “MousePad” from[65], “Bear” and “Book” from [50], and “Tea” and “Acronis”of our own. For non-rigid (originally) planar objects, we use“Paper” from [50] and “Magazine” from [39] while we use“Doll” and “David” from [46] for rigid non-planar objects.

We crop the initial object template T initðpÞ from the firstframe manually and resize it to a size of 40� 40. We setðN;NcÞ ¼ ð40; 10Þ for parent-child particles andNiter ¼ 5 forthe iterative Gaussian importance function generation. ThePCA bases are incrementally updated at every fifth frameafter the initial PCA basis extraction from the first 15frames. The state covariance P and measurement covari-ance R are tuned specifically for each sequence to yield sat-isfactory tracking results. The parameters tuned for eachsequence are summarized in the separate supplementarydocument, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.170. Our C implementation with thissetting runs in about 15 fps. By the inverse formulation of

the Jacobian calculation, the tracking speed is increasedfrom 8 to 15 fps. There is no noticeable performance degra-dation by template resizing.

Fig. 5 shows the tracking results by our framework forthe test sequences of planar objects. Please also watch thesupplementary video for better understanding of thetracking performance of our framework. Figs. 5a and 5bshow that our framework accurately tracks the planarobjects in “MousePad” and “Towel” despite the motionblur caused by fast and large inter-frame object motion.Our framework is also able to track the planar object in“Tea” consistently despite temporal occlusion as Fig. 5cshows. The robustness to motion blur and temporal occlu-sion can be considered as originating from the stochasticnature of particle filtering.

The effect of conjunctive use of NCC and PCA in ourmeasurements can be verified from the results for “Bear” inFig. 5d. When using only gNCC for the measurement (redlines in Fig. 5d), tracking failure occurs because of local illu-mination changes. On the contrary, our framework usingboth gNCC and gPCA (yellow lines in Fig. 5d) accuratelytracks the object by accounting for the local illuminationchange through an incrementally updated PCA model.

The effect of outlier detection can be identified from thetracking result for “Acronis” in Fig. 5e. Without outlierdetection (red lines in Fig. 5e), our framework fails to trackthe object because of large NCC error induced by severelocal illumination changes. On the contrary, by detectingoutliers whose PCA reconstruction error is above thethreshold and excluding them in calculating gNCC (yellowlines in Fig. 5e), our framework is able to successfully trackthe object.

2. The video is also available at http://cv.snu.ac.kr/jhkwon/tracking_video.avi.

Fig. 5. Tracking results by our framework for (a) “MousePad”,(b) “Towel”, (c) “Tea”, (d) “Bear”, (e) “Acronis”, and (f) “Book”, all of whichcontain planar object images. The red lines in (d) and (e) represent theresults by our framework when “using only NCC” and “without outlierdetection”, respectively.


IEEE

Proo

f

Fig. 5f shows the tracking failure for “Book” even withthe conjunctive use of NCC and PCA with outlier detection.In “Book”, severe specular reflection changes the objectappearance so rapidly that even the incremental PCA modelcannot capture such appearance changes adequately, whichleads to the tracking failure.

Although our framework is inherently best suited fortracking rigid planar objects, it can also yield satisfactorytracking results up to a certain extent for non-rigid and non-planar object cases. Figs. 6a and 6b show the tracking resultsfor “Paper” and “Magazine” in which originally planarobjects are being deformed non-rigidly. For both sequences,our framework estimates planar homographies locally opti-mal in minimizing the difference to the initial object tem-plate. We can confirm that such homography estimation isreasonably acceptable via a visual check of the trackingresults in the supplementary video, available online.

Figs. 6c and 6d show the tracking results for “Doll”and “David” in which rigid non-planar objects are chang-ing their poses frequently under the varying illuminationcondition. In spite of appearance changes owing to objectpose and illumination changes, our framework estimatesthe object image region quite accurately for both sequen-ces. These successful tracking results can be mainly attrib-uted to gPCA in the measurement as such appearancechanges are captured by the incrementally updated PCAbases. The tracking results by our framework for “Doll”and “David” are in good agreement with the results in[46] where the incremental PCA model for visual trackingis originally proposed.

7.2 Tracking Performance Analysis with ParameterVariations

In this subsection, we analyze the tracking performance ofour framework with parameter variations. Fixing the tem-plate size and Niter to 40� 40 and five, respectively, wevary ðN;NcÞ for parent-child particles, state covariance P ,and measurement covariance R. For this experiment, weuse our own video sequences, “Mascot”, “Bass”, and“Cube” shown in Fig. 7, with the ground truth corner

points extracted manually. Since there is little local illumi-nation change in the test sequences, we use only gNCC forthe measurement in this experiment to simplify the per-formance analysis.

In order to analyze the tracking performance quantita-tively, we consider the followings: “Time”, “Error”, “Std”,and Neff . “Time” is the average computation time per framein milliseconds. With the ground truth corner points cjk and

estimated corner points cjk at time k, we calculate the RMS

error as ek ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi14

P4j¼1 jc

jk � c

jkj

2=sk

q, where sk is the ground

truth object area at time k. The purpose of error normaliza-tion by the object area is to prevent the error from beingscaled according to the object scale change. “Error” and“Std” are the mean and standard deviation of ek. As afore-mentioned, Neff represents how well a probability density isrepresented by particles and their weights. The larger Neff

is, the stabler the tracking is.We first vary ðN;NcÞ for parent-child particles, which

are the most important factors for tracking performancein practice. We fix P and R to the values to yield satisfac-tory tracking results for each sequence. P and R tunedfor each sequence can be found in the separate supple-mentary document, available online. There are 12 combi-nations of ðN;NcÞ corresponding to 200, 400, and 600particles in total.

Table 1 shows combinations of ðN;NcÞ and correspond-ing tracking performance. The numbers in Table 1 areobtained by averaging over 10 independent runs for eachcombination of ðN;NcÞ. It is clear from Table 1 that as thetotal number of particles increases, “Error” and “Std”decrease along with the increase of “Time”. On the otherhand, within the same total number of particles, there is atrend that as Nc increases, “Error” and “Std” increase alongwith the decrease of “Time”.3

We can verify the effectiveness of the use of parent-childparticles by specifically comparing the cases of ðN;NcÞ ¼ð400; 1Þ and ðN;NcÞ ¼ ð60; 10Þ. For all test sequences,“Time” for ðN;NcÞ ¼ ð60; 10Þ is much less than that forðN;NcÞ ¼ ð400; 1Þ while both cases yield similar “Error”and “Std”. This implies that we can enhance the computa-tional efficiency considerably without sacrificing the track-ing accuracy by employing the parent-child particles. It isalso worth to note that in Table 1 the values of Neff for differ-ent values of Nc are quite similar within the same total num-ber of particles, i.e., the same N �Nc. This means that the use

Fig. 7. Test video sequences used in Sections 7.2 and 7.3. The redcrosses are the ground truth object corner points manually extracted.

Fig. 6. Tracking results by our framework for (a) “Paper” and(b) “Magazine”, which contain non-rigid object images, and (c) “Doll” and(d) “David”, which contain rigid non-planar object images.

3. Note that the tracking speed gain is not strictly inversely propor-tional to the number of parent particles, because particles copied fromthe same particle during resampling can share the same importancefunction without re-computation.


IEEE

Proo

fof parent-child particles does not alter the statistical prop-erty of particle filtering and tracking stability.

It is necessary in practice to find the appropriate combi-nation of ðN;NcÞ that balances the tracking accuracy andcomputation time. Fig. 8 shows the scatter plots of “Error”and “Time” for each combination of ðN;NcÞ. From Fig. 8,we can identify that our choice of ðN;NcÞ ¼ ð40; 10Þ for theexperiments in Section 7.1 is reasonable because greencircles in Fig. 8 representing the cases of ðN;NcÞ ¼ ð40; 10Þcan be regarded as points of good compromise between“Error” and “Time”.

We now check the tracking performance changes accord-ing to the changes of the state and measurement covarian-ces, P and R. For each sequence, we vary P tunedpreviously as P1 ¼ 0:82P , P2 ¼ 0:92P , P3 ¼ P , P4 ¼ 1:12P ,and P5 ¼ 1:22P and similarly for R. We set ðN;NcÞ ¼ð40; 10Þ for this experiment. Tables 2 and 3 summarize thetracking performance comparison between different state

and measurement covariances. The numbers in Tables 2and 3 are obtained by averaging over 10 independent runsfor each case.

It is hard to find out certain trends from Table 2 exceptone that Neff decreases as P increases. This can be intuitivelyunderstood: a larger P will result in larger variations ofsampled SL(3) particles and accordingly smaller number ofsampled particles will be similar to the ground truth. Inpractice, it is important to set P to cover a possible expectedrange of the object motion. As we show in Fig. 3, our tem-plate resizing approach is important in setting P correctlyregardless of the initial size of the object image. Note thatthe values of P for various test video sequences given in theseparate supplementary document, available online, are invery similar ranges.

There is a clear trend in Table 3 that “Error”, “Std”,and Neff all increase as R increases. This can be intui-tively understood: if R is set to the very large value, all

TABLE 1Tracking Performance Comparison between Different Combinations of ðN;NcÞ for Parent-Child Particles

Fig. 8. Scatter plots of “Error” and “Time” for each combination of ðN;NcÞ in Table. 1. The legend for (a) applies to all sub figures.

TABLE 3Tracking Performance Comparison between

Different Measurement Covariances

TABLE 2Tracking Performance Comparison between Different

State Covariances


IEEE

Proo

f

particles will have essentially the same weights regard-less of their similarity to the ground truth. In this case, alarger Neff does not necessarily result in more stability intracking. It should be understood that the comparison ofdifferent values of Neff is only meaningful with the sameP and R. In practice, we should set R to balance thetracking accuracy and stability. The difficulty is that theproper value of R depends on the object image character-istics as identified from the separate supplementary doc-ument, available online. Therefore, it would be aninteresting topic to find a way to set a proper R automati-cally based on object image properties such as gradientand texture.

7.3 Comparison with Different ImportanceFunctions

We now compare the performance of our Gaussianimportance functions with that of the state transitiondensity (“STD”) and the Gaussian importance functionsby local linearization (“LL”), which is employed in [30].We implement “LL” also with the inverse formulation ofthe Jacobian calculation to make its computation timecomparable to ours. We use the same test sequences asin Section 7.2. For “STD” and “LL”, we set N as 400,800, and 1,200 without the use of parent-child particleswhile we set ðN;NcÞ ¼ ð40; 10Þ for ours. All the remain-ing parameters are the same as in Section 7.2.

Table 4 summarizes the performance comparisonbetween different importance functions. The numbers inTable 4 are obtained by averaging over 10 independent runsfor each case. For “STD” and “LL”, it is clear that as Nincreases, “Error” and “Std” decrease along with theincrease of “Time” and Neff . Comparing the cases of “STD”and “LL” with the same N , we can confirm from thedecreased “Error” and “Std” and increased Neff that thetracking performance is enhanced by the local linearization-based Gaussian importance function.

However, “LL” fails to outperform our framework for alltest sequences in terms of “Error” and “Std”. Note that ourframework yields the least “Error” and “Std” for all testsequences with the greatest Neff . It is remarkable that Neff

for our framework with 400 particles in total is even greaterthan that for “LL” with 1200 particles.

Fig. 9 shows the scatter plots of “Error” and “Time” foreach case. We can infer from Fig. 9 that “STD” and “LL”with the same “Time” as ours would yield greater “Error”than ours, and they would require much greater “Time” toyield the same “Error” as ours. This can be better illustratedby the supplementary video, available online, which showstracking results for “Mascot”, “Bass”, and “Cube” by eachimportance function with the similar computational com-plexity, i.e., N ¼ 1200 for “STD”, N ¼ 600 for “LL”, andðN;NcÞ ¼ ð40; 10Þ for ours. The result videos clearly indicatethat our framework yields more accurate and stabler track-ing results than “STD” and “LL” when run with the similarcomputational complexity.

7.4 Benchmark Test Using Metaio Data Set

We finally evaluate the tracking performance of our frame-work using the Metaio benchmark data set [35], which hasbeen developed to evaluate template-based tracking algo-rithms. The benchmark data set is composed of the follow-ing four groups classified according to the object textureproperty: “Low”, “High”, “Repetitive”, and “Normal”.Each group possesses two different objects as shown inFig. 10. For each object, there are five different videosequences focusing on different types of dynamic behaviors,namely, “Angle” for the camera viewing angle change,“Range” for the object image size change, “FastFar” for thefast moving object far from the camera, “FastClose” for thefast moving object close to the camera, and “Illumination”for the illumination change. Therefore, there are 40 testvideo sequences in total in the benchmark data set.

For each sequence, the initial coordinates of four cornerfiducial points outside the object image region are given.

TABLE 4Tracking Performance Comparison between Different Importance Functions

Fig. 9. Scatter plots of “Error” and “Time” for each case in Table 4. Thelegend for (a) applies to all sub figures.


IEEE

Proo

f

The goal is to estimate those fiducial point coordinates forthe remaining frames. We manually extract the object tem-plate from the initial frame and perform projective motiontracking for the remaining frames. Then we estimate thefiducial point coordinates by transforming their initialcoordinates using the estimated homographies. In [35], thetracking at a specific frame is considered to be successfulif the root-mean-square (RMS) error of the estimated fidu-cial point coordinates is less than 10 pixels. There is noconsideration of the object scale change in calculating thetracking error.

For experiments using the Metaio benchmark data set,we set ðN;NcÞ ¼ ð80; 10Þ to increase the tracking accuracywith the reasonable computational complexity. The tem-plate size is 40� 40 and Niter is set as five as before. Thestate covariance P and measurement covariance R aretuned specifically for each sequence to yield satisfactorytracking results. The parameter settings for each sequenceare summarized in the separate supplementary document,available online. Although the video sequences in theMetaio benchmark data set are in color, we use their gray-scale versions.

We compare our tracking performance with five differenttemplate tracking methods: the ESM tracker based on non-linear optimization [8], the L1 tracker based on particle fil-tering [4], and motion estimation by key-point matchingusing SIFT [36], SURF [6], and FERNS [40]. The ESM trackerand L1 tracker are generally considered as the state-of-the-art tracking methods in their own categories, i.e., nonlinearoptimization and particle filtering. The main strength of theESM tracker is a large basin of convergence achieved byusing both forward and inverse gradients of the object tem-plate while that of the L1 tracker is the robustness to objectappearance changes achieved by the advanced objectappearance model based on sparse representation. SIFT,SURF, and FERNS are also widely used for key-pointmatching because of their ability to extract distinctive localfeatures stably.

We reuse the tracking results of the ESM tracker and key-point matching-based motion estimation methods reportedin [35] without re-running them. For the L1 tracker, we usethe source code provided by the authors of [4]. In order toachieve higher tracking accuracy of the L1 tracker, we use2,000 particles, which is rather many, with the state covari-ance values somewhat increased from the default and tuned

differently for each sequence. The other parameters remainunchanged from the default.

It is rather meaningful to compare our tracking perfor-mance with the ESM tracker because it also treats SL(3) asan optimization variable. That is, it is a comparison betweendeterministic and probabilistic template-based visual track-ing algorithms employing the same Lie group-based repre-sentation. The comparison with the ESM tracker can verifyour claim that the well-designed particle filtering can bemore robust to the problem of local optima than the deter-ministic optimization.

It is also meaningful to compare our tracking perfor-mance with the L1 tracker because it also employs particlefiltering as the state estimation tool as ours but only withthe state transition density-based importance function. Fur-thermore, the L1 tracker can estimate the object motionup to affine transformations. Therefore, the comparisonwith the L1 tracker can verify whether our two contribu-tions, namely the use of the Gaussian importance functionsand the extension of our affine motion tracking frameworkof [30] to projective motion tracking, are really importantfor highly accurate template tracking.

7.4.1 Overall Performance

Bar graphs in Fig. 11 summarize the evaluation of trackingresults by each tracking method in terms of the tracking suc-cess rate. Table 5 shows the average tracking success ratesfor each specific sub-groups. Our framework yields highertracking success rates for “Repetitive” and “Normal” thanfor “Low”. It is rather surprising that the success rate for“High” is the lowest. Fig. 11d tells that such a low successrate for “High” is due to the quite low success rate for thegrass image in Fig. 10d in which no distinct structure exists.

On the other hand, our framework yields higher trackingsuccess rates for “Angle”, “Range”, and “Illumination” thanfor “FastFar” and “FastClose”. The reason of the lower suc-cess rate for “FastFar” is severe motion blur caused by fastcamera motion while that for “FastClose” is partially visibleobjects due to the closeness to the camera. The higher suc-cess tracking rates for “Illumination” implies that our mea-surement using NCC and PCA can properly deal withillumination changes.

In comparison with other template tracking methods, ourframework yields the highest tracking success rate averagedover all 40 test sequences as Table 5 shows. Specifically, ourframework yields the highest tracking success rates for“Low”, “Repetitive”, “Normal”, “Angle”, and “FastFar”while our framework is the second best for “High”,“Range”, and “Illumination”, following SIFT. The reason ofthe best tracking success rates of SIFT for “Range” and“FastClose” is it estimates homographies only using visiblekey-points unlike our framework. Overall, the performanceranking is as follows: ours (74:53 percent), SIFT (64:35 per-cent), ESM (47:28 percent), FERNS (44:92 percent), SURF(31:95 percent), and L1 (21:10 percent).

7.4.2 Comparison with ESM Tracker

Our framework consistently outperforms the ESM trackerfor all sub-groups. The performance difference for“Repetitive” and “FastFar” well characterizes the

Fig. 10. Target objects in the Metaio benchmark data set [35]. There arefour groups of objects classified according to the object texture property.For each object, there are five different video sequences focusing on dif-ferent types dynamic behaviors.


IEEE

Proo

fdifferent inherent properties of deterministic optimizationand stochastic particle filtering. The lower tracking suc-cess rate of the ESM tracker for “Repetitive” is due tobeing trapped in the local optima with similar appearan-ces while our framework suffers less from such localoptima problems because of the stochastic nature of parti-cle filtering. Although both methods do not employ a spe-cific model of motion blur, our framework yields farbetter tracking success rates than the ESM tracker for“FastFar” where severe motion blur exists. This can bealso attributed to the stochastic nature of particle filtering.

7.4.3 Comparison with L1 Tracker

Our framework also consistently outperforms the L1tracker for all sub-groups with considerable margins. Theaverage success rate of our framework is 74:53 percentwhile that of the L1 tracker is 21:10 percent. When com-pared with our framework, the main reason of the poor per-formance of the L1 tracker is it can estimate the objectmotion only up to affine transformations. Among “Angle”,“Range”, “FastFar”, “FastClose”, and “Illumination”, theL1 tracker performs worst for “Angle”, where the perspec-tive effect is severe due to large camera viewing angle

changes. Conversely, the L1 tracker performs best for“Range”, where the camera viewing angle change is morelimited than other cases.

TABLE 5Tracking Performance Comparison between FERNS,

SIFT, SURF, ESM, L1, and Our Frameworkfor the Metaio Benchmark Data Set

The numbers are average tracking success rates for each specificsub-groups.

Fig. 11. Evaluation of tracking results for the Metaio benchmark data set by FERNS, SIFT, SURF, ESM, L1, and our framework in terms of the track-ing success rate. The legend for “Low #1” in (a) applies to all sub figures.


IEEE

Proo

f

Another reason for the poor performance of the L1tracker is it employs the state transition density-basedimportance function unlike our framework. Only consider-ing the frames where the perspective effect is not severe, thequalitative tracking performance of the L1 tracker is far bet-ter than the numbers in Table 5 suggest, i.e., it tracks theobject with seemingly good accuracy, but the RMS trackingerrors are often greater than the threshold (10 pixels). Wecan enhance the tracking accuracy of the L1 tracker with thestate transition density-based importance function by usingmore particles, e.g., 5,000, but it is not practical consideringthe computational complexity. The average tracking errorof the L1 tracker for the successfully tracked frames is 7.02while that of our framework is 4.88. These suggest that thestate transition density-based importance functionemployed by the L1 tracker is not suitable for applicationswhere high tracking accuracy is highly required such asaugmented reality and visual servoing.

8 DISCUSSION

From experiments using various test video sequences, wehave demonstrated that our probabilistic template-basedvisual tracking framework can provide quite accurate track-ing results with sufficient computational efficiency. Com-pared with [30], such satisfactory performance is largelyderived from a collection of techniques, including the itera-tive Gaussian importance function generation, inverse for-mulation of the Jacobian calculation, template resizing, anduse of parent-child particles.

The benchmark test using the Metaio data set clearlydemonstrates the practical value of our framework forapplications such as visual servoing, augmented reality,and visual localization and mapping, where accurateprojective motion tracking is necessary for 3D cameramotion estimation. Note that the Metaio benchmark dataset has been developed specifically in consideration ofaugmented reality applications. We have shown that ourframework yields far better tracking performance for theMetaio benchmark data set than the tracking methodscurrently popularly used for augmented reality applica-tions, namely, the deterministic optimization-based ESMtracker and motion estimation by key-point matchingusing SIFT, SURF, and FERNS.

The extension of our previous affine motion trackingalgorithm of [30] to projective motion tracking is anotherimportant factor for the satisfactory tracking performancefor the Metaio benchmark data set. When the perspectiveeffect is not perceivable (e.g., when the object is far enoughfrom the camera or its size is small), an affine motion trackercan estimate the object template motion rather accurately.However, in most augmented reality applications usuallytargeted for indoor environments, the perspective effectnearly always exists as shown by our test video images inFig. 5 and the Metaio benchmark data set images. This alsoholds true for the cases of visual servoing and indoor visuallocalization and mapping. Affine motion tracking shouldnot be used for such applications where the 3D cameramotion is estimated from tracking results because of its lim-ited ability to estimate the object template motion accu-rately. Therefore, it should be regarded that what we have

done to our previous work [30] in this paper, i.e., the exten-sion of the affine motion tracking framework to projectivemotion tracking with increased computational efficiencyand tracking accuracy, has not only theoretical merits butalso considerable practical importance.

Especially for visual tracking, it is difficult to model theobject dynamics precisely for freely moving objects. Due tothis reason, it has been the focus to improve the objectappearance model in improving the performance of particlefiltering-based visual tracking, e.g., [4], [23], [46], [59]. How-ever, we argue that the improvement of particle samplingother than the object appearance model is also importantespecially for projective motion tracking. The degree of free-dom of projective motion is eight rather higher than those ofother forms of motion such as similarity and affine. Whenwe sample particles naively using the state transition den-sity-based importance function, the number of particles nec-essary to get satisfactory results increases with the statedimension exponentially. Therefore, the improvement ofparticle sampling is essential for the practicality of projec-tive motion tracking via particle filtering. The poor perfor-mance of the L1 tracker of [4] for the Metaio benchmarkdata set also suggests that even with the advanced objectappearance model, the naive particle sampling using thestate transition density-based importance function can leadto tracking inaccuracy. In this paper, we have considerablyimproved particle sampling by the proposed iterativeGaussian importance function generation.

The way to represent the state also affects the trackingperformance considerably. In this paper, we have not pre-sented a performance comparison between our Lie group-based state representation and a conventional vector spaceone. The superiority of our Lie group-based state repre-sentation to a conventional vector space one has been welldemonstrated both conceptually and experimentally in[31] and [30], and the reader is referred to these sourcesfor further details.

The question of how to design an image similarity mea-sure robust to various challenging causes such as motionblur, temporal occlusion, and illumination change is obvi-ously relevant and important to template-based visualtracking performance. However, the design of image simi-larity measures is not the focus of this paper. Instead, wehave shown how well-established image similarity meas-ures suitable to template-based visual tracking, such asNCC and PCA reconstruction error, can easily fit into ourframework and yield satisfactory tracking performance.The only restriction of our framework on image similaritymeasures is that they should be differentiable, and that thederivatives can be obtained at least approximately. Thepractical performance shall be dependent on the linearity ofthe employed similarity measure. In this sense, the entropy-based similarity measures such as mutual information [43]and disjoint information [55] can be also used for measure-ments in our framework because their derivatives can beapproximately obtained [11], [16].

We have not addressed the occlusion issue explicitlyin this paper, since the stochastic nature of particle filter-ing can deal (at least partly) with short-term and partialocclusion, as experimentally shown in Section 7.1 andpreviously in [31], [46]. One can apply more specific


IEEE

Proo

f

strategies such as the robust error norm [9] and the sub-templates [27] to our framework in order to effectivelydeal with long-term occlusion.

In [50], as an efficient alternative to global optimiza-tion strategies, it is suggested to run the Kalman filter asa pre-processing step to give a reasonable initial pointfor optimization. Similarly, to refine our tracking resultswe can run an optimization procedure as a post-process-ing step using the estimated state as the initial point. Wecan also consider fusion with the detection-basedapproach such as [21] to be able to escape from temporaltracking failure. Although these conjunctive uses of dif-ferent kinds of methods for template-based visual track-ing may not be so appealing from a theoretical point ofview, in practice they can potentially result in consider-able gains both in accuracy and efficiency.

9 CONCLUSION

In this paper, we have proposed a novel particle filteringframework to solve template-based visual tracking prob-abilistically. We have formulated particle filtering onSL(3) with appropriate state and measurement equationsfor projective motion tracking that does not require vec-tor parametrization of homographies. In order to utilizethe optimal importance function, we have clarified thenotions of Gaussian distributions and Taylor expansionon SL(3) using the exponential coordinates, and approxi-mated the optimal importance function as a Gaussiandistribution on SL(3) by local linearization. We have alsoproposed an iterative Gaussian importance function gen-eration method to increase the accuracy of Gaussianimportance functions. The computational efficiency isenhanced by the inverse formulation of the Jacobian cal-culation, template resizing, and use of parent-child par-ticles. We have also extended the SL(3) particle filteringframework to affine motion tracking on the group Aff(2).The improved performance of our probabilistic trackingframework and superiority of our Gaussian importancefunctions to other importance functions have been exper-imentally demonstrated. Finally, it has been shown thatour framework yields superior tracking results to severalstate-of-the-art tracking methods via experiments usingthe publicly available benchmark data set.

ACKNOWLEDGMENTS

Frank C. Park was supported by the Center for AdvancedIntelligent Manipulation, BK21+, and SNU-IAMD.

REFERENCES

[1] E. Arnaud and E. Memin, “Partial Linear Gaussian Models forTracking in Image Sequences Using Sequential Monte CarloMethods,” Int’l J. Computer Vision, vol. 74, no. 1, pp. 75-102, 2007.

[2] S. Baker, A. Datta, and T. Kanade, “Parameterizing Homo-graphies,” technical report, The Robotics Inst., Carnegie MellonUniv., 2006.

[3] S. Baker and I. Matthews, “Lucas-Kanade 20 Years on: A UnifyingFramework,” Int’l J. Computer Vision, vol. 56, no. 3, pp. 221-255,2004.

[4] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real Time Robust L1Tracker Using Accelerated Proximal Gradient Approach,”Proc. IEEE Conf. Computer Vision and Pattern Recognition(CVPR), 2012.

[5] Y. Bar-Shalom, T. Kirubarajan, and X.-R. Li, Estimation with Appli-cations to Tracking and Navigation. John Wiley & Sons, 2002.

[6] H. Bay, T. Tuytelaars, and L. Gool, “Surf: Speeded up RobustFeatures,” Proc. Ninth European Conf. Computer Vision (ECCV), 2006.

[7] E. Begelfor and M. Werman, “How to Put Probablilities on Homo-graphies,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 27, no. 10, pp. 1666-1670, Oct. 2005.

[8] S. Benhimane and E. Malis, “Homography-Based 2D VisualTracking and Servoing,” Int’l J. Robotics Research, vol. 26, no. 7,pp. 661-676, 2007.

[9] M. Black and A. Jepson, “Eigentracking: Robust Matching andTracking of Articulated Obejcts Using a View-Based Repre-sentation,” Int’l J. Computer Vision, vol. 26, no. 1, pp. 63-84,1998.

[10] M. Bolic, P. Djuric, and S. Hong, “New Resampling Algorithmsfor Particle Filters,” Proc. Int’l Conf. Acoustics, Speech, and SignalProcessing (ICASSP), 2003.

[11] R. Brooks and T. Arbel, “Generalizing Inverse Compositional andESM Image Alignment,” Int’l J. Computer Vision, vol. 87, no. 3,pp. 191-212, 2010.

[12] A. Chiuso and S. Soatto, “Monte Carlo Filtering on Lie Groups,”Proc. IEEE 39th Conf. Decision and Control (CDC), pp. 304-309, 2000.

[13] F. Dellaert and R. Collins, “Fast Image-Based Tracking by Selec-tive Pixel Integration,” Proc. IEEE Int’l Conf. Computer Vision Work-shop Frame Rate Processing, 1999.

[14] Sequential Monte Carlo Methods in Practice, A. Doucet, N. Freitas,and N. Gordon, eds., Springer-Verlag, 2001.

[15] A. Doucet, S. Godsill, and C. Andrieu, “On Sequential MonteCarlo Sampling Methods for Bayesian Filtering,” Statistics andComputing, vol. 10, pp. 197-208, 2000.

[16] N. Dowson and R. Bowden, “Mutual Information for Lucas-Kanade Tracking (Milk): An Inverse Compositional Formulation,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 1,pp. 180-185, Jan. 2008.

[17] U. Grenander, Probabilities on Algebraic Structures. John Wiley &Sons, 1963.

[18] G. Hager and P. Belhumeur, “Efficient Region Tracking withParametric Models of Geometry and Illumination,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 20, no. 10, pp. 1025-1039, Oct. 1998.

[19] B. Hall, Lie Groups, Lie Algebras, and Representations: An ElementaryIntroduction. Springer, 2004.

[20] S. Hauberg, F. Lauze, and K. Steenstrup, “Unscented Kalman Fil-tering on Riemannian Manifolds,” J. Math. Imaging and Vision,vol. 46, no. 1, pp. 103-120, 2013.

[21] S. Hinterstoisser, O. Kutter, N. Navab, P. Fua, and V. Lepetit,“Real-Time Learning of Accurate Patch Retification,” Proc. IEEEConf. Computer Vision and Pattern Recognition (CVPR), 2009.

[22] S. Holzer, S. Ilic, and N. Navab, “Adaptive Linear Predictors forReal-Time Tracking,” Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), 2010.

[23] W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang,“Single and Multiple Object Tracking Using Log-Euclidean Rie-mannian Subspace and Block-Division Appearance Model,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 34, no. 12,pp. 2420-2440, Dec. 2012.

[24] M. Isard and A. Blake, “Condensation–Conditional Density Prop-agation for Visual Tracking,” Int’l J. Computer Vision, vol. 29,pp. 5-28, 1998.

[25] S. Julier and J. Uhlmann, “Unscented Filtering and NonlinearEstimation,” Proc. IEEE, vol. 92, no. 3, pp. 401-422, Mar. 2004.

[26] F. Jurie and M. Dhome, “Hyperplane Approximation for Tem-plate Matching,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 24, no. 7, pp. 996-1000, July 2002.

[27] F. Jurie and M. Dhome, “Real Time Robust Template Matching,”Proc. 13th British Machine Vision Conf. (BMVC), 2002.

[28] J. Kwon, M. Choi, F. Park, and C. Chun, “Particle Filtering on theEuclidean Group: Framework and Applications,” Robotica, vol. 25,no. 6, pp. 725-737, 2007.

[29] J. Kwon and K. Lee, “Monocular Slam with Locally Planar Land-marks via Geometric Rao-Blackwellized Particle Filtering on LieGroups,” Proc. IEEE Conf. Computer Vision and Pattern Recognition(CVPR), 2010.

[30] J. Kwon, K. Lee, and F. Park, “Visual Tracking via Geometric Par-ticle Filtering on the Affine Group with Optimal ImportanceFunctions,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion (CVPR), 2009.


IEEE

Proo

f

[31] J. Kwon and F. Park, “Visual Tracking via Particle Filtering on theAffine Group,” Int’l J. Robotic Research, vol. 29, no. 2/3, pp. 198-217, 2010.

[32] K.-C. Lee, J. Ho, M.-H. Yang, and D. Kriegman, “Visual Trackingand Recognition Using Probabilistic Appearance Manifolds,”Computer Vision and Image Understanding, vol. 99, no. 3, pp. 303-331, 2005.

[33] M. Li, T. Tan, W. Chen, and K. Huang, “Efficient Object Trackingby Incremental Self-Tuning Particle Filtering on the AffineGroup,” IEEE Trans. Image Processing, vol. 21, no. 3, pp. 1298-1313,Mar. 2012.

[34] P. Li, T. Zhang, and A. Pece, “Visual Contour Tracking Based onParticle Filters,” Image and Vision Computing, vol. 21, pp. 111-123,2003.

[35] S. Lieberknecht, S. Benhimane, P. Meier, and N. Navab, “A Data-set and Evaluation Methodology for Template-Based TrackingAlgorithms,” Proc. IEEE Eighth Int’l Symp. Mixed and AugmentedReality (ISMAR), 2009.

[36] D. Lowe, “Distinctive Image Features from Scale-Invarinat Key-points,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[37] B. Lucas and T. Kanade, “An Iterative Image Registration Tech-nique with an Application to Stereo Vision,” Proc. Seventh Int’lJoint Conf. Artificial intelligence (IJCAI), pp. 674-679, 1981.

[38] R. Mahony and J. Manton, “The Geometry of the Newton Methodon Non-Compact Lie Groups,” J. Global Optimization, vol. 23,pp. 309-327, 2002.

[39] E. Malis, “ESM Software Development Kits,” from http://esm.gforge.inria.fr/ESM.html.

[40] M. Ozuysal, P. Fua, and V. Lepetit, “Fast Keypoint Recognition inTen Lines of Code,” Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), 2007.

[41] N. Paragios and R. Deriche, “Geodesic Active Contours and LevelSets for the Detection and Tracking of Moving Objects,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 22, no. 3,pp. 266-280, Mar. 2000.

[42] X. Pennec, “Intrinsic Statistics on Riemannian Manifolds: BasicTools for Geometric Measurements,” J. Math. Imaging and Vision,vol. 25, pp. 127-154, 2006.

[43] J. Pluim, J. Maintz, and M. Viergever, “Mutual-Information-Based Registration of Medical Images: A Survey,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 986-1004, Aug. 2003.

[44] D. Ponsa and A. Lopez, “Variance Reduction Techniques in Parti-cle-Based Visual Contour Tracking,” Pattern Recognition, vol. 42,pp. 2372-2391, 2009.

[45] S. Prince, K. Xu, and A. Cheok, “Augmented Reality CameraTracking with Homographies,” IEEE Computer Graphics and Appli-cations, vol. 22, no. 6, pp. 39-45, Nov./Dec. 2002.

[46] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “IncrementalLearning for Robust Visual Tracking,” Int’l J. Computer Vision,vol. 77, no. 1-3, pp. 125-141, 2008.

[47] Y. Rui and Y. Chen, “Better Proposal Distributions: Object Track-ing Using Unscented Particle Filter,” Proc. IEEE Conf. ComputerVision and Pattern Recognition (CVPR), 2001.

[48] S. Saha, P. Mandal, Y. Boers, and H. Driessen, “Gaussian ProposalDenstiy Using Moment Matching in SMC,” Statistics and Comput-ing, vol. 19, pp. 203-208, 2009.

[49] C. Shen, A. van den Hengel, A. Dick, and M. Brooks, “EnhancedImportance Sampling: Unscented Auxiliary Particle Filtering forVisual Tracking,” Proc. 17th Australian Joint Conf. Advances in Arti-ficial Intelligence, pp. 180-191, 2005.

[50] G. Silveira and E. Malis, “Unified Direct Visual Tracking of Rigidand Deformable Surfaces under Generic Illumination Changes inGrayscale and Color Images,” Int’l J. Computer Vision, vol. 89,no. 1, pp. 84-105, 2010.

[51] G. Silveira, E. Malis, and P. Rives, “An Efficient Direct Approachto Visual Slam,” IEEE Trans. Robotics, vol. 24, no. 5, pp. 969-979,Oct. 2008.

[52] A. Srivastava, “Bayesian Filtering for Tracking Pose and Loca-tion of Rigid Targets,” Proc. SPIE AeroSense, vol. 4052, pp. 160-171, 2000.

[53] A. Srivastava, U. Grenander, G. Jensen, and M. Miller, “Jump-Dif-fusion Markov Processes on Orthogonal Groups for Object PoseEstimation,” J. Statistical Planning and Inference, vol. 103, pp. 15-37,2002.

[54] R. Subbarao and P. Meer, “Nonlinear Mean Shift over Rieman-nian Manifolds,” Int’l J. ComputerVision, vol. 84, no. 1, pp. 1-20,2009.

[55] Z. Sun and A. Hoogs, “Image Comparison by Compound DisjointInformation with Applications to Perceptual Visual QualityAssessment, Image Registration and Tracking,” Int’l J. ComputerVision, vol. 88, no. 3, pp. 461-488, 2010.

[56] C. Taylor and D. Kriegman, “Minimization on the Lie GroupSO(3) and Related Manifolds,” technical report, 1994.

[57] O. Tuzel, F. Porikli, and P. Meer, “Learning on Lie Groups forInvariant Detection and Tracking,” Proc. IEEE Conf. ComputerVision and Pattern Recognition (CVPR), 2008.

[58] T. Vercauteren, X. Pennec, E. Malis, A. Perchant, and N. Ayache,“Insight into Efficient Image Registration Techniques and theDemons Algorithm,” Proc. Conf. Information in Medical Imaging,pp. 495-506, 2008.

[59] H. Wang, D. Suter, K. Schindler, and C. Shen, “Adaptive ObjectTracking Based on an Effective Appearance Filter,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp. 1661-1667, Sept. 2007.

[60] Y. Wang and R. Schultz, “Tracking Occluded Targets in High-Similarity Background: An Online Training, Non-Local Appear-ance Model and Periodic Hybrid Particle Filter with ProjectiveTransformation,” Proc. IEEE Int’l Conf. Image Processing (ICIP),2009.

[61] J. Xavier and J. Manton, “On the Generalization of AR Process toRiemannian Manifolds,” Proc. IEEE Int’l Conf. Acoustics, Speechand Signal Processing (ICASSP), 2006.

[62] A. Yilmaz, O. Javed, and M. Shah, “Object Tracking: A Survey,”ACM Computing Surveys, vol. 38, no. 4, article 13, 2006.

[63] A. Yilmaz, L. Xin, and M. Shah, “Contour-Based Object Trackingwith Occlusion Handling in Video Acquired Using Mobile Cam-eras,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26,no. 11, pp. 1531-1536, Nov. 2004.

[64] S. Zhou, R. Chellappa, and B. Moghaddam, “Visual Tracking andRecognition Using Appearance-Adaptive Models in Particle Fil-ters,” IEEE Trans. Image Processing, vol. 13, no. 11, pp. 1491-1506,Nov. 2004.

[65] K. Zimmermann, J. Matas, and T. Svoboda, “Tracking by an Opti-mal Sequence of Linear Predictors,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 31, no. 4, pp. 677-692, Apr. 2009.

Junghyun Kwon received the BS and PhDdegrees in mechanical and aerospace engineer-ing from Seoul National University, Seoul, Korea,in 2002 and 2008, respectively. During the PhDstudy, he was a research assistant for the Robot-ics Laboratory in the School of Mechanical andAerospace Engineering at Seoul National Univer-sity. From September 2008 to January 2011, hewas a Brain Korea 21 postdoctoral fellow andwith the Computer Vision Laboratory in theDepartment of Electrical Engineering and Com-

puter Science at Seoul National University. He is currently with TeleSe-curity Sciences, Inc., Las Vegas. His research interests are in visualtracking and visual SLAM, novel approaches to image reconstruction forX-ray imaging systems, and automatic threat detection and recognitionfor aviation security.

Hee Seok Lee received the BS degree in electri-cal engineering from Seoul National University,Seoul, Korea, in 2006. He is currently workingtoward the PhD degree in the area of computervision and working as a research assistant forthe Computer Vision Laboratory in the Depart-ment of Electrical Engineering and Computer Sci-ence at Seoul National University. His researchinterests are in object tracking and simultaneouslocalization and mapping for mobile robots.


IEEE

Proo

f

Frak C. Park received the BS degree in electricalengineering from MIT in 1985, and the PhDdegree in applied mathematics from Harvard Uni-versity in 1991. From 1991 to 1995, he was anassistant professor of mechanical and aerospaceengineering at the University of California, Irvine.Since 1995, he has been a professor of mechani-cal and aerospace engineering at Seoul NationalUniversity. From 2009 to 2012, he was anadjunct professor in the Department of Interac-tive Computing at the Georgia Institute of Tech-

nology, and in 2001-2002 was a visiting professor at the CourantInstitute of Mathematical Sciences at New York University. His researchinterests are in robot mechanics, planning and control, visual tracking,and related areas of applied mathematics. In 2007-2008, he was anIEEE Robotics and Automation Society (RAS) Distinguished Lecturer,and has served as secretary of RAS from 2009-2010 and 2012-2013.He has served as senior editor of the IEEE Transactions on Robotics, anarea editor of the Springer Handbook of Robotics and Advanced Tractsin Robotics (STAR), and as an associate editor of the ASME Journal ofMechanisms and Robotics. He is a fellow of the IEEE, and an incomingeditor-in-chief of the IEEE Transactions on Robotics.

Kyoung Mu Lee received the BS and MSdegrees in control and instrumentation engi-neering from Seoul National University (SNU),Seoul, Korea in 1984 and 1986, respectively,and the PhD degree in electrical engineeringfrom the University of Southern California(USC), Los Angeles, California in 1993. Hehas been awarded the Korean GovernmentOverseas Scholarship during the PhD courses.From 1993 to 1994, he was a research associ-ate in the Signal and Image Processing Insti-

tute (SIPI) at USC. He was with the Samsung Electronics Co. Ltd.in Korea as a senior researcher from 1994 to 1995. On August1995, he joined the Department of Electronics and Electrical Engi-neering of the Hong-Ik University, and was an assistant and associ-ate professor. From September 2003, he has been with theDepartment of Electrical and Computer Engineering at SeoulNational University as a professor and the director of the AutomationSystems and Research Institute. His primary research interestsinclude object recognition, MRF optimization, tracking, and visualnavigation. He is currently serving as an associate editor of theIPSJ Transactions on CVA, MVA, the Journal of HMSP, and IEEESignal Processing Letters. He also has served as a program chairand area chairs of ACCV, CVPR, and ICCV. He received the OkawaFoundation Research Grant Award in 2006, Honorable MentionAward at the ACCV2007, the Most Influential Paper over the DecadeAward at IAPR MVA2009, and the Outstanding Research Award bythe College of Engineering of SNU in 2010. He is a distinguishedlecturer of the Asia-Pacific Signal and Information Processing Asso-ciation (APSIPA) for 2012-2013. He has (co)authored more than 150publications in refereed journals and conferences including PAMI,IJCV, CVPR, ICCV and ECCV. More information can be found onhis homepage http://cv.snu.ac.kr/kmlee.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …robotics.snu.ac.kr/fcp/files/_pdf_files_publications/... · 2016-01-20 · ment process involving image warping with interpolation,