Enforcing Template Representability and Temporal ...inside.mines.edu/~hzhang/papers/IJCAI16_SRAC.pdf · robotics, and motion analysis. Over the years, numerous vi- ... dictionary

Enforcing Template Representability and Temporal Consistencyfor Adaptive Sparse Tracking

Xue Yang, Fei Han, Hua Wang, and Hao Zhang∗

Department of Electrical Engineering and Computer ScienceColorado School of Mines, Golden, Colorado 80401

[email protected], [email protected], [email protected], [email protected]

Abstract

Sparse representation has been widely studied in vi-sual tracking, which has shown promising trackingperformance. Despite a lot of progress, the visualtracking problem is still a challenging task due toappearance variations over time. In this paper, wepropose a novel sparse tracking algorithm that welladdresses temporal appearance changes, by enforc-ing template representability and temporal consis-tency (TRAC). By modeling temporal consistency,our algorithm addresses the issue of drifting awayfrom a tracking target. By exploring the templates’long-term-short-term representability, the proposedmethod adaptively updates the dictionary using themost descriptive templates, which significantly im-proves the robustness to target appearance changes.We compare our TRAC algorithm against the state-of-the-art approaches on 12 challenging benchmarkimage sequences. Both qualitative and quantitativeresults demonstrate that our algorithm significantlyoutperforms previous state-of-the-art trackers.

1 IntroductionVisual tracking is one of the most important topics in comput-er vision with a variety of applications such as surveillance,robotics, and motion analysis. Over the years, numerous vi-sual tracking methods have been proposed with demonstrat-ed success [Yilmaz et al., 2006; Salti et al., 2012]. Howev-er, tracking a target object under different circumstances ro-bustly remains a challenging task due to the challenges likeocclusion, pose variation, background clutter, varying viewpoint, illumination and scale change. In recent years, sparserepresentation and particle filtering have been widely studiedto solve the visual tracking problem [Mei and Ling, 2011;Mei et al., 2011]. In this framework, particles are randomlysampled around the previous target state according to Gaus-sian distributions, each particle is sparsely represented by adictionary of templates and the particle with the smallest rep-resentation error is selected as the tracking result. The sparse

∗Corresponding Author. This project was partially supported bythe grant NSF-IIS 1423591.

representation of each particle can be solved using `1 mini-mization. Multi-task learning improves the performance bysolving all particles together as a multi-task problem usingmixed `2,1 norm, which can exploit the intrinsic relationshipamong all particles [Zhang et al., 2012b]. The sparse trackershave demonstrated robustness to image occlusion and lightingchanges. However, the temporal consistency of target appear-ances over time was not well investigated, which is critical totrack deformable/changing objects in cluttered environments.In addition, previous template update schemes based only onan importance weight can result in a set of similar templates,which limits the representability of the templates and makesthe trackers sensitive to appearance changes over time.

To make visual tracking robust to appearance changes likepose changes, rotation, and deformation, we introduce a nov-el sparse tracking algorithm that incorporates template repre-sentability and temporal consistency (TRAC). Our contribu-tions are threefold: (1) We propose a novel method to modeltemporal consistency of target appearances in a short time pe-riod via sparsity-inducing norms, which can well address theproblem of tracker drifting. (2) We introduce a novel adap-tive template update scheme that considers the representabil-ity of the templates beyond only using traditional importantweights, which significantly improves the templates’ discrim-inative power. (3) We develop a new optimization algorithmto efficiently solve the formulated problems, with a theoreti-cal guarantee to converge to the global optimal solution.

The reminder of the paper is organized as follows. Relatedbackground is discussed in Section 2. Our novel TRAC-basedtracking is proposed in Section 3. After showing experimen-tal results in Section 4, we conclude the paper in Section 5.

2 Background2.1 Related WorkVisual tracking has been extensively studied over the last fewdecades. Comprehensive surveys of tracking methods can befound in [Salti et al., 2012; Smeulders et al., 2014]. In gen-eral, existing tracking methods can be categorized as eitherdiscriminative or generative. Discriminative tracking meth-ods formulate the tracking problem as a binary classificationtask that separates a target from the background. [Babenko etal., 2009] proposed a multi instance learning algorithm thattrained a discriminative classifier in an online manner to sep-

arate the object from the background. [Kalal et al., 2010]used a bootstrapping binary classifier with positive and neg-ative constraints for object tracking by detection. An onlineSVM solver was extended with latent variables in [Yao et al.,2013] for structural learning of the tracking target. Gener-ative tracking techniques [Zhang et al., 2013], on the otherhand, are based on appearance models of target objects andsearch the most similar image region. The appearance mod-el can either rely on key points and finding correspondenceson deformable objects [Nebehay and Pflugfelder, 2015] or onimage features extracted from a bounding box [Zhang et al.,2013]. We focus on appearance models relying on image fea-tures, which can be used to construct a descriptive represen-tation of target objects.

Recently, sparse representation was introduced in genera-tive tracking methods, which demonstrated promising perfor-mance [Mei and Ling, 2011; Liu et al., 2010; Li et al., 2011].In sparse trackers, a candidate is represented by a sparse lin-ear combination of target templates and trivial templates. Thetrivial templates can handle occlusion by activating a limitednumber of trivial template coefficients, while the whole coef-ficients are sparse. The sparse representation can be learnedby solving an optimization problem regularized by sparsity-inducing norms. Techniques using the `1 norm regularizationto build sparse representation models are often referred to asthe L1 tracker. [Bao et al., 2012] improved the L1 tracker byadding an `2 norm regularization on the trivial templates toincrease tracking performance when no occlusion is present.Considering the inherent low-rank structure of particle repre-sentations that can be learned jointly, [Zhang et al., 2012a]formulated the sparse representation problem as a low-rankmatrix learning problem. A multi-task learning was proposedto jointly learn the sparse representation of all particles un-der this tracking framework based on particle filters [Zhanget al., 2012b], which imposed a joint sparsity using a mixed`p,1 norm to encourage the sparseness of particles’ represen-tations that share only a few target templates. Besides devel-oping sparse representation models, many research focusedon studying effective visual features that can well distinguishthe target from the background. [Jia et al., 2012] proposed alocal structural model that samples overlapped image patcheswithin the target region to locate the target and handle par-tial occlusion. [Hong et al., 2013] utilized multiple types offeatures, including color, shape, and texture, in jointly sparserepresentations shared among all particles. In [Zhang et al.,2015], global and local features were imposed together withpredefined spatial layouts considering the relationship amongglobal and local appearance as well as the spatial structure oflocal patches. Global and local sparse representations werealso developed in [Zhong et al., 2012], using feature selectionand a combination of generative and discriminative learningmethods. However, the previous sparse trackers generally ig-nore the temporal consistency of the target in a short historyof frames, which is addressed in this work.

For accurate visual tracking, templates must be updated toaccount for target appearance changes and prevent drift prob-lems. Most of the sparse-based trackers adopted the templateupdate scheme from the work in [Mei and Ling, 2011], whichassigns an importance weight for each template based on its

utilization during tracking. The template having the smallestweight is then replaced by the current tracking result. Howev-er, this scheme cannot model the templates’ representabilityand cannot adapt to the degree of target’s appearance changes,thus lacks of discriminative power. Our TRAC algorithm ad-dresses both issues and can robustly track targets with appear-ance changes over time.

2.2 Particle FilterThe particle filter is widely used in visual tracking, which isa combination of sequential importance sampling and resam-pling methods to solve the filtering problem. It estimates theposterior distribution of state variables in a hidden Markovchain. Let st and yt denote the state variable at time t and itsobservation respectively. The prediction of the state st givenall previous observations up to time t− 1 is given by

p(st|y1:t−1) =

∫p(st|st−1)p(st−1|y1:t−1) d st−1 (1)

where y1:t−1 := (y1,y2, · · · ,yt−1). In the update step, theobservation yt is available, the state probability can be updat-ed using the Bayes rule

p(st|y1:t) =p(yt|st)p(st|y1:t−1)

p(yt|yt−1)(2)

In the particle filter, the posterior p(st|y1:t) is estimated bysequential importance sampling, and we select an importancedensity q(s1:t|y1:t) such that p(s1:t,y1:t) = wtq(s1:t|y1:t)from which it is easy to draw samples, where q(s1:t|y1:t) =q(s1:t−1|y1:t−1)q(st|s1:t−1,yt). To generate n independentsamples (particles) {si1}ni=1 ∼ q(s1:t|y1:t) at time t, we gen-erate si1 ∼ q(s1|y1) at time 1, then sik ∼ q(sk|si1:k,yk) attime k, for k = 2, · · · , t. The weight of the particle sit at timet, is updated as

wit = wi

t−1p(yt|sit)p(sit|sit−1)q(sit|si1:t−1,yt)

(3)

At each time step, the particles are resampled according totheir importance weights to generate new equally weightedparticles. In order to minimize the variance of the importanceweights at time t, the importance density is selected accordingto q(st|s1:t−1,yt) = p(st|st−1,yt).

An affine motion model between consecutive frame is as-sumed in particle filters for visual tacking, as introduced in[Mei and Ling, 2011]. That is, the state variable st is definedas a vector that consists of six parameters of the affine trans-formation, transforming the bounding box within each imageframe to get an image patch of the target. The state transitionp(st|st−1) is defined as a multivariate Gaussian distributionwith a different standard deviation for each affine parameter.Since the velocity of the tracking target is unknown and canchange during tracking, it is modeled within the variance ofthe position parameters in the state transition. In this way, thetracking techniques based on particle filters need a variety ofstate parameters, which requires a large amount of particlesto represent this distribution. The observation yt encodes thecropped region of interest by applying the affine transforma-tion. In practice, yt is represented by the normalized featuresextracted from the region of interest.

3 TRAC-Based Sparse Tracking3.1 Sparse TrackingUnder the tracking framework based on particle filtering, theparticles are randomly sampled around the current state of thetarget object according to p(st|st−1). At time t, we consid-er n particle samples {sit}ni=1, which are sampled from thestate of the previous resampled particles in time t − 1, ac-cording to the predefined multivariate Gaussian distributionp(st|st−1). The observations of these particles (i.e., the im-age features of the particles) in the t-th frame are denoted asX = [x1,x2, · · · ,xn] ∈ <d×n, where xi represents the im-age features of the particle sit, and d is the dimension of thefeature. In the noiseless case, each xi approximately lies ina linear span of low-dimensional subspace, which is encodedas a dictionary D = [d1,d2, · · · ,dm] ∈ <d×m containingmtemplates of the target, such that X = DZ, where Z ∈ <m×n

is a weight matrix of X with respect to D.When targets are partially occluded or corrupted by noise,

the negative effect can be modeled as sparse additive noisethat can take a large value anywhere [Mei and Ling, 2011].To address this issue, the dictionary is augmented with trivialtemplates Id = [i1, i2, · · · , id] ∈ <d×d, where a trivial tem-plate ii ∈ <d is a vector with only one nonzero entry that cancapture occlusion and pixel corruption at the i-th location:

X = [ D Id ]

[ZE

]= BW (4)

Because the particles {st}ni=1 are represented by the corre-sponding image features {x}ni=1, the observation probabilityp(yt|sit) becomes p(yt|xi), which reflects the similarity be-tween a particle and the templates. The probability p(yt|xi)is inversely proportional to the reconstruction error obtainedby this linear representation.

p(yt|sit) = exp(−γ‖xi − xi‖22) (5)where γ is a predefined parameter and xi is the value of theparticle representation predicted by Eq. (4). Then, the parti-cle with the highest probability is selected as the object targetat time t.

To integrate multimodal features in multi-task sparse track-ing, n particles are jointly considered in estimating W, andeach particle has K modalities of features. When multimodalfeatures are applied, the particle representation X can be de-noted as X =

[X1,X2, · · · ,XK

]>. For each modality, the

particle observation matrix Xk ∈ <dk×n has n columns ofnormalized feature vectors for n particles, and dk is the di-mensionality of the k-th modality such that

∑Kk=1 dk = d.

Then, the dictionary of the k-th modality is Bk =[Dk, Idk

],

thus Eq. (4) becomes Xk = BkWk. The resulted repre-sentation coefficient matrix is a combination of all modalitycoefficients W = [W1,W2, · · · ,WK ]∈<m×(n×K). In themultimodal sparse tracking framework, W is computed by:

minW

K∑k=1

‖BkWk −Xk‖2F + λ‖W‖2,1 (6)

where λ is the trade-off parameter, and the `2,1 norm is de-

noted by ‖W‖2,1 =∑

i(√∑

j w2i,j) (with wi,j representing

the element of the i-th row and j-th column in W), which en-forces an `2 norm on each row and an `1 norm among rows,which introduces sparsity of the target templates.

3.2 Temporal ConsistencyTo robustly track deformable or changing objects in clutteredenvironments and address tracker drifting, it is important tomodel the consistency of target appearances during a historyof recent image frames. While particle filters model the timepropagation of each individual particle, it cannot model theconsistency of multiple particles. In visual tracking, particlesselected as the tracking results in multiple times are typicallydifferent (especially when severe appearance change occurs),which is critical but cannot be addressed by particle filters.This shows, although the idea of temporal consistency is in-tuitive, the solution is not obvious and heuristic. In our TRACalgorithm, we propose a novel sparsity regularization to en-force temporal consistency. Because the observation proba-bility p(yt|sit) is inversely proportional to the model error inEq. (5), we enforce selecting the particles that are consistentwith recently tracking results by applying temporal consis-tency in the objective function in Eq. (6).

We denote Wt as the coefficient matrix of all particles withrespect to Bt in the t-th frame, wt−l is the coefficient vectorof the tracking result (i.e., the selected particle encoding thetarget object) in the (t − l)-th frame with respect to Bt, andWt−l = wt−l1n denotes the coefficient matrix for the targetwith the same rank as wt−l. Based on the insight that a targetobject usually has a higher similarity to a more recent track-ing result and this similarity decreases over time, we employa time decay factor to model the temporal correlation. Then,the temporal consistency can be modeled using an autoregres-sive model as:

∑Tl=1 α

l‖Wt−Wt−l‖2,1, where α is the timedecay parameter. Thus, our multimodal sparse tracking taskat time t is formulated as:

minWt

K∑k=1

‖BktW

kt −Xk

t ‖2F + λ1‖Wt‖2,1

+λ2

T∑l=1

αl‖Wt −Wt−l‖2,1 (7)

and Wt−l is computed by:

minWt−l

∑Kk=1 ‖Bk

tWkt−l −Xk

t−l‖2F + λ1‖Wt−l‖2,1

The i-th row of the coefficient difference matrix Wt−Wt−lin Eq. (7) denotes the weight differences of the i-th templatebetween the target in the t-th frame and the previous trackingresult in the (t − l)-th frame. The `2,1 norm of the coeffi-cient difference ‖Wt −Wt−l‖2,1 enforces a small numberof rows to have non-zero values, i.e., only a small set of thetemplates can be different to represent the targets in framest and t − l. In other words, this regularization term encour-ages the target appearance in the current frame to be similar tothe previous tracking results. Thus, using this regularization,the particles with appearances that are similar to the recentlytracking results can be better modeled, and the correspondingobservation probability p(yt|sit) is higher. The particle with

the highest observation probability in Eq. (7) is then chosenas the tracking result. When templates are updated (Sec. 3.3),the coefficient matrices {Wt−l}(l = 1, . . . , T ) need to be re-calculated. If the tracking result in the frame t− l is includedin the current dictionary, we don’t use its coefficient to en-force consistency, to avoid overfitting (i.e., the dictionary canperfectly encode the tracking result at t− l with no errors).

3.3 Adaptive Template UpdateThe target appearance usually changes over time; thus fixedtemplates typically cause the tracking drift problem. To mod-el the appearance variation of the target, the dictionary needsto be updated. Previous techniques [Mei and Ling, 2011] fortemplate update assign each template an importance weightto prefer frequently used templates, and replace the templatewith the smallest weight by the current tracking result if it isdifferent form the highest weighted template. However, thesemethods suffer from two key issues. First, the update schemedoes not consider the representability of these templates, butonly rely on their frequency of being used. Thus, similar tem-plates are usually included in the dictionary, which decreasesthe discrimination power of the templates. Second, previousupdate techniques are not adaptive; they update the templateswith the same frequency without modeling the target’s chang-ing speed. Consequently, they are incapable of capturing theinsight that when the target’s appearance changes faster, thetemplates must be updated more frequently, and vise versa.

To address these issues, we propose a novel adaptive tem-plate update scheme that allows our TRAC algorithm to adap-tively select target templates, based on their representative-ness and importance, according to the degree of appearancechanges during tracking. When updating templates, we con-sider their long-term-short-term representativeness. The ob-servation of recent tracking results are represented by Y =[yt,yt−1, · · · ,yt−(l−1)

]∈ <d×l, where yt is the observa-

tion (i.e., feature vector) of the particle chosen as the trackingtarget at time t, which is used as the template candidate toupdate the dictionary D ∈ <d×m. Then, the objective is toselect r (r < l, r < m) templates that are most representativein short-term from the recent tracking results, which can beformulated to solve:

minU‖Y −YU‖2F + λ3‖U‖2,1 (8)

where U = [u1,u2, · · · ,ul] ∈ <l×l, and ui is the weight ofthe template candidates to represent the i-th candidate in Y.The `2,1 norm enforces sparsity among the candidates, whichenables to select a small set of representative candidates. Af-ter solving Eq. (8), we can sort the rows Ui (i = 1, . . . , l) bythe row-sum values of the absolute U in the decreasing order,resulting in a row-sorted matrix U′. A key contribution of ourTRAC algorithm is its capability to adaptively select a num-ber of templates, which varies according to the degree of thetarget’s appearance variation. Given U′, our algorithm deter-mines the minimum r value that satisfies 1

l

∑ri=1 ‖U′i‖1 ≥ γ,

and selects the r template candidates corresponding to the topr rows of U′, where γ is a threshold encoding our expect ofthe overall representativeness of the selected candidates (e.g.,γ = 0.75). When the target’s appearance remains the same

in the recent tracking results, one candidate will obtain a highrow-sum value (while others have a value close to 0, due tothe `2,1 norm), which will be selected as the single candidate.On the other hand, when the target’s appearance significantlychanges, since no single candidate can well represent others,the rows of U will become less sparse and a set of candidatescan have a high row-sum value. So, multiple candidates in thetop rows of U′ will be selected. Therefore, our TRAC methodis able to adaptively select a varying number of template can-didates based on their short-term representability, accordingto the degree of the target’s appearance changes.

To update the dictionary D, the adaptively selected r can-didates are added to D, while the same number of templatesmust be removed from D. To select the templates to remove,we compute the representativeness weight of the templates inD, using the same formulation in Eq. (8). Since the dictio-nary incorporates template information from the beginning oftracking, we call the weight the long-term representativeness.Then, the templates to remove from D are selected accordingto a combined weight:

w = βwrep + (1− β)wimp (9)

where wrep denotes the normalized long-term representative-ness weight, wimp denotes the traditional normalized impor-tance weight, and β is a trade-off parameter. The r templatesin D with the minimum weights are removed.

3.4 Optimization AlgorithmAlthough the optimization problems in Eqs. (7) and (8) areconvex, since their objective function contains non-smoothterms, they are still challenging to solve. We introduce a newefficient algorithm to solve both problems, and provide a the-oretical analysis to prove that the algorithm converges to theglobal optimal solution. Since Eq. (8) is a special case ofEq. (7) when λ2 = 0, we derive the solution according to thenotation used in Eq. (7). For a given matrix W = [wi,j ], werepresent its ith row as wi and the jth column as wj . Giv-en Wk

t = [wkt1,w

kt2, · · · ,wk

tn], taking the derivative of theobjective with respect to Wk

t (1 ≤ k ≤ K), and setting it tozero, we obtain

(Bkt )>Bk

tWkt − (Bk

t )>Xk

t + λ1DWkt

+λ2

T∑l=1

αlDl(Wkt −Wk

t−l) = 0 (10)

where Wkt−l is the coefficient of the kth view in the tracking

result at time t−l, D is a diagonal matrix with the ith diagonalelement as 1

2‖wit‖2

, and Dl is a diagonal matrix with the ith

diagonal matrix as 12‖wi

t−wit−l‖2

. Thus we have:

Wkt =

((Bk

t )>Bk

t + λ1D+ λ2

T∑l=1

αlDl

)−1

·

((Bk

t )>Xk

t + λ2

T∑l=1

αlDlWkt−l

)(11)

Note that D and Dl(1 ≤ l ≤ T ) are dependent on Wt andthus are also unknown variables. We propose an iterative al-gorithm to solve this problem described in Algorithm 1.

Convergence analysis. The following theorem guaranteesthe convergence of Algorithm 1.Theorem 1. Algorithm 1 decreases the objective value of Eq.(7) in each iteration.

Proof. In each iteration of Algorithm 1, according to Step 3 to 5,we know that

(Wt)s+1 = minWt

K∑k=1

‖BktW

kt −Xk

t ‖2F+λ1 TrW>t Ds+1Wt

+ λ2

T∑l=1

αl Tr (Wt −Wt−l)>Dl

s+1(Wt −Wt−l)

Thus, we can derive:K∑k=1

‖Bkt (W

kt )s+1−Xk

t ‖2F+λ1 Tr (Wt)>s+1Ds+1(Wt)s+1

+λ2

T∑l=1

αl Tr ((Wt)s+1−Wt−l)>Dl

s+1 ((Wt)s+1−Wt−l)

≤K∑k=1

‖Bkt (W

kt )s−Xk

t ‖2F + λ1 Tr (Wt)>s Ds+1(Wt)s

+λ2

T∑l=1

αl Tr ((Wt)s−Wt−l)>Dl

s+1((Wt)s−Wt−l)

Substituting D and Dl by definitions, we obtain:

Ls+1+λ1

m∑i=1

‖(wit)s+1‖22

2‖(wit)s‖2

+λ2

T∑l=1

αlm∑i=1

‖(wit)s+1 −wi

t−l‖222‖(wi

t)s −wit−l‖2

≤Ls + λ1

m∑i=1

‖(wit)s‖22

2‖(wit)s‖2

+λ2

T∑l=1

αlm∑i=1

‖(wit)s −wi

t−l‖222‖(wi

t)s −wit−l‖2

where Ls =∑Kk=1 ‖B

kt (W

kt )s − Xk

t ‖2F . Since it can be easilyverified that for the function f(x) = x − x2

2α, given any x 6= α ∈

<, f(x) ≤ f(α) holds, we can derive:m∑i=1

‖(wit)s+1‖2 −

m∑i=1

‖(wit)s+1‖22

2‖(wit)s‖2

≤m∑i=1

‖(wit)s‖2 −

m∑i=1

‖(wit)s‖22

2‖(wit)s‖2

andm∑i=1

‖(wit)s+1 −wi

t−l‖2 −m∑i=1

‖(wit)s+1 −wi

t−l‖222‖(wi

t)s −wit−l‖2

≤

m∑i=1

‖(wit)s −wi

t−l‖2 −m∑i=1

‖(wit)s −wi

t−l‖222‖(wi

t)s −wit−l‖2

(12)

Adding the previous three equations on both sides (note Eq. (12) isrepeated for 1 ≤ l ≤ T ), we have

Ls+1+λ1

m∑i=1

‖(wit)s+1‖2+λ2

T∑l=1

αlm∑i=1

‖(wit)s+1−wi

t−l‖2

≤Ls+λ1

m∑i=1

‖(wit)s‖2+λ2

T∑l=1

αlm∑i=1

‖(wit)s−wi

t−l‖2

Algorithm 1: An efficient iterative algorithm to solve theoptimization problems in Eqs. (7) and (8).

Input : Bt,Xt

Output: (Wt)s ∈ <m×(nK)

1 Let s = 1. Initial (Wt)s by solvingminWt

∑Kk=1 ‖Bk

tWkt −Xk

t ‖2F .

2 while not converge do3 Calculate the diagonal matrix Ds+1, where the ith

diagonal element is 12‖(wi

t)s‖2.

4 Calculate the diagonal matrices Dls+1(1 ≤ l ≤ T ),

where the ith diagonal element is 12‖(wi

t)s−wit−l‖2

.

5 For each Wkt (1 ≤ k ≤ K), calculate (Wk

t )s+1

using Eq. (11).6 s = s+1

Therefore, the algorithm decreases the objective value in each itera-tion. Since the problem in Eq. (7) is convex, the algorithm convergesto the global solution.

4 ExperimentsTo evaluate the performance of the proposed TRAC method,we performed extensively validation on twelve challengingimage sequences that are publicly available from the widelyused Visual Tracker Benchmark dataset [Wu et al., 2013]1.The used image sequences contain a variety of target objectsunder static or dynamic background. The length of the imagesequences is also varied with the shortest under 100 framesand the longest over 1000 frames. Each frame of the sequenceis manually annotated with the corresponding ground-truthbounding box for the tracking target; the attributes and chal-lenges of each sequence that may affect tracking performanceare also provided in the dataset.

Throughout the experiments, we employed the parameterset of λ1 = 0.5, λ2 = 0.1, λ3 = 0.5, α = 0.1, β = 0.5,n = 400, and m = 10. To represent the tracking targets, weemployed four popular visual features that were widely usedin previous sparse tracking methods: color histograms, inten-sity, histograms of oriented gradients (HOG), and local bina-ry patterns (LBP). We compared our TRAC algorithm withten state-of-the-art methods, including trackers based on (1)multiple instance learning (MIL) [Babenko et al., 2009], (2)online Adaboost boosting (OAB) [Grabner et al., 2006], (3)L1 accelerated proximal gradient tracker (L1APG) [Bao etal., 2012], (4) Struck [Hare et al., 2011], (5) circulant struc-ture tracking with kernels (CSK) [Henriques et al., 2012],(6) local sparse and K-selection tracking (LSK) [Liu et al.,2011], (7) multi-task tracking (MTT) [Zhang et al., 2012b],(8) incremental visual tracking (IVT) [Ross et al., 2008], (9)fragments-based tracking (Frag) [Adam et al., 2006], and(10) visual tracking decomposition (VTD) [Kwon and Lee,2010].

1The Visual Tracker Benchmark: www.visual-tracking.net.

Figure 1: Tracking results of 11 trackers (denoted in different colors) on 12 image sequences. Frame indices are shown in thetop left corner in yellow colors. Results are best viewed in color on high-resolution displays.

4.1 Qualitative EvaluationThe qualitative tracking results obtained by our TRAC algo-rithm is shown in Figure 1. We analyze and compare the per-formance when various challenges are present, as follows.

Occlusion: The walking2 and girl sequences track a personbody or a human face while occluded by another person. Inthe walking2 sequence, the OAB, Frag, MIL, CT, LSK, andVTD methods fail when the walking woman is occluded by aman. The Struck method shows more tracking errors from theaccurate position. On the other hand, TRAC, L1APG, MTT,and IVT methods successfully track the target throughout theentire sequence. The main challenge of the girl sequence isocclusion and pose variation. Frag fails when the girl startsto rotate; LSK fails when the girl completely turns her backtowards the camera. The IVT method fails around frame 125when the girl keeps rotating, and the CT and MIL methodsexperience significant drift at the same time. When the man’sface occludes the girl, the VTD method starts to track the menbut comes back to the target when the man disappears. TheTRAC, L1APG, MTT, OAB, and Struck methods accuratelytrack the target face in the entire sequence.

Background Clutter: The basketball and skating1 se-quences track a fast moving human among other people, withsignificant background clutter, occlusion and deformation. Inthe basketball sequence, the TRAC, VTD, and Frag methodstrack the correct target throughout the entire sequence, whileFrag suffers more errors from the accurate position. Othertrackers fail to track the target at different time frames. Due toenforcing temporal consistency and adaptively updating tem-plates, our TRAC method accurately tracks the fast movinghuman body. In the skating1 sequence, the TRAC and VTDmethods can track the target most of the time. The LSK andOAB trackers can keep tracking most of the time but signifi-cantly drift away at the frames where the background is dark.Struck fails when the target is occluded by another person.Other trackers fail at earlier time frames due to the target orbackground motion.

Illumination Variation: The main challenge of the shak-ing and fish sequences is illumination change. In shaking, theOAB, CT, IVY, Frag and MTT trackers fail to track the target

face in frames around 17, 21, 25, 53, 60, respectively. Struckcannot track the accurate position most of the time and driftfar away. LSK fails in frame 18 but recovers in frame 59; it al-so suffers tracking drift when the hat occludes the man’s face.In contrast, TRAC and VTD successfully track the target forthe whole video. In the fish sequence, OAB and LSK fail inframes 25 and 225, respectively. L1APG, MTT, Frag, MIL,and VTD track part of the target but gradually drift away. TheTRAC, IVT, Struck, and CT methods accurately track the en-tire sequence despite large illumination changes, while CT isless accurate compared to other successful methods.

(a) precision (b) success rate

Figure 2: Overall tracking performance of our TRAC algo-rithm and comparison with previous state-of-the-art methods.

Pose Variation: The david2, dudek, and trellis sequencestrack human faces in different situations with significant posechanges. In david2, CT fails at the very beginning; Frag failsaround frame 165; OAB and LSK start to drift at frame 159and 341, respectively, and then fail. MIL roughly tracks thetarget but exhibits significant drifts. In the dudek sequence,occlusion of hands occurs at frame 205, where the CT, OABmethods start to drift shortly after. The Frag approach suffersmore drifts than other trackers when pose changes, and failsaround frame 906. The OAB method fails around frame 975,when the target is partially out of view. The L1APG methodexperiences significant drift at frame 1001 and keeps driftingfrom the accurate position to the end of the sequence. In thetrellis sequence, the OAB, MTT, IVT, Frag, L1APG, MIL, C-

(a) occlusion (b) rotation (c) illumination variation (d) background clutter

Figure 3: Precision and success plots evaluated on image sequences with the challenges of (a) occlusion, (b) rotation (includingin-plane and out-of-plane rotation), (c) illumination variation, and (d) background clutter.

T, VTD methods fail around frames 115, 192, 210, 212, 239,240, 321, 332, respectively. Struck successfully tracks themoving faces with slight tracking drifts. The proposed TRACtracker accurately tracks the moving targets with significantpose changes in all three videos, due to its ability to adaptive-ly update templates and enforce temporal consistence.

4.2 Quantitative EvaluationWe also quantitatively evaluate our TRAC method’s perfor-mance using the precision and success rate [Wu et al., 2013].The precision metric is computed using the center locationerror, which is the Euclidean distance between the center ofthe tracked target and the ground truth in each frame. Theplot is generated as the percentage of frames whose centerlocation error is within the given threshold versus the prede-fined threshold. The representative precision score is calcu-lated with the threshold set to 20 pixels. The metric of suc-cess rate is used to evaluate the bounding box overlap. Theoverlap score is defined as the Jaccard similarity: Given thetracked bounding box ROIT and the ground truth boundingbox ROIG, it is calculated by s = |ROIT

⋂ROIG|

|ROIT⋃

ROIG| . The suc-cess plot is generated as the ratio of successful frames at thethreshold versus the predefined overlap score threshold rang-ing from 0 to 1.

To quantitatively analyze our algorithm’s performance andcompare with other methods, we compute the average frameratio for the center location error and the bounding box over-lap score, using the 12 image sequences. The overall perfor-mance is demonstrated in Figure 2. The results show that ourTRAC algorithm achieves the state-of-the-art tracking perfor-mance, and significantly outperforms the previous 10 meth-ods on all image sequences. To evaluate the robustness of theproposed tracker in different challenging conditions, we eval-uate the performance according to the attributes provided bythe image sequences, including occlusion, rotation, illumina-

tion variation, and background clutter. As illustrated by theresults in Figure 3, our TRAC algorithm performs significant-ly better than previous methods, which validates the benefit ofenforcing temporal consistency and adaptively updating tar-get templates.

5 Conclusion

In this paper, we introduce a novel sparse tracking algorithmthat is able to model the temporal consistency of the targetsand adaptively update the templates based on their long-term-short-term representability. By introducing a novel structurednorm as a temporal regularization, our TRAC algorithm caneffectively enforce temporal consistency, thus alleviating theissue of tracking drifting. The proposed template update strat-egy considers the long-term-short-term representability of thetarget templates and is capable of selecting an adaptive num-ber of templates, which varies according to the degree of thetracking target’s appearance variations. This strategy makesour approach highly robust to the target’s appearance changesdue to occlusion, deformation, and pose changes. Both abili-ties are achieved via structured sparsity-inducing norms, andtracking is performed using particle filters. To solve the for-mulated sparse tracking problem, we implement a new opti-mization solver that offers a theoretical guarantee to efficient-ly find the optimal solution. Extensive empirical studies havebeen conducted using the Visual Tracker Benchmark dataset.The qualitative and quantitative results have validated thatthe proposed TRAC approach obtains very promising visualtracking performance, and significantly outperforms the pre-vious state-of-the-art techniques. The proposed strategies notonly address the visual tracking task, but also can benefit ad-dressing a wide range of problems involving smooth temporalsequence modeling in artificial intelligence.

References[Adam et al., 2006] Amit Adam, Ehud Rivlin, and Ilan

Shimshoni. Robust fragments-based tracking using the in-tegral histogram. In IEEE Conference on Computer Visionand Pattern Recognition, 2006.

[Babenko et al., 2009] Boris Babenko, Ming-Hsuan Yang,and Serge Belongie. Visual tracking with online multi-ple instance learning. In IEEE Conference on ComputerVision and Pattern Recognition, 2009.

[Bao et al., 2012] Chenglong Bao, Yi Wu, Haibin Ling, andHui Ji. Real time robust l1 tracker using accelerated prox-imal gradient approach. In IEEE Conference on ComputerVision and Pattern Recognition, 2012.

[Grabner et al., 2006] Helmut Grabner, Michael Grabner,and Horst Bischof. Real-time tracking via on-line boost-ing. In British Machine Vision Conference, 2006.

[Hare et al., 2011] Sam Hare, Amir Saffari, and Philip HSTorr. Struck: Structured output tracking with kernels. InIEEE International Conference on Computer Vision, 2011.

[Henriques et al., 2012] Joao F Henriques, Rui Caseiro, Pe-dro Martins, and Jorge Batista. Exploiting the circulantstructure of tracking-by-detection with kernels. In Euro-pean Conference on Computer Vision. 2012.

[Hong et al., 2013] Zhibin Hong, Xue Mei, DanilProkhorov, and Dacheng Tao. Tracking via robustmulti-task multi-view joint sparse representation. In IEEEInternational Conference on Computer Vision, 2013.

[Jia et al., 2012] Xu Jia, Huchuan Lu, and Ming-HsuanYang. Visual tracking via adaptive structural local sparseappearance model. In IEEE Conference on Computer vi-sion and pattern recognition, 2012.

[Kalal et al., 2010] Zdenek Kalal, Jiri Matas, and KrystianMikolajczyk. Pn learning: Bootstrapping binary classifiersby structural constraints. In IEEE Conference on Comput-er Vision and Pattern Recognition, 2010.

[Kwon and Lee, 2010] Junseok Kwon and Kyoung Mu Lee.Visual tracking decomposition. In IEEE Conference onComputer Vision and Pattern Recognition, 2010.

[Li et al., 2011] Hanxi Li, Chunhua Shen, and Qinfeng Shi.Real-time visual tracking using compressive sensing. InIEEE Conference on Computer Vision and Pattern Recog-nition, 2011.

[Liu et al., 2010] Baiyang Liu, Lin Yang, Junzhou Huang,Peter Meer, Leiguang Gong, and Casimir Kulikowski. Ro-bust and fast collaborative tracking with two stage sparseoptimization. In European Conference on Computer Vi-sion. 2010.

[Liu et al., 2011] Baiyang Liu, Junzhou Huang, Lin Yang,and Casimir Kulikowsk. Robust tracking using local s-parse appearance model and k-selection. In IEEE Confer-ence on Computer Vision and Pattern Recognition, 2011.

[Mei and Ling, 2011] Xue Mei and Haibin Ling. Robust vi-sual tracking and vehicle classification via sparse repre-sentation. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 33(11):2259–2272, 2011.

[Mei et al., 2011] Xue Mei, Haibin Ling, Yi Wu, ErikBlasch, and Li Bai. Minimum error bounded efficient `1tracker with occlusion detection. In 2011 IEEE Confer-ence on Computer Vision and Pattern Recognition, 2011.

[Nebehay and Pflugfelder, 2015] Georg Nebehay and Ro-man Pflugfelder. Clustering of static-adaptive correspon-dences for deformable object tracking. In IEEE Confer-ence on Computer Vision and Pattern Recognition, 2015.

[Ross et al., 2008] David A Ross, Jongwoo Lim, Ruei-SungLin, and Ming-Hsuan Yang. Incremental learning for ro-bust visual tracking. International Journal of ComputerVision, 77(1-3):125–141, 2008.

[Salti et al., 2012] Samuele Salti, Andrea Cavallaro, andLuigi Di Stefano. Adaptive appearance modeling for videotracking: Survey and evaluation. IEEE Transactions onImage Processing, 21(10):4334–4348, 2012.

[Smeulders et al., 2014] Arnold WM Smeulders, Dung MChu, Rita Cucchiara, Simone Calderara, Afshin Dehghan,and Mubarak Shah. Visual tracking: An experimental sur-vey. IEEE Transactions on Pattern Analysis and MachineIntelligence, 36(7):1442–1468, 2014.

[Wu et al., 2013] Yi Wu, Jongwoo Lim, and Ming-HsuanYang. Online object tracking: A benchmark. In IEEEConference on Computer vision and pattern recognition,2013.

[Yao et al., 2013] Rui Yao, Qinfeng Shi, Chunhua Shen,Yanning Zhang, and Anton Hengel. Part-based visualtracking with online latent structural learning. In IEEEConference on Computer Vision and Pattern Recognition,2013.

[Yilmaz et al., 2006] Alper Yilmaz, Omar Javed, andMubarak Shah. Object tracking: A survey. ACMComputing Surveys, 38(4):13, 2006.

[Zhang et al., 2012a] Tianzhu Zhang, Bernard Ghanem,Si Liu, and Narendra Ahuja. Low-rank sparse learning forrobust visual tracking. In European Conference on Com-puter Vision, pages 470–484. Springer, 2012.

[Zhang et al., 2012b] Tianzhu Zhang, Bernard Ghanem,Si Liu, and Narendra Ahuja. Robust visual tracking viamulti-task sparse learning. In IEEE Conference on Com-puter Vision and Pattern Recognition, 2012.

[Zhang et al., 2013] Hao Zhang, Christopher Reardon, andLynne E. Parker. Real-time multiple human perceptionwith color-depth cameras on a mobile robot. IEEE Trans-actions on Cybernetics, 43(5):1429–1441, 2013.

[Zhang et al., 2015] Tianzhu Zhang, Si Liu, Changsheng Xu,Shuicheng Yan, Bernard Ghanem, Narendra Ahuja, andMing-Hsuan Yang. Structural sparse tracking. In IEEEConference on Computer Vision and Pattern Recognition,2015.

[Zhong et al., 2012] Wei Zhong, Huchuan Lu, and Ming-Hsuan Yang. Robust object tracking via sparsity-basedcollaborative model. In IEEE Conference on Computervision and pattern recognition, 2012.

Documents

Enforcing Template Representability and Temporal ...inside.mines.edu/~hzhang/papers/IJCAI16_SRAC.pdf · robotics, and motion analysis. Over the years, numerous vi- ... dictionary