[Lecture Notes in Computer Science] Intelligent Science and Intelligent Data Engineering Volume 7202 || Multiple Feature Fusion for Object Tracking

Multiple Feature Fusion for Object Tracking

Yu Zhou, Cong Rao, Qin Lu, Xiang Bai�, and Wenyu Liu

Department of Electronics and Information Engineering,Huazhong University of Science and Technology, Wuhan, China{zhouyu.hust,cong.rao,luqin.hust,xiang.bai}@gmail.com,

[email protected]

Abstract. In this paper, we propose a novel object tracking method byfusing multiple features. The tracking task is formulated under Bayesianinference framework. The posterior probability is resolved by the sum ofweighted likelihood observations. Graph based semi-supervised learningmethod is used for likelihood evaluation, and the distance between fore-ground and background histograms is used for weight estimation. Weevaluate our tracking algorithm on some popular benchmark videos andachieve competitive results compared with some state-of-art algorithms.

Keywords: Visual Tracking, Multiple Feature Fusion.

1 Introduction

Visual tracking is an important issue in computer vision and has many prac-tical applications. The challenges in designing a tracking system are caused bythe intrinsic appearance changes (shape deformation) and the extrinsic noise(occlusion, viewpoints variances or background clutter). Many robust trackingalgorithms have been designed in recent literatures, which could be generallydivided into two categories. The first category is matching based tracking algo-rithms, e.g. [1]. Appearance model is extracted from the target and a matchingstrategy is used to locate the target in candidates set. The second category isclassification based tracking algorithms, such as [2],[3]. Tracking task is trans-formed into foreground and background binary classification problem.

Hybrid discriminative strategies for object tracking are well-established inrecent years. The famous ensemble tracking [4] combines the HOG and RGBusing Adaboost algorithm. [5] fuses different kinds of likelihood maps into oneunified framework. Product rule, sum rule, min rule, median rule and majorityvote rule are often used for multiple strategies fusion, in which the sum ruleoutperforms other rules in most of the cases.

Semi-supervised learning [6] has also attracted much attention in recent liter-atures, in which both labeled and unlabeled data are used to predict the labels ofthe unlabeled data. In traditional supervised learning based tracking algorithms,only a few labeled data (image pixels, image patches or else) could be obtained,

� Corresponding author.

Y. Zhang et al. (Eds.): IScIDE 2011, LNCS 7202, pp. 145–152, 2012.c© Springer-Verlag Berlin Heidelberg 2012

146 Y. Zhou et al.

HOG Feature

LBP Feature

Harr Feature

Transductive Learning

Likelihood Based on HOG

Likelihood Based on LBP

Likelihood Based on Harr

Frame t

Frame t+1

Weight of HOG

Weight of LBP

Weight of Harr

Predict Result in Frame t+1

Fig. 1. Flow chart for multiple feature fusion tracking algorithm

which limits the performance of classification. Thus, semi-supervised learningmethod is introduced and developed, in which unlabeled data can be utilized toimprove the classification accuracy. This learning paradigm is suitable for objecttracking task naturally.

In this paper, we propose a visual tracking algorithm based on multiple featurefusion method under semi-supervised learning framework. The flow chart of ouralgorithm is shown in Fig.1. Given a frame t in which the location of objectof interest is labeled and an new unlabeled frame (t + 1), the algorithm canpredict the location of the object in this new frame under transductive learningparadigm.

Compared with traditional methods, our work has the following characteristicsand contributions:

(1) Object tracking is formulated under Bayesian framework, and the posteriorprobability could be decomposed into the combination of weighted likelihoodconfidences. Compared with previous methods, different local information(including HOG[7], LBP[8] and Harr-like[9]) is used to locate the targetjointly, which is more robust than those methods which combine differentclassifiers.

(2) Graph-based semi-supervised learning method is used to handle the evalu-ation of likelihood map, which exploits not only the appearance model inframe t, but also the hidden manifold structure in frame (t + 1). A simpleclosed solution is given which makes our tracking algorithm computation-ally effective. Further more, a novel weight estimation strategy is proposedbased on the distance between foreground and background quantificationhistograms, which is simple but effectual.

The remainder of this paper is organized as follows: in Section 2, the proposedtracking method based on multiple feature fusion is introduced; in Section 3, wediscuss the evaluation and performance of our algorithm on some conventionalused but still challenging videos, and a conclusion is made in Section 4.

Multiple Feature Fusion for Object Tracking 147

2 Tracking with Multiple Feature Fusion

2.1 Problem Formulation

Consider a dynamic tracking system, let Ft be the image frame at time t withgiven target, and our tracking goal is to infer the state variable S from the givenF(t+1), where the state variable could refer to position draft, rotation and scaling.In our case, we only focus on the object’s position draft, and the objective ofvisual tracking (accuratly predicting the target location S∗ in frame Ft+1) is tomaximize the posterior probability

S∗t+1 = argmax

S∈ΩP (S|Ft+1) (1)

where Ω is the candidate set of target locations at frame Ft+1. In fact, a hiddenvariable L (denoting the likelihood based on different image descriptor extractionstrategies) is used to estimate the posterior probability. We define likelihood mapP (S|L) based on such L, and under Bayesian inference rule, we could transformthe original objective into the following form.

P (S|Ft+1) =

∫P (S|Lt+1)P (Lt+1|Ft+1)dL ≈

M∑i=1

ω(i,t+1)P (S|L(i,t+1))(2)

where P (S|Lt+1) represents observation model for a specific feature extractionstrategy and it is the crucial part for finding the ideal posterior distribution.P (Lt+1|Ft+1) = ωt+1 is the degree of belief about the validity likelihood mapfor a given feature extraction strategy, M is the number of feature extractionstrategy. In such formulation, we have the simple assumption that S is indepen-dent of F , and all the likelihood are independent of each other.

Our measurement of the observation comes from transductive learning modelsand the weight is estimated by a simple foreground and background histogramdistance. In the following, we entail the strategy to learn the likelihood map andweight for P (S|Li) and ωi, respectively.

2.2 Likelihood Map

First we introduce our adaptive appearance model. Given an observed imageframe Ft at time t, and the target object is located within a rectangle area Wt

with center coordinate S∗t ∈ R

2. Then our appearance model At consists of aseries of image patches:

At = {[I(t,i), V(t,i),S(t,i)], i ∈ {1, ...L}}, ‖S(t,i) − S∗(t−1)‖ ≤ α (3)

where I(t,i) is an image patch in Ft, V(t,i) ∈ Rd is the d-dimensional descriptor

vector of the image patch I(t,i), S(t,i) ∈ R2 is the center coordinate of image

patch I(t,i).

148 Y. Zhou et al.

Positive Patch

Unlabeled Patch

Similarity Between Patches

Feature Extract

Fig. 2. Graph model to learn likelihood map

Different from traditional motion models of visual tracking which take affinetransformation into account and take position draft, shape variance as restraintfor locating the target, in this paper, only a simple motion hypothesis M like [2]is used:

M(St+1|S∗t ) =

{1 if ||St+1 − S∗

t || < s

0 else(4)

where s is the search radius. Under such laconic motion hypothesis, we couldextract the unlabeled instance in image frame F(t+1) as follows:

Ut+1 = {[I(t+1,j), V(t+1,j),S(t+1,j)], j ∈ {1, ..., U}}, ||S(t+1,j) − S∗t || < s (5)

note that St+1 is the simplified form of S(t+1,j) (j ∈ {1, ..., n}). Unlabeled imagepatches play a key rule in our method.

With the appearance model and motion model in hand, we could estimate thelikelihood based on graph based semi-supervised learning [10]. Suppose there areL labeled image patchesAt and U unlabeled image patches Ut+1, thenN = L+Uis the total number of image patches. We define a fully connected graph G =<V,E > based on the total N image patches, where the vertex V represents imagepatch Ik (k ∈ {1, ..., N}), the edge E represents the similarity between pairs ofimage patches. An intuitive illustration of this graph is shown in Fig.2, whichshows the relationship between image patches of interest in frame Ft and frameF(t+1).

LetX ∈ RN×d be the feature set, Y ∈ N

N×C with Yi,j = 1 if Ii (i ∈ {1, ..., N})have label j and 0 else. F ⊂ R

N×C denotes all matrices with nonnegative entries,and matrix F ∈ F denotes such matrix that labels Ii (i ∈ {1, ..., N}) with labelyi = argmaxj≤cFij . Then learning likelihood map is transformed to the followingoptimization problem:

F ∗ = argminF∈F

Q(F ) (6)


Following the regularization framework, a quadratic energy could be used tomeasure F :

Q(F ) = 〈F, LF 〉+ μ‖F − Y ‖2 (7)

〈·, ·〉 is the inner product operator, L is the Normalized Graph Laplacian, whichis defined as L := D−1/2WD1/2, D is the diagonal matrix which represents thedegree of vertices, and W is an affine matrix. L is a symmetric and positivesemi-definite matrix which induces a semi-norm on F that penalizes the changebetween adjacent vertices [11]. The first term of Eqn. (7) is the smoothnessconstraint which depicts the local variances that results in the character of Land the second term is the fitting constraint. The trade-off between those twoconstrains is captured by a constant parameter μ. To give the closed solution forEqn. (7), we decompose it into the following form:

Q(F ) = F�(D−1/2WD1/2)F + μ‖F − Y ‖2

=1

2

n∑i,j=1

Wij‖ 1√Dii

Fi − 1√Djj

Fj‖2 + μ

n∑i=1

‖Fi − Yi‖2 (8)

where Wij is the affine matrix, and it is captured according to the Gaussiankernel as follows:

Wij = exp(−d(Vi, Vj)

σ2) (9)

where σ is the bandwidth hyperparameter and d(Vi, Vj) measures the distancebetween image patches Ii and Ij (each patch is encoded as a feature vector). Inour case, the common L2 norm is used to calculate such distance.

To find the optimal solution of Eqn. (7), under the decomposed form of Eqn.(8), we differentiate Q(F ) with respect to F :

∂Q∂F

|F=F∗= F ∗ − SF ∗ + μ(F ∗ − Y ) = 0 (10)

then the final F ∗ is derived as:

F ∗ =μ

1 + μ(I− 1

1 + μL)−1Y (11)

where I is an identity matrix.For the reason that Ii and Si is one-to-one correspondence according to

Eqn.(3), we could obtain the candidates likelihood confidences for single fea-ture extraction strategy:

P (S|Lt+1) =

{μ

1+μ (I− 11+μ L)

−1Y if ||St+1 − S∗t || < s

0 else(12)

150 Y. Zhou et al.

Algorithm 1: MFF Tracking Algorithm

Input: Video frame Ft, t ∈ {1, ..., T}, S∗1

Output: S∗t , t ∈ {2, ..., T}

1 Initialize appearance model A1 = {[I(1,i), V(1,i),S(1,i)], i ∈ {1, ...L}};2 for each video frame Ft, t ∈ {1, ..., T} do3 Build candidate location set

Ut+1 = {[I(t+1,j), V(t,i),S(t+1,j)], j ∈ {1, ..., U}}, ||S(t+1,j) − S∗t || < s;

4 for each feature extraction strategy do5 Evaluate the likelihood confidences:

P (S|L(e,t+1)) =

{μ

1+μ(I− 1

1+μL)−1Y if ||St+1 − S∗

t || < s

0 else;

6 Evaluate the weight: ωe = D(e,(O,B))/∑E

e=1 D(e,(O,B))

7 end

8 Update target location: S∗t+1 = argmaxS∈Ω

∑Mi=1 ω(i,t+1)P (S|L(i,t+1)).

9 Update appearance model:A(t+1) = {[I((t+1),i), V((t+1),i),S((t+1),i)], i ∈ {1, ...L}}, ‖S((t+1),i) − S∗

t ‖ ≤ α

10 end

2.3 Weight Estimation

Suppose we have the likelihood confidences of frame Ft. Measurement of thedegree of belief about the validity likelihood map is based on the distance offoreground and background quantification histograms. The foreground quantifi-cation histogram is bounded by the maximum and minimum of the posteriorprobability in the area of W , and such region is quantified into K bins. Wecount the number of posterior probability in each bin, and the histogram isdenoted as ΛO. Also we could get the background histogram ΛB. Under suchforeground and background histograms, it is natural to use the χ2 statistic tomeasure the distinguishing ability of foreground and background for a featureextraction strategy:

D(O,B) =1

2

K∑k=1

[ΛO(k)− ΛB(k)]2

ΛO(k) + ΛB(k)(13)

where D(O,B) is the distance between foreground and background quantificationhistogram. Such distance is powerful enough to measure the confidence of a givenlikelihood map. Based on such distance measure, we calculate the weight for eachfeature extraction strategy as:

ωe =D(e,(O,B))∑Ee=1 D(e,(O,B))

(14)

The whole Multiple Feature Fusion (MFF) tracking algorithm is shown in Alg.1.


0 50 100 150 200 250 300 350 400 450 5000

20

40

60

80

100

120

140

160

180

David Indoor

Frame #C

ente

r Lo

catio

n E

rror

(pi

xel)

SemiBoostFragIVTOABOur Method

0 50 100 150 200 250 300 350 4000

50

100

150

200

250

Tiger2

Frame #

Cen

ter

Loca

tion

Err

or (

pixe

l)


0 200 400 600 800 1000 1200 14000

50

100

150

200

250

Sylvester

Frame #

Cen

ter

Loca

tion

Err

or (

pixe

l)


0 50 100 150 200 250 3000

20

40

60

80

100

120

140

Coke Can

Frame #

Cen

ter

Loca

tion

Err

or (

pixe

l)


(A) (B) (C) (D)

Fig. 3. Center Location Error (CLE) versus frame sequence number

3 Experimental Results

We test our Multiple Feature Fusion (MFF) tracking algorithm on four challeng-ing video sequences1: Sylvester, David Indoor, Coke Can, and Tiger2, and com-pare our method with four famous tracking algorithms, including FragTracker[3], IVT [12], Online Adaboost(OAB) [13], and SemiBoost tracker [14]. For theparameters, We set α = 4 pixels, s = 15 pixels, μ = 0.99 and learn σ adaptivelylike [15] in our algorithm. Note that all the parameters in our algorithm werefixed for all the experiments.

Table 1. Average Center Location Error (ACLE which measures with pixel). Red colorindicates best performance, Blue color indicates second best.

Video OAB[13] IVT[12] SemiBoost[14] Frag[3] MFF(Ours)

Coke Can 25.08 37.31 40.56 69.11 16.34Tiger2 13.25 98.54 39.33 38.68 6.19Sylvester 35.08 96.19 21.08 23.04 11.31

David Indoor 51.09 8.10 44.46 70.55 20.19

Sylvester and David Indoor are challenging for the largely lighting, scale andpose variances while Tiger2 and Coke Can are challenging for the occlusionand fast motion. Our evaluation methodology includes quantitative comparisonusing CLE versus frame sequence number and mean error comparison over allthe video frames (ACLE). As shown in Fig.3 and Tab.1, we get best performancein majority of the cases and our CLE curve is smoother than those of the othermethods (which indicates that our tracking algorithm appears more robust andstable). Meanwhile, it has to be noted that our method performs not as goodas IVT in the case of David Indoor. This is mainly caused by the PCA methodused in IVT, which is proved to be very effective in the task of identifying faces.

1 All those four video sequences and groundtruth for video clips available athttp://vision.ucsd.edu/∼bbabenko/project miltrack.shtml

152 Y. Zhou et al.

4 Conclusion

In this paper, a novel tracking algorithm based on Multiple Feature Fusion isproposed. Object tracking task is formulated under Bayesian inference frame-work. Instead of embedding heterogenous features into one vector, we calculatethe likelihood confidences using graph based semi-supervised learning methodfor each feature, and combine them with a weight coefficient which is estimatedfrom the distance of the foreground and background histograms for posteriorprobability estimation. We test our tracking algorithm on some video bence-marks and achieve competitive results.

Acknowledgments. The authors would like to thank the anonymous refer-ees who gave us many helpful comments and suggestions. This work was sup-ported by the Fundamental Research Funds for the Central Universities’ HUST2011TS110, and National Natural Science Foundation of China #60903096 and#60873127.

References

1. Jiang, N., Liu, W.-Y., Wu, Y.: Learning Adaptive Metric for Robust Visual Track-ing. TIP 20(8), 2288–2300 (2011)

2. Babenko, B., Yang, M.H., Belongie, S.: Robust Object Tracking with Online Mul-tiple Instance Learning. TPAMI 33(8), 1619–1632 (2011)

3. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragment-based tracking using theintegral histogram. In: CVPR (2006)

4. Avidan, S.: Ensemble Tracking. TPAMI 29(2), 261–271 (2007)5. Yin, Z., Porikli, F., Collins, R.: Likelihood Map Fusion for Visual Object Tracking.

In: WACV (2008)6. Zhu, X.: Semi-Supervised Learning Literature Survey. Technical Report, Depart-

ment of Computer Sciences, University of Wisconsin, Madison (2005)7. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In:

CVPR (2005)8. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution Gray-Scale and Rotation

Invariant Texture Classification with Local Binary Patterns. TPAMI 24(7), 971–987 (2002)

9. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simplefeatures. In: CVPR, pp. 511–518 (2001)

10. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Scholkopf, B.: Learning with Localand Global Consistency. In: NIPS (2004)

11. Smola, A., Kondor, R.: Kernels and Regularization on Graphs. In: COLT (2003)12. Ross, D., Kim, J., Lin, R.-S., Yang, M.-H.: Incremental Learning for Robust Visual

Tracking. IJCV 77(1), 125–141 (2008)13. Grabner, H., Grabner, M., Bischof, H.: Real-Time Tracking via On-Line Boosting.

In: BMVC, pp. 47–56 (2006)14. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Ro-

bust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008)

15. Zelnik-Manor, L., Perona, P.: Self-Tuning Spectral Clustering. In: NIPS (2004)

Documents

[Lecture Notes in Computer Science] Intelligent Science and Intelligent Data Engineering Volume 7202 || Multiple Feature Fusion for Object Tracking