[ACM Press the 21st ACM international conference - Barcelona, Spain (2013.10.21-2013.10.25)] Proceedings of the 21st ACM international conference on Multimedia - MM '13 - Background

Background Subtraction via Coherent TrajectoryDecomposition

Zhixiang Ren, Liang-Tien Chia andDeepu Rajan

School of Computer Engineering, NanyangTechnological University, Singapore

{renz0002,asltchia,asdrajan}@ntu.edu.sg

Shenghua GaoAdvanced Digital Sciences Center, Singapore

[email protected]

ABSTRACT

Background subtraction, the task to detect moving object-s in a scene, is an important step in video analysis. Inthis paper, we propose an efficient background subtractionmethod based on coherent trajectory decomposition. We as-sume that the trajectories from background lie in a low-ranksubspace, and foreground trajectories are sparse outliers inthis background subspace. Meanwhile, the Markov RandomField (MRF) is used to encode the spatial coherency and tra-jectory consistency. With the low-rank decomposition andthe MRF, our method can better handle videos with mov-ing camera and obtain coherent foreground. Experimentalresults on a video dataset show our method achieves verycompetitive performance.

Categories and Subject Descriptors

I.4.8 [Image Processing and Computer Vision]: SceneAnalysis

Keywords

background subtraction, trajectory, low-rank, sparse

1. INTRODUCTIONBackground subtraction, the task to detect moving object-

s in a scene, plays an important role in many video analysistasks such as video surveillance, vehicle/robot navigation,etc. In the literature, many algorithms have been proposedfor background subtraction. Among them, a typical andintuitive category [4][16] is to first build a background mod-el with a manually labeled frame, and then compare theincoming frames with it. The areas varying from the esti-mated background model are detected as the moving object-s/foreground. However, these models have two drawbackswhich limit their applicability: (1) They are semi-supervisedand require a training phase to learn the initial backgroundmodel. (2) Their performance suffers due to large camera

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

MM’13, October 21–25, 2013, Barcelona, Spain.

Copyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00.

http://dx.doi.org/10.1145/2502081.2502144.

motion. For videos taken by moving camera, the backgroundhas severe motion which makes it hard to comply with thebackground model estimated from history frames. In gen-eral, it is infeasible for these methods to automatically dealwith videos with large camera motion.

Recently, the effectiveness of Robust Principal Compo-nent Analysis (RPCA) has resulted in some automated back-ground subtraction methods [13]. Most of these methods as-sume the vectorized image frames can be decomposed into alow-rank matrix representing the background, and a sparsematrix indicating the foreground objects. To enforce thespatial connectivity of the foreground pixels, block-sparseconstraint is considered in decomposition process [5][6]. Thesemethods achieve promising results on videos with static cam-era and illumination variance. However, they still suffer dueto camera motion which makes the background no longerlow-rank. To alleviate this problem, [15] and [9] use 2Dparametric transforms to align vectorized video frames sothat the background is still low-rank. However, for videoswith large camera motion, the background has changed sig-nificantly. Thus these methods may fail to align vectorizedvideo frames. As stated in [15], it performs poorly on thevideos with large camera motion, such as “cars1”, “cars9”and “cars10” sequences in Fig. 3(c). In addition, it is noteasy for these methods to encode the long-term temporalconsistency due to the lack of correspondence between twononadjacent frames and the extremely high computationalcomplexity. For this reason, Cui et al. [3] propose to usetrajectory matrix as input and decompose it to a low-rankmatrix (background) and group sparse matrix (foreground).Although [3] preserves temporal consistency by enforcing thepixels belonging to the same trajectory to have similar label-s, it cannot model spatial connectivity in their framework,which is another important attribute for foreground objects.

Motion segmentation [1][14][12] is another feasible solu-tion for background subtraction under large camera motion-s. Since the foreground will move differently from the back-ground, we can separate them according to their motion pat-terns. These methods usually first derive the affinity matrixby measuring the similarity between trajectories, and thenuse some clustering techniques (e.g. spectral clustering) toget the segmentation results. However, most of these meth-ods are sensitive to non-Gaussian noises.

In this paper, we present a novel algorithm to automati-cally detect moving objects with spatial and temporal con-sistency by motion decomposition. By leveraging large dis-placement optical flow [2], we use trajectories as the input

545

and decompose them into background and foreground. Weassume that the trajectories from background lie in a low-rank subspace, and foreground trajectories are sparse out-liers in this background subspace. Meanwhile, the trajecto-ries belonging to the same moving object should be similarand spatially close. Therefore, we use the Markov RandomFields (MRFs) to simultaneously model the sparsity and tra-jectory coherency of foreground.

Compared to previous work, our model offers the followingadvantages: 1) Due to the trajectory representation, the pro-posed method can not only handle videos with large cameramotion, but also preserve temporal consistency by consid-ering the trajectory as a whole. 2) We incorporate spatialconnectivity and trajectory similarity in the framework, thusour method can detect moving objects with spatial coher-ence.

2. TRAJECTORY DECOMPOSITION

2.1 Problem FormulationIn our framework, we first use the dense point tracker [2] to

get trajectories of P tracked points through F image frames.The trajectory of the ith point can be represented by a vec-tor Mi = [x1i, y1i, x2i, y2i, ..., xFi, yFi]

T ∈ R2F×1, where x

and y denote the 2D coordinates in each frame. We definea 2F × P matrix M = [M1, . . . ,MP ] to represent the col-lected P trajectories. Note that some trajectories obtainedby the dense point tracker [2] may be incomplete. We com-plete the missing entries of such trajectories by a sparsereconstruction method proposed in [8]. After we get thecomplete motion trajectory matrix M, we aim to estimatethe background trajectory matrix B ∈ R

2F×P as well as theforeground support S ∈ R

P×1:

Si =

{

1, if the ith trajectory is foreground

0, if the ith trajectory is background.(1)

Background Constraint As indicated in [8][11], the tra-jectories of background from a single rigid motion will lie ina low-rank subspace. Specifically, under the assumption oforthographic projection, the subspace is spanned by threebasis trajectories [11], i.e. the rank is 3. Therefore, somemethods [10][3] impose a specific rank constraint on thebackground trajectory matrix B. We follow them to requirethe rank of the background trajectory matrix B to be:

rank(B) = 3. (2)

It can be seen from the experiments that, such constraintachieves satisfactory performance for most of sequences.Foreground Constraint The foreground is usually assumedto occupy a small portion of the scene, so sparsity [13] orgroup sparsity [3][5] is commonly used to constrain the fore-ground matrix. In our model, we also require the foregroundsupport to be sparse. Moreover, the foreground trajectoriesbelonging to the same object should occupy a contiguous s-patial region. Thus, we model the binary states of entries inforeground support S by a MRF framework [7]. Our task isto infer a binary vector S = [S1, . . . , SP ]

T so that Si = 1 or0 indicates whether the ith trajectory belongs to foregroundor not. We state the problem as the minimization of thefollowing energy function:

E(S) =∑

i

Si + γ∑

i,j

Wij(Si − Sj)2, (3)

where the first term (unary term) is to penalize a high pro-portion of foreground trajectories. The second term (smoothterm) encourages similar and proximate trajectories to havesimilar labelings. γ is the parameter controlling the strengthof mutual interaction between trajectories. The affinity ma-trix W is defined as

Wij = exp(−d2(i, j)), (4)

where d2(i, j) is the distance between the ith trajectoryand jth trajectory. We use the pair-wise distance defini-tion in [1], which measures difference between trajectoriesas the maximum difference of their motion over time:

d2(i, j) = max

td2t (i, j) (5)

where the distance between two trajectories at a particularinstant t is calculated as:

d2t (i, j) = dsp(i, j)

(uit − u

jt )

2 + (vit − vjt )

2

5σ2t

. (6)

dsp(i, j) is the average spatial Euclidean distance of ith tra-jectory and jth trajectory in 5 frames. ut and vt are theaggregated optical flows over 5 frames. σt denotes the nor-malization parameter. Note that such definition not onlypreserves the motion consistency between two trajectoriesover time, but also maintains the spatial coherency by tak-ing the spatial distance of two trajectories into account.Objective Formulation We combine the above two con-straints together and get the following objective function:

minB,S

1

2

∑

Si=0

‖Mi −Bi‖2F + β

∑

i

Si + γ∑

i,j

Wij(Si − Sj)2

s.t. rank(B) = 3, Si ∈ {0, 1} (7)

where Bi is the ith column of the background matrix B.The weight factors β and γ are the trade-off parameter-s. The first term in Eq. (7) requires that the estimatedbackground trajectories should be close to the original cor-responding trajectories. The second and third terms arefrom the foreground energy function. The former express-es the sparsity of foreground and the latter preserves themotion consistency and spatial coherency.

By rewriting Eq. (7), we can get the matrix form:

minB,S

1

2‖(M−B)(I− diag(S))‖2F + β‖S‖1 + γS

TLS

s.t. rank(B) = 3, Si ∈ {0, 1} (8)

where diag(S) denotes a P × P diagonal matrix with S asthe diagonal elements and I is the identity matrix. L is theLaplacian matrix defined as L = D−W and D is a diagonalmatrix with the definition Dii =

∑

jWij . Note that S is a

binary vector, so ‖S‖0 = ‖S‖1.

(a) 4th (b) 5th (c) 6th (d) Final

Figure 1: The intermediate results of our itera-tion process on “cars3”. Green/red points denoteforeground/background. Please enlarge for a betterview.

546

2.2 Objective OptimizationThe objective function defined in Eq. (8) is non-convex

because of the low-rank constraint. Moreover, it containsboth continuous and discrete variables which make simul-taneous optimization extremely difficult. So, we adopt analternating algorithm to efficiently optimize B and S. Theoptimization process can be divided into two steps: B-stepand S-step with unknown B or S. We initialize B to be Mand S to be a zero vector.B-step Given the estimated S, the objective function withrespect to B can be rewritten as:

minB

‖M(I− diag(S)) +Bdiag(S)−B‖2F

s.t. rank(B) = 3 (9)

The optimal B can be iteratively derived via the Singu-lar Value Decomposition (SVD). Particularly, we iterative-

ly update B by applying the SVD to M(I − diag(S)) +

Bk diag(S), and then use three left/right-singular vectorswith the largest singular values to reconstruct Bk+1.S-step Given the estimated B, the objective function (Eq.(8)) with respect to S can be rewritten as follows:

minS

∑

i

(β −1

2‖Mi − Bi‖

2F )Si + γS

TLS + C

s.t. Si ∈ {0, 1}, (10)

where C = 1

2

∑

i‖Mi − Bi‖

2F is a constant. The above ener-

gy function has the standard form [7] and can be solved byusing graph cuts [7].

We iteratively employ these two steps until the algorithmconverges to a local minimum1. Fig. 1 shows some interme-diate results of the iteration process on the video “cars3”.We can see that foreground regions are gradually detect-ed in the iteration process. After we get the labels of thetracked points, it is quite straightforward to obtain a pixellevel segmentation based on color or edge information.

Precision Recall F−measure0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LSA

RANSAC

BROX

LRGS

OURS

Figure 2: Quantitative comparison.

3. EXPERIMENTSWe quantitatively evaluate our background subtraction

model on the Berkeley motion segmentation dataset2, withpixel-accurate annotation of moving objects. This dataset

1The convergence of B-step and S-step has been proved by SVDand [7] respectively, so the algorithm must converge to a localminimum.2http://lmb.informatik.uni-freiburg.de/resources/

datasets/

contains 26 videos, in which 12 of the sequences (10 carand 2 people sequences) are taken from the Hopkins 155dataset [12]. Since these videos are annotated for motionsegmentation instead of background subtraction, we follow [3]to treat all the moving objects as foreground and the rest asbackground.

We evaluate the results of different models using Precisionand Recall, which are defined as:

Precision =TP

TP + FP, Recall =

TP

TP + FN(11)

where TP is the number of trajectories with support value1 and falling into the foreground regions, and FP is thenumber of trajectories with support value 1 but falling intothe background regions. FN is the number of trajectorieswith support value 0 but falling into the foreground regions.We also define F-measure as follows:

F −measure =2× Precision×Recall

P recision+Recall. (12)

We compare the proposed method with four state-of-the-art algorithms including: (1) Local Subspace Affinity (L-SA) [14], (2) RANSAC [12], (3) Long term analysis of pointtrajectories (BROX) [1]3 and (4) Background subtractionmethod using Low-Rank and Group Sparsity constraints (L-RGS) [3]. LSA and RANSAC are downloaded from website4.We implement the LRGS method. Since LSA and RANSACrequire the number of segments in advance, we give them thecorrect number. We generate the trajectories by using thedense point tracker [2], in which the sampling parameter isset to 8. Except the BROX method, all other methods re-quire the trajectories to have full length. So we use a sparsereconstruction strategy [8] to complete the incomplete tra-jectories. However, this strategy will fail in some sequencesdue to insufficient complete trajectories. For this reason, weonly conduct experiments on the 12 sequences from Hopkins155 dataset. For LSA, RANDSAC and BROX methods, fol-lowing [3], we take the segment with the largest trajectorynumber as the background and the rest as the foreground.Please note that the DECOLOR method proposed in [15] isbased on pixel labeling, not trajectory labeling. So we donot compare our method with it. We only show some visualresults of it in the third column of Fig. 3.

For the proposed method, there exists two parameters:β and γ, which affect the performance. The parameter β

controls the sparsity of the foreground support. As seenfrom Eq. (10), if 1

2‖Mi − Bi‖

2F > β, Si is more likely to

be 1. Hence the choice of β depends on the proportion offoreground which is unknown in advance. Thus we start

our algorithm with a relatively large β = maxi‖Mi − B0i ‖

2F ,

where B0i is the estimation of Bi in the first iteration. We

then reduce β by a factor 0.5 after each iteration until itreaches the low boundary (set to be 0.001). By using thisconservative manner, the foreground can be gradually de-tected along with the decrease of β (see Fig. 1). In ourexperiments, γ is empirically set to be 2β. To be fair, wetune parameters for all the compared methods to get thehighest F-measure.

Fig. 2 shows the Precision, Recall and F-measure of allcompared methods. It can be seen that most of methods can

3http://lmb.informatik.uni-freiburg.de/resources/

binaries/4http://www.vision.jhu.edu/code/

547

(a) Frame (b) GT (c) DECOLOR [15] (d) LSA [14] (e) RANSAC [12] (f) BROX [1] (g) LRGS [3] (h) OURS

Figure 3: Results of the compared methods. Green/red points denote foreground/background. From top tobottom, the sequences are cars1, cars9, cars10 and people2. Please enlarge for a better view.

achieve a high Recall but a relatively low Precision. In con-trast, the proposed method achieves much better Precisionand comparable Recall. Overall, our method outperformsall the other methods in terms of F-measure.

In Fig. 3, we show some results of the compared meth-ods. All these sequences are captured by a moving camera,especially for the “cars9” and “cars10”, in which the back-ground is a 3D scene with a large depth and the cameramoves a lot. Due to large camera motion in “cars1” and“cars10”, [15] works poorly on these sequences (Fig. 3(c)).For the difficult sequence“cars9”, [15] even fails to detect themoving objects. As seen, with the help of trajectory rep-resentation, the proposed method can segment foregroundfrom background in these sequences. Therefore, instead ofraw pixels, it is desirable to use trajectories for backgroundsubtraction on sequences with large camera motion. Bothour method and LRGS [3] have a specific rank constraint tomodel background. However, LRGS only requires the pointsbelonging to the same trajectory to have similar labeling, itstill misclassifies some trajectories, such as the small car arepartially missed in “cars1”. Thank to the spatial coherencyand trajectory consistency, the proposed method significant-ly outperforms all the compared methods on the two difficultsequences “cars9” and “cars10”. LSA [14] and RANSAC [12]detect a lot of background points as foreground. BROX [1]works better than LSA and RANSAC on “cars10”, but stillworse than our method. We show some video results of ourmethod in the supplementary materials.

4. CONCLUSIONIn this paper, we propose an efficient approach to auto-

matically perform background subtraction based on a tra-jectory decomposition framework. We leverage the fact thattrajectories from a single rigid motion will lie in a low-ranksubspace to estimate background motion and detect outlier-s as foreground. We incorporate spatial connectivity andtrajectories similarity into decomposition process by usingMRF model. In this manner, the proposed method can de-tect moving objects with spatial and temporal consisten-cy. Extensive experiments illustrate the effectiveness of ourmethod.

5. REFERENCES[1] T. Brox and J. Malik. Object segmentation by long term

analysis of point trajectories. In ECCV, 2010.[2] T. Brox and J. Malik. Large displacement optical flow:

Descriptor matching in variational motion estimation.PAMI, 33(3), 2011.

[3] X. Cui, J. Huang, S. Zhang, and D. Metaxas. Backgroundsubtraction using group sparsity and low rank constraint.In ECCV, 2012.

[4] A. M. Elgammal, D. Harwood, and L. S. Davis.Non-parametric model for background subtraction. InECCV, 2000.

[5] Z. Gao, L.-F. Cheong, and M. Shan. Block-sparse rpca forconsistent foreground detection. In ECCV, 2012.

[6] C. Guyon, T. Bouwmans, and E.-H. Zahzah. Foregrounddetection based on low-rank and block-sparse matrixdecomposition. In ICIP, 2012.

[7] V. Kolmogorov and R. Zabih. What energy functions canbe minimized via graph cuts? PAMI, 26(2), 2004.

[8] S. Rao, R. Tron, R. Vidal, and Y. Ma. Motionsegmentation in the presence of outlying, incomplete, orcorrupted trajectories. PAMI, 32(10), 2010.

[9] Z. Ren, L.-T. Chia, and D. Rajan. Video saliency detectionwith robust temporal alignment and local-global spatialcontrast. In ICMR, 2012.

[10] Y. Sheikh, O. Javed, and T. Kanade. Backgroundsubtraction for freely moving cameras. In ICCV, 2009.

[11] C. Tomasi and T. Kanade. Shape and motion from imagestreams under orthography: a factorization method. IJCV,9(2), 1992.

[12] R. Tron and R. Vidal. A benchmark for the comparison of3-d motion segmentation algorithms. In CVPR, 2007.

[13] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma. Robustprincipal component analysis: Exact recovery of corruptedlow-rank matrices via convex optimization. In NIPS, 2009.

[14] J. Yan and M. Pollefeys. A general framework for motionsegmentation: Independent, articulated, rigid, non-rigid,degenerate and non-degenerate. In ECCV, 2006.

[15] X. Zhou, C. Yang, and W. Yu. Moving object detection bydetecting contiguous outliers in the low-rankrepresentation. PAMI, in press, 2012.

[16] Z. Zivkovic. Improved adaptive gaussian mixture model forbackground subtraction. In ICPR, 2004.

548

Documents

[ACM Press the 21st ACM international conference - Barcelona, Spain (2013.10.21-2013.10.25)] Proceedings of the 21st ACM international conference on Multimedia - MM '13 - Background