Dynamic Background Learning through Deep Auto-encoder …jzwang/1501863/mm2014/p107-xu.pdfare 1) a cascade of two deep auto-encoder networks which can deal with the separation of dynamic

Dynamic Background Learning through DeepAuto-encoder Networks

Pei XuSchool of CSE, University of

Electronic Science andTechnology of China, P.R.

[email protected]

Mao Ye∗

School of CSE, University ofElectronic Science and

Technology of China, P.R.China

[email protected]

Xue LiSchool of ITEE, The University

of Queensland, [email protected]

Qihe LiuSchool of CSE, University of

Electronic Science andTechnology of China, P.R.

[email protected]

Yi YangSchool of ITEE, The University

of Queensland, [email protected]

Jian DingTencent Group

[email protected]

ABSTRACTBackground learning is a pre-processing of motion detec-tion which is a basis step of video analysis. For the staticbackground, many previous works have already achievedgood performance. However, the results on learning dy-namic background are still much to be improved. To addressthis challenge, in this paper, a novel and practical methodis proposed based on deep auto-encoder networks. Firstly,dynamic background images are extracted through a deepauto-encoder network (called Background Extraction Net-work) from video frames containing motion objects. Then, adynamic background model is learned by another deep auto-encoder network (called Background Learning Network) us-ing the extracted background images as the input. To bemore flexible, our background model can be updated on-lineto absorb more training samples. Our main contributionsare 1) a cascade of two deep auto-encoder networks whichcan deal with the separation of dynamic background andforegrounds very efficiently; 2) a method of online learn-ing is adopted to accelerate the training of Background Ex-traction Network. Compared with previous algorithms, ourapproach obtains the best performance over six benchmarkdata sets. Especially, the experiments show that our algo-rithm can handle large variation background very well.

∗Mao Ye is the corresponding author and works for theSchool of Computer Science and Engineering, Center forRobotics, Key Laboratory for Neuro-Information of Min-istry of Education, University of Electronic Science andTechnology of China, Chengdu 611731, P.R. China

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, November 03 - 07 2014, Orlando, FL, USA.Copyright 2014 ACM 978-1-4503-3063-3/14/11 ...$15.00.http://dx.doi.org/10.1145/2647868.2654914.

Categories and Subject DescriptorsI.4.8 [Image Processing and Computer Vision]: SceneAnalysis

KeywordsMotion detection; Dynamic background; Deep auto-encodernetwork

1. INTRODUCTIONExtracting moving objects from a video is a fundamental

and basis step in multimedia applications such as crowd de-tection, action recognition, etc [36, 42]. There have alreadyexisted lots of works on this problem. Roughly, they can beclassified into two categories. The first category is to learna foreground model to get the moving objects. The othercategory is to extract the moving objects by learning thebackground model.

In the first category, there are two approaches. The firstapproach is called motion segmentation[26, 5, 6, 18, 33, 40].With the assumption of rigid or smooth motion, the methodof segmentation or optical flow is used to model foregroundmotion. However, the moving objects are always compli-cated with nonrigid shapes, complicated texture and illumi-nation changes caused by dynamic background. Recently,another approach is proposed based on deep models. Hu-man motion, which can be viewed as foreground, is learnedby a factored conditional Restricted Boltzman Machine [23,24]. In real-world applications, these methods are not avail-able because moving objects are infinite in variety and itwould be a huge project to learn a neural network for eachkind of motion.

Instead of extracting the moving object directly, anothercategory of methods try to model the dynamic background.There exist two kinds of approaches in this category. Thefirst one is probability model based, such as Mixture ofGaussian (MOG) [31, 22, 11] and Hidden Markov model(HMM) [19], which are used to model the background. Theparameters of these models are learned to describe the back-ground. In this approach, the background is assumed to fit

107

a specific probability model, which is limited when the dis-tribution of real background is unknown, because the realbackground could be very complex including dynamical tex-ture variations and illumination changes.The second approach is based on a deterministic model,

such as Support Vector Machine (SVM), linear regressionmodel, sparse dictionary, and low-rank strategy, etc. In[3, 4, 41, 16, 38, 9, 14, 2], the authors proposed to use aSVM classifier, linear regression model and sparse dictio-nary learning based methods to learn the background re-spectively. These methods need clean background images(without foregrounds), which cannot be achieved in realworld applications. For example, in traffic monitoring ofcrossroad, there always exist moving cars in the video. Thelow-rank strategy assumes the background images are linearcorrelated with each other. This assumption has some tol-erances of linear variations in the background [42]. In [35], acouple of auto-encoder networks is proposed to separate fore-grounds from background images. This method can achievegood performance with static background. When the vari-ation of background is nonlinear, such as water surface, allof these methods perform poorly.As we stated, the above mentioned methods cannot simul-

taneously achieve: 1) learning the background with largevariations and 2) separating foregrounds from backgroundimages. To address these challenges, in this paper, a noveland practical method is proposed based on deep auto-encodernetworks by a technically designed cost function. Firstly,background images are extracted using a deep auto-encodernetwork (called Background Extraction Network, BEN). Then,a succeed deep auto-encoder network (called BackgroundLearning Network, BLN) is used to model dynamic back-ground with the background images from the BEN as input.Using the deep architecture, the large dynamic backgroundchanges can be learned. Furthermore, to be more flexiblefor the dynamic background changes, a method of searchingand substituting the minimizing effect parameters in differ-ent hidden layers, is used to learn the background online.Our main contributions are: 1) A learning method of dy-

namic background based on deep auto-encoder networks isproposed. A Background Extraction Network is designedto extract background images from video frames which maycontain motion objects. The Background Learning Networkis used to learn dynamic background based on the extractedbackground images. This strategy can handle large vari-ations of background; 2) A method of searching and sub-stituting the minimizing effect parameters in different hid-den layers is adopted to accelerate the on-line training ofthe Background Extraction Network. Our experiments havedemonstrated the superior performance than the state-of-the-art approaches.In the rest of this paper, we first review the deep auto-

encoder network in Section 2. Our approach is introducedin Section 3. And the experimental results and conclusionsare given in Section 4 and Section 5, respectively.

2. PRELIMINARIESIn this paper, we use deep auto-encoder network in [1,

17] as a building block. In the encoding stage, the inputdata x ∈ [0, 1]N is encoded by a function h1 = f(x) (h1 ∈[0, 1]M1), which is defined as follows

h1 = f(x) = sigm(W1x+ b1) (1)

where W1 ∈ RM1×N is a weight matrix, b1 ∈ RM1 is ahidden bias vector, and sigm(z) = 1

1+exp(−z)is the sigmoid

function. Then, h1 as the input is also encoded by anotherfunction h2 = f(h1)(h2 ∈ [0, 1]M2), which is written as,

h2 = f(h1) = sigm(W2h1 + b2) (2)

where W2 ∈ RM2×M1 is a weight matrix and b2 ∈ RM2 is abias vector.

In the decoding stage, h2 is the input of function

h′1 = g(h2) = sigm(WT

2 h2 + b3) (3)

where b3 ∈ RM1 . Then the reconstructed output x ∈ [0, 1]N

is computed by the decoding function,

x = g(h′1) = sigm(WT

1 h′1 + b4) (4)

where b4 ∈ RM is a bias vector. Here, Wi(i = 1, 2) andbj(j = 1, 2, 3, 4) are model parameters which are learned byminimizing a cost function defined as,

E(x) = −N∑i=1

(xi log xi + (1− xi) log(1− xi)

), (5)

which is called a cross-entropy function [43, 27].

3. PROPOSED METHOD

3.1 Dynamic Background ModelingOur cascade architecture of two deep auto-encoder net-

works is shown in Fig 1. Firstly, the video frames are re-shaped into 1-dimensional vectors and the pixel values ofeach frame are mapped into the interval [0, 1]. In this paper,the RGB images are changed into gray ones. Thus, the videoframes are denoted as x = {x1, · · · ,xD}(xj ∈ [0, 1]N ) whereD is the number of frames and N is the number of pixelsof a frame. The video frame set x is the input of the Back-ground Extraction Network (BEN). The vector B0 ∈ [0, 1]N

represents the extracted background image. The output h1

and h2 are from the different encoder stages, respectively.The h,

1 and x are the decoding and reconstruction outputs,

respectively. A separation function S(xji , B

0i ) is used to for-

mulate the input data B of the Background Learning Net-work (BLN) where i = 1, 2, · · · , N . The outputs hB1 and

hB2 are from the two encoding layers. The h′1 and B are

the outputs of the hide decoding and reconstruction layers,respectively. In this work, we assume the neurons in thesame layer are independent of each other.

3.1.1 Dynamic Background ExtractionDetecting foregrounds from dynamic background is a chal-

lenge task because it is very difficult to find a set of cleanbackground images and the background also dynamicallychanges. Some parts of video frames are clean backgroundwhen the foreground vanishes or occurs in video sequences.So the pixel belongs to background at most of time. This isa basic observation of our work.

Inspired by previous denoising works using deep auto-encoder networks, we view dynamic background and fore-grounds as ’clean’ data and ’noise’ data, respectively. Weuse a deep auto-encoder network (named Background Ex-traction Network, BEN) to get background images. The

108

0

1 1(x )

jˆ ,B

Video Frames

Background Extraction

Network

Background Learning

Network

Dynamic Background

Images

DecodingEncodingEncoding Decoding

x1

h

2h

1h′

x B B1B

h

2Bh

1Bh′

0

2 2(x )

jˆ ,B

0

3 3(x )

jˆ ,B

0(x )

j

N Nˆ ,B

Figure 1: The cascade architecture of two deep auto-encoder networks.

cost function of the BEN is defined as

minθE ,B0,σ

L(xj ; θE , B0, σ) = E(xj)+

N∑i=1

∣∣∣ xji −B0

i

σi

∣∣∣+λ

N∑i=1

|σi|

(6)where j = 1, 2, · · · , D (D is the number of video frames),and N is the dimensionality. In formula (6), B0 is a param-eter vector representing a background image, and σ is thetolerance value vector of B0.The first item in the right hand side of formula (6) is the

network reconstruction part of BEN which contains the stan-dard deep auto-encoder network parameters θE = {WEi(i =1, 2), bEj(j = 1, · · · , 4)}. The second item forces the recon-structed frames approach to a background image B0. In thesecond and third items, the parameter vector σ representsthe tolerance of dynamic background variances. At differ-ent pixels, the tolerances are different. To be resilient tolarge variance tolerances, in the second term, we divide theapproximate error by the parameter σi at the ith pixel fori = 1, · · · , N . However, the large variances should be con-trolled. Otherwise, the good background images cannot beobtained. The third term is a regularization item which con-trols the solution range of σ. For σ is trained from the BEN,the tolerance threshold of background variances is adaptive.Next, we train the parameters of BEN by optimizing the

cost function (6). The training of parameters θE , B0 and σ

will be executed respectively. The parameter θE representsthe weight matrices and bias vectors in BEN respectively.The updating of θE is,

θE = θE − η∇θE , (7)

where η is the learning rate, and ∇θE is defined as the fol-lowing,

∇θE =∂L(xj ; θE , B

0, σ)

∂θE=

∂E(xj)

∂θE+

∂(∑N

i=1

∣∣∣ xji−B0

i

σi

∣∣∣)∂θE

.

(8)The second item in the right hand side of formula (6) is notdifferentiable. We adopt a sign function to roughly computethe derivative. So we can obtain

∇θE =∂E(xj)

∂θE+

N∑i=1

sign( xj

i −B0i

σi

) ∂xji

∂θE, (9)

where sign(a) = 1(a > 0), sign(a) = 0(a = 0) and sign(a) =−1(a < 0). Then, the parameter θE can be updated by theformulas (7) and (9).

Now, after fixing the parameter θE , we show the detailsabout the training of B0. In this paper, we view D framesas a batch data set. So the optimal problem about B0, fromthe formula (6), can be represented equally as

N∑i=1

(minB0

i

D∑j=1

|xji −B0

i |). (10)

It also can be written as

minB0

i

D∑j=1

|xji −B0

i |, (11)

for i = 1, · · · , N . According to previous works [28, 14]about l1-norm optimization, the optimal B0

i is the medianof {x1i , x2

i , · · · , xDi } for i = 1, · · · , N .After B0 and θE are fixed, the cost function for each σi

can be rewritten as

L(σi) =∣∣∣ xj

i −B0i

σi

∣∣∣+ λ|σi|, (12)

for i = 1, · · · , N, j = 1, · · · , D. Optimizing L(σi) equals tominimize its logarithmic form, written as lnL(σi). Let thederivative of lnL(σi) be zero. It follows that

∂ lnL(σi)

∂σi=

2λσi

λσ2i + |xji −B0

i |− 1

σi= 0, (13)

for j = 1, · · · , D. The optimal σi is

σ∗i =

√∣∣∣ xji −B0i

λ

∣∣∣ (14)

for j = 1, · · · , D, where λ is a tunable parameter and 0 <λ < 1 in our work. For the batch data set of D frames, wecompute the average value of σ∗

i for j = 1, · · · , D. It followsthat

σ∗i = Averagej=1,··· ,D

√∣∣∣ xji −B0i

λ

∣∣∣ . (15)

When the error∑D

j=1 ∥xj − xj∥2 < δE (δE is a threshold

closed to zero), the updating is terminated. The details canbe found in Algorithm 1 (line 2-12).

Different from previous works ([23, 24, 8, 38, 9, 14, 2]),we do not need clean background images. Our assumptionis that each pixel belongs to the background at most oftime which is reasonable in practical applications. When

109

the number D of training frames is larger, the learned resultis much better. Compared with the method in [35], the tol-erance measure of dynamic background, σ, is learned adap-tively by our proposed method. Thus, our deep auto-encodernetwork can handle large variations of dynamic backgroundvery efficiently.After the training of BEN is finished, for the video frames

xj , j = 1, · · · , D, we can get a clean and static backgroundB0, and the tolerance measure σ of background variations.However, the reconstructed output xj is not exact the back-ground image though the deep auto-encoder network BENcan move some foregrounds in some sense. So we adopt aseparation function to clean the output furthermore, whichis

Bji = S(xj

i , B0i ) =

{xi |xj

i −B0i | ≤ σi,

B0i |xj

i −B0i | > σi

(16)

where Bj(j = 1, · · · , D) are the cleaned background images.If |xj

i−B0i | ≤ σi, then at the ith pixel of the jth background

image Bji equals xj

i . Otherwise, Bji equals B0

i . For theseD video frames, we obtain the clean background image setB = {B1, · · · , BD}(Bj ∈ [0, 1]N ) in some sense.

Algorithm 1 Dynamic Background Modeling

1: Input: samples x = {x1, · · · ,xD} ; randomly initializedmatrixWEi(i = 1, 2), B0; zero vectors bEj(j = 1, · · · , 4);learning rate ηE ; parameter λ.

2: repeat3: for each j = 1, · · · , D do4: Compute h1 ← sigm(WE1x

j + bE1)5: Compute h2 ← sigm(WE2h1 + bE2)

6: Compute h′1 ← sigm(WT

E2h2 + bE3)

7: Compute xj ← sigm(WTE1h

′1 + bE4)

8: Update parameters θE ← θE − ηE∇θE , θE can besubstituted for WEi(i = 1, 2), bEj(j = 1, · · · , 4)

9: B0i ← median(x1

i , · · · , xji )(i = 1, · · · , N)10: end for11: σ is updated according to formula (15)

12: until∑D

j=1 ∥xj − xj∥2 < δE

13: Obtain the background images B = {B1, · · · , BD}based on formula (16). Initialize the parameters of BLNas WLi ←WEi(i = 1, 2) and bLj ← bEj(j = 1, · · · , 4).

14: repeat15: for each j = 1, · · · , D do16: Compute hB1 ← sigm(WL1B

j + bL1)17: Compute hB2 ← sigm(WL2hB1 + bL2)

18: Compute h′B1 ← sigm(WT

L2hB2 + bL3)

19: Compute Bj ← sigm(WTL1h

′B1 + bL4)

20: Update parameters θL ← θL − ηL∇θL, θE can besubstituted for WLi(i = 1, 2), bLj(j = 1, · · · , 4)

21: end for22: until

∑Dj=1 ∥B

j −Bj∥2 < δL

23: Output: θE , θL, B0, σ

3.1.2 Dynamic Background LearningWe use another deep auto-encoder network (called Back-

ground Learning Network, BLN) to further learn the dy-namic background model from the extracted background im-ages. B as the input data is used to train the parameters ofthe BLN. To decrease the training time, the parameters of

this deep auto-encoder Network, written as θL = {WLi(i =1, 2), bLj(j = 1, · · · , 4)}, are initialized by θE = {WEi(i =1, 2), bEj(j = 1, · · · , 4)}. The cost function of BackgroundNetwork is

L(Bj , θL) = −N∑i=1

(Bj

i log Bji +(1−Bj

i ) log(1− Bji ))

(17)

for j = 1, 2, · · · , D, where Bj is the reconstruction result ofBLN.

The gradient decent method is used to train θL, where∇θL is computed as

∇θL =∂L(Bj , θL)

∂θL(18)

where θL can be substituted for WLi(i = 1, 2) and bLj(j =

1, · · · , 4). When∑D

j=1 ∥Bj − Bj∥2 < δL is satisfied, the

iteration is finished. δB is a small threshold parameter. Thetraining details also can be found in Algorithm 1 (line 13-22).

The first auto-encoder performs like a denoising auto-encoder which learns the ’clean’ background image and thetolerance of background variations for each pixel. The tol-erance can help obtain the background images for each ofthe first D frames. Simply fitting B to the frames usingL1 optimization cannot achieve good results. The extractedbackground images contain the foregrounds. The deep archi-tecture learns a more invariant representation of the back-ground which enables us to better reconstruct the ’clean’background image and remove the foregrounds. It will makeour algorithm more robust.

The reason we use two deep auto-encoder networks is thatthe background of different frames may have dramatic vari-ations, so the learned background images from the first Dframes may not fit the subsequent frames well. In the nextsection, we will introduce the online learning method whichis essential for obtaining robust performance given dramaticinter-frame variations and adaptively learns the network pa-rameters. Besides, the second network is used to model dy-namic background, which can further reduce the noise of thebackground images by learning the variation distributions ofthe background implicitly. Although two auto-encoders areused, our algorithm is still real time.

3.2 Online LearningIn previous section, D video frames are used to train the

dynamic background model. The number of samples is lim-ited which may produce the overfitting problem. So we pro-pose an online learning method to incorporate more datasuch that a more generalized model can be obtained.

The online learning executes on the BEN to update thetrained parameters θE , B

0 and σ. To discriminate the sym-bols of these parameter, we denote the parameters as θoE ={W o

Ei(i = 1, 2), boE,j(j = 1, · · · , 4)}, Bo0 and σo. Firstly, the

parameters, θoE , Bo0 and σo, are initialized by θoE,t−h, B

o0t−h

and σot−h where t−h means the previous trained BEN. And

in each turn, there are h frames to update the network.To speed up the online learning of the two layer weightsW o

Ei(i = 1, 2), motivated by the method in [34] that is tosearch and eliminate the weights which lead to the minimalchange effect of cost function, instead of eliminating theseweights, we substitute these weights by the weights withlarge change effects. This strategy can help us use all con-

110

nections between neurons more efficiently which also helpthe training stop earlier.Firstly, the weight matrices W o

Ei(i = 1, 2) are rewrittenas W o

E1 = [W oE1,1,W

oE1,2, · · · ,W o

E1,M1] and W o

E2 = [W oE2,1,

W oE2,2, · · · , W o

E2,M2] where M1 and M2 are the neural num-

ber respect to the first two hidden layers respectively. W oE1,·

is an N -dimensional vector and W oE2,· is an M1-dimensional

vector. Based on the method [34] on single hidden layer, ouroptimal problem on two hidden layers is to solve

min∆Wo

Ei

∆L(W oE1,W

oE2) =

∑2i=1 tr(

12∆W oT

EiHi∆W oEi)

s.t. ∆W oE1ej = −W o

E1,j ,∆W oE2e

′k = −W o

E2,k (19)

where ej is the jth column of N ×M1 identity matrix, e′k

is the kth column of M1 × M2 identity matrix and Hi isthe Hessian matrix respect to the ith layer weights. ∆Ldenotes a disturbance from the weight changes. We assumethe weight matricesW o

E1 andW oE2 to be independent to each

other which means ∆L(W oE1,W

oE2) = ∆L(W o

E1)+∆L(W oE2)

abbreviated as ∆L = ∆L1 +∆L2.Let

∆L1j = W oE1,j

TW oE1,j/H

−11,jj ,

∆L2k = W oE2,k

TW oE2,k/H

−12,kk (20)

where H−1i,jj means the jth diagonal element of H−1

i . The

computation of H−1i is achieved by several iterative opera-

tions. The solutions of minimizing the cost functions ∆L1

and ∆L2 are to find the vectors W oE1,j∗ and W o

E2,k∗ suchthat the ∆L1j and ∆L2k are minimal for j = 1, · · · ,M1 andk = 1, · · · ,M2 respectively. More details are referred to [34].Thus, we sort the results of ∆Lij(i = 1, 2) for j = 1, · · · ,M1

and j = 1, · · · ,M2 respectively. The vector W oEi,j with some

j, which satisfies ∆Lij < ζ, is substituted by a randomlychosen vector W o

Ei,r satisfying ∆Lir > ζ, where ζ is an ar-tificial parameter. Then, we retrain the BEN based on newsamples. The details are shown in Algorithm 2.There exist a few online learning methods ([43, 20, 39,

32]). These methods find a metric to merge similar weights.In fact, merging similar weights may affect the cost functionenormously. Such operation may cause the network needsmore iterative times to converge. Different from these works,our online strategy is to find the weights corresponding tothe minimal effects of cost function in order to update andsufficiently use all connections. The saturation effects will beavoided by substituting these weight vectors by the weightvectors with large change effects of cost function.

3.3 Foreground DetectionIn this section, we introduce the foreground detection flow

based on the method proposed in previous sections. For aninput image y, the outputs of the BEN and the BLN are yand By respectively. Combining y and By, the foregroundis computed as

Fi =

{0 |yi − By

i | ≤ ϵ

1 |yi − Byi | > ϵ

(21)

for i = 1, 2, · · · , N , where ϵ is a threshold representing theerror of BEN input and BLN output.Suppose there are D′ frames in a video. Firstly, we use

D(D < D′) frames to pre-train the BEN and BLN using Al-gorithm 1. Then, D foreground images F = {F 1, · · · , FD}

Algorithm 2 Online Learning of BEN

1: Input: samples x = {xt−h+1, · · · ,xt}; initialized pa-rameters as W o

Ei ← WEi,t−h, boEj ← boEj,t−h, B

o0 ←Bo0

t−h, σo ← σo

t−h

2: Use (20) to find the minimal effect vectors in W oEi(i =

1, 2).3: Substitute these vectors by the randomly chosen vectors

in W oEi(i = 1, 2) satisfying ∆L > ζ.

4: repeat5: for each j = t− h+ 1, · · · , t do6: Compute h1 ← sigm(W o

E1xj + boE1)

7: Compute h2 ← sigm(W oE2hB1 + boE2)

8: Compute h′1 ← sigm(W oT

E2h2 + boE3)

9: Compute xj ← sigm(W oTE1h

′1 + boE4)

10: Update weight matrices θoE ← θoE−η∇θoE using for-mula (9), where θoE can be substituted for W o

Ei, boEj

11: Bo0i ← median(x1

i , · · · , xji )(i = 1, · · · , N)12: end for13: σo is updated according to formula (15)14: until

∑tj=t−h+1 ∥x− x∥2 < δo

15: Output:θo, Bo0, σo

are computed according to the formula (21). In each roundof the online learning, h frames are used to train the BENby Algorithm 2. The foreground images are saved in F.These details are shown in Algorithm 3.

Algorithm 3 Foreground Detection

1: Input: samples x = {x1, · · · ,xD′};

2: Use D samples to initialize the BEN and BLN by Al-gorithm 1.

3: Use the formula (21) to obtain D foreground images F ={F 1, · · · , FD}.

4: t← D + 1.5: repeat6: Obtain samples {xt−h+1, · · · ,xt} from x7: Use these h samples to process online-learning by Al-

gorithm 2.8: Re-train BLN using Algorithm 2 line 13-229: F← F ∪ {F t−h+1, · · · , F t}10: t← t+ h11: until t ≥ D′

12: Output:F

4. EXPERIMENTAL RESULTSIn this section, we evaluate the performance of our meth-

ods to extract foregrounds from dynamic background. Wefirstly introduce the data sets and some experiment detailsin Section 4.1. Then we compare our experiment results toprevious works in Section 4.2. Lastly, we discuss the onlinelearning strategy in Section 4.3.

4.1 Experiment SetupWe use six publicly available video data sets in our ex-

periments, including Jug [41], Railway [21], Lights [3], Trees[25], Water-Surface [12], and Fountain [21] to evaluate theperformance of Algorithm 3 which detailed the foregrounddetection from dynamic background.

111

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Jug

Lights

Fountain

Railway

WaterSurface

Trees

λ

TP

R

Figure 2: TPR vs λ on six data sets.

The Jug data set [41] contains 207 frames. In thisdata set, the dynamic background is moving river surfaceand foreground is a bottle floating on the surface. TheRailway data set [21] contains 500 images. The dynamicbackground consists of moving trees and foreground is awalking person and a moving car. The Lights data set[3] contains 1546 images whose dynamic background is theroom scene with illumination variations and foregrounds arepedestrians. Trees [25] contains 287 frames. There are amoving tree as a part of dynamic background and a walkingman as the foreground. Water-Surface [12] contains 432frames whose dynamic background consists of the movingwater and the foreground is a walking man. Fountain [21]contains 1957 frames. There are several moving fountains inthe background and walking men as foregrounds.We measure the accuracy of detection to evaluate quan-

titatively. Actually, the foreground detection is a classifi-cation problem. If the pixel is classified as foreground, theoutput is positive. Otherwise it is negative. We evaluatedetection results using TPR (True Positive Rate) and FPR(False Positive Rate), which are defined as

TPR =TP

TP + FP,FPR =

FN

TP+ FN(22)

where TP, FP, TN and FN are the number of true positives,false positives, true negatives and false negatives at the pixellevel for all test frames, respectively.Some of previous works uses F-measure (or called F-score)

to evaluate the performance. F-measure is defined as

F-measure =2 · TPR · (1− FPR)

TPR+ (1− FPR). (23)

The higher F-measure means the better performance.For each of data sets mentioned above, we trained our

deep auto-encoder networks, respectively. The input dimen-sionality N on these six data set is 80 × 60(Jug), 180 ×120(Railway), 80× 64(Lights), 80× 60(Water-Surface), and180 × 120(Fountain), respectively. The number of frames,

D′in Algorithm 3, is 50 on all these data sets. According

to our method, the larger D′obtains the better learning re-

sults. We find that, on these six data set, the experimental

results are almost the same when D′is larger than 50. No

more accuracy can be gained. And in Algorithm 3, eachonline learning contains h = 35 frames to re-train the net-work. In Algorithm 1, the parameters, δE , δL, are set to

the 5% and 10% of the initial error between the input andthe reconstruction output. This setting is adaptive for dif-ferent data sets than that of previous work in [35] whose δis a fixed value. Parameter ϵ in formula (21) is tunable. Wewill show our ROC curves in the next section depend on ϵ.

Parameter λ in Algorithm 1 is tunable. The Differentvalues of λ provide different tolerances of dynamic back-ground. We compute the TPR on six data sets with dif-ferent λ (see Figure 2). In our discussion below, we choosethe value of λ which is corresponding to the highest TPRon each data set. Specifically, λ = 0.5, 0.4, 0.4, 0.5, 0.6, 0.4correspond to the Jug, Lights, Fountain, Railway, Water-Surface, and Trees, respectively.

4.2 Comparisons to Previous WorksIn this section, we compare some of benchmark meth-

ods on which the authors have provided the ROC curvesin their papers. Firstly, we compare our method to previ-ous works: 1-SVM [3], Bayesian method of [21], struct-SVM(S-SVM) method from [4], Mixture of Gaussian (MOG) in[11] and DCOLOR (Detecting Contiguous Outliers of Low-Rank) [42] on Jug, Lights, Railway, and Trees data sets.Sheikh et al use Bayesian model to learn background. Butthis work cannot obtain competitive results on dynamic back-ground. 1-SVM and S-Svm are two previous methods whichare proposed to use support vector machines to subtractbackground from dynamic scenes online. These two meth-ods obtain some competitive results on dynamic scene. TheMixture of Gaussian based method is a popular method tobackground subtraction. The MOG has been proven to bevery competitive compared to some sophisticated methods.But it is not good enough for dynamic scene. DCOLOR isan state-of-the-art method to dynamic background learningand get good results on many complex and dynamic back-ground data sets. But this method is based on the assump-tion that the background images are linearly correlated witheach other. This assumption limits the method to the casewith large variations of background.

The comparisons of their ROC curves are shown in Fig. 3.From Fig. 3, it can be observed that these methods performnot very well comparing to our method on data sets Jug,Lights, Trees. On the Railway data set, only the results ofS-SVM are competitive to ours. Our method uses deep auto-encoder network (BLN) to model the dynamic background.Unlike DCOLOR, there is no linear assumption about thebackground variations. BLN can learn large variations ofdynamic background. So our method performs better.

We also compare our method to some works on data setWaterSurface and Fountain. For the WaterSurface data set,our method obtain the best performance among RPMF (Ro-bust Probabilistic Matrix Factorization) [29], RP CA (Ro-bust PCA) [13], GRASTA (Grassmannian Robust AdaptiveSubspace Tracking) [7], and DCOLOR [42]. RPMF, RPCA,and GRASTA are three baseline methods on backgroundsubstraction. There exist nonlinear moving and large varia-tions of moving water in data set WaterSurface. RPCA andDCOLOR are not robust for this kind of dynamic changebackground. GRASTA [7] is also a method based on theLow-Rank model which imposes a sparsity penalty on itsnorm. But this method only works well for the static back-ground. So on the dynamic background data set, such asWaterSurface, GRASTA cannot achieve good performance.RPMF [29] is based on the method of non-negative ma-

112

0 0.01 0.02 0.03 0.04 0.050.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Lights

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TP

R

FPR

Fountain

TP

R

FPR

Jug

FPR

TP

R

Railway

FPR

TP

R

Trees

FPR

TP

R

FPR

TP

R

WaterSurface

Sheikh ea al

MOG

1-SVM

S-SVM

DCOLOR

Ours

Sheikh ea al

MOG

1-SVM

S-SVM

DCOLOR

Ours

Sheikh ea al

MOG

1-SVM

S-SVM

DCOLOR

Ours

Sheikh ea al

MOG

1-SVM

S-SVM

DCOLOR

Ours

RPMF

RPCA

DCOLOR

Ours

GRASTA

FCH

MOG

BOF

DCOLOR

Ours

Figure 3: Comparisons of ROC Curves.

trix factorization which provides competitive performanceon static background data set and is not good enough fordynamic background. The reason of our better performanceis that BEN can subtract dynamic background images withconsidering the variations and BLN trains these dynamicbackground images using a deep auto-encoder network thatcan learn large variations of background.For the Fountain data set, we compare our results to FCH

(Fuzzy color Histogram) [10], BOF (Bag of Feature), MOG[11], and DCOLOR [42]. FCH learns color information ofbackground. BOF is based on the learning of backgroundtexture. The feature selection is very important for the goodresults of FCH and BOF. Our deep learning strategy com-bines these two phases of feature selection and object detec-tion. Deep neural networks can learn the object informationfrom pixels and show powerful functionality of classification.Some of previous works use F-measure (or called F-score)

to evaluate the performance. To compare with these worksdirectly, we also compute the F-measure of our method. Thecomparison results are shown in Table 1 and Table 2. In Ta-ble 1, we compare our work to MOG, FCH, BOF, ROFS,DCOLOR, RPCA, RPMF, and GRASTA on four data sets.On the Fountain and WaterSurface data sets, the F-measurevalues of MOG, FCH, BOF, and ROFS are copied from [37].And the F-measure values of RPCA, RPMF, GRASTA arecopied from [7]. On the Trees and Lights data sets, theF-measure values of FCH, BOF, and ROFS are from ourimplementation according to their algorithms. And the F-measure values of FCH, BOF, and ROFS are also from [7].In Table 1, the F-measure values of DCOLOR are all fromthe open code provided by the authors. From Table 1, onecan find that our method is consistently better than all ofthese baseline methods, though RPMF and DCOLOR canobtain rather good performances on these data sets, respec-tively. This demonstrates our method has more competi-tiveness for motion detection from dynamic background.

In Table 2, we compare our work to MOG, ORDL (On-line Robust Dictionary Learning) [14], BRDL (Batch RobustDictionary Learning) [38], Mairal et al [15], DECOLOR,and MDAEN (Motion Detection with Auto-encoder Net-work) [35] on the Jug and Railway data sets. The methodof ORDL is based on the l1 norm to learn the backgrounddictionary. And BRDL is based on l2 norm to learn the dic-tionary. Mairal et al proposed to learn the background bymatrix factorization and sparse coding. These three meth-ods always need clean background images to learn the dic-tionary. This limitation does not fit with real world appli-cations. And these three methods can perform very well onstatic background when the learned dictionary expresses thestatic background good enough. The authors of MDAENalso proposed to use a couple of auto-encoder networks tolearn the background. But this work is only based on afixed vector to describe the tolerances of background vari-ations while our strategy is to learn an adaptive tolerancevector. On the dynamic background set its performance isworse than that of ours. From Table 2, one can find thatour method obtains the best performance.

In Fig. 4, we show some examples of foreground extrac-tion for different methods on the six data sets. The firstrow in Fig. 4 shows a bottle floating on the moving surfaceof water. Our method subtracts the background accurately.MOG and DCOLOR output some false negatives. Sheikh etal ’s work, 1-SVM, and S-SVM produce some false positives.The fourth row in Figure 4 is a walking person with dynamicbackground of a waving tree. The tree is waving faster thanthat of the walking man. MOG, Sheikh et al ’s work, 1-SVM , and S-SVM show some false negatives. The methodof DECOLOR shows both of false negatives and false pos-itives. On these six data sets of dynamic background, ourmethod shows the best foreground extraction results thanall of previous benchmark methods.

113

Table 1: Comparisons of F-measure on Fountain, WaterSurface, Trees, and LightsVideo MOG FCH BOF ROFS DCOLOR RPCA RPMF GRASTA Ours

Fountain 0.601 0.900 0.901 0.859 0.860 0.940 0.940 0.690 0.962WaterSurface 0.534 0.721 0.819 0.926 0.910 0.730 0.840 0.870 0.950

Trees 0.553 0.732 0.795 0.8785 0.79 0.74 0.840 0.870 0.970Lights 0.63 0.846 0.871 0.864 0.88 0.87 0.92 0.620 0.943

Table 2: Comparisons of F-measure on Jug and RailwayVideo MOG ORDL BRDL Mairal et al DECOLOR MDAEN OursJug 0.613 0.712 0.711 0.623 0.76 0.853 0.92

Railway 0.736 0.886 0.833 0.794 0.932 0.914 0.954

Trees

Railway

Jug

Fountain

Lights

WaterSurface

MOG Sheikh et al 1-SVM S-SVM DCOLOR Ours




MOG Sheikh et al DCOLOR OursBOF

MOG

RPCA

RPCA DCOLOR OursFCH ROFS

Figure 4: Comparisons of foreground extraction. The first column contains the original images. The secondto the end columns are the extracted results of different methods.

4.3 Online Learning Strategy ComparisonIn this section, we compare the online strategy to pre-

vious works. Our aim is to compare the iterative times

in Algorithm 2 with other strategies. ESCWL (ExactSoft Confidence-Weighted Learning) [30] is the newest onlinelearning methods which assume the weight vector fits Gaus-

114

sian distribution. Based on this assumption, this methodconstructs a loss function about the weight vector. And theweights are updated according to the mean and covarianceof weight vectors. OI (Online Incremental) [43] is an onlinemethod to compute the similarity between two weight vec-tors. This method will combine the similar weight vectors.Besides, we also compare the online learning strategy whichis directly updating weights (DUW) using

W = W − η∆W (24)

without any preprocessing.

1 2 3 4 5 6 7 8 90

10

20

30

40

50

60

Experimental Times

Averaged Iterative Times

OI

DUWOurs

ESCWL

Figure 5: Comparisons of online learning strategies.

On all the six data sets, we evaluate Algorithm 2 ninetimes. The h frames of each time is 35, 40, 27, 43, 50, 28, 32,47, 23 on each data set. Every time we average the iterativetimes of six data sets. In Algorithm 2, the parametersζ and δ are different for the data sets. So we average theiterative times on the six data sets.The results are shown in Fig. 5. The OI strategy is almost

better than that of DUW except the seventh test. Thismeans that merging similar weights can not always obtainbetter results than the method of directly updating weights.The ESCWL method is better than or comparable with OI.ESCWL is based on the assumption that the weight vector isfit to a Gaussian distribution. Our strategy can obtain leastiterative times which is better than that of ESCWL, OI andDUW. Because the minimal effect weights are substitutedfor large effect weights according to the cost function, thisstrategy can reduce the iterative times and make the trainingconverges fast.

5. CONCLUSIONSWe propose a novel and practical method to deal with

moving object detection from dynamic background based ondeep auto-encoder networks. Firstly, dynamic backgroundimages are extracted using a deep auto-encoder network(called Background Extraction Network, BEN). Then, an-other deep auto-encoder network (called Background Learn-ing Network, BLN) is used to model dynamic background us-ing the dynamic background images as input. To be more re-silient to the large dynamic background variances, a methodof searching and substituting the minimizing effect param-eters, in different hidden layers, is used for online-learning.In our framework, a specific method is used to learn thevariance tolerance measure adaptively. Experiments on sixbenchmark data sets confirmed the superior performance ofour method on dynamic background.

6. ACKNOWLEDGMENTSThis work was supported in part by the National Natural

Science Foundation of China (61375038), 973 National Ba-sic Research Program of China (2010CB732501) and Funda-mental Research Funds for the Central University (ZYGX2012YB028). The work of Yi Yang is in part supported by theARC DECRA project.

7. REFERENCES[1] Y. Bengio, P. Lamblin, D. Popovici, and H.

Larochelle. Greedy layer-wise training of deepnetworks. In Advances in Neural InformationProcessing Systems (NIPS), pages 153-160, 2007.

[2] V. Cevher, A. Sankaranarayanan, M. Duarte, D.Reddy, R. Baraniuk, and R. Chellappa. Compressivesensing background subtraction. In EuropeanConference on Computer Vision (ECCV), pages155-168, 2008.

[3] L. Cheng, S. V. N. Vishwanathan, and D. Schuurmanset al. Implicit online learning with kernels. InAdvances in Neural Information Processing Systems(NIPS), pages 249-256, 2006.

[4] L. Cheng, and M. Gong. Realtime BackgroundSubstraction from Dynamic Scenes. In IEEEInternational Conference on Computer Vision(ICCV), pages 2066-2073, 2009.

[5] D. Cremers and S. Soatto. Motion competition: Avariational approach to piecewise parametric motionsegmentation, International Journal of ComputerVision (IJCV), 62(3):249-265, 2005.

[6] C. Farabet, C. Couprie, L. Najman, and Y. Lecun.Learning Hierarchical Features for Scene Labeling.IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI), 35(8):1915 - 1929, 2013.

[7] J. He, L. Balzano, and A. Szlam. Incremental gradienton the grassmannian for online foreground andbackground separation in subsampled video. In IEEEInternational Conference on Computer Vision andPattern Recognition (CVPR), pages 1568-1575, 2012.

[8] N. Heess, N. L. Roux, and J.Winn. Weakly supervisedlearning of foreground-background segmentation usingmasked RBMs. In International Conference onArtificial Neural Networks (ICANN), pages 9-16, 2011.

[9] J. Huang, X. Huang, and D. N. Metaxas. Learningwith dynamic group sparsity. In IEEE InternationalConference on Computer Vision (ICCV), pages 64-71,2009.

[10] W. Kim, C. Kim. Background subtraction for dynamictexture scenes using fuzzy color histograms. IEEESignal Process Letter, 19(3):127-130, 2012.

[11] D. Lee. Effective gaussian mixture learning for videobackground substraction. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI),27(5):827-832, 2005.

[12] L. Li, W. Huang, I. Gu, and Q. Tian. Statisticalmodeling of complex backgrounds for foregroundobject detectiong. IEEE Transactions on ImageProcessing (TIP), 13(11):1459-1472,2004.

[13] Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmentedlagrange multiplier method for exact recovery ofcorrupted low-rank matrices. UIUC Technical ReportUILU-ENG-09-2215, 1009.5055, 2009.

115

[14] Cewu Lu, Jianping Shi, and Jiaya Jia. Online RobustDictionary Learning. In IEEE InternationalConference on Computer Vision and PatternRecognition (CVPR), pages 415-422, 2013.

[15] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Onlinelearning for matrix factorization and sparse coding.The Journal of Machine Learning Research,11(3):19-60, 2010.

[16] A. Monnet, A. Mittal, N. Paragios, and V. Ramesh.Background modeling and subtraction of dynamicscenes. In IEEE International Conference onComputer Vision (ICCV), pages 1305-1312, 2005.

[17] Marc’Aurelio Ranzato, F. J. Huang, Y. L. Boureau, Y.LeCun. Unsupervised learning of invarient featurehierarchies with applications to object recognition. InIEEE International Conference on Computer Visionand Pattern Recognition (CVPR), papges 1-8, 2007.

[18] Z. Ren, L. Chia, D. Rajan, and S. Gao. Backgroundsubtraction via coherent trajectory decomposition. InACM international conference on Multimedia (MM),pages 545-548, 2013.

[19] J. Rittscher, J. Kato, S. Joga, and A. Blake. Aprobabilistic background model for tracking. InEuropean Conference on Computer Vision (ECCV),pages 336-350,2000.

[20] F. Rosenblatt. The perceptron: A probabilistic modelfor information storage and organization in the brain.Psychological Review, 65(6):386-408, 1958.

[21] Y. Sheikh and M. Shah. Bayesian object detection indynamic scenes. In IEEE International Conference onComputer Vision and Pattern Recognition (CVPR),pages 74-79, 2005.

[22] C. Stauffer and W. Grimson. Adaptive BackgroundMixture Models for Real-Time Tracking. In IEEEInternational Conference on Computer Vision andPattern Recognition (CVPR), pages 2246-2252, 1999.

[23] G. W. Taylor, and G. E. Hinton. Factored ConditionalRestricted Boltzmann Machines for Modeling MotionStyle. In International Conference on MachineLearning (ICML), pages 129-137, 2009.

[24] G. W. Taylor, G. E. Hinton, and S. Roweis. ModelingHuman Motion Using Binary Latent Vairiables. InAdvances in Neural Information Processing Systems(NIPS), pages 1345-1352, 2006.

[25] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers.Wallflower: Principles and practice of backgroundmaintenance. In IEEE International Conference onComputer Vision (ICCV), pages 255-261, 1999.

[26] R. Vidal and Y. Ma. A unified algebraic approach to2-d and 3-d motion segmentations. In EuropeanConference on Computer Vision (ECCV), pages 1-15,2004.

[27] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol.Extracting and composing robust features withdenoising auto-encoders. In International Conferenceon Machine Learning (ICML), Pages 1096-1103, 2008.

[28] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H.Mobahi, and Y. Ma. Towards a practical facerecognition system: Robust alignment andillumination by sparse representation. IEEETransactions on Pattern Analysis and MachineIntelligence (TPAMI), 34(2):372-386, 2012.

[29] N.Wang, T. Yao, J.Wang, and D.-Y. Yeung. Aprobabilistic approach to robust matrix factorization.In European Conference on Computer Vision(ECCV), pages 126-139, 2012.

[30] J. Wang, P. Zhao, and S. C. H. Hoi. Exact softconfidence-weighted learning. In InternationalConference on Machine Learning (ICML), pages121-128, 2012.

[31] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P.Pentland. Pfinder: Real-time tracking of the humanbody. IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI), 19(2):780-785, 2002.

[32] P. Wu, S. C. H. Hoi, H. Xia et al. Online MultimodalDeep Similarity Learning with Application to ImageRetrieval. In ACM international conference onMultimedia (MM), pages 153-162, 2013.

[33] Q. Wu, P. Boulanger, and W. F. Bischof. Bi-LayerVideo Segmentation with Foreground and BackgroundInfrared Illumination. In ACM internationalconference on Multimedia (MM), pages 1025-1026,2012.

[34] Jinhua Xu, DanielW. C. Ho. A new training andpruning algorithm based on node dependence andJacobian rank deficiency. Neurocomputing,70(1):544-558, 2006.

[35] P. Xu, M. Ye, Q.H. Liu, et al. Motion detection via acouple of auto-encoder networks. In IEEEInternational Conference on Multimedia and Expo(ICME), 2014.

[36] A. Yilmaz, O. Javed, and M. Shah. Object tracking:A survey. ACM computing survey, 38(4):1-45, 2006.

[37] S. Yoo, and C. Kim. Background subtraction usinghybrid feature coding in the bag-of-featuresframework. Pattern Recognition Letters, 34(16):2086-2093, 2013.

[38] C. Zhao, X. Wang, and W.K. Cham. Backgroundsubtraction via robust dictionary learning. EURASIPJournal on Image and Video Processing,2011(972961):1-12, 2011.

[39] P. Zhao, S. C. H. Hoi, and R. Jin. Double updatingonline learning. Journal of Machine LearningResearch, 12(5):1587-1615, 2011.

[40] Y. Zheng, S. Gu, and C. Tomas. Detection motionsynchrony by video tubes. In ACM internationalconference on Multimedia (MM), pages 1197-1200,2011.

[41] J. Zhong and S. Sclaroff. Segmenting foregroundobjects from a dynamic textured background via arobust Kalman filter. In IEEE InternationalConference on Computer Vision (ICCV), pages 44-50,2003.

[42] X. Zhou, C. Yang, and W. Yu. Moving objectdetection by detecting contiguous outliers in thelow-rank representation. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI),35(3):597-610, 2013.

[43] G. Zhou, K. Sohn, and H. Lee. Online IncrementalFeature Learning with Denoising Auto-encoders,Proceedings of the Fifteenth International Conferenceon Artificial Intelligence and Statistics (ICAIS), pages1453-1461, 2012.

116

Documents

Dynamic Background Learning through Deep Auto-encoder …jzwang/1501863/mm2014/p107-xu.pdfare 1) a cascade of two deep auto-encoder networks which can deal with the separation of dynamic