Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

1/15

This article appeared in a journal published by Elsevier. The attached

copy is furnished to the author for internal non-commercial research

and education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling or

licensing copies, or posting to personal, institutional or third partywebsites are prohibited.

In most cases authors are permitted to post their version of the

article (e.g. in Word or Tex form) to their personal website or

institutional repository. Authors requiring further information

regarding Elseviers archiving and manuscript policies are

encouraged to visit:

http://www.elsevier.com/copyright
http://www.elsevier.com/copyrighthttp://www.elsevier.com/copyright


2/15

Author's personal copy

Scene Aware Detection and Block Assignment Tracking in crowded scenes

Genquan Duan a,, Haizhou Ai a, Junliang Xing a, Song Cao b, Shihong Lao c

a Computer Science and Technology Department, Tsinghua University, Beijing, Chinab Electronic Engineering Department, Tsinghua University, Beijing, Chinac Development Center, OMRON Social Solutions Co., LTD, Kyoto, Japan

a b s t r a c ta r t i c l e i n f o

Article history:

Received 18 July 2011

Received in revised form 7 February 2012

Accepted 10 February 2012

Keywords:

Visual surveillance

Object detection

Object tracking

Particlelter

How far can human detection and tracking go in real world crowded scenes? Many algorithms often fail in

such scenes due to frequent and severe occlusions as well as viewpoint changes. In order to handle these dif-

culties, we propose Scene Aware Detection (SAD) and Block Assignment Tracking (BAT) that incorporate

with some availablescene models (e.g. background, layout, groundplane andcamera models). TheSAD is pro-

posed for accurate detection through utilizing 1) camera model to deal with viewpoint changes by rectifying

sub-images, 2) a structural lter approach to handle occlusions based on a feature sharing mechanism in

which a three-level hierarchical structure is built for humans, and 3) foregrounds for pruning negative and

falsepositivesamplesand merging intermediate detectionresults.Many detection or appearance based track-

ingsystems areproneto errorsin occluded scenesbecause of failures of detectorsand interactionsof multiple

objects. Differently, the BAT formulates tracking as a block assignment process, where blocks with the same

label form the appearance of one object. In the BAT, we model objects on two levels, one is the ensemble

level to measure how it is like an object by discriminative models, and the other one is the block level to mea-

sure how it is like a target object by appearance and motion models. The main advantage of BAT is that it can

track an objecteven when all the partdetectors fail as long as theobject has assigned blocks. Extensiveexper-

iments in many challenging real world scenes demonstrate the efciency and effectiveness of our approach.

2012 Elsevier B.V. All rights reserved.

1. Introduction

Human detection and tracking are classic problems in computer

vision for the applications in visual surveillance, driver-aided system

and trafc managements, and have achieved signicant progresses

recently. Many existing detection and tracking methods, however,

encounter great challenges from radial distortions, illumination vari-

ations, viewpoint changes and occlusions, all of which are quite com-

mon in real world scenes.

The goal of our work is to cope with these difculties to detect and

track multiple humans in surveillance scenes using a single stationarycamera. Many detection and tracking systems developed so far as-

sume that the viewpoint is frontal, a person enters the scene without

occlusions, a person appears or disappears in some special locations, a

person will exist in the scene for a given number of frames or the

human ow is gentle. In this paper, we present a robust detection

and tracking system attempting to minimize such constraining as-

sumptions, which is able to handle the following difculties: 1) occlu-

sion, when multiple persons crowdedly enter and move in the scene;

2) relatively unconstrained camera viewpoints, rotations and heights;

3) relatively unconstrained human motions, appearances and posi-

tions with respect to the camera; 4) humans appearing for only a

small number of frames; and 5) relatively slowly moving humans.

We only assume inherently that humans stand on the ground plane

in the scene and ignore those below this ground plane or stand in

other places such as rooftops, windows or sky. This is a very reason-

able assumption which is applicable in most of thesurveillance scenes.

We innovate from both detection and tracking for the scenes with oc-

clusions and viewpoint changes. Our main contributions include two

aspects as follows.

A Scene Aware Detection for accurate detection. Specically, it in-

cludes: (1) A simple but efcient learning algorithm to use fore-

grounds to prune negative and false positive samples; (2) A

structural lter approach to detect occluded humans in a feature

sharing mechanism; and (3) A foreground aware merging strategy

to explain foregrounds by detected results;

A Block Assignment Tracking for robust tracking where tracking is

formulated as a block assignment process and the objects are mod-

eled in different levels, i.e. the block level and the ensemble level.

Blocks with the same label form the appearance of one object,

from which robust appearance and motion models can be estab-

lished. Its main advantage is that it can track an object even when

all the part detectors fail as long as the object has assigned blocks.

Image and Vision Computing 30 (2012) 292305

This paper has been recommended for acceptance by Xiaogang Wang.

Corresponding author. Tel.: +86 10 62795495; fax: +86 10 62795871.

E-mail addresses:[email protected](G. Duan),

[email protected](H. Ai),[email protected](J. Xing),

[email protected](S. Cao),[email protected](S. Lao).

0262-8856/$ see front matter 2012 Elsevier B.V. All rights reserved.

doi:10.1016/j.imavis.2012.02.008

Contents lists available at SciVerse ScienceDirect

Image and Vision Computing

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / i m a v i s


3/15


The rest of this paper is organized as follows. Related work is dis-

cussed in the next section. Our system is overviewed in Section 3.

Scene Aware Detection is presented in Section 4. Block Assignment

Tracking is described in Section 5. Experimental results on many chal-

lenging real world datasets are provided along with some discussions

inSection 6. Conclusions and future work are given inSection 7.

2. Related work

There area great deal of works in theliterature on object detection,

such as faces[1]and pedestrians[24], and multiple target tracking,

such as vehicles[5]and humans[68]. Here we mainly review some

robust detection methods to cope with occlusions and viewpoint

changes at rst, andthen discuss some detection related anddetection

free tracking algorithms.

2.1. Robust detection

2.1.1. Occlusion handling

Using multiple part detectors, Wu et al. [2] proposed a Bayesian

approach for combination, while Huang et al. [3] introduced a dynam-ic search. Wang et al. [9]proposed a global-part occlusion handling

method, where an occlusion likelihood map was produced from HOG

feature responses rst and then segmented by mean shift approach.

2.1.2. Viewpoint change handling

Due to the changes of viewpoints, human appearances and poses

vary a lot. To solve this difculty, Li et al.[10]detected objects in rec-

tied sub-images with a learned frontal viewpoint detector. Another

method is to learn one powerful detector for all possible viewpoints,

such as[11,12]. Duan et al.[11]clustered the complex multiple view-

point samples into several sub-categories rst and then learn a classi-

er for each sub category. Felzenszwalb et al. [12]proposed a more

efcient model, Deformable Part based Model, in which a root lter

and several parts models are learned for each object category that

can detect objects with some pose changes.

2.1.3. Integration with other models

Beleznai et al.[13]used local shape descriptors to infer human lo-

cations in images of absolute background difference from background

model. Hoiem et al. [14] and Huang et al. [15] utilized scene geometric

model to restrict the object locations and ground plane model to re-

strict the objects heights in a particular location.

2.2. Robust tracking

2.2.1. Detection free tracking

Some techniques assume that objects enter the scene in some spe-

cic location [5], or appear in the scene without occlusions [5,16] for a

period of time that allows object models to be built up while they areisolated. Some techniques (e.g.[5,6]) depend on accurate segmenta-

tion of moving foreground objects from a background color intensity

model,whereKamijo et al. [5] segmented foreground blocks into vehi-

cles using spatialtemporal information, and Zhao et al.[6]developed

a tracker based on human shape model.All of them rely on an inherent

assumption that there will be signicant difference in color intensity

information between foreground and background. Unfortunately,

there are many problems for background modeling, such as being in-

accurate, noise sensitive, and weak in shadow. Similar assumptions

are made in[1720], where the authors extracted features, e.g. inten-

sity, colors, edges, contours, feature points, and used them to establish

the correspondences between modelimages and target images. More-

over, shape based approaches[6,21]will encounter challenges when

body parts are not isolated which may cause signicant occlusions,and appearance based ones[16] often fail when several objects get

close together as this kind of algorithms fail to allocate the pixels to

the correct object. In order to overcome some of these problems,

Kelly et al.[22]used 3D stereo information to detect pedestrians via

a 3D clustering process and track them by a weighted maximum car-

dinality matching scheme.

2.2.2. Detection related tracking

2.2.2.1. Detection based tracking.With the fast development of object

detection techniques, object detectors play an important role in

many tracking algorithms. Some tracking algorithms use detection

as their observation model. One of the most successful techniques is

particle lter[23]. Particle lter is based on Sequential Monte Carlo

Sampling, which has gained many attentions because of its simplicity,

generality, and extensibility in a wide range of challenging applica-

tions.Xing et al. [7] combined multiple part detectors with particle l-

ter to track multiple objects with occlusions. Another kind of work is

to associate detected results of video frames locally [24]or globally

[8,15,2527].Wuetal. [24] associated detection results in two consec-

utive frames. Jiang et al.[25]adapted Linear Programming for associ-

ation, while Zhang et al. [26]used min-cost ow. Andriluka et al. [8]

tailored Viterbi algorithm to link detection results, which combinedboth the advantages of detections and tracking. Huang et al. [15]pre-

sented a three-level hierarchical association approach where they

achieved short tracks and long tracks at the low level and middle

level separately, and rened the last trajectories with the estimated

scene knowledge at thehigh level.Pirsiavash et al. [27] proposed glob-

ally optimal greedy algorithms to estimate the number of tracks, their

birth and death states in a cost function. Global association based

tracking method could theoretically obtain a global optimum, since

the results of all the frames are available before tracking. However,

the cost of heavy computations and temporal delays limits them in

real time applications.

2.2.2.2. Online learning.Avidan[28]trained an ensemble of weak clas-

siers online to distinguish between the object and the background.

Grabner et al. [29] described an online boosting algorithm for real-

time tracking, which was very adaptive but may drift. To limit the

drifting problem, Grabner et al. [30] introduced a semi-supervised

learning algorithm using unlabeled data explored in a principled man-

ner, while Babenko et al. [31]proposed an online Multiple Instance

Learning using one positive bag consisting of several image patches

to update a learned classier. However, manually initialization and fo-

cusing on single object tracking prevent their applications in our inter-

ested scenes.

3. System overview

We propose to detect and track multiple humans in surveillance

scenes with occlusions and viewpoint changes using a single station-

ary camera by taking advantage of some available scene models (e.g.background, camera, layout and ground plane models). We believe

that the models we use are generic and applicable to a wide variety

of situations. The models used are listed as follows.

(a) A camera model to rectify an image with large viewpoint

changes into a frontal viewpoint;

(b) A background model to direct the system's attention to the re-

gions showing difference from the background;

(c) A layout model to restrict objects in the scene;

(d) A ground plane model to restrict objects standing on the ground.

The whole system is overviewed in Fig. 1, which mainly includes

two components, Scene Aware Detection and Block Assignment Track-

ing. The three key factors of the SAD are foreground aware pruningto

prune negative and false positive samples, a structural lterapproachbased on our previous work[4] to detect occluded objects, and fore-

ground aware mergingto explain foregrounds by detected results. The

293G. Duan et al. / Image and Vision Computing 30 (2012) 292305


4/15


BAT formulates tracking as a block assignment process, which can track

an object even when all the part detectors fail, as long as the object has

assigned blocks. The BAT proceeds as follows. It maintains the spatial

and temporal consistence in the block level rst (Block Tracking) and

then precisely estimates locations and sizes of objects in the ensemble

level using appearance, motion and discriminative models (Ensemble

Tracking), and at last assigns blocks to maintain the blocks with the

samelabel looklike apart of human by combining both previous results

(Ensemble to Block Assignment). In implementations, we split each frame

into 88 blocks and typically a 640480 image contains 8060=

4800 blocks. A block is called a foreground block if the pixel number

in the foreground region is larger than 20% of that in the whole. Similar

to[5], the BAT takes foreground blocks into account and ignores back-

ground ones.

The BAT is a particular segmentation problem, coarser than pixel

level segmentations but ner than bounding boxes as illustrated in

Fig. 2. Pixel level segmentations are dened to achieve the most accu-

rate results. But they are somewhat prone to errors for occlusions and

particularly non-rigid objects like humans with viewpoint changes as

their contours are disturbed and vary drastically. These restrictions

prevent such methods from applications in our concentrated scenes.

Bounding boxes may take extra (non-object or other object) pixels

into account and miss some real pixels. These drawbacks also exist

in the BAT but relatively more moderate, since BAT considers fore-

ground blocks and ignores background ones. Hence, more important-

ly, BAT can build up more robust appearance and motion models for

objects from these blocks than bounding boxes.

4. Scene aware detection

4.1. Scene models

Background model is widely used in many tracking systems. In

order to establish a background model robust to noise, motion and il-

lumination variations, we employ the lifespan background modeling

algorithm in our previous work [32], where short, middle and long

life span models are online adaptively built and updated in a collabo-

rative manner.

Fig. 1.System overview. Round rectangle box: inputs and outputs. Rectangle box: procedure. Solid arrow: data ow. Double-line arrow: extra input models. The key factors of our

system are marked out in bold.

Fig. 2.Comparisons of BAT, bounding boxes and pixel level segmentations in one object. (a) an image; (b) the foreground image; (c) ideal pixel level segmentations labeled man-

ually; (d) bounding boxes with extra pixels (left) and missed pixels (right); and (e) BAT with extra blocks (left) and missed blocks (right). Please seeSection 3for more discussions.

294 G. Duan et al. / Image and Vision Computing 30 (2012) 292305


5/15


Cameramodels are utilized to handle viewpoint changes in detec-

tion. We follow the method[10], which rst detects objects in sub-

images rectied from a changed viewpoint to a frontal viewpoint,

and then projects the detection results into the original image. This

kind of method is able to take advantage of detectors learned for afrontal viewpoint and avoid a more difcult training for multiple

viewpoints samples. During detection, the sampling in 3D space is

projected into the image coordinate as shown in Fig. 3(d) (bottom).

Moreover, there is no need to do such rectications for frontal view-

point scenes. To speed up detection in these scenes, we assume a

linear mapping from 2D coordinate (x,y) to the human height (Lh),

c1x + c2y + c3= Lh.c1,c2and c3are unknown parameters and can be

estimated through a RANSAC style algorithm like[33]. During detec-

tion, the sampling in 2D space is a scanning window process re-

strained by the linear mapping as shown in Fig. 3(d) (top). Please

refer to[33,10]for details.

Layoutmodels can be easily marked out for stationary scenes such

asFig. 3(c). We assume that humans stand on the ground planein the

layout. After integrating these two models with the linear mapping or

camera model mentioned earlier, we can obtain the sampled search-

ing points and corresponding human heights in scenes as illustrated

inFig. 3(d).

4.2. Foreground aware pruning (FAP)

This step is to prune negative and false positive samples by fore-

grounds as shown inFig. 4(a). We take this pruning problem as a 2-

class classication problem on binary images, and design a simple

discriminatively learning algorithm under the boosting framework

[1]. The aim is to mine some features to learn a fast and effective

pruning detector.

Our used features are based on zero moment of regionRG,M(RG),where M(RG) = (x,y)RGIB(x,y) ina binaryimage IB. Each feature risa sub-region ofIB as shown in black ofFig. 4(c). The feature value can

be calculated as

f r; IB

M r M IBr

IB 1

where |IB| is the total number of pixels in IB. We restrictras a rectan-

gle, and hence Eq.(1)can be calculated efciently through an integral

image without generating image pyramids like[1].

Positive samples for the pruning can be achieved by labeling man-

ually as shown inFig. 4(b). However, collecting negative samples are

impractical because of two reasons. One reason is that negative sam-

ples can be in any form, which is too time consuming for manually

labeling. The other reason is that when applying the pruning detector,

negative samples themselves are always inaccurate because of noises

in background modeling, and thus it is likely that parts of real objects

are missing in foregrounds and some backgrounds are included in

objects. In fact, negative samples are not necessary because 1) small

amount of negative samples may cause overtting, and 2) large

amount of negative samples might make the pruning detectors very

Fig. 3.Models in detection: (a) original images; (b) foregrounds; (c) scene layouts; (d) some searching points in red with lines whose lengths indicating the corresponding human

heights; (e) cropped sub-images and their foregrounds; and (f) detection results projected as quadrangles in original images. The top and bottom rows show a common frontal

viewpoint scene and a changed viewpoint one separately. Note that, in the latter occasion, camera models are adopted to handle the difculty of viewpoint changes.

a

Negative

False positive of left-body False positive of right-body False positive of whole-body

False positive of head-shoulder False positive of upper-body

b c

Fig. 4. Foreground pruning. (a) Typical pruned negative and false positive examples. (b) Whole body positive masks, from which other part positive masks can be generated.

(c) Five used features.



6/15


complex and thus they are inefcient to prune negative and false pos-

itive samples. Motivated by the above, pruning classiers are learned

with positive samples only. Theclassier on feature ris determined as

hr IB 1; f r; I

B

Tr> 00; otherwise

( 2

whereTr=minxiBf(r,xiB),is small positive(102), andxiB is a pos-

itive sample. In consideration of the inaccuracy of background model-

ing, positive samples are disturbed by moving 3 pixels left or right, or

2 pixels top or bottom.

This pruning should be fast and effective. Instead of automatically

selecting good features from a large feature pool as [1], we simply

design several features as shown in Fig. 4(c). All classiers learned

on these features are combined together to be one strong detector,

whose orders are not constrained. Then a searching window will be

considered if its corresponding foreground passes this strong detec-

tor. For a n m image, the pre-processing of an integral image costs

O(nm) time and space. Then our used feature can be calculated inO(1) time and thus a bunch of classiers will cost approximately con-

stant time. Its effectiveness will be evaluated in the experiment.

4.3. Structurallter approach

The detection is based on our previous work[4,34]. We proposed

to learn an Integral Structural Filter (ISF) detector in [4]to detect

humans with occlusions and articulated poses in a feature sharing

mechanism. We build up a three-level hierarchical model for human,

words, sentences and paragraphs, where words are the most basic

units, sentences are some meaningful sub-structures and paragraphs

are the appearance statuses (e.g., headshoulder, upper-body, left-

part, right-part and whole-body in occluded scenes). An example is

shown in Fig. 5. We integratethe detectorsfor thethreelevels through

inferringfrom word to sentence,from sentence to paragraphand from

word to paragraph. All detectors for structures (words, sentences and

paragraphs) are based on Real Adaboost algorithm and Associated

Pairing Comparison Features (APCFs)[34]. APCF describes invariance

of color and gradient of an object to some extent and it contains two

essential elements, Pairing Comparison of Color (PCC) and Pairing

Comparison of Gradient (PCG). A PCC (or PCG) is a Boolean color (or

gradient) comparison of two granules in which a granule is a square

window patch. Please refer to[4,34]for more details.

4.4. Foreground aware merging (FAM)

We then discuss the merging strategy after obtaining all detected

results. Different from previous approaches (e.g.[2,3]) which stick to

detection results, we integrate foreground information into post-

processing. We consider objects one by one after extending them to

the whole body through adding and deleting operations dened onvisible and invisible parts of objects. To reduce the complexity of

computation, the two operations are based on blocks as dened in

Section 3.

A hypothesis his a detected response. We denote the block set and

foreground block set ofh be Bh and Fh respectively. For a hypothesis set

H, we have BH hH Bhand FH hH Fhcorrespondingly. The scoreof addingh intoHis dened as

scadd h

FH hf gFH BH hf gBH ; FH hf gFH

>TM Fh; hH0; otherwise:

8>>>:

3

h can be added ifscadd(h) > Tadd. TMis a threshold. Thescoreof deleting

hfromHis dened as

scdel h

FHFH hf g BHBH hf g ; FHFH hf g

>TMFh; hH0; otherwise:

8>>>: 4

h can be deleted ifscdel(h)bTdel. Taddand Tdel are empirical parameters.

The less Tadd, the more added objects. ThelargerTdel, the more deleted

objects. In the implementation, we propose a greedy way to rst uti-

lize the adding operation to nd possible hypotheses and then the de-

letingoperation to deletesome badones.Althoughthe strategy is very

simple, it yields promising detection results in the experiments.

5. Block Assignment Tracking

The previous section mainly discusses accurately locating objects

in the scenes with occlusions and viewpoint changes. In this section,

we concentrate on robustly tracking them. In the next, we will rst

derive the formulation of our block assignment tracking problem,

and then present our solution.

Fig. 5.The hierarchical structure of pedestrian[4].



7/15


5.1. Problem formulation

Denoting object state sequences from frame 1 to frameTasS1 : T=

{S1,, ST} and the corresponding observation sequences collected

from the frame data as O1 :T={O1,, OT}, a tracking problem can

be formulated to solve the following MAP (maximum a posteriori)

problem

St argmax

St

p St jO1:t: 5

Generally, an object state can be modeled as the location and size

of the object on the ensemble level like[7]or a set of blocks forming

the appearance like [5]. Tracking on the ensemble level is efcient

when objects are isolated. However, it tends to have errors when ob-

jects interact with each other since ensemble observations can be am-

biguous and missing because of occlusions. When objects are well

initialized, tracking on the block level is efcient even with heavy oc-

clusions, which mainly considers the block persistence in spatial and

temporal spaces. But it cannot guarantee that a segmented region is

like an object part. In fact, it might contain none or several objects.

Moreover, it does not have an explicitly correcting mechanism to rec-tify errors that arose during initialization and tracking. In order to

combine their merits and get rid of their restrictions, we propose to

model object states on both ensemble and block levels as St= {Zt, Vt},

whereZt= {zt, k}k=1K is the ensemble level state of all Kobjects and

Vt= {vt, i}i=1N is the block level state of all Nblocks.vt, iis the label for

blockbt, i, indicating thatbt, ibelongs to objectzt, vt,i ifvt, i0, or back-ground ifvt, ib0. All blocks with the same label form the appearance

of an object, while the ensembles describe coarse shapes of objects

and cover some blocks assigned to them as illustrated inFig. 6. There-

fore, we modify Eq.(5)and formulate our problem as

Zt;V

t

argmax

Zt;Vt p Zt; Vt jO1:t; Vt1: 6

Compared to Eq.(5),Vt1in the right side of Eq. (6)takes the pre-

vious assignment into account. However, the optimization of Eq. (6)is

not tractable because V1:tandZtareclosely intertwined at time t. The in-

ference betweenVtandVt1should hold the spatial and temporal per-

sistence of block assignments. Meanwhile, Ztencourages blocks with

the same label in Vtto look like an object. Moreover, V1:t1can provide

robust appearance and motion models of objects for inferring Zt. To

make the optimization tractable, we propose to split Eq.(6)into three

steps. Therst step is to obtain an intermediate assignment ~Vtthrough

inferences on the block level of two sequential frames ignoring Zt,

~Vt argmax

~Vt

p ~Vt

Vt1; Ot1:t

: 7

This step can hold the persistence of block assignments in spatialand temporal spaces. Then, the second step focuses on inferring Zt

with the aid of robust appearance and motion models of objects esti-

mated fromV1:t1,

Zt argmax

Zt

p Zt jO1:t: 8

Afterwards, the third step is to achieve the nal assignment by

combining the previous results ~Vtand Zt,

Vt argmax

Vt

p Vt jZt; Ot;~Vt: 9

The third step is based on the other two steps, which can make

blocks with the same label look like a part of some object and poten-

tially rectify possible errors during initialization and tracking. After

integrating these three steps into Eq.(6), we obtain

p Zt; Vt jO1:t; Vt1p ~VtVt1; Ot1:t p Zt jO1:tp Vt jZt; Ot; ~Vt: 10

Therefore, Eq.(10)can be efciently solved by the max-product

algorithm. These three steps will be further explained in the next sec-

tion correspondingly.Now, we have elaborated our problem formulation. Since the last

step is to assign blocks each time, we term it as Block Assignment

Tracking. Compared to[5,7], our formulation provides a simple way

to integrate block and ensemble level information.

5.2. Solution

In this subsection, we present details of the three steps in Eq. (10)

which are Block Tracking, Ensemble Tracking and Ensemble to Block

Assignment correspondingly. At the end, we give a summary of our

tracking algorithm.

5.2.1. Block Tracking

This step is to predict an intermediate result by taking advantagesof the constraints of label, color, and shape. Inspired by the similar

problem in[5], we dene

lnp ~VtVt1; Ot1:t XK

k0

XNi1

i bt;i;zt;k

Vt1; Ot1:t vt;i; k

XNi1

i bt;i; bt;j1i;; bt;jl

i

Vt1; Ot1:t 11

wherei is thepenaltyifbt,i is assigned tozt,k, i is thepenalty when bt,iand its neighbors are assigned to different objects, (i,j) is a Kroneckerfunction, equaling to 1 ifi =j or 0 otherwise, l =|Nbt,i| andNbt,i are 8-

neighbored blocks of bt,i. The observations here are actually image

sequences and object states are updated straightforwardly from their

previous states aszt,k=zt1,k+rzt, kby their motionsrzt, k=(rzt,kx, rzt,ky).The motion of an object is represented by the most frequent motion

Fig. 6.Tracking Problem formulation. Left: original image. Middle: foreground block image. Right: an assignment where blocks in the same color (label) form the appearance of one

object and the quadrangles indicate coarse shapes of objects.



8/15


amongits all blocks, where the motion for one block can be obtained by

applying block matching. Then we give the denitions ofiandi.

i bt;i;zt;k

Vt1; Ot1:t aDLt;i;kbDMt;i;kcDAt;i;k: 12

DLt,i,k is a rough shape constraint to restrict the spread of block labels.As object shapes are quadrangles, we need to eliminate the effectsof scale

along axis and rotation in the 2D plane. Our idea is to make use of a nor-

malization matrix~x 1=W 0

0 1=H

cos sin

sin cos

, where[WH]Tis

the minimum size of detection and is the angle between an object and

the vertical. Let the centers ofbt,iandzt,kbe x t;i xt;i;yt;i T

andxzt;k

xzt;k ;yzt;k

Trespectively. We deneDLt;i;k exp~x xt;ixzt;k

2

.

DMt,i,kis a temporal constraint of the label consistency, dened as

DMt;i;k Mt;i;kMi 2

. Mt,i,k is the number of pixels in the over-

lapped area between zt1,k and the region moving b t,i by rzt,k. Miis the total number of pixels in a block.

DAt,i,kis a color constraint, which measures the temporal color co-herence. LettingItbe the gray scale frame at time t, we dene

DAt;i;k 0dxb80dyb8

It xdx;ydy

It1 xdxrzt;kx;ydyrzt;ky : 13

iis the spatial constraint of the label consistency

i bt;i; bt;j1i;; bt;jl

i

Vt1; Ot1:t

d

XKk1

Ni;kNk

2g

Xln1

rt;irt; inj2:

14

Similar to[5], we set a =1, b = 1, c=0.125,d =0.00000025 and

g=0.5, and adopt Gibbs Sampler algorithm [35] to solve Eq. (11).

Please refer to[5,35]for more details.

5.2.2. Ensemble Tracking

This step is to estimate object locations accurately in the ensemble

level and offer the potential to amend possible errors in initialization

and tracking as discussed earlier. The errors are not notable in a short

time (NEframes for simplicity), but will be magnied vastly as time

passes. For the former situation, object states updated by their mo-

tions are adequate. For the latter situation, we refer to the update

step for a sequential Bayesian estimation problem:

p Zt jO1:tL Ot jZtp Zt jO1:t1 15

in whichp(Zt|O1 : t1) is thepredictionstep

p Zt jO1:t1 D Zt jZt1p Zt1 jO1:t1dZt1 16

whereL(Ot|Zt) is the likelihood of observation and D(Zt|Zt1) is the

dynamic model of thesystem, which is modeled as oneorderGaussian

by considering object motions.In order to approximate the ltering distribution, Particle Filter (PF)

approach[23]used a set of weighted particles. Its direct extension for

multiple object tracking models objects as unrelated. However, it may

cause ID switches when tracking adjacent objects, because observations

are ambiguous to be assigned to objects. Differently, we do not distin-

guish particles generated from different objects. Fig. 7compares the

two strategies. Formally, we extend[23]by

p Zt jO1:t XNpn1

nt;kznt zt 17

in which Np is the totalnumber of particles, and z() denotes the delta-Dirac function at positionz. Thenth particle is denoted aspn=(xt

n,stn,

Htn, {t,kn }1K).xtn=(xtn,ytn) is the location,stn is the scale,Htn is the appear-ance model, t,k

n is the weight for zt,k. Motivated by the successes of

[7,16], we dene

nt;k

n;Dt;k 1 n;Gt;k; x

ntxzt;k

2b;

0; otherwise

( 18

where t,kn,D is a discriminative weight modeled using the detector con-

dence and t,kn,G is an appearance weight measured from an online

learned appearance model. is a parameter (=0.5 here). is a distancethreshold. The appearance models for particles or objects come from

pixels in foreground blocks. We utilize HSV color space and the number

of bins for each channel is set to 16.

But objects may get lost sometimes during tracking. If an object

cannot get enough support particles (t,kn >), it is lost and bufferedfor possibly matching newly detected objects. We perform object de-

tection (SAD) everyNFframes tond new objects. If a lost object can-

not get matched in TWframes, it will be discarded.

5.2.3. Ensemble to Block Assignment

Thisstepis to achieve the nal result with the intermediate assign-

ment~Vtandthe estimatedobject stateZt. This problem is a multi-label

problem, which can be easily converted to 2-label problem by adding

objects one by one and then solved by graph cut algorithms. Suppose

object map Vt is obtained after adding objects zt,1~zt,k1and it will

add object zt,k. Then the target is to minimize the following energy

function each time

EkXNi1

i bt;i;zt;k

bt;jNbt;i

i;j bt;i; bt;j

: 19

a b c

Fig. 7.Comparisons of sampling strategies. (a) shows a scene with six persons. (b) PF [23]models objects as unrelated. (c) In our strategy, particles from different objects are not

distinct. But those far away from the concerned object are ignored (e.g., only particles from object D, C and E contribute to D).



9/15


Unary itemi encodes the data likelihood, which imposes penaltiesfor assigning blockbt,ito objectzt,k. We consider the shape model and

thepriorknowledge in

i bt;i;zt;k

xt;i; xzt;k

ni 1 ~vt;i; k

20

where (,) is a kernel function dened as (xt, i,xzt,k) = DLt, i,k. isanoccluded factor. Let n i be the number of objects that occlude zt,k in

blockbt,i, where an object is occluded by others if they are overlapped

and itsy-axis value is larger. Intuitively, the largerni, the lower i. Wehave1 (set to 1.25 in our experiments).

Pairwise item i,j encourages the spatial coherence and imposespenalties when bt,i and bt,j are assigned with different labels. As a

sub-modular energy function can be solved by graph cut algorithms,

we adopt Potts model for simplicity

i;j exp

2 At;i;At;j

A

0@

1A 1 exp rt;irt;j2r

0@

1A; vt;ivt;j

0; otherwise

8>>>: 21

whereAt,land rt,lare the appearance and motion ofbt,l(l = i,j).is aparameter (=0.5 here).Aandrare normalization factors. Here theappearances of blocks aremodeled as 4 bins histogramin gray images.

A is set to be the number of pixels in a block (=64 here).Suppose themaximum motion of a block is the block size (88), and thus we set

r= 82+ 82=128.

After achieving the nal assignment, we then update appearance

models of objects. Intuitively speaking, if an object is occluded by

others, meaning that some of its overlapped foreground blocks are

not assigned to it, the update ratio should be small. The more occlu-

sion, the less update ratio. Based on this, we dene the update ratio

as =0.5Nk/Na, where Nk is the number of blocks assigned to zt,kandNais the total number of blocks overlapped byzt,k. Given the pre-

vious and current appearance models for zt,k,Apand Ac, the update is

described asA = (1)Ap+Ac.

Until now, we have described the three key components of theBAT. For easy reference, the entire procedure of the BAT is summa-

rized inFig. 8.

6. Experiments

In this section, we carry out extensive experiments to evaluate our

proposed detectionand tracking system. We rst describe the training

andtesting datasets,and then list some detectionand tracking metrics

for evaluations, and then evaluate the performance of our system, and

make some discussions at last.

6.1. Datasets

We have labeled 2470 positive masks of 24 58 as shown inFig. 4(b) for training the foreground pruning detector. We also have

collected 18,474 whole body positive samples of 2458 for learning

object detectors as shown inFig. 9. The positive masks and samples

of the other parts can be generated from those of whole body using

the denitions inFig. 5.

We use a large variety of challenging test datasets with different

situations of occlusions and viewpoints for evaluation as summarized

inFig. 10. Occlusions or viewpoint changes in these real world data-

sets make them valuable for evaluating detection and tracking sys-

tems. As the viewpoint in CAVIAR1 is frontal, learned detectors can

Fig. 8.The algorithm of our system.



10/15


be applied directly. But since the viewpoints in CAVIAR2, PETS2007

and our dataset are tilted, we utilize camera models to cope with it.

In our experiments, we aimat improvingboth detectionand track-

ing performances with off-line discriminative models. Therefore, test

datasets are totally independent from the training set and we employ

the generally trained detectors into all test sequences without retrain-

ing them specically for a certain scene.

6.2. Metrics

We use False Positive Per Image (FPPI) for detection evaluation.

When the intersection between a detection response and a ground-

truth box is larger than 50% of their union, we consider it to be a suc-

cessful detection. Only one detection per annotation is counted as

correct.

For multi-object tracking, there is no single established protocol.

We follow two current existing metrics. The metrics [36]count the

number of mostly tracked (MT), partially tracked (PT) and mostly

lost (ML) trajectories as well as the number of track fragmentations(FM) and identity switches (IDS). The CLEAR-metrics [37]calculate

the Multiple Object Tracking Accuracy (MOTA) which take into ac-

count false positives, missed targets and identity switches; and the

Multiple Object Tracking Precision (MOTP) measuring the precision

with which objects are located using the intersection of the estimated

region with the ground truth region.

6.3. Performance evaluations

6.3.1. Detection evaluations

In this subsection, we concentrate on evaluating the performances

of the key components of our SAD, foreground aware pruning (FAP),

Structural Filter approach (ISF) and foreground aware merging (FAM).

Since the number of available frames in test datasets is quite huge, weonly select 200 representative frames from each test datasets for

evaluation.

6.3.1.1. Efciency of FAP. TheaimofFAPistoefciently prunenegatives

and falsepositives. Table 1 shows the pruned window proportions and

saved times on these datasets with default detection parameters. In

Table 1, we can see that about 79%94.4% windows are pruned,

which yields a plenty of time saving (0.29 s4.6 s). Since there are

lots of people in our datasets, the pruned proportion is less than

those of the other datasets. Compared to CAVIAR1, the other three

datasets need to rectify sub images, and thus they cost much more

times than CAVIAR1. However, as there are a few (b4) persons in

CAVIAR2, the cost time is not as huge as S02 and our datasets. This ex-

periment sufciently demonstrates the efciency of FAP.

6.3.1.2. Efciency of SAD.We choose two state-of-the-art works[12,38]

for comparison with our SAD. ACF [38] has achieved good performances

for pedestrian detection, which is a strong competitor for frontal view-

point detection. ACF is learned on the same training dataset as our ISF

for a fair comparison. Since there are no publically available detectors

for multiple viewpoints of humans, we use Deformable Part Model

(DPM)[12]as a baseline, which is very famous for detecting objectswith large variations. The original DPM detector is provided by the

author and trained on Pascal VOC 2008. For a fair comparison, we also

train a new DPM detector on the same training dataset as our ISF. To

distinguish them, we denote them as DPM1 and DPM2 separately.

In the following, for concise descriptions, we let MAP be the

Bayesian method in [2] andNAIVEbe thesimpleststrategyto combine

Fig. 9.Positive samples for the whole body.

Fig. 10. Test datasets. CAVIAR dataset can be downloaded from http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. PETS2007 can be downloaded from http://www.cvg.rdg.ac.uk/

PETS2007/. Humans in CAVIAR2 are too small, and therefore we double the original video size (384288).

Table 1

Evaluations of foreground aware pruning.NHis the average number of humans. PPWis

the average proportion of pruned windows in all scanned windows.Tis the cost time

without foreground aware pruning and tis the saved time when using it.

CAVIAR1 S02 Our dataset CAVIAR2

NH 6 9 11 4

PPW 94.4% 90% 79% 86%t(ms) 700 4600 1200 290

T(ms) 1210 10,400 7560 650



11/15


detection results by the near locations in the following. Note that ex-

cept CAVIAR1, the other three test datasets need rectications with

camera models. The methods using camera models are indicated by

CAM. The ROC curves are shown inFig. 11.

6.3.1.2.1. Improvements of FAP. Compared to ISF+NAIVE, ISF+

NAIVE+FAP improves the detection rate about 3% on CAVIAR1. Com-

pared to ISF+NAIVE+CAM, ISF+NAIVE+CAM+FAP improves the

performance about 4% on S02, 4% on our dataset, 1% on CAVIAR2. Sim-

ilar performance improvements are achieved in ISF+MAP+CAM+

FAP. From the experiments in Fig. 11 andTable 1, we can see that

FAP not only works well on pruning but also improves the detection

performances.

6.3.1.2.2. Improvements of ISF and scene models. ISF+MAP per-

forms better than or comparable to ACF+MAP on CAVIAR1, S02

and our dataset, demonstrating that ISF can detect occluded humans

in scenes without large viewpoint changes. And ISF (ISF+MAP and

ISF+NAIVE) is better than DPM (DPM1 and DPM2),where there

might be two main reasons: 1) the ability of deformable part based

model is limited on strong labeled samples like our training dataset,

and 2) the weak feature in DPM uses only gradient information,while the weak feature in ISF combines both color and gradient infor-

mation which is more discriminative than the former for pedestrian

detection. Note that, as DPM2 is more focused on pedestrians than

DPM1, it performs better than DPM1 on S02, and comparable to

DPM1 on CAVIAR1 and our dataset. But all these detectors fail in CAV-

IAR2 because of large viewpoint changes, which can be better solved

by camera models. Comparing with ISF+NAIVE, ISF+NAIVE+CAM

improves the performances about 3% on S02 and 26% on our dataset,

and it works well on CAVIAR2. Similar improvements are achieved in

ISF+MAP+CAM. As the viewpoint of CAVIAR1 is frontal, the linear

mapping from 2D coordinate to human height is used. In the experi-

ment, we nd that the linear mapping does not reduce the detection

performance, while speeds up the detection about 0.6 s compared to

only using ISF on average.6.3.1.2.3. Improvements of FAM. We replace the post processing

method by FAM to show further performance improvements. Compared

to ISF+ MAP +FAP, our approach (ISF+ FAM+FAP) can improve the

detection rate by about 11% in CAVIAR1. Compared to ISF+MAP+

FAP+CAM, our approach (ISF+FAM+FAP+CAM) improves the de-

tection rate about 16% on our dataset and 14% in S02. As MAP adds ob-

jects in y-decent order which is not true in large viewpoint changed

scenes, it does not work well in CAVIAR2 and even worse than NAIVE

sometimes. On contrast, our approach can still work well in such scenes

and achieves 52% detection rate at FPPI=0.1 in CAVIAR2. We also ob-

serve another interesting phenomenon: the curves of our approach are

much cliffy than the others. It indicates that we can detect more objects

with less false samples. This is mainly because of pruned false positive

samples and used scene models. We zoom in the curves ofFig. 11(b)

and (c) to illustrate more details inFig. 11(e) and (f) respectively.

6.3.1.2.4. Summary. These experiments have shown the effective-

ness of the key components (FAP, ISF and FAM) of our SAD in occlud-

ed and viewpoint changed scenes. Therefore, our SAD as a whole

outperforms many of the state-of-the-art detection algorithms such

as[12,38]. But the speed is not satisfactory. The detection costs on av-

erage about 0.51 s, 5.8 s, 6.36 s and 0.36 s on CAVIAR1, our dataset,

S02 and CAVIAR2 respectively. Because of changed viewpoints andheavy occlusions, it costs much more time on our dataset and S02

than on CAVIAR1 and CAVIAR2. For further speedup and performance

improvements, we recommend our proposed BAT, which is evaluated

in the next subsection.

6.3.2. Tracking evaluations

In this section, we report our BAT tracking performances on all test

datasets based on the SAD results without retraining detectors for

specic scenes. For concise descriptions, we let our BAT with and

without camera models be BAT+ 3D and BAT+ 2D respectively.

6.3.2.1. Algorithms for comparisons. We compare our approach with

some state-of-the-art tracking algorithms[26,24,15,7,36,27]. We uti-

lize the implementation1 of [27] to carry out experiments by

a b c

d e f

Fig. 11.Evaluation of our SAD compared to DPM[12]and ACF[38]. (a), (b), (c) and (d) compare our approach with several state-of-the-art works on CAVIAR1, S02, Our dataset and

CAVIAR2 separately. (e) and (f) zooms in our approach on S02 and our dataset respectively to illustrate more details.

1 http://www.ics.uci.edu/~dramanan/.



12/15


ourselves. In this implementation, the authors do not use appearances

after detecting objects. Therefore, it will obtain relatively more frag-

ments and ID switches as well as missing detections. We improve its

performance by (1) utilizing background modelings to remove false

positive samples, and (2) building up appearance models for detectedobjects to associate them and (3) adjusting some parameters to

achieve better tracking results. After this improvement, it can track

more humans, but there are still too many fragments and ID switches.

Thus, we only use it for comparisons on the following metrics, MT, PT,

ML and MOTP.

Besides these state-of-the-art algorithms, we also use two simpli-

ed versions of our BATas baselines to demonstrate the improvement

of combining both block and ensemble information. Onebaselineonly

uses Ensemble Tracking, shortened as BAT(ET). BAT(ET)+2D, where

camera models are not used in detection, is similar to [7].BAT(ET) +

3D, where camera models are used in detection, is a better way to

show the improvement of BAT+3D with ensemble information. The

otherbaselineonly uses BlockTracking, shortenedas BAT(BT). Objects

can be well initialized in CAVIAR2 because of little occlusions, but notin CAVIAR1, our dataset and S02 because of severe and frequent

occlusions. Therefore,BAT(BT) +3Dis fair in comparison with BAT+

3D on CAVIAR2, in which camera models are used in detection.

6.3.2.2. Quantitative results.The obtained results are shown inFigs. 12

and 13.6.3.2.2.1. CAVIAR1. We compare our BAT with [26,24,15,7,36] in

Fig. 12. Among them, our method achieves the highest MT. Our FM

and IDS are a little higher than[36], which mainly because we handle

sequences online but[36]used all detection results to obtain a global

optimization. The MOTA and MOTP of our approach are both better

than those of[7], showing the efciency of combining block and en-

semble information. In general, CAVIAR1 is relatively easy for many

tracking systems, however, the further used test datasets are more

challenging.

6.3.2.2.2. Our dataset and S02. We compare our approach with

[7,27] on these two datasets in Fig. 13(top) and (middle). As described

in Section 6.3.1, many state-of-the-art detection algorithms do not

perform as well as our detection approach in sceneswith heavy occlu-

sions and slightly changed viewpoints. The detection processes in[7,27]lost many humans on our dataset and S02, which reduces the

Fig. 12.Quantitative results on CAVIAR1. *The denitions of fragment and IDS numbers in[26,7]are obtained by looser evaluation metrics.

Fig. 13.Quantitative results of our method on our dataset, S02 and CAVIAR2.



13/15


tracking performances. While, our SAD can detect many humans and

our BAT generally performs much better and more stable than[7,27].

BAT(ET)+3D can track more objects than[7,27], but it obtains many

fragments and ID switches. Compared to BAT(ET)+3D, BAT+3D

achieves a better performance. It obtains higher MT/PT/MOTP/MOTA

and lower FM and IDS. This improvement shows that combinations

of block and ensemble information are superior to only using ensem-bleinformationfor tracking. Compared to BAT+2D, theimprovement

of BAT+3D majorly lies in the usage of camera models because of

slight changed viewpoints. Furthermore, BAT+3D is always better

thanBAT+2D inMT/PT/ML,but not inother metrics.A partof the rea-

son is that the ground truths are labeled as rectangles, but the tracked

humansof BAT+3D arequadrangles. However, because our dataset is

much more crowded than S02, there are still many partially tracked

objects.

6.3.2.2.3. CAVIAR2. Because of the extremely large viewpointchanges, the methods without using camera models (such as [7,27]

and BAT+2D) fail totally in this dataset. As far as we know, there

a

b

c

d

Fig. 14.Tracking results. (a) and (b) compares[7](top) and our approach (bottom) onOneStopMoveEnter1corof CAVIAR1 and S02. (c) and (d) illustrate sample results in our data-

set andMeet_crowdof CAVIAR2 separately. The layouts of (a) and (b) are already shown inFig. 3(c). The layouts of (c) and (d) are illustrated in the most left. More descriptions can

be found inSection 6.3.2.



14/15


are no publically available implementations to deal with multiple

humans tracking in these scenes. Thus, we compare our BAT+3D

with BAT(BT)+3D and BAT(ET)+3D inFig. 13(bottom). Compared

to BAT(ET)+ 3D, BAT(BT)+ 3D achieves higher MT, MOTA and

MOTP, but more IDS and FM. Our BAT+3D can integrate both of

their advantages and achieves better performances. It improves MT

by 13.8%, MOTAby 7.2% and reduces PT by 18.2% than the second best.6.3.2.2.4. Summary. As described earlier, the application of Block

Tracking is limited, because it requires good initializations, but the

achievement of good initializations is difcult in occluded scenes. Com-

paringBAT+ 2D than [7] on CAVIAR1 and BAT+ 3D than BAT(ET)+ 3D

on theother three datasets (our dataset, S02 and CAVIAR2), we can eas-

ily conclude that combinations of block and ensemble information can

improve the tracking performance. From the experiments inFigs. 12

and 13, we can see that our proposed detection and tracking system

can work robustly in the scenes with heavy occlusions and viewpoint

changes.

6.3.2.3. Sample results. Fig. 14demonstrates some tracking results by

our tracking algorithm, where the green and red arrows point some

IDS, the purple dotted ellipses point some target missing or lost, andthe blue arrow points some false alarms. Panels (a) and (b) illustrate

scenes with targets walking against a crowd. We compare[7] in the

top with our approach in the bottom. Our method can consistently

track these objects, while[7] experiences several instances of IDS, tar-

get lost and false alarms. Panel (c) features a subway scene with many

people walking, where the occlusions are very severe and the view-

point is slightly changed. Our tracker succeeds in tracking many of

them. Panel (d) shows a scene with several people walking where

the viewpoint is extremely changed. Our tracker tracks them success-

fully all over the sequence.

6.4. Discussions

6.4.1. Parameters

There are some parameters in the SAD and BAT as listed in Fig. 15

with corresponding descriptions and default values. The affections of

some key parameters to our framework are presented as follows. For

SAD, the parameters TM, Tadd and Tdel directly impact on the post-

process of detection. Smaller TM indicates the more probability of

adding or deleting a detected response each time. Smaller Tadd will

add more objects and larger Tdelwill delete more objects. For BAT,NOandare key parameters. Larger NO(i.e. more particles) can improvethe performance, but it will cost more time. Largerconsiders moreconsistencies in videos, which can improve the tracking performance

when the detection is not so accurate, especially in CAVIAR2 because

of large viewpoint changes. Therefore, we set most parameters by de-fault and they are relatively robust in the experiments, except that we

setNO=300 and=5 for CAVIAR2.

6.4.2. Processing speeds

The entire system is implemented in one thread by C++, without

special code optimization and taking advantage of GPU processing. On

a workstation with an Intel Core(TM)2 2.33 GHz and 2 G memory, we

achieve real-time process speed of 2.715 fps (given the video size,

the object number and the changed viewpoint), as shown in Fig. 16

compared with detection only. The current bottleneck is the detection

stage. As not all speedup possibilities are explored yet, the current

run-time raises hope that online experiments in real world applica-

tions will not be too far away.

6.4.3. Failure cases

Objects are initialized by detection in our system. The failures of

detection (e.g. missing detection and false alarms) cannot be avoided

in the tracking process. If the initialized object is not so accurate, such

asobject 8 inFrame 20ofFig. 14(c), it drifts easilyand tends to belost.

In particular, camera models impacts a lot on detection in viewpoint-

changing scenes. Bad estimated camera parameters will lead to unex-

pected detected results. Besides, our system cannot handle the near

vertical viewpoint where camera is right over the top of objects,

since it is impossible to recover the objects' frontal viewpoint in this

situation as pointed out in[10].

7. Conclusion

In this paper, we propose a robust system for multi-object detec-

tion and tracking in surveillance scenes with occlusions and view-

point changes. Our SAD can achieve robust detection through:

(1) camera models to cope with viewpoint changes; (2) structural l-

ter approach to handle occlusions; and (3) foreground aware pruning

Fig. 15.Default parameters.



15/15


and foreground aware merging with the aid of some scene models.

Our BAT can track objects robustly even when all the part detectors

fail as long as the object has assigned blocks, which formulate track-

ing as a block assignment process. Its key factors are: (1) Block Track-

ing to maintain the spatial and temporal consistence of labels; (2)

Ensemble Tracking to precisely estimate locations and sizes of ob-

jects; and (3) Ensemble to Block Assignment to maintain the blocks

with the same label look like a part of human.

Although our method tracks remarkably well even through occlu-

sions and viewpoint changes, one unavoidable drawback is fuzzy ob-

ject boundaries. To overcome this, we can learn and extract some

discriminative patches to represent and track objects. Another draw-back is that the tracking results are jittering, which can be amended

by estimated object trajectories. For detection improvement, we can

use online algorithms to make the ofine and general detectors be-

come adaptive to a xedscene. Although the current systemonly con-

siders human, the proposed mechanism can be easily to be extended

to other kinds of objects. Based on the detection and tracking results,

some high level analysis of object behaviorsbecome possible. Further-

more, we hope to be able to make our approach applicable to real

world needs.

Acknowledgements

This work is supported in part by National Science Foundation of

China under grant No.61075026, National Basic Research Program ofChina under Grant No.2011CB302203. Mr. Shihong LAO is partially

supported by R&D Program for Implementation of Anti-Crime and

Anti-Terrorism Technologies for a Safe and Secure Society, Special

Coordination Fund for Promoting Science and Technology of MEXT,

the Japanese Government.

Appendix A. Supplementary data

Supplementary data to this article can be found online at doi:10.

1016/j.imavis.2012.02.008.

References

[1] P. Viola, M. Jones, Rapid objectdetectionusing a boosted cascadeof simple features, in:

Proc. IEEE Int. Conf. Comput. Vis.Pattern Recogni., Kauai,HI, USA, 2001, pp.I-511I-518.[2] B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single

image by Bayesian combination of edgelet part detectors, in: Proc. IEEE Int.Conf. Comput. Vis., Beijing, China, 2005, pp. 9097.

[3] C. Huang, R. Nevatia, High performance object detection by collaborative learningof joint ranking of granules features, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., San Francisco, California, USA, 2010, pp. 4148.

[4] G. Duan, H. Ai, S. Lao, A structural lter approach to human detection, in: Proc.Eur. Conf. Comput. Vis., Crete, Greece, 2010, pp. 238251.

[5] S. Kamijo, Y. Matsushita, K. Ikeuchi, M. Sakauchi, Trafc monitoring and accidentdetection at intersections, IEEE Trans. Intell. Transp. Syst. 1 (2000) 108118.

[6] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations, IEEE Trans.Pattern Anal. Mach. Intell. 26 (2004) 12081221.

[7] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets l-tering and global tracklets association with detection responses, in: Proc. IEEE Int.Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA, 2009, pp. 12001207.

[8] M. Andriluka, S. Roth, B. Schiele, People-tracking-by-detection and people-detection-by-tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., An-chorage, Alaska, USA, 2008, pp. 18.

[9] X. Wang, T.X. Han, S. Yan, An hog-lbp human detector with partial occlusion han-dling, in: Proc. IEEE Int. Conf. Comput. Vis., Kyoto, Japan, 2009, pp. 32 39.

[10] Y. Li, B. Wu, R. Nevatia, Human detection by searching in 3D space using cameraand scene knowledge, in: Proc. IEEE Int. Conf. Image Process., Tampa, Florida,USA, 2008, pp. 15.

[11] G. Duan, H. Ai, S. Lao, Human detection in video over large viewpoint changes, in:Proc. IEEE Asi. Conf. Comput. Vis., Queenstown, New Zealand, 2010, pp. 683696.[12] P. Felzenszwalb, D. McAllester, D. Ramaman, A discriminatively trained, multi-

scale, deformable part model, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Anchorage, Alaska, USA, 2008, pp. 18.

[13] C. Beleznai, H. Bischof, Fast human detection in crowded scenes by contour inte-gration and local shape estimation, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Miami, FL, USA, 2009, pp. 22462253.

[14] D. Hoiem, A.A. Efros, M. Hebert, Putting objects in perspective, Int. J. Comput. Vis.80 (2008) 315.

[15] C. Huang, B. Wu, R. Nevatia, Robust object trackingby hierarchical associationof detec-tion responses, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 788801.

[16] A. Senior, Tracking with probabilistic appearance models, in: ECCV Workshop onPerformance Evaluation of Tracking and Surveillance Systems, Copenhagen,Denmark, 2002, pp. 4855.

[17] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pat-tern Anal. Mach. Intell. 25 (2003) 564577.

[18] P. Fieguth, D. Terzopoulos, Color based tracking of heads and other mobile objectsat video frame rates, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., San

Juan, Puerto Rico, 1997, pp. 2127.[19] M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density,

in: Proc. European Conf. Computer Vision, Cambridge, UK, 1996, pp. 343356.[20] J.C. Clarke, A. Zisserman, Detection and tracking of independent motion, Image

Vis. Comput. 14 (1996) 565572.[21] M.D. Rodriguez, M. Shah, Detecting and segmenting humans in crowded scenes,

in: Proc. IEEE Int. Conf. Multimed., Augsburg, Germany, 2007, pp. 353356.[22] P. Kelly, N.E. O'Connor, A.F. Smeaton, Robust pedestrian detection and tracking in

crowded scenes, Image Vis. Comput. 27 (2009) 14451458.[23] M. Isard, A. Blake, Condensation-conditional density propagation for visual track-

ing, Int. J. Comput. Vis. 28 (1998) 528.[24] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans

by Bayesian combination of edgelet based part detectors, Int. J. Comput. Vis. 75(2007) 247266.

[25] H. Jiang, S. Fels, J.J. Little, A linear programming approach for multiple objecttracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Minneapolis,MN, USA, 2007, pp. 18.

[26] L. Zhang, Y. Li, R. Nevatia, Global data association for multi-object tracking usingnetwork ows, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Anchorage,Alaska, USA, 2008, pp. 18.

[27] H. Pirsiavash, D. Ramanan, C.C. Fowlkes, Globally-optimal greedy algorithms fortracking a variable number of objects, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Colorado Springs, CO, USA, 2011, pp. 12011208.

[28] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007)261271.

[29] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in:British Machine Vision Conference, Edinburgh, British, 2006.

[30] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robusttracking, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 234247.

[31] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instancelearning, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA,2009, pp. 983990.

[32] J. Xing, L. Liu, H. Ai, Background subtraction through multiple life span modeling,in: Proc. IEEE Int. Conf. Image Process., Brussels, Belguim, 2011.

[33] B. Wu, R. Nevatia, Y. Li, Segmentation of multiple partially occluded objects bygrouping merging assigning part detection responses, in: Proc. IEEE Int. Conf.Comput. Vis. Pattern Recogni., Anchorage, Alaska, USA, 2008, pp. 18.

[34] G. Duan, C. Huang, H. Ai, S. Lao, Boosting associated pairing comparison featuresfor pedestrian detection, in: Proc. IEEE Workshop Visual Surveillance, Kyoto,

Japan, 2009, pp. 10971104.[35] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution, and the Bayesian

restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721741.[36] Y. Li, C. Huang, R. Nevatia, Learning to associate: Hybrid boosted multi-target

tracker for crowded scene, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni.,Miami, FL, USA, 2009, pp. 29532960.

[37] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance:the clear mot metrics, J. Image Video Process., 2008, 2008.

[38] W. Gao, H. Ai, S. Lao, Adaptive contour features in oriented granular space forhuman detection and segmentation, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Miami, FL, USA, 2009, pp. 17861793.

Fig. 16.Speed comparisons of detection and tracking (ms).


Documents

Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full