Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

Embed Size (px)

Citation preview

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    1/15

    This article appeared in a journal published by Elsevier. The attached

    copy is furnished to the author for internal non-commercial research

    and education use, including for instruction at the authors institution

    and sharing with colleagues.

    Other uses, including reproduction and distribution, or selling or

    licensing copies, or posting to personal, institutional or third partywebsites are prohibited.

    In most cases authors are permitted to post their version of the

    article (e.g. in Word or Tex form) to their personal website or

    institutional repository. Authors requiring further information

    regarding Elseviers archiving and manuscript policies are

    encouraged to visit:

    http://www.elsevier.com/copyright

    http://www.elsevier.com/copyrighthttp://www.elsevier.com/copyright
  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    2/15

    Author's personal copy

    Scene Aware Detection and Block Assignment Tracking in crowded scenes

    Genquan Duan a,, Haizhou Ai a, Junliang Xing a, Song Cao b, Shihong Lao c

    a Computer Science and Technology Department, Tsinghua University, Beijing, Chinab Electronic Engineering Department, Tsinghua University, Beijing, Chinac Development Center, OMRON Social Solutions Co., LTD, Kyoto, Japan

    a b s t r a c ta r t i c l e i n f o

    Article history:

    Received 18 July 2011

    Received in revised form 7 February 2012

    Accepted 10 February 2012

    Keywords:

    Visual surveillance

    Object detection

    Object tracking

    Particlelter

    How far can human detection and tracking go in real world crowded scenes? Many algorithms often fail in

    such scenes due to frequent and severe occlusions as well as viewpoint changes. In order to handle these dif-

    culties, we propose Scene Aware Detection (SAD) and Block Assignment Tracking (BAT) that incorporate

    with some availablescene models (e.g. background, layout, groundplane andcamera models). TheSAD is pro-

    posed for accurate detection through utilizing 1) camera model to deal with viewpoint changes by rectifying

    sub-images, 2) a structural lter approach to handle occlusions based on a feature sharing mechanism in

    which a three-level hierarchical structure is built for humans, and 3) foregrounds for pruning negative and

    falsepositivesamplesand merging intermediate detectionresults.Many detection or appearance based track-

    ingsystems areproneto errorsin occluded scenesbecause of failures of detectorsand interactionsof multiple

    objects. Differently, the BAT formulates tracking as a block assignment process, where blocks with the same

    label form the appearance of one object. In the BAT, we model objects on two levels, one is the ensemble

    level to measure how it is like an object by discriminative models, and the other one is the block level to mea-

    sure how it is like a target object by appearance and motion models. The main advantage of BAT is that it can

    track an objecteven when all the partdetectors fail as long as theobject has assigned blocks. Extensiveexper-

    iments in many challenging real world scenes demonstrate the efciency and effectiveness of our approach.

    2012 Elsevier B.V. All rights reserved.

    1. Introduction

    Human detection and tracking are classic problems in computer

    vision for the applications in visual surveillance, driver-aided system

    and trafc managements, and have achieved signicant progresses

    recently. Many existing detection and tracking methods, however,

    encounter great challenges from radial distortions, illumination vari-

    ations, viewpoint changes and occlusions, all of which are quite com-

    mon in real world scenes.

    The goal of our work is to cope with these difculties to detect and

    track multiple humans in surveillance scenes using a single stationarycamera. Many detection and tracking systems developed so far as-

    sume that the viewpoint is frontal, a person enters the scene without

    occlusions, a person appears or disappears in some special locations, a

    person will exist in the scene for a given number of frames or the

    human ow is gentle. In this paper, we present a robust detection

    and tracking system attempting to minimize such constraining as-

    sumptions, which is able to handle the following difculties: 1) occlu-

    sion, when multiple persons crowdedly enter and move in the scene;

    2) relatively unconstrained camera viewpoints, rotations and heights;

    3) relatively unconstrained human motions, appearances and posi-

    tions with respect to the camera; 4) humans appearing for only a

    small number of frames; and 5) relatively slowly moving humans.

    We only assume inherently that humans stand on the ground plane

    in the scene and ignore those below this ground plane or stand in

    other places such as rooftops, windows or sky. This is a very reason-

    able assumption which is applicable in most of thesurveillance scenes.

    We innovate from both detection and tracking for the scenes with oc-

    clusions and viewpoint changes. Our main contributions include two

    aspects as follows.

    A Scene Aware Detection for accurate detection. Specically, it in-

    cludes: (1) A simple but efcient learning algorithm to use fore-

    grounds to prune negative and false positive samples; (2) A

    structural lter approach to detect occluded humans in a feature

    sharing mechanism; and (3) A foreground aware merging strategy

    to explain foregrounds by detected results;

    A Block Assignment Tracking for robust tracking where tracking is

    formulated as a block assignment process and the objects are mod-

    eled in different levels, i.e. the block level and the ensemble level.

    Blocks with the same label form the appearance of one object,

    from which robust appearance and motion models can be estab-

    lished. Its main advantage is that it can track an object even when

    all the part detectors fail as long as the object has assigned blocks.

    Image and Vision Computing 30 (2012) 292305

    This paper has been recommended for acceptance by Xiaogang Wang.

    Corresponding author. Tel.: +86 10 62795495; fax: +86 10 62795871.

    E-mail addresses:[email protected](G. Duan),

    [email protected](H. Ai),[email protected](J. Xing),

    [email protected](S. Cao),[email protected](S. Lao).

    0262-8856/$ see front matter 2012 Elsevier B.V. All rights reserved.

    doi:10.1016/j.imavis.2012.02.008

    Contents lists available at SciVerse ScienceDirect

    Image and Vision Computing

    j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / i m a v i s

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    3/15

    Author's personal copy

    The rest of this paper is organized as follows. Related work is dis-

    cussed in the next section. Our system is overviewed in Section 3.

    Scene Aware Detection is presented in Section 4. Block Assignment

    Tracking is described in Section 5. Experimental results on many chal-

    lenging real world datasets are provided along with some discussions

    inSection 6. Conclusions and future work are given inSection 7.

    2. Related work

    There area great deal of works in theliterature on object detection,

    such as faces[1]and pedestrians[24], and multiple target tracking,

    such as vehicles[5]and humans[68]. Here we mainly review some

    robust detection methods to cope with occlusions and viewpoint

    changes at rst, andthen discuss some detection related anddetection

    free tracking algorithms.

    2.1. Robust detection

    2.1.1. Occlusion handling

    Using multiple part detectors, Wu et al. [2] proposed a Bayesian

    approach for combination, while Huang et al. [3] introduced a dynam-ic search. Wang et al. [9]proposed a global-part occlusion handling

    method, where an occlusion likelihood map was produced from HOG

    feature responses rst and then segmented by mean shift approach.

    2.1.2. Viewpoint change handling

    Due to the changes of viewpoints, human appearances and poses

    vary a lot. To solve this difculty, Li et al.[10]detected objects in rec-

    tied sub-images with a learned frontal viewpoint detector. Another

    method is to learn one powerful detector for all possible viewpoints,

    such as[11,12]. Duan et al.[11]clustered the complex multiple view-

    point samples into several sub-categories rst and then learn a classi-

    er for each sub category. Felzenszwalb et al. [12]proposed a more

    efcient model, Deformable Part based Model, in which a root lter

    and several parts models are learned for each object category that

    can detect objects with some pose changes.

    2.1.3. Integration with other models

    Beleznai et al.[13]used local shape descriptors to infer human lo-

    cations in images of absolute background difference from background

    model. Hoiem et al. [14] and Huang et al. [15] utilized scene geometric

    model to restrict the object locations and ground plane model to re-

    strict the objects heights in a particular location.

    2.2. Robust tracking

    2.2.1. Detection free tracking

    Some techniques assume that objects enter the scene in some spe-

    cic location [5], or appear in the scene without occlusions [5,16] for a

    period of time that allows object models to be built up while they areisolated. Some techniques (e.g.[5,6]) depend on accurate segmenta-

    tion of moving foreground objects from a background color intensity

    model,whereKamijo et al. [5] segmented foreground blocks into vehi-

    cles using spatialtemporal information, and Zhao et al.[6]developed

    a tracker based on human shape model.All of them rely on an inherent

    assumption that there will be signicant difference in color intensity

    information between foreground and background. Unfortunately,

    there are many problems for background modeling, such as being in-

    accurate, noise sensitive, and weak in shadow. Similar assumptions

    are made in[1720], where the authors extracted features, e.g. inten-

    sity, colors, edges, contours, feature points, and used them to establish

    the correspondences between modelimages and target images. More-

    over, shape based approaches[6,21]will encounter challenges when

    body parts are not isolated which may cause signicant occlusions,and appearance based ones[16] often fail when several objects get

    close together as this kind of algorithms fail to allocate the pixels to

    the correct object. In order to overcome some of these problems,

    Kelly et al.[22]used 3D stereo information to detect pedestrians via

    a 3D clustering process and track them by a weighted maximum car-

    dinality matching scheme.

    2.2.2. Detection related tracking

    2.2.2.1. Detection based tracking.With the fast development of object

    detection techniques, object detectors play an important role in

    many tracking algorithms. Some tracking algorithms use detection

    as their observation model. One of the most successful techniques is

    particle lter[23]. Particle lter is based on Sequential Monte Carlo

    Sampling, which has gained many attentions because of its simplicity,

    generality, and extensibility in a wide range of challenging applica-

    tions.Xing et al. [7] combined multiple part detectors with particle l-

    ter to track multiple objects with occlusions. Another kind of work is

    to associate detected results of video frames locally [24]or globally

    [8,15,2527].Wuetal. [24] associated detection results in two consec-

    utive frames. Jiang et al.[25]adapted Linear Programming for associ-

    ation, while Zhang et al. [26]used min-cost ow. Andriluka et al. [8]

    tailored Viterbi algorithm to link detection results, which combinedboth the advantages of detections and tracking. Huang et al. [15]pre-

    sented a three-level hierarchical association approach where they

    achieved short tracks and long tracks at the low level and middle

    level separately, and rened the last trajectories with the estimated

    scene knowledge at thehigh level.Pirsiavash et al. [27] proposed glob-

    ally optimal greedy algorithms to estimate the number of tracks, their

    birth and death states in a cost function. Global association based

    tracking method could theoretically obtain a global optimum, since

    the results of all the frames are available before tracking. However,

    the cost of heavy computations and temporal delays limits them in

    real time applications.

    2.2.2.2. Online learning.Avidan[28]trained an ensemble of weak clas-

    siers online to distinguish between the object and the background.

    Grabner et al. [29] described an online boosting algorithm for real-

    time tracking, which was very adaptive but may drift. To limit the

    drifting problem, Grabner et al. [30] introduced a semi-supervised

    learning algorithm using unlabeled data explored in a principled man-

    ner, while Babenko et al. [31]proposed an online Multiple Instance

    Learning using one positive bag consisting of several image patches

    to update a learned classier. However, manually initialization and fo-

    cusing on single object tracking prevent their applications in our inter-

    ested scenes.

    3. System overview

    We propose to detect and track multiple humans in surveillance

    scenes with occlusions and viewpoint changes using a single station-

    ary camera by taking advantage of some available scene models (e.g.background, camera, layout and ground plane models). We believe

    that the models we use are generic and applicable to a wide variety

    of situations. The models used are listed as follows.

    (a) A camera model to rectify an image with large viewpoint

    changes into a frontal viewpoint;

    (b) A background model to direct the system's attention to the re-

    gions showing difference from the background;

    (c) A layout model to restrict objects in the scene;

    (d) A ground plane model to restrict objects standing on the ground.

    The whole system is overviewed in Fig. 1, which mainly includes

    two components, Scene Aware Detection and Block Assignment Track-

    ing. The three key factors of the SAD are foreground aware pruningto

    prune negative and false positive samples, a structural lterapproachbased on our previous work[4] to detect occluded objects, and fore-

    ground aware mergingto explain foregrounds by detected results. The

    293G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    4/15

    Author's personal copy

    BAT formulates tracking as a block assignment process, which can track

    an object even when all the part detectors fail, as long as the object has

    assigned blocks. The BAT proceeds as follows. It maintains the spatial

    and temporal consistence in the block level rst (Block Tracking) and

    then precisely estimates locations and sizes of objects in the ensemble

    level using appearance, motion and discriminative models (Ensemble

    Tracking), and at last assigns blocks to maintain the blocks with the

    samelabel looklike apart of human by combining both previous results

    (Ensemble to Block Assignment). In implementations, we split each frame

    into 88 blocks and typically a 640480 image contains 8060=

    4800 blocks. A block is called a foreground block if the pixel number

    in the foreground region is larger than 20% of that in the whole. Similar

    to[5], the BAT takes foreground blocks into account and ignores back-

    ground ones.

    The BAT is a particular segmentation problem, coarser than pixel

    level segmentations but ner than bounding boxes as illustrated in

    Fig. 2. Pixel level segmentations are dened to achieve the most accu-

    rate results. But they are somewhat prone to errors for occlusions and

    particularly non-rigid objects like humans with viewpoint changes as

    their contours are disturbed and vary drastically. These restrictions

    prevent such methods from applications in our concentrated scenes.

    Bounding boxes may take extra (non-object or other object) pixels

    into account and miss some real pixels. These drawbacks also exist

    in the BAT but relatively more moderate, since BAT considers fore-

    ground blocks and ignores background ones. Hence, more important-

    ly, BAT can build up more robust appearance and motion models for

    objects from these blocks than bounding boxes.

    4. Scene aware detection

    4.1. Scene models

    Background model is widely used in many tracking systems. In

    order to establish a background model robust to noise, motion and il-

    lumination variations, we employ the lifespan background modeling

    algorithm in our previous work [32], where short, middle and long

    life span models are online adaptively built and updated in a collabo-

    rative manner.

    Fig. 1.System overview. Round rectangle box: inputs and outputs. Rectangle box: procedure. Solid arrow: data ow. Double-line arrow: extra input models. The key factors of our

    system are marked out in bold.

    Fig. 2.Comparisons of BAT, bounding boxes and pixel level segmentations in one object. (a) an image; (b) the foreground image; (c) ideal pixel level segmentations labeled man-

    ually; (d) bounding boxes with extra pixels (left) and missed pixels (right); and (e) BAT with extra blocks (left) and missed blocks (right). Please seeSection 3for more discussions.

    294 G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    5/15

    Author's personal copy

    Cameramodels are utilized to handle viewpoint changes in detec-

    tion. We follow the method[10], which rst detects objects in sub-

    images rectied from a changed viewpoint to a frontal viewpoint,

    and then projects the detection results into the original image. This

    kind of method is able to take advantage of detectors learned for afrontal viewpoint and avoid a more difcult training for multiple

    viewpoints samples. During detection, the sampling in 3D space is

    projected into the image coordinate as shown in Fig. 3(d) (bottom).

    Moreover, there is no need to do such rectications for frontal view-

    point scenes. To speed up detection in these scenes, we assume a

    linear mapping from 2D coordinate (x,y) to the human height (Lh),

    c1x + c2y + c3= Lh.c1,c2and c3are unknown parameters and can be

    estimated through a RANSAC style algorithm like[33]. During detec-

    tion, the sampling in 2D space is a scanning window process re-

    strained by the linear mapping as shown in Fig. 3(d) (top). Please

    refer to[33,10]for details.

    Layoutmodels can be easily marked out for stationary scenes such

    asFig. 3(c). We assume that humans stand on the ground planein the

    layout. After integrating these two models with the linear mapping or

    camera model mentioned earlier, we can obtain the sampled search-

    ing points and corresponding human heights in scenes as illustrated

    inFig. 3(d).

    4.2. Foreground aware pruning (FAP)

    This step is to prune negative and false positive samples by fore-

    grounds as shown inFig. 4(a). We take this pruning problem as a 2-

    class classication problem on binary images, and design a simple

    discriminatively learning algorithm under the boosting framework

    [1]. The aim is to mine some features to learn a fast and effective

    pruning detector.

    Our used features are based on zero moment of regionRG,M(RG),where M(RG) = (x,y)RGIB(x,y) ina binaryimage IB. Each feature risa sub-region ofIB as shown in black ofFig. 4(c). The feature value can

    be calculated as

    f r; IB

    M r M IBr

    IB 1

    where |IB| is the total number of pixels in IB. We restrictras a rectan-

    gle, and hence Eq.(1)can be calculated efciently through an integral

    image without generating image pyramids like[1].

    Positive samples for the pruning can be achieved by labeling man-

    ually as shown inFig. 4(b). However, collecting negative samples are

    impractical because of two reasons. One reason is that negative sam-

    ples can be in any form, which is too time consuming for manually

    labeling. The other reason is that when applying the pruning detector,

    negative samples themselves are always inaccurate because of noises

    in background modeling, and thus it is likely that parts of real objects

    are missing in foregrounds and some backgrounds are included in

    objects. In fact, negative samples are not necessary because 1) small

    amount of negative samples may cause overtting, and 2) large

    amount of negative samples might make the pruning detectors very

    Fig. 3.Models in detection: (a) original images; (b) foregrounds; (c) scene layouts; (d) some searching points in red with lines whose lengths indicating the corresponding human

    heights; (e) cropped sub-images and their foregrounds; and (f) detection results projected as quadrangles in original images. The top and bottom rows show a common frontal

    viewpoint scene and a changed viewpoint one separately. Note that, in the latter occasion, camera models are adopted to handle the difculty of viewpoint changes.

    a

    Negative

    False positive of left-body False positive of right-body False positive of whole-body

    False positive of head-shoulder False positive of upper-body

    b c

    Fig. 4. Foreground pruning. (a) Typical pruned negative and false positive examples. (b) Whole body positive masks, from which other part positive masks can be generated.

    (c) Five used features.

    295G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    6/15

    Author's personal copy

    complex and thus they are inefcient to prune negative and false pos-

    itive samples. Motivated by the above, pruning classiers are learned

    with positive samples only. Theclassier on feature ris determined as

    hr IB 1; f r; I

    B

    Tr> 00; otherwise

    ( 2

    whereTr=minxiBf(r,xiB),is small positive(102), andxiB is a pos-

    itive sample. In consideration of the inaccuracy of background model-

    ing, positive samples are disturbed by moving 3 pixels left or right, or

    2 pixels top or bottom.

    This pruning should be fast and effective. Instead of automatically

    selecting good features from a large feature pool as [1], we simply

    design several features as shown in Fig. 4(c). All classiers learned

    on these features are combined together to be one strong detector,

    whose orders are not constrained. Then a searching window will be

    considered if its corresponding foreground passes this strong detec-

    tor. For a n m image, the pre-processing of an integral image costs

    O(nm) time and space. Then our used feature can be calculated inO(1) time and thus a bunch of classiers will cost approximately con-

    stant time. Its effectiveness will be evaluated in the experiment.

    4.3. Structurallter approach

    The detection is based on our previous work[4,34]. We proposed

    to learn an Integral Structural Filter (ISF) detector in [4]to detect

    humans with occlusions and articulated poses in a feature sharing

    mechanism. We build up a three-level hierarchical model for human,

    words, sentences and paragraphs, where words are the most basic

    units, sentences are some meaningful sub-structures and paragraphs

    are the appearance statuses (e.g., headshoulder, upper-body, left-

    part, right-part and whole-body in occluded scenes). An example is

    shown in Fig. 5. We integratethe detectorsfor thethreelevels through

    inferringfrom word to sentence,from sentence to paragraphand from

    word to paragraph. All detectors for structures (words, sentences and

    paragraphs) are based on Real Adaboost algorithm and Associated

    Pairing Comparison Features (APCFs)[34]. APCF describes invariance

    of color and gradient of an object to some extent and it contains two

    essential elements, Pairing Comparison of Color (PCC) and Pairing

    Comparison of Gradient (PCG). A PCC (or PCG) is a Boolean color (or

    gradient) comparison of two granules in which a granule is a square

    window patch. Please refer to[4,34]for more details.

    4.4. Foreground aware merging (FAM)

    We then discuss the merging strategy after obtaining all detected

    results. Different from previous approaches (e.g.[2,3]) which stick to

    detection results, we integrate foreground information into post-

    processing. We consider objects one by one after extending them to

    the whole body through adding and deleting operations dened onvisible and invisible parts of objects. To reduce the complexity of

    computation, the two operations are based on blocks as dened in

    Section 3.

    A hypothesis his a detected response. We denote the block set and

    foreground block set ofh be Bh and Fh respectively. For a hypothesis set

    H, we have BH hH Bhand FH hH Fhcorrespondingly. The scoreof addingh intoHis dened as

    scadd h

    FH hf gFH BH hf gBH ; FH hf gFH

    >TM Fh; hH0; otherwise:

    8>>>:

    3

    h can be added ifscadd(h) > Tadd. TMis a threshold. Thescoreof deleting

    hfromHis dened as

    scdel h

    FHFH hf g BHBH hf g ; FHFH hf g

    >TMFh; hH0; otherwise:

    8>>>: 4

    h can be deleted ifscdel(h)bTdel. Taddand Tdel are empirical parameters.

    The less Tadd, the more added objects. ThelargerTdel, the more deleted

    objects. In the implementation, we propose a greedy way to rst uti-

    lize the adding operation to nd possible hypotheses and then the de-

    letingoperation to deletesome badones.Althoughthe strategy is very

    simple, it yields promising detection results in the experiments.

    5. Block Assignment Tracking

    The previous section mainly discusses accurately locating objects

    in the scenes with occlusions and viewpoint changes. In this section,

    we concentrate on robustly tracking them. In the next, we will rst

    derive the formulation of our block assignment tracking problem,

    and then present our solution.

    Fig. 5.The hierarchical structure of pedestrian[4].

    296 G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    7/15

    Author's personal copy

    5.1. Problem formulation

    Denoting object state sequences from frame 1 to frameTasS1 : T=

    {S1,, ST} and the corresponding observation sequences collected

    from the frame data as O1 :T={O1,, OT}, a tracking problem can

    be formulated to solve the following MAP (maximum a posteriori)

    problem

    St argmax

    St

    p St jO1:t: 5

    Generally, an object state can be modeled as the location and size

    of the object on the ensemble level like[7]or a set of blocks forming

    the appearance like [5]. Tracking on the ensemble level is efcient

    when objects are isolated. However, it tends to have errors when ob-

    jects interact with each other since ensemble observations can be am-

    biguous and missing because of occlusions. When objects are well

    initialized, tracking on the block level is efcient even with heavy oc-

    clusions, which mainly considers the block persistence in spatial and

    temporal spaces. But it cannot guarantee that a segmented region is

    like an object part. In fact, it might contain none or several objects.

    Moreover, it does not have an explicitly correcting mechanism to rec-tify errors that arose during initialization and tracking. In order to

    combine their merits and get rid of their restrictions, we propose to

    model object states on both ensemble and block levels as St= {Zt, Vt},

    whereZt= {zt, k}k=1K is the ensemble level state of all Kobjects and

    Vt= {vt, i}i=1N is the block level state of all Nblocks.vt, iis the label for

    blockbt, i, indicating thatbt, ibelongs to objectzt, vt,i ifvt, i0, or back-ground ifvt, ib0. All blocks with the same label form the appearance

    of an object, while the ensembles describe coarse shapes of objects

    and cover some blocks assigned to them as illustrated inFig. 6. There-

    fore, we modify Eq.(5)and formulate our problem as

    Zt;V

    t

    argmax

    Zt;Vt p Zt; Vt jO1:t; Vt1: 6

    Compared to Eq.(5),Vt1in the right side of Eq. (6)takes the pre-

    vious assignment into account. However, the optimization of Eq. (6)is

    not tractable because V1:tandZtareclosely intertwined at time t. The in-

    ference betweenVtandVt1should hold the spatial and temporal per-

    sistence of block assignments. Meanwhile, Ztencourages blocks with

    the same label in Vtto look like an object. Moreover, V1:t1can provide

    robust appearance and motion models of objects for inferring Zt. To

    make the optimization tractable, we propose to split Eq.(6)into three

    steps. Therst step is to obtain an intermediate assignment ~Vtthrough

    inferences on the block level of two sequential frames ignoring Zt,

    ~Vt argmax

    ~Vt

    p ~Vt

    Vt1; Ot1:t

    : 7

    This step can hold the persistence of block assignments in spatialand temporal spaces. Then, the second step focuses on inferring Zt

    with the aid of robust appearance and motion models of objects esti-

    mated fromV1:t1,

    Zt argmax

    Zt

    p Zt jO1:t: 8

    Afterwards, the third step is to achieve the nal assignment by

    combining the previous results ~Vtand Zt,

    Vt argmax

    Vt

    p Vt jZt; Ot;~Vt: 9

    The third step is based on the other two steps, which can make

    blocks with the same label look like a part of some object and poten-

    tially rectify possible errors during initialization and tracking. After

    integrating these three steps into Eq.(6), we obtain

    p Zt; Vt jO1:t; Vt1p ~VtVt1; Ot1:t p Zt jO1:tp Vt jZt; Ot; ~Vt: 10

    Therefore, Eq.(10)can be efciently solved by the max-product

    algorithm. These three steps will be further explained in the next sec-

    tion correspondingly.Now, we have elaborated our problem formulation. Since the last

    step is to assign blocks each time, we term it as Block Assignment

    Tracking. Compared to[5,7], our formulation provides a simple way

    to integrate block and ensemble level information.

    5.2. Solution

    In this subsection, we present details of the three steps in Eq. (10)

    which are Block Tracking, Ensemble Tracking and Ensemble to Block

    Assignment correspondingly. At the end, we give a summary of our

    tracking algorithm.

    5.2.1. Block Tracking

    This step is to predict an intermediate result by taking advantagesof the constraints of label, color, and shape. Inspired by the similar

    problem in[5], we dene

    lnp ~VtVt1; Ot1:t XK

    k0

    XNi1

    i bt;i;zt;k

    Vt1; Ot1:t vt;i; k

    XNi1

    i bt;i; bt;j1i;; bt;jl

    i

    Vt1; Ot1:t 11

    wherei is thepenaltyifbt,i is assigned tozt,k, i is thepenalty when bt,iand its neighbors are assigned to different objects, (i,j) is a Kroneckerfunction, equaling to 1 ifi =j or 0 otherwise, l =|Nbt,i| andNbt,i are 8-

    neighbored blocks of bt,i. The observations here are actually image

    sequences and object states are updated straightforwardly from their

    previous states aszt,k=zt1,k+rzt, kby their motionsrzt, k=(rzt,kx, rzt,ky).The motion of an object is represented by the most frequent motion

    Fig. 6.Tracking Problem formulation. Left: original image. Middle: foreground block image. Right: an assignment where blocks in the same color (label) form the appearance of one

    object and the quadrangles indicate coarse shapes of objects.

    297G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    8/15

    Author's personal copy

    amongits all blocks, where the motion for one block can be obtained by

    applying block matching. Then we give the denitions ofiandi.

    i bt;i;zt;k

    Vt1; Ot1:t aDLt;i;kbDMt;i;kcDAt;i;k: 12

    DLt,i,k is a rough shape constraint to restrict the spread of block labels.As object shapes are quadrangles, we need to eliminate the effectsof scale

    along axis and rotation in the 2D plane. Our idea is to make use of a nor-

    malization matrix~x 1=W 0

    0 1=H

    cos sin

    sin cos

    , where[WH]Tis

    the minimum size of detection and is the angle between an object and

    the vertical. Let the centers ofbt,iandzt,kbe x t;i xt;i;yt;i T

    andxzt;k

    xzt;k ;yzt;k

    Trespectively. We deneDLt;i;k exp~x xt;ixzt;k

    2

    .

    DMt,i,kis a temporal constraint of the label consistency, dened as

    DMt;i;k Mt;i;kMi 2

    . Mt,i,k is the number of pixels in the over-

    lapped area between zt1,k and the region moving b t,i by rzt,k. Miis the total number of pixels in a block.

    DAt,i,kis a color constraint, which measures the temporal color co-herence. LettingItbe the gray scale frame at time t, we dene

    DAt;i;k 0dxb80dyb8

    It xdx;ydy

    It1 xdxrzt;kx;ydyrzt;ky : 13

    iis the spatial constraint of the label consistency

    i bt;i; bt;j1i;; bt;jl

    i

    Vt1; Ot1:t

    d

    XKk1

    Ni;kNk

    2g

    Xln1

    rt;irt; inj2:

    14

    Similar to[5], we set a =1, b = 1, c=0.125,d =0.00000025 and

    g=0.5, and adopt Gibbs Sampler algorithm [35] to solve Eq. (11).

    Please refer to[5,35]for more details.

    5.2.2. Ensemble Tracking

    This step is to estimate object locations accurately in the ensemble

    level and offer the potential to amend possible errors in initialization

    and tracking as discussed earlier. The errors are not notable in a short

    time (NEframes for simplicity), but will be magnied vastly as time

    passes. For the former situation, object states updated by their mo-

    tions are adequate. For the latter situation, we refer to the update

    step for a sequential Bayesian estimation problem:

    p Zt jO1:tL Ot jZtp Zt jO1:t1 15

    in whichp(Zt|O1 : t1) is thepredictionstep

    p Zt jO1:t1 D Zt jZt1p Zt1 jO1:t1dZt1 16

    whereL(Ot|Zt) is the likelihood of observation and D(Zt|Zt1) is the

    dynamic model of thesystem, which is modeled as oneorderGaussian

    by considering object motions.In order to approximate the ltering distribution, Particle Filter (PF)

    approach[23]used a set of weighted particles. Its direct extension for

    multiple object tracking models objects as unrelated. However, it may

    cause ID switches when tracking adjacent objects, because observations

    are ambiguous to be assigned to objects. Differently, we do not distin-

    guish particles generated from different objects. Fig. 7compares the

    two strategies. Formally, we extend[23]by

    p Zt jO1:t XNpn1

    nt;kznt zt 17

    in which Np is the totalnumber of particles, and z() denotes the delta-Dirac function at positionz. Thenth particle is denoted aspn=(xt

    n,stn,

    Htn, {t,kn }1K).xtn=(xtn,ytn) is the location,stn is the scale,Htn is the appear-ance model, t,k

    n is the weight for zt,k. Motivated by the successes of

    [7,16], we dene

    nt;k

    n;Dt;k 1 n;Gt;k; x

    ntxzt;k

    2b;

    0; otherwise

    ( 18

    where t,kn,D is a discriminative weight modeled using the detector con-

    dence and t,kn,G is an appearance weight measured from an online

    learned appearance model. is a parameter (=0.5 here). is a distancethreshold. The appearance models for particles or objects come from

    pixels in foreground blocks. We utilize HSV color space and the number

    of bins for each channel is set to 16.

    But objects may get lost sometimes during tracking. If an object

    cannot get enough support particles (t,kn >), it is lost and bufferedfor possibly matching newly detected objects. We perform object de-

    tection (SAD) everyNFframes tond new objects. If a lost object can-

    not get matched in TWframes, it will be discarded.

    5.2.3. Ensemble to Block Assignment

    Thisstepis to achieve the nal result with the intermediate assign-

    ment~Vtandthe estimatedobject stateZt. This problem is a multi-label

    problem, which can be easily converted to 2-label problem by adding

    objects one by one and then solved by graph cut algorithms. Suppose

    object map Vt is obtained after adding objects zt,1~zt,k1and it will

    add object zt,k. Then the target is to minimize the following energy

    function each time

    EkXNi1

    i bt;i;zt;k

    bt;jNbt;i

    i;j bt;i; bt;j

    : 19

    a b c

    Fig. 7.Comparisons of sampling strategies. (a) shows a scene with six persons. (b) PF [23]models objects as unrelated. (c) In our strategy, particles from different objects are not

    distinct. But those far away from the concerned object are ignored (e.g., only particles from object D, C and E contribute to D).

    298 G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    9/15

    Author's personal copy

    Unary itemi encodes the data likelihood, which imposes penaltiesfor assigning blockbt,ito objectzt,k. We consider the shape model and

    thepriorknowledge in

    i bt;i;zt;k

    xt;i; xzt;k

    ni 1 ~vt;i; k

    20

    where (,) is a kernel function dened as (xt, i,xzt,k) = DLt, i,k. isanoccluded factor. Let n i be the number of objects that occlude zt,k in

    blockbt,i, where an object is occluded by others if they are overlapped

    and itsy-axis value is larger. Intuitively, the largerni, the lower i. Wehave1 (set to 1.25 in our experiments).

    Pairwise item i,j encourages the spatial coherence and imposespenalties when bt,i and bt,j are assigned with different labels. As a

    sub-modular energy function can be solved by graph cut algorithms,

    we adopt Potts model for simplicity

    i;j exp

    2 At;i;At;j

    A

    0@

    1A 1 exp rt;irt;j2r

    0@

    1A; vt;ivt;j

    0; otherwise

    8>>>: 21

    whereAt,land rt,lare the appearance and motion ofbt,l(l = i,j).is aparameter (=0.5 here).Aandrare normalization factors. Here theappearances of blocks aremodeled as 4 bins histogramin gray images.

    A is set to be the number of pixels in a block (=64 here).Suppose themaximum motion of a block is the block size (88), and thus we set

    r= 82+ 82=128.

    After achieving the nal assignment, we then update appearance

    models of objects. Intuitively speaking, if an object is occluded by

    others, meaning that some of its overlapped foreground blocks are

    not assigned to it, the update ratio should be small. The more occlu-

    sion, the less update ratio. Based on this, we dene the update ratio

    as =0.5Nk/Na, where Nk is the number of blocks assigned to zt,kandNais the total number of blocks overlapped byzt,k. Given the pre-

    vious and current appearance models for zt,k,Apand Ac, the update is

    described asA = (1)Ap+Ac.

    Until now, we have described the three key components of theBAT. For easy reference, the entire procedure of the BAT is summa-

    rized inFig. 8.

    6. Experiments

    In this section, we carry out extensive experiments to evaluate our

    proposed detectionand tracking system. We rst describe the training

    andtesting datasets,and then list some detectionand tracking metrics

    for evaluations, and then evaluate the performance of our system, and

    make some discussions at last.

    6.1. Datasets

    We have labeled 2470 positive masks of 24 58 as shown inFig. 4(b) for training the foreground pruning detector. We also have

    collected 18,474 whole body positive samples of 2458 for learning

    object detectors as shown inFig. 9. The positive masks and samples

    of the other parts can be generated from those of whole body using

    the denitions inFig. 5.

    We use a large variety of challenging test datasets with different

    situations of occlusions and viewpoints for evaluation as summarized

    inFig. 10. Occlusions or viewpoint changes in these real world data-

    sets make them valuable for evaluating detection and tracking sys-

    tems. As the viewpoint in CAVIAR1 is frontal, learned detectors can

    Fig. 8.The algorithm of our system.

    299G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    10/15

    Author's personal copy

    be applied directly. But since the viewpoints in CAVIAR2, PETS2007

    and our dataset are tilted, we utilize camera models to cope with it.

    In our experiments, we aimat improvingboth detectionand track-

    ing performances with off-line discriminative models. Therefore, test

    datasets are totally independent from the training set and we employ

    the generally trained detectors into all test sequences without retrain-

    ing them specically for a certain scene.

    6.2. Metrics

    We use False Positive Per Image (FPPI) for detection evaluation.

    When the intersection between a detection response and a ground-

    truth box is larger than 50% of their union, we consider it to be a suc-

    cessful detection. Only one detection per annotation is counted as

    correct.

    For multi-object tracking, there is no single established protocol.

    We follow two current existing metrics. The metrics [36]count the

    number of mostly tracked (MT), partially tracked (PT) and mostly

    lost (ML) trajectories as well as the number of track fragmentations(FM) and identity switches (IDS). The CLEAR-metrics [37]calculate

    the Multiple Object Tracking Accuracy (MOTA) which take into ac-

    count false positives, missed targets and identity switches; and the

    Multiple Object Tracking Precision (MOTP) measuring the precision

    with which objects are located using the intersection of the estimated

    region with the ground truth region.

    6.3. Performance evaluations

    6.3.1. Detection evaluations

    In this subsection, we concentrate on evaluating the performances

    of the key components of our SAD, foreground aware pruning (FAP),

    Structural Filter approach (ISF) and foreground aware merging (FAM).

    Since the number of available frames in test datasets is quite huge, weonly select 200 representative frames from each test datasets for

    evaluation.

    6.3.1.1. Efciency of FAP. TheaimofFAPistoefciently prunenegatives

    and falsepositives. Table 1 shows the pruned window proportions and

    saved times on these datasets with default detection parameters. In

    Table 1, we can see that about 79%94.4% windows are pruned,

    which yields a plenty of time saving (0.29 s4.6 s). Since there are

    lots of people in our datasets, the pruned proportion is less than

    those of the other datasets. Compared to CAVIAR1, the other three

    datasets need to rectify sub images, and thus they cost much more

    times than CAVIAR1. However, as there are a few (b4) persons in

    CAVIAR2, the cost time is not as huge as S02 and our datasets. This ex-

    periment sufciently demonstrates the efciency of FAP.

    6.3.1.2. Efciency of SAD.We choose two state-of-the-art works[12,38]

    for comparison with our SAD. ACF [38] has achieved good performances

    for pedestrian detection, which is a strong competitor for frontal view-

    point detection. ACF is learned on the same training dataset as our ISF

    for a fair comparison. Since there are no publically available detectors

    for multiple viewpoints of humans, we use Deformable Part Model

    (DPM)[12]as a baseline, which is very famous for detecting objectswith large variations. The original DPM detector is provided by the

    author and trained on Pascal VOC 2008. For a fair comparison, we also

    train a new DPM detector on the same training dataset as our ISF. To

    distinguish them, we denote them as DPM1 and DPM2 separately.

    In the following, for concise descriptions, we let MAP be the

    Bayesian method in [2] andNAIVEbe thesimpleststrategyto combine

    Fig. 9.Positive samples for the whole body.

    Fig. 10. Test datasets. CAVIAR dataset can be downloaded from http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. PETS2007 can be downloaded from http://www.cvg.rdg.ac.uk/

    PETS2007/. Humans in CAVIAR2 are too small, and therefore we double the original video size (384288).

    Table 1

    Evaluations of foreground aware pruning.NHis the average number of humans. PPWis

    the average proportion of pruned windows in all scanned windows.Tis the cost time

    without foreground aware pruning and tis the saved time when using it.

    CAVIAR1 S02 Our dataset CAVIAR2

    NH 6 9 11 4

    PPW 94.4% 90% 79% 86%t(ms) 700 4600 1200 290

    T(ms) 1210 10,400 7560 650

    300 G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    11/15

    Author's personal copy

    detection results by the near locations in the following. Note that ex-

    cept CAVIAR1, the other three test datasets need rectications with

    camera models. The methods using camera models are indicated by

    CAM. The ROC curves are shown inFig. 11.

    6.3.1.2.1. Improvements of FAP. Compared to ISF+NAIVE, ISF+

    NAIVE+FAP improves the detection rate about 3% on CAVIAR1. Com-

    pared to ISF+NAIVE+CAM, ISF+NAIVE+CAM+FAP improves the

    performance about 4% on S02, 4% on our dataset, 1% on CAVIAR2. Sim-

    ilar performance improvements are achieved in ISF+MAP+CAM+

    FAP. From the experiments in Fig. 11 andTable 1, we can see that

    FAP not only works well on pruning but also improves the detection

    performances.

    6.3.1.2.2. Improvements of ISF and scene models. ISF+MAP per-

    forms better than or comparable to ACF+MAP on CAVIAR1, S02

    and our dataset, demonstrating that ISF can detect occluded humans

    in scenes without large viewpoint changes. And ISF (ISF+MAP and

    ISF+NAIVE) is better than DPM (DPM1 and DPM2),where there

    might be two main reasons: 1) the ability of deformable part based

    model is limited on strong labeled samples like our training dataset,

    and 2) the weak feature in DPM uses only gradient information,while the weak feature in ISF combines both color and gradient infor-

    mation which is more discriminative than the former for pedestrian

    detection. Note that, as DPM2 is more focused on pedestrians than

    DPM1, it performs better than DPM1 on S02, and comparable to

    DPM1 on CAVIAR1 and our dataset. But all these detectors fail in CAV-

    IAR2 because of large viewpoint changes, which can be better solved

    by camera models. Comparing with ISF+NAIVE, ISF+NAIVE+CAM

    improves the performances about 3% on S02 and 26% on our dataset,

    and it works well on CAVIAR2. Similar improvements are achieved in

    ISF+MAP+CAM. As the viewpoint of CAVIAR1 is frontal, the linear

    mapping from 2D coordinate to human height is used. In the experi-

    ment, we nd that the linear mapping does not reduce the detection

    performance, while speeds up the detection about 0.6 s compared to

    only using ISF on average.6.3.1.2.3. Improvements of FAM. We replace the post processing

    method by FAM to show further performance improvements. Compared

    to ISF+ MAP +FAP, our approach (ISF+ FAM+FAP) can improve the

    detection rate by about 11% in CAVIAR1. Compared to ISF+MAP+

    FAP+CAM, our approach (ISF+FAM+FAP+CAM) improves the de-

    tection rate about 16% on our dataset and 14% in S02. As MAP adds ob-

    jects in y-decent order which is not true in large viewpoint changed

    scenes, it does not work well in CAVIAR2 and even worse than NAIVE

    sometimes. On contrast, our approach can still work well in such scenes

    and achieves 52% detection rate at FPPI=0.1 in CAVIAR2. We also ob-

    serve another interesting phenomenon: the curves of our approach are

    much cliffy than the others. It indicates that we can detect more objects

    with less false samples. This is mainly because of pruned false positive

    samples and used scene models. We zoom in the curves ofFig. 11(b)

    and (c) to illustrate more details inFig. 11(e) and (f) respectively.

    6.3.1.2.4. Summary. These experiments have shown the effective-

    ness of the key components (FAP, ISF and FAM) of our SAD in occlud-

    ed and viewpoint changed scenes. Therefore, our SAD as a whole

    outperforms many of the state-of-the-art detection algorithms such

    as[12,38]. But the speed is not satisfactory. The detection costs on av-

    erage about 0.51 s, 5.8 s, 6.36 s and 0.36 s on CAVIAR1, our dataset,

    S02 and CAVIAR2 respectively. Because of changed viewpoints andheavy occlusions, it costs much more time on our dataset and S02

    than on CAVIAR1 and CAVIAR2. For further speedup and performance

    improvements, we recommend our proposed BAT, which is evaluated

    in the next subsection.

    6.3.2. Tracking evaluations

    In this section, we report our BAT tracking performances on all test

    datasets based on the SAD results without retraining detectors for

    specic scenes. For concise descriptions, we let our BAT with and

    without camera models be BAT+ 3D and BAT+ 2D respectively.

    6.3.2.1. Algorithms for comparisons. We compare our approach with

    some state-of-the-art tracking algorithms[26,24,15,7,36,27]. We uti-

    lize the implementation1 of [27] to carry out experiments by

    a b c

    d e f

    Fig. 11.Evaluation of our SAD compared to DPM[12]and ACF[38]. (a), (b), (c) and (d) compare our approach with several state-of-the-art works on CAVIAR1, S02, Our dataset and

    CAVIAR2 separately. (e) and (f) zooms in our approach on S02 and our dataset respectively to illustrate more details.

    1 http://www.ics.uci.edu/~dramanan/.

    301G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    12/15

    Author's personal copy

    ourselves. In this implementation, the authors do not use appearances

    after detecting objects. Therefore, it will obtain relatively more frag-

    ments and ID switches as well as missing detections. We improve its

    performance by (1) utilizing background modelings to remove false

    positive samples, and (2) building up appearance models for detectedobjects to associate them and (3) adjusting some parameters to

    achieve better tracking results. After this improvement, it can track

    more humans, but there are still too many fragments and ID switches.

    Thus, we only use it for comparisons on the following metrics, MT, PT,

    ML and MOTP.

    Besides these state-of-the-art algorithms, we also use two simpli-

    ed versions of our BATas baselines to demonstrate the improvement

    of combining both block and ensemble information. Onebaselineonly

    uses Ensemble Tracking, shortened as BAT(ET). BAT(ET)+2D, where

    camera models are not used in detection, is similar to [7].BAT(ET) +

    3D, where camera models are used in detection, is a better way to

    show the improvement of BAT+3D with ensemble information. The

    otherbaselineonly uses BlockTracking, shortenedas BAT(BT). Objects

    can be well initialized in CAVIAR2 because of little occlusions, but notin CAVIAR1, our dataset and S02 because of severe and frequent

    occlusions. Therefore,BAT(BT) +3Dis fair in comparison with BAT+

    3D on CAVIAR2, in which camera models are used in detection.

    6.3.2.2. Quantitative results.The obtained results are shown inFigs. 12

    and 13.6.3.2.2.1. CAVIAR1. We compare our BAT with [26,24,15,7,36] in

    Fig. 12. Among them, our method achieves the highest MT. Our FM

    and IDS are a little higher than[36], which mainly because we handle

    sequences online but[36]used all detection results to obtain a global

    optimization. The MOTA and MOTP of our approach are both better

    than those of[7], showing the efciency of combining block and en-

    semble information. In general, CAVIAR1 is relatively easy for many

    tracking systems, however, the further used test datasets are more

    challenging.

    6.3.2.2.2. Our dataset and S02. We compare our approach with

    [7,27] on these two datasets in Fig. 13(top) and (middle). As described

    in Section 6.3.1, many state-of-the-art detection algorithms do not

    perform as well as our detection approach in sceneswith heavy occlu-

    sions and slightly changed viewpoints. The detection processes in[7,27]lost many humans on our dataset and S02, which reduces the

    Fig. 12.Quantitative results on CAVIAR1. *The denitions of fragment and IDS numbers in[26,7]are obtained by looser evaluation metrics.

    Fig. 13.Quantitative results of our method on our dataset, S02 and CAVIAR2.

    302 G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    13/15

    Author's personal copy

    tracking performances. While, our SAD can detect many humans and

    our BAT generally performs much better and more stable than[7,27].

    BAT(ET)+3D can track more objects than[7,27], but it obtains many

    fragments and ID switches. Compared to BAT(ET)+3D, BAT+3D

    achieves a better performance. It obtains higher MT/PT/MOTP/MOTA

    and lower FM and IDS. This improvement shows that combinations

    of block and ensemble information are superior to only using ensem-bleinformationfor tracking. Compared to BAT+2D, theimprovement

    of BAT+3D majorly lies in the usage of camera models because of

    slight changed viewpoints. Furthermore, BAT+3D is always better

    thanBAT+2D inMT/PT/ML,but not inother metrics.A partof the rea-

    son is that the ground truths are labeled as rectangles, but the tracked

    humansof BAT+3D arequadrangles. However, because our dataset is

    much more crowded than S02, there are still many partially tracked

    objects.

    6.3.2.2.3. CAVIAR2. Because of the extremely large viewpointchanges, the methods without using camera models (such as [7,27]

    and BAT+2D) fail totally in this dataset. As far as we know, there

    a

    b

    c

    d

    Fig. 14.Tracking results. (a) and (b) compares[7](top) and our approach (bottom) onOneStopMoveEnter1corof CAVIAR1 and S02. (c) and (d) illustrate sample results in our data-

    set andMeet_crowdof CAVIAR2 separately. The layouts of (a) and (b) are already shown inFig. 3(c). The layouts of (c) and (d) are illustrated in the most left. More descriptions can

    be found inSection 6.3.2.

    303G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    14/15

    Author's personal copy

    are no publically available implementations to deal with multiple

    humans tracking in these scenes. Thus, we compare our BAT+3D

    with BAT(BT)+3D and BAT(ET)+3D inFig. 13(bottom). Compared

    to BAT(ET)+ 3D, BAT(BT)+ 3D achieves higher MT, MOTA and

    MOTP, but more IDS and FM. Our BAT+3D can integrate both of

    their advantages and achieves better performances. It improves MT

    by 13.8%, MOTAby 7.2% and reduces PT by 18.2% than the second best.6.3.2.2.4. Summary. As described earlier, the application of Block

    Tracking is limited, because it requires good initializations, but the

    achievement of good initializations is difcult in occluded scenes. Com-

    paringBAT+ 2D than [7] on CAVIAR1 and BAT+ 3D than BAT(ET)+ 3D

    on theother three datasets (our dataset, S02 and CAVIAR2), we can eas-

    ily conclude that combinations of block and ensemble information can

    improve the tracking performance. From the experiments inFigs. 12

    and 13, we can see that our proposed detection and tracking system

    can work robustly in the scenes with heavy occlusions and viewpoint

    changes.

    6.3.2.3. Sample results. Fig. 14demonstrates some tracking results by

    our tracking algorithm, where the green and red arrows point some

    IDS, the purple dotted ellipses point some target missing or lost, andthe blue arrow points some false alarms. Panels (a) and (b) illustrate

    scenes with targets walking against a crowd. We compare[7] in the

    top with our approach in the bottom. Our method can consistently

    track these objects, while[7] experiences several instances of IDS, tar-

    get lost and false alarms. Panel (c) features a subway scene with many

    people walking, where the occlusions are very severe and the view-

    point is slightly changed. Our tracker succeeds in tracking many of

    them. Panel (d) shows a scene with several people walking where

    the viewpoint is extremely changed. Our tracker tracks them success-

    fully all over the sequence.

    6.4. Discussions

    6.4.1. Parameters

    There are some parameters in the SAD and BAT as listed in Fig. 15

    with corresponding descriptions and default values. The affections of

    some key parameters to our framework are presented as follows. For

    SAD, the parameters TM, Tadd and Tdel directly impact on the post-

    process of detection. Smaller TM indicates the more probability of

    adding or deleting a detected response each time. Smaller Tadd will

    add more objects and larger Tdelwill delete more objects. For BAT,NOandare key parameters. Larger NO(i.e. more particles) can improvethe performance, but it will cost more time. Largerconsiders moreconsistencies in videos, which can improve the tracking performance

    when the detection is not so accurate, especially in CAVIAR2 because

    of large viewpoint changes. Therefore, we set most parameters by de-fault and they are relatively robust in the experiments, except that we

    setNO=300 and=5 for CAVIAR2.

    6.4.2. Processing speeds

    The entire system is implemented in one thread by C++, without

    special code optimization and taking advantage of GPU processing. On

    a workstation with an Intel Core(TM)2 2.33 GHz and 2 G memory, we

    achieve real-time process speed of 2.715 fps (given the video size,

    the object number and the changed viewpoint), as shown in Fig. 16

    compared with detection only. The current bottleneck is the detection

    stage. As not all speedup possibilities are explored yet, the current

    run-time raises hope that online experiments in real world applica-

    tions will not be too far away.

    6.4.3. Failure cases

    Objects are initialized by detection in our system. The failures of

    detection (e.g. missing detection and false alarms) cannot be avoided

    in the tracking process. If the initialized object is not so accurate, such

    asobject 8 inFrame 20ofFig. 14(c), it drifts easilyand tends to belost.

    In particular, camera models impacts a lot on detection in viewpoint-

    changing scenes. Bad estimated camera parameters will lead to unex-

    pected detected results. Besides, our system cannot handle the near

    vertical viewpoint where camera is right over the top of objects,

    since it is impossible to recover the objects' frontal viewpoint in this

    situation as pointed out in[10].

    7. Conclusion

    In this paper, we propose a robust system for multi-object detec-

    tion and tracking in surveillance scenes with occlusions and view-

    point changes. Our SAD can achieve robust detection through:

    (1) camera models to cope with viewpoint changes; (2) structural l-

    ter approach to handle occlusions; and (3) foreground aware pruning

    Fig. 15.Default parameters.

    304 G. Duan et al. / Image and Vision Computing 30 (2012) 292305

  • 8/10/2019 Scene Aware Detection and Block Assignment Tracking in Crowded Scenes2012 Full

    15/15

    Author's personal copy

    and foreground aware merging with the aid of some scene models.

    Our BAT can track objects robustly even when all the part detectors

    fail as long as the object has assigned blocks, which formulate track-

    ing as a block assignment process. Its key factors are: (1) Block Track-

    ing to maintain the spatial and temporal consistence of labels; (2)

    Ensemble Tracking to precisely estimate locations and sizes of ob-

    jects; and (3) Ensemble to Block Assignment to maintain the blocks

    with the same label look like a part of human.

    Although our method tracks remarkably well even through occlu-

    sions and viewpoint changes, one unavoidable drawback is fuzzy ob-

    ject boundaries. To overcome this, we can learn and extract some

    discriminative patches to represent and track objects. Another draw-back is that the tracking results are jittering, which can be amended

    by estimated object trajectories. For detection improvement, we can

    use online algorithms to make the ofine and general detectors be-

    come adaptive to a xedscene. Although the current systemonly con-

    siders human, the proposed mechanism can be easily to be extended

    to other kinds of objects. Based on the detection and tracking results,

    some high level analysis of object behaviorsbecome possible. Further-

    more, we hope to be able to make our approach applicable to real

    world needs.

    Acknowledgements

    This work is supported in part by National Science Foundation of

    China under grant No.61075026, National Basic Research Program ofChina under Grant No.2011CB302203. Mr. Shihong LAO is partially

    supported by R&D Program for Implementation of Anti-Crime and

    Anti-Terrorism Technologies for a Safe and Secure Society, Special

    Coordination Fund for Promoting Science and Technology of MEXT,

    the Japanese Government.

    Appendix A. Supplementary data

    Supplementary data to this article can be found online at doi:10.

    1016/j.imavis.2012.02.008.

    References

    [1] P. Viola, M. Jones, Rapid objectdetectionusing a boosted cascadeof simple features, in:

    Proc. IEEE Int. Conf. Comput. Vis.Pattern Recogni., Kauai,HI, USA, 2001, pp.I-511I-518.[2] B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single

    image by Bayesian combination of edgelet part detectors, in: Proc. IEEE Int.Conf. Comput. Vis., Beijing, China, 2005, pp. 9097.

    [3] C. Huang, R. Nevatia, High performance object detection by collaborative learningof joint ranking of granules features, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., San Francisco, California, USA, 2010, pp. 4148.

    [4] G. Duan, H. Ai, S. Lao, A structural lter approach to human detection, in: Proc.Eur. Conf. Comput. Vis., Crete, Greece, 2010, pp. 238251.

    [5] S. Kamijo, Y. Matsushita, K. Ikeuchi, M. Sakauchi, Trafc monitoring and accidentdetection at intersections, IEEE Trans. Intell. Transp. Syst. 1 (2000) 108118.

    [6] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations, IEEE Trans.Pattern Anal. Mach. Intell. 26 (2004) 12081221.

    [7] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets l-tering and global tracklets association with detection responses, in: Proc. IEEE Int.Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA, 2009, pp. 12001207.

    [8] M. Andriluka, S. Roth, B. Schiele, People-tracking-by-detection and people-detection-by-tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., An-chorage, Alaska, USA, 2008, pp. 18.

    [9] X. Wang, T.X. Han, S. Yan, An hog-lbp human detector with partial occlusion han-dling, in: Proc. IEEE Int. Conf. Comput. Vis., Kyoto, Japan, 2009, pp. 32 39.

    [10] Y. Li, B. Wu, R. Nevatia, Human detection by searching in 3D space using cameraand scene knowledge, in: Proc. IEEE Int. Conf. Image Process., Tampa, Florida,USA, 2008, pp. 15.

    [11] G. Duan, H. Ai, S. Lao, Human detection in video over large viewpoint changes, in:Proc. IEEE Asi. Conf. Comput. Vis., Queenstown, New Zealand, 2010, pp. 683696.[12] P. Felzenszwalb, D. McAllester, D. Ramaman, A discriminatively trained, multi-

    scale, deformable part model, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Anchorage, Alaska, USA, 2008, pp. 18.

    [13] C. Beleznai, H. Bischof, Fast human detection in crowded scenes by contour inte-gration and local shape estimation, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Miami, FL, USA, 2009, pp. 22462253.

    [14] D. Hoiem, A.A. Efros, M. Hebert, Putting objects in perspective, Int. J. Comput. Vis.80 (2008) 315.

    [15] C. Huang, B. Wu, R. Nevatia, Robust object trackingby hierarchical associationof detec-tion responses, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 788801.

    [16] A. Senior, Tracking with probabilistic appearance models, in: ECCV Workshop onPerformance Evaluation of Tracking and Surveillance Systems, Copenhagen,Denmark, 2002, pp. 4855.

    [17] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pat-tern Anal. Mach. Intell. 25 (2003) 564577.

    [18] P. Fieguth, D. Terzopoulos, Color based tracking of heads and other mobile objectsat video frame rates, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., San

    Juan, Puerto Rico, 1997, pp. 2127.[19] M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density,

    in: Proc. European Conf. Computer Vision, Cambridge, UK, 1996, pp. 343356.[20] J.C. Clarke, A. Zisserman, Detection and tracking of independent motion, Image

    Vis. Comput. 14 (1996) 565572.[21] M.D. Rodriguez, M. Shah, Detecting and segmenting humans in crowded scenes,

    in: Proc. IEEE Int. Conf. Multimed., Augsburg, Germany, 2007, pp. 353356.[22] P. Kelly, N.E. O'Connor, A.F. Smeaton, Robust pedestrian detection and tracking in

    crowded scenes, Image Vis. Comput. 27 (2009) 14451458.[23] M. Isard, A. Blake, Condensation-conditional density propagation for visual track-

    ing, Int. J. Comput. Vis. 28 (1998) 528.[24] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans

    by Bayesian combination of edgelet based part detectors, Int. J. Comput. Vis. 75(2007) 247266.

    [25] H. Jiang, S. Fels, J.J. Little, A linear programming approach for multiple objecttracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Minneapolis,MN, USA, 2007, pp. 18.

    [26] L. Zhang, Y. Li, R. Nevatia, Global data association for multi-object tracking usingnetwork ows, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Anchorage,Alaska, USA, 2008, pp. 18.

    [27] H. Pirsiavash, D. Ramanan, C.C. Fowlkes, Globally-optimal greedy algorithms fortracking a variable number of objects, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Colorado Springs, CO, USA, 2011, pp. 12011208.

    [28] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007)261271.

    [29] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in:British Machine Vision Conference, Edinburgh, British, 2006.

    [30] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robusttracking, in: Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 234247.

    [31] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instancelearning, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni., Miami, FL, USA,2009, pp. 983990.

    [32] J. Xing, L. Liu, H. Ai, Background subtraction through multiple life span modeling,in: Proc. IEEE Int. Conf. Image Process., Brussels, Belguim, 2011.

    [33] B. Wu, R. Nevatia, Y. Li, Segmentation of multiple partially occluded objects bygrouping merging assigning part detection responses, in: Proc. IEEE Int. Conf.Comput. Vis. Pattern Recogni., Anchorage, Alaska, USA, 2008, pp. 18.

    [34] G. Duan, C. Huang, H. Ai, S. Lao, Boosting associated pairing comparison featuresfor pedestrian detection, in: Proc. IEEE Workshop Visual Surveillance, Kyoto,

    Japan, 2009, pp. 10971104.[35] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution, and the Bayesian

    restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721741.[36] Y. Li, C. Huang, R. Nevatia, Learning to associate: Hybrid boosted multi-target

    tracker for crowded scene, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recogni.,Miami, FL, USA, 2009, pp. 29532960.

    [37] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance:the clear mot metrics, J. Image Video Process., 2008, 2008.

    [38] W. Gao, H. Ai, S. Lao, Adaptive contour features in oriented granular space forhuman detection and segmentation, in: Proc. IEEE Int. Conf. Comput. Vis. PatternRecogni., Miami, FL, USA, 2009, pp. 17861793.

    Fig. 16.Speed comparisons of detection and tracking (ms).

    305G. Duan et al. / Image and Vision Computing 30 (2012) 292305