1198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …

Segmentation and Tracking of Multiple Humansin Crowded Environments

Tao Zhao, Member, IEEE, Ram Nevatia, Fellow, IEEE, and Bo Wu, Student Member, IEEE

Abstract—Segmentation and tracking of multiple humans in crowded situations is made difficult by interobject occlusion. We propose

a model-based approach to interpret the image observations by multiple partially occluded human hypotheses in a Bayesian

framework. We define a joint image likelihood for multiple humans based on the appearance of the humans, the visibility of the body

obtained by occlusion reasoning, and foreground/background separation. The optimal solution is obtained by using an efficient

sampling method, data-driven Markov chain Monte Carlo (DDMCMC), which uses image observations for proposal probabilities.

Knowledge of various aspects, including human shape, camera model, and image cues, are integrated in one theoretically sound

framework. We present experimental results and quantitative evaluation, demonstrating that the resulting approach is effective for very

challenging data.

Index Terms—Multiple human segmentation, multiple human tracking, Markov chain Monte Carlo.

Ç

1 INTRODUCTION AND MOTIVATION

SEGMENTATION and tracking of humans in video sequencesis important for a number of applications, such as visual

surveillance and human-computer interaction. This hasbeen a topic of considerable research in the recent past androbust methods for tracking isolated or a small number ofhumans for which only transient occlusion exists. However,tracking in a more crowded situation where several peopleare present, which exhibits persistent occlusion, remainschallenging. The goal of this work is to develop a method todetect and track humans in the presence of persistent andtemporarily heavy occlusion. We do not require thathumans be isolated, that is, unoccluded, when they firstenter the scene. However, in order to “see” a person, werequire that at least the head-shoulder region must bevisible. We assume a stationary camera so that motion canbe detected by comparison with a background model. Wedo not require the foreground detection to be perfect, e.g.,the foreground blobs may be fragmented, but we assumethat there are no significant false alarms due to shadows,reflections, or other reasons. We also assume that thecamera model is known and that people walk on a knownground plane.

Fig. 1a shows a sample frame of a crowded environment,and Fig. 1b shows the motion blobs detected by comparisonwith the learned background. It is apparent that segmentinghumans from such blobs is not straightforward. One blobmay include multiple objects, while one object may split

into multiple blobs. Blob tracking over extended periods,

e.g., [20], may resolve some of these ambiguities, but such

approaches are likely to fail when occlusion is persistent.

Some approaches have been developed to handle occlusion,

for example, [9], but require the objects to be initialized

before occlusion happens. This is usually infeasible for a

crowded scene. We believe that the use of a shape model is

necessary to achieve individual human segmentation and

tracking in crowded scenes.In earlier related work [54], Zhao and Nevatia modeled

the human body as a 3D ellipsoid and human hypotheses

were proposed based on head top detection from fore-

ground boundary peaks. This method works reasonably

well in the presence of partial occlusions if the number of

people in the field of view is small. As the complexity of the

scene grows, head tops cannot be obtained by simple

foreground boundary analysis and more complex shape

models are needed to fit more accurately with the observed

shapes. Also, joint reasoning about the collection of objects

is needed, rather than the simpler one-by-one verification

method in [54]. The consequence of this joint consideration

is that the optimal solution has to be computed in the joint

parameter space of all of the objects. To track the objects in

multiple frames, temporal coherence is another desired

property besides the accuracy of the spatial segmentation.

We adapt a data-driven Markov chain Monte Carlo

(MCMC) approach to explore this complex solution space.

To improve the computational efficiency, we use direct

image features from a bottom-up image analysis as

importance proposal probabilities to guide the moves of

the Markov chain. The main features of this work include

1. a three-dimensional part-based human body modelwhich enables the segmentation and tracking ofhumans in 3D and the inference of interobjectocclusion naturally,

1198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008

. T. Zhao is with Intuitive Surgical Inc., 950 Kifer Road, Sunnyvale, CA94086. E-mail: [email protected].

. R. Nevatia and B. Wu are with the Institute for Robotics and IntelligentSystems, USC Viterbi School of Engineering, University of SouthernCalifornia, 3737 Watt Way, Los Angeles, CA 90089.E-mail: {nevatia, bowu}@usc.edu.

Manuscript received 18 Sept. 2006; revised 24 Apr. 2007; accepted 13 Aug.2007; published online 31 Aug. 2007.Recommended for acceptance by C. Kambhamettu.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPAMI-0668-0906.Digital Object Identifier no. 10.1109/TPAMI.2007.70770.

0162-8828/08/$25.00 � 2008 IEEE Published by the IEEE Computer Society

2. a Bayesian framework that integrates segmentationand tracking based on a joint likelihood for theappearance of multiple objects,

3. the design of an efficient Markov chain dynamics,directed by proposal probabilities based on imagecues, and

4. the incorporation of a color-based backgroundmodel in a mean-shift tracking step.

Our method is able to successfully detect and trackhumans in the scenes of complexity shown in Fig. 1 with

high detection and low false-alarm rates; the tracking

results for the frame in Fig. 1a are shown in Fig. 1c (the

result includes the integration of multiple frames duringtracking). In Section 6, we give graphical and quantitative

results on a number of sequences. Parts of our system have

been partially described in [53] and [55]; this paper provides

a unified presentation of the methodology, additionalresults, and discussions. This approach has been built on

by other researchers, for example, [41]. The same frame-

work has also been successfully applied to vehiclesegmentation and tracking in challenging cases [43].

The rest of the paper is organized as follows: Section 2

gives a brief review of the related works. Section 3 presents

an overview of our method. Section 4 describes the

probabilistic modeling of the problem. Section 5 describesour MCMC-based solution. Section 6 shows experimental

results and evaluation. Conclusions and discussions are

given in the last section.

2 RELATED WORK

We summarize related work in this section; some of these

are referred to in more detail in the following sections. Due

to the amount of literature in this field, it is not possible forus to provide a comprehensive survey, but we attempt to

include the major trends.The observations for human hypotheses may come from

multiple cues. Many previous approaches [20], [9], [54],

[37], [44], [15], [18], [40], [24], [3], [45] use motion blobsdetected by comparing pixel colors in a frame to learnedmodels of the stationary background. When the scene is nothighly crowded, most of the parts of the humans in thescene are detected in the foreground motion blob; multiplehumans may be merged into a single blob, but they can beseparated by rather simple processing. For example,Haritaoglu et al. [15] use vertical projection of the blob tohelp segment a big blob into multiple humans. Siebel andMaybank [40] and Zhao and Nevatia [54] detect headcandidates by analyzing the foreground boundaries. Sincedifferent humans have small overlapping foregroundregions, they could be segmented in a greedy way.However, the utility of these methods in crowded environ-ments such as in Fig. 1 is likely to be limited.

Some methods, for example, [50], [31], [7], [13], detectappearance or shape-based patterns of humans directly.Those in [50] and [31] learn human detectors from localshape features; those in [7] and [13] build contour templatesfor pedestrians. These learning-based methods need a largenumber of training samples and may be sensitive toimaging viewpoint variations as they learn 2D patterns.Besides motion and shape, face and skin color are alsouseful cues for human detection, but environments wherethese cues could be utilized are limited, usually to indoorscenes where illumination is controlled and the objects areimaged with high resolution, for example, [42] and [12].

Without a specific model of objects, tracking methods arelimited to blob tracking, for example, [3]. The mainadvantage of model-based tracking is that it can solve theblob merge and split problems by enforcing a global shapeconstraint. The shape models could be either parametric, forexample, an ellipsoid as in [54], or nonparametric, forexample, the edge template as in [13], and either in 2D, forexample, [46], or in 3D, for example, [54]. Parametricmodels are usually generative and of high dimensionality,while nonparametric models are usually learned from realsamples. Two-dimensional models make the matching ofhypotheses and image observations straightforward, while3D models are more natural for occlusion reasoning. Thechoice of the model complexity depends on both theapplication and the video resolution. For human trackingfrom a middistant camera, we do not need to capture thedetailed body articulation; a rough body model such as thegeneric cylinder in [19], the ellipsoid in [54], and themultiple rectangles in [46] suffices. When the body pose ofhumans is desired and the video resolution is high enough,more complex models could be used, such as the articulatedmodels in [54] and [34].

Tracking of multiple objects requires the matching ofhypotheses with the observations both spatially andtemporally. When objects are highly interoccluded, theirimage observations are far from independent; hence, a jointlikelihood for multiple objects is necessary [46], [27], [19],[35], [30], [51]. Smith et al. [41] use a pairwise MarkovRandom Field (MRF) to model the interaction betweenhumans and define the joint likelihood. Rittscher et al. [36]include a hidden variable which indicates a global mappingfrom the observed features to human hypotheses in thestate vector.

ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1199

Fig. 1. A sample frame, the corresponding motion blobs, and our

segmentation and tracking result for a crowded situation. (a) Sample

frame. (b) Motion blobs. (c) Our result.

As the solution space is of high dimension, searching forthe best interpretation by brute force is not feasible. Particlefilter-based methods, for example, [19], [46], [30], [51], [27],become unsuitable when the dimensionality of the searchspace is high as the number of samples needed usuallygrows exponentially with the dimension. The methods in[41], [21] use some variations of the MCMC algorithm tosample the solution space, while those in [45], [36] use anEM-style method. For efficiency, the candidate solutionscould be generated from some image cues, not purelyrandomly, for example, the work in [36] proposes hypoth-eses from local silhouette features.

Information from multiple cameras with overlappingviews can reduce the ambiguity of a single camera. Suchmethods usually assume that, at least from one viewpoint,the object can be detected successfully (for example, [11]) ormany cameras are available for 3D reconstruction (forexample, [28]). The difficulty in segmenting multiplehumans that overlap in images from a stereo camera isalleviated by analyzing where in the 3D space they areseparable [52]. In a multicamera context, an object can betracked even when it is fully occluded from some of theviews; however, many real environments do not permit theuse of multiple cameras with overlapping views. In thispaper, we consider situations where video from only onecamera is available. However, our approach can utilizemultiple cameras with little modification.

MCMC-based methods are receiving increasing popu-larity for computer vision problems due to their flexibilityin optimizing an arbitrary energy function as opposed toenergy functions of a specific type as in graph cut [2] orbelief propagation [49]. They have been used for variousapplications, including segmenting multiple cells [38],image parsing [48], multiobject tracking [21], estimatingarticulated structures [23], and so forth. The data-drivenMCMC was proposed in [48] to utilize bottom-up imagecues to speed up the sampling process.

We want to point out the difference between ourapproach and another independently developed work [21]that also used MCMC for multiobject tracking. The work in[21] assumes that the objects do not overlap by applying apenalty term for overlap, while our approach explicitly usesa likelihood of appearance under occlusion. Our approachfocuses on the domain of tracking a human, which is themost important subject for visual surveillance. We considerthe three-dimensional perspective effect in a typical camerasetting, while the ant tracking problem described in [21] isalmost a 2D problem. We utilize the acquired appearancewhere each object is of different appearance, while ants in[21] are assumed to have the same appearance. Wedeveloped a full set of effective bottom-up cues for humansegmentation and hypotheses generation.

3 OVERVIEW

Our approach to segmenting and tracking multiple humansemphasizes the use of shape models. An overview diagramis given in Fig. 2. Based on a background model, theforeground blobs are extracted as the basic observation. Byusing the camera model and the assumption that objectsmove on a known ground plane, multiple 3D humanhypotheses are projected onto the image plane and matchedwith the foreground blobs. Since the hypotheses are in 3D,

occlusion reasoning is straightforward. In one frame, wesegment the foreground blobs into multiple humans andassociate the segmented humans with the existing trajec-tories. Then, the tracks are used to propose humanhypotheses in the next frame. The segmentation andtracking are integrated in a unified framework andinteroperate along time.

We formulate the problem of segmentation and trackingas one of Bayesian inference to find the best interpretationgiven the image observations, the prior models, and theestimates from the previous frame analysis (that is, themaximum a posteriori (MAP) estimation). The state to beestimated at each frame includes the number of objects,their correspondences to the objects in the previous frame(if any), their parameters (for example, positions), and theuncertainty of the parameters. We define a color-based jointlikelihood model that considers all of the objects and thebackground together and encodes both the constraints thatthe object should be different from the background and thatthe object should be similar to its correspondence. Usingthis likelihood model gracefully integrates segmentationand tracking and avoids a separate, sometimes ad hoc,initialization step. Given multiple human hypotheses,before calculating the joint image likelihood, interobjectocclusion reasoning is done. The occluded parts of a humanshould not have corresponding image observations.

The solution space contains subspaces of varyingdimensions, each corresponding to a different number ofobjects. The state vector consists of both discrete andcontinuous variables. This disqualifies many optimizationtechniques. Therefore, we use a highly general reversiblejump/diffusion MCMC-based method to compute the MAPestimate. We design dynamics for the multiobject trackingproblem. We also use various direct image features to makethe Markov chain more efficient. Direct image featuresalone do not guarantee optimality because they are usuallycomputed locally or using partial cues. Using them asproposal probabilities of the Markov chain results in anintegrated top-down/bottom-up approach that has both thecomputational efficiency of image features and the optim-ality of a Bayesian formulation. A mean-shift technique [5] isused as efficient diffusion for the Markov chain. The data-driven dynamics and the in-depth exploration of thesolution space make the approach less sensitive todimensionality compared to particle filters. Our experi-ments show that the described approach works robustly invery challenging situations with affordable computation;some results are shown in Section 6.


Fig. 2. Overview diagram of our approach.

4 PROBABILISTIC MODELING

Let � represent the state of the objects in the scene at time t;it consists of the number of objects in the scene, their3D positions, and other parameters describing their size,shape, and pose. Our goal is to estimate the state at time t,�ðtÞ, given the image observations, Ið1Þ; . . . ; IðtÞ, abbreviatedas Ið1;...;tÞ. We formulate the tracking problem as computingthe MAP estimation, �ðtÞ?:

�ðtÞ? ¼ arg max�ðtÞ2�

P �ðtÞjIð1;...;tÞ� �

¼ arg max�ðtÞ2�

P IðtÞj�ðtÞ� �

P �ðtÞjIð1;...;t�1Þ� �n o

;ð1Þ

where � is the solution space. Denote by m the state vectorof one individual object. A state containing n objects can bewritten as � ¼ ðk1;m1Þ; . . . ; ðkn;mnÞf g 2 �n, where ki is theunique identity of the ith object whose parameters are mi

and �n is the solution space of exactly n objects. The entiresolution space is � ¼ [Nmax

n¼0 �n, where Nmax is the upperbound of the number of objects. In practice, we compute anapproximation of P �ðtÞjIð1;...;t�1Þ� �

(details are given later inSection 4.4).

4.1 3D Human Shape Model

The parameters of an individual human, m, are definedbased on a 3D human shape model. The human body ishighly articulated; however, in our case, the human motionis mostly limited to standing or walking, and we do notattempt to capture the detailed shape and articulationparameters of the human body. Thus, we use a number oflow-dimensional models to capture the gross shape ofhuman bodies (Fig. 3).

Ellipsoids fit human body parts well and have the propertythat their projection is an ellipse with a convenient form [16].Therefore, we model human shape by a composition ofmultiple ellipsoids corresponding to the head, the torso, andthe legs, with fixed spatial relationship. A few such models incharacteristic poses are sufficient to capture the gross shapevariations of most humans in the scene for midresolutionimages. We use the multi-ellipsoid model to control themodel complexity while maintaining a reasonable level offidelity. We have used three such models (one for legs close toeach other and two for legs well split) in our previous work onmultihuman segmentation [53]. However, in this work, weuse only a single model with three ellipsoids, which we foundsufficient for tracking.

The model is controlled by two parameters called sizeand thickness. The size parameter is the 3D height of the

model; it also controls the overall scaling of the object in the

three directions. The thickness parameter captures extra

scaling in the horizontal directions. Besides size and

thickness, the parameters also include the image position

of the head,1 3D orientation of the body, and 2D inclination of

the body. The orientations of the models are quantized into

a few levels for computation efficiency. The origin of the

rotation is chosen so that 0 degrees corresponds to a human

facing the camera. We use 0 and 90 degrees to represent

front/back and side view in this work. The 3D models

assume that humans are perfectly upright, but there is the

chance that the body may be inclined slightly. We use one

parameter to capture the inclination in 2D (as opposed to

two parameters in 3D). Therefore, the parameters of the

ith human are mi ¼ foi; xi; yi; hi; fi; iig, which are orienta-

tion, position, size, thickness, and inclination, respectively. We

also write ðxi; yiÞ as ui.With a given camera model and a known ground plane,

the 3D shape models automatically incorporate the per-

spective effect of camera projection (change in object image

size and shape due to the change in object position and/or

camera viewpoint). Compared to 2D shape models (for

example, [13]) or prelearned 2D appearance models (for

example, [50]), the 3D models are more easily applicable for

a novel viewpoint.

4.2 Object Appearance Model

Besides the shape model, we also use a color histogram of

the object, ~p ¼ ~p1; . . . ; ~pmf g (m is the number of bins of the

color histogram) defined within the object shape, as a

representation of its appearance, which helps establish

correspondence in tracking. We use a color histogram

because it is insensitive to the nonrigidity of human motion.

Furthermore, there exists an efficient algorithm, for exam-

ple, the mean-shift technique [5], to optimize a histogram-

based object function. When calculating the color histo-

gram, a kernel function KEðÞ with Epanechnikov profile [5]

is applied to weight pixel locations so that the center has a

higher weight than the boundary. Such a representation has

been used in [6]. Our implementation uses a single red,

green, blue (RGB) histogram with 512 bins (eight for each

dimension) of all of the samples within the three elliptic

regions of our object model.

4.3 Background Appearance Model

The background appearance model is a modified version of

a Gaussian distribution. Denote by ð�rj; �gj; �bjÞ and �j ¼diagf�2

rj; �2

gj; �2

bjg the mean and the covariance of the color at

pixel j. The probability of pixel j being from the back-

ground is


1. The image head location is an equivalent parameterization of theworld location on the ground plane ðxw; ywÞ given the human height. Thetwo are related by ½x; y; 1�T � ½p1;p2;p3hþ p4�½xw; yw; 1�T , where pi is theith column of the camera projection matrix and h is the height of thehuman. For clarity of presentation, we chose the ground plane to be z ¼ 0.

Fig. 3. A number of 3D human models to capture the gross shape of

human bodies.

Pb Ij� �¼ Pb rj; gj; bj

� �/ max exp � rj � �rj

�rj

� �2

� gj � �gj�gj

� �2

� bj � �bj�bj

� �2" #

; �

( );

ð2Þ

where � is a small constant. It is a composition of a Gaussiandistribution and a uniform distribution. The uniformdistribution captures the outliers that are not modeled bythe Gaussian distribution to make the model more robust.The Gaussian parameters (mean and covariance) areupdated continuously by the video stream only with thenonmoving regions. A more sophisticated backgroundmodel (for example, mixture of Gaussians [44] or nonpara-metric [10]) could be used to account for more variations,but this is not the focus of this work; we assume thatcomparison with a background model yields adequateforeground blobs.

4.4 The Prior Distribution

The prior distribution P �ðtÞjIð1;...;t�1Þ� �is decomposed into

two parts given by

P �ðtÞjIð1;...;t�1Þ� �

/ P �ðtÞ� �

P �ðtÞjIð1;...;t�1Þ� �

: ð3Þ

P ð�ðtÞÞ is independent of time and is defined byQni¼1 P ðjSijÞP ðmiÞ, where Si is the projected image of

the ith object and jSij is its area. The prior of the imagearea P ðjSijÞ is modeled as being proportional toexp ��1jSijð Þ 1� exp ��2jSijð Þ½ �.2 The first term here pena-lizes a large total object size to avoid situations where twohypotheses overlap a large portion of an image blob, whilethe second term penalizes objects with small image sizes asthey are more likely to be due to image noise. Althoughthe prior on 2D image size could be converted to the3D space, defining this prior in 2D is more natural becausethese properties model the reliability of image evidenceindependent of the camera models. The priors on thehuman body parameters are considered independent.Thus, we have P ðmiÞ ¼ P ðoiÞP ðxi; yiÞP ðhiÞP ðfiÞP ðiiÞ. Weset P ðofrontalÞ ¼ P ðoprofileÞ ¼ 1=2. P ðxi; yiÞ is a uniformdistribution in the image region where a human head isplausible. P ðhiÞ is a Gaussian distribution Nð�h; �2

hÞtruncated in the range of ½hmin; hmax� and P ðfiÞ is Gaussiandistribution Nð�f; �2

fÞ truncated in the range of ½fmin; fmax�.P ðiiÞ is Gaussian distribution Nð�i; �2

i Þ. In our experiments,we use �h ¼ 1:7 m, �h ¼ 0:2 m, hmin ¼ 1:5 m, hmax ¼ 1:9 m,�f ¼ 1, �f ¼ 0:2, fmin ¼ 0:8, fmax ¼ 1:2; �i ¼ 0, and �i ¼ 3degrees. These parameters correspond to common adultbody sizes.

We approximate the second term of the right side of

(3), P ð�ðtÞjIð1;...;t�1ÞÞ, by P ð�ðtÞj�ðt�1ÞÞ, assuming �t�1 encodes

the necessary information from the past observations. For

convenience of expression, we rearrange �ðtÞ and �ðt�1Þ

as ~�ðtÞ ¼ fð~kðtÞi ; ~mðtÞi Þg

Ni¼1 and ~�ðt�1Þ ¼ fð~kðt�1Þ

i ; ~mðt�1Þi ÞgNi¼1,

where N is the overall number of objects present in the

two frames, so that one of f~kðtÞi ¼ ~k

ðt�1Þi ; ~m

ðtÞi ¼ �; ~m

ðt�1Þi ¼

�g is true for each i. ~kðtÞi ¼ ~k

ðt�1Þi means that object ~k

ðtÞi is a

tracked object, ~mðtÞi ¼ � means that object ~k

ðt�1Þi is a dead

object (that is, trajectory is terminated), and ~mðt�1Þi ¼ �

means that object ~kðtÞi is a new object. With the rearranged

state vector, we have

P ð�ðtÞj�ðt�1ÞÞ ¼ P ð~�ðtÞj~�ðt�1ÞÞ ¼YNi¼1

P ð ~mðtÞi j ~m

ðt�1Þi Þ:

The temporal prior of each object follows the definition

P ~mðtÞi j ~m

ðt�1Þi

� �/

Passoc ~mðtÞi j ~m

ðt�1Þi

� �; ~k

ðtÞi ¼ ~k

ðt�1Þi

Pnew ~mðtÞi

� �; ~m

ðt�1Þi ¼ �

Pdead ~mðt�1Þi

� �; ~m

ðtÞi ¼ �:

8>>><>>>: ð4Þ

We assume that the position and the inclination of an objectfollow constant velocity models with Gaussian noise andthat the height and thickness follow a Gaussian distribution(for simplicity of presentation, we omit the velocity terms inthe state). We use Kalman filters for temporal estimation;Passoc is therefore a Gaussian distribution. Pnewð ~m

ðtÞi Þ ¼

Pnewð~uðtÞi Þ and Pdeadð ~mðt�1Þi Þ ¼ Pdeadð~uðt�1Þ

i Þ are the likeli-hoods of the initialization of a new track at position ~u

ðtÞi and

the termination of an existing track at position ~uðt�1Þi ,

respectively. They are set empirically according to thedistance of the object to the entrances/exits (the boundariesof the image and other areas that people move in/out of).PnewðuÞ � N ð�ðuÞ;�eÞ, where �ðuÞ is the location of theclosest entrance point to u and �e is its associatedcovariance matrix, which is set manually or through alearning phase. PdeadðÞ follows a similar definition.

4.5 Joint Image Likelihood for Multiple Objects andthe Background

The image likelihood P ðIj�Þ reflects the probability that weobserve image I (or some features extracted from I) givenstate �. Here, we develop a likelihood model based on thecolor information of the background and objects. Given astate vector �, we partition the image into different regionscorresponding to different objects and the background.Denote by ~Si the visible part of the ith object defined by mi.The visible part of an object is determined by the depthorder of all of the objects, which can be inferred from their3D positions and the camera model. The entire object regionS ¼ [ni¼1Si ¼

Pni¼1

~Si since ~Si are disjoint regions. We use �Sto denote the supplementary region of S, that is, thenonobject region. The relationship of the regions isillustrated in Fig. 4.

In case of multiple objects which can possibly overlap inthe image, the likelihood of the image given the state cannotbe simply decomposed into the likelihood of each indivi-dual object. Instead, a joint likelihood of the whole image,given all objects and the background model, needs to beconsidered. The joint likelihood P ðIj�Þ consists of two termscorresponding to the object region and the nonobject region:

P Ij�ð Þ ¼ P ISj��

P I�Sj�

� �: ð5Þ


2. We have used prior on the number of objects in [53] to constrainoversegmentation. However, we found that the prior on the area is moreeffective due to the large variation in the image sizes of the objects (due tothe camera perspective effect) and, therefore, their different contribution tothe likelihood.

After obtaining ~Si by occlusion reasoning, the object regionlikelihood can be calculated by

P ISj��

¼Yni¼1

P I~Si jmi

� �

/ exp �SXni¼1

~Si�� bB pi;dið Þ|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}

ð1Þ

þ�fB pi; ~pið Þ|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}ð2Þ

264

375

8><>:

9>=>;;

ð6Þ

where di is the color histogram of the background imagewithin the visibility mask of object i and ~pi is the colorhistogram of the object; both are weighted by the kernelfunction KEðÞ. Bðp;dÞ ¼

Pmj¼1

ffiffiffiffiffiffiffiffiffipjdj

pis the Bhattachayya

coefficient, which reflects the similarity of the two histo-grams.

This likelihood favors both the difference in an objecthypothesis from the background and its similarity to itscorresponding object in a previous frame (Fig. 4). Thisenables simultaneous segmentation and tracking in thesame object function. We call the two terms backgroundexclusion and object attraction, respectively. The back-ground exclusion concept was also proposed in [33]. �b and�f weight the relative contribution of the two terms (weconstrain �b þ �f ¼ 1). The object attraction term is the sameas the likelihood function used in [6]. For an object withouta correspondence, that is, a new object, only the backgroundexclusion part is used.

The nonobject likelihood is calculated by

P I�Sj�

� �¼Yj2 �S

Pb Ij� �� S / exp �� S

Xj2 �S

ej

0@

1A ð7Þ

where ej ¼ logðPbðIjÞÞ is the probability of belonging to thebackground model, as defined in (2). �S in (6) and � �S in (7)weight the balance of the foreground and the backgroundconsidering the different probabilistic models being used.The posterior probability is obtained by combining theprior, (3), and the likelihood, (5).

5 COMPUTING MAP BY EFFICIENT MCMC

Computing the MAP is an optimization problem. Due to thejoint consideration of an unknown number of objects, thesolution space contains subspaces of varying dimensions. Italso includes both discrete variables and continuous

variables. These have made the optimization challenging.We use an MCMC method with jump/diffusion dynamicsto sample the posterior probability. Jumps cause theMarkov chain to move between subspaces with differentdimensions and to traverse the discrete variables; diffusionsmake the Markov chain sample continuous variables. In theprocess of sampling, the best solution is recorded and theuncertainty associated with the solution is also obtained.

Fig. 5 gives a block diagram of the computation process.The MCMC-based algorithm is an iterative process, startingfrom an initial state. In each iteration, a candidate isproposed from the state in the previous iteration assisted byimage features. The candidate is accepted probabilisticallyaccording to the Metropolis-Hasting rule [17]. The statecorresponding to the maximum posterior value is recordedand becomes the solution.

Suppose we want to design a Markov chain with

stationary distribution Pð�Þ ¼ P �ðtÞjIðtÞ; �ðt�1Þ� �. At the

gth iteration, we sample a candidate state �0 according to

�g�1 from a proposal distribution qð�gj�g�1Þ. The candidate state

�0 is accepted with the probability p ¼ minf1; Pð�0Þqð�g�1j�0Þ

Pð�g�1Þqð�0 j�g�1Þg.3

If the candidate state �0 is accepted, �g ¼ �0; otherwise,

�g ¼ �g�1. It can be proven that the Markov chain con-

structed in this way has its stationary distribution equal to

PðÞ, independent of the choice of the proposal probability qðÞand the initial state �0 [47]. However, the choice of the

proposal probability qðÞ can affect the efficiency of MCMC

significantly. Random proposal probabilities will lead to a

very slow mixing rate. Using more informed proposal

probabilities, for example, as in the data-driven MCMC

[48], will make the Markov chain traverse the solution space

more efficiently. Therefore, the proposal distribution is

written as qð�gj�g�1; IÞ. If the proposal probability is informa-

tive enough that each sample can be thought of as a

hypothesis, then the MCMC approach becomes a stochastic

version of the hypothesize and test approach. In general, the


Fig. 4. First pane: The relationship of visible object regions and thenonobject region. Remaining panes: The color likelihood model. In ~Si,the likelihood favors both the difference in an object hypothesis with thebackground and its similarity to its corresponding object in a previousframe. In �S, the likelihood penalizes the difference from the backgroundmodel. Note that the elliptic models are used for illustration.

3. Based on our experiments, we find that approximating the ratio in thesecond term with just the posterior probability ratio, Pð�

0 ÞPð�g�1Þ , gives almost the

same results as the complete computation; hence, we use this approxima-tion in our implementation.

Fig. 5. The block diagram of the MCMC tracking algorithm.

original version of MCMC has a dimension matching problem

for a solution space with varying dimensionality. A variation

of MCMC, called trans-dimensional MCMC [14], is proposed

to solve this problem. However, with some appropriate

assumption and simplification, the trans-dimensional

MCMC can be reduced to the standard MCMC. We address

this issue later in this section.

5.1 Markov Chain Dynamics

We design the following reversible dynamics for the Markovchain to traverse the solution space. The dynamics corre-spond to the proposal distribution with a mixture densityqð�0j�g�1; IÞ ¼

Pa2A paqað�0j�g�1; IÞ, where A is the set of all

dynamics ¼fadd; remove; establish; break; exchange; diffg:

The mixing probabilities pa are the chances of selectingdifferent dynamics and

Pa2A pa ¼ 1.

We assume that we have the sample in the ðg� 1Þthiteration �

ðtÞg�1 ¼ fðk1;m1Þ; . . . ; ðkn;mnÞg and now propose a

candidate �0 for the gth iteration (t is omitted where there isno ambiguity).

Object hypothesis addition. Sample the parameters of anew human hypothesis ðknþ1;mnþ1Þ and add it to �g�1.qaddð�g�1 [ fðknþ1;mnþ1Þgj�g�1; IÞ is defined in a data-drivenway whose details will be given later.

Object hypothesis removal. Randomly select an existinghuman hypothesis r 2 ½1; n� with a uniform distribution andremove it. qremove �g�1 n fðkr;mrÞgj�g�1

� �¼ 1=n. If kr has a

correspondence in �ðt�1Þ, then that object becomes dead.Establish correspondence. Randomly select a new object r

in �ðtÞg�1 and a dead object r0 in �ðt�1Þ and establish their

temporal correspondence. qestablish �0j�g�1

� �/ ur � ur0k k�2

for all of the qualified pairs.

Break correspondence. Randomly select an object r

where kr 2 �ðt�1Þ with a uniform distribution and change krto a new object (and the same object in �ðt�1Þ becomes dead).

qbreak �0j�g�1

� �¼ 1=n0, where n0 is the number of objects in

�ðtÞg�1 that have correspondences in the previous frame.

Exchange identity. Exchange the IDs of two nearbyobjects. Randomly select two objects r1; r2 2 ½1; n� andexchange their IDs. qexchange r1; r2ð Þ / ur1

� ur2k k�2. Identi-

ties exchange can also be replaced by the composition ofbreaking and establishing correspondence. It is used to easethe traversal since breaking and establishing correspon-dences may lead to a big decrease in the probability and areless likely to be accepted.

Parameter update. Update the continuous parameters ofan object. Randomly select an existing human hypothesisr 2 ½1; n� with a uniform distribution and update itscontinuous parameters qdiff �

0j�g�1

� �¼ ð1=nÞqd m0rjmr

� �.

Among the above, addition and removal are a pair ofreverse moves, as are establishing and breaking correspon-dences; exchanging identity and parameter updating aretheir own reverse moves.

5.2 Informed Proposal Probability

In theory, the proposal probability qðÞ does not affect thestationary distribution. However, different qðÞ lead to

different performance. The number of samples needed toget a good solution strongly depends on the proposal

probabilities. In this application, the proposal probability ofadding a new object and the update of the object parameters

are the two most important ones. We use the following

informed proposal probabilities to make the Markov chainmore intelligent and thus have a higher acceptance rate.

Object addition. We add human hypotheses from three

cues, foreground boundaries, intensity edges, and fore-ground residue (foreground with the existing objects carved

out). In [54], a method to detect the heads that are on theboundary of the foreground is described. The basic idea is

to find the local vertical peaks of the boundary. The peaksare further verified by checking if there are enough

foreground pixels below it according to a human heightrange and the camera model. This detector has a high

detection rate and is also effective when the human is smalland image edges are not reliable; however, it cannot detect

the heads in the interior of the foreground blobs. Fig. 6ashows an example of head detection from foreground

boundaries.The second head detection method is based on an “�”

shape head-shoulder model (this term was first introduced in

[53]). This detector matches the �-shape edge template withthe image intensity edges to find the head candidates. First,

the Canny edge detector is applied to the foreground regionof the input image. A distance transformation [1] is computed

on the edge map. Fig. 6b shows the exponential edge mapwhereEðx; yÞ ¼ expð��Dðx; yÞÞðDðx; yÞ is the distance to the

closest edge point and� is a factor to control the response fielddepending on the object scale in the image; we use � ¼ 0:25).

In addition, the coordinates of the closest pixel point are alsorecorded as ~Cðx; yÞ. The unit image gradient vector ~Oðx; yÞ is

only computed at edge pixels. The “�” shape model, seeFig. 6c, is derived by projecting a generic 3D human model to

the image and taking the contour of the whole head and theupper quarter torso as the shoulder. The normals of the

contour points are also computed. The size of the human


Fig. 6. Head detection. (a) Head detection from foreground blob

boundaries. (b) Distance transformation on the Canny edge detection

result. (c) The �-shape head-shoulder model (black—head-shoulder

shape, white—normals). (d) Head detection from intensity edges.

model is determined by the camera calibration assuming anaverage human height.

Denote f~u1; . . . ; ~ukg and f~v1; . . . ;~vkg as the positions andthe unit normals of the model points, respectively, when thehead top is at ðx; yÞ. The model is matched with the imageas Sðx; yÞ ¼ ð1=kÞ�k

i¼1e��Dð~uiÞð~vi � ~Oð~Cð~uiÞÞÞ. A head candi-

date map is constructed by evaluating Sðx; yÞ on every pixelin the dilated foreground region. After smoothing it, wefind all of the peaks above a threshold such that it results ina very high detection rate but may also result in a high falsealarm rate. An example is shown in Fig. 6d. The false alarmstend to happen in the area of rich texture, where there areabundant edges of various orientations.

Finally, after some human objects obtained from the firsttwo methods are hypothesized and removed from theforeground, the foreground residue map R ¼ F � S iscomputed. A morphological “open” operation with avertically elongated structural element is applied to removethin bridges and small/thin residues. From each connectedcomponent c, human candidates can be generated, assum-ing that 1) the centroid of the c is aligned with the center ofthe human body, 2) the top center point of c is aligned withthe human head, or 3) the bottom center point of c is alignedwith the human feet.

The proposal probability for addition combines thesethree head detection methods qaðk;mÞ ¼

P3i¼1 �aiqaiðk;mÞ,

where �ai, i ¼ 1; 2; 3 are mixing probabilities of the threemethods, and we use �ai ¼ 1=3. qaiðÞ samples m first andthen k. qaiðk;mÞ ¼ qaiðmÞqaiðkjmÞ, and

qaiðmÞ ¼ qoðoÞqaiðuÞqhðhÞqfðfÞqiðiÞ:

qaiðuÞ answers the question “where do we add a new

human hypothesis.” In practice, qoðoÞ, qhðhÞ, qfðfÞ, and qiðiÞuse their respective prior distributions and qaiðuÞ is a

mixture of Gaussians based on the bottom-up detection

results. For example, denote by HC1 ¼ fðxi; yiÞgN1

i¼1 the head

candidates obtained by the first method, and then,

qa1ðuÞ ¼ qa1ðx; yÞ �PN1

i¼1Nððxi; yiÞ; diagf�2x; �

2ygÞ. The defi-

nitions of qa2ðuÞ and qa3ðuÞ are similar. After u0 is sampled,

qðkjmÞ / qðkju0Þ is to sample k from fkðt�1Þd1

; . . . ; kðt�1Þdnd

; newgaccording to P ðujuðt�1Þ

diÞ, see (4), i ¼ 1; . . . ; nd, and PnewðuÞ,

where nd is the number of dead objects.The addition and removal actions change the dimension

of the state vector. When calculating the acceptanceprobability, we need to compute the ratio of probabilitiesfrom spaces with different dimensions. Smith et al. [41] usean explicit strategy of transdimensional MCMC [14] to dealwith the dimension-matching problem. We do not need anexplicit strategy to match the dimension. Since thetransdimensional actions only add or remove one object atone iteration, leaving the other objects unchanged, theJacobian in [14] is unit, as in [41]. Therefore, our formulationis just a special case of the more general theory.

Parameter update. We use two ways to update themodel parameters:

qdiff m0rjmr

� �¼ �d1qd1 m0rjmr

� �þ �d2qd2 m0rjmr

� �;

�di ¼ 1=2. qd1ðÞ uses stochastic gradient descent to update

the object parameters. qd1ðm0rjmrÞ / N ðmr � k dEdm ;wÞ,

where E ¼ � logP ð�ðtÞjIðtÞ; �ðt�1ÞÞ is the energy function, k

is a scalar to control the step size, and w is random noise to

avoid a local maximum.A mean-shift vector computed in the visible

region provides an approximation of the gradientof the object likelihood with respect to the position.qd2ðm0rjmrÞ / N ðmms

r ;wÞ, where mmsr is the new location

computed from the mean-shift procedure (details are givenin the Appendix). We assume that the change in theposterior probability by other components and due toocclusion can be absorbed in the noise term. The mean shifthas an adaptive step size and has a better convergencebehavior than numerically computed gradients. The rest ofthe parameters follow their numerically computed gradi-ents. Compared to the original color-based mean-shifttracking, the background exclusion term in (6) can utilizea known background model, which is available for astationary camera. As we observe in our experiments,tracking using the above likelihood is more robust to thechange in appearance of the object, for example, whengoing into the shadow, compared to using the objectattraction term alone.

Theoretically, the Markov chain designed should beirreducible and reversible; however, the use of the abovedata-driven proposal probabilities makes the approach notconform to the theory exactly. First, irreducibility requiresthe Markov chain to be able to reach any possible point inthe solution space. However, in practice, the proposalprobability of some point is very small, close to zero. Forexample, the proposal probability of adding a hypothesis ata position where there is no head candidate detected nearbyis extremely low. With finite numbers of iterations, a stateincluding such a hypothesis will never be sampled.Although this breaks the completeness of the Markovchain, we argue that skipping the parts of the solution spacewhere no sign of objects is observed brings no harm to thequality of the final solution and makes the searchingprocess more efficient. Second, the use of the mean shift,which is a nonparametric method, makes the chainirreversible. Mean shift can be seen as an approximationof the gradient, while stochastic gradient descent isessentially a Gibbs sampler [39], which is a special case ofthe Metropolis-Hasting sampler with an acceptance ratioalways equal to one [25]. However, mean shift is muchfaster than random walk to estimate the parameters of theobject. We choose to use these techniques with the loss ofsome theoretical beauty because, experimentally, they makeour method much more efficient and the results are good.

5.3 Incremental Computation

As the MCMC process may need hundreds or moresamples to approximate the distribution, we need anefficient method to compute the likelihood for eachproposed state. In one iteration of the algorithm, at mosttwo objects may change. It affects the likelihood locally;therefore, the computation of the new likelihood can becarried out more efficiently by incrementally computing itonly within their neighborhood (the area associated withthe changed objects and those overlapping with them).


Take the addition action as an example. When a newhuman hypothesis is added to the state vector, for thelikelihood of the nonobject region P ðI �Sj�Þ, we only need toremove those background pixels taken by the new hypoth-esis. For the likelihood of the object regionP ðIS j�Þ, as the newhypothesis may overlap with some existing hypotheses, weneed to recompute the visibility of the object regionsconnected to the new hypothesis and then update thelikelihood of these neighboring objects. The incrementalcomputations of the likelihood for the other actions aresimilar. Although a joint state and joint likelihood is used, thecomputation of each iteration is greatly reduced through theincremental computation. This is in contrast to the particlefilter where the evaluation of each particle (joint state) needsthe computation of the full joint likelihood.

The appearance models of the tracked objects areupdated after processing each frame to adapt to thechange in object appearance. We update the object colorhistogram using an Infinite Impulse Response (IIR) filter~pðtÞ ¼ �ppðtÞ þ ð1� �pÞ~pðt�1Þ. We choose to update the ap-pearance conservatively: We use a small �p ¼ 0:01 and stopupdating if the object is occluded by more than 25 percent orits position covariance is too big.

6 EXPERIMENTAL RESULTS

We have experimented on the system with many types ofdata and will only show some representative ones. We willfirst show results on an outdoor scene video and, then, on astandard evaluation data set of indoor scene videos.

Among all of the parameters of our approach, many are“natural,” meaning that they correspond to measurablephysical quantities (for example, 3D human height); there-fore, setting their values is straightforward. We use thesame set of parameters for all of the sequences. This meansthat our approach is not sensitive to the choice of parametervalues. We list here the values of the parameters that are notmentioned in the previous sections. For the size prior (inSection 4.4), �1 ¼ 0:04, and �2 ¼ 0:002. For likelihood,�f ¼ 0:5, �b ¼ 0:5 in (6), �S ¼ 25 in (6), and �S ¼ 0:005 in(7). For the mixing probabilities of different types ofdynamics, we use Padd ¼ 0:1, Premove ¼ 0:1, Pestablish ¼ 0:1,Pbreak ¼ 0:1, Pexchange ¼ 0:1, and Pdiff ¼ 0:5. We also apply ahard constraint of 25 pixels on the minimum image heightof a human.

We also want to comment here on the choice ofparameters related to the peakedness of a distribution insampling algorithms. The image likelihood is usually acombination of a number of components (sites, e.g., pixels).Inevitable simplifications (for example, independence as-sumption) in probabilistic modeling may result in excessivepeakedness of the distribution, which affects the perfor-mance of the sampling algorithms such as MCMC andparticle filter by having the samples in both MCMC andparticle filter focused in one location (that is, the highestpeak) of the state space, therefore making them degenerateinto greedy algorithms. Eliminating the dependencies ofdifferent components can be extremely difficult andinfeasible. From an engineering point of view, one shouldset the values of the parameters (for example, �S and �Swhile keeping their ratio constant) so that the likelihood

ratio of different hypotheses is reasonable so that theMarkov chains can efficiently traverse and particle filterscan maintain multiple hypotheses. In a similar fashion,simulated annealing has been used in the sampling processto reduce the effect of the peakedness and force conver-gence [48], [8]; however, the varying temperature makes thesamples not from a single posterior distribution.

6.1 Evaluation on an Outdoor Scene

We show results on an outdoor video sequence, which wecall the “Campus Plaza” sequence and which contains900 frames. This sequence is captured from a camera abovea building gate with a 40 degree camera tilt angle. Theframe size is 360� 240 pixels and the sampling rate is30 fps. In this sequence, 33 humans pass by the scene, with23 going out of the field of view and 10 going inside abuilding. The interhuman occlusions in this sequence arelarge. There are 20 occlusion events overall, nine of whichare heavy occlusions (over 50 percent of the object isoccluded). For MCMC sampling, we use 500 iterations perframe. We show in Fig. 7 some sample frames from theresult on this sequence. The identities of the objects areshown by their ID numbers displayed on the head.

We evaluate the results by the trajectory-based errors.

Trajectories whose lengths are less than 10 frames are

discarded. Among the 33 human objects, trajectories of three

objects are broken once (ID 28 ! ID 35, ID 31! ID 32, and

ID 30 ! ID 41, all between frames 387 and 447, as marked

with arrows in Fig. 7); the rest of the trajectories are correct.

Usually, the trajectories are initialized once the humans are

fully in the scene; some start when the objects are only

partially inside. Only the initializations of three objects

(objects 31, 50, 52) are noticeably delayed (by 50, 55, and

60 frames, respectively, after they are fully in the scene).

Partial occlusion and/or the lack of contrast with the

background are the causes of the delays. To justify our

approach for integrated segmentation and tracking, we

compare the tracking result with the result using frame-by-

frame segmentation as in [53], where we use frame-based

evaluation metrics. The detection rate and the false-alarm

rate are 98.13 and 0.27 percent, respectively. The detection

rate and the false-alarm rate of the same sequence by using

segmentation alone are 92.82 and 0.18 percent. With

tracking, not only are the temporal correspondences

obtained but the detection rate is also increased by a large

margin, while the false-alarm rate is kept low.

6.2 Evaluation on Indoor Scene Sequences

Next, we describe the results of our method on an indoor

video set, Context-Aware Vision using Image-based Active

Recognition (CAVIAR) video corpus4 [56]. We test our

system on the 26 “shopping center corridor view”

sequences, 36,292 frames overall, captured by a camera

looking down toward a corridor. The frame size is 384� 288

pixels and the sampling rate is 25 fps. Some 2D-3D point

correspondences are given from which the camera can be


4. In the provided ground truth, there are 232 trajectories overall.However, five of these are mostly out of sight, for example, only one arm orthe head top is visible; we set these as “do not care.”

calibrated. However, we compute the camera parameters by

an interactive method [26].The interobject occlusion in this set is also intensive.

Overall, there are 96 occlusion events in this set, 68 out of 96

are heavy occlusions, and 19 out of the 96 are almost fully

occluded (more than 90 percent of the object is occluded).

Many interactions between humans, such as talking and

handshaking, make this set very difficult for tracking. For

MCMC sampling, we use 500 iterations per frame again. For

such a big data set, it is infeasible to enumerate the errors as


Fig. 7. Selected frames of the tracking results from “Campus Plaza.” The numbers on the heads show identities. (Please note that the two people

who are sitting on two sides are in the background model and, therefore, not detected.)

we did for the “Campus Plaza” sequence. Instead, wedefined five statistical criteria:

1. the number of mostly tracked trajectories,2. the number of mostly lost trajectories,3. the number of fragments of trajectory,4. the number of false trajectories (a results trajectory

corresponding to no object), and5. the frequency of identity switches (identity exchan-

ging between a pair of result trajectories).

Fig. 8 illustrates their definition. These five categories are byno means a complete classification; however, they covermost of the typical errors observed on this set. There areother performance measures that have been proposed in therecent evaluations, such as the Multiple Object TrackingPrecision and Accuracy in the CLEAR 2006 evaluation [57].We do not use these measures because they are lessintuitive as they try to integrate multiple factors into onescalar valued measure.

Table 1 gives the performance of our method. Wedeveloped an evaluation software to count the number ofmostly tracked trajectories, mostly lost trajectories, false alarms,and fragments automatically. Denote a ground-truthtrajectory by fGðiÞ; . . . GðiþnÞg, where GðtÞ is the objectstate at the tth frame; denote a hypothesized trajectory byfHðjÞ; . . . HðjþmÞg. The overlap ratio of the ground-truthobject and the hypothesized object at the t-frame is defined by

OverlapðGðtÞ;HðtÞÞ ¼ RegðGðtÞÞ \ RegðHðtÞÞRegðGðtÞÞ [ RegðHðtÞÞ

; ð8Þ

where RegðÞ is the image region of the object. IfOverlapðGðtÞ;HðtÞÞ > 0:5, we say fGðtÞ;HðtÞg is a potentialmatch. The overlap ratio of the ground-truth trajectory andthe hypothesized trajectory is defined by

OverlapðGði:iþnÞ;Hðj:jþmÞÞ

¼Pminðiþn;jþmÞ

t¼maxði;jÞ � OverlapðGðtÞ;HðtÞÞ > 0:5� �

maxðiþ n; jþmÞ �minði; jÞ þ 1;

ð9Þ

where �ðÞ is an indicator function. Given that one sequencehas NG ground-truth trajectories, fGkgNG

k¼1, and NH hy-pothesized trajectories, fHkgNH

k¼1, we compute the overlapratios for all ground-truth hypothesis pairs fGk;Hlg; thepairs whose overlap ratios are larger than 0.8 are consideredto be potential matches. Then, the Hungarian matchingalgorithm [22] is used to find the best matches that areconsidered to be mostly tracked. To count the mostly losttrajectories, we define a recall ratio by replacing thedenominator of (9) with nþ 1. If, for Gk, there is no Hl suchthat the recall ratio between them is larger than 0.2, weconsider Gk to be mostly lost. To count the false alarms andfragments, we define a precision ratio by replacing the

denominator of (9) with mþ 1. If, for Hl, there is no Gk suchthat the precision ratio between them is larger than 0.2, weconsider Hl a false alarm; if there is such a Gk that theprecision between them is larger than 0.8 but the overlap ratiois smaller than 0.8, we consider Hl to be a fragment of Gk. Wefirst count the mostly tracked trajectories and remove thematched parts of the ground-truth tracks. Second, we countthe trajectory fragments with a greedy iterative algorithm. Ateach round, the fragment with the highest overlap ratio isfound and, then, the matched part of the ground-truth track isremoved; this procedure is repeated until there are no morevalid fragments. Last, we count the mostly lost trajectoriesand the false alarms. This algorithm cannot classify allground-truth and hypothesized tracks; the unlabeled onesare mainly due to an identity switch. We count the frequencyof identity switches visually.

Some sample frames and results are shown in Fig. 9.Most of the missed detections are due to the humanswearing clothing with a color very similar to that of thebackground so that some part of the object is misclassifiedas background; see frame 1,413 in Fig. 9b for an example.The fragmentation of trajectory and the ID switch aremainly due to full occlusions; see frame 496 in Fig. 9a andframe 316 in Fig. 9b for examples. Our method can dealwith partial occlusion well. For full occlusion, classifying anobject as going into an “occluded” state and associating itwhen it reappears could potentially improve the perfor-mance. The false alarms are mainly due to the shadows,reflections, and sudden brightness changes that are mis-classified as foreground; see frame 563 in Fig. 9a. A moresophisticated background model and shadow model (forexample, [32]) could be used to improve the result. Ingeneral, our method performs reasonably well on theCAVIAR set, though not as well as on the “Campus Plaza”sequence, mainly due to the abovementioned difficulties.The running speed of the system is about 2 fps with a2.8 GHz Pentium IV CPU. The implementation is in C++code without any special optimization.

7 CONCLUSION AND FUTURE WORK

We have presented a principled approach to simulta-neously detect and track humans in a crowded sceneacquired from a single stationary camera. We take a model-based approach and formulate the problem as a BayesianMAP estimation problem to compute the best interpretationof the image observations collectively by the 3D humanshape model, the acquired human appearance model, thebackground appearance model, the camera model, theassumption that humans move on a known ground plane,and the object priors. The image is modeled as a composi-tion of an unknown number of possibly overlapping objects


Fig. 8. Tracking evaluation criteria.

TABLE 1Results of Performance Evaluations on the CAVIAR Set

(277 Trajectories)

and a background. The inference is performed by an

MCMC-based approach to explore the joint solution space.

Data-driven proposal probabilities are used to direct the

Markov chain dynamics. Experiments and evaluations on

challenging real-life data show promising results.The success of our approach mainly lies in the integra-

tion of the top-down Bayesian formulation following the

image formation process and the bottom-up features that

are directly extracted from images. The integration has the

benefit of both the computational efficiency of image

features and the optimality of a Bayesian formulation.This work could be improved/extended in several ways:

1) Extension to track multiple classes of objects (for

example, humans and cars) can be done by adding model

switching in the MCMC dynamics. 2) Tracking, operating in

a two-frame interval, has a very local view; therefore,

ambiguities inevitably exist, especially in the case of

tracking fully occluded objects. The analysis in the level of

trajectories may resolve the local ambiguities (for example,[29]). The analysis may take into account prior knowledgeof the valid object trajectories, including their starting andending points.

APPENDIX

SINGLE OBJECT TRACKING WITH BACKGROUND

KNOWLEDGE USING MEAN SHIFT

Denote by ~p, pðuÞ, and bðuÞ the color histograms of the

object learned online, the color histogram of the object at

location u, and the color histogram of the background at

the corresponding region, respectively. Let xif gi¼1;...;n be

the pixel locations in the region with the object center at

u. A kernel with profile kðÞ is used to assign smaller

weights to the pixels farther away from the center. An

m-bin color histogram pðuÞ ¼ pjðuÞ� �

j¼1;...;mis constructed

as pjðuÞ ¼Pn

i¼1 kðkxik2Þ�½bfðxiÞ � j�, where function bfðÞ


Fig. 9. Selected frames of the tracking results from the CAVIAR set. (a) Sequence “ThreePastShop2cor.” (b) Sequence “TwoEnterShop2cor.”

maps the pixel location to the corresponding histogram bin

and � is the delta function. The same goes for ~p and b. We

would like to optimize

LðuÞ ¼ ��b B pðuÞ;bðuÞð Þ|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}L1ðuÞ

þ�f B pðuÞ; ~pð Þ|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}L2ðuÞ

; ð10Þ

where BðÞ is the Bhattachayya coefficient. By applying

Taylor expansion at pðu0Þ and bðu0Þ ðu0 is a predicted

position of the object), we have

L1ðuÞ ¼ B pðuÞ;bðuÞð Þ ¼ BðuÞ� Bðu0Þ þB0pðu0Þ pðuÞ � pðu0Þð Þ þB 0dðu0Þ bðuÞ � bðu0Þð Þ

¼ c1 þXmu¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffibuðu0Þpuðu0Þ

spuðuÞ þ

Xmu¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffipuðu0Þbuðu0Þ

sbuðuÞ

¼ c1 þXni¼1

ku� xih

2� �

wbi ;

ð11Þ

where

wbi ¼Xmu¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffibuðu0Þpuðu0Þ

s� bf xið Þ � u� �

þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffipuðu0Þbuðu0Þ

s� bb xið Þ � u½ �

( ):

Similarly, also in [6],

L2ðuÞ ¼B pðuÞ; ~pð Þ � 1

2

Xmu¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipuðu0Þ~pu

pþ 1

2puðuÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffi~pu

puðu0Þ

s

¼ c2 þXnhi¼1

wfi ku� xih

2� �

;

ð12Þ

where wfi ¼Pmu¼1

ffiffiffiffiffiffiffiffiffiffi~pu

puðu0Þ

q� bf xið Þ � u� �

; therefore,

LðuÞ ¼ c1 þ c2 þXni¼1

�fwfi � �bwbi

� �|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}

wi

ku� xih

2� �

: ð13Þ

The last term of LðuÞ is the density estimate computed with

kernel profile kðÞ at u. The mean-shift algorithm with

negative weight [4] applies. By using the Epanechikov

profile [6], LðuÞ will be increased, with the new location

moved to

u0 Pn

i¼1 xiwiPni¼1 wij j

: ð14Þ

ACKNOWLEDGMENTS

This research was funded in part by the US Government’s

Video Analysis and Content Extraction (VACE) program.

REFERENCES

[1] G. Borgefors, “Distance Transformations in Digital Images,”Computer Vision, Graphics, and Image Processing, vol. 34, no. 3,pp. 344-371, 1986.

[2] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate EnergyMinimization via Graph Cuts,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001.

[3] I. Cohen and G. Medioni, “Detecting and Tracking MovingObjects for Video Surveillance,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, vol. 2, pp. 2319-2326, 1999.

[4] R.T. Collins, “Mean-Shift Blob Tracking through Scale Space,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2,pp. 234-240, 2003.

[5] D. Comaniciu and P. Meer, “Mean Shift: A Robust Approachtoward Feature Space Analysis,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 24, no. 5, pp. 603-619, May 2002.

[6] D. Comaniciu and P. Meer, “Kernel-Based Object Tracking,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5,pp. 564-577, May 2003.

[7] L. Davis, V. Philomin, and R. Duraiswami, “Tracking Humansfrom a Moving Platform,” Proc. Int’l Conf. Pattern Recognition,vol. 4, pp. 171-178, 2000.

[8] J. Deutscher, A. Blake, and I. Reid, “Articulated Body MotionCapture by Annealed Particle Filtering,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, vol. 2, pp. 126-133, 2000.

[9] A. Elgammal and L. Davis, “Probabilistic Framework forSegmenting People under Occlusion,” Proc. Eighth Int’l Conf.Computer Vision, vol. 2, pp. 145-152, 2001.

[10] A. Elgammal, R. Duraiswami, D. Harwood, and L. Davis,“Background and Foreground Modeling Using Non-ParametricKernel Density Estimation for Visual Surveillance,” Proc. IEEE,vol. 90, no. 7, pp. 1151-1163, 2002.

[11] F. Fleuret, R. Lengagne, and P. Fua, “Fixed Point Probability Fieldfor Complex Occlusion Handling,” Proc. 10th Int’l Conf. ComputerVision, vol. 1, pp. 694-700, 2005.

[12] D. G-Perez, J.-M. Odobez, S. Ba, K. Smith, and G. Lathoud,“Tracking People in Meetings with Particles,” Proc. Int’l WorkshopImage Analysis for Multimedia Interactive Services, 2005.

[13] D. Gavrila and V. Philomin, “Real-Time Object Detection for“Smart” Vehicles,” Proc. Seventh Int’l Conf. Computer Vision, vol. 1,pp. 87-93, 1999.

[14] P. Green, Trans-Dimensional Markov Chain Monte Carlo. OxfordUniv. Press, 2003.

[15] S. Haritaoglu, D. Harwood, and L. Davis, “W4: Real-TimeSurveillance of People and Their Activities,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 22, no. 8, pp. 809-830, Aug.2000.

[16] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge Univ. Press, 2000.

[17] W. Hasting, “Monte Carlo Sampling Methods Using MarkovChains and Their Applications,” Biometrika, vol. 57, no. 1, pp. 97-109, 1970.

[18] S. Hongeng and R. Nevatia, “Multi-Agent Event Recognition,”Proc. Eighth Int’l Conf. Computer Vision, vol. 2, pp. 84-91, 2001.

[19] M. Isard and J. MacCormick, “Bramble: A Bayesian Multiple-BlobTracker,” Proc. Eighth Int’l Conf. Computer Vision, vol. 2, pp. 34-41,2001.

[20] J. Kang, I. Cohen, and G. Medioni, “Continuous Tracking withinand across Camera Streams,” Proc. IEEE Conf. Computer Vision andPattern Recognition, vol. 1, pp. 267-272, 2003.

[21] Z. Khan, T. Balch, and F. Dellaert, “MCMC-Based ParticleFiltering for Tracking a Variable Number of Interacting Targets,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11,pp. 1805-1819, Nov. 2005.

[22] H.W. Kuhn, “The Hungarian Method for the AssignmentProblem,” Naval Research Logistics Quarterly, vol. 2, pp. 83-87, 1955.

[23] M.-W. Lee and I. Cohen, “A Model-Based Approach for EstimatingHuman 3D Poses in Static Images,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 28, no. 6, pp. 905-916, June 2006.

[24] A. Lipton, H. Fujiyoshi, and R. Patil, “Moving Target Classifica-tion and Tracking from Real-Time Video,” Proc. DARPA ImageUnderstanding Workshop, pp. 129-136, 1998.

[25] J. Liu, “Metropolized Gibbs Sampler,” Monte Carlo Strategies inScientific Computing, Springer, 2001.

[26] F. Lv, T. Zhao, and R. Nevatia, “Self-Calibration of a Camera fromVideo of a Walking Human,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 28, no. 9, pp. 1513-1518, Sept. 2006.

[27] J. MacCormick and A. Blake, “A Probabilistic Exclusion Principlefor Tracking Multiple Objects,” Proc. Seventh Int’l Conf. ComputerVision, vol. 1, pp. 572-578, 1999.

[28] A. Mittal and L. Davis, “M2tracker: A Multi-View Approach toSegmenting and Tracking People in a Cluttered Scene UsingRegion-Based Stereo,” Proc. Seventh European Conf. ComputerVision, vol. 2, pp. 18-33, 2002.


[29] P. Nillius, J. Sullivan, and S. Carlsson, “Multi-Target Tracking-Linking Identities Using Bayesian Network Inference,” Proc. IEEEConf. Computer Vision and Pattern Recognition, vol. 2, pp. 2187-2194,2006.

[30] K. Okuma, A. Taleghani, N. de Freitas, J. Little, and D. Lowe, “ABoosted Particle Filter: Multitarget Detection and Tracking,” Proc.Eighth European Conf. Computer Vision, vol. 1, pp. 28-39, 2004.

[31] C. Papageorgiou, T. Evgeniou, and T. Poggio, “A TrainablePedestrian Detection System,” Proc. IEEE Intelligent Vehicles Symp.,pp. 241-246, 1998.

[32] A. Prati, I. Mikic, M. Trivedi, and R. Cucchiara, “DetectingMoving Shadows: Algorithms and Evaluation,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 918-923, July 2003.

[33] P. Prez, C. Hue, J. Vermaak, and M. Gangnet, “Color-BasedProbabilistic Tracking,” Proc. Seventh European Conf. ComputerVision, vol. 1, pp. 661-675, 2002.

[34] D. Ramanan, D. Forsyth, and A. Zisserman, “Strike a Pose:Tracking People by Finding Stylized Poses,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 271-278, 2005.

[35] C. Rasmussen and G.D. Hager, “Probabilistic Data AssociationMethods for Tracking Complex Visual Objects,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 560-576,June 2001.

[36] J. Rittscher, P. Tu, and N. Krahnstoever, “Simultaneous Estimationof Segmentation and Shape,” Proc. IEEE Conf. Computer Vision andPattern Recognition, vol. 2, pp. 487-493, 2005.

[37] R. Rosales and S. Sclaroff, “3D Trajectory Recovery for TrackingMultiple Objects and Trajectory Guided Recognition of Actions,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2,pp. 2117-2123, 1999.

[38] H. Rue and MA. Hurn, “Bayesian Object Identification,” Biome-trika, vol. 86, no. 3, pp. 649-660, 1999.

[39] C.R.H.S. Geman, “Diffusion for Global Optimization,” SIAM J.Control and Optimization, vol. 24, no. 5, pp. 1031-1043, 1986.

[40] N. Siebel and S. Maybank, “Fusion of Multiple TrackingAlgorithms for Robust People Tracking,” Proc. Seventh EuropeanConf. Computer Vision, vol. 4, pp. 373-387, 2002.

[41] K. Smith, D. Gatica-Perez, and J.-M. Odobez, “Using Particles toTrack Varying Numbers of Interacting People,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 962-969, 2005.

[42] X. Song and R. Nevatia, “Combined Face-Body Tracking in IndoorEnvironment,” Proc. 17th Int’l Conf. Pattern Recognition, vol. 4,pp. 159-162, 2004.

[43] X. Song and R. Nevatia, “A Model-Based Vehicle SegmentationMethod for Tracking,” Proc. 10th Int’l Conf. Computer Vision, vol. 2,pp. 1124-1131, 2005.

[44] C. Stauffer and E. Grimson, “Learning Patterns of Activity UsingReal-Time Tracking,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 22, no. 8, pp. 747-757, Aug. 2000.

[45] C. Tao, H. Sawhney, and R. Kumar, “Object Tracking withBayesian Estimation of Dynamic Layer Representations,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 24, no. 1,pp. 75-89, Jan. 2002.

[46] H. Tao, H. Sawhney, and R. Kumar, “A Sampling Algorithm forTracking Multiple Objects,” Proc. Workshop Vision Algorithms,1999.

[47] L. Tierney, “Markov Chain Concepts Related to SamplingAlgorithms,” Markov Chain Monte Carlo in Practice, pp. 59-74, 1996.

[48] Z.W. Tu and S.C. Zhu, “Image Segmentation by Data-DrivenMarkov Chain Monte Carlo,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 24, no. 5, pp. 651-673, May 2002.

[49] Y. Weiss, “Correctness of Local Probability Propagation inGraphical Models with Loops,” Neural Computation, vol. 12,no. 1, pp. 1-41, 2000.

[50] B. Wu and R. Nevatia, “Detection of Multiple, Partially OccludedHumans in a Single Image by Bayesian Combination of EdgeletPart Detectors,” Proc. 10th Int’l Conf. Computer Vision, vol. 1,pp. 90-97, 2005.

[51] T. Yu and Y. Wu, “Collaborative Tracking of Multiple Targets,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1,pp. 834-841, 2004.

[52] T. Zhao, M. Aggarwal, R. Kumar, and H. Sawhney, “Real-TimeWide Area Multi-Camera Stereo Tracking,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 976-983, 2005.

[53] T. Zhao and R. Nevatia, “Bayesian Human Segmentation inCrowded Situations,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 459-466, 2003.

[54] T. Zhao and R. Nevatia, “Tracking Multiple Humans in ComplexSituations,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 26, no. 9, pp. 1208-1221, Sept. 2004.

[55] T. Zhao and R. Nevatia, “Tracking Multiple Humans in CrowdedEnvironment,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 406-413, 2004.

[56] The CAVIAR Data Set, http://homepages.inf.ed.ac.uk/rbf/CAVIAR/, 2008.

[57] CLEAR06 Evaluation Campaign and Workshop, http://isl.ira.uka.de/clear06/, 2008.

Tao Zhao received the BEng degree from theDepartment of Computer Science and Technol-ogy at Tsinghua University, China, in 1998 andthe MSc and PhD degrees from the Departmentof Computer Science at the University of South-ern California in 2001 and 2003, respectively. Hewas with Sarnoff Corp., Princeton, New Jersey,from 2003 to 2006. He is currently with IntuitiveSurgical Inc., Sunnyvale, California, working oncomputer vision applications for medicine and

surgery. His research interests include computer vision, machinelearning, and pattern recognition. His experience has been in visualsurveillance, human motion analysis, aerial image analysis, and medicalimage analysis. He is a member of the IEEE and the IEEE ComputerSociety.

Ram Nevatia received the PhD degree fromStanford University with a specialty in the area ofcomputer vision. He has been with the Universityof Southern California since 1975, where he iscurrently a professor of computer science andelectrical engineering. He is also the director ofthe Institute for Robotics and Intelligent Sys-tems. He has been a principal investigator ofmajor government-funded computer vision re-search programs for more than 25 years. He has

made important contributions to several areas of computer vision,including the topics of shape description, object recognition, stereoanalysis aerial image analysis, tracking of humans, and eventrecognition. He is an associate editor of the Pattern Recognition andthe Computer Vision and Image Understanding journals. He is theauthor of two books, several book chapters, and more than 100 refereedtechnical papers. He is a fellow of the IEEE and of the AmericanAssociation for Artificial Intelligence (AAAI).

Bo Wu received the BEng and MEng degreesfrom the Department of Computer Science andTechnology at Tsinghua University, Beijing, in2002 and 2004, respectively. He is currently aPhD candidate in the Computer Science Depart-ment at the University of Southern California,Los Angeles. His research interests includecomputer vision, machine learning, and patternrecognition. He is a student member of the IEEEand the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

1198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …