1472 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …imran/PDFs/saleemi-pami-2009.pdfProbabilistic Modeling of Scene Dynamics for Applications in Visual Surveillance Imran Saleemi, Khurram

Probabilistic Modeling of Scene Dynamicsfor Applications in Visual Surveillance

Imran Saleemi, Khurram Shafique, and Mubarak Shah, Fellow, IEEE

Abstract—We propose a novel method to model and learn the scene activity, observed by a static camera. The proposed model is

very general and can be applied for solution of a variety of problems. The motion patterns of objects in the scene are modeled in the

form of a multivariate nonparametric probability density function of spatiotemporal variables (object locations and transition times

between them). Kernel Density Estimation is used to learn this model in a completely unsupervised fashion. Learning is accomplished

by observing the trajectories of objects by a static camera over extended periods of time. It encodes the probabilistic nature of the

behavior of moving objects in the scene and is useful for activity analysis applications, such as persistent tracking and anomalous

motion detection. In addition, the model also captures salient scene features, such as the areas of occlusion and most likely paths.

Once the model is learned, we use a unified Markov Chain Monte Carlo (MCMC)-based framework for generating the most likely paths

in the scene, improving foreground detection, persistent labeling of objects during tracking, and deciding whether a given trajectory

represents an anomaly to the observed motion patterns. Experiments with real-world videos are reported which validate the proposed

approach.

Index Terms—Vision and scene understanding, Markov processes, machine learning, tracking, kernel density estimation, Metropolis-

Hastings, Markov Chain Monte Carlo.

Ç

1 INTRODUCTION

1.1 Problem Description

RECENTLY, there has been a major effort underway in thevision community to develop fully automated surveil-

lance and monitoring systems [1], [2]. Such systems havethe advantage of providing continuous 24 hour activewarning capabilities and are especially useful in the areas oflaw enforcement, national defense, border control, andairport security. The current systems are efficient androbust in their handling of common issues, such asillumination changes, shadows, short-term occlusions,weather conditions, and noise in the imaging process [3].However, most of the current systems have short or nomemory in terms of the observables in the scene. Due to thismemory-less behavior, these systems lack the capability oflearning the environment parameters and intelligent rea-soning based on these parameters. Such learning andreasoning is an important characteristic of all cognitivesystems that increases the adaptability and thus thepracticality of such systems. A number of studies haveprovided strong psychophysical evidence of the importanceof context for scene understanding in humans, such ashandling long-term occlusions, detection of anomalous

behavior, and even improving the existing low-level visiontasks of object detection and tracking [4], [5].

Recent works in the area of scene modeling are limited to

the detection of entry and exit points and finding the likely

paths in the scene [6], [7], [8], [9]. We argue that, over the

period of its operation, an intelligent tracking system

should be able to model the scene from its observables

and be able to improve its performance based on this

model. The high-level knowledge necessary to make such

inferences derives from domain knowledge, past experi-

ences, as well as scene geometry, learned traffic, and target

behavior patterns in the area, etc. For example, consider a

scene that contains bushes that only allow partial or no

observation while the targets pass behind them. Most

existing systems only detect the target when it comes out of

the bushes and are unable to link the observations of the

target before and after the long-term occlusion. Given that

this behavior of targets disappearing and appearing after a

certain interval at a certain place is consistently observed,

an intelligent system should be able to infer the correlation

between these observations and to use it to correctly

identify the targets at reappearance. We believe that the

identification, modeling, and analysis of target behavior in

the scene are the key to achieving autonomous intelligent

decision-making capabilities. The work presented here is a

step forward in this direction. Specifically, we present a

framework to automatically learn a probabilistic model of

the traffic patterns in a scene. We show that the proposed

model can be applied toward various visual surveillance

applications that include behavior prediction, detection of

anomalous patterns in the scene, improvement of fore-

ground detection, and persistent tracking.

1472 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 8, AUGUST 2009

. I. Saleemi and M. Shah are with the School of Electrical Engineering andComputer Sciences, University of Central Florida, 4000 Central FloridaBlvd., Harris Corporation Engineering Center (HEC), Orlando, FL 32816.E-mail: [email protected], [email protected].

. K. Shafique is with Object Video Inc., 11600 Sunrise Valley Drive, Suite290, Reston, VA 20191. E-mail: [email protected].

Manuscript received 18 Oct. 2006; revised 10 Sept. 2007; accepted 7 Apr.2008; published online 8 July 2008.Recommended for acceptance by H. Sawhney.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPAMI-0736-1006.Digital Object Identifier no. 10.1109/TPAMI.2008.175.

0162-8828/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society

1.2 Proposed Approach

In this paper, we present a novel framework to learn trafficpatterns in the form of a multivariate nonparametricprobability density function (pdf) of spatiotemporal vari-ables that include the locations of objects detected in thescene and their transition times from one location toanother. The model is learned in a fully unsupervisedfashion by observing the trajectories of objects by a staticcamera over extended periods of time. The model alsocaptures salient scene features, such as the usually adaptedpaths, frequently visited areas, occlusion areas, entry/exitpoints, etc.

We learn the scene model by using the observations oftracks over a long period of time. These tracks may haveerrors due to clutter and may also be broken due to short-termand long-term occlusions; however, by observing enoughtracks, one can get a fairly good understanding of the sceneand infer the above-described scene properties and salientfeatures. We assume that the tracks of the moving objects areavailable for training (we use the KNIGHT system forgenerating these tracks [3]). Two scenes and the observedtracks are shown in Fig. 1. We use these tracks in a trainingphase to discover the correlation in the observations bylearning the motion pattern model in the form of a multi-variate pdf of spatiotemporal parameters (i.e., the jointprobability density of pairs of observations of an objectoccurring within certain time intervals ðx; y; x0; y0;�tÞ).Instead of imposing assumptions about the form of this pdf,we estimate the pdf using kernel density estimators. Eachtrack on the image lattice can then be seen as a random walkwhere the probabilities of transition at each state of the walk isgiven by the learned spatiotemporal kernel density estimate.Once the model is learned, we use a unified Markov ChainMonte Carlo (MCMC) sampling-based scheme to generatethe most likely paths in the scene, to decide whether a givenpath is an anomaly to the learned model, and to estimate theprobability density of the next state of the random walk basedon its previous states. The predictions based on the model arethen used to improve the detection of foreground objects aswell as to persistently track targets through short-term andlong-term occlusions. We show quantitatively that theproposed system improves both the precision and recall ofthe foreground detection and can handle long-term occlu-sions that cannot be handled using constant dynamics modelscommonly used in the literature. We also show that theproposed algorithm can handle occlusions that were notobserved during training. The examples of these types of

occlusions include vehicles parked in the scene after trainingor a bug sitting on the camera lens.

2 RELATED WORK

Trajectory and path modeling is an important step invarious applications, many of which are crucial to surveil-lance and monitoring systems. Such models can be used tofilter tracking algorithms, generate likely paths, findlocations of sources and sinks in the scene, detectanomalous tracks, etc. This kind of modeling can be directlyused as a feedback to the initial stages of the trackingalgorithm and applied to solve short and long-termocclusions. In recent years, a number of different methodsand features for trajectory and path modeling of traffic inthe scene have been proposed. These methods differ bytheir choice of features, models, learning algorithms,applications, and training data. A detailed review of thesemodels is presented in [10]. We now describe some of thesemethods.

Neural network-based approaches for learning of typicalpaths and trajectory modeling are proposed in [11], [12], [13].Other than the computational complexity and lack ofadaptability of neural networks, a major disadvantage ofthese methods is their inability to handle incomplete andpartial trajectories. Fernyhough et al. [14] use the spatialmodel presented in [15] as a basis for a learning algorithm thatcan automatically learn object paths by accumulating thetraces of targets. Koller-Meier and Van Gool [16] use a node-based model to represent the average of trajectory clusters. Asimilar technique is suggested by Lou et al. [17]. Althoughboth methods successfully identify the mean of commontrajectory patterns, no explicit information is derived regard-ing the distribution of trajectories around the mean. Ahierarchical clustering of trajectories is proposed in [18],[19], where trajectories are represented as a sequence of statesin a quantized 6D space for trajectory classification and themethod is based on a cooccurrence matrix that assumes thatall trajectory sequences are equally long. However, thisassumption is usually not true in real sequences.

Detection of sources and sinks in a scene as a prestepallows robust tracking. In [20], a hidden Markov model(HMM)-based scheme of learning sources and sinks ispresented, where all sequences are two-state long. Theknowledge of sources and sinks is used to correct and stitchtracks in a closed-loop manner. Another HMM-basedapproach is to model trajectories as transitions between statesrepresenting Gaussian distributions on the 2D image plane[21], [22]. Galata et al. [23] propose a mechanism thatautomatically acquires stochastic models of any behavior.Unlike HMMs, the proposed variable length Markov modelcan capture dependencies that may have a variable time scale.

Another body of work for modeling motion patterns in ascene lies in the literature of multicamera tracking [9], [24],[25], where the objective is to use these models to solve forhandoff between cameras. The motion patterns are modeledonly between the field of views of cameras. These modelscan be seen as special cases of the proposed model, which isricher in the sense that it captures the motion pattern bothin visible and occluded regions. This richness of theproposed model allows it to handle dynamic occlusions,

SALEEMI ET AL.: PROBABILISTIC MODELING OF SCENE DYNAMICS FOR APPLICATIONS IN VISUAL SURVEILLANCE 1473

Fig. 1. Two scenes used for the testing of the proposed framework, with

some of the tracks observed by the tracker during training.

such as person-to-person occlusions as well as occlusionsthat occur after the model is learned. Dockstader andTekalp [26] uses Bayesian networks for tracking andocclusion reasoning across calibrated cameras with over-lapping views, where sparse motion estimation andappearance are used as features. Tieu et al. [6] infer thetopology of a network of nonoverlapping cameras bycomputing the statistical dependence between observationsin different cameras.

Another interesting piece of work in this area is by Elliset al. [7], who determined the topology of a camera networkby using a two-stage algorithm. First, the entry and exitzones of each camera are determined using an Expectation-Minimization technique to cluster the start and end pointsof object tracks. The links between these zones acrosscameras are then found using the cooccurrence of entry andexit events. The proposed method assumes that correctcorrespondences will cluster in the feature space (locationand time), while the wrong correspondences will generallybe scattered across the feature space. Stauffer [8] proposesan improved linking method that tests the hypothesis thatthe correlation between exit and entry events is similar tothe expected correlation when there are no valid transitions.This allows the algorithm to handle the cases where exit-entrance events may be correlated, but the correlation is notdue to valid object transitions. Both of these methodsassume a fixed set of entry and exit locations after initialtraining and, hence, cannot deal with newly formedocclusions without retraining of the model.

Hoiem et al. [27] take a step forward in image and sceneunderstanding and proposed improvement in object recog-nition by modeling the correlation between objects, surfacegeometry, and camera viewpoints. Rosales and Sclaroff [28]estimate motion trajectories using Extended Kalman Filterto enable improved tracking before, during, and afterocclusions. Kaucic et al. [29] present a modular frameworkthat handles tracking through occlusions and the blindregions between cameras by the initialization, tracking, andlinking of high-confidence smaller track sections. Wanget al. [30] propose measures for similarity between tracksthat take into account spatial distribution, velocity, andobject size. The trajectories are then clustered based onobject type and its spatial and velocity distributions. Pereraet al. [31] present a method to reliably link track segmentsduring the linking stage. Splitting and merging of tracks ishandled to achieve multiobject tracking through occlusion.Recently, Hu et al. [32] have presented an algorithm forlearning motion patterns where foreground pixels are firstclustered using fuzzy K-means algorithm and trajectoriesare then hierarchically clustered based on the results of theprevious step. Trajectory clustering is performed in twolayers for spatial and temporal-based clustering. Eachpattern of clustered trajectories is then assumed to be alink in a chain of Gaussian distributions, the parameters forwhich are estimated using features of each clusteredtrajectory. Results of experiments on anomaly detectionand behavior prediction are reported.

As opposed to explicitly modeling trajectories or smallersections of trajectories of objects in the scene, we proposemodeling of joint distribution of initial and final locations of

every possible transition and the transition times. Our

approach is original in the following ways:

. We propose a novel motion model that not only learns thescene semantics but also the behavior of traffic througharbitrary paths. This learning is not limited like otherapproaches that work best with well-defined paths likeroads and walkways.

. The learning is accomplished using a joint 5D modelunlike pixel-wise models and mixture or chain models.The proposed model represents the joint probability of atransition from any point in the image to any other point,and the time taken to complete that transition.

. The temporal dimension of traffic patterns is explicitlymodeled and is included in the feature vector, thusenabling us to distinguish patterns of activity thatcorrespond to the same trajectory cluster but have highdeviation in the temporal dimension. This is a moregeneralized method as compared to modeling pixel-wisevelocities.

. Instead of fitting parametric models to the data, wepropose the idea of learning tracks information usingKernel Density Estimation. It allows for a richer modeland the density retained at each point in the feature spaceaccurately reflects the training data.

. Rather than exhaustively searching for predictions in thefeature space based on their probabilities, we proposeusing stochastic methods to sample from the learneddistribution and using it as prediction with a computedprobability. Sampling is thus used as the processpropagation function in our state estimation framework.

. Unlike most of the previous work reported in this section,which is targeted toward one or two similar applications,we apply the proposed probabilistic framework to solve avariety of problems that are commonly encountered insurveillance and scene analysis.

3 PROBABILISTIC TRAFFIC MODEL

We now propose a model that learns traffic patterns as joint

probability of the initial and final locations and duration of

object transitions and describe our method of learning andsampling from the distribution. In this section, we discuss

the feature to be learned and explain how it is computed.

The sampling process and its use in the state estimation

framework is also explained.

3.1 Learning the Transition Distribution Using KDE

We use the surveillance and monitoring system, KNIGHT

[3], to collect tracking data. These data consist of time

stamps and object labels with locations of their centroid for

each frame of video. KNIGHT assigns a unique label to each

object present in the scene and attempts to persistently track

the object using a joint motion and appearance model. A

frame of video may contain multiple objects. Given these

data, we associate an observation vector Oðx; y; t; lÞ witheach detected object. In this vector, x and y are the spatial

coordinates of the object centroid on image lattice, t is the

time instant at which O is observed (accurate to a

millisecond), and l is the label for object in O as per

tracking. We seek to build a 5D estimate of the transition


pdf pðX;Y ; �Þ, where X and Y are the initial and final states

representing the object centroid locations in 2d image

coordinates and � is the time taken to make the transition

from X to Y in milliseconds.In the proposed work, the analysis is performed on a

feature space where the n transitions are represented by

zi 2 IR5, i ¼ f1; 2; . . . ; ng. The vector z consists of a pair of

2d locations of the centroid of the object before and after

transition and time taken to execute the transition. Any two

observations Oi and Oj in frames i and j, respectively,

comprise a feature point Zðxi; yi; xj; yj; tj � tiÞ for the

probability estimate if they satisfy the following:

. 0 < tj � ti ¼ � < T , thus the implicit assumption isthat Oi and Oj are uncorrelated if occurringsimultaneously or beyond a time interval of T . Forour experiments, we chose T ¼ 5;000 ms.

. li ¼ lj, both Oi and Oj are the observations of thesame object as per the tracker’s labeling.

. If lj 62 Lk, where Lk is the set of all objects (labels) inframe k, then all labels in frame k make valid datapoints with lj provided 0 < � < T .

Kernel Density Estimation [33], [34] is used to estimate

the pdf pðX;Y ; �Þ nonparametrically. Suppose we have a

sample consisting of n, d dimensional, data points

z1; z2; . . . ; zn from a multivariate distribution pðzÞ, then an

estimate p̂ðzÞ of the density at z can be computed using

p̂ðzÞ ¼ 1

njHj�

12

Xni¼1

K H�12ðz� ziÞ

� �; ð1Þ

where the d variate kernel KðzÞ is a bounded function

satisfyingRKðzÞdz ¼ 1, KðzÞ ¼ Kð�zÞ,

RzKðzÞdz ¼ 0,R

zzTKðzÞdz ¼ Id and H is the symmetric positive definite

d� d bandwidth matrix. The multivariate kernel KðzÞ can

be generated from a product of symmetric univariate

kernel Ku, i.e.,

KðzÞ ¼Ydj¼1

Ku zfjg� �

: ð2Þ

The selection of the kernel bandwidth H is the single

most important criterion in kernel density estimation.

Asymptotically, the selected bandwidth H does not affect

the estimate, but, in practice, sample sizes are limited.

Theoretically, the ideal or optimal H that balances the bias

and variance globally can be computed by minimizing the

mean-squared error, MSEfp̂HðzÞg ¼ Ef½p̂HðzÞ � pHðzÞ�2g,where p̂ is the estimated density, p is the true density, and

the subscript H indicates the use of the bandwidth H in

computing the density. However, the true density p is not

known in practice. Instead, various heuristic approaches

have been proposed for finding H (see [35] for a survey of

these approaches). We use the Rule of Thumb method [36],

which is a fast standard deviation-based method that uses

the following Asymptotic Mean Integrated Squared Error

(AMISE) criterion:

AMISEfHg ¼ 1

nRðKÞjHj�

12 þ

ZR5

ðKH � pÞðzÞ � pðzÞ½ �2dz;

ð3Þ

where RðKÞ ¼RR5 KðzÞ2dz and � is the convolution

operator. The data-driven bandwidth selector is thenH ¼ arg minH

dAMISEfHg. To reduce the complexity, His assumed to be a diagonal matrix, i .e . ,H ¼ diag½h2

1; h22; . . . ; h2

d�. We use a product of univariateGaussian kernels to generate KðzÞ, i.e.,

Ku zfjg� �

¼ 1

hjffiffiffiffiffiffi2�p exp �

z2fjg

2hj2

!; ð4Þ

where hj is the jth nonzero element of H and Ku is centeredat zfjg. It is emphasized here that using a Gaussian kerneldoes not make any assumption on the scatter of data in thefeature space. The kernel function only defines the effectiveregion in 5d space in which each data point will have aninfluence while computing the probability estimate. Eachtime a pair of observations satisfying the described criteriaare available during training, the observed feature z isadded to the sample. pðX;Y ; �Þ then gives the probability ofa transition from point X to point Y in � milliseconds.

Fig. 2 illustrates the maps of pðX;Y ; �Þmarginalized over2dvectorsX andY , i.e., Fig. 2a represents the probability of anobject reaching a point in the map from any location, i.e.,RX

R� pðX;Y ; �Þ d�dX. It is important to note that the areas

with higher densities (brighter colors) cannot be labeled asentry or exit points. The intensity at a point in the image onlyillustrates the marginal probability that an observed statetransition has, of starting or ending at that point. The dominantpoints in the maps only mean that a higher number of objectobservations were made at these points. For example, inpractice, this could possibly mean that these areas are placeswhere people stand or sit while exhibiting nominal motionbut significant enough to be noticed by the tracker. Similarly,each point in Fig. 2b represents the probability of an objectstarting from that particular point and ending anywhere inthe image, i.e.,

RY

R� pðX;Y ; �Þ d�dY . Notice that the simi-

larity in the two maps manifests a correlation betweenincoming and outgoing object transitions at the samelocations. It should also be pointed out that, sincepðX;Y ; �Þ is a high-dimensional space, the probabilitiesfrom a particular location to another location in a specificinterval can be very low and a high dynamic range isrequired to display these maps correctly. So, the dark blueregions in these maps are not necessarily absolutely zero. Itshould be noted that many dark blue regions in these maps


Fig. 2. These maps represent the marginal probability of any object,

(a) reaching each point in the image and (b) starting from each point in

the image, in any arbitrary duration of time. (c) and (d) show these maps

for the scenes in 1b, respectively.

can have relatively higher probabilities of transition, if thepdf is marginalized over specific time intervals or a specificstarting or ending point, as discussed later. Figs. 2c and 2dshow similar maps for the scene in Fig. 1b.

The pdf pðX;Y ; �Þ now represents the distribution oftransitions from any pointX in the scene to any other point Ygiven the time taken to complete the transition as � . Ournext step is to be able to predict the next state of an object,given the current state and an approximate time duration ofthe jump. This is achieved by sampling the marginal of thisdistribution function as described in the following section.

3.2 Construction of Predicted Trajectories

The learned model can be used to find a likely next stategiven the current state of an object. The motivation is thatwe want to construct a likely trajectory from the currentobject position onward, which would act as the predictedbehavior of the object. Instead of finding the probabilities oftransitions to different locations, we use the learneddistribution as a generative model and sample feature pointsfrom the model to construct a sequence of tracks based onthe training observations.

Once the distribution has been learned, each track � canbe viewed as a Markov process, �k : ��!I, on the imagelattice I, where k 2 f1; 2; . . .g. By marginalizing out � orintegrating pðX;Y ; �Þ over an appropriate value of � , thelearned distribution can then be seen as the transitionprobability model, i.e., p̂ðXtþ1jXtÞ, where Xt is the currentstate and Xtþ1 is the next state. Fig. 3 illustrates the marginalprobability of reaching any state Y in the image startingfrom state X ¼ G, in any arbitrary duration of time, and canbe written as

R� pðX ¼ G; Y ; �Þ d� . The relatively bright

spots further away show the possible states that a series oftransitions can lead to. These correspond to higher values of� , the time dimension of the distribution, for which theprobability of transition from G is obviously low. Thesemaps do not represent the whole scene but just smallregions of it.

To compute a next state given the current state, wepropose sampling from the learned transition distribution.Instead of sampling only one prediction, we proposesampling multiple distinct future states from the distribu-tion so that the multimodal nature of the learned density isfully utilized.

In real-world scenarios, the assumption that tracks arefirst-order Markov processes is not usually valid. It canhowever be noted that in our scene traffic model, the choiceof different values of � (the time duration required tocomplete a transition) can be used to achieve higher orderMarkov chains. Furthermore, while dealing with applica-tions such as track filtering, anomaly detection, and

tracking through occlusions, it must be noted that sampling

high probability future states is not always enough, for

example, in human traffic scenarios where it is possible for

people to take the less traveled but possible paths, which

we do not want to ignore in our algorithm. So, instead of

only one prediction for next location, multiple distinct

future states are sampled from the distribution. The track is

initialized with any location and the representative future

states are initially chosen with uniform distribution in the

neighborhood of the current state with equal weights. At

any given step, the weighted mean of all these will give the

next candidate state. The transitions of this random walk

will not always be the most likely ones but rather the

possible ones that are in accordance with the training data.

A sample set containing N future states is sampled from the

distribution to represent P ðX0Þ using the algorithm

described in Fig. 4. These samples can be written as

fðuk;�0 ; wk;�0 Þg, where uk;�0 � NðX0;�Þ, wk;�0 ¼ 1N , N is the

number of possible future states, � ¼ diag½�2max; �

2max�, and

k ¼ f1; . . . ; Ng. �max is the maximum possible time, which

we seek to sample transitions from. The value of �max is set

to the average duration in which a person takes a step. This

will correspond to the maximum time allowed for a

sampled transition to occur. The symbols “�” and “þ”

indicate the representation of the state before and after the

measurement has arrived, respectively. Since only N states

are used to represent the whole distribution given X0 as

starting state, weights w1;...;k are normalized such thatPi w

k;þi ¼ 1, thus wk;�0 ¼ 1

N , initially. At each step i, we

have the representation of the pdf P ðXijX0; . . . ; Xi�1Þ in the

form of a set of future states fðuk;�i ; wk;�i Þg, where

uk;�i ¼ fðuk;þi�1Þ, and wk;�i ¼ wk;þi�1, f is a function that samples

from state transition distribution according to the condi-

tional probability P ð:juk;þi�1Þ using the Metropolis-Hastings

algorithm, which is described in detail in Fig. 4. Given this


Fig. 3. Regions of maps that represent the probability of reaching any

state in the image from state G.

Fig. 4. Metropolis-Hastings sampling. See Section 6 for details about the

choice of q, the proposal density.

representation of P ðXijX0; . . . ; Xi�1Þ, we update it tofðuk;þi ; wk;þi Þg as uk;þi ¼ uk;�i and

wk;þi ¼ wk;�iZ�max

0

p Xi�1; uk;�i ; �

� �d�: ð5Þ

An estimate of Xi is now given by the weighted mean ofall the sampled future states as Xi ¼

PNm¼1 u

m;þi wm;þi , where

Xi serves as the distribution’s prediction until the measure-ment from the tracker arrives, which is denoted by !i (not tobe confused with wi, which is the weight of a particularsample). The update step then involves the computation ofthe final estimate of the object’s new location, �i, as aweighted mean of Xi and !i as

�i ¼XipðXij�i�1Þ þ !ipð!ij�i�1ÞpðXij�i�1Þ þ pð!ij�i�1Þ

; ð6Þ

where the probabilities of transition from the previous stateestimate �i�1 to the new potential states Xi and !i serve asthe weights, which are normalized so they sum to 1.

After update, the weights are normalized again so thatPi w

k;þi ¼ 1 and their variance �2

w is calculated. If �2w > �2

th, apredefined maximum threshold, then a new set of samplesis constructed by drawing (sampling) without replacementfrom the distribution using �i as current position. Theprobability of the sample being drawn is assigned as theinitial weights. Thus, the predicted states with relativelylower weights are discarded and others are retained torepresent the next state. This process is continued topropagate the trajectory.

The choice of �max depends on the desired maximum

duration between two consecutive points on a track. Assum-

ing first-order Markov process in a human traffic environ-

ment, �max corresponds to the average time taken by a person

to take one step. For most of the experiments reported in this

paper, the value of �max was chosen to be 800 ms since it was

empirically found to be the average time taken by a person to

take one step. The importance of this value is illustrated in

Fig. 5. Different values of � can be used to specify the time

interval in transition from X to Y . Figs. 5a, 5b, and 5c depict

the marginal probabilities of an object starting fromG shown

by an � and reaching any state in the image in 0-1.5, 1.5-3, and

3-5 s, respectively. These maps can be represented mathema-

tically asR 1;500

0 pðX ¼ G; Y ; �Þd� ,R 3;000

1;500 pðX ¼ G; Y ; �Þd� , andR 5;0003;000 pðX ¼ G; Y ; �Þd� . The maps show the paths that

objects usually follow in these areas. So, the time dimension

of the learned distribution encodes additional information

that is vital to solving short-term occlusions using the

distribution as described in Section 4.4. Notice that the

modes or the areas of dominant probability in general are

not equidistant from G. Furthermore, the locations of these

modes and their distances from G can be different for

different choices of G depending on the usual speeds of

objects in that area, which is in contrast to tracking using

Gaussian or other symmetric distributions as process noise

functions.We now have a mechanism of sampling from the

distribution that enables us to predict object behavior inspace and time, which is then used for likely (typical) pathgeneration, anomaly detection, and persistent tracking ofobjects through occlusions as described in the next section.The learned transition probabilities are also used toimprove foreground detection.

4 APPLICATIONS OF THE PROPOSED MODEL

It is important that motion patterns of objects be modeled ina way that enables solution of multiple problems that areencountered in scene analysis and surveillance. We nextdescribe some of these problems and the application of theproposed model and state estimation framework for theirsolution.

4.1 Generating Likely Paths

An important aspect of modeling traffic patterns isgeneration of likely paths. Given the current location of anobject, such a simulated path amounts to a prediction offuture behavior of the object. We seek to sample from thelearned model of transition patterns to generate behaviorpredictions. We expect that only a few number of pathsshould adequately reflect observations of trajectoriesthrough walkways, roads, and so forth. The samplingmethod is used as described in Section 3.2 except that thereare no measurements available, which translates to accep-tance of the weighted mean of all samples at each step asthe next state. More specifically, (6) transforms to �i ¼ Xi.For our experiments, we start by manually choosing X0 asthe initial state in a region of significant traffic. Thetrajectory is propagated by accepting predicted states as nextstates. The chain is stopped whenever �2

w > �2th, i.e., there is

a need to resample candidate predictions or when theMetropolis-Hastings algorithm chain fails to move for morethan a specified number of iterations, e.g., we used amaximum of 200 iterations for this purpose.

4.2 Improvement of Foreground Detection

The intensity difference of objects from the background hasbeen a widely used criterion for object detection, but it can


Fig. 5. Marginal probability maps of the learned pdf at different timeintervals. The maps represent the probability of reaching any state in theimage from state G (shown by white � in the first row and black � in thesecond row) in (a) 0-1.5 s, (b) 1.5-3 s, and (c) 3-5 s. Each row shows adifferent starting state in the image. These maps show only smallregions of the actual scene.

be noted that temporal persistence is also an intrinsicproperty of the foreground objects, i.e., unless an objectexits from the scene or becomes occluded, it has to eitherstay at the same place or move to a location within thespatial vicinity of the current observation. Since ourproposed transition model incorporates the probabilitiesof movement of objects from one location to another, it canbe used to improve the foreground models. We now presentthe formulation for this application. It should be noted,however, that this method alone cannot be used to modelthe foreground. Instead it needs to be used in conjunctionwith an appearance-based model like mixture of Gaussians[19]. We seek to find the likelihood of a pixel u belonging tothe foreground at time instant t. Let u be a random variablewhich is true if and only if the pixel u is a foreground pixel.Also, let � be the feature set used to model the background(for example, color/gray scale in appearance-based model)and � ¼ f�1; �2; . . . ; �Qg be the set of pixels detected asforeground at time instant t� 1, where Q is the totalnumber of the foreground pixels in the previous frame.Then, according to Bayes rule,

P ðu ¼ truej�;�Þ / P ð�;�ju ¼ trueÞP ðu ¼ trueÞ: ð7Þ

Assuming independence of appearance � and the fore-ground history �, we may write

P ðu ¼ truej�;�Þ / P ð�ju ¼ trueÞP ð�ju ¼ trueÞP ðu ¼ trueÞ:

ð8Þ

Here, we use the framework of aggregating expertopinions [37] to combine the decisions from differentfeatures. Specifically, we consider logarithmic opinionpooling [38], i.e.,

P ð�;�ju ¼ trueÞ / P ð�ju ¼ trueÞ�P ð�ju ¼ trueÞ1��: ð9Þ

The weight � represents the expert’s reliability. We chose� ¼ Q

QþQa, where Q is the number of pixels classified as

foreground in the previous frame, while Qa is the numberof foreground pixels detected in the last frame using onlyappearance-based model.

The KNIGHT object detection and tracking system [3] isused to extract appearance-based likelihood, P ð�ju ¼ trueÞ.KNIGHT performs background subtraction at multiplelevels using statistical models of gradient and color data.Specifically, a mixture of Gaussians model is used with fourGaussians per pixel. The learned pdf pðX;Y ; �Þ is used toestimate the remaining terms as follows:

P ð�ju ¼ trueÞ ¼XQj¼1

pð�j; u;�tÞ ð10Þ

and the marginal (prior) of u being a foreground pixel is

P ðu ¼ trueÞ ¼ZX

Z�

pðX;Y ¼ u; �Þd�dX: ð11Þ

The value of �t represents the length of time between twoconsecutive frames. It should be pointed out here that,although not all previous foreground pixels should con-tribute significantly to the pixel u, the value of �t is

typically small enough to inhibit contributions from far-away pixels. In other words, only the pixels in � that arespatially close to u chip in substantial values to the sum.

Fig. 6 shows the results of the proposed foregroundmodel. It can be seen that the number of false positives hasdecreased and the true silhouettes are much more pro-nounced. The scheme presents a simple and effective wayof reducing errors and improving background subtraction.Many of the commonly encountered false detections likemotion of static background objects are removed using thisapproach. More experiments and a quantitative analysis arereported in Section 5.

4.3 Anomaly Detection

If tracking data used to model the state transition distribu-tion span sufficiently large periods of time, it is obvious thata sampling algorithm will not be able to sample a track thatis anomalous considering the usual patterns of activity andmotion in that scene. This observation forms the basis of ouranomaly detection algorithm. The anomaly detection algo-rithm generates its own predictions for future states usingthe method described in Section 3.2 without using thecurrent observation of the tracker. It then compares theactual measurements of objects with the predicted tracksand computes a difference measure between them. Let theset of predicted states of an object be � and the actualmeasurements as observed by the tracking algorithm be �.Using the framework as described earlier, we can compute� ¼ f�1; �2; . . . ; �iþ1g, which is the predicted or estimatedstate. Notice that, at step i, we have a prediction for location


Fig. 6. Foreground modeling: Each column shows an example ofimprovement on foreground probability estimation. (a) Original image.(b) Probability map using only the mixture of Gaussians method.(c) Foreground probability map using the proposed model in conjunctionwith the mixture of Gaussians model. (d) and (e) show foregroundmasks obtained using the maps in (b) and (c), respectively. (b) and(c) show an improvement of foreground probability modeling, while (d)and (e) show reduction in the number of false positives.

at step iþ 1. Then, the observed track described by � ¼f!1; !2; . . . ; !ig is labeled as anomalous if di > dth, i.e.,

di ¼1

nþ 1

Xij¼i�n

ð!j � �jÞT��1� ð!j � �jÞ

� �12

; ð12Þ

where �� is a covariance matrix of x and y coordinates of allthe potential (candidate) next states, n is the number ofprevious states included in the calculation of Mahalanobisdistance, and dth is a predefined maximum differencebetween observed and predicted tracks. Furthermore, theplot of unnormalized distances between observed anom-alous and predicted trajectories versus the number of states,n, used in the computation of the distance is sublinear withrespect to n.

This proposed approach is sufficient to find a sequenceof transitions significantly different than the predictionsfrom the state transition distribution and can easily identifyan anomalous event in terms of motion patterns. Using thisformulation, trajectories that are spatially incoherent ortemporally inconsistent with normal behavior can beidentified, e.g., the presence of objects in unusual areas orsignificant speed variation, respectively. Increasing thevalue of n (12) helps detect suspicious behavior whereactivity is not necessarily unusual in either spatial ortemporal domains, but the object has been in view for a longtime and error (distance between actual and predictedpaths) has accumulated and crossed the dth barrier. Hence,the approach for anomaly detection is able to handle severaldifferent kinds of anomalous behavior.

4.4 Persistent Tracking through Occlusions

Finally, we present the application of the proposed scenemodel for persistent tracking through occlusions andcompletion of the missing tracks. Persistent trackingrequires modeling of spatiotemporal and appearanceproperties of the targets. Traditionally, parametric motionmodels such as constant velocity or constant accelerationare used to enforce spatiotemporal constraints. Thesemodels usually fail when the paths adapted by objects arearbitrary. The proposed model of learning traffic para-meters is able to handle these shortcomings when occlu-sions are not permanently present in the scene and thepatterns of motion through these occlusions have pre-viously been learned, e.g., person-to-person occlusions,large objects like vehicles that hide smaller moving objectsfrom view. We use the proposed distribution to describe asolution to these problems.

Let p̂ðY jXÞ ¼R�pðX;Y ;�Þd�R

�

RYpðX;Y ;�ÞdY d�

and � be an observed track.

A sample set containing N samples is generated from the

learned pdf to represent p̂ðX0Þ, where X0 is the initial

location of the track. These samples are represented as

fðuk;�0 ; wk;�0 Þg, where uk;�0 � NðX0;�Þ, N represents Gaus-

sian distribution with mean X0 and covariance matrix,

� ¼ diag½�2; �2�, and k ¼ f1; . . . ; Ng. For the initial set of

samples, the weight is assigned as wk;�0 ¼RXP̂ uk;�

0jXð ÞdX

P� ¼uk;�0ð Þ ,

where is a random variable that represents Gaussian

distribution in the neighborhood of X0. The symbols “�”

and “þ” indicate the representation of the state before

and after availability of the measurement, respectively.

Since only N samples are used to represent the whole

distribution, weights wf1;...;kg are normalized such thatPi w

k;þi ¼ 1. At each step i, a representation of the pdf

p̂ðXij!0; !1; . . . ; !i�1Þ is retained in the form of weighted

candidate samples fðuk;�i ; wk;�i Þg, where uk;�i ¼ fðuk;þi�1Þ and

wk;�i ¼ wk;þi�1. f is a function that samples from the state

transition distribution according to the conditional prob-

ability P̂ ð:juk;þi�1Þ and the smoothness of motion constraint

using the Metropolis-Hastings algorithm, which is de-

scribed in detail in Fig. 4.

At any time, we want to be able to find correspondences

between successive instances of objects. This task proves to be

very difficult when the color and shape distributions of

objects are very similar or when the objects are in very close

proximity of each other, including in person-to-person

occlusion scenarios. However, in the proposed model, a

hierarchical use of transition density enables our scheme to

assign larger weights to more likely paths. Assume now that

the observed measurements of m objects at time instant i are

given by � ¼ f!1i ; !

2i ; . . . ; !mi g and the predicted states of

previously observed s objects are � ¼ f�1i ; �

2i ; . . . ; �sig, re-

spectively. We must find the mapping function from the set

f�ki g, 1 � k � m, to the set f�l

ig, 1 � l � s. To establish this

mapping function, we use the graph-based algorithm

proposed in [39]. The weights of the edges corresponding

to the mapping �li and �k

i are given by the Mahalanobis

distance between the estimated position �li and the observed

position !ki , for each l and k. Once the correspondences are

established for each object, given the representation of

p̂ð�ij�0; . . . ; �i�1Þ and the corresponding new observation !i,

we update it to fðuk;þi ; wk;þi Þg as uk;þi ¼ uk;�i and

wk;þi ¼ wk;�iXNk¼1

P̂ !ijuk;�i� �

¼Yij¼0

XNk¼1

P̂ !jjuk;�j� �

: ð13Þ

The product of old weight with the sum of probabilitiesof reaching !i from each sample state uk;�i for all k translatesinto a product of transition probabilities for all jumps takenby an object. This results in assignment of larger weights forhigher probability tracks and, in the next time step, thisweight helps establish correspondences such that theprobability of an object’s track is increased. Once updated,the candidate locations are resampled as described inSection 3.2. This approach enables us to simultaneouslysolve short-term occlusions and person-to-person occlu-sions, as well as persistent labeling.

5 RESULTS

In this section, we present our experimental results. Thecollected tracking data is derived from two video sequencestaken at the scenes shown in Fig. 1. Some figures related to


these data sets and their density estimates are summarizedin Fig. 7. The video sequences were acquired by staticcameras and the scenes consist of high human traffic withocclusions and both commonly and sparingly used paths.Initial tracking is done using the algorithm in [3]. As can beseen in Fig. 1, there are numerous entry and exit points inthe scenes and multiple paths can be chosen betweendifferent pairs of entry and exit points.

The organization of this section is given as follows: First,the results of our methods for generation of likely paths andimprovement of foreground modeling and detection arepresented. The results of our experiments on anomalydetection are then reported. Finally, we show a fewexamples of persistent tracking of objects through occlu-sions using the proposed distribution.

5.1 Likely Paths Generation

As described in Section 4.1, starting at random initial statesin the image, sampling from the distribution gives possiblepaths that are usually followed by the traffic. Fig. 8 showssome of these random walks. It should be noted that noobservations from the tracking algorithm have been used inthis experiment. The likely paths generated are purelysimulated based on the learned distribution. It can,however, be seen in Fig. 8 that some tracks are not verysmooth and loop at some locations. But, we want to pointout here that it is not our purpose to generate smooth paths,rather we want a depiction of actual tracking scenarios. Acomparison of Fig. 8 with Fig. 1b, which shows actual tracksused during training, proves the similarity of both andvalidates the proposed model and sampling method.

5.2 Foreground Detection

A scheme for improving detection of foreground objectsusing the proposed model was described in Section 4.2.Fig. 9 presents the results of that scheme. The top row ofFigs. 9a and 9b shows images that represent the results ofthe object detection algorithm in [3]. It can be seen in these

images that parts of trees, garbage dump, and grass havebeen detected as objects due to nominal motion owing ofwind. The false detections are shown by red boundingboxes. On the other hand, during the training phase, therewere very few object detections in these areas, thus givinglow probability of object presence at these locations. Thebottom row of Figs. 9a and 9b shows the results aftermisdetections have been removed using the combination ofappearance and proposed motion model. The scheme hasremoved these false positives owing to their low probabilityof transition or existence or both. Figs. 9c and 9d showtracks from the tracking algorithm in [3], which are brokenin different places because of a significant number of falsenegatives in the foreground detection. Again, these misseddetections were corrected after incorporating the motion-based foreground model.

For a quantitative analysis of the foreground modelingscheme, we manually segmented a 500 frame videosequence into foreground and background regions. Thefirst 200 frames were used to initialize the mixture ofGaussians-based background model. The foreground wasestimated on the rest of the frames using 1) mixture ofGaussians model and 2) the proposed logarithmic poolingof appearance and motion models. Pixel-level precision andrecall were computed for both cases. The results of thisexperiment are shown in Fig. 10. It can be seen from theplots that both the precision and recall of the combinedmodel are consistently better than that of the mixture ofGaussians approach alone. The number of false positives forthe combined model was reduced by 79.65 percent on


Fig. 7. Data sets: Observations list the number of times any object wasdetected and feature samples are the number of 5D features computedand added to the density estimate (regardless of resampling). Learningtime is the time taken to compute the samples and generate their kerneldensity estimate.

Fig. 8. Examples of likely paths generated by the proposed algorithm

using Metropolis-Hastings sampling. Tracks are initialized from random

locations selected manually.

Fig. 9. Foreground modeling results. (a) and (b) Results of objectdetection. The images in the top row are without and those in the bottomrow are with the use of the proposed model. Green and red boundingboxes show true and false detections, respectively. (c) and (d) Resultsof blob tracking. Red tracks are original broken tracks and black onesare after improved foreground modeling.

Fig. 10. Foreground modeling results. (a) Number of foreground pixels

detected by the mixture of Gaussians method, the proposed method in

conjunction with the mixture of Gaussians method, and the ground truth.

The pixel-level foreground detection recall and precision using the

mixture of Gaussians approach only and the proposed formulation are

shown in (b) and (c), respectively.

average, as compared to the mixture of Gaussians model.The average reduction in the number of false negatives was58.27 percent.

5.3 Anomaly Detection

Fig. 11 shows the results of our anomaly detectionalgorithm described in Section 4.3. The normal tracksadapted in each scene can be seen in Figs. 1 and 8. In thetop row of Fig. 11a, a person walks through the pavedregion and then into the grass. The algorithm separates thistrajectory into typical and atypical parts. The blue trackshown in the figure is the track of concern, the red track isthe area where the prediction follows observations, whiledotted black shows the prediction once an anomaly isdetected. Notice how the blue track is closely followed bythe red one as long as it is a typical behavior. At the time oftraining, the dump trailer was at a different position as seenin Fig. 1a, resulting in the classification of the trackobserved in middle row of Fig. 11a as anomalous. Thebottom row of Fig. 11a shows another spatially incoherenttrack through an area where no motion is usually observed.

The top row of Fig. 11b shows anomaly due to theunusual speed of a person riding a bicycle in an area wherepeople usually only walk. The middle row of Fig. 11b showsthe results of our algorithm on another example where aperson is running in the same paved area. Again, thisbehavior has not been observed in the tracks used fortraining because people usually do not run in this area. Thebottom row of Fig. 11b shows a stopping action where aperson sits down to tie their shoelaces. Three moreexamples of the second type of anomaly (temporal incon-sistency) are shown in Fig. 11c. The top and middle rows inthe third column show two persons skating and riding abicycle, respectively, while the bottom row shows a personwalking on the road where usually only cars are observed.Since the speeds of objects in Figs. 11b and 11c aresignificantly different from the average speed of objectsduring training, the prediction either lags behind ascompared to the actual measurements from the tracker orhurries ahead resulting in the observed trajectory beinglabeled anomalous. Three examples of the third type of

anomalies are shown in Fig. 11d. In the top row of Fig. 11d,a person is walking slowly in a circular path for a fewminutes. The motion pattern itself is not anomalous to thelearned distribution, but the presence of the object in thescene for a relatively longer period of time results in theaccumulation of error calculated as di using larger values ofn in (12) and eventually becomes greater than dth, resultingin the classification of the sequence as an anomaly. Themiddle row of Fig. 11d shows a person jumping over anarea, which people usually only sit on. Even though theperson is not present in the scene for very long, the error diquickly becomes significant due to the incoherence of hisactions relative to the training examples. The bottom rowshows a zigzag pattern of walking, a motion that has notbeen observed before, resulting in the classification of thetrajectory as anomalous.

Since the decision as to whether a given trajectory isanomalous is subjective and differs from one observer toanother, we asked three human observers to classify19 sequences as either normal or anomalous. The proposedmethod of anomaly detection was run on these 19 sequences,which labeled 14 of them as anomalous. The quantitativeresults of these experiments are summarized in Fig. 12.

5.4 Persistent Tracking

In Section 4.4, we described our method to track objectspersistently through occlusions. In the absence of measure-ments, our algorithm propagates the track based onsamples from the distribution. Fig. 13 shows two examplesof tracks that have undergone erroneous labeling by thetracking algorithm because both of them had considerablemissing portions due to total occlusion. In the top row ofFig. 13, the tracker assumes that two objects walked toward


Fig. 11. Results of anomaly detection. (a) Spatially anomalous, (b) and(c) temporally anomalous, and (d) suspicious behavior due to thepresence over large distance or extended period. The blue trackrepresents the actual (observed) track. Red and black tracks correspondto typical and atypical (anomalous) predicted paths, respectively.

Fig. 12. Quantitative analysis of anomaly detection results using

classification by three human observers as the ground truth. The

proposed system labeled 14 of the 19 sequences as anomalous.

Fig. 13. For each row, (a) shows the observed tracks in blue and red that

have been labeled wrong and (b) and (c) show the stitched part of tracks

in black and actual tracks in red and blue, respectively.

the golf car and then returned to their original locations,which is not correct. Each of these tracks is propagatedthrough predicted locations to join the correct track. Theresult of propagation based on weighted mean of thesesamples is shown in the top row of Figs. 13b and 13c, whereboth tracks are separately shown for clarity. Red and bluecolors show two different tracks and the black color showsthe part where trajectory has been stitched. The bottom rowof Fig. 13 shows a similar example where a truck isobstructing the camera’s view.

Another example of persistent tracking through occlu-sions is shown in Figs. 14 and 15, where we created aconsiderably large simulated occlusion in a small videosequence as shown in Fig. 14b as a black box in the middleof the image. The location of this occlusion makes the taskchallenging since most of the likely paths in the scene passthrough this region. This sequence contains four tracks thathave significant portions under occlusion. For comparisonwith other algorithms, we also tested Kalman filter-basedtracking on the same sequence where the process noisefunction is assumed to be Gaussian and trajectories areassumed to be locally linear, as in [28]. The results on thefour tracks are shown separately in Fig. 15. The direction ofmovement in all tracks is from the top of the imagedownward, except in Fig. 15c, where it is from left to right.The results of Kalman filter tracking are shown as whitetrack, the proposed approach as yellow, and the groundtruth as green tracks. The ground truth trajectories areavailable since the occlusion is only simulated. The reasonfor the sharp turns at the end of the predicted tracks is therecovery from error, once the object comes out of occlusion.It is fairly obvious that Kalman filter-based tracks tend tofollow linear constant motion dynamics, which actuallyworks better than our algorithm in Fig. 15b, but, in the otherthree tracks, the proposed approach works significantly

better than the Kalman filter. The reason for the deviation ofthe track generated using motion patterns from the groundtruth (Fig. 15b) is that the length of the track before theocclusion is very small. As a result, the algorithm is unableto differentiate between the actual track that is coming fromthe top and a track that goes from that position to the left.As a result, the predicted track goes left since the smalltrack length makes it seem that the direction of movement istoward the left (Fig. 16).

Finally, we report some of the runtime performance data.All the experiments were ran on a Pentium-4 2.53 GHzmachine and the coding was done in Matlab without anyoptimizations for speed. For track generation (used in likelytracks generation, anomaly detection, and persistent track-ing), the proposed algorithm performed at an average of0.422 fps. The Metropolis-Hastings algorithm failed tosample a new state approximately 11.72 percent of thetime. The average time taken to compute the probability of atransition ðpðX;Y ; �ÞÞ was 35.78 ms, but had a highstandard deviation where the time taken was lower forareas with less traffic. The proposed foreground detectionalgorithm performed at an average of 0.637 fps. Theperformance of the algorithm using a histogram approx-imation of the kernel density estimate is reported in Fig. 17.

6 DISCUSSION

In this section, we discuss some of the questions andconcerns raised by the anonymous reviewers and theassociate editor.

Comment 1. The proposed technique to handle problemsof object association across short/long-term occlusions isnot convincing, especially when there are multiple objectswithin the scope of this association problem. The under-lying track breaks may require sophisticated appearancematching over and above the kinematic constraints learnedfrom the track distributions.

Response. We agree that appearance is a key feature forobject association and do fully understand that successfulobject association requires both spatiotemporal and appear-ance features to complement each other. Traditionally,


Fig. 14. Example of persistent tracking for multiple simultaneous objectswith overlapping or intersecting tracks undergoing occlusion. (a) Actualoriginal tracks (ground truth). (b) Broken tracks due to simulatedocclusion shown as a black region.

Fig. 15. Results for the scenario shown in Fig. 14. Green track is theground truth. Tracking through occlusion using Kalman filter is shown inwhite and yellow tracks are generated using the proposed approach.Notice that both methods can recover well once the measurement isavailable again, but, during occlusion, the proposed method stays closerto ground truth.

Fig. 16. Mahalanobis distances of tracks generated using the proposed

approach and Kalman filter-based tracking, from the ground truth.

Fig. 17. Performance of the proposed algorithm when using histogramapproximation of kernel density. The metrics used are given as follows:(a) Track generation (in frames per second), (b) Metropolis-Hastingsfailure rate, (c) time taken to compute probability (in milliseconds), and(d) foreground detection (in frames per second).

appearance models are used along with kinematic con-straints (imposed by an assumed motion model, such asconstant velocity or constant acceleration) for objectassociation. The contribution of this paper is to learn thesekinematic constraints directly from the track distributioninstead of assuming some motion model to be valideverywhere in a scene. The kinematic model learned thisway not only includes the above-mentioned motion proper-ties but also embodies scene-specific knowledge, such as thepresence of paths, entry/exit points, and people prefer-ences. Similar to traditional kinematic models, the proposedspatiotemporal model can be complemented with moresophisticated appearance matching to solve complex occlu-sions, e.g., occlusions due to group motion.

Comment 2. Are you assuming ideal tracking? Whatinfluence does imperfect tracking have on your algorithm’sperformance?

Response. No, we are not assuming the input tracks to beideal and, like other real-time systems, the KNIGHT systemis also susceptible to errors due to illumination changes,shadows, and severe occlusions. However, the errors do notgreatly affect the performance of the proposed algorithmbecause the system is trained for video sequences collectedover extensive periods of time. Over time, the featuresamples in the density estimate are resampled and outliersare rejected. Consequently, the true patterns of objectdynamics are retained.

Comment 3. Although you state that the proposedalgorithm is resilient to failures of the underlying tracker,it is not clear from the formulation that this is so. If a trackersuch as KNIGHT tends to fragment tracks a lot, then thetransition times beyond a small value will almost never getgood support. This will not reflect the true distributions. Itis not clear from this paper how this aspect is handled? Oris it handled at all?

Response. The proposed framework does not rely ontracking information. As a matter of fact, the distributionscan be generated by using the detections alone (by assigning alink between each pair of detection that occurs within a giventime period and using it as a data point for the model). Overtime, the actual links will persist and the noisy links willaverage out. The tracking information is used only when it isavailable to reduce the influence of noise on the model. Forexample, consider a high precision tracker that links twoobservations with high precision but may fragment tracks alot. If the tracker establishes a link between two observations,then we ignore other possible links. However, if such a linkfor an observation is not present then all possible pairs areconsidered (for example, for time interval ofn frames, the firstand lastnpoints in a track do not have links established by thetracker). This way, the algorithm is resilient to failures of theunderlying tracker.

Comment 4. If there is a static object, like a truck or abuilding, that occludes objects consistently over the courseof their motion, how can the proposed algorithm connectthese objects especially in situations when objects close toeach other are moving, for instance groups of people?

Response. As mentioned in response to Comment 3, thealgorithm does not fully rely on the track links and thus canconnect objects across long-term occlusions due to a static

object. The algorithm however cannot solve occlusionscaused by objects moving together in groups. Resolvingsuch occlusions would require use of appearance featuresalong with the proposed spatiotemporal model.

Comment 5. Are there some anomalous trajectories inyour training data set?

Response. There are some anomalous trajectories in ourtraining sets, for example, people walking through thegrassy areas. However, by definition, the anomaloustrajectories are very sparse and, hence, have very lowprobability in the model as compared to normal trajectoriesor events.

Comment 6. The determination of dth is tricky. How can itbe determined automatically?

Response. The threshold dth depends on several factorsincluding the distance of an object from the camera. In otherwords, the farther away the object is, the smaller the valueof dth should be, especially for scenes where the sizes ofobjects vary greatly from one region to another within theimage lattice. The threshold can be determined automati-cally if either the camera calibration is known or if the objectsizes are also used in the feature vector.

Comment 7. The computational cost of the system is veryhigh.

Response. The system has high computational cost fortraining, but it can be performed offline. The runtimeperformance of the system can be significantly improved byusing software optimization and using well-known approx-imations of KDE, such as histograms, Fast Gauss Transform[40], [41], and mixture models. In our experiments, the useof histograms (in an unoptimized MATLAB code) im-proved the runtime performance of foreground estimationfrom 0.6 to 2.75 fps (see Fig. 17).

Comment 8. How do you apply incremental learning toupdate the model adaptively.

Response. Traditionally, the kernel density estimation ismade adaptive to the changes in the data by keeping a fixedsize queue of n data points. Once k new data points areobserved, the oldest k points in the queue are replaced withthe new data points and the model is updated. The modelupdates can be done on a low priority thread, althoughonline updating will not have any significant effect onperformance. Note that the proposed system is based onstatic scenes and the availability of large training data and,hence, does not have to be updated very frequently.

Comment 9. Why use the Gaussian distribution asproposal density for Metropolis-Hastings sampling?

Response. The reason we choose Gaussian distribution asthe proposal density is the observation that often the trueconditional density of object transition given the currentstate resembles the Gaussian distribution. In addition toGaussian distribution, we experimented with Uniform andEpanechnikov densities centered on the current location.Assume that the initial location X of a transition can bewritten as X ¼ ðx1; x2Þ. The results of sampling from thetarget density

RY

R�

Rx1pðX;Y ; �Þdx1d�dY are shown in

Fig. 18. The 1D marginal density of the proposed distribu-tion is chosen for ease of visualization. As can be seen inFig. 18, the quality of samples is not affected by the choice ofproposal density; however, the performance in terms of


speed is significantly reduced due to a bad choice as more

samples are rejected.In conclusion, we have introduced a simple and

intuitive novel method for modeling scene dynamics.

The proposed model is rich and accurately reflects the

transition likelihoods in the scene. We have shown the

effectiveness and validity of the method experimentally

for diverse surveillance applications.

ACKNOWLEDGMENTS

This research was supported in part by the US Government

VACE Program. The authors would like to thank Dr. Xin Li,

Dr. Yaser Sheikh, and Dr. Marshall Tappen for their

valuable comments throughout this research.

REFERENCES

[1] “Special Issue on Video Communications, Processing, and Under-standing for Third Generation Surveillance Systems,” Proc. IEEE,vol. 89, no. 10, Oct. 2001.

[2] R. Collins, J. Lipton, and T. Kanade, “Introduction to the SpecialSection on Video Surveillance,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 22, no. 8, Aug. 2000.

[3] O. Javed, K. Shafique, and M. Shah, “Automated VisualSurveillance in Realistic Scenarios,” IEEE Multimedia, pp. 30-39,Jan.-Mar. 2007.

[4] I. Biederman, “On the Semantics of a Glance at a Scene,” PerceptualOrganization, pp. 213-253, Lawrence Erlbaum Assoc., 1981.

[5] A. Torralba, “Contextual Influences on Saliency,” Neurobiology ofAttention, 2005.

[6] K. Tieu, G. Dalley, and W.E.L. Grimson, “Inference of Non-Overlapping Camera Network Topology by Measuring StatisticalDependence,” Proc. 10th IEEE Int’l Conf. Computer Vision, pp. 1842-1849, 2005.

[7] T.J. Ellis, D. Makris, and J.K. Black, “Learning a Multi-CameraTopology,” Proc. Joint IEEE Int’l Workshop Visual Surveillance andPerformance Evaluation of Tracking and Surveillance, 2003.

[8] C. Stauffer, “Learning to Track Objects through UnobservedRegions,” Proc. IEEE Workshop Motion and Video Computing, vol. 2,pp. 96-102, 2005.

[9] O. Javed, K. Shafique, Z. Rasheed, and M. Shah, “Modeling Inter-Camera Space-Time and Appearance Relationships for Trackingacross Non-Overlapping Views,” Computer Vision and ImageUnderstanding, vol. 109, no. 2, Feb. 2008.

[10] H. Buxton, “Generative Models for Learning and UnderstandingDynamic Scene Activity,” Proc. Generative Model Based VisionWorkshop, 2002.

[11] A. Hunter, J. Owens, and M. Carpenter, “A Neural System forAutomated CCTV Surveillance,” IEE Intelligent Distributed Sur-veillance Systems, 2003.

[12] J. Owens and A. Hunter, “Application of the Self-Organising Mapto Trajectory Classification,” Proc. Third IEEE Int’l Workshop VisualSurveillance, 2000.

[13] N. Johnson and D.C. Hogg, “Learning the Distribution of ObjectTrajectories for Event Recognition,” Image and Vision Computing,vol. 14, no. 8, pp. 609-615, Aug. 1996.

[14] J.H. Fernyhough, A.G. Cohn, and D.C. Hogg, “Generation ofSemantic Regions from Image Sequences,” Proc. Fourth EuropeanConf. Computer Vision, 1996.

[15] R.J. Howard and H. Buxton, “Analogical Representation of SpatialEvents, for Understanding Traffic Behaviour,” Proc. 10th EuropeanConf. Artificial Intelligence, 1992.

[16] E.B. Koller-Meier and L. Van Gool, “Modeling and Recognition ofHuman Actions Using a Stochastic Approached,” Proc. SecondEuropean Workshop Advanced Video-Based Surveillance Systems, 2001.

[17] J. Lou, Q. Liu, T. Tan, and W. Hu, “Semantic Interpretation ofObject Activities in a Surveillance System,” Proc. 16th Int’l Conf.Pattern Recognition, 2002.

[18] W.E.L. Grimson, C. Stauffer, R. Romano, and L. Lee, “UsingAdaptive Tracking to Classify and Monitor Activities in a Site,”Proc. Int’l Conf. Computer Vision and Pattern Recognition, 1998.

[19] C. Stauffer and W.E.L. Grimson, “Learning Patterns of ActivityUsing Real Time Tracking,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 22, no. 8, pp. 747-767, Aug. 2000.

[20] C. Stauffer, “Estimating Tracking Sources and Sinks,” Proc. SecondIEEE Event Mining Workshop, 2003.

[21] P. Remagnino and G.A. Jones, “Classifying Surveillance Eventsfrom Attributes and Behaviour,” Proc. British Machine Vision Conf.,2001.

[22] M. Walter, A. Psarrou, and S. Gong, “Learning Prior andObservation Augmented Density Models for Behaviour Recogni-tion,” Proc. British Machine Vision Conf., 1999.

[23] A. Galata, N. Johnson, and D. Hogg, “Learning Variable LengthMarkov Models of Behaviour,” Computer Vision and Image Under-standing, vol. 81, no. 3, pp. 398-413, 2001.

[24] T. Huang and S. Russell, “Object Identification in a BayesianContext,” Proc. 15th Int’l Joint Conf. Artificial Intelligence, 1997.

[25] V. Kettnaker and R. Zabih, “Bayesian Multi-Camera Surveil-lance,” Proc. Int’l Conf. Computer Vision and Pattern Recognition,1999.

[26] S.L. Dockstader and A.M. Tekalp, “Multiple Camera Fusion forMulti-Object Tracking,” Proc. IEEE Workshop Multi-Object Tracking,2001.

[27] D. Hoiem, A. Efros, and M. Hebert, “Putting Objects inPerspective,” Proc. Int’l Conf. Computer Vision and Pattern Recogni-tion, 2006.

[28] R. Rosales and S. Sclaroff, “Improved Tracking of MultipleHumans with Trajectory Prediction and Occlusion Modeling,”Proc. CVPR Workshop Interpretation of Visual Motion, 1998.

[29] R. Kaucic, A. Perera, G. Brooksby, J. Kaufhold, and A. Hoogs, “AUnified Framework for Tracking through Occlusions and acrossSensor Gaps,” Proc. Int’l Conf. Computer Vision and PatternRecognition, 2005.

[30] X. Wang, K. Tieu, and E. Grimson, “Learning Semantic SceneModels by Trajectory Analysis,” Proc. Ninth European Conf.Computer Vision, 2006.

[31] A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu, “Multi-Object Tracking through Simultaneous Long Occlusions and Split-Merge Conditions,” Proc. IEEE Int’l Conf. Computer Vision andPattern Recognition, pp. 666-673, 2006.

[32] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, “A Systemfor Learning Statistical Motion Patterns,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 28, no. 9, pp. 1450-1464, Sept.2006.

[33] E. Parzen, “On the Estimation of a Probability Density Functionand Mode,” Annals of Math. Statistics, vol. 33, pp. 1065-1076, 1962.

[34] R. Duda, P. Hart, and D. Stork, Pattern Classification, second ed.Wiley Interscience, 2001.

[35] B. Turlach, “Bandwidth Selection in Kernel Density Estimation: AReview,” Institut de Statistique, 1993.

[36] B.W. Silverman, Density Estimation for Statistics and Data Analysis.Chapman and Hall, 1986.


Fig. 18. A comparison of different proposal densities. The first rowshows histograms of the 10,000 samples from the respective proposaldistributions. The second row shows histograms of samples that wereaccepted for the target density plotted in red. All histograms contain200 bins. (a) Uniform. (b) Epanechnikov. (c) Gaussian.

[37] J.A. Benediktsson and P.H. Swain, “Consensus Theoretic Classi-fication Methods,” IEEE Trans. Systems, Man, and Cybernetics,vol. 22, pp. 688-704, 1992.

[38] G. Hinton, “Products of Experts,” Proc. Ninth Int’l Conf. ArtificialNeural Networks, pp. 1-6, 1999.

[39] K. Shafique and M. Shah, “A Non-Iterative Greedy Algorithm forMulti-Frame Point Correspondence,” Proc. Ninth IEEE Int’l Conf.Computer Vision, 2003.

[40] L. Greengard and J. Strain, “The Fast Gauss Transform,” SIAM J.Scientific and Statistical Computing, vol. 12, no. 1, pp. 79-94, 1991.

[41] A. Elgammal, R. Duraiswami, and L. Davis, “The Fast GaussTransform for Efficient Kernel Density Evaluation with Applica-tions in Computer Vision,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 25, no. 11, pp. 1499-1504, Nov. 2003.

Imran Saleemi received the BS degree incomputer system engineering from GhulamIshaq Khan Institute of Engineering Sciencesand Technology, Topi, Pakistan, in 2004 and theMS degree in computer science from theUniversity of Central Florida in 2008. He iscurrently working toward the PhD degree in theComputer Vision Laboratory, University of Cen-tral Florida. His research interests include visualtracking and surveillance, probabilistic graphical

models, and multiview object detection and categorization.

Khurram Shafique received the BE degree incomputer systems engineering from NED Uni-versity of Engineering and Technology, Karachi,Pakistan, in 1998 and the MS and PhD degreesin computer science from the University ofCentral Florida in 2001 and 2004, respectively.He is currently a research scientist in the Centerfor Video Understanding Excellence, ObjectVi-deo, Reston, Virginia. His research interestsinclude graph theory, discrete optimization,

tracking, correspondence across multiple cameras, and real-timesurveillance systems. He was a recipient of Hillman Fellowship Awardfor excellence in research in computer science in 2003.

Mubarak Shah is an Agere chair professor ofcomputer science and is the founding director ofthe Computer Vision Laboratory at the Universityof Central Florida (UCF). He is a coauthor of twobooks, namely, Motion-Based Recognition(Kluwer Academic, 1997) and Video Registra-tion (Kluwer Academic, 2003). In 2006, he wasawarded the Pegasus Professor Award, thehighest faculty award at UCF. He was an IEEEdistinguished visitor speaker for 1997-2000, was

nominated for the Best Paper Award at the ACM Multimedia Conferencein 2005, and is a recipient of the IEEE Outstanding EngineeringEducator Award in 1997, the Harris Corporation’s Engineering Achieve-ment Award in 1999, the TOKTEN awards from UNDP in 1995, 1997,and 2000, the Teaching Incentive Program Award in 1995 and 2003, theResearch Incentive Award in 2003, the Millionaires’ Club awards in 2005and 2006, the University Distinguished Researcher Award in 2007, andan honorable mention for the 10th IEEE International Conference onComputer Vision (ICCV ’05) Where Am I? Challenge Problem. He is aneditor of the international book series on video computing, an editor-in-chief of Machine Vision and Applications, and an associate editor ofACM Computing Surveys. He was also an associate editor of the IEEETransactions on Pattern Analysis and Machine Intelligence and a guesteditor of the special issue of the International Journal of Computer Visionon video computing. He is a fellow of the IEEE, the InternationalAssociation for Pattern Recognition (IAPR), and the InternationalSociety for Optical Engineering (SPIE).

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

1472 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …imran/PDFs/saleemi-pami-2009.pdfProbabilistic Modeling of Scene Dynamics for Applications in Visual Surveillance Imran Saleemi, Khurram