HHO Yayınları

Review and Evaluation of Well-Known Methods for Moving Object Detection and Tracking in Videos

JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 2010 VOLUME 4 NUMBER 4 (11-22)

REVIEW AND EVALUATION OF WELL-KNOWN METHODS FOR MOVING OBJECT DETECTION AND TRACKING IN VIDEOS

Bahadır KARASULU

Ege University, Engineering Faculty, Computer Engineering

Department, IZMIR [email protected]

Received: 01st November 2009, Accepted:26 March 2010 ABSTRACT Moving object detection and tracking (D&T) are important initial steps in object recognition, context analysis and indexing processes for visual surveillance systems. It is a big challenge for researchers to make a decision on which D&T algorithm is more suitable for which situation and/or environment and to determine how accurately object D&T (real-time or non-real-time) is made. There is a variety of object D&T algorithms (i.e. methods) and publications on their performance comparison and evaluation via performance metrics. This paper provides a systematic review of these algorithms and performance measures and assesses their effectiveness via metrics. Keywords: Image processing, Object detection, Object tracking, Performance metrics, Evaluation.

VIDEOLARDA HAREKETLİ NESNE TESPİT VE TAKİBİ İÇİN BELLİ BAŞLI YÖNTEMLERİN

DEĞERLENDİRİLMESİ VE İNCELENMESİ ÖZET Hareketli nesne tespit ve takibi (T-ve-T), görsel gözetleme sistemleri için içerik analizi ve endeksleme, nesne tanıma süreçlerindeki en önemli ilk adımlardır. Araştırmacılar için hangi T-ve-T algoritmasının; hangi ortama ve/veya duruma daha uygun olduğu, ne kadar doğrulukta nesne T-ve-T (gerçek zamanlı veya gerçek-zamanlı olmayan) yapabildiğinin tespiti oldukça zor bir uğraşıdır. Nesne T-ve-T algoritmaları (yöntemleri) ve bunların başarım ölçütleri yoluyla başarım karşılaştırma ve değerlendirmelerinin bir çok çeşiti vardır. Bu çalışma, bu algoritmaların ve başarım ölçümlerinin sistematik bir incelemesini ve ölçütler yoluyla bunların etkinliklerinin bir karşılaştırmasını sağlamaktadır. Anahtar Kelimeler: Görüntü işleme, Nesne tespiti, Nesne takibi, Başarım ölçütleri, Değerlendirme. 1. INTRODUCTION Video object segmentation, detection and tracking processes are the basic, starting steps for more complex processes, such as video context analysis and multimedia indexing. Object tracking in videos can be defined as the process of segmenting an object of interest from a sequence of video scenes. This process should keep track of its motion, orientation, occlusion and etc. in order to extract useful context information, which will be used on higher-level processes. In Dey’s study [1], the context is defined as ‘‘any information that can be used to characterize the situation of an entity”. Moreover, an entity can be a person, place or object that is considered to be related to users and applications. These applications are aimed for visual or non-visual information extraction and/or retrieval.

When the camera is fixed and the number of targets is small, objects can easily be tracked using simple methods. Computer vision-based methods often provide the only non-invasive solution. Their applications can be divided into three different groups: Surveillance, control and analysis. The object detection and/or tracking (D&T) process is an essential item for surveillance applications. The control applications, which use some parameters to control motion estimation and etc., are used to control the relevant vision system. The analysis applications are often automatic, and used to optimize and/or diagnose system’s performance. For well predefined (namely, annotated) datasets, the object recognition algorithms give good accuracy, rarely [2, 3, 4].

KARASULU 11


In the literature, the previous works concentrated mainly on moving-object D&T in videos. One can find bunch of methods dedicated to generic-object D&T in video processing like Background Subtraction (BS) [5, 6], Mean-Shift (MS) and/or Continuously Adaptive Mean-Shift (CMS) [7-9], Optical Flow (OF) [10, 11], Active Contour Models (i.e. Snakes) [12, 13] and etc. Template matching is an essential-object D&T method, but it is simpler than others, and is generally based on matching a given template as an object in given a frame. This paper provides a comparison of aforementioned video object D&T algorithms and their application areas. The rest of this paper is organized as follows: Section 2 gives a general overview of well-known object D&T algorithms. Section 3 provides detailed information about the performance metrics and evaluation, common video databases and/or datasets and evaluation projects in the literature. Section 4 provides comparison results, and finally the conclusion is drawn in section 5. 2. OBJECT DETECTION AND TRACKING METHODS Computer and human vision work similar in terms of functionality, but have not exactly same functions [14]. A Camera is the basic sensing element, and it is the first step for a good visual surveillance system’s object D&T process. Digital signal and image processing are starter levels of digital video processing. In digital video processing, the object detection process affects object tracking and classification processes, as well. Manual Object D&T is a tedious task. For this reason, experts in computer vision research area studied for a long time semiautomatic or automatic D&T techniques. These techniques often involve maintaining a model. This model is related to the spatial relationship between the various features [15]. Some image features, such as color, motion, and edges, can be used to track the moving object in videos [16]. In addition, video segmentation has two major types: spatial segmentation and temporal segmentation. Spatial segmentation is based on the digital image segmentation approach. Digital image segmentation is used locally or globally to partition an image as bi-level or higher level into different regions belonging to the image. Automatic video segmentation aims the separation of moving objects from the background and identification of accurate boundaries of the objects. The performance of tracking algorithms usually depends also on the measure characterizing the similarity or dissimilarity between the two subsequent images/video frames [16]. The intra- or inter-frame’s distance metrics are often based on Bhattacharyya [17], Hamming, Mahalonobis, Manhattan (Taxi cab) or simple Euclidean distance [18]. Comparison of

these metrics is only possible by applying them independent of the application and having a scientific basis [19]. A video’s hierarchical structure units are scene, shot, and frame. Frame is the lowest level in the hierarchical structure. Video-content analysis, content-based video browsing and retrieval use these units. In that situation, video applications must partition a video sequence into shots. In Camara-Chavez et al. study [20], a shot is defined as an image sequence that presents continuous action, which is captured from a single operation of a single camera. Shots are joined to form the complete sequence. Furthermore, Camara-Chavez et al. expressed that shots can be effectively considered as the smallest indexing unit where no change in the scene content can be perceived and they indicated that higher level concepts are often constructed by combining and analyzing the inter- and intra-shot relationships [20]. Generally, each shot can be represented by key frames and indexed according to the spatial and temporal features in a video or multimedia indexing/editing application. Most surveillance systems involve a kind of video sequences, which are captured from a fixed (i.e. stationary) camera and based on a single-shot. 2.1. Background Subtraction In video processing applications, variants of the background subtraction (BS) method are broadly used for the detection of moving objects in video sequences. The BS’s speed in locating the moving objects makes it attractive for the users. Unfortunately, a simple inter-frame difference with global threshold reveals itself as being sensitive to phenomena of the basic assumptions of BS. These assumptions are based on a firmly fixed camera with a static noise-free background. Real-life systems have camera jitters, illumination changes and etc. [21, 22]. In object detection, usually a scene can be represented by a model called background model. Also, the related algorithm (or method) finds the deviations from the background model for each incoming frame (i.e. frame differencing). A pixel-level background model is generated and maintained to keep track of the time-evolving background. A moving object can be defined as any significant change in an image region compared to the background model. Intra-regions pixels’ undergoing changes are marked for further processing. Usually, a connected component algorithm is applied to obtain connected regions corresponding to the objects. Background maintenance is the essential part, which may affect the performance of BS in the time-varying situations. This process is referred to as the BS (as mentioned in the survey study of Yilmaz et al.) [21]. The methods of basic BS employ usually a single reference image corresponding to an empty scene as the background

KARASULU 12


model. This kind of simple model was not suitable for real world’s much complex surveillance systems.

The Fig. 1 shows BS based object D&T system’s architecture.

Figure 1. BS based object D&T system’s architecture.

In the study by Benezeth et al. [22] common BS techniques were reviewed. In principle, according to Benezeth et al., these techniques assume the hypothesis that the observed video sequence I is made of a fixed background B, in front of which moving objects are observed. With the assumption that a moving object at time t has a color (or a color distribution, or any other desired feature) different from the one observed in B, the principle of BS methods can be summarized by the following formulation [22]:

⎩⎨⎧ >

=otherwise

BIdifs sts

t 0),(1

)( , τψ (1)

where tψ is the motion label field at time t and a function of s(x,y) spatial location (also called motion mask), d is a distance between the video frame at time t at pixel s and the background at pixel s;

tsI ,

sB τ is a threshold. The main difference among most of the BS methods is how well B is modeled and which distance metric d is being used (e.g. Euclidean, Mahalanobis or Manhattan, etc.) [22]. In the literature, there are various BS techniques, such as basic motion detection, one gaussian, gaussian mixture model, kernel density estimation, inter-frame minimum, maximum difference and etc. [22]. In a BS algorithm, the four main steps are preprocessing, background modeling, foreground detection, and data validation, as declared in Cheung and Kamath’s study [23]. Preprocessing step involves some simple image processing tasks. These tasks change the raw input video sequence into a format, which is used in subsequent steps. In background modeling, the new video frame can be used in calculation and updating of a background model, which provides a statistical description of the entire background scene (which may be static or dynamic). Much research has been devoted to developing a background model that is robust against environmental changes in the background, but sensitive enough to identify all moving objects of interest [22, 23]. One can classify background modeling techniques into two broad categories (non-

recursive and recursive). Non-recursive techniques are frame differencing, median filter, linear predictive filter and non-parametric model. Recursive techniques are the approximated median filter, Kalman filter and Mixture of Gaussians (MoG). Details of the aforementioned techniques can be found at Yilmaz et al., Benezeth et al., Cheung and Kamath, and Fuentes and Velastin studies [21-24]. In the foreground detection step, pixels in the video frame, which are not explained enough by the background model [5], are defined as a binary candidate foreground mask [23]. The most important limitation of background subtraction is the requirement of fixed (i.e. stationary) cameras. Camera motion usually distorts the background models and causes false or partial object detection. All the aforementioned techniques use a single image as their background models, except the non-parametric model and the MoG model [21, 22]. There are some foreground detection approaches and the most commonly-used one is to check whether the input pixel is significantly different from the corresponding background estimate:

τ>− ),(),( yxByxI tt (2)

where and are used to denote the luminance pixel intensity and its background estimate at spatial location and time t. In addition, another popular foreground detection scheme is to apply a threshold based on the normalized statistics [23]:

),( yxI t ),( yxBt

),( yx

sd

dtt yxByxIτ

σμ

>−− ),(),( (3)

where dμ and dσ are the mean and the standard deviation of I ),(( yxBtt ), yx − for all spatial locations

. In these formulations, ),( yx τ and sτ are used to denote the foreground threshold and the statistical foreground threshold, which are experimentally

KARASULU 13


determined by the most of foreground detection schemes. Ideally, the threshold should be a function of the spatial location . For example, the threshold should be smaller for regions with low contrast. Sometimes, this is an advantageous situation for object detection in low contrast scenes. Fuentes and Velastin [24] proposed one possible modification for threshold determination [23]:

),( yx

t

tt

yxByxByxI

>−

),(),(),(

cτ (4)

where cτ is used to denote the contrast threshold. The contrast enhancement of bright images, such as an outdoor scene under heavy fog or spot (e.g. sun spot or other flash light source spot) is not possible with this technique. In data validation step, method reviews this candidate mask and eliminates those pixels that do not correspond to actual moving objects, and outputs the final foreground mask. Real-time processing is still feasible as computationally-intensive vision algorithms are applied only on the small number of candidate foreground pixels [21, 23]. There are some other alternate approaches for background subtraction, which are to represent the intensity variations of a pixel in an image sequence as discrete states corresponding to the events in the environment. In some studies [25, 26], Hidden Markov Models (HMM) are used to classify small blocks of an image as belonging to one of three different states: background state, foreground state and shadow state. HMM is successful for certain events, which are hard to model correctly using unsupervised background modeling approaches, can be learned using training samples [21]. 2.2. Mean-shift based Trackers Mean-Shift (MS) is a clustering approach in image segmentation for joint spatial-color space. Mean-shift segmentation approach is used to analyze complex multi-modal feature space and identification of feature clusters. It is a non-parametric technique. Its Region of Interest’s (ROI) size and shape parameters are only free parameters on mean-shift process, i.e., the multivariate density kernel estimator. A 2-step sequence of discontinuity preserving filtering and mean-shift clustering is used for mean-shift image [27, 28]. The MS algorithm is initialized with a large number of hypothesized cluster centers randomly chosen from the data of the given image. The MS algorithm aims at finding of nearest stationary point of underlying density function of data and, thus, its utility in detecting the modes of the density. For this purpose, each cluster center is moved to the mean of the data lying inside the multidimensional ellipsoid centered on the cluster center [21]. Meanwhile, algorithm builds up a vector, which is defined by the old and the new cluster centers. This vector is called

MS vector and computed iteratively until the cluster centers do not change their positions. Some clusters may get merged during the MS iterations [9, 21, 27]. MS-based image segmentation (or within the meaning of video frame’s spatial segmentation) is a straightforward extension of the discontinuity preserving smoothing algorithm. After nearby modes are pruned as in the generic feature space analysis technique [27], each pixel is associated with a significant mode of the joint domain density located in its neighborhood,. In Comaniciu and Meer’s study [27], the MS image segmentation algorithm is explained with four steps. According to this, when and , , be the d-dimensional input and filtered image pixels in the joint spatial-range domain and the label of the ith pixel in the segmented image.

ix

L

iz ni ,...,1=

i

Step 1. Run the mean-shift filtering procedure for the image and store all the information about the d-dimensional convergence point in , i.e., iz cii yz ,= , Step 2. Delineate in the joint domain the clusters { }

mppC...1=

by grouping together all which are

closer than in the spatial domain and in the range domain, i.e., concatenate the basins of attraction of the corresponding convergence points.

iz

sh th

Step 3. For each ni ,...,1= , assign { }pii CzpL ∈= | . Step 4. Optional: Eliminate spatial regions contain-ing less than M pixels.

The cluster delineation step can be refined according to a priori information and, thus, physics-based segmentation algorithms can be incorporated. In MS trackers, the Bhattacharyya coefficient or other similarity measures can be used on measurement of similarity between the template (or model) region and the current target region. MS is a non-parametric method, which has fast convergence speed and low computational cost and provides a general optimization solution independently from target features, but it cannot guarantee global optimality. MS method is susceptible to fall into local maxima, in case of clutter or occlusion [28]. CAMShift (CMS) is a tracking method which is a modified form of MS method. The MS algorithm operates on color probability distributions (CPDs), and CAMShift (CMS) is a modified form of MS to deal with dynamical changes of CPDs. To track colored objects in video frame sequences, the color image data has to be represented as a probability distribution [7, 8]. To accomplish this, Bradski used color histograms in his study [7]. Color distributions derived from video image sequences change over time, so the MS algorithm has to be modified to adapt dynamically to the probability distribution it is tracking. In Bradski’s study, the new algorithm that

KARASULU 14


KARASULU 15

meets all these requirements is called CMS. In a single image, the CMS process is iterated until convergence (or until an upper bound on the number of iterations is reached). If the tracked target’s color

does not change, the MS and CMS trackers are quite robust. Furthermore, they are easily distracted when similar colors appear in the background. The Fig. 2 shows an example CAMShift tracker’s work flow.

Figure 2. Example work flow of CAMShift tracker.

The Coupled CMS algorithm as described in

Bradski’s study [7] (and as reviewed François’s study [8]) is demonstrated in a real-time head tracking application, which is part of the Intel OpenCV Computer Vision library [29]. Some implementations of this algorithm are related to face tracking and used in Perceptual User Interfaces (PUIs) area. In this particular case, using head movements as seen in a face shot video stream to control various interactive programs, as told in François study. The head’s color model is actually a skin color tone model in HSI (or HSV) color space, initialized by sampling an area specified manually in one image (i.e. it works with selected region’s features).

0=∇+∂∂ gu

tg (5)

where g is the brightness function, and u is the velocity vector. In one-dimensional case, the above equation takes the simple form

0=∂∂

+∂∂

xgu

tg

x (6)

from which one can directly determine one-dimensional velocity [30],

xg

tgux ∂

∂∂∂

−= / (7) HSI or HSV stands for Hue-Saturation-Intensity or Value, which attempt to describe perceptual color relationships more accurately than RGB (Red-Green-Blue) color space and its computational complexity is low. Best results are obtained for skin area tracking when using the Hue component in the color images and the histogram. In addition, the CMS tracker tracks namely an speficied object which is defined or detected by system.

provided that the spatial derivative does not vanish, i.e. brightness is continuous. According to Yilmaz et al. [21], the OF methods are used for generating dense flow fields by computing the flow vector of each pixel. These pixels are under the brightness constancy constraint

0),,(),,( =+++− dttdyydxxItyxI [31]. This computation is always carried out in the neighborhood of the pixel either algebraically [32] or geometrically [33]. Extending OF methods in order to compute the translation of a rectangular region is trivial. Shi and Tomasi [34] proposed the Kanade-Lucas-Tomasi (KLT), which iteratively computes the translation ( )dvdu, of a region (e.g. 25 x 25 patch) centered on an interest point

2.3. Optical Flow based Trackers An optical flow (OF) method can be used to track a region defined by a primitive shape, whose translation is computed by use of the OF [21]. The OF is a differential method. In Porikli’s study [30], it is declared that OF is based on the idea that the brightness is continuous for most of the points in the image, neighboring points have approximately the same brightness. In OF method, given scene have continuous objects over which brightness varies smoothly. In successive frames, pixels belong to the same objects have the same brightness similar to the conservation of mass law in fluid dynamics. The continuity equation for the optical term by omitting the second order terms is given below [30],

.2

2

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟

⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

∑∑

∑∑∑∑

ty

tx

yyx

yxx

IIII

dvdu

IIIIII (8)

This equation 8 is similar in construction to the OF method proposed by Lucas and Kanade [32]. Once the


new location of the interest point is obtained, the KLT tracker evaluates the quality of the tracked patch [21]. This tracker’s algorithm detects interest points in the first given frame and propagates them to succeeding frames based on local appearance matching. Propagated unreliable points are discarded and replaced with new points, and the output of algorithm is a set of these reliable points (e.g. features of tracked points). There are two different main option for OF: Sparse OF and Dense OF. The most popular sparse OF tracking technique is Lucas-Kanade OF and this technique is referred to as CVLK in our study. Another popular OF technique is also a dense technique: Horn-Schunk OF technique. 2.4. Active Contour Model An another image segmentation approach is active contour models (ACMs), which are in scope of edge-based segmentation and used on object tracking process, as well. A snake is an energy minimizing spline, which is a kind of active contour model. The snake’s energy is based on its shape and location within the image. Desired image properties are usually relevant to the local minima of this energy. ACMs, which was suggested by Kaas et. al. [12] in 1987 is also known as Snake method. A snake can be considered as a group of control points (or snaxel) connected to each other and can easily be deformed under applied force. According to the study of Dagher and Tom [13], the situation, in which a snake works the most abundant is the situation where the points are at the adequate distance and the situation, in which the initial position’s coordinates are controlled. Total energy of a snake is defined as [13] in equation 9.

[ ] dssvEvsvsE ExternalS

SSSSnake ))(()()(21 22 γβα ++= ∫

(9) where;

dssdvvs)(

= (10)

and

2

2 )(ds

svdvss = (11)

Existence of a spline, on which E energy is constant, is a problem on which Euler-Lagrange differential equation can be practiced. According to this, it turns to:

0)()( =∇−− ExternalSSSSSS Evsvs βα (12) The third term in equation 12 has been normalized by γ, the external energy weight. By solving equation 12, the final contour, which provides the minimization of Esnake , is obtained. It is also possible to interpret the quation in the equation 12 ase

force [13 an equation of balance

]. According to this;

0=+ ExternalFInternalF (13)

re;

whe

SSSSSSal vsvs )()(InternF βα −= (14)

and

ExternalInternal EF ∇= (15) The ssss notion denotes fourth-order partial derivative. Using the final solution, gradient forces Fx and Fy are applied on the point (x,y) in x and y directions, respectively [12, 13]. There are some snake-based methods and implementations in video object segmentation and tracking. In Gouet-Brunet and Lameyre’s study [35], a snake model’s control points (snaxels) are taken as global descriptors describing the

lobal shapeg of objects (e.g. interest points of objects of interest). 3. COMPARISON AND EVALUATION In some surveys and studies [36-38], the issue of evaluating the performance of video surveillance systems (system’s total performance) is becoming more and more important [37]. In this paper, we review a number of well-known methods and their algorithms and appropriate evaluation ways of performance measurement of moving object D&T. Performance evaluation has two types: with ground-truth (GT) or without-ground-truth evaluation. Without-ground-truth evaluation is a color- and motion features-based or statistically defined model-based evaluation. Object D&T conventional metrics have two major types for this process (with ground truth evaluation): Frame-based and Object-based D&T metrics [37]. In the meantime, some advantages and disadvantages for these metrics exist. Frame-based D&T metrics focus on pixels’ position-based comparison between GT image (or GT data) and the frame, which has moving object (or objects) within. They are used to measure the performance of the whole system. These metrics do not take into account the object identity over its lifespan. In addition to this issue, Object-based D&T metrics focus on spatio-temporal coherency between GT image’s object (or GT object’s data) and the object (or objects), which are the moving within the given frame. In these metrics, object’s identity and trajectory over its

KARASULU 16


KARASULU 17

some equivalent and different stems or methods.

.1.Video Databases and Evaluation Projects

Extraction (VACE) [42]. The program of VACE has

lifespan is an important issue. Object D&T process and generation of GT image (or GT data) can be both semiautomatic [37, 38] and manually directed. These manual generations (GT video image or data) can be used on real-life or (pseudo) synthetic video. In order to test an algorithm of object D&T methods, we need a standart format (mostly in eXtensible Markup Language, XML) of GT data for videos. In the literature, the D&T results of some performance evaluation systems were published in their respective portals, websites and papers. It gives us an opportunity to compare between sy 3 The kind of GT used in comparison is important for interpretation of evaluation results. GT generation process contains some obscureness, such as like accounting only individual objets or groups of objects, or whether to look at exact shapes or bounding boxes of the objects. In the literature, commonly-used GT generation (e.g. annotation) tools are available, such as Context Aware Vision using Image-based Active Recognition (CAVIAR) [39], the Viewpoint Invariant Pedestrian Recognition (VIPeR) [40] and etc. Some of these tools have their own video datasets, as well. The CAVIAR addresses indoor and/or city center surveillance and retail applications (e.g. people tracking). There are some huge frameworks like Performance Evaluation of Tracking and Surveillance (PETS) [41] and Video Analysis and Content

some progressive phases, and during 2006-2009 it is at Phase III, and it administrated by US Government. According to Kasturi et al. [43], during VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects such as faces, hands, people, and vehicles, as well as text in four primary video domains: broadcast news, meetings, surveillance, and Unmanned Aerial Vehicle (UAV) motion imagery. The PETS program began in March 2000 as a workshop series that was launched with the goal of evaluating visual tracking and performance of surveillance algorithms [43]. 3.2. Performance Evaluation Metrics Aforementioned performance evaluation projects for video surveillance systems provide a benchmark set of annotated video sequences. For that reason, results of measurement and evaluation process still depend heavily on the chosen benchmark dataset. There are some different scenarios involved by these benchmark datasets. For each of these scenarios (for example indoor or outdoor scenarios), object D&T algorithms have to be evaluated separately. There are some metrics in these performance evaluations. Thirde et al. [44] stated that the quality of motion segmentation can be described in principle by two characteristics: the spatial deviation from the reference segmentation (e.g. GT data), and the fluctation of spatial deviation over time. This kind of evaluation methodology is based on the object-level evaluation [44, 45]. The Fig. 3 shows a typical performance evaluation system.

Figure 3. A typical performance evaluation system.

mance etrics, the reader may refer to Baumann et al. study 6] and Lazarevic-McManus et al. study [45].

In the literature, there are two type of metrics; frame-based metrics and object-based metrics. Frame-based metrics are computed for every frame in the sequence. There are many definition for frame- and/or object-based metrics. Common definitions are as follows: True Negative (TN) is the number of frames where both GT and algorithm results (AR) agree on the absence of any object. True Positive (TP) is the number of frames where both GT and AR agree on the presence of one or more objects, and the bounding box of at least one or more objects coincides among GT and AR. False Negative (FN) is the number of frames where GT contains at least one object, while AR either

does not contain any object or none of the AR’s objects fall within the bounding box of any GT object. False Positive (FP) is the number of frames where AR contain at least one object, while GT either does not contain any object or none of the GT’s objects fall within the bounding box of any AR’s object [36]. For details of the aforementioned and other perform[3


3.3. Performance Comparison of Well-known Methods In our study, we show quantitative comparisons of performance for well-known methods (e.g. BS, CMS, OF and ACM methods) via Precision vs. Recall statistics, which are based on TP, FP, FN and TN metrics. In following subsections of section 3, pixel-based Precision (PP) and pixel-based Recall (PR) performance measures are given with their details. Fig. 4 shows the main concept of Precision vs. Recall

atistics. In this figure, the cyan colored bounding box represents the GT object, and the red colored bounding box represents the AR object.

st

Figure 4. Precision vs. Recall main concept.

et the ratio of detected areas in GT with total detection be thfor the frame t is defined as:

3.3.1. Pixel-based Precision L

e Pixel-based Precision (PP). The PP

FPTPTP

tPP+

=)( (16)

Where the | | operator represents the number of pixels

the relevant area.

he Pixel-based Recall (PR) measure measures how well the algoriframe t is defined as:

in 3.3.2. Pixel-based Recall T

thm minimizes the FNs. The PR for

FNTPTP

tPR+

=)( (17)

Where the | | operator represents the number of pixels

the relevant area. The results of these comparisons ion 4 with their details.

t entification (ID) numbers, which are assigned by the

relevant algorithm to the object at the D&T process.

inare given in sect

4. RESULTS Some applications of well-known methods and algorithms are given in the literature. The OpenCV Computer Vision library [29] includes some implementations of these methods. We implemented these algorithms via OpenCV library infrastructure. For each method, some sample screenshots were taken by our testing implementations in action, using a fixed (i.e. stationary) web camera working in real-time. These screenshots are shown in the following figures. These implementations are built on MS-DOS platform of Microsoft Corporation. Fig. 5 shows BS method’s output. Fig. 6 shows BS method’s foreground mask. Fig. 7 shows CAMShift tracker’s output. Fig. 8 shows CAMShift tracker’s back-projection image. Fig. 9 shows OF method’s output with object D&T process output’s items, such as corner points and their cluster’s minimum rectangular area. Fig. 10 shows OF method’s output with only corner points and their cluster’s minimum rectangular area (i.e. essential bounding box). Fig. 11 shows the output of the ACM method (i.e. Snake). In these figures the green numbers in the middle of the screen are AR objecid

Figure 5. BS method’s output.

Figure 6. BS method’s foreground mask output.

KARASULU 18


Figure 7. CAMShift tracker’s output.

Figure 8. CAMShift tracker’s back-projection image.

Figure 9. OF CVLK method’s output.

Figure 10. OF CVLK method’s output with only

corner points and minimum area rectangular.

Figure 11. ACM (i.e. Snake) method’s output.

In this paper, all of our comparisons are made for CAVIAR database’s [39] videoclips. We have chosen three of them for testing, namely “Browse4”, “Browse2” and “Fight_Runaway1” videoclips, respectively. These video sequences are captured from inclined look-down camera, which has a wide angle. The performance results of these methods are given as an appropriate plot. These plots are produced by using Matlab software of the Mathworks. Fig. 12, fig. 13 and fig. 14 show comparison of all aforementioned methods together in precision-recall space at related plot for the corresponding videoclip.

Figure 12. Performance comparison of well-known

methods for Browse4 videoclip.


methods for Browse2 videoclip.

KARASULU 19



methods for Fight_Runaway1 videoclip. In our tests, we see that ACM (i.e. Snake) method has successive zero values for PP or PR measures through its video sequence in “Browse4” and other CAVIAR’s selected videoclips. Therefore, the bounding box size(s) of GT object(s) are sometimes much smaller than Snake’s algorithm’s AR bounding box size(s). For ACM (i.e. Snake) method, both of GT and AR bounding boxes do not match each other accurately through given video sequence. The ACM (i.e. Snake) method is successful while a detected and tracked object is close to the camera (namely, web camera with face D&T example) and its bounding box is big enough, but ACM is unsuccessful since a detected and tracked object is very small or far away from camera. In “Browse4” video, other methods have not got generally this kind of failure or misdetection, but their performances are fairly low. In “Browse2” and “Fight_Runaway1” videoclips, the performance results of BS and CMS methods are generally better than the results of OF and ACM, respectively. 5. CONCLUSION In this paper, we review a number of commonly-implemented object D&T algorithms. Novelty of this study as related to other reviews [21] is that we made comparison and evaluation for all above mentioned D&T methods and the performance results are presented on the related plot for consultation of readers. In the literature, common evaluations are performed on large datasets which include usually real, synthetic and semi-synthetic video sequences. As a result of comparisons, no method outperforms the other ones on each video category. More research, however, is needed to improve the robustness against the effects of the environment such as noise, illumination changes, occlusions and etc. The variety of metrics and datasets allows us to reason about the weaknesses of particular algorithms against specific challenges.

The aim of this paper is to provide a better understood of performances of video surveillance systems in the literature via published measures, computational and environmental details. These details have a large impact on the performance and evaluation results of given algorithms. Therefore, the comparability of published algorithms is turned into understanding of details of performance metrics and evaluation methodologies. 6. REFERENCES [1] Dey, A. K., “Understanding and using context”, Journal of Personal and Ubiquitous Computing, Vol. 5, No. 1, 47:4-7, 2001. [2] Brdiczka, O., Yuen, P. C., Zaidenberg, S., Reignier, P., and Crowley, J. L., “Automatic acquisition of context models and its application to video surveillance”, In Proceedings of the 18th international conference on pattern recognition (ICPR’06), pp. 1175-1178, Hong Kong, August 2006. [3] Sáncheza, A. M., Patricio, M. A., Garcia, J., and Molina, J.M., "A Context Model and Reasoning System to improve object tracking in complex scenarios", Expert Systems with Applications, doi: 10.1016/j.eswa.2009.02.096, 2009. [4] Bennett, B., Magee, D.R., Cohn, A. G., and Hogg, D.C., "Enhanced tracking and recognition of moving objects by reasoning about spatio-temporal continuity", Image and Vision Computing, Cognitive Vision-Special Issue, Vol. 26, No. 1, pp. 67-81, 2008. [5] Carmona, E. J., Martínez-Cantos, J., and Mira, J., “A new video segmentation method of moving objects based on blob-level knowledge”, Pattern Recognition Letters, Vol. 29, No. 3, pp. 272-285, 2008. [6] Kim, J.B., Kim, H.J., “Efficient region-based motion segmentation for a video monitoring system”, Pattern Recognition Letters, Vol. 24, No. 1–3, pp. 113–128, 2003. [7] Bradski, G. “Computer Vision Face Tracking For Use in a Perceptual User Interface”, In Intel Technology Journal, (http://developer.intel.com/technology/itj/q21998/ articles/art_2.htm, (Q2 1998). [8] François, A. R. J., "CAMSHIFT Tracker Design Experiments with Intel OpenCV and SAI", IRIS Technical Report IRIS-04-423, University of Southern California, Los Angeles, 2004. [9] Comaniciu, D., Meer, P., "Mean Shift Analysis and Applications", IEEE International Conference Computer Vision (ICCV'99), Kerkyra, Greece, pp. 1197-1203, 1999.

KARASULU 20


[10] Jodoin, P.M., Mignotte, M., “Optical-flow based on an edge-avoidance procedure”, Computer Vision and Image Understanding, Vol. 113, No. 4, pp. 511-531, 2009. [11] Pauwels, K., Van Hulle, M. M., “Optic flow from unstable sequences through local velocity constancy maximization”, Image and Vision Computing, (The 17th British Machine Vision Conference (BMVC 2006)), Vol. 27, No. 5, pp. 579-587, 2009. [12] Kass, M., Witkin, A., and Terzopoulos, D., “Snakes: active contour models”, International Journal of Computer Vision, Vol. 1, No. 4, pp. 321–331, 1988. [13] Dagher, I., Tom, K. E., “WaterBalloons: A hybrid watershed Balloon Snake segmentation”, Image and Vision Computing, Vol. 26, pp. 905–912, doi:10.1016/j.imavis.2007.10.010, 2008. [14] Nixon, M. S., Aguado, A. S., "Feature Extraction and Image Processing", Elsevier Science Ltd, 2002, ISBN:0750650788. [15] Sankaranarayanan, A.C., Veeraraghavan, A., Chellappa, R., "Object Detection, Tracking and Recognition for Multiple Smart Cameras", Proceedings of the IEEE, Vol. 96, No. 10, pp. 1606-1624,ISSN:0018-9219, 2008. [16] Loza, A., Mihaylova, L., Canagarajah, N., and Bull, D., “Structural Similarity-Based Object Tracking in Video Sequences”, In: The 9-th International Conference on Information Fusion, 10-13 July 2006, Florence, Italy, 2006. [17] Aherne, F., Thacker, N., and Rockett, P., “The Bhattacharyya metric as an absolute similarity measure for frequency coded data”, Kybernetica, 32(4):1–7, 1997. [18] Russell, S. J., Norvig, P., "Artificial Intelligence: A Modern Approach, Second Edition", Pearson Education, Inc., Upper Saddle River; New Jersey 07458, ISBN: 0-13-790395-2, 2003. [19] Bowyer, K.W., Phillips, P.J., “Empirical evaluation techniques in computer vision”, Wiley-IEEE Computer Society Press, 1998. [20] Camara-Chavez, G., Precioso, F., and Cord, M., Phillip-Foliguet, S., de A. Araujo, A., "An interactive video content-based retrieval system", Systems, Signals and Image Processing, 2008. IWSSIP 2008. 15th International Conference on 25-28 June 2008, pp.133-136, 2008. [21] Yilmaz, A., Javed, O., Shah, M., “Object tracking: A survey”, ACM Computing Surveys, Vol.

38, No. 4, Article No. 13 (Dec. 2006), 45 pages, doi: 10.1145/1177352.1177355, 2006. [22] Benezeth, Y., Jodoin, P.M., Emile, B., Laurent, H., and Rosenberger, C., “Review and evaluation of commonly-implemented background subtraction algorithms”, International Conference on Pattern Recognition (ICPR 2008), 19th Publication Date: 8-11 Dec. 2008, pp. 1-4, 2008. [23] Cheung, S.-C., Kamath, C. "Robust techniques for background subtraction in urban traffic video" Video Communications and Image Processing, SPIE Electronic Imaging, San Jose, January 2004, UCRL-JC-153846-ABS, UCRL-CONF-200706, 2004. [24] Fuentes, L., Velastin, S., “From tracking to advanced surveillance”, In Proceedings of IEEE International Confererence on Image Processing, Barcelona,Spain,2003. [25] Rittscher, J., Kato, J., Joga, S., and Blake, A., "A probabilistic background model for tracking", In European Conference on Computer Vision (ECCV), Vol. 2, pp. 336–350, 2000. [26] Stenger, B., Ramesh, V., Paragios, N., Coetzee, F., and Buhmann, J., "Topology free hidden markov models: Application to background modeling", In IEEE International Conference on Computer Vision (ICCV), pp. 294–301, 2001. [27] Comaniciu, D., Meer, P., “Mean shift: A robust approach toward feature space analysis”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002. [28] Shan, C., Tan, T., Wei, Y., "Real-time hand tracking using a mean shift embedded particle filter", Pattern Recognition, Vol. 40, No. 7, pp. 1958-1970, 2007. [29] Bradski, G., Kaehler, A., "Learning OpenCV: Computer Vision with the OpenCV Library", O’Reilly Media, Inc. Publication, 1005 Gravenstein Highway North, Sebastopol, CA 95472, 2008. [30] Porikli, F., "Automatic Video Object Segmentation", Ph.D. Thesis, Electrical and Computer Engineering, Polytechnic University, Brooklyn, NY, 2002. [31] Horn, B., Schunk, B., "Determining optical flow", Artificial Intelligence, Vol. 17, pp. 185–203, 1981. [32] Lucas, B., Kanade, T., “An iterative image registration technique with an application to stereo vision,” In Proceedings of Image Understanding

KARASULU 21


KARASULU 22

Workshop, pp. 121–130, (In International Joint Conference on Artificial Intelligence), 1981. [33] Schunk, B., "The image flow constraint equation", Computer Vision Graphics Image Process, Vol. 35, pp. 20–46, 1986. [34] Shi, J., Tomasi, C., "Good features to track", In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600, 1994. [35] Gouet-Brunet, V., Lameyre, B., "Object recognition and segmentation in videos by connecting heterogeneous visual features", Computer Vision and Image Understanding, Vol. 111, No. 1, pp. 86-109, Special Issue on Intelligent Visual Surveillance (IEEE), 2008. [36] Baumann, A., Boltz, M., and Ebling, J. et al., “A Review and Comparison of Measures for Automatic Video Surveillance Systems”, EURASIP Journal on Image and Video Processing, Vol. 2008, Article ID: 824726, 30 pages, doi:10.1155/2008/824726, 2008. [37] Bashir, F., Porikli, F., “Performance Evaluation of Object Detection and Tracking Systems”, In Proceedings 9th IEEE International Workshop on PETS, pp. 7-14, Newyork, June 18, 2006. [38] Black, J., Ellis, T., Rosin, P., “A novel method for video tracking performance evaluation”, In Proceedings of the IEEE InternationalWorkshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS 03), pp. 125–132, Nice, France, October 2003. [39] CAVIAR, “Context aware vision using image-based active recognition”, (URL: http://homepages. inf.ed.ac.uk/rbf/CAVIAR), 2009. [40] VIPeR, “Viewpoint Invariant Pedestrian Recognition”, (URL:http://vision.soe.ucsc.edu/node/ 178), 2009. [41] IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), (URL: http://pets2007.net), 2009. [42] VACE, “Video analysis and content extraction”, (URL:http://www.perceptual-vision.com/vt4ns/vace_ brochure.pdf), 2009. [43] Kasturi, R, Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., and Zhang, J., "Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 2, pp. 319-336, 2009.

[44] Thirde, D., Borg, M., Aguilera, J., Wildenauer, H., Ferryman, J., and Kampel, M., “Robust Real-Time Tracking for Visual Surveillance”, EURASIP Journal on Advances in Signal Processing, Vol. 2007, Article ID 96568, 23 pages, doi:10.1155/2007/96568, 2007. [45] Lazarevic-McManus, N., Renno, J.R., Makris, D., and Jones, G. A., "An object-based comparative methodology for motion detection based on the F-Measure", Computer Vision and Image Understanding, Vol. 111, No.1, pp. 74-85, Special Issue on Intelligent Visual Surveillance (IEEE), 2008. VITAE Res. Assist. Bahadir KARASULU He graduated from the Science and Arts Faculty-Physics Department at Kocaeli University, Kocaeli, Turkey in 2003. Afterwards, in 2006, he completed a M.Sc. thesis study titled as ‘Application of Parallel Computing Technique to Monte Carlo Simulation’ at Natural Sciences Institute of Maltepe University, Istanbul, Turkey. He continues his Ph.D. study as well as research assistantship in Computer Engineering Department in Natural Sciences Institute of Ege University, Izmir, Turkey. His research interests are artificial intelligence, artificial neural networks, simulation and parallel computing, computer vision, pattern recognition and optimization.

Documents

HHO Yayınları