9
Abstract The mean shift tracking algorithm has become a standard in the field of visual object tracking, caused by its real time capability and robustness to object changes in pose, size, or illumination. The standard mean shift tracking approach is an iterative procedure that is based on kernel weighted color histograms for object modelling and the Bhattacharyyan coefficient as a similarity measure between target and candidate histogram model. The benefits of the approach could not been transferred to monochrome vision systems yet, because the loss of information from color to grey-scale histogram object models is too high and the system performance drops seriously. We propose a new framework that solves this problem by using histograms of HoG-features as object model and the SOAMST approach by Ning et al. for track estimation. Mean shift tracking requires a histogram for object modelling. In the proposed framework a set of high dimensional HoG-features is clustered via K-means and features inside the object area are matched to the cluster- centers via a nearest neighbor search. This procedure is comparable to a Bag of Words algorithm. The proposed system is evaluated for advanced driver assistance systems and it is shown that the framework can be used as a reliable visual tracking system for a pedestrian recognition module. Introduction Many modern advanced driver assistance systems rely on low cost monochrome vision systems, for analyzing the vehicle surrounding. Especially for Pedestrian warning or auto- breaking systems, vision sensors deliver highly required information that make these additional functionalities possible. Therefore a pedestrian object needs to be recognized and followed over several consecutive image-frames. The so called tracking is necessary for the calculation of several attributes of the object like distance, velocity and acceleration. Tracking in general is the calculation and estimation of an object path. Therefore visual tracking algorithms can determine the motion of an object between two video frames and so associate detections in two frames to the same object. In the field of computer vision, visual tracking still is an important and relevant research area [1]. Different methods like KLT [2], Mean Shift (MS) [3] or Template Matching [4] belong to this algorithm class and have become standard algorithms. The main challenges for visual tracking algorithms are real- time capability, accuracy and robustness to changes in the object appearance, due to motion or illumination. Especially the mean shift algorithm, also commonly known as kernel based tracking, fulfils these criteria adequately. MS was first introduced by Fukunaga [5] for data analysis in the field of pattern recognition, as gradient ascent local mode search algorithm for probability density distributions. Comaniciu evolved the MS algorithm as a kernel based visual object tracking method in 1999 [6], [7], using color histograms of image regions for target and candidate modelling and the Bhattacharyyan coefficient, a similarity measurement for probability density distributions. This approach is limited to spatial tracking and the object scale adaptation is estimated by a brute force search, with different kernel bandwidths. Later different approaches were introduced to handle the scale estimation problem, like mean shift based blob tracking [8] or multi kernel tracker [9], which divide the target into several blocks, track each block separately and have the advantage of additional spatial information of the separated object blocks. In the field of face tracking, CAMSHIFT [10], [11] and EM-shift [12] have been introduced since 1998, to estimate the scale and orientation changes adaptively, by calculating moment Adaptation of the Mean Shift Tracking Algorithm to Monochrome Vision Systems for Pedestrian Tracking Based on HoG-Features 2014-01-0170 Published 04/01/2014 Daniel Schugk and Anton Kummert Univ. of Wuppertal Christian Nunn Delphi Automotive CITATION: Schugk, D., Kummert, A., and Nunn, C., "Adaptation of the Mean Shift Tracking Algorithm to Monochrome Vision Systems for Pedestrian Tracking Based on HoG-Features," SAE Technical Paper 2014-01-0170, 2014, doi:10.4271/2014-01-0170. Copyright © 2014 SAE International

Meanshif Delphi

Embed Size (px)

DESCRIPTION

paper

Citation preview

Page 1: Meanshif Delphi

AbstractThe mean shift tracking algorithm has become a standard in the field of visual object tracking, caused by its real time capability and robustness to object changes in pose, size, or illumination. The standard mean shift tracking approach is an iterative procedure that is based on kernel weighted color histograms for object modelling and the Bhattacharyyan coefficient as a similarity measure between target and candidate histogram model. The benefits of the approach could not been transferred to monochrome vision systems yet, because the loss of information from color to grey-scale histogram object models is too high and the system performance drops seriously. We propose a new framework that solves this problem by using histograms of HoG-features as object model and the SOAMST approach by Ning et al. for track estimation. Mean shift tracking requires a histogram for object modelling. In the proposed framework a set of high dimensional HoG-features is clustered via K-means and features inside the object area are matched to the cluster-centers via a nearest neighbor search. This procedure is comparable to a Bag of Words algorithm. The proposed system is evaluated for advanced driver assistance systems and it is shown that the framework can be used as a reliable visual tracking system for a pedestrian recognition module.

IntroductionMany modern advanced driver assistance systems rely on low cost monochrome vision systems, for analyzing the vehicle surrounding. Especially for Pedestrian warning or auto-breaking systems, vision sensors deliver highly required information that make these additional functionalities possible. Therefore a pedestrian object needs to be recognized and followed over several consecutive image-frames. The so called tracking is necessary for the calculation of several attributes of

the object like distance, velocity and acceleration. Tracking in general is the calculation and estimation of an object path. Therefore visual tracking algorithms can determine the motion of an object between two video frames and so associate detections in two frames to the same object.

In the field of computer vision, visual tracking still is an important and relevant research area [1]. Different methods like KLT [2], Mean Shift (MS) [3] or Template Matching [4] belong to this algorithm class and have become standard algorithms. The main challenges for visual tracking algorithms are real-time capability, accuracy and robustness to changes in the object appearance, due to motion or illumination. Especially the mean shift algorithm, also commonly known as kernel based tracking, fulfils these criteria adequately.

MS was first introduced by Fukunaga [5] for data analysis in the field of pattern recognition, as gradient ascent local mode search algorithm for probability density distributions. Comaniciu evolved the MS algorithm as a kernel based visual object tracking method in 1999 [6], [7], using color histograms of image regions for target and candidate modelling and the Bhattacharyyan coefficient, a similarity measurement for probability density distributions. This approach is limited to spatial tracking and the object scale adaptation is estimated by a brute force search, with different kernel bandwidths. Later different approaches were introduced to handle the scale estimation problem, like mean shift based blob tracking [8] or multi kernel tracker [9], which divide the target into several blocks, track each block separately and have the advantage of additional spatial information of the separated object blocks.

In the field of face tracking, CAMSHIFT [10], [11] and EM-shift [12] have been introduced since 1998, to estimate the scale and orientation changes adaptively, by calculating moment

Adaptation of the Mean Shift Tracking Algorithm to Monochrome Vision Systems for Pedestrian Tracking Based on HoG-Features

2014-01-0170

Published 04/01/2014

Daniel Schugk and Anton KummertUniv. of Wuppertal

Christian NunnDelphi Automotive

CITATION: Schugk, D., Kummert, A., and Nunn, C., "Adaptation of the Mean Shift Tracking Algorithm to Monochrome Vision Systems for Pedestrian Tracking Based on HoG-Features," SAE Technical Paper 2014-01-0170, 2014, doi:10.4271/2014-01-0170.

Copyright © 2014 SAE International

Page 2: Meanshif Delphi

functions on weight values, which are estimated during the MS-procedure. Ning et al. have transferred this calculation to SOAMST [13] as a pedestrian tracking module.

The performance of all MS algorithms is strongly depending on the modelling of the target and the separability against the background. Different approaches try to handle these problems, like CBWH-MS by Ning et al. [14], who corrected the BWH-MS approach of Comaniciu et al. [3] by reducing the influence of prominent background features in the target model calculation.

Object modelling is commonly done with color histograms, so they have become a standard in combination with the MS tracking algorithms and achieved reliable and robust tracking results. For grey-scale image sensors, like they are used in state of the art driver assistance systems, the object modelling based on pixel intensity histograms is not discriminant enough to reach a comparable performance [9].

Many different pedestrian detection and classification systems are already based on HoG-features (histogram of oriented Gradient), a standard feature in this research area [15], [16], [17]. Introducing the HoG-features for object modelling inside the MS algorithms requires a method to build up object model histograms. We propose to use a Bag of Words (BoW) approach that clusters similar HoG-features and summarizes the object features inside a histogram based on a nearest neighbor matching to the cluster centers. The BoW-histogram of HoG-features fits perfectly to the requirements of the MS algorithms.

The proposed algorithm adapts state of the art MS-algorithms, described in the following part, to a monochrome image sensor system for pedestrian tracking. Like in SOAMST [13], the target is modelled with a weighted kernel function, but prominent background features are suppressed additionally like in CBWH [14]. In the next part the Introduction of HoG features to the modelling process inside MS is described. The system will be evaluated on selected video sequences of a driver assistance system in the Experimental Results part. The paper is closed by a Conclusion and a Future Work description.

Mean Shift TrackingMS tracking is commonly used for color vision systems. The basic algorithm is illustrated in the block-diagram in Figure 1. At its first appearance the object is detected with a detector module and defined by a rectangular bounding box. Based on the color information of the n features f inside the object area, the target object is modelled as a weighted histogram. The weight of a feature is obtained by a kernel profile function k, anti-proportional to the distance to the object center.

Figure 1. Generalized block-diagram of standard mean shift tracking algorithm. The MS-tracking-algorithm requires a histogram object model. Standardly color intensities are used as input for the histogram modelling, but also more powerful features can be used. The feature extraction block can also be combined with the histogram model blocks, to reduce the number of features that are calculated.

The object model histogram is equal to the probability distribution of the object features weighted by a kernel function. The target object model q is defined as

(1)

with

wherein qu is the kernel weighted probability of the u-th feature-bin of the m sized histogram, x is the object center, xi is the position of the feature fi in pixel coordinates and r is the kernel bandwidth describing the object size. The function b matches the feature fi to a histogram bin u and c is a

normalization constant defined as .

The candidate model is built up the same way, at position yi with candidate size correlated bandwidth ry and the features xi inside the candidate region. It is defined as

(2)

Page 3: Meanshif Delphi

The key step for the tracking algorithm is the computation of a displacement vector Δy based on the mean shift procedure to obtain the object movement from the initial candidate position to the target object center in the new frame:

(3)

(4)

The kernel function g(x) is defined as the negative derivative of the kernel profile function g(x) = −k′(x). Comaniciu et al. [7] derivate equation (4) based on the Bhattacharyya coefficient

(5)

a similarity measurement between the target and candidate model.

MST is an iterative process over the equations (2), (3), (4) by updating the candidate position every step with yj+1 = yj + Δy until convergence or a constant number of iterations is reached. The initial candidate model is calculated at y0 =x.

Scale and Orientation Estimation (Ning et al.)Up to now only the new object position is estimated, without any scale adaptation. The first algorithm containing an adaptive scale and orientation estimation was introduced by Bradski [10] as the face-tracking algorithm CAMSHIFT. Bradski made the assumption, that the target will fit inside the candidate area, if its size is enlarged before the MST. This assumption is valid for most moving objects, if the frame rate of the camera system is high enough, so that object shrinking or enlarging between consecutive frames is a gradual process and scale changes can be assumed to be smooth. The problem now is to estimate the real object area and orientation of the candidate object. Here CAMSHIFT uses the pixel value probability distribution p(I(x, y)) of the image or the target model q inside mathematical moment functions, which can estimate the shape of an ellipsoidal point cloud.

Ning et al. [13] examined the weight values from eq. (3) with different object sizes, compare Figure 2, and pointed out that the mean shift weight values (MSWV) depend on the candidate scale. Furthermore MSVW converge to 1, if the estimated object size fits to the current object appearance. In contrast to MSVW, the values of p(I(x, y)) are constant and independent from candidate object size. Ning et al. [13] use this adaptive character of the MSWV inside the moment calculation.

Figure 2. (a) Image of the target object. (b) - (d) Images of the candidate bounding box (dashed lines) with the corresponding MSWV. For scale estimation the object area is enlarged before the object movement is tracked. This introduces background features to the candidate region and results in lower weight values, what can be clearly seen in the images above. If the candidate region fits to the real object size the weights converge to the value 1. This observation leads to the assumption, that the object scaling can be estimated out of the MSWV [13].

By enlarging the object bounding box, background features are introduced to the candidate model automatically. The probability of the target model features is reduced, which leads to a higher MSWV and lowers the influence of background features in the area estimation of eq. (6) that is based on the 0th-moment of eq. (7), which is defined as the sum of all MSVW of the features inside the bounding box.

(6)

(7)

The Bhatacharryan coefficient of eq. (5) is an indicator for the quality of the area estimation by the 0th-moment and can be used to adjust this estimate of eq. (6), wherein σ is an adaptive variable, which should be chosen relatively to the object size.

To estimate the height, width and orientation of the object, first (M01, M10) and second (M02, M11, M20) order moments need to be calculated

(8)

together with the centroid of the distribution of MSWVs

(9)

Page 4: Meanshif Delphi

Furthermore the second order central moments

(10)

can be combined to a covariance matrix

(11)

which eigenvalues λ1 and λ2 are used to estimate the object parameters width and height and are calculated by the singular value decomposition,

(12)

together with the eigenvectors (u11, u21)T and (u12, u22)

T that represent the two orientation vectors of the targets main axes in the candidate region. Due to the usage of a kernel function, the rectangular object-area is reduced to an ellipsoidal area, which is a necessary condition for shape calculation based on moment functions. The ellipsoid is defined by the two main axes z1 and z2, which can be approximated by λ1 and λ2 and lead to a scale factor κ = λ1z1 = λ2z2. Together with the definition of the ellipse area a = πz1z2 it can be derived that

(13)

(14)

After calculating the width and height of the estimated candidate area, the covariance matrix needs to be updated

(15)

to be reused in the next frame, since it contains the orientation estimation of the current candidate object.

The features inside the candidate region can be calculated using the ellipsoid equation

(16)

Background AdaptationThe modelling of the target and candidate will contain background features, since the detected bounding box may not be positioned at the real object center, neither fit the exact object shape or the candidate region is enlarged before

tracking. Comaniciu et al. [3] tried to reduce the influence of prominent background features, by estimating a background-histogram-model and introducing background-weights vu to gain the background-weighted-histograms (BWH), by multiplying target and candidate model with vu. Ning et al. [14] proved that the application to both models has no effect on the mean shift tracking algorithm and corrected the calculation by applying the background weights just to the target model (CBWH).

The background model v is calculated on the n-features around the target object in an area that is three times bigger than the object, without any kernel weighting

(17)

so that the CBWH-target model results in

(18)

The relationship between the standard mean shift feature weights wi in equation (3) and the CBWH feature weights w′i can be calculated as

(19)

Figure 3. (a) Standard mean shift target model. (b) Background histogram model. (c) Background weights calculated by eq. (18). (d) CBWH histogram model. Comparing (a) and (b) it is obvious that the target cannot be separated from the background easily based on the pixel intensity model. After the background adaptation of CBWH the model becomes more discriminative, since the influence of prominent background features (bins 30 - 60) is reduced.

Page 5: Meanshif Delphi

Introducing Histograms of Oriented Gradients to Mean ShiftThe state of the art mean shift tracking algorithms, like SOAMST, show good performances and reliable results, when they are used with color sensor systems. As already mentioned, the object modelling is not discriminative enough for monochrome vision systems and stronger features are needed. Therefore we introduce the HoG-features, developed by Dalal and Triggs [15], to the MST algorithm, which have the advantage of describing an image patch instead of a single pixel, by summarizing the gradient information of the image patch in a histogram. Furthermore HoG-features are less noise sensitive and have a higher independency to illumination changes due to a low orientation-quantization

To calculate HoG-features, the image derivatives in x and y direction are computed pixel-wise by the standard filters gx = [1,0, −1] and gy = [1,0, −1]T. Based on gx and gy pixel-gradients can be calculated and the so gained pixel-orientation is quantized into 9 directions. A HoG-feature is defined by its position (fx, fy) and size (sx, sy), describing an image-patch with size of sx × sy pixels, wherein the orientation of all pixels is summarized into a histogram.

For MST highly discriminative features are needed, which we obtain by sub-dividing the image patch into four rectangular regions and calculating another four oriented histograms, like it is visualized in Figure 4. Therefore, the minimum patch size results in 4 × 4. Together with a 15-dim Haarlike feature vector [18], calculated on the intensity values, the HoG-feature results in a highly discriminant 60 dimensional feature f.

Figure 4. HoG-features are calculated on the oriented gradients inside an image patch (red). All pixel gradients are quantized into nine directions and summarized in a histogram and normalized, the dot stands for no direction, which is the case, if the magnitude of the gradient is below a threshold. To generate more discriminative features, the patch is subdivided into four regions (green). Again the gradients are summarized in histograms and normalized. Together with a 15 - dimensional Haarlike feature, the HoG-feature results in a 60 dimensional vector.

With these high dimensional features an easy histogram quantization modelling is not suitable anymore and other methods need to be found. As a first step we propose to cluster the features, to achieve a quantization similar to the histogram modelling in color feature space. We use K-means clustering so that every cluster center corresponds to one histogram bin in the model representation. As a second step every feature of the object area is matched to one cluster by a nearest neighbor matching algorithm and weighted with a kernel function and the modelling process is finished. This procedure is comparable to the bag of words idea, wherein features are clustered and then summarized inside a histogram, based on their cluster matching, to obtain a visual word. In our case, all features inside the surrounding area of the pedestrian are used for clustering, so a more general, but still discriminative and individual clustering is reached. The combination of all cluster centers is often referenced as dictionary. We propose to use this dictionary with a nearest neighbor search algorithm inside the mean shift tracking algorithm as substitution of the feature matching function b in eq. (1) and (2). The so gained target model is comparable to a visual “word”. Calculating a dictionary for every single object can be very inefficient and may corrupt the real-time capability of the algorithm. To overcome this problem it is possible to train a “‘global’” dictionary based on a feature set extracted from a training database with positive and negative pedestrian examples. In this case it is also possible to evaluate the optimal number of cluster centers used for the modelling in an offline stage.

The proposed algorithm uses a constant grid of HoG-features with a distance of df = 2 pixel. Because of the change in feature sampling, the scale estimation needs to be adapted to the feature distance. In the previous scale estimation the pixel features, used in the moment calculation, were sampled on the same dense grid as the image coordinate system used for position and scale estimation. So the object area relative to the coordinate system could be calculated, since every pixel is a feature, a dense grid of feature weights can be calculated. If the feature grid is subsampled, not every coordinate can be matched to a feature weight directly. The moment calculation is based on a dense feature space, and a missing feature-weight would equal a zero weighting of the feature position. This problem can be solved by interpolating the feature weights to the coordinate-system. The interpolation can easily be done by a nearest neighbor matching, which can be approximated by multiplying the relative feature position with the feature distance. The new moments result in:

(20)

(21)

(22)

Page 6: Meanshif Delphi

The orientation calculation is not influenced by different feature sampling and since we are not expecting the object to rotate inside the image-plane, it is neglected anyway.

Furthermore, the density of the feature grid influences the accuracy of the position and scale estimation and the calculation time. Obviously less features result in a reduced calculation time, but also reduce the accuracy of the system.

Scale Estimation and Background AdaptationObviously the benefits of a combined SOAMST-CBWH approach could be scale adaptive and robust position estimations. However the results of the MST evaluation in the next section show that the benefit of the background adaptation can be neglected in most cases. Furthermore, the manipulated target model influences the scale estimation based on the moment features, if the target model is mainly built up on a group of features that are also prominent in the background. The CBWH is used to suppress the influence of these background features during the position estimation, by reducing the proportion of these features inside the target model. This results in a smaller MSWV, comparing the equations (3) and (19).

The scale-estimation of SOAMST relies on the area estimation based on the sum of the MSWV inside the object area. If the weights are reduced, not based on the relation of the target and the candidate model, but by the background, the assumption of a relation between the real object area and the estimated area fails and the estimation results in a smaller candidate size.

A solution of this problem can be, to use the different target models for the different tasks, they have been designed for. CBHW target model for position estimation and the standard MST target model for scale estimation.

Experimental ResultsThe proposed MST system is evaluated in an advanced driver assistance system for pedestrian tracking. Therefore a set of 5 videos is recorded, containing typical pedestrian situations during daytime, like cross-walking pedestrians, approaching pedestrians, groups of pedestrians and overlapping pedestrians. Summed up 9 pedestrians (PED) are tracked. For a comparison to the real object, the position and scale of all pedestrians are hand-labeled in every 5-th frame, and interpolated in between. The labeled data is used as a ground-truth. The initialization and modelling of the pedestrian-object is done in the first frame of its appearance.

Sequence (Seq.) 1 contains three PEDs, see Figure 5(a). The vehicle is standing in front of a cross walk, one big PED (height 340 px) is cross-walking in the foreground, one PED in a group is approaching to the vehicle (100 - 200px) and the last one is standing in the background (40 px). In Seq. 2 the vehicle is driving and again, three PEDs are walking on the right walkway, see Figure 6(a), one large (70 - 130px) and a group of two, which are initialized with a height of just 30px (30 -

90px). The other sequences only contain one walking PED on the right walkway each, while the vehicle is approaching. The background differs from very dark (Seq. 3) to daylight (Seq. 4) or contains structured background (Seq. 5), see Figure 6(b).

(a). Sequence 1, frame 6

(b). Sequence 1, frame 60

Figure 5. Tracking results of sequence 1. SOAMST using the grey intensity value target model loses all objects almost directly. (a) Shows the results up to frame 6. (b) Shows the superior performance of the HoG-feature modelling up to frame 60 in combination with the SOAMST position and scale estimation.

The proposed algorithm is evaluated in two different implementations as SOAMST based on Hog-features with and without background adaptation using the CBWH target model. Furthermore the results are compared to the state of the art MST algorithms, SOAMST [13] and CBHW [14] and additionally the standard MST [7], adapted to monochrome vision systems. A target is defined lost, if the quotient between the estimated scale and the ground truth scale is greater than 1.5 or smaller than 0.5, or the relative position error is greater than 0.5 times the ground truth width.

Due to the random character of the K-means-clustering in the modelling process, the HoG-feature based results are not static for every initialization of the same pedestrian. Therefore the system is initialized multiple times and the results are averaged, see Table 1.

Page 7: Meanshif Delphi

(a). Sequence 2, frame 18

(b). Sequence 5, frame 32

Figure 6. Tracking results of the sequences 2 and 5. (a) Even when the position error becomes larger for some consecutive frames, the HoG-feature based tracking can recover the object (big PED). This is caused by the object modelling based on one single frame and the pose of the pedestrian in this frame. If the PED returns to this pose, the model matching can be done more accurate. (b) The proposed algorithm is also able to handle vehicle pitching

Figure 5 shows the results of all tracking algorithms on Seq. 1. Like in Table 1, it points out clearly that the tracking algorithms based on intensity values cannot track the targets properly. The SOAMST (yellow) loses the cross-walking directly after 3 frames and the other two PEDs due to completely wrong scale estimation after a few frames. The MST (red) and the CBWHMST (magenta) can track the targets longer, but don’t estimate the scale-change.

With a moving vehicle, the performance of all intensity based algorithms drops significantly and a tracking over several frames is not possible in most sequences. Only the CBWH-algorithm can track the object if the background is not structured. These sequences show the advantage of the background adaptation, see Table 1.

The HoG-features can improve this tracking a lot. SOAMST-HoG tracks 7 of the 9 PEDs almost over the complete distance. The remaining 2 tracks fail, caused by wrong scale estimation. In the sequences 2-5 all PEDs approach towards the vehicle. This causes a higher position error, due to the initialization of the target model in the first frame and the missing model update strategy, as can be seen in Figure 6.

Slightly better position estimations can be achieved using the CBWH-target-model inside the HoG algorithm. The difference of the CBWH and SOAMST becomes clear looking at the scale estimation error. SOAMST often results in a slightly larger scale estimate, especially when the object does not change its size, compare Seq. 1 PED 1. The average scale estimation of CBWH instead is always too small, due to the reduced influence of feature, which prominent in the background.

The presented results are based on a single target initialization and do not contain any model update strategies. This results in two different effects. On the one side, approaching PED will get lost if the scale change gets to large compared to the initial

Table 1. Statistical results of the mean shift algorithms for the PED sequences. Evaluation criteria are: Track Length(TL), Position Error(PE) and Scale Error (SE). The ground truth track length is given in column GT.

Page 8: Meanshif Delphi

size, due to the fact that the HoG-features are calculated with a fixed size and for example, during target modelling, one feature contains all gradients describing a head (a full circle), which later may be described by a handful features (circle parts) that are matched to different cluster centers and histogram bins in the candidate model. The other effect is a positive one, due to the effect that the PED is initialized in one special pose (walking with spread legs) that will return after some frames, the PED may be recovered with high accuracy, if it returns to this initial pose, compare Figure 6(a).

ConclusionsIn this paper we present a method that uses high dimensional HoG-features for target modelling inside a mean shift tracking algorithm, so the MST is adapted to monochrome vision systems, commonly used in driver assistance systems. The HoG based object modelling is needed, because the reduced information content of the grey-scale image pixels compared to colored is not discriminant enough to guarantee a robust and accurate tracking. The HoG-features need to be preprocessed to fit to the required histogram object models. Therefore we propose to use a Bag of Words algorithm that clusters all extracted features with K-means and summarizes the object features based on a nearest neighbor cluster matching into a histogram.

The system is evaluated on typical pedestrian appearances in various situations on monochrome sequences, in two different implementations of the adapted SOAMST by Ning et al. [13] with (HoG-CBWH) and without (HoG-SOAMST) a background adaptation. It is shown that the proposed MST based on HoG-features is superior to the standard methods adapted to greyscale histogram modelling. The position estimation based on the HoG-SOAMST points out to be the most reliable and robust algorithm and is able to track 7 of the 9 evaluated pedestrian appearances nearly over the full length. The position estimation still contains a relative small error compared to the hand labelled ground truth, which cannot be reduced significantly by the use of the background adaptive CBWH target model. Since the scale estimation is designed for the SOAMST, it is the most reliable way to use it.

The HoG-features can be used without a considerable increase of processing power, since they are already computed and needed for pedestrian detection. Only the clustering process in the modelling stage is costly in sense of calculation time. Using a global dictionary, which is trained offline on a large dataset of pedestrians and their surroundings, will solve this problem.

Future WorkThe results we reached so far are satisfying by the length of the tracks and their robustness in different situations, but still can be improved in terms of accuracy in position and scale. A model update strategy can be introduced since the results are just based on a single target initialization, to generate longer tracks and handle the problem of losing approaching pedestrians with a small initialization size. Therefore reliable confidence estimation is necessary, using the Bhattacharyya

coefficient as a similarity measure between the final candidate model and the target model, may be an option. Additionally the current model is just based on the pixel orientation and leaves the pixel intensities out. A combination of both features may improve the estimation. Furthermore the position of the features inside the object is neglected by the use of histograms as model-structures. This loss of information may be covered by an offline trained voting scheme that defines an offset vector for every feature towards the object center, which can be used to recalculate the kernel weights or as an additional feature dimension.

References1. Yilmaz A., Javed O. and Shah M., “Object tracking:

A survey,” ACM Comput. Surv., vol. 38, no. 4, p. 13+, December 2006, doi:10.1145/1177352.1177355.

2. Lucas B. D. and Kanade T., “An iterative image registration technique with an application to stereo vision,” in Proceedings of the 7th international joint conference on Artificial intelligence - Volume 2, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., 1981, pp. 674-679.

3. Comaniciu D., Ramesh V. and Meer P., “Kernel-based object tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 5, pp. 564-577, 2003, doi:10.1109/TPAMI.2003.1195991.

4. Zitová B. and Flusser J., “Image registration methods: a survey,” Image Vision Comput., vol. 21, no. 11, pp. 977-1000, 2003, doi:10.1016/S0262-8856(03)00137-9.

5. Fukunaga K. and Hostetler L., “The estimation of the gradient of a density function, with applications in pattern recognition,” Information Theory, IEEE Transactions on, vol. 21, no. 1, pp. 32-40, 1975, doi:10.1109/TIT.1975.1055330.

6. Comaniciu D. and Meer P., “Mean shift analysis and applications,” in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, Kerkyra, 1999, doi:10.1109/ICCV.1999.790416.

7. Comaniciu D., Ramesh V. and Meer P., “Real-time tracking of non-rigid objects using mean shift,” in Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, Hilton Head Island, SC, 2000, doi:10.1109/CVPR.2000.854761.

8. Collins R., “Mean-shift blob tracking through scale space,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, Madison, Wisconsin, 2003, doi:10.1109/CVPR.2003.1211475.

9. Yan Y., Huang X., Xu W. and Shen L., “Robust Kernel-Based Tracking with Multiple Subtemplates in Vision Guidance System,” Sensors, vol. 12, no. 2, pp. 1990-2004, 2012, doi:10.3390/s120201990.

10. Bradski G. R., “Computer vision face tracking for use in a perceptual user interface,” Intel Technology Journal, vol. 2, 1998.

Page 9: Meanshif Delphi

11. Lee L.-K., An S.-Y. and Oh S.-y., “Robust visual object tracking with extended CAMShift in complex environments,” in IECON 2011 - 37th Annual Conference on IEEE Industrial Electronics Society, Melbourne, VIC, 2011, doi:10.1109/IECON.2011.6120057.

12. Zivkovic Z. and Krose B., “An EM-like algorithm for color-histogram-based object tracking,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, Washington, DC, 2004, doi:10.1109/CVPR.2004.1315113.

13. Ning J., Zhang L., Zhang D. and Wu C., “Scale and orientation adaptive mean shift tracking,” Computer Vision, IET, vol. 6, no. 1, pp. 52-61, 2012, doi:10.1049/ietcvi.2010.0112.

14. Ning J., Zhang L., Zhang D. and Wu C., “Robust mean-shift tracking with corrected background-weighted histogram,” Computer Vision, IET, vol. 6, no. 1, pp. 62-69, 2012, doi:10.1049/iet-cvi.2009.0075.

15. Dalal N. and Triggs B., “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, San Diego, CA, USA, 2005, doi:10.1109/CVPR.2005.177.

16. Enzweiler M. and Gavrila D., “Monocular Pedestrian Detection: Survey and Experiments,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 12, pp. 2179-2195, 2009, doi:10.1109/TPAMI.2008.260.

17. Zheng G. and Chen Y., “A review on vision-based pedestrian detection,” in Global High Tech Congress on Electronics (GHTCE), 2012 IEEE, Shenzhen, 2012, doi:10.1109/GHTCE.2012.6490122.

18. Viola P. a. J. M., “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Kauai, HI, 2001, doi:10.1109/CVPR.2001.990517.

The Engineering Meetings Board has approved this paper for publication. It has successfully completed SAE’s peer review process under the supervision of the session organizer. The process requires a minimum of three (3) reviews by industry experts.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of SAE International.

Positions and opinions advanced in this paper are those of the author(s) and not necessarily those of SAE International. The author is solely responsible for the content of the paper.

ISSN 0148-7191

http://papers.sae.org/2014-01-0170