6
A Real-time Visual Action-Recognition Framework for Time-Varying-Geometry Objects Matthew Mackay, Robert G. Fenton, and Beno Benhabib Department of Mechanical & Industrial Engineering University of Toronto, 5 King's College Road Toronto, ON, Canada, M5S 3G8 Abstract – This paper develops a real-world, real-time sensing-system reconfiguration framework for time-varying geometry (TVG) object action recognition. Currently, a method specifically tailored for sensing such objects, especially humans, does not exist. A 10-level superscalar pipeline structure is proposed to address real-time concerns. Carefully designed evaluation criteria are used with real-world experiments to validate the method. Overall, it was concluded that this implementation (of our more generic framework) was successful in its task-specific goals, and that the framework can be applied to many other TVG object action sensing tasks. I. INTRODUCTION AND LITERATURE REVIEW Recognizing time-varying-geometry (TVG) objects is a complex task. In the past, algorithms have been developed to recognize such objects under carefully controlled conditions, and, later, under real-world conditions [1]. However, shape and action dimensionality is a key concern - the human form, for instance, can exhibit an essentially infinite pose space, owing to a large number of redundant degrees of freedom (DOF). For example, [2] presents a 74 DOF partial human model. Past work has presented systems which mainly focus on relatively time-invariant features (such as the face) for human form recognition, such as the agent-based work in [3]. However, recent research has again focused on the form of a TVG object – specifically, many objects are known to exhibit actions one might wish to recognize [4]. As such, this review will first examine the methods for TVG object form and action recognition. A. Form and Action Recognition Early efforts in static form recognition focused on single form TVG object identification using fixed view images. Many algorithms also dealt with reconstructing models for a priori unknown objects (i.e. [5]). Modern methods use static environment reconfiguration methods to construct a 3-D model and recognize the current object form [6]. However, many of these are quasi-static, static form systems. As mentioned above, TVG objects often exhibit recognizable actions. Action recognition approaches have been commonly placed into three general categories: template matching, semantic, and statistical [7]. Most modern works can be classified under one or more of these categories (i.e. [8]-[10]). However, in these works there is an underlying assumption that input data is fixed; static- camera methods are prevalent (i.e. [11], [12]). Indeed, the above form-recognition methods do not consider sensor pose choice as part of their methodology, and leave no opportunity for selecting optimal sensor poses. A few works acknowledge the benefit of improved input data, but present no specific framework to obtain it [13]. An established need exists for a formal reconfiguration methodology tailored to the sensing of time-varying geometry objects. B. Sensing-System Reconfiguration Sensing-system reconfiguration is a method to improve sensing system input data by repositioning sensors to optimal locations. Optimality is typically defined as performance optimality for a particular sensing task, and is quantified through physical measures directly related to task performance (i.e. visibility). Many seminal works in reconfiguration focused on static environment sensor- planning methods. These are commonly classified as either generate-and-test or synthesis [12], and are often offline methods. Recent examples for static environments have often been application-specific. Sensor planning for dynamic cameras surveying an otherwise static scene has become a primary interest of computer-vision specialists studying image or scene understanding. Such works often focus on the well-known next-best-view problem [14]. However, sensing dynamic targets moving in a dynamic environment is more complex - early methods addressed an object/subject that may be moving or maneuvering. Static obstacles have been addressed by multiple mobile sensors, as in [15]. Dynamic obstacles require a detection step (assuming no a priori knowledge) and path prediction. Multiple, dynamic obstacles have been considered, as in [3]. These methods inherently assume a uniform target – recent methods have considered that an articulated object may self-occlude, and as such, the next-best-view problem might be solved online as part of the sensing solution [16]. It has also been shown in [17] that the problem of action recognition and sensor placement is intertwined – reconfiguration algorithms depend on accurate information about the current pose, form, and action of a TVG object to make near-optimal sensor placements. Formal attention must be given to the real-time operation of these systems. Many past works in action recognition attempt to minimize processing time as an inherent part of their solution. For example, statistical and semantic recognition methods employ dimensionality reduction to solve an 6th annual IEEE Conference on Automation Science and Engineering Marriott Eaton Centre Hotel Toronto, Ontario, Canada, August 21-24, 2010 TuB2.1 978-1-4244-5448-8/10/$26.00 ©2010 IEEE 922 978-1-4244-5449-5/10/$26.00 ©2010 IEEE

[IEEE 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010) - Toronto, ON (2010.08.21-2010.08.24)] 2010 IEEE International Conference on Automation Science

  • Upload
    beno

  • View
    216

  • Download
    4

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010) - Toronto, ON (2010.08.21-2010.08.24)] 2010 IEEE International Conference on Automation Science

A Real-time Visual Action-Recognition Framework for Time-Varying-Geometry Objects

Matthew Mackay, Robert G. Fenton, and Beno Benhabib

Department of Mechanical & Industrial Engineering University of Toronto, 5 King's College Road

Toronto, ON, Canada, M5S 3G8

Abstract – This paper develops a real-world, real-time

sensing-system reconfiguration framework for time-varying geometry (TVG) object action recognition. Currently, a method specifically tailored for sensing such objects, especially humans, does not exist. A 10-level superscalar pipeline structure is proposed to address real-time concerns. Carefully designed evaluation criteria are used with real-world experiments to validate the method. Overall, it was concluded that this implementation (of our more generic framework) was successful in its task-specific goals, and that the framework can be applied to many other TVG object action sensing tasks.

I. INTRODUCTION AND LITERATURE REVIEW

Recognizing time-varying-geometry (TVG) objects is a complex task. In the past, algorithms have been developed to recognize such objects under carefully controlled conditions, and, later, under real-world conditions [1]. However, shape and action dimensionality is a key concern - the human form, for instance, can exhibit an essentially infinite pose space, owing to a large number of redundant degrees of freedom (DOF). For example, [2] presents a 74 DOF partial human model. Past work has presented systems which mainly focus on relatively time-invariant features (such as the face) for human form recognition, such as the agent-based work in [3]. However, recent research has again focused on the form of a TVG object – specifically, many objects are known to exhibit actions one might wish to recognize [4]. As such, this review will first examine the methods for TVG object form and action recognition.

A. Form and Action Recognition

Early efforts in static form recognition focused on single form TVG object identification using fixed view images. Many algorithms also dealt with reconstructing models for a priori unknown objects (i.e. [5]). Modern methods use static environment reconfiguration methods to construct a 3-D model and recognize the current object form [6]. However, many of these are quasi-static, static form systems.

As mentioned above, TVG objects often exhibit recognizable actions. Action recognition approaches have been commonly placed into three general categories: template matching, semantic, and statistical [7]. Most modern works can be classified under one or more of these categories (i.e. [8]-[10]). However, in these works there is an underlying assumption that input data is fixed; static-camera methods are prevalent (i.e. [11], [12]). Indeed, the above form-recognition methods do not consider sensor

pose choice as part of their methodology, and leave no opportunity for selecting optimal sensor poses. A few works acknowledge the benefit of improved input data, but present no specific framework to obtain it [13]. An established need exists for a formal reconfiguration methodology tailored to the sensing of time-varying geometry objects. B. Sensing-System Reconfiguration

Sensing-system reconfiguration is a method to improve sensing system input data by repositioning sensors to optimal locations. Optimality is typically defined as performance optimality for a particular sensing task, and is quantified through physical measures directly related to task performance (i.e. visibility). Many seminal works in reconfiguration focused on static environment sensor-planning methods. These are commonly classified as either generate-and-test or synthesis [12], and are often offline methods. Recent examples for static environments have often been application-specific.

Sensor planning for dynamic cameras surveying an otherwise static scene has become a primary interest of computer-vision specialists studying image or scene understanding. Such works often focus on the well-known next-best-view problem [14]. However, sensing dynamic targets moving in a dynamic environment is more complex -early methods addressed an object/subject that may be moving or maneuvering. Static obstacles have been addressed by multiple mobile sensors, as in [15]. Dynamic obstacles require a detection step (assuming no a priori knowledge) and path prediction. Multiple, dynamic obstacles have been considered, as in [3]. These methods inherently assume a uniform target – recent methods have considered that an articulated object may self-occlude, and as such, the next-best-view problem might be solved online as part of the sensing solution [16]. It has also been shown in [17] that the problem of action recognition and sensor placement is intertwined – reconfiguration algorithms depend on accurate information about the current pose, form, and action of a TVG object to make near-optimal sensor placements. Formal attention must be given to the real-time operation of these systems.

Many past works in action recognition attempt to minimize processing time as an inherent part of their solution. For example, statistical and semantic recognition methods employ dimensionality reduction to solve an

6th annual IEEE Conference on Automation Science andEngineeringMarriott Eaton Centre HotelToronto, Ontario, Canada, August 21-24, 2010

TuB2.1

978-1-4244-5448-8/10/$26.00 ©2010 IEEE 922978-1-4244-5449-5/10/$26.00 ©2010 IEEE

Page 2: [IEEE 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010) - Toronto, ON (2010.08.21-2010.08.24)] 2010 IEEE International Conference on Automation Science

otherwise computationally intractable problem [14]. However, relatively few methods give attention to a situation where hard time deadlines are enforced (i.e. where the timeliness of the decision itself is a part of the system problem). One rare example is [3] where a rolling time horizon is used, and a constrained, decision-time-aware search method allows hard deadlines to be enforced. However, action recognition as a system task represents a continuous-time problem, where the task of the system does not have a specific start and end. Previous methods for fixed-geometry objects and static form sensing cannot be truly applied, even though some modern works have achieved results in doing so [18]. As such, the goal of this work is to develop a real-time, real-world, generic sensing system reconfiguration methodology, designed specifically for systems sensing the action of time-varying-geometry objects under a wide range of task requirements.

II. PROBLEM FORMULATION

Real-time sensing-system reconfiguration for a generic sensing application can be formulated as a global optimization problem, maximizing recognition performance [19]. For TVG object action recognition, one considers the ‘success rate in recognizing current OoI form and action’ as a metric of performance [19]. This would, in turn, depend on the visibility of the OoI to any given camera. A mathematical formulation of the generic optimization problem can thus be inherently solved by any implementation of the generic framework we propose herein.

We define visibility by a visibility metric, V. The visibility metric for the ith sensor, expressed at the jth demand instant, tj, is defined as ��� � �����

� � . It is a function of �

� , the pose of the ith sensor, Si, at the jth instant, with pose defined as a 6D vector representing position, � � �� ��, and orientation, ����� ��. Evaluating this metric will be examined in more detail in the methodology section.

The overall optimization can be stated for a system with nsens sensors, an environment with nobs obstacles, and with prediction over the time horizon ending at the mth demand instant, as:

For each demand instant, tj, j = 1 to m, perform the following:

For each sensor, Si, i = 1 to nsens, solve the following:

Given ��� , ����� , ��, ������ ; k = 1 to ����

(1)

Maximize � � !���"� ; l = 1 to j

(2)

Subject to ��" # $%� , ��" # &�" , and ��" ' �(�); l = 1 to j

(3)

End of loop. Continue while *+,�- . *(/0

Above, ����� is the pose of the OoI at the jth demand instant, ������ is the pose of the kth obstacle at the jth demand instant,

�� is the feature vector of the OoI at the jth demand instant, �� is the discretized set of feasible poses for the ith sensor, 1�� is the discretized set of achievable poses for the ith sensor at the jth demand instant, Vmin refers to a user-defined threshold of minimum visibility, tproc is the time spent processing data, and tmax is the maximum decision time limit.

A pose is feasible (belongs to �� ) if it is within the physical motion limits of the sensor. The achievable poses (1��) are a strict subset of the feasible poses, given by the motion model used (which can be application and equipment dependent). They represent poses that can be reached in the time remaining before a decision must be made. The performance objective function, Pr, depends on the visibility metric of each sensor at all demand instants on the horizon. It is a measure of success in achieving the sensing objective. An implementation capturing this problem formulation would optimize performance by first maximizing the visibility of the OoI at the immediate-future demand instant, (t1), for all sensors. If sufficient time remains, the system would maximize expected visibility for the combination (t1 and t2), then, (t1, t2, and t3), and so on. As such, a higher overall metric value may be achieved at later demand instants, possibly at the expense of immediate-future visibility. This is the base optimization problem formulation, presented in [19].

However, on its own, this optimization does not truly capture the real-time nature of the problem. Since this paper presents a generalized framework, many choices are necessarily application-dependent, and hence not specified. In these cases, once the user has implemented a system, it is useful to provide a measure of feedback to gauge the suitability of the implementation to the given application. As such, we define several key metrics and associated constraints:

1. The effective average update rate of the system, ��2�, must be greater than a minimum, �(�)/34 56 7 *(�)8 9. The value of *(�) is the minimum separation in time between any two key feature vectors in the action library. Under basic sampling theory, the update rate must be greater than 7 *(�)8 to avoid aliasing. However, since this is based on average update rate, a more conservative choice of : *(�)8 to ; *(�)8 is recommended.

2. The minimum update rate, �(�), must be constrained, if there is variation in the update rate: �(�) ' �(�)(�). Note that if �(�)cannot be concretely defined (as is typically the case), then update rate, � , should be treated as a Gaussian random variable, and �(�) is taken as �(�) � �< = >?@, where �< is the average of �

978-1-4244-5448-8/10/$26.00 ©2010 IEEE 923

Page 3: [IEEE 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010) - Toronto, ON (2010.08.21-2010.08.24)] 2010 IEEE International Conference on Automation Science

over a sufficiently large number of samples, and ?@ is the sample SD.

3. The maximum update rate, �(/0 , should also be constrained such that �(/0 A �(/0(/0 , where �(/0(/0 � BC�*@ = *��/34 . Here, BC is the length (in number of demand instants) of the rolling horizon, which is typically equal to the length (in key frames) of the longest action in the library. Also, �*@ = *��/34 is average key-frame time length of the longest action in the library. Note that this constraint may be omitted in cases where there is a large difference in action lengths in the action library. The optimal case is where the action library consists mostly of actions of equal length in key-frames.

4. The average per-frame error, D< � EFG� = G"��FE, taken as the straight-line Euclidian distance between the recovered, normalized subject feature vector for a given demand instant (G�), and the closest interpolated library form feature vector (G"��), should be less than a maximum D(/0. Typically, D(/0 will be chosen such

that D(/0 . HI JKG"��� = G"��LKJ(�) , where JKG"��� =

G"��LFJ(�) is the minimum distance between any two

library feature vectors. Again, if this distance is artificially small, the constraint may be waived.

5. Decision ratio, MN � OPOQPROP S MNTUQ. Here, VN is the

frequency of demand instants with a definite sensor pose decision, and VWN is the frequency of instants where a default pose decision is used. Typically, MNTUQ � XYZ[.

If these constraints are carefully considered for a given application space, it is possible to design an implementation that can yield a success rate in excess of ZZ\ for a given TVG action recognition task. At a minimum, they will provide a clear picture of the suitability of the implementation.

III. PROPOSED REAL-TIME METHODOLOGY

It was determined that the methodology originally presented in [19], which uses a central-planning agent as a ‘decision-hub,’ contains a critical path that can benefit from parallelization. The benefit of parallelization is that one can potentially (significantly) increase the effective update rate of the system. An increased update rate reduces reliance on predicted data. The system will process more updates, and hence, more near-future predictions about the current poses, forms, and actions of all objects in the workspace. A TVG object’s action is a continuous variable, and thus increasing the effective system update rate also acts to prevent aliasing, and increases the available time resolution. In addition, the proposed methodology is a generic framework that can be customized for a wide variety of TVG object action sensing applications. The pipeline architecture [20] we examine has established theory and can be formally benchmarked. As such, the chosen architecture now closely resembles a traditional CPU pipeline (Figure 1). The redesigned, generalized framework consists of a 10-stage pipeline, which will be explained in sequence.

A. Update Structure and General Considerations

Conceptually, a single demand instant decision moves from left to right in the pipeline. In general, each successive pipeline stage will use an equal or more compact data representation. Minimal representation size reduces the overhead associated with data transfer between stages [20].

Figure 1 – Overview of generalized sensing system reconfiguration pipeline framework for TVG action recognition

978-1-4244-5448-8/10/$26.00 ©2010 IEEE 924

Page 4: [IEEE 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010) - Toronto, ON (2010.08.21-2010.08.24)] 2010 IEEE International Conference on Automation Science

Furthermore, for the proposed system, data moves between agents only at a predefined time (the clock), which is the update interval (demand instant spacing). Note that the possibility of using an asynchronous pipeline [21] was investigated, but due to the multiple synchronization points used, and the hard deadlines on sensor motion decisions, a synchronous pipeline is more stable and consistent.

B. Pipeline Structure

The framework has been carefully designed to be adaptable to a variety of TVG object action sensing tasks. More importantly, it has been specifically designed with real-time operation in mind, as per the above constraints. The pipeline stages, from left to right in Figure 1 operate as follows:

Stage L1 - The Imaging Agents asynchronously capture and compress time-stamped input images. Data must be aligned on the ‘leading edge’ of the pipeline clock for consistency. This agent is also responsible for data pre-processing. This typically involves two stages: Image Correction and Region of Interest (RoI) Identification. For Image Correction, a series of image filters are applied to enhance the image and remove noise. These will vary depending on the application, but for general purposes, a simple Gaussian low-pass filter (LPF) is sufficient. Depending on the segmentation, tracking, and camera calibration algorithms used, brightness normalization and camera distortion removal can also be performed. Region of Interest Identification is a data reduction method, applied to reduce data throughput required between stages L1 and L2. Raw images are the least efficient data representation used, and a physical implementation is likely to connect these agents by the slowest communication media present in the overall system, compounding the problem. To avoid excessive overhead, the image is pre-segmented into areas of interest using a series of customizable ‘Interest Filters,’ similar to those in [22]. In this framework, we define the interest of pixel �]� ^� as _���:

_��� � `a� $bcd�_���� � _���H e _���f � S _(�)X� ghig (4)

Here, X A _���j A a is the response of the klC input filter (total of K filters), and _(�) is a user-defined minimum level of interest. The ‘OR’ logic effect of the maximum value operator selects all regions that could be of interest under any of the chosen filters. Typical filters include:

1. Active Tracking Area –�An elliptical area of interest is centered at the predicted 2-D location of each currently tracked feature point. �

2. Gradient-Based Edge Detection – Edges are detected through application of a high-pass filter (HPF) on the gradient of the input image. The resulting edgels are

thresholded and bled using Gaussian LPFs until a general RoI around any significant edges is formed.�

3. Predicted RoI – An estimate of overall 2D RoI motion is produced, and this estimate is combined with the previous instant’s interest map to product an elliptical area of interested at the expected location for the current image.�

Many other filters are available, depending on the exact input images that are expected. Empirical selection of one or more filters pertinent to the application at hand can significantly reduce the total image area that must be passed to the Synchronization Agent, greatly reducing pipeline overhead.

Stage L2 - The Synchronization Agent aligns the images by time-stamp and selects the ‘world time’ that they represent. Image coherency can be enhanced by delaying the alignment time from the leading edge slightly. This is ensures that one or more images before, during, and after the synchronization point are available, and prevents sensors from completely missing a synchronization time. Image-wide interpolation can be used to construct a synthetic image that ‘occurs’ exactly at the desired time boundary:

mn � mnlo p �mnlq = mnlo� r*� = *H*I = *Hs (5)

Here, mn �. MgD� t gg�� uhGg S$is the final pixel value for any pixel at synchronized time *� , were *H . * . *I. The other pixel values, mnlo and mnlq come from images just before ( *H ) and just after ( *I ) the synchronization point, respectively. This method assumes that the capture rate of the cameras is sufficiently high, such that the average pixel motion between frames is relatively small. Otherwise, ‘ghosting’ of moving edges will occur.

Stage L3 - At this point, the pipeline splits into a two-stage, 4-parallel pipeline (essentially, it is a superscalar architecture [20]). Stage L3 contains the Point Tracking Agents, which detect and track the 2D feature points that comprise the articulated model of the subject. If a point is positively tracked, a simple method, such as Optical Flow, may be used to determine 2D pixel coordinates. If a tracking loss is detected (on a per-feature basis), a blind local search is instead performed (a modified PCA-based algorithm is used in the experimental implementation, but this is not critical). If the blind search is unsuccessful for a pre-determined number of cycles, a general search is performed instead (also performed when a point is not yet detected). This search is run as a background task, and is shared between the Point Tracking Agents using cycle-stealing – it operates asynchronously, outside of the pipeline. The use of superscalar architecture leverages the separable nature of the 2D tracking step. On its own, the feature point tracking operation is essentially an atomic operation. It cannot be further subdivided into pipeline

978-1-4244-5448-8/10/$26.00 ©2010 IEEE 925

Page 5: [IEEE 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010) - Toronto, ON (2010.08.21-2010.08.24)] 2010 IEEE International Conference on Automation Science

functional units, but it usually one of the most resource and processing intensive steps, especially if relatively naïve algorithms are used. However, individual points can usually be tracked independently of one another. As such, their tracking can be executed in parallel. The number of parallel stages can be adjusted to fit the application, although four stages offered the best tradeoff between fork/join overhead and parallelism in our particular experiments.

Stage L4 - The De-projection Agents are responsible for reversing the calibrated camera/stage motion transformations for each given feature (i.e., conversion to a 3D world coordinate constraint). This stage may be omitted in designs using a simplified calibration method with a closed-form inverse solution. This stage also provides a buffer time for potential forwarding and feedback paths that may be necessary in special pipeline cases, and participates in cycle stealing when slow tracking algorithms are used.

Stage L5 - By fusing 2D feature point locations, uncertainty estimates, and the articulated model as constraints, the 3D Solver Agent recovers true world coordinates for all possible features [19].

Stage L6 - The Form Recovery Agent first acts as in [19], using known information from the previous agent, the articulated model, and current action to iteratively reconstruct the best estimate of current subject form, given the input images. It is responsible for removing rotation, translation, and scaling effects. To begin, an initial estimate is extracted from known feature points using the algorithm in [19]. An uncertainty vector is formed, with elements v�: v� � wHE � = (�E p wI�x/ p wy�(� (6)

Here, C1, C2, and C3 are proportionality constants, nua is the number of unassigned points, and nms is missing points. The value of (� is the ]lCelement of the predicted action feature vector at this instant, projected into the 3D scene. The value of � is the ]lC element of the recovered feature vector. The agent iteratively examines interpolated feature vectors from the action library. It will iteratively attempt to fit previously unassigned points to their closest match in the model feature vector, or positively reject them as false positives. Similarly, model constraints will be used to produce estimated positions for all missing points. An intelligent hypothesis selector chooses which order to examine feature vectors, and performs time normalization and interpolation on the action database. This process will continue until an internal deadline slightly before the system clock. At that time, the solution with the lowest per-element average uncertainty, vz�, is selected as the final form fit.

Stage L7 - The Prediction Agent specification is unchanged from past work [19] – it predicts all motions necessary (2D and 3D). A typical implementation would use the Kalman filter, or one of its variants.

Stage L8 - The Central Planner Agent (CPA) performs constrained optimization to determine near-optimal sensor

placement, given all previous information in the pipeline. Since this is a constrained optimization, the Flexible Tolerance Method search engine [23] was chosen as a general-purpose reference for real-time operation – other methods can be applied, but were not formally investigated.

Stage L9 - The Action Recognition Agent is responsible for detecting the current target action, and also for maintaining the current time normalization factor. If the current feature vector recovered by the form recognition agent matches an appropriate interpolation of any adjacent feature vectors in the action library, a key-frame match is recorded. Subsequent matches for sequential feature vectors of the same action constitute a positive action match. The time normalization constant is estimated continuously as new feature vector estimates arrive. Feature vectors can use multiple-action coding to sense simultaneous actions.

Stage L10 - The Referee Agent enforces global rules on all final sensor pose decisions. It also maintains fallback poses and enforces pipeline coherency.�

IV. SAMPLE IMPLEMENTATION AND RESULTS

To demonstrate the viability of the proposed framework, a specific implementation has been developed and tested using carefully controlled experiments. The sensing task at hand is the recognition of an accurately scaled walking human analogue. The 1:6 scale human analogue has 14 degrees of freedom. A motion-capture library was used to generate realistic walking motions from multiple human subjects. The human follows a semi-randomized path.

A physical system with four cameras was used. All cameras were mounted on 1-dof rotational stages, and two were also mounted on 1-dof (linear) X stages. The layout and capabilities of the cameras are selected to be representative of a system to sense an average room (around 100 m2), scaled to the size of our human analogue. Two cylindrical obstacles were utilized, which follow semi-randomized linear paths chosen to ensure subject occlusion. In all experiments, the OoI maintained a constant speed of 150 mm/s on a straight-line path through the center of the workspace (a scaled, fast-paced human walking speed). The system had a limited maximum camera speed of 250 mm/s, with a maximum acceleration of 9000 mm/s2.

The pipeline design was implemented and tested in real-time on a 5-processor cluster using 100 trials of 100 frames each. For each trial, the system update rate was gradually increased to determine a maximum ‘successful’ update rate. Ideal form recovery with simulated noise derived was used to automate the trials in the absence of a fully-automated human analogue. A two-level system of local PCA and optical flow was used for feature tracking.

Overall, average maximum update rate was improved from 0.47 Hz to 1.56 Hz, and worst-case maximum update rate improved from 0.15 Hz to 1.14 Hz. Best-case maximum update rate fell slightly from 3.31 Hz to 3.15 Hz, owing to increased overhead from the pipeline. The results showed

978-1-4244-5448-8/10/$26.00 ©2010 IEEE 926

Page 6: [IEEE 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010) - Toronto, ON (2010.08.21-2010.08.24)] 2010 IEEE International Conference on Automation Science

that with an update rate equal to the average of 1.56 Hz, in 86% of trials, the poses selected at each instant are essentially identical (within a {[\ margin of error) to those of the previous central-planning architecture [19].

Figure 2 – Critical values of system update rate

Additional testing, shown in Figure 2, demonstrates that lowering the update rate towards the worst-case maximum value increases the number of trials with poses identical to those selected by the non-pipelined system. This is due to the time-limiting effect of higher update rates – some stages of the pipeline, typically the L3 or L8 stages, are pre-empted, resulting in differing decisions. This effect will be investigated further as future work. Note that the system was still successful in recognizing the walking action for 97% of all trials, even with an update rate of 1.56 Hz.

V. CONCLUSIONS

In this paper, a novel sensing-system reconfiguration framework, designed specifically for systems sensing TVG object actions in real-time was presented. The method uses a software pipeline approach to mitigate the problems typically encountered in real-time operation, and to provide an improved average update rate over previous systems. Controlled experiments demonstrated that the system will operate correctly in real-time conditions when compared to an existing quasi-static system. It was also shown that the method provides a tangible improvement in the real-time performance metrics used to evaluate previous systems.

ACKNOWLEDGMENT

We acknowledge the support of the Natural Science and Engineering Research Council of Canada (NSERCC).

REFERENCES

[1] K. A. Tarabanis, P. K. Allen, and R. Y. Tsai, "A Survey of Sensor Planning in Computer Vision," IEEE Transactions on Robotics and Automation, vol. 11, no. 1, pp. 86-104, Feb. 1995.

[2] T. K. Capin, I. S. Pandzic, N. M. Thalmann, D. Thalmann, “A Dead-Reckoning Algorithm for Virtual Human Figures,” In Proc. of VRAIS 1997, pp. 161-169, 1997.

[3] A. Bakhtari, M. Mackay, and B. Benhabib, "Active-Vision for the Autonomous Surveillance of Dynamic, Multi-Object Environments," J. of Intelligent and Robotic Systems [Available On-line], Jul. 2008.

[4] T. Urano, T. Matsui, T. Nakata, and H. Mizoguchi, “Human Pose Recognition by Memory-Based Hierarchical Feature Matching,” IEEE Intl. Conf. on Systems, Man, and Cybernetics, pp. 6412-6416, The Hague, Netherlands, 2004.

[5] R. Pito, "A Solution to the Next Best View Problem for Automated Surface Acquisition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 10, pp. 1016-1030, Oct. 1999

[6] W. R. Scott, G. Roth, and J. Rivest, "View Planning for Automated Three-Dimensional Object Reconstruction and Inspection," ACM Computing Surveys (CSUR), vol. 35, no. 1, pp. 64-96, Mar. 2003.

[7] G. V. Veres, L. Gordon, J. N. Carter, and M. S. Nixon, "What Image Information is Important in Silhouette-Based Gait Recognition?," in IEEE Conference on Computer Vision and Pattern Recognition, Washington, D.C., 2004, pp. 776-782.

[8] X. Wang, S. Wang, and D. Bi, “Distributed Visual-Target-Surveillance System in Wireless Sensor Networks,” IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics, vol. 39, no. 5, Oct. 2009.

[9] A. Roy, and S. Sural, “A Fuzzy Interfacing System for Gait Recognition,” In Proc. of 28th North American Fuzzy Information Processing Society Annual Conference, pp. 1-6, June 2009.

[10] C. Lee, and A. Elgammel, “Dynamic Shape Outlier Detection for Human Locomotion,” Computer Vision and Image Understanding, vol. 113, no. 3, pp. 332-344, March 2009.

[11] M. Saini, M. Kankanhalli, and R. Jain, “A Flexible Surveillance System Architecture,” In 2009 IEEE Conference on Advanced Video and Surveillance, pp. 571-576, Sept. 2009.

[12] L.Bixio, L.Ciardelli, M.Ottonello, and C. S. Regazzoni, “Distributed Cognitive Sensor Network Approach for Surveillance Applications,” IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 232-237, Sept. 2009.

[13] S. Yu, D. Tan, and T. Tan, "A Framework for Evaluating the Effect of View Angle, Clothing, and Carrying Condition on Gait Recognition," in International Conference on Pattern Recognition, Hong Kong, 2006, pp. 441-444.

[14] E. Gonzalez, A. Adan, V. Feliu, and L. Sanchez, "A Solution to the Next Best View Problem Based on D-Spheres for 3D Object Recognition," in Conference on Computer Graphics and Imaging, Innsbruck, Austria, 2008.

[15] R. Murrieta-Cid, B. Tovar, and S. Hutchinson, "A Sampling-Based Motion Planning Approach to Maintain Visibility of Unpredictable Targets," J. of Autonomous Robots, vol. 19, no. 3, pp. 285-300, 2005.

[16] M. A. Otaduy and M. C. Lin, "User-Centric Viewpoint Computation for Haptic Exploration and Manipulation," in Conference on Visualization, San Diego, CA., 2001, pp. 311-318.

[17] M. Mackay and B. Benhabib, "Active-Vision System Reconfiguration for Form Recognition in the Presence of Dynamic Obstacles," in Lec. Notes on Computer Science, Conference on Articulated Motion and Deformable Objects, Andratx, Mallorca, Spain, 2008, pp. 188-207.

[18] Y. K. Yu, K. H. Wong, et. al., “Controlling Virtual Cameras Based on a Robust Model-Free Pose Acquisition Technique,” IEEE Transactions on Multimedia, vol. 11, no. 1, pp. 184-190, Jan. 2009.

[19] M. Mackay, R. G. Fenton, and B. Benhabib, "Time-Varying-Geometry Object Surveillance Using a Multi-Camera Active-Vision System," International. Journal on Smart Sensing and Intelligent Systems, vol. 1, no. 3, pp. 679-704, Sep. 2008.

[20] S. A. Williams, “Programming Models For Parallel Systems.” John Wiley & Son Ltd., July. 1990.

[21] J. Ebergen, and R. Berks, “Response-Time Properties of Linear Asynchronous Pipelines,” Proceedings of the IEEE, vol. 87, no. 2, pp.308-318, Feb. 1999.

[22] H. de Ruiter, and B. Benhabib, “Object-of-Interest Selection for Model-Based 3D Pose Tracking with Background Clutter,” In Novel Algorithms and Techniques in Telecommunications, Automation and Industrial Electronics, pp. 93-98, 2008.

[23] M. D. Himmelblau, “Applied Non-Linear Programming,” McGraw Hill, 1972.

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1.1 1.2 1.3 1.4 1.5 1.6

Percen

tage�of�trials�iden

tical�(%)

System/Pipeline�Update�Rate�(Hz)

978-1-4244-5448-8/10/$26.00 ©2010 IEEE 927