9
Application for Research Excellence Funding Page 4 Ministry of Research and Innovation Form RE-02 1.1 Theme structure This five-year research project addresses fundamental issues involved in the prevention of human-made disasters, namely the environment and context-dependent, real-time detection/identification of potential threatening activity and emotional behaviour of human actors acting individually or in groups in crowded environments. MUSES_SECRET architectural level specifications are based on the field operational requirements from the public security application sector. The proposed solution is innovative both in its content and context. The multimodality fusion of all the proposed video, audio, biometric and other sensed data has not been used before, to our knowledge, in designing a 3D-based decision support system for threat assessment. The interest of our industrial, government and institutional users (see letters) testifies to the recognition of proposed innovation and great commercialization potential. The proposed research program is organized around six interrelated theme areas all of which are vital for the design and development of ideas and solutions to help ensure the safety and security of the Canadian public and infrastructure and to support our industry partners with a competitive edge in today’s global economy. The highly integrated thematic structure enables the delineation of well-defined projects while ensuring constant feedback, cross-fertilization and validation with the user sector necessary to the realization of the MUSES_SECRET vision. Theme 1. Distributed Multimodal Surveillance Sensor Network [U of Toronto , U of Ottawa, Waterloo U ] MUSES-SECRET will use a distributed intelligent surveillance system, which combines visual and audio surveillance based on wireless sensor nodes equipped with video or infrared (IR) cameras, audio detectors, or other object detection and motion sensors with location aware wireless sensor network solutions. The integration of visual, sound and radio tracking methods results in a highly intelligent, proactive, and adaptive surveillance and security solution sensor networks. Task-directed sensor data collection and observation planning algorithms will be developed to allow for a more elastic and efficient use of the inherently limited sensing and processing capabilities. Each task a sensor has to carry out determines the nature and the level of the information that is actually needed. We will study "selective environment perception" methods that focus on object parameters that are important for the specific decision to be made for the task at hand and avoid wasting effort to process irrelevant data. Task 1.1. Sensor network design [Boukerche ] Coverage of the sensor networks, quality of the sensor data, optimal utilization of network resources [bouk06], quality of communication service, and network life time are among the most important requirements to be taken into account in the design of a multimodal sensor network [bouk05] [tia05]. The following open research problems will be addressed: a) the use and positioning of multiple video or IR cameras, and sound and vibration detectors for accurate tracking of moving objects and producing real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor nodes and other active devices within the network; c) event detection, classification and recognition; and d) multi-sensor fusion of video, audio, and wireless information for enhanced tracking and prediction. A new optimization framework will be developed to jointly optimize the source rate, the encoding power and the routing scheme. This framework is based upon convex optimization and the associated dual decomposition, decoupling the problem into multiple sub-problems and solving at individual sensors [he07] [bouk06]. Because of the generic nature of the framework, we expect this technology will be able to optimize the performance of the proposed distributed network of sensors connected via wired and wireless links for multimodal environment monitoring and human crowd surveillance. Task 1.2. Event detection and classification [Hatzinakos ] Developing energy efficient acquisition protocols suggest that the surveillance nodes should be turned

1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

Application for Research Excellence Funding Page 4 Ministry of Research and Innovation Form RE-02

1.1 Theme structure This five-year research project addresses fundamental issues involved in the prevention of human-made disasters, namely the environment and context-dependent, real-time detection/identification of potential threatening activity and emotional behaviour of human actors acting individually or in groups in crowded environments.

MUSES_SECRET architectural level specifications are based on the field operational requirements from the public security application sector. The proposed solution is innovative both in its content and context. The multimodality fusion of all the proposed video, audio, biometric and other sensed data has not been used before, to our knowledge, in designing a 3D-based decision support system for threat assessment. The interest of our industrial, government and institutional users (see letters) testifies to the recognition of proposed innovation and great commercialization potential.

The proposed research program is organized around six interrelated theme areas all of which are vital for the design and development of ideas and solutions to help ensure the safety and security of the Canadian public and infrastructure and to support our industry partners with a competitive edge in today’s global economy.

The highly integrated thematic structure enables the delineation of well-defined projects while ensuring constant feedback, cross-fertilization and validation with the user sector necessary to the realization of the MUSES_SECRET vision. Theme 1. Distributed Multimodal Surveillance Sensor Network [U of Toronto, U of Ottawa, Waterloo U ] MUSES-SECRET will use a distributed intelligent surveillance system, which combines visual and audio surveillance based on wireless sensor nodes equipped with video or infrared (IR) cameras, audio detectors, or other object detection and motion sensors with location aware wireless sensor network solutions. The integration of visual, sound and radio tracking methods results in a highly intelligent, proactive, and adaptive surveillance and security solution sensor networks.

Task-directed sensor data collection and observation planning algorithms will be developed to allow for a more elastic and efficient use of the inherently limited sensing and processing capabilities. Each task a sensor has to carry out determines the nature and the level of the information that is actually needed. We will study "selective environment perception" methods that focus on object parameters that are important for the specific decision to be made for the task at hand and avoid wasting effort to process irrelevant data. Task 1.1. Sensor network design [Boukerche] Coverage of the sensor networks, quality of the sensor data, optimal utilization of network resources [bouk06], quality of communication service, and network life time are among the most important requirements to be taken into account in the design of a multimodal sensor network [bouk05] [tia05].

The following open research problems will be addressed: a) the use and positioning of multiple video or IR cameras, and sound and vibration detectors for accurate tracking of moving objects and producing real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor nodes and other active devices within the network; c) event detection, classification and recognition; and d) multi-sensor fusion of video, audio, and wireless information for enhanced tracking and prediction.

A new optimization framework will be developed to jointly optimize the source rate, the encoding power and the routing scheme. This framework is based upon convex optimization and the associated dual decomposition, decoupling the problem into multiple sub-problems and solving at individual sensors [he07] [bouk06]. Because of the generic nature of the framework, we expect this technology will be able to optimize the performance of the proposed distributed network of sensors connected via wired and wireless links for multimodal environment monitoring and human crowd surveillance.

Task 1.2. Event detection and classification [Hatzinakos] Developing energy efficient acquisition protocols suggest that the surveillance nodes should be turned

Page 2: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

Application for Research Excellence Funding Page 5 Ministry of Research and Innovation Form RE-02

off most of the time when the events under surveillance are not present. In particular, the video camera operation is decided by duty-cycled sensing. In the proposed research, we aim at developing a lightweight event detection mechanism, which can provide the yes/no decision on whether the surveillance events are present, based on a short term image-snapshot. The algorithm to be developed needs to exploit the tradeoffs between the processing complexity, the amount of sensor data, and the detection performance.

We will develop a wireless transmission architecture based on the reference model of Embedded Wireless Interconnect (EWI) [son07]. Decided by the number of dedicated wireless channels, the number of surveillance nodes with the highest priorities, e.g. the best views of the surveillance events, will be able to locate an idle wireless channel, transmitting the acquired wireless data or visual streams to the central processing unit. In addition, a decentralized, power-constrained processing envisioned under the low energy self-organizing surveillance protocol will be contrasted against a centralized approach where location-related information. The authentication of video, audio and other sensor data is essential to ensuring authenticity of its source and data integrity. We will investigate the use of industry standards such as Secure JPEG2000 as well as encryption, watermarking, and authentication techniques for secure multimedia transmission over wireless links [lee05].

Task 1.3. Multi-sensor data fusion [Kamel, Basir] Based on their merits of accuracy, reliability, ease of implementation and maintenance, and cost, video and IR cameras and wireless radio-based devices are selected as the main sensing modalities. Furthermore, these two modalities are selected because they suffer unique limitations while enjoying complementary advantages. For example, visual surveillance [hu04], [pla05] is able to produce highly accurate positioning results, but suffers from several disadvantages. In particular, multiple cameras are needed to resolve tracking ambiguities that result from occlusion and limited field of view. An important limitation of visual surveillance is the complexity of higher-level processing for extraction of semantics such as user identity and actions. In contrast, radio-based surveillance offers a natural way of analyzing such high-level semantics because user identities are readily established using network MAC addresses.

The proposed research shall introduce a novel, multi-sensor system for management and update of the radio map by fusing radio, visual, and audio data. The second step in radio-based localization is formulation of a location estimate in real-time by comparing the observed wireless channel values from a mobile terminal to those values stored in the radio map. The proposed research shall investigate the use of kernel-based methods for modeling the mapping between signal strength and physical locations. We will extend the Kalman filtering framework, employing a zero-memory estimator as a pre-processor to the tracking filter.

We will extend the measurement data fusion approach, using cooperative neural networks (CNL) developed by Xia and Kamel [kam1]. The model of the CNL algorithm combines adaptively three modular neural networks and is suitable for parallel implementation, using software and hardware. This approach will be further investigated to deal with sensor fusion in the presence of different types of noise. Theme 2: Real-time Tracking and Recognition of Human Body Movements, Hand Gestures,

Facial Expressions, and Voice Emotions [Ryerson U, U of Ottawa, U of Toronto] Real-time computer-vision and signal processing algorithms will be developed for the identification and evaluation of environmental parameters and biometric features (such as facial expressions, human gait, hand gestures, human voice inflexions, background sound, ambient light, etc.) that provide the contextual information for the specific surveillance activity.

Task 2.1. Tracking individuals moving in a crowded environment [Dubois, Laganière] In the context of crowds, robust tracking of people represents an important challenge. The numerous sources of occlusions and the large diversity of interactions that might occur make difficult the long-term tracking of a particular individual over an extended period of time and using a network of sensors.

Our research will build upon preliminary work done by Laganière and his collaborators at the University

Page 3: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

Application for Research Excellence Funding Page 6 Ministry of Research and Innovation Form RE-02

of Ottawa. We will develop a robust tracking algorithm that will combine monocular short-term tracking strategies with long-term multi-camera approaches. Short-term tracking can be achieved using histogram-based methods that use a target’s local appearance in order to locate it from frame to frame [woj02]. In addition to providing information about an individual’s behavior in a crowd, short-term tracking will allow the extraction of biometric features that can then be used in the monitoring of a large area and the detection of potential threats. However, in order to avoid overloading the system with multiple face occurrences of the same individual, an online face quality assessment measure will be applied [fou07]. The different captured face images of an individual can also be combined to generate a 3D face, thus addressing the problem of high-quality image generation from multiple lower resolution views.

The results of each short-term tracker will then be combined to constitute a global interpretation of the extended scene that is monitored by a network of sensors. Predictions and a hypothesis will be generated concerning the motion of each tracked individual. These predictions can then be verified using appearance models and also using the biometric features that have been extracted during the short-term tracking periods. These validations must however take into account the geometric relations that exist between the different sensors that are spatially distributed over the scene. Off-line calibration procedures exist that can be used to locate each sensor with respect to the others [sun06]. However, in a real deployment context, self-calibration methods will be preferred. Using an online calibration approach has also the advantage of being able to cope with the unavoidable changes in sensor characteristics that inevitably happen during long term deployment of surveillance systems.

Task 2.2. Human Gait [Plataniotis] Human gait is a spatio-temporal characteristic of human motion, which allows humans to recognize others by their walking style. Gait is a nonintrusive biometric characteristic which can be detected and measured even in low-resolution video and IR images. Gait is harder to disguise than other static biometric features, such as the facial expression, and it does not require a cooperating subject.

Our research will build upon preliminary work done by Venetsanopoulos, Hatzinakos, Plataniotis and their collaborators at the University of Toronto [bou05], [bou06], [lu06] who developed a new and improved coarse-to-fine silhouette extraction algorithm for the National Institute of Standards (NIST) Gait Challenge data sets [bou06] to robustly extract gait silhouette sequences from low resolution surveillance video sequences in various challenging outdoor application scenarios. Based on the previously developed solution of [lu06], a four-layer articulated human body model which allows for limb deformation, will be researched, analyzed and evaluated. The layered deformable model (LDM) will be used to automatically extract silhouettes from low resolution visual surveillance data and it will be used in the development of model-based tracking solutions for human behavior recognition.

We intend to research and develop the multi-linear principal component analysis (MPCA) framework [lu07], which is able to extract gait features directly from gait silhouette sequences in their original tensor 3D representation. A multi-linear discriminative analysis algorithm, which will extract distinguishing features directly from the tensor style gait representation will be also researched, developed and analyzed. Finally, the gait recognition solution will be augmented by combining the proposed multi-linear subspace learning framework with popular boosting solutions, such as the Ada-Boost algorithm.

Task 2.3. Body activity recognition [Guan] Our proposed approach to the recognition of human body activities is based on the preliminary work done by Green and Guan at Ryerson University on human movement modeling and recognition using the novel alphabet of dynemes paradigm [gre04]. They proved that any human skill can be decomposed into a series dynemes - the smallest contrastive dynamic units of human movement- belonging to a finite set of 35 dynemes. Surveillance activities would only require a subset of dynemes, which will substantially simplify the semantics used for the identifications of these activities.

A second important issue is accurate initialization for tracking human body movement. We will adopt the target tracking initialization procedure proposed by Enzweiler et al. [enz05] based on the “coherent motion energy” concept and characterized by the ratio of the difference between two opponent energy measures and their sum. The position and initial movement of the face and the global estimate of human body will serve the purpose of accurately initializing the tracking of individual human body parts using

Page 4: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

Application for Research Excellence Funding Page 7 Ministry of Research and Innovation Form RE-02

the DE-MC Particle filter [du06]. Based on the theory of belief propagation and factor graphs we will develop a probabilistic inference engine that can effectively represent and process the simultaneous presence of multiple dynemes/skills. These skills jointly contribute to the presence of an intention/action that could potentially be a threat to the public/environment.

Task 2.4. Face expression recognition [Venetsanopoulos] For the proposed biometric-identification in crowd monitoring and surveillance, the objective is more to identify who/what could be potential threats to the public/environment, instead of who is who. The role of face recognition accordingly becomes recognizing gender, age group, and other facial characteristics which can contribute to the identification of suspicious persons. Another issue needing urgent attention is the considerable variations in terms of light illumination and pose variations. The recent work by Venetsanopoulos and his colleagues on face recognition using the so-called generic learning, which learns the intrinsic properties of the subjects to be recognized using a generic training database consisting of images from subjects other than those under consideration, has shown its effectiveness in face recognition. We will combine the findings in this work with emotion recognition techniques to identify human intention.

Our research will investigate all three major issues involved in the recognition of human facial expressions: (i) detecting human faces in crowded scenes; (ii) tracking human faces in crowded scenes; and (iii) recognizing the characteristics of human faces which may play prominent roles in a security sensitive context and in reliably recognizing those states. For face detection, we will adopt Viola’s and Jones’ approach, which applies the AdaBoost learning algorithm to the “Integral Image” representation and can reliably detect multi-view human faces in crowded scenes in real-time [min05]. We have developed a differential evolution – Markov chain (DE-MC) particle filter, which provides an effective solution to the adaptive tracking of human head and human body movement [du06].

Task 2.5. Hand Gestures [Georganas, Basir] Hand gestures represent a very expressive and a powerful human to human communication modality and biometric features.. The human hand is a complex articulated object consisting of many connected parts and joints. With the latest advances in the fields of computer vision, image processing and pattern recognition, real-time vision-based hand gesture classification is becoming more and more feasible. We will study vision-based hand tracking and gesture classification, focusing on tracking the bare hand and recognizing hand gestures without the help of any markers and gloves. Our research will address both issues of hand gesture recognition: hand posture and hand gestures.

We will adopt a two level approach for the real-time vision-based hand gesture recognition: (i) Haar-like features and the AdaBoost learning algorithm will be used for hand posture recovery, (ii) linguistic pattern recognition for hand gesture syntactic analysis [che07]. Given a sequence of hand postures, the composite gesture can be recognized with a set of grammatical primitives and production rules.

Task 2.6. Voice emotion [Karray] Our research will build upon preliminary work proposed by Kamel and his colleagues at the University of Waterloo [sha05], [aya07].

We will develop real-time algorithms for the recognition of the voice emotions based on the segment-based approach and vector autoregressive methods for classification of speech emotion [sha05], [aya07] and the new multiclassifier aggregation architectures [wan06] that allow active cooperation by sharing training patterns and algorithms. This would help in the developing next generation natural speech understanding systems allowing for the design of natural man-machine interface systems. The model built here will mimic in a certain way the capability of humans in recognizing particular emotions which would help in uploading adequate acoustic models for enhancing speech understanding capabilities Theme 3: Context-Aware Assessment of Threatening Human Actions and Emotional

Behaviours [U of Ottawa, Ryerson U] An Intelligent Decision Support System will be developed for the real-time assessment of threatening situations involving human subjects by analyzing invariant features of the emotional and activity

Page 5: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

behaviours identified as being potentially of security interest in the monitored environment.

Task 3.1. Bimodal video and audio, human intention identification [Guan] We will investigate bimodal human emotion recognition based on jointly studying auditory and facial cues extracted from basic human emotional states [wan05].

We will apply the new knowledge discovery method recently proposed by [kya06] to naturally cluster human emotional states based on Facial Action Coding System (FACS) and discover their relationship to intentions.

Audio cues

Visual cues

Possible humanintention

Analyze human emotionsAnalyze human emotions

Analyze human actionsAnalyze human actions

Hand gesture

Body movement

We will study dynamic facial features with Action Units in the FACS as the basic emotional descriptors in the bimodal emotion recognition systems using kernel canonical correlation analysis [zhe06]. We will investigate effective fusion of the auditory and facial features using a fused Hidden Markov Model (HMM). Fused HMM (F-HMM), a derivative of the coupled hidden Markov models uses a probabilistic fusion method to determine the optimum connections between HMMs. This is done by using the maximum entropy principal and the maximum mutual information criterion for selecting dimension reduction transforms. The F-HMM model utilizes cross observation links to exhibit the statistical dependencies between multiple HMMs, allowing to train each HMM separately, enabling the use of well established methods, such as Balm-Welch for training, and Expectation-Maximization for decoding.

We will develop a hierarchical F-HMM which has a two-level hierarchy: (i) due to the apparently close tie between the auditory and facial emotional cues, a F-HMM will be established to fuse these cues for emotion recognition; (ii) similarly, a second F-HMM will be used for the fusion of hand gestures [min05] and body movement features for activity recognition; and (iii) at the global level, a F-HMM framework consisting of two HMMs will be used to fuse cues extracted from the multimodal information shown in Figure 1. Figure 1. Hierarchical Fused HMM architecture Task 3.2. Context – dependent situation assessment [Petriu] Building upon the preliminary work on cyber-psychology of human activity and behaviour of Whalen at CRC and Petriu and his collaborators at the University of Ottawa [wha03], [yan05], we will use linguistic pattern recognition techniques and semantic model representations to develop a semantic level situation assessment system that will allow understanding of the dynamics of a complex scene based on multimodal surveillance data streams.

Body posture and gait, hand gestures, and vocal and facial emotions are powerful non-verbal strongly context-dependent human-to-human communication modalities. While understanding them may come naturally to humans, describing them in an unambiguous algorithmic way is not an easy task. We will use Fuzzy Neural Networks and Fuzzy Cognitive Maps [moh03] to develop an expert system that captures the collective wisdom of human experts, psychologists and security surveillance specialists [van06] on the best procedures to follow in the assessment of the level of threat based on the semantic information extracted from the multimodal surveillance data streams. Theme 4: Synthetic Environment to Support Human Decision Makers [U of Ottawa, NRC] The partial and heterogeneous sensor-views of the environment are fused into a coherent Virtualized Reality TM Environment (VRE) model of the explored environment. Being based on information about real/physical world objects and phenomena, as captured by a variety of sensors, VREs have more “real content” than the pure Virtual Reality environments entirely based on computer simulations.

Application for Research Excellence Funding Page 8 Ministry of Research and Innovation Form RE-02

Page 6: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

Application for Research Excellence Funding Page 9 Ministry of Research and Innovation Form RE-02

The VREs model of the explored environment allows human operators to combine their intrinsic reactive-behavior with higher-order world model representations of the immersive VRE systems.

A synthetic environment will be developed to provide efficient multi granularity-level function-specific feedback and human-computer interaction interfaces for the human users who are the final assessors and decision makers in the specific security monitoring situation. We will study VREs, which will allow the users to see the synthetic world models as a whole or concentrate the focus of attention on specific details or portions of the world.

Through in-depth onsite studies, we will carefully design the most representative scenarios for testing, which will give us first-hand knowledge about capturing human behaviours and emotions from the best viewing angles to better facilitate the analysis and recognition of human intentions under different conditions. The virtual testing environments will also allow quick and cost effective simulation of multiple fixed and mobile sensors carrying out collaborative vigilance tasks in a changing environment. The infrastructures in the DISCOVER Lab at the University of Ottawa and C-iM2 Lab at Ryerson University form networked VREs capable of carrying out large, realistic assessments and decision making tasks in the specific security monitoring situations.

Many applications such as MUSES_SECRET are dealing with complex situations, where objects are described in terms of a large collection of heterogeneous properties. Integrating such different kinds of dynamic data, coming from very many different sources and understanding its internal structure presents a considerable challenge. An additional difficulty is to present the information in such a way as to make it understandable to a domain expert or decision maker who is not necessarily a mathematician or data mining specialist. Here is where visual data mining techniques, in particular those based on virtual reality, represent a suitable and promising approach.

We will study and develop a VRE for the integration of large heterogeneous, time-varying data to assist the human situation assessor and decision makers in security surveillance applications. Our research will build upon the preliminary work on distributed cooperative interactive virtual environment done by Georganas and his collaborators at the University of Ottawa [tia05], [ahm07], and on Valdes’ work at NRC Ottawa on visual data mining of complex and large data [val03], [val05], [val06]. .

The idea is to construct a virtual reality space where the complex properties of the data can be represented as geometric entities so that their interplay will be processed at the perceptual level by the user. In this approach, the built-in capabilities of the eye-sight sense combined with the fast and powerful pattern recognition capabilities of the human brain will complement the data processing performed by the computer. Such a man-machine system will benefit from the features where each entity excels. Because the essential process of data structure understanding is performed at the perceptual level, large amounts of complex information can be processed without overwhelming the user.

In particular, similarity relations between the data objects can be exploited because they are simple, very intuitive and definable for any kind of data. Ultimately, the data integration process leads to a complex space of original data from which a virtual reality 3D space is constructed through a high performance computational procedure and presented in the form of a multimedia user interface suitable for data analysis and decision making.

Real-time processing of the huge volume of heterogeneous data involved in the surveillance process cannot be achieved without the help of a divide-and-conquer strategy, such as “zoning” of the virtual environment and applying “area of interest” techniques and enhancing it with filtering methods to gain attention to specific details of portions of the virtual reality space. We will build upon our research experience in “massively multiuser virtual environments” [ahm07] to address this issue. Theme 5: Design and Development of Service Oriented Architecture for Intelligent Surveillance [U of Ottawa / El Saddik, Shirmohammadi] Our main goal encompasses a seamless integration of new and improved surveillance techniques and methodologies whilst the surveillance network continues its ordinary operation. We intend to support

Page 7: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

both functional and non functional requirements for surveillance networks. Functional requirements are signal processing functions and data fusion, archiving and tracking human behaviours, assessment and interpretation functions of the data, and supporting human decision makers, among others. Non-functional requirements include interoperability, scalability, availability, and manageability.

The rate at which surveillance systems can currently disseminate data to evaluate new threats is mainly limited due to the developed and implemented nature of existing systems and their limited ability to operate with other systems. Service-oriented architectures (SOA) are designed for the purpose for supporting loosely coupled systems and to integrate them by using a standard protocol for communication between the different available services [sad06], [ala06]. In fact, service providers only need to understand the messages produced and consumed by services they use. Adopting IBM’s SOA architecture (WebSphere) [ibm07] to support distribution of information within our proposed project prepares it to be deployment ready and to support integration of external systems developed by our diverse industrial and institutional partners. Surveillance data will be stored using DB2 content manager from IBM. By leveraging the latest standardized communication protocols such as web services as illustrated in Figure 2, we allow a heterogeneous operating environment to co-exist within the infrastructure (to support our diverse industrial partners’ solutions). For example, a government database would be able to expose its information via self-hosted web services. The integration performed is seamless.

Figure 2. The SOA architecture (WebSphere) for MUSES_SECRET

There are several aspects to consider in this design. First, data stemming from diverse sources such as live cameras, microphones, or any other sensory systems are streamed to the middleware (SOA-layer) in order to be passed to the different subsystems implemented in this project or by our industrial partners such as data fusion algorithms. Data are then filtered and stored on an IBM DB2 data content server and/or sent to the visualization system. Although the design incorporates several machines to provide various functionalities, it is possible to cluster the services and use one or two machines for the tasks.

The objectives of this theme are to: • Design and develop SOA architecture for tele-surveillance applications based on WebSphere. We

will build a layer of software on top of existing legacy systems using the black-box software development process to be able to support the communication between the SOA-based system and the non SOA existing systems and algorithms; In fact we will be using a combination of the Rational Unified Process (including RUP for SOA) and IBM's Service-oriented modeling and architecture (SOMA) methodology [ars04]. New requirements such as the ones derived from tele-surveillance applications might lead to further development and enhancement of SOMA.

• Study SOA suitability for 3D visualization of sensory data and development of SOA-based

image Audio Video

SOA-based Distributed Environment

Media features

extraction service

Media features FusionService

Trust Service

Domain Ontology Service

Intelligentdecision Service

User Query Interface Multimodal User Interface

SOAP

Env Cov.2 Gov.1

Multimodal sensory data & knowledge are stored on distributed databases

Application for Research Excellence Funding Page 10 Ministry of Research and Innovation Form RE-02

Page 8: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

Application for Research Excellence Funding Page 11 Ministry of Research and Innovation Form RE-02

visualization of sensory data; • Develop semantic representation and data fusion for distributed sensors-based application using

IBM’s Unstructured Information Management Architecture (UIMA). UIMA has proven itself in the area of non structured textual data. One of our goals is to work together with “IBM” on analysing and evaluating the suitability of UIMA as a mean to semantically structure sensory data. We intend to enhance UIMA with functionalities and libraries to support fast access to tele-surveillance sensory data.

• Develop tools to support the ability to share knowledge from multiple autonomous domains without violating privacy, etc.

• Support trust in context-aware distributed tele-surveillance environments by further investigating confidence between different media streams [atr07].

Theme 6: User Centered Case Studies and Field Trials [U of Ottawa, CRC] This theme enhances the close connection with the user sector and our industrial partners by way of practical implementation, demonstrations, and field trials of the technologies, systems, and architectures to be developed in all the previous themes. Theme 6 activities will be carried out by all relevant researchers and teams from the five other themes. These activities will serve as validation and feedback mechanisms for all the research and industrial partners, as well as for the users. Field trails will be conducted specifically on the multi-modal security surveillance system at the University of Ottawa Campus (619 cameras and other sensors). These trials will be conducted in collaboration with the surveillance specialists from the Risk Management Service of the University of Ottawa. REFERENCES [ars04] A. Arsanjani, “Service-oriented modeling and architecture: How to identify, specify, and realize services

for your SOA” http://www-128.ibm.com/developerworks/webservices/library/ws-soa-design1/, 09 Nov 2004, last viewed in May 2007.

[ahm07] D.T. Ahmed, S. Shirmohammadi, J.C. Oliveira, "Supporting Large-Scale Networked Virtual Environments", Proc. IEEE Conf. Virtual Environments, Human-Computer Interfaces, and Measurement Systems, Ostuni, Italy, June 2007

[ala06] A. Alamri, M. Eid, A. El Saddik, “Classification of the State-of-the-Art Dynamic Web Services Composition Techniques”, Int. J. Web and Grid Services, Vol. 2, No. 2, pp.148 – 166, 2006.

[atr07] P. K. Atrey, M. S. Kankanhalli, A. El Saddik, “Confidence building among correlated streams in multimedia surveillance systems”, Proc. 13th Int. Conf. on Multimedia Modeling, MMM'2007, Singapore, Jan. 2007.

[aya07] El Ayadi, Moataz M. H., Kamel, S. Mohamed, F. Karray, "Speech Emotion Recognition using Gaussian Mixture Vector Autoregressive Models", ICASSP 2007.

[bou05] N.V. Boulgouris, D, Hatzinakos, K.N. Plataniotis, `Gait recognition: a challenging signal processing technology for biometric identification', IEEE Signal Processing Magazine, vol. 22, no. 6, pp. 78-90, November 2005.

[bou06] N.V. Boulgouris, K.N. Plataniotis, D. Hatzinakos, `Gait recognition using linear time normalization', Pattern Recognition, vol. 39, no 5, pp. 969-979, May 2006.

[bouk05] A. Boukerche, X. Fei, R. B. Araujo: ‘An energy aware coverage-preserving scheme for wireless sensor networks’. ACM PE-WASUN 2005: 205-213

[bouk06] A. Boukerche, R. B. Araujo, F. H. S. Silva: ‘A Context Interpretation Based Wireless Sensor Network for the Emergency Preparedness Class of Applications’. ALGOSENSORS 2006: 25-34

[che07] Q. Chen, N.D. Georganas, E.M. Petriu, “Real-Time Vision-Based Hand Gesture Recognition with Haar-like Features and Grammars,” (6 pages), Proc. IMTC/2006, IEEE Instrum. Meas. Technol. Conf., Warsaw, Poland, May 2007.

[du06] M. Du and L. Guan, “Monocular human motion tracking with the DE-MC particle filter,” Proc. ICASSP, Toulouse, 2006.

[enz05] M. Enzweiler, et al, “Unified target detection and tracking using motion coherence,” Proc. IEEE MVC Workshop, 2005.

[fou07] A. Fourney, R. Laganiere, “Constructing Face Image Logs that are Both Complete and Concise”, Int. Workshop on video processing and recognition, Montreal, Canada, May 2007.

[gre04] R.D. Green, L. Guan, “Quantifying and recognizing human movement patterns form monocular video images”-Part I: A new framework for modeling human motion, IEEE CSVT Trans, 13(2): 154-165, 2004,

Page 9: 1.1 Theme structurepami.uwaterloo.ca/projects/muses/Technical_Proposal.pdf · real-time and adaptive representations of the environment; b) radio-based localization of mobile sensor

Application for Research Excellence Funding Page 12 Ministry of Research and Innovation Form RE-02

-Part II: Applications to biometrics, IEEE CSVT Trans, 13(2): 166-173, 2004. [he07] Y. He, I. Lee, L. Guan, “Optimized multi-path routing using dual decomposition for wireless video

streaming”, accepted by IEEE ISCAS 2007, New Orleans, USA, 2007. [hu04] W. Hu, T. Tan, L.W.S. Maybank, ``A survey on visual surveillance of object motion and behaviours,''

IEEE Trans. SMC, Part C, vol. 34, no. 3, pp. 334-352, Aug. 2004. [ibm07] IBM WebSphere, http://www-306.ibm.com/software/websphere/, last visited May 12, 2007. [lee05] D.T. Lee, “JPEG 2000: retrospective and new developments”, Proc. IEEE, vol.93, no.1, pp.32–41, 2005. [kya06] M. Kyan, K. Jarrah, P. Muneesawang, L. Guan, “Self-organizing trees and forests: Strategies for

unsupervised multimedia processing,” IEEE CI Magazine, 1(2): 27-40, 2006. [lu07] Haiping Lu, K.N. Plataniotis, A.N. Venetsanopoulos, `MPCA: multi-linear principal component

analysis of tensor objects', IEEE Trans. Neural Networks, to appear, vol. 18, no.6, November 2007. [lu06] Haiping Lu, K.N. Plataniotis, A.N. Venetsanopoulos, ‘A Layered Deformable Model for Gait Analysis",

Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition - FG 2006, pp. 249-254, Southampton, UK, Apr. 2006.

[min05] B. Miner, O. Basir, M.S. Kamel, “Understanding hand gestures using approximate graph matching”, IEEE Trans SMC, Part A, 35(2): 239-248, 2005.

[moh03] S. Mohr, The use and Interpretation of Fuzzy Cognitive Maps, Rensselaer Polytechnic Institute, June 2003.

[pla05] K.N. Plataniotis, C. Regazzoni, (Editors) Special issue on “Video and Signal Processing for Surveillance Networks and Services”, IEEE Signal Processing Magazine, vol. 22, no. 2, Mar. 2005.

[sad06] A. El Saddik “Performance Measurement of Web Services-based Applications”, IEEE Trans. Inst. Meas., Vol.55, No. 5, pp: 1599-1605, 2006.

[sha05] Shami, M.T., and Kamel, M.S., "Segment-based approach to the recognition of emotions in speech", Proc. 5th Int. Conf. Multimedia and Expo, Amsterdam, pp. 366-369, July 2005.

[son07] L. Song, D. Hatzinakos, “A cross layer architecture of wireless sensor networks for target tracking,” to appear in IEEE/ACM Trans. on Networking, Apr. 2007.

[sun06] A. Sundaresan, R. Chellappa, "Multi-camera Tracking of Articulated Human Motion Using Motion and Shape Cues ", Asian Conference on Computer Vision, pp. 131-140, 2006.

[tia05] D. Tian, N.D. Georganas "Connectivity Maintenance and Coverage Preservation in Wireless Sensor Networks", AdHoc Networks Journal (Elsevier Science), pp. 744-761, 2005

[van06] G. Vanderveen, Interpreting Fear, Crime, Risk and Unsafety: Conceptualisation and Measurement, Willan Publishing, 2006.

[val03] J.J. Valdés, “Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Tool for Understanding Data and Knowledge”, Artificial Intelligence LNAI 2639, pp. 615-618. Springer-Verlag, 2003.

[val05] J.J. Valdés, A.J. Barton, “Virtual Reality Visual Data Mining with Nonlinear Discriminant Neural Networks: Application to Leukemia and Alzheimer Gene Expression Data”, Proc. IJCNN’05 Int. Joint Conf. Neural Networks, Montreal, Canada, July 2005.

[val06] J.J. Valdés, A.J. Barton, “Virtual Reality Spaces for Visual Data Mining with Multiobjective Evolutionary Optimization: Implicit and Explicit Function Representations Mixing Unsupervised and Supervised”, 2006 IEEE Congress of Evolutionary Computation - CEC 2006, Vancouver, BC, Canada. July 2006.

[wan05] Y. Wang, L. Guan, “Recognizing human emotion from audiovisual information,” Proc. IEEE ICASSP, 2: 1125-1128, Philadelphia, 2005.

[wan06] N. Wanas, R. Dara, M.S. Kamel, “Adaptive Fusion and Co-operative Training for Classifier Ensembles, Pattern Recognition, Vol 39, No. 9, Pages 1781-1794, 2006.

[wha03] T.E. Whalen, D.C. Petriu, L. Yang, E.M. Petriu, M.D. Cordea, “Capturing Behaviour for the Use of Avatars in Virtual Environments,” CyberPsychology & Behavior, Vol. 6, No. 5, pp. 537-544, 2003.

[woj02] D. Wojtaszek, R. Laganière, “Using Color Histograms to Recognize People in Real Time Visual Surveillance”, Int. Conf. Multimedia, Internet and Video technologies, vol. 3, Greece, pp. 261-264, September 2002.

[yan05] X. Yang, D.C. Petriu, T.E. Whalen, E.M. Petriu, "Hierarchical Animation Control of Avatars in 3D Virtual Environments," IEEE Trans. Instrum. Meas., Vol. 54, No. 3, pp. 1333 – 1341, 2005.

[zhe06] W. Zheng, et al, “Facial Expression Recognition Using Kernel Canonical Correlation Analysis (KCCA)”, IEEE Trans. NN, vol. 17, no. 1, pp. 233-238, 2006.