Part Number - CAS – Central Authentication Servicemy.fit.edu/~vkepuska/Research Proposals/DURIP2003.doc · Web viewSensor Networks of Clustered Smart Microphone Arrays Using Robust

The Department of Defense (DoD), Fiscal Year 2004, 5/6/2023Defense University research Instrumentation Program (DURIP). AFOSR BAA 2003-5

Sensor Networks of Clustered Smart Microphone Arrays Using Robust Key Word Recognition and Video-Surveillance System for Monitoring

and Security Applications

Abstract Technologies for the monitoring human activities and the recognition of possible ill-intentions are increasingly becoming necessary to protect and defend our national interests. More recently, these technologies are becoming important in controlled access, security surveillance, and situational monitoring due to the national interest in combating possible terrorist attacks.In this proposal, we describe a novel system for security and monitoring based on at least seven innovations:

1 Use of microphone arrays to locate and separate the source of each sound in its coverage area.

2 Use of key-word recognition technology to identify potential foul intentions. 3 Integration of microphone arrays and video-surveillance system to overlay/indicate the

source of sound in the image of the corresponding monitor.4 Use of the location information to perform face recognition and/or automatic suspect tagging

(e.g., height, color of the clothes, etc.).5 The performance of multi-modal recognition, e.g., using audio signal for speech recognition

as well as images for lip-reading.6 Automatic behavioral analysis of tagged individuals. 7 Integration of the overall system in a responsive network system.

Research areas that are required to develop the proposed system into an integrated platform include: Clustered microphone array sensor network to form robust speech signal extraction received

from adverse and noisy environments as a basis for a robust speech recognition system that is capable of identifying sensitive words.

Beam-forming with microphone array signals for sound separation and sound location. Key-word speech recognition technology implemented with embedded processors, thus

allowing this technology to be deployed in the sensor front end. Image recognition feature will be used for development of algorithms that detect abnormal

behavior (e.g., leaving a bag on the airport and walking away). Detection of hazardous circumstances (e.g., fire) and medical emergencies.

Preliminary work in several of these areas has already been done and will be leveraged by the research enabled by the proposed equipment.

Dr. Veton Këpuska, Associate Professor Page 1/24FIT, ECE DepartmentMelbourne, FL 32901


Sensor Networks of Clustered Smart Microphone Arrays Using Robust Key Word Recognition and Video-Surveillance System for Monitoring and Security Applications

1 Introduction

In this proposal, we describe a novel system for security and monitoring based on an innovative key-word recognition technology. This technology will be integrated with a clustered microphone array sensor network to form a robust speech recognition system that is capable of identifying sensitive words from speech signals received from adverse and noisy environments.In addition to security monitoring, several additional application areas are also envisioned. Integration of several technologies is required to fully exploit capabilities of such a system, namely:

Use of microphone arrays to locate and separate source of each sound in its coverage area. Use of key-word recognition technology to identify potential foul intentions. Integration of microphone arrays and Video-surveillance system to overlay/indicate the source

of sound in the image of the corresponding monitor. Use the location information to perform face recognition and/or automatic suspect tagging

(e.g., height, color of the clothes, etc.). Perform multi-modal recognition, e.g., using audio signal for speech recognition as well as

images for lip-reading. Automatic behavioral analysis of tagged individuals. Integration of the overall system in a responsive network system.

2 System Description

In the following section the Monitoring Center and its basic functions are described as they are envisioned. The whole system is linked together via a wireless network with a monitoring station as depicted in Wireless Sensor Network Section 4.4. Core key-word technology is outlined as a centerpiece of the system that makes possible realization of proposed Monitoring Center. Furthermore, since the system requires ability to handle noise in public places it necessitates the use of microphone arrays. However, a single array of microphones may be insufficient for accurately locating the source of the speech (i.e., speaker). Thus, multiple and redundant sets of microphone arrays are desired as described in following section. 2.1 The Monitoring Center

The sketch of the system is depicted in Figure 1 below. Microphone Array Clusters are placed in a suitable configuration ensuring proper coverage of the whole area. The cluster arrays have a particular range of coverage depicted with the green lines (color is used here only to ensure clarity) originating from each cluster. Each array captures sounds originating from corresponding coverage areas. Each microphone array provides most likely direction and approximate location of the source of each sound. This measurement can be done with a degree of (im)precision and it is limited to some error bounds as depicted with blue lines forming a cone from microphone array sensor toward the source of the sound. In order to locate this source with further accuracy, more than one microphone array sensor has to detect it. Triangulation can be performed if at least one additional microphone array sensor can be focused on the same target. Multiple arrays of microphones as well as larger number of microphones in each ensure increased quality of the signal by suppressing/filtering out other undesired sources of sound that are considered and treated as noise pollution. This separation would not be possible with a single microphone. Currently microphone arrays are used to separate single most dominant sound in the array’s coverage area. The problem that we propose to solve is significantly more difficult; that is our solution should be capable of separating each source of the sound within the coverage area of the microphone array(s). Significant work has already been done in related fields using beam forming techniques. The existing body of knowledge will provide a basis for methods that can be applied to the presented problem. Moreover, related research already has been done at our institution by Dr. Ham at al [1]. The solution



presented in [1] is almost directly applicable to our microphone array sensor clusters solution.

Figure 1. Key-Word Recognition and Video Tracking Based System Monitoring Center.

In order to be able to capture simultaneous sources of speech occurring at the same time, the embedded system should be able to handle a number of different simultaneous speakers concurrently. Although the proposers have significant experience in applying solutions to embedded systems, not all details of algorithms have been worked out or implemented and thus the proposed hardware is estimated to provide sufficient resources to perform beam-forming function as well as speech and image recognition tasks.Furthermore, some coordination might be necessary to ensure that allocated standalone systems are able to share useful information; for example, to further enhance the location estimate for a particular speaker. For this reason, it becomes necessary that the specific data be shared. One solution is to collect and analyze the data at a centralized location, e.g., a server at the monitoring station.Each sensor array cluster will be performing its microphone beam forming algorithm independently (see section 3.2), in its embedded hardware, as well as key-word recognition. This hardware (floating-point device) is housed on a client computer (cluster node). This computer will also serve as a host to two video cameras attached to that sensor array cluster. The video cameras and sensor array pair of a cluster will have the same coverage area.The server is connected via a wireless connection with each cluster node. The server’s task is to coordinate resources and allocate tasks as needed. For example, when more than one speaker is to be tracked at the same time, it is the server’s task to provide additional information to each sensor array cluster to aid in distributing resources for each individual task/speaker as required. In addition,



information provided by each sensor array cluster may be triangulated by the server to provide more accurate location(s) of speaker(s). This information is then projected onto corresponding monitor(s) and proper signaling is provided when a particular key-word is recognized and detected. The security operator then can be alerted and take appropriate action.2.2 The Key Word Recognition Core

The current state of the art in Speech Recognition Technology has made significant improvements toward natural language understanding (spontaneous speech) because of advances in acoustic and language modeling. However, very little has been done to improve key-word recognition and/or its utilization. It is the contention of the proposers that key to natural language understanding by a machine is proper utilization of key-word recognition. Key-Word recognition should provide a natural gateway to more elaborate natural understanding tasks similar to human-to-human interaction.Furthermore, key-word recognition provides numerous application possibilities, including the one that is being proposed. For example, one could envision a centralized system that controls all home appliances including the telephone, TV, Entertainment System etc. Interaction with such a system, for example: “call Mom at work”, “play Erick Clapton CD”, or “turn off the TV” can only be utilized with key-word recognition. Even if hypothetically one assumes that a natural language speech understanding system could perform with 100% recognition rate, resolving contexts in spontaneous speech (e.g., who speaks to whom) can only be done with key-word recognition. The concept of a “Smart Speech Recognition Agent” [SSRA] that has a “name” (e.g., computer “HAL” in Arthur’s Clark “Space Odyssey 2001”) provides a base for key-word recognition (e.g., the dialog between HAL and David in “Space Odyssey 2001”: “David: <HAL>”; “HAL: <Yes David”>, …). Recognizing the “name” of SSRA (e.g., “HAL”) in a proper context (e.g., “[silence] <Operator> [silence]” vs. “… Operation went very well and, Operator said that ….”) is in essence a definition of the key-word recognition problem. Clearly, even from those simplistic examples, it can be concluded that a crucial component to wide acceptance of speech recognition technology as a mode of interaction with computers is key-word recognition. Furthermore, the same solution can be applied to any sensitive word that may provide context for a potentially dangerous activity that poses a security threat. In this case the “name” of a SSRA is replaced with a number of sensitive words or phrases. Government agencies may already have lists of words/languages that are of special interest that could be used to demonstrate the proposed system performance. Note that the proposer’s previous work in this topic provides solution that is language independent.Current solutions (research institutions: MIT, SRI. CMU, etc. and commercial Speech Recognition corporations: SpeechWorks, Nuance, ScanSoft, Microsoft, Philips, etc.) to key-words are derived from a frame work of Large Vocabulary Speech Recognition systems. Thus, the solution is largely complex in terms of code and image size as well as CPU and Memory resource requirements. Furthermore, these solutions fail to perform satisfactorily and thus currently there are no successful applications that are commercially available.It is contention of the proposer that a custom solution that is tailored for this task (i.e., key-word recognition) is necessary and must be developed from the ground up. A solution already developed by proposer for “ThinkEngine Networks Inc.”, 100 Nickerson Rd., Marlborough, MA 01752, will be outlined in this proposal. ThinkEngine Networks hardware/software system is under development and it is expected to be available by the end of the year 2003 (www.thinkengine.net). Note that ThinkEngine’s hardware platform is designed as a server solution for telephony applications only, and is not applicable to this proposal.This solution and proposers experience provides a basis for future research in developing new solutions for proposed Monitoring System.2.3 The Microphone Array Sensor Cluster

Array signal processing is a very mature technology field, with applications ranging from speech processing, to radar, sonar and other areas. In fact, many successful microphone array systems are used for non-speech signals. The main advantage of a microphone array system is to make full use of well-known array properties to enhance any distant noisy target signal. In the security and monitoring applications, we are interested in capturing speech signals at a distance from the speech source without the speaker’s awareness and usually in a noisy public environment. With such constraints, it is natural


http://www.thinkengine.net/


to adopt the microphone array as the front-end speech signal acquisition device since any single microphone sensor will not be able to acquire speech signal with adequate quality for robust recognition.In this research, the proposed sensor network consists of a number of clusters. Each cluster consists of several microphones that form an array for speech signal acquisition front-end. The acquired speech signal(s) at each microphone will be processed by an embedded system (running real-time operating system) based on array processing techniques to form speech signal output(s). Extracted speech signal(s) will then be fed into the core key-word recognition algorithm residing on the embedded hardware for further processing. Once a sensitive key-word has been spotted, the cluster head sensor will initiate the data transfer and forward the recognition results to the remote base station for further decision and action. The key word recognition system is remotely reprogrammable as key words and languages of interest change over time.2.4 Microphone Array Based Wireless Sensor Network

The proposed sensor network of sensor clusters requires robust wireless communications for transferring the speech and video signals to the monitoring center station. Speech data is pre-processed and encoded at the cluster level; therefore the system generates recognition results at the cluster level. However, communication between a cluster and monitoring station needs to be bidirectional: operators will be able to disseminate control information from the monitoring center to the clusters. For example, cameras can be oriented and focused to a given direction and area. Furthermore, one research area made possible with this equipment will focus in automation of this procedure without human intervention. The first design issue we will address is the selection of a data routing protocol. We propose to adopt a time-division multiple access (TDMA) media access control (MAC) protocol [2] for managing the proposed wireless network. Such a MAC protocol works best when a clustering architecture is adopted since clustering facilitates frequency reuse and therefore increases system capacity. Current research described in this proposal focuses on a fixed cluster infrastructure deployed in known environments. However, this research can be extended to an ad-hoc cluster architecture that can be dynamically deployed in an unknown spatial environment [3]. The second design issue we will address is the system latency. The proposed fixed cluster architecture allows for local processing of the speech data and recognition of key words within the cluster. After the local recognition process, only sensitive key word recognition results will be transmitted to remote monitoring center. The proposed system and the cameras should adaptively track objects, areas, or persons. In order to achieve this goal, the communication protocol should support video signal transfer as well. However, video signals can be encoded after being processed (e.g., face recognition) and transmitted with significantly lower bandwidth at monitoring station. Note that each cluster node will be capable to store original high-fidelity signals for data collection, research and development of various algorithms and record keeping. The third design issue in wireless network management is the accuracy of speech key-word recognition, which is to say the perceived application quality. We believe that the proposed microphone arrays facilitate application of effective array processing techniques for speech enhancement. The speech signals enhanced by array processing at cluster level will result in higher recognition performance. Furthermore, it is possible to jointly design the array processing speech enhancement and the speech recognition to take full advantage of the array architecture in the proposed wireless network.Further design issues include wireless network security, optimal cluster deployment and system coverage specification.

3 Existing Approaches

3.1 Existing Speech Recognition Schemes

Providing the computer with the ability to naturally communicate with humans has been a research goal for almost 40 years. The ability to understand human speech is necessary to achieving this goal. Although speech recognition, a crucial component of such a system, has been gaining ground in research areas, its commercial applications are limited, and thus it has not yet achieved the status of a widely accepted technology. The research focus has been mainly concentrated in large vocabulary speech recognition problems, as well as speech understanding. Most of the advancement of such



systems has come from improved statistical modeling of speech and due to a significant increase in the amount of speech data used for the training of models, and not because of new inventions, techniques, or technologies. Researchers in the field have optimistic predictions and claim that, even without radical advancement in the field, if the current rate of progress is maintained, the objective of having a computer recognize human speech at the same level as humans do should be within reach in less than a decade. Because of the lack of a universal solution, speech recognition technologies can be classified into several areas depending on their use, speaking environment, recognition task, or any combination of these factors. Speech recognition tasks range from isolated word recognition, continuous word recognition to dictation and spontaneous speech. Each of these tasks may be approached in a speaker dependent mode or speaker independent mode. A speaker depended mode is system trained only to recognize a particular speaker and performs poorly for other speakers. Speaker independent mode is a system that can in theory recognize any speaker. For tasks such as isolated words or limited command and control word recognition, speaker independent solutions are feasible and have been developed commercially (ScanSoft, Nuance, etc.). On the other hand, tasks such as dictation, work reasonably well in real-time using speaker dependent mode. Another classification of speech recognition technology is based on the speaking environment. All applications mentioned above can be applied in a controlled environment that typically makes use of speech signals at a high bandwidth: 16 kHz sample rate. For example, a desktop system with a nearby microphone (typically held close to the mouth, like a head-mounted earphone and microphone set). Another example is a telephone line that uses a lower bandwidth, 8 kHz, sample rate. However, the real bandwidth of telephonic speech can effectively be lower than 8 kHz depending on the device used: landline, cellular, cordless, or speaker phone. Cellular devices in particular use special coding techniques to lower the bit-rate of transmission, which in turn effectively reduces the bandwidth of the signal. This is also true for cordless phones that need to communicate with a base unit. Lower bandwidth signals for large vocabulary speech recognition require solutions that can not run in real-time.Most recognition systems use Hidden Markov Models (HMM) developed in late 70’s by James Backer, the founder of Dragon (dictation technology provider). HMM’s are what is known as a frame based solution, where each frame is used for classification [4], [5]. There are a few exceptions to frame based recognizers, notably MIT’s segmental approach to recognition [6], where multiple frames that are acoustically similar are grouped into one segment and classified as such. More specialized solutions and embedded solutions utilize Dynamic Time Warping - a Dynamic Programming technique for the matching of two patterns. This technique has been successfully used for key-word recognition by the proposer, albeit if used in its standard form it would provide a poor solution for such a complex task. The inventions enumerated in the list of patents presented in section 4.1 make it possible for:

Significantly simpler algorithmic solution of VAD (Voice Activity Detector) that also achieves better performance. This translates directly to smaller footprint of the executable code, as well as simplifies the matching stage of the system,

Higher recognition performance, combined with very low false recognitions that in turn allow for simpler scoring and faster recognition by using only one model per key-word vs. multiple models (e.g., gender specific models, dialect specific models, etc.),

Simple scoring that is a direct consequence of the inventive classification scheme.The Speech Recognition community has not previously viewed key-word recognition as a necessary solution for the general speech recognition problem. For research institutions dealing with well defined problems (e.g., Broadcast News Transcription), this approach is understandably irrelevant since they do not involve real-time human-machine solutions. Some leading companies developing speech recognition technology (SpeechWorks, Nuance, ScanSoft, etc.) offer key-word recognition. However, their solutions offer too low performance to be of any value for this project. Furthermore, some of these companies have publicly announced that they will not develop a specialized solution to this problem. Therefore, at this moment there are no commercial applications using key-word recognition, to the proposers’ knowledge, with the exception of Wildfire, Waltham, MA (their technology is tuned only for one key word namely “Wildefire” and their performance is still unsatisfactory), and ThinkEngine Networks, Inc., Marlborough, MA whose platform is not applicable to this proposal.



3.2 Microphone Arrays for Speech Enhancement

There are numerous array processing techniques that have been applied to enhance speech signals collected through microphone arrays. The most widely used method is known as beamforming. Beamforming refers to a method that algorithmically, instead of physically, steers the sensors in an array toward a target signal. The simplest form of beamforming is delay-sum beamforming in which signals from various microphones are first time-aligned to adjust the delays caused by different path lengths between the target source and each microphone. The average, or weighted sum, of these time-aligned speech signals is an enhanced version of the speech signal because, in general, the noises corrupting each microphone are uncorrelated and will be attenuated by averaging. Ideally, for every doubling of the number of microphones in the array, there will be a 3dB increase in Signal-to-Noise ratio (SNR) in the enhanced output speech signals.Many successful array processing techniques are actually based on this simple delay-sum beamforming principle. One such extensions is filter-sum array processing, in which, instead of simple alignment, speech signals from individual microphone are first filtered and then combined to produce enhanced speech signals. Another natural extension of simple delay-sum array processing is to allow parameters of array processing be adjusted according to some optimization criteria or particular speech acquisition environment. Reverberation is often the major cause for poor speech recognition performance when a microphone array is deployed within an enclosed environment. The echoes of speech signals result in negative effects on the combination of speech signals from individual microphones. Traditional beamforming is unable to compensate for such negative effects and is therefore inapplicable to speech enhancement within a reverberant environment. Recent approaches to this class of speech enhancement challenges have centered on estimating the impulse response of a reverberant environment and using the inverse impulse response technique to compensate for the reverberation effects. Dr. Ham, Harris Professor of Engineering at FIT, has applied a method that is directly applicable to this issue [1]. In addition, Intersil (www.intersil.com), a local corporate member of FIT’s Wireless Center of Excellence and active member of ECE’s Department advisory board, has done a lot of work on how to deal with internally reflected signals for WLAN applications. They are able to ignore reflected signals arriving a short time after the principal signal albeit for RF and not audio. However, it is expected that their solution is analogous or portable to the problem at hand. Other array processing techniques include blind source separation and auditory model-based array processing. Blind source separation can be very useful when we need to identify individual speakers within the neighborhood of the microphone array sensor cluster. The auditory model-based approach usually involves much more complicated processing, which may not be realistic for sensor network applications in which computational resource is greatly constrained. All the above techniques provide solutions for the most dominant speech sound. Solution presented in [1] is appropriate for the task at hand; to separate independently each source of speech signal.3.3 Microphone Arrays for Speaker Location

The problem of locating a speaker from suitably placed clusters of microphone arrays is analogous to the triangulation problem in wireless networks of mobile phones. There have been some attempts to solve this problem, see for example [7]. This work, however, did not draw from the experience of the wireless community. Thus, there already exists a large body of knowledge that will be utilized. In addition, many software tools have been developed to assist designers of such networks to appropriately place receiver stations in a given area to fulfill, within specified requirements, pin-pointing the location of a mobile phone. Location of a source (speaker) requires several microphone arrays to be strategically placed in a room. The location of the arrays can be optimized using the existing configuration of the room, furniture, possible obstructions, etc. It is envisioned that a custom tool will be developed to deal with this issue by providing assistance in generating an optimized solution that meets a specified criteria. Existing tools used in Wireless Networks to design placement of the receiver stations for given topology of a location may be used as a starting point.Each array provides information on the direction and approximate location of the sound(s) at a particular instant of time [1]. For increased accuracy the signals from several arrays can then be correlated, and


http://www.intersil.com/


the time differential of the sound measured. This information, along with the direction information obtained by the array, is used to generate the most likely location of the source, within certain error bounds. This information can be overlaid in real-time with existing images from cameras covering the area being surveyed. Based on the speech that triggers the action and the information provided on the monitors, security personnel will be able to decide on the most appropriate action.

4 Proposed Research

4.1 Innovative Key Word Recognition Algorithm

The proposed key-word recognition system is based on Këpuska’s recent work in this area. Note that the following is proprietary information. This technology is founded on three patented key inventions:

1. Voice Activity Detection (VAD) Based on Cepstral Features,2. Dynamic Time Warping (DTW) Matching using Reverse Ordered Feature Vectors,3. Re-scoring using Distribution Distortion Measurements of Dynamic Time Warping Matching.

This solution and its implementation has been extensively tested and shown to achieve high recognition rates while maintaining practically 0% false alarm rate. It must be noted that this performance has been achieved with what is currently considered a miniscule amount of training data. Furthermore, all the tests were conducted with only one (speaker independent) model per key-word.This technology was developed to support real time implementation on embedded DSPs as well as on standard Microprocessor platforms. In addition, the above-mentioned inventions make possible for the high-density implementation of the technology. The VAD solution is algorithmically simple and performance-wise it surpasses other complex solutions. More accurate VAD, in turn, simplifies the recognition and matching components of the system. Furthermore, because of the innovative solution to the recognition algorithm itself, the correct recognition performance is very high, and false rejection is very low, occurring only for pathological cases. False acceptance is practically zero.Current implementation of the Key-Word recognition front-end uses standard MEL-Cepstrum algorithm, compliant with ETSI Aurora Version 1. Standard MEL-Cepstrum is further processed to achieve a higher robustness of key-word recognition and VAD. More specifically, Cepstral Coefficients are filtered first by a high pass filter and next by a low pass filter to maximize the accuracy of VAD detection. VAD/End-Pointing information from the VAD detector is passed to the key-word recognizer; part of back-end processing. The only purpose of VAD is to provide accurate markers of the beginning and ending of a speech event (e.g., utterance/word). The proposed solution uses in addition to conventional features (i.e., energy of the speech signal), additional features derived from cepstral coefficients that are generated by the front-end. In the front-end, the standard MEL-cepstrum coefficients are generated for each segment in a stream of segments of the incoming signal. The front-end derives thirteen cepstral coefficients: c0 and c1-c12. The front-end also derives the energy level of the signal using an energy detector. The thirteen coefficients and the energy signal are provided to a VAD processor.Cepstral coefficients capture signal features that are useful for representing speech. Most speech recognition systems classify short-term speech segments into acoustic classes by applying a maximum likelihood approach to the cepstrum (the set of cepstral coefficients) of each segment/frame. In theory, such a classifier could be used for the simple function of discriminating speech from non-speech segments, but that function would require a substantial amount of processing time and memory resources.To reduce the processing and memory requirements, a simpler classification system may be used to discriminate between speech and non-speech segments of a signal. The simpler system uses a function that combines only a subset of cepstral coefficients that optimally represent general properties of speech as opposed to non-speech.The developed key-word speech recognition system will involve comparing two instances of a word, such as comparing an uttered word to a model of a word. The durational variations of uttered words and parts of words can be accommodated by a non-linear time warping designed to align speech features of two



speech instances that correspond to the same acoustic events before comparing the two speech instances. Dynamic time warping (DTW) is a dynamic programming technique suitable to match patterns that are time dependent [8].The result of applying DTW is a measure of similarity of a test pattern (for example, an uttered word) and a reference pattern (e.g., a template or model of a word). Each test pattern and each reference pattern may be represented as a sequence of vectors. The two speech patterns are aligned in time and DTW measures a global distance between the two sequences of vectors.DTW is based on a time-time matrix alignment process. The uttered word is represented by a sequence of feature vectors (also called frames) arrayed along one axis, e.g., horizontal. The template or model of the word is represented by a sequence of feature vectors (also called frames) arrayed along the other axis, e.g., vertical axis. The feature vectors are generated at intervals of, for example, 0.01 sec (e.g., 100 feature vectors per second). Each feature vector captures properties of speech typically centered within 20-30 msec. Properties of the speech signal generally do not change significantly within the time duration of the analysis window (i.e., 20-30 msec). The analysis window is shifted by 0.01 sec to capture the properties of the speech signal in each successive time instance. Details of how a raw signal may be represented as a set of features are provided in [9] and [10].A test utterance will typically be compared to all other templates (i.e., reference patterns or models that correspond to other words) in a repository to find the template that is the best match. The best matching template is deemed the one that has the lowest global distance from the utterance, computed along a path [8] that best aligns the utterance with a given template, i.e., produces the lowest global distance of any alignment path between the utterance and the given template. By a path, we mean a series of associations between frames of the utterance and corresponding frames of the template. The complete universe of possible paths includes every possible set of associations between the frames. A global distance of a path is the sum of local distances for each of the associations of the path.The DTW algorithm works well for:

Tasks having a relatively small number of possible choices, for example, recognizing one word from among 10-100 possible ones, and

Performing speaker dependent recognition.Furthermore, note that the minimum global matching score for a template need be compared only with a relatively limited number of alternative score values representing other minimum global matching scores (that is, even though there may be 1000 templates, many templates will share the same value for minimum global score). All that is needed for a correct recognition is the best matching score to be produced by a corresponding template. The best matching score is simply the one that is relatively the lowest compared to other scores. Thus, we may call this a "relative scoring" approach to word matching. The situation in which two matches share a common score can be resolved in various ways. For example, a tie breaking rule can be used, by asking the user to confirm, or by picking the score that has lowest maximal local distance. However, the case of ties is irrelevant for recognizing a single "key-word" and practically never occurs.The standard DTW algorithm is not practical for real-time tasks that have nearly infinite perplexity; for example, correctly detecting and recognizing a specific word/command phrase such as a so-called wake-up word, hot word or key-word) from all other possible words/phrases/sounds. It is impractical to have a corresponding model for every possible word/phrase/sound that is not the word to be recognized. Furthermore, absolute values of matching scores are not suited to select correct word recognition because of wide variability in the scores.Those problems have been avoided by extending the standard DTW algorithm with the inventions mentioned at the beginning of this section.4.2 Embedded Implementation of Key Word Recognition Algorithm

The inventions described in section 4.1, enabled the development of a key-word Recognition technology that has the following properties:

Compact - Implementation occupies less than 45 Kbytes of total memory space, and Efficient - This is a CPU solution that is able to achieve a throughput of over 50 channels per TI



C62x 200MHz DSP.The above features of the current solution, on the other hand, leave room for further improvement in two areas:

Replacing Dynamic Time Warping [DTW] matching with Hidden Markov Models [HMM]. HMMs are a significantly more powerful modeling technique. They are particularly useful when speaker independence and tolerance to various speaking styles is required. Thus, it is expected that this change would further improve already high recognition performance of the system.

HMMs are parametric techniques and thus are more compact than models in DTW. For example, a typical DTW model for a 1 sec long model of a key-word occupies 3.8 Kbytes. A 5 state HMM model each using 5 Gaussian Mixtures would require 65 Bytes if the same precision is applied.

In general, one can view any recognition system as a black box that produces a likelihood score that a particular test utterance matches a respective model/hypothesis. The reversal principle, as indicated in Section 4.1, is able to add further information into the matching process, namely:

Is the match poor because of degraded signal/realization of utterance? Or Is the match poor because the test utterance does not correspond to the word that it is being

matched?Current methods are not able to distinguish between those two cases and thus the match is commonly rejected or misclassified depending on the threshold. Performance gains when applying the Reversal Principle in the form of error reductions of over 50% can be achieved. The Reversal Principle approach is not restricted to only key-word recognition. It can also be used for recognition of any sequence of vectors such as DNA sequence classification and Large Vocabulary Speech Recognition.4.3 Microphone Arrays as a Cluster of Smart Sensors

Wireless networked sensors will have a significant impact on security and monitoring-related applications. In order to monitor some sensitive key words spoken in public areas or selected surveillance locations, the speech sensors, or microphones, need to be easily deployed and are often required to be invisible to the population being monitored. To enable the key word recognition algorithm to have a high recognition rate with virtually zero false alarms, it is required that speech signals received for the recognition algorithm at the Monitoring Center Computer (MC) be of high signal fidelity. Because the environments in which these microphones are deployed are highly noisy and may be reverberant, a cluster of sensors is required in order to reconstruct high quality speech signals. This consideration motivates us to adopt a microphone array as a cluster of smart sensors. Consider a given location that is under surveillance and monitoring. We assume that the spatial extent is known. The placement of microphones and the formation of clusters may be designed such that the entire sensor network will have full coverage of the location. The algorithm for cluster placement will utilize the geometry of the environment and the properties of selected microphones. Therefore, unlike typical wireless sensor networks, in which the sensors are randomly deployed, the location of each microphone sensor cluster can be made known a priori to the Monitoring Center as well as to the microphones within a cluster. The wireless sensor cluster network data routing protocol and resource management will use the location information for each cluster.The microphone sensor arrays form the speech signal acquisition front end and can be considered as a special detection cluster of smart sensors. A given location that is under monitoring and surveillance may require many clusters of microphone sensors. The number of microphones in array is limited by the hardware and further research will determine what the optimal number of microphones is required for the task. If research results show that larger number of microphones is shown to be necessary a more expensive hardware solution is possible that would allow integration of larger number of microphones in the array. Furthermore, the proposed system integrates additional signals from 2 (stereophonic) cameras attached to the cluster’s host microcomputer.



We call these smart sensor clusters because all sensors within a cluster work collaboratively in three different aspects towards a high performance and energy efficient key-word recognition system:

The first aspect of the intelligence is that the microphone array allows high performance speech enhancement based on array processing. The array-based speech enhancement techniques have been discussed in Section 3.2. The resultant high fidelity speech signal will then enable more robust key-word recognition.

The second aspect of the intelligence is that the microphone cluster array also facilitates the speaker location, which will become very useful in locating the source of the speech when sensitive key words are spotted. The speaker location techniques have been described in Section 3.3.

The third aspect of the intelligence is that for given location of the speaker this information can be used to tag indicated speaker using image processing techniques as well as perform face recognition effectively integrating video and audio components of the system. Stereophonic cameras will enable three dimensional modeling.

The fourth aspect of the intelligence is that, with a wireless networked cluster of microphone sensors, we shall be able to design an energy efficient wireless communication between the clusters and the monitoring center.

4.3.1 Calculation of Bandwidth Allocation

In the proposed system audio and video signals are processed at the full bandwidth in corresponding embedded systems as depicted in the Figure 2. In addition they are stored unaltered in clusters computer storage device. For real time monitoring in is not necessary to preserve full bandwidth of the original signals. Thus compression techniques can be used to lower audio and video signal bandwidths for efficient wireless communication.

Clu ste r M icro co m p u te r

M ic ro p h o n e s Came raCame ra

W ire le ss Ne tw ork Acce ssC LUSTER

Figure 2. Cluster Sensor comprised of Array Microphones and Stereo Cameras

Audio information of the speech is digitized, encoded and compressed at computer of a sensor cluster. After compression, typical data rates for speech data lie between 12 kbit/s (highly compressed) and 64 kbit/s (uncompressed at 8 kHz sampling rate and 8-bit amplitude resolution). Furthermore, the digitized audio signal from a microphone array will be transmitted to monitoring center only when a Key-Word is spotted. Advantages of using microphone arrays are discussed in Section 3.2. In contrary, video information is send continuously to the monitoring station. Existing wireless network standards (IEEE 802.11, WiFi, etc.) offer from 1-2 Mbit/s up to 54 Mbit/s bandwidth. However, based on our experience, 10 Mbit/s transmission rate specifies an empirical constraint. Using the proposed system we intend to develop new adaptive communication protocols to optimize data transfers (e.g., audio as well as video signals).As we indicated earlier in Section 2.4, we will adopt the TDMA MAC protocol for wireless sensor network management. With this protocol, the operation of the sensor network is divided into rounds. Each round begins with a set-up phase in which the network assigns a cluster to communicate. This is followed by a steady state phase in which data is collected, processed and forwarded to the monitoring center. 4.4 Wireless Sensor Network Management

The Monitoring Center (MC) acts as a base station and controls the wireless data transfer from/to sensor clusters. The communication between MC and the clusters is bidirectional. A maximum number of



clusters of sensors that share the wireless medium at one time will be determined based on empirical studies. It is expected that the antenna characteristics at each device be the same, and thus the wireless network is considered to be homogeneous. The system allows new clusters to join and operating clusters to disjoin the wireless network. Sensor clusters can autonomously self-configure and set-up themselves in the network when deployed. The topology is semi-static. That means that deployment of the clusters is statically pre-set and should follow certain rules. However the topology may change dynamically. We call ClusterMC communication direction as uplink and MCCluster direction as downlink. Sensed data information is transferred on uplink. Downlink is mainly for control. In proposed application uplink direction must have significantly higher bandwidth utilizing asymmetric communication optimally. A wireless sensor network is reactive if nodes send data referring to events occurring in the environment (in turn they may trigger specific actions) and programmed-active (pro-active) when nodes collect data according to application conditions (the application will decide when, what and how to collect regardless on environmental conditions). The majority of audio-video information processing is done at cluster level; therefore the proposed network is a pro-active type. However, sensor clusters look for special “events” and trigger certain system response which may be coordinated through the monitoring center.

5 Integration of Research with Education

Dealing with the current realities of terrorist threats requires changes in the way we conduct our everyday life. A significant part of that change should come through education. By exposing students to the research work that is being proposed, two goals can be achieved:

Introduction to innovative technologies, Exposure to the ways these technologies are used to increase safety, which in turn will

motivate a certain behavior that is appropriate for the time that we live in.The first task, of introducing the innovative technologies, will aid the already-initiated process of incorporating speech recognition and related areas in the curriculum. Introduction of classes that cover broad areas of speech recognition has been already initiated at ECE and CSE departments of Florida Tech. ECE courses will cover a wide range of topics related to Speech and Speech Recognition and Synthesis. This project will facilitate introduction of research of a higher level in those classes5.1 Integration in graduate education

This research initially will include a selected number of graduate students directly involved with the projects listed in the proposal as well as others support this equipment. In addition, once established, the lab may be used for a wide variety of research projects in the areas of speech recognition, image processing, machine learning, etc. Furthermore, the laboratory would enable involvement a wider number of students interested in those and related areas through various class projects. 5.2 Integration in undergraduate education

Collection of suitable test data is a large undertaking: speech data collection (e.g., building of a microphone array multi-channel Speech Data

Corpus), image and video data collection, etc.

This activity will require the participation of large number of students and other willing parties as donors of their voices as well as their presence. The proposer has no knowledge of the existence of publicly available microphone array speech data. This data can become publicly available through our efforts, and may become a trigger for proliferation of research in the area of usage of microphone arrays for speech recognition and/or speaker location.Collecting data requires setting up a laboratory with variable topologies to study various effects in microphone array placement and speaker location. Maintaining this laboratory will largely involve a number of graduate and undergraduate students, thus facilitating transfer of knowledge.In addition, a project involving speech recognition is now in progress in the departmental Senior Design course, and future projects related to or affiliated with the proposed research will be considered.



On a larger scale, a transfer of knowledge to students at large will occur through the publishing of papers triggered by this research.5.3 Industrial collaborations

The proposer is negotiating with his industrial contacts to help a company get started by providing key-word recognition technology, a crucial component of the product that they intend to introduce. Their application does not involve security and it targets home markets. Many engineering solutions expected to come out of this research will be directly applicable to this application. Furthermore, many other applications can be envisioned as results of the research efforts made possible with this laboratory equipment. Furthermore, one can easily envision the need to establish a newly founded company to produce products based on the technology developed through the outlined projects and the proposed hardware.5.4 Qualifications of PI

Dr. Veton Këpuska, received Ph.D. degree in ECE from Clemson University in 1990. He has also completed post-doctoral studies in Image Understanding, at Swiss Federal institute of technology in Zurich in the period of 1990-1993. Since then he has worked in R&D at various high-tech companies developing speech recognition technology. During this period from 1993-2003 he was not allowed to publish anything related to his work due to companies’ policies and thus although he is well regarded in industrial community he is not know in academic circles. He is the inventor of three patents - one of which is expected to be shown to be a fundamental contribution to pattern recognition for the class of problems where the pattern is composed of a sequence of features. Most recently he has joined FIT in the quest for more freedom to conduct research as well as transfer some of his industrial experience through teaching.Dr. Këpuska’s interests include the area of Human - Machine Interaction and Communication, which covers areas of Speech Recognition, Text to Speech, Speaker Identification - Biometrics and Telematics, Digital Signal Processing, Adaptive Filtering, Pattern Recognition, Neural Networks, Language Modeling. See Dr. Veton Këpuska’s (Section 9) biography for further details.

6 Significance and Impact of the Proposed Research

6.1 Significance in national security and monitoring

It is expected that the proposed solution will enhance current security and monitoring capabilities for the nation. This enhancement will thus contribute to an overall improvement in our national security. In addition to providing technical solutions to some of the security problems that we are currently facing, the full impact of the proposed solution will largely depend on its ultimate deployment. This might be an appropriate topic for follow-on research and development.6.2 Impact on combating terrorism, intelligence gathering

The proposed integrated system can be used directly to combat terrorism through tactical intelligence gathering and response. Furthermore, it is reasonable to expect that the proposed research will generate many new and other novel applications and scenarios as it progresses. These new ideas also may be directly applicable to providing new solutions for intelligence gathering, thus adding value in new ways to the combating of terrorism, the monitoring of vulnerable areas, and in general the acquisition of new intelligence information. 6.3 Impact on human computer interface, entertainment, and education

6.3.1 Speech Recognition - Smart Speech Recognition Agent

One desired and expected result of this research is in encouraging the current attitude of the speech recognition community to accept the notion of a “Smart Speech Recognition Agent” made possible through robust key-word recognition. Acceptance of this concept will enable integration of existing speech recognition technologies into a single application that provides multiple services. This trend has been started at some rudimentary level (without key-word recognition) through some commercial so-called voice-portal applications.However, in all these applications one has to remember to dial a particular number to gain access to



these services. These services are usually very specific (travel reservations, weather forecasts, movie listings, etc.) and limited in scope. When one would need to gain access to some other information or service outside the offered packages, then the whole interaction with the system becomes at the minimum cumbersome. More often then not, it just breaks down. Additional services may each require key-word recognition. The concept of a “Smart Speech Recognition Agent” would enable the integration of all these technologies under one top level application that integrates all sub-services under one transparent interface. At any point in time the user will be able to invoke the services of the “Smart Speech Recognition Agent”, interrupting one activity and initiating another until the transaction is fully complete. For example, let’s assume that our “Speech Recognition Agent” is called Aristotle:

Speech Recognition Agent (Aristotle): “Monitors all speech activity continuously and all the time”.User (at home): “Aristotle”Aristotle: <Tone> # a suitably designed tone acknowledging recognition of the key-word “Aristotle”.User: “Call Linda”Aristotle: <Dials Linda>User: “Hi Linda ….”Linda: “Have you converged with John regarding …”User: “Actually we can resolve that now … Aristotle”Aristotle: <Tone>User: “Conference in John”Aristotle: <Dials John>…John: “Perhaps you should fly in next week?”User: “Hold on let me see how my schedule looks like … Aristotle”Aristotle: <Tone>User: “Calendar”Aristotle: <Invokes Calendar Application>User: “What does my schedule look like for next week?”….User: “OK, good, I can travel on …”User: “Aristotle”Aristotle: <Tone>User: “Call travel agent”Aristotle: <Calls travel agent>….John: “You should formally inform Mr. Doe about our discussion”User: “Aristotle”Aristotle: <Tone>User: “Compose E-mail”



Aristotle: <Initiates Dictation Program>….Note that all of the activities invoked by Aristotle (e.g., dialing a specific user, calling a travel agent, taking dictation of an email, etc.) are activities for which speech recognition already exists, but not as a single, monolithic application or service provider. The results of the proposed research will provide the fundamental science and education needed for making this seamless application of speech recognition a reality, in addition to the proposed national security focus of our immediate activities.

6.3.2 Speaker Recognition – Speaker Authentication

If speech patterns of a person are available, a natural extension of the speech recognition task is to provide the likelihood if a specific voice belongs to a specific person in the database. This task can be used to validate the identity of an individual or it can be used to authenticate a specific user.

6.3.3 Image Recognition - Face Recognition

In Boston’s Logan Airport an experimental system that performs face recognition has been installed. Our proposed system will enable further research to be done in this area. However, one advantage of the proposed overt system installed at Logan is that the system will provide an integrated framework to evaluate multiple sources of information that includes sound as well as video. In addition to face recognition video images can be used to develop behavioral models that could detect potentially ill-intended or abnormal behavior. For example a person walk and places a personal item (e.g., bag) in the airport and walks away would constitute a suspicious behavior.Furthermore, video and voice could be used concurrently to perform multi-modal recognition by combining voice recognition with lip-movement.Another research area for which the proposed system can be used is automatic human tracking – whereby an indicated individual (triggered by suspicious behavior, use of specific word/phrase, matches some particular face profile, etc.) will be automatically traced by the system.

7 References:

[1] F.M. Ham, S. Park, and J.C. Wheeler, “A Pseudo-Correlation Beamsteering Algorithm for Infrasound Array Data,” (Abstract) American Geophysical Union Conference, EOS Transactions, AGU, vol. 81, no. 48, San Francisco, CA, December 15-19, 2000 (Abstract S61B-08), p. F844, 2000

[2] K. Sohrabi, J. Gao, V. Ailawadhi, and G. Pottie, “Protocols for self-organization of a wireless sensor network,” IEEE Personal Communications, pp. 16-27, October 2000.

[3] W. Ye, J. Heidemann, and D. Estrin, “An energy-efficient MAC protocol for wireless sensor networks,”

[4] Lawrence Rabiner, and Biing-Hwang Juang, “Fundamentals of Speech Recognition” Prentice Hall, 1993, Chapter 6, Theory and Implementation of Hidden Markov Models, pp. 321-386.

[5] Frederic Jelinek, “Statistical Methods for Speech Recognition”, MIT, 1997.[6] James R. Glass, “A probabilistic framework for segment-based speech recognition”, MIT

Laboratory for Computer Science, 200 Technology Square, Cambridge, MA 02139, USA, Computer Speech and Language, November 2002

[7] Darren Moore, Iain McCowan, “Microphone Array Speech Recognition: Experiments on Overlapping Speech in Meetings”, to appear in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-03, Hong Kong, April 2003.

[8] Mary Jo Creaney-Stockton, "Isolated Word Recognition Using Reduced Connectivity Neural Networks With Non-Linear Time Alignment Methods", PhD Dissertation, Chapter 3, BEng., MSc. Department of Electrical and Electronic Engineering, University of Newcastle-Upon-Tyne, August 1996, <http://www.moonstar.com/~morticia/thesis/chapter3.html%20>.



[9] L.R. Rabiner and R.W. Schafer, “Digital Processing of Speech Signals”, Prentice-Hall, 1978, Chapter 4: Time-Domain Methods for Speech Processing, pp117-171.

[10] Lawrence Rabiner, and Biing-Hwang Juang, “Fundamentals of Speech Recognition” Prentice Hall, 1993, Chapter 3, Signal Processing and Analysis Methods for Speech Recognition, pp. 69-139.



8 Budget

Contact Info:

National Instruments Vince Accardi, National Instruments, District Sales Manager, 218A E. Eau Gallie Blvd., Indian Harbor Beach, FL 32937Tel. (321) 751-7771, E-mail: [email protected]

Part Number Description Unit Price Qty Price Link

776678-03NI LabVIEW Professional Dev. System for Win2000/NT/XP/ME/98 $873.75 1 $873.75

http://sine.ni.com/apps/we/nioc.vp?lang=US&cid=2441

777859-03NI Vision Development Module for LabVIEW $648.75 1 $648.75

777844-03NI LabVIEW Real-Time Module for Windows $648.75 1 $648.75


777849-01 NI LabVIEW RT Run-Time License (Quantity 1)

$495.00 4 $1,980.00

777970-03 NI LabVIEW Sound and Vibration Toolkit for Win 2000/NT/XP/Me/9x

$373.75 1 $373.75 http://sine.ni.com/apps/we/nioc.vp?lang=US&cid=3122

778042-01 NI LabVIEW Sound and Vibration Toolkit Run-time License

$98.75 1 $98.75 http://sine.ni.com/apps/we/nioc.vp?lang=US&cid=3122

960596-01 Factory Installation Services $595.00 4 $2,380.00 http://sine.ni.com/apps/we/nioc.vp?lang=US&cid=3405

778636-01 NI PXI-1042 8-Slot 3U Chassis with Universal AC Power Supply

$1,777.50 4 $7,110.00 http://sine.ni.com/apps/we/nioc.vp?lang=US&cid=11649

778468-02 NI 8176, 1.26 GHz PXI Embedded Controller with Windows XP


778469-512 512 MB RAM for NI-8171 Series $445.50 4 $1,782.00

778279-01 NI PXI-4472 , 8 Inputs, 24-bit Dynamic Signal Acquisition


778492-01 External USB CD-ROM for use with PXI & VXI Embedded Controllers

$175.50 4 $702.00

778415-01 SMB 100, SMB Female to BNC Female Coax Cable, 50 ohm,2 ft., Qty 8

$265.50 4 $1,062.00 http://sine.ni.com/apps/we/nioc.vp?cid=10812&lang=US


http://sine.ni.com/apps/we/nioc.vp?cid=10812&lang=US

http://sine.ni.com/apps/we/nioc.vp?cid=10812&lang=US


















Contact Info:

PCB Piezotronics, Inc. Mark Valentino

Vibration Product Manager, PCB Piezotronics, Inc.

3425 Walden Avenue Depew, New York 14043

Phone: 888-684-0013 Fax: 716-685-3886

www.pcb.com email: [email protected]

Part Number Description Unit Price Qty Price Link

300M162 32 Channel Array Microphone System with Signal Conditioner and PowerSupplies:

$11,875.00 4 $47,500.00

32 - 130D20 Array Microphones and Preamps

32 - 002T20 Output Cables

2 - 070C29 Input Patch Cables

2 - 009H25 Cables

2 - 441A39 Modular Conditioning Mainframes

2 - 441A101AC Power Module

2 - 442B117 16 Channel ICP Power Supply

8 - 009L05 Output Cables

2 - 1110 Mounting Grids

200 - Microphone Array Clips

2 - CAL200 Calibrators

WUSB 11 Wireless Networking Card $49.99 5 $249.95 http://www.linksys.com/ products/product.asp? grid=33&scid=36&prid=435

USB2S Serial/USB port Controller $19.99 4 $79.96 http://www.deluo.com/Merchant2/merchant.mv?Screen=PROD&Store_Code=DE&Product_Code=USB2S&Category_Code=GPSOEM

DellDell Precision™ Workstation 360 Minitower

$6,800.39 2 $13,600.78

DellUltraSharp 1800FP 18-inch Flat Panel LCD Monitor

$529.00 4 $2,116.00

Total: $138,734.44


http://www.deluo.com/Merchant2/merchant.mv?Screen=PROD&Store_Code=DE&Product_Code=USB2S&Category_Code=GPSOEM



http://www.linksys.com/%20products/product.asp?%20grid=33&scid=36&prid=435




9 Biography

Dr. Veton Z. Këpuska

EDUCATION:

1990 Ph.D. Computer Engineering. Clemson University.Dissertation: Artificial Neural Networks for Speech

Recognition ApplicationsAdvisor: John N. Gowdy

1986 M.S. Computer Engineering. Clemson UniversityAdvisor: John N. Gowdy

1981 Dipl. Eng. (B.S.) Electrical Engineering. University of PrishtinaThesis: The use of the Analog Computers for

Simulation and Automatic ControlAdvisor: Abdurrahman Grapci

1976 Diploma Mathematical Gimnasium.Diploma Work Experimental Methods for Measurement of

the Speed of LightAdvisor: Skender Skenderi

FELLOWSHIPS AND HONORS

1984 – 1985 Fulbright Fellow

1987 – 1988 Harris Fellow

1977 – 1979 Univeristy of Prishtina Fellow

EMPLOYMENT:

2003 - Present

Associate Professor – Florida Institute of Technology, Electrical and Computer Engineering Department

2001 - 2003 Speech Recognition Scientist - ThinkEngine Networks, Inc., 100 Nickerson Rd., Marlborough, MA 01745. USA.

Invented, Designed and Developed unique solution to “Key Word” Recognition or “OnWord™” Spotting Technology. Key Word Spotting entails recognition of a specific word/phrase uttered in isolation or in a context of a continuous speech. Currently this technology is not as widely used as other Speech Recognition Applications/Tasks because of poor performance of Speech Recognition Systems offering such technology commercially - Nuance, SpeechWorks, Philips, Conversay, ART, etc., or as a research tool, that is speech recognition technologies of primarily research and development institutions such as – Byblos (BBN), Sphinx (CMU), HTK Speech Recognition Tool Kit (Cambridge University, Entropic and Microsoft), etc. Furthermore, all those systems require computer systems with powerful CPU’s (~1.5 GHz Pentiums) with large memory (512 Mbytes RAM) with Speech Recognition process itself requiring tens of hundreds of Mbytes for this feature alone to even run in real time. Additional advantage of the developed system is that it is designed also to run on a Fixed Point DSP, requiring less than 36.2 Kbytes of program memory space and 2 Kbytes for Model space, consuming less 2 Million Cycles per Second on



a TI C62xx. Inventor of 3 Patented Solutions – Patent Pending:

1. Voice Activity Detection Based on Cepstral Features.2. Dynamic Time Warping (DTW) Matching using Reverse Ordered Feature

Vectors.3. Rescoring using Distribution Distortion Measurements of Dynamic Time

Warping Match. Working on Generalized scoring using Reversed and Normal Ordered Features for

any Pattern Matching Method (e.g, DTW, HMM) to be filed for patent. Designed and Assisted in Developed of Voice Data Collection System - necessary for

research, development, testing and evaluation of the Key Word Recognition System. Performed and Managed 2 data collections over various calling environments (noisy,

quiet, public, car, etc.) using various calling devices (cellular, landline, speaker phone. Created 2 Corpora from the recorded data. Those Corpora are used for:

Building Models of a particular Key Word Testing and Evaluation of the System, and Research, Development, and Refinement of Key Word Recognition System, Transcribed and/or Supervised transcription process of recorded data. Set up

conventions and standards so that all the tools to be developed that use data of created Corpora comply with a clear set of standards.

Converted other (CallHome and PhoneBook) Corpora to this set of standards for easy and consistent use.

Directed and Supervised Code Conversion and Porting from Floating Point to Fixed Point of Key Word Spotting Technology.

Developed Automated Process using combination of perl scripts and perl configuration files controlling various parameters affecting each step of the complex process of:

Generating Features from a Voice Data Corpus, Building a Model of a Key Word (e.g., “Operator”, “Help”, “MapQuest”, “Verizon”,

etc.) from the features, Using built Model to test and evaluate Key Word Recognition System, Generating Performance Plots, Charts and Graphs. Those scripts use numerous executables, gnuplot – a graph plotting tool, as well as

other perl scripts. End result of this process is automatic generation of number of plots, charts, and graphs that depict performance of the system for easy evaluation and comparison.

Trained and Supervised a DSP engineer to port, test and evaluate Key Word Technology.

Worked with Application Developers to integrate Key Word Spotting Technology into a viable Demo and potentially viable product.

Wrote Technical Document and Manual for this Technology. Consulted CTO in decision making process regarding Speech Recognition, Text to

Speech, as well as Key Words Spotting Technologies.

1999 – 2001 Speech Recognition Scientist – SpeechWorks International, Inc., Product Group,



695 Atlantic Ave., Boston, MA 02111. USA. Developed Noise Compensation Algorithm to increase recognition robustness

against Noise and Channel varying characteristics. Conducted Study of Wireless/Cellular vs. Wireline/Landline signal differences and

their effect on recognition performance. Developed Nonlinear Front End Signal Processing. Performed Comparative Studies of various Speech Recognition Technologies (e.g.,

AT&T, NUANCE, SPEECHWORKS recognizers). Developed algorithms to investigate various features (confidence score, acoustic

score, etc.) and their optimal use for combining N-best lists produced by different features (mfcc, lpc, etc.) and different recognizers (segmental, HMM, Watson). Combining algorithm achieved significant error reduction as compared to the best.

Developed diphone clustering for HMM models to minimize model size. Involved in re-alignment of acoustic segments for Text to Speech (TTS) model

building data. Developed frame work for modular expansion and refinement of re-alignment process using perl scripts combined with perl configuration files. Implemented various heuristic rules to improve alignments generated by the Speech Recognizer to better fit TTS.

Developed data collection program for Dialogic JCT board that supports CSP. Developed, Run, Digested, Processed, “Call Environment Data Collection” using this application.

1997 - 1999 Scientist - GTE, BBN Technologies, Speech Solutions Group, 70 Fawcett St., Cambridge, MA 02138. USA.

Compiled and Analyzed BYBLOS (research speech recognition technology) and BBN HARK (commercial technology) system differences; Analyzed possible BYBLOS technologies for porting into BBN HARK; Developed and Coded Voice Model Filter that loads BYBLOS and/or BBN HARK training files and converts them into a new format files in compliance to designed specifications; Ran various tests (BYBLOS and BBN HARK) for Continuous Densities BBN HARK for benchmarking.

Peer reviewed a paper for Speech Communication Journal.

1993 - 1997 Speech Scientist – Voice Processing Corporation/Voice Control Systems, Advanced Technology Development Group, One Main Street, MA 02142. USA.

Enhanced the performance of existing Front End of Speech Recognition System, implemented in VPro line of products, by designing a non-linear smoothing algorithm based on median filtering.

Developed and Implemented Dynamic Features that augmented existing Front End Features.

Developed a universal preprocessing module of the Front End that enables run-time front-end configuration, decompression, and sample-rate transformations of the original wave file.

Performed numerous tests that provided critical insights into enhancement and debugging of VProFlex Technology.

Invented, Developed and Integrated a very efficient novel Code Book Search strategy (internally named Fickle Search).

Compiled a condensed Internal Report of the Literature Review Study on different



ways to perform fast FFT’s of a real valued sequence. Developed, Tested, and Integrated Split Radix FFT algorithm. The function can

handle any power of 2 Real Valued FFT’s. Modified Front End to take advantage of higher FFT size and increased frequency

resolution: Analyzed the conflicting effect of window size and type (higher frequency resolution

causing break down of enhancement due to harmonics, Analyzed several possible modifications of enhancement algorithm to accommodate

higher frequency resolution, and Proposed elimination of pitch harmonics from the spectrum with Homomorphic

filtering or LPC - based Spectrum. Implemented LPC based spectrum integrating it with existing Spectral Enhancement

module of the Front End. Initiated the study toward enhanced composition of boundary and internal acoustic

phonemic features. Invented, Developed, Ported, and extensively Tested a novel Noise Compensation

with Speech Enhancement Algorithm. Also invented several integration strategies that take further advantage of the algorithm through a better interaction of the Front End with API. that take advantage of calibration when feasible. Default mode of operation is fully unsupervised in real-time.

Developed and ANN software tool currently supporting five different feed-forward back-propagation type of learning.

Developed a Pitch Tracking Algorithm based on enhanced Super-Resolution Pitch Determination Algorithm.

1990 – 1993 Post-Doctoral Research Associate - Swiss Federal Institute of Technology, IGP, ETH-Hönggerberg, CH-8093 Zürich, Switzerland.

Swiss National Science Foundation Research Project in Image Understanding - Design and Analysis of Spatial Image Sequences

1985 – 1990 Teaching Assistant – Electrical and Computer Engineering Department. Clemson University.

Digital Processing of the Speech Signals, Digital Systems, Digital Circuit Design and Microprocessor Applications, Electronics, Programming.

1987 - 1990 Consultant - Engineering Research and Computer Services Department, Clemson University, Electrical and Computer Engineering Department, Clemson, SC 29634-0915. USA. Design and Development of a database system for processing of the

expenditures of the College of Engineering, Clemson University. Design and Development of a database system prototype for automation of: Management of the repair and maintenance orders, Task allocation and duty assignment, Time-table management of the assigned personnel, and Generation of relevant statistical data.

1985 - 1986 Software Engineer - Keiltronix: Textile Control Systems Inc. 2910 Horseshoe Lane,



Summer Job

P.O. Box 1923, Charlotte, NC 28219. Developed software for polling and analyzing data from peripheral machine

controllers. Developed software for graphical display of status of a manufacturing dying process in real-time.

Development of Software Package using REGIS as a low-level software tool for dynamic display of the state of technological process in real-time.

1981 - 1984 Assistant Lecturer - Electrical Engineering Faculty, University of Prishtina, Republic of Kosova.

Taught courses in Control Theory, Systems Theory, Algorithms, Digital Communications, Boolean Algebra, Digital Systems, and Programming.

Contributed in the publishing of the first Automatic Control Theory text book in Albanian Language.

Key member of the commission that prepared a detailed proposal for Advancement of Curricula of Electrical and Electronics Engineering Faculty.

PATENTS:

Voice Activity Detection Based on Cepstral Features. Dynamic Time Warping (DTW) Matching using Reverse Ordered Feature Vectors. Rescoring using Distribution Distortion Measurements of Dynamic Time Warping

Match.JOURNAL PUBLICATIONS:

Këpuska, V. and Mason. S., A Neural Network Approach to Signalized Point Recognition in Aerial Photographs, Photogrammetric Engineering & Remote Sensing, Vol. 61, No. 7, pp. 917-925, July 1995.Mason, S. and Këpuska, V., CONSENS: An Expert System for Photogrammetric Network Design, Allgemaine Vermessungs Nachrichten, pp. 384-393, September 1992.CONFERENCE PUBLICATIONS:

Mason, S. and Këpuska, V., On the Representation of Close-Range Network Design Knowledge, XVII ISPRS Congress, Washington D.C., August 1992.Këpuska, V. and Mason, S., Automatic Signalized Point Recognition with Feed-Forward Neural Network, IEE Second International conference on Artificial Neural Networks, Bournemouth, U.K., November, 1991.Mason, S., Beyer, H., and Këpuska, V., An AI-based Photogrammetric Network Design System, First Australian Photogrammetric Conference, University of Newcastle, Australia, November 1991.Këpuska, V. and Mason, S., Artificial Neural Network Approach to Signalized Point Recognition in Aerial Photographs, First Australian Photogrammetric Conference, University of Newcastle, Australia, November 1991.Këpuska, V., Beyer, H. and Mason, S., Artificial Neural Networks for Calibration of CCD-Cameras, Workshop on Industrial Applications of Neural Networks, Ascona, Switzerland, September 1991.Këpuska, V. and Gowdy, J., On the Effect of Topological Structure of the Kohonen Network on the Performance of the Hierarchical two Layered Isolated Word Recognition System, IEEE Southeastcon Symposium, New Orleans, April 1990.Këpuska, V. and Gowdy, J., Investigation of Phonemic Context in Speech using Self-Organizing Feature Maps, IEEE International Conference on Acoustics, Speech and Signal



Processing - ICASSP’89, Glasgow, Scotland, May 1989.Këpuska, V. and Gowdy, J., Phonemic Speech Recognition Based on Neural Network, IEEE Southeastcon Symposium, Columbia, April 1989.Këpuska, V. and Gowdy, J., The Kohonen Net for Speaker Dependent Isolated Word Recognition, IEEE Southeastern Symposium on Systems Theory, UNCC Charlotte, March 1988.Këpuska, V. and Gowdy, J., Evaluation of Digital Signal Processing Chips for Speech Processing Applications, IEEE Southeastern Symposium on Systems Theory, Clemson University, Clemson, March 1987.Këpuska, V. and Gacaferri, J., The Determination of the Polynomial Coefficients for Approximation of the EKG with Computer, (in Serbo-Croatian), Symposium JUREMA, Zagreb 1979.9.1 PUBLIC REPORTS:

Këpuska, V. and Mason. S., NFP23: Design and Analysis of Spatial Image Sequences, Wissentsschaflicher Bericht zum Schweizerischer Nationalfonds zer Förderung der Wissentsschaftlicher Forschung, 1992.Këpuska, V. and Mason, S., Design and Analysis of Spatial Image Sequences, NFP 23 Third Annual Status Report, Bern, July 6, 1992.Këpuska, V. and Mason. S., NFP23: Design and Analysis of Spatial Image Sequences, Wissentsschaflicher Bericht zum Schweizerischer Nationalfonds zer Förderung der Wissentsschaftlicher Forschung, 1991.Këpuska, V. and Mason, S., Design and Analysis of Spatial Image Sequences, NFP 23 Second Annual Status Report, Bern, June 5, 1992.Mason, S. and Këpuska, V., NFP 23: Design and Analysis of Spatial Image Sequences (Project Summary), SGAICO Newsletter, Swiss Group for Artificial Intelligence and Cognitive Science, 1991.SPECIAL SKILLS:

Languages: Albanian (Mother Tongue), English, Serbo-Croatian, German (beginner), Turkish (beginner).


Documents

Part Number - CAS – Central Authentication Servicemy.fit.edu/~vkepuska/Research Proposals/DURIP2003.doc · Web viewSensor Networks of Clustered Smart Microphone Arrays Using Robust