Universidad Autónoma de San Luis Potosí Detection ...galia.fc.uaslp.mx/~avancetesis/AvancesMayo2016/Moncada...piezoelectric sensors connected to an audio amplifier. This amplifier

1

Universidad Autónoma de San Luis Potosí

Detection Conditioning and

Processing of Acoustic Wave Signals

in Human Tissue for Wearable

Computer Applications. Héctor Raúl Moncada González, M. E. E. José Luis Tecpanecatl Xihuitl, Ph. D.

Carlos Adrián Gutiérrez Díaz de León, Ph. D.

Introduction The miniaturization of ultra-low power microcontroller, sensors and actuators have

enabled a new era of computing. Devices embedded into everyday objects that can adapt to a

person needs, time or context of use. These devices known as wearable devices have been

attracting several companies such as Apple, Samsung and Google. Indeed, the wearable devices

market is growing and some analysts estimate that by 2018 it will reach anywhere from $30Bn

to $50Bn in revenues [1]. Thus, the development of wearable related technologies is in demand

nowadays. Hence, novel human interface and device-to-device networking technologies are

crucial for the development of the upcoming wearable devices and applications.

Incidentally, wearable devices take advantage of one or several sensors. Some of these

sensors are cameras, accelerometers, and gyroscopes. Alternatively, acoustic transducers offer

advantages like no need for voltage supply to sense or act, small size and low price. However,

application of acoustic transducers for wearable devices has not been explored thoroughly

specially in the gesture recognition field in which the purpose is to identify gestures (hand and

fingers movements) to relate them with commands to control a system.

In this regard, beyond their use in medical imaging and diagnostics, acoustic waves have

been recently proposed for different applications related to wearable computers and human-

machine interaction. The applications include user input detection, user’s feedback and body-

centric communications.

Proposal

The proposal consist in a gesture recognition system. The prototype is a wristband with six

piezoelectric sensors connected to an audio amplifier. This amplifier is connected to a computer

in which all signals are recorded, henceforth each signal is called intrabody acoustic wave signal

(IAWS).

2

Figure 1 shows the sensors position, from sensor three to sensor five, on wrist in the hand

palm side. Figure 2 shows sensors position, sensors one, two, and six, on wrist in back palm

side. A database with 18 participants was created, each participant made 30 repetitions of 22

different gestures. The users are uniformly distributed in age and gender.

Figure 1. Sensors in wrist for palm side Figure 2. Sensors in wrist for back palm side.

Previous report

In previous report, preprocessing stage, feature extraction process and results from

gestures classification were described. Figure 3 shows a block diagram containing these stages.

Figure 3. Complete process block diagram.

Because of the results from classifiers (50% of accuracy) with previous features, conclusions

were:

- Utilize features related with audio processing.

- Use information from five sensors instead of use information from three sensors.

- Reduce the set of gestures.

All recommendations were followed.

3

Progress

Feature extraction process.

In pattern recognition field, a feature is a characteristic that can uniquely identify a pattern,

such as power energy of a signal. Feature extraction process corresponds to the stage in which

informative and no redundant features are selected facilitating the subsequent machine

learning. Patterns are formed with a series of features, the good selection of these features is

critical to obtain acceptable results in classification stage.

There are several papers that introduce an analysis on classification of acoustic signals. Subramanian use several features to classify audio signal, while Chmulik and Jarina [3] took a bio-inspired generic sound recognition. Both works used Spectral Flux, and Spectral Centroid as a classification feature. Subramanian [2]suggested to analyze the signal using time windows of N samples. That suggestion is follow in the present work using 34ms windows.

Spectral Flux is the tone quality in a musical note. It measures how fast is changing the power

spectrum in audio signal. In order to calculate the Spectral Flux, the signal is divided in 30 windows with 34 ms long, then, power spectral is obtained from all windows, the current window is compared with the previous window using the Euclidian distance [5]. Expression for Spectral Flux [2] is:

𝑆𝐹(𝑟) = ∑(|𝑓𝑓𝑡(𝑦(𝑟))| − |𝑓𝑓𝑡(𝑦(𝑟 − 1))|)

𝑁

𝑛=1

2

( 1

where r is current window , y(r) is the IAWS in the actual window, y(r-1) is the IAWS from previous window, N is the number of samples in each window, fft transforms from time domain to frequency domain.

Spectral centroid [2] determines the center of the spectrum of a signal and is commonly associated with the brightness of a sound. This measure is obtained by evaluating the “center of gravity” using the Fourier transform’s frequency and magnitude information.

𝑆𝐶(𝑟) =∑ 𝑓𝑟[𝑛] ∗ |𝑌𝑓𝑟[𝑛]|𝑁

𝑛=1

∑ |𝑌𝑓𝑟[𝑛]|𝑁𝑛=1

( 2

where r is the actual window , N is the window length, fr[n] is the frequency value at sample n,

Yfr(n) corresponds to the FFT coefficient value in sample n.

4

In similar way Jalil [4]et al. proposes Short Time Energy, as a method for separate voiced

and unvoiced segments of speech signals. Short Time Energy is the energy of short speech

segment calculated by windows. Energy signal in time domain [4][7] is defined as:

𝐸(𝑟) = ∑ |𝑦𝑟(𝑛)|2

𝑁

𝑛=1

( 3

where yr is the actual window from signal under study, and N is the number of samples in

window n is the current sample.

The features selected for this application include the spectral flux, short time energy and

spectral centroid. The signals where analyze by windows of 34ms. The maximum value of each

feature is recorded into a feature vector. Additionally, the sum of each feature over the 30

windows is used as a component of the vector feature as shown in Figure 4. The maximum is

used to weight the maximum contribution of each window, while the sum is used to consider

all the variations.

Figure 4. Pattern from one sensor, formed with six features.

Classification experiments in this work includes patterns formed with five sensors (30

features) and patterns formed with three sensors (18 features). A total of 2700 patterns

distributed in five classes are used, 50% for training and 50% for test.

Reducing number of classes.

Original set of gestures in database contains 22 classes. This 22 classes are grouped in four

categories: “Flick”, “Taps”, “Various” and “Drags”. The objective is use these gestures as

commands to control smartphones, tv, computers, light lamps, or any object related with a

smart house. Adding the idea that 22 gestures are excessive, the most representative gestures

were selected.

Previous results suggest that gestures in same categories have similar features, except in

“Various” category, in conclusion: “Various” category must be separate in different classes. A

5

new set of six gestures are proposed. Gestures selected from “Various” category are: snap, clap,

flex, click, open, and close. After several classification experiments the conclusion obtained was

that open and close gestures have similar behavior with the features proposed. Hence gestures

open and close were eliminated. Additional experiments show that gesture “index tap” obtains

better performance with the features proposed. Therefore the five gestures that presents the

best classification results with the features proposed are: index tap, snap, clap, flex, and click.

The set of gestures selected are illustrated in Figure 5.

Figure 5. Gestures selected. From left to right: index tap, snap, clap, flex, and click.

Now with the gestures and the features selected is time to classify the patterns. The

Bayesian Classifier, KNN and Neural Network where used to classify the selected feature.

Classification

Pattern recognition is an ability of a human being use to identify the voice of a friend in a

group of voices, or the handwriting from a specific student in a group of exams. These complex

tasks are achieved by a human brain effortless. However, the same tasks is a challenge for a

machine.[8].

Machine learning is the discipline to “teach” a machine to identify a special voice, or more

generally a specific signal in a group of different signals. Training a classifier or a machine require

a set of signals, these signals can be previous labeled with the category or class to which each

signal belongs, this learning is called supervised learning.

One of the most useful ways to represent pattern classifiers is in terms of discriminant

functions gi(x), i=1,…, c. The classifier assign a feature vector x to class ωi if gi(x) > gj(x) for

all i≠j. A classifier is viewed as a network like in Figure 6, computing c discriminant functions

and selecting the category [8].

6

Figure 6. Classifier Representation [8].

Bayesian classifier (BC) is based in Bayesian Decision Theory. The number of classes to

distinguish is the number of discriminant functions implicated in BC. Discriminant function for

the BC [8] is defined as

𝑔𝑖(𝑥) = −1

2(𝑥 − µ𝑖)

𝑡𝛴𝑖−1(𝑥 − µ𝑖) −

𝑑

2𝑙𝑛 2𝜋 −

1

2𝑙𝑛|𝛴𝑖| + 𝑙𝑛 𝑃(𝜔𝑖)

(4

where gi(x) is the ith discriminant function from i classes, x is the pattern to classify in d

dimensions, d represents the dimension of pattern x (number of features), µi is a vector with d

dimensions, containing d mean values calculated from the training data of class i, Σi is the

covariance matrix with d x d dimensions, calculated from the training data of class i, and P(ωi)

is the probability that x belongs to class i. A more detailed explanation of the BC is in [8].

Knn is a non-parametric classifier and uses the distance between patterns as discriminator.

Computing the distances from a new point x versus all training points, the k nearest points to x

decide to which class belongs pattern x. Euclidian distance, is utilized in this approach. The

mathematical expression for the Euclidian distance in d dimensions is

𝑒𝑑 = √∑(𝑥𝑖 − 𝑡𝑟𝑖)2

𝑖=𝑑

𝑖=1

( 5

where xi is the pattern in d dimensions, tri is the training point, ed is the Euclidian distance

between vectors x and tr. The number of neighbors k varies from one to nine, the best

performance is reached with k=5 .A more detailed information about Knn can be found in [8].

7

Artificial neural networks are generally presented as systems of interconnected “neurons”

which exchange messages between each other. The connections have numeric weights that can

be tuned based on experience, making neural nets adaptive to inputs and capable of learning.

Artificial Neural Networks (NN) is a non-parametric technique and its learning is supervised.

Using nprtool (Neural Pattern Recognition Tool) from Matlab toolbox an NN was implemented.

The number of used Neurons is 100, first layer contains 30 neurons, last layer contains 5

neurons, rest of the neurons are in internal layers. Architecture used is backpropagation. In [8]

exists a detailed section about NN.

Results

A total of 2700 patterns from five classes were shuffled, then the 50% of patterns were

selected randomly, the rest 50% were used as test. Each classifier were trained for two

experiments patterns with 30 features (5 sensors), results are in Figure 7, and patterns with 18

features (3 sensors), results are in Figure 8. Results shows the best performance using 5 sensors,

but eliminating 2 sensors the performance maintain a decent accuracy classifying patterns. An

important observation is: using 18 features, sensor combinations with best accuracy involve

sensors one and four in all cases, worth it try to use only these sensors and analyze the results.

Figure 7 Results classifying patterns with 30

features. Figure 8 Results classifying patterns with 18

features.

Conclusions

Classification accuracy was improved following suggestions from previous report. Selecting

features related with speech recognition, use five sensors in classification and reduce the

number of gestures. Next step consist in use SVM and compare the results with previous

classifiers.

Harrison et al. [6] employed 10 sensors and more than 180 features to reach and accuracy of

91% classifying four different datasets. Miura et al. [9] obtained an accuracy of 95%

differentiating between eight classes but using signals from one user. This proposal employs

83.56

90.52 90.37

80828486889092

BC 30 Features Knn 30 Featuresk=5

NN 30 Features

Classifiers Accuracy

80.59

84.3 84.44

78

80

82

84

86

BC 18 FeaturesC=124

Knn 18 FeaturesC=124 k=5

NN 18 FeaturesC= 134

Classifiers Accuracy

8

five sensors and 30 features, reaching an accuracy of 90.52% separating signals from five

different classes. Additionally, data from 18 users is used, supporting the hypothesis that

classification is independent from the user.

Calendar activities

Activity 2016

Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

Literature Review

Statistical analysis of the

different patterns for

gesture recognition.

Course. “Oral expression”

Study of SVM for pattern

classification towards

gesture recognition

Course “Optimization”

Course Seminar III

Study of the

implementation platform

Writing a Conference

paper

Start preparing a journal

paper

9

References

[1] Shane Walker. Associate Director, Medical Devices & Healthcare IT. IHS Electronics & Media. “Wearable Technology – Market Assessment.” An HIS Whitepaper. September 2013.

[2] Subramanian, H. (2004, November). Audio signal classification. Credit Seminar Report on M. Tech (pp. 1-16).

[3] Chmulik, M., & Jarina, R. (2012, April). Bio-inspired optimization of acoustic features

for generic sound recognition. 19th International Conference on Systems, Signals

and Image Processing (IWSSIP), 2012 (pp. 629-632). IEEE.

[4] Jalil, M., Butt, F. A., & Malik, A. (2013, May). Short-time energy, magnitude, zero

crossing rate and autocorrelation measurement for discriminating voiced and

unvoiced segments of speech signals. International Conference on Technological

Advances in Electrical, Electronics and Computer Engineering (TAEECE), 2013 (pp.

208-212). IEEE.

[5] Giannoulis, D., Massberg, M., & Reiss, J. D. (2013). Parameter automation in a

dynamic range compressor. Journal of the Audio Engineering Society, 61(10), 716 -

726.

[6] Harrison, C., Tan, D., & Morris, D. (2011). Skinput: appropriating the skin as an

interactive canvas. Communications of the ACM, 54(8), 111-118.

[7] Mitra, S. K., & Kuo, Y. (2006). Digital signal processing: a computer-based approach

(Vol. 2). New York: McGraw-Hill.

[8] Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons.

[9] Miura, K., Jiang, S., Hada, Y., & Okabayashi, K. (2015, July). Recognition of hand action using body-conducted sounds. 54th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), 2015 (pp. 246-251). IEEE.

Documents

Universidad Autónoma de San Luis Potosí Detection ...galia.fc.uaslp.mx/~avancetesis/AvancesMayo2016/Moncada...piezoelectric sensors connected to an audio amplifier. This amplifier