[IEEE MELECON 2014 - 2014 17th IEEE Mediterranean Electrotechnical Conference - Beirut, Lebanon (2014.04.13-2014.04.16)] MELECON 2014 - 2014 17th IEEE Mediterranean Electrotechnical

King-Kopetzky Syndrome An Approach for a Solution

Wissam Nawfal, Nizar Al-Aawar and Bassam Moslem

College of Engineering

Rafik Hariri University

Meshref, Lebanon

[email protected]; [email protected], [email protected]

Abstract— King-Kopetzky syndrome (KKS) is recognized as a

clinical disorder in the hearing system of a human being. It is

defined as the condition in which an individual possess a normal

hearing threshold on pure tone audiometry however complains

about his inability or difficulty in understanding speech in the

presence of background noise. In what follows is a proposed

approach for a solution for KKS consisting of three main blocks

starting by separating mixed signals, then identifying the speaker

of each speech signal, and finally applying a selectivity method

for selecting the intended speaker. Convolutive Blind Source

Separation (CoBliSS) was used for signal separation while

speaker identification depended on Mel Frequency Cepstral

Correlations as features and Multilayer Perceptron (MLP) as a

classifier. The results show that the proposed technique can be

used as a possible solution for KKS.

Keywords—King-Kopetzky Syndrome (KSS); Selective

Attention; Cocktail Party Problem; Blind Source Separation (BSS);

Text-Independent Speaker Recognition; classification

I. INTRODUCTION

Human beings normally possess a high ability of separating sound signals arriving to their ears, and selectively diverting their attention toward one sound source at a time, which is known as selective hearing [1]. The problem of demixing sound signals into independent components is known as the cocktail party problem [2]. Therefore, when a normal person is in a room with two or more people speaking simultaneously alongside some background music, he can concentrate on one sound source and attenuate or ignore the rest. However, some people lack the ability of separating and concentrating on one sound source. This is known as King-Kopetzky syndrome (KKS). KKS is recognized as a clinical disorder that can affect the hearing system of some human beings. It is defined as the condition in which an individual possess a normal hearing threshold on pure tone audiometry (PTA), however complains about his inability or difficulty in understanding speech particularly in noisy environments [3].Therefore, an individual diagnosed with KKS won’t be able to focus his attention on one source. Attention cannot be focused on more than one thing at a time [1], thus he won’t be able to hear anything but noise and several sound mixtures all mixed together.KKS deprives a human being from his ability in separating sound signals and concentrating on one at a time.

The term King-Kopetzky was introduced by Hinchcliffein [4]. However the syndrome was also called “Auditory Disability with Normal Hearing” (AND) by Stephens and Rendell and “Obscure Auditory Dysfunction” (OAD) by Saunders and Haggard [5]. KKS has various causes ranging from psychological, physiological or even psychoacoustical factors being contributory. Causes include but are not limited to: Central dysfunction, impaired lip-reading skills or audio visual integration, emotional and work problems, difficulties in the use of second language [3-4]. Furthermore, several studies have shown many patients with KKS to have a family history of hearing problems [6].

Most of the previous research studies focused on the psychological aspects and reasons behind KKS. Individuals diagnosed with KSS were not treated as real patients. Consequently, the effort done in this domain was limited to familiarize individuals with the syndrome and to empower them in managing their own hearing difficulties [6-8]. Therefore, there is a need for a concrete solution for patients with physiological disorders such as central dysfunction, impaired lip-reading skills etc…

In this paper, a solution for the problem described above is proposed. The approach described herein consists of three main building blocks. It starts by applying a Blind Source Separation (BSS) technique to separate the mixtures of sound signals into independent components. Next, each of the available speech signals is identified. The best way to identify a speech signal is by recognizing its speaker. Text independent speaker recognition is thus used to identify each speaker present. Finally, the patient will have the privilege to choose the speaker he wants to listen to. The output will be the speech signal corresponding to the person chosen, and other speech signals are attenuated and ignored. Fig. 1 illustrates the block diagram of the proposed approach.

II. METHODOLOGY

A. Convolutive Blind Source Separation

Blind source separation (BSS) deals with the problem of recovering independent signals using only mixtures of these signals without any prior knowledge about the mixing process [9]. Instantaneous BSS gives a solution for mixing processes that are independent of time delays. However this assumption

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

978-1-4799-2337-3/14/$31.00 ©2014 IEEE 252

is inadequate in some mixtures especially sound signal mixing where reflections and echoes introduce time delays into the mixing process.

Convolutive Blind Source Separation (CoBliSS) introduced by Schobben employs Multi-Channel Finite Impulse Response (MC-FIR) filter to these signals and depends entirely on Second Order Statistics (SOS) [10]. The optimization is achieved by minimizing the cross correlations among the outputs of the multi-channel separating filter in the frequency domain. However the solution in frequency domain will destroy the time domain solution. Thus there is a need for a certain time-frequency compliance to hold the time domain solution of the filters intact. In real world signals such as speech signals the energy decays significantly for higher frequencies. When CoBliSS is applied, it results in energy boosting for the frequencies where the signal is weak and lowers the energy for frequencies where the signal is strong. This leads to unwanted signal equalization where the low frequencies are suppressed and high frequencies are boosted resulting in damaging the recovered sound signal. The solution for this problem is by normalizing the weights, which ensures that all filter coefficients are of the same order of magnitude after the normalization is applied thus leaving the timbre of the signal unaffected [10].

B. Text Independent Speaker Recognition

Speaker recognition is the act of verifying or identifying the speaker identity by using a set of extracted features from the corresponding speech segment [11]. The speaker recognition system used is a text independent speaker identification system. It uses Mel Frequency Cepstral Coefficients Correlations as features and the classification is done using a multilayer perceptron [12]. The extraction of the Mel Frequency Cepstral Coefficients is described in fig. 2.

The extraction starts by pre-emphasis to increase within a band of frequencies, some frequencies with respect to the magnitude of other frequencies [13]. Speech signals are quasi-stationary, thus traditional spectral evaluation techniques cannot be applied. However, when viewing the signal through a small enough window frame (usually about 20-40ms), the signal will be almost stationary [14-15]. In addition, the best way to analyze a signal is by windowing. Windowing a signal is basically multiplying the value of the window with the signal sample and then shifting it [16]. Next, the signal is transformed from the time domain to the frequency domain using the Discrete Fourier Transform (DFT). The signal now presented in the frequency domain enables the application of Mel filter banking filtering to initiate the process of MFCC features extraction. Mel filter bank consists of 20-40 (26 is standard) triangular mel filters. Out of these mel filters log energies is calculated [14-15]. Finally, Discrete Cosine Transform (DCT) is applied to convert the log mel spectrum calculated in the last step back into time domain. The result of this transformation is called Mel Frequency Cepstral Coefficients (MFCC) [17]. It was shown in [12] that the correlation between the MFCC present a reliable and significant text independent speaker identification system. Thus Mel Frequency Cepstral Coefficients Correlations (MFC

3) was presented. Out of the 12

MFCC’s a correlation matrix of 144 elements was created with only 66 elements useful due to its symmetry and the fact that the diagonal elements are nothing but autocorrelation coefficients [12]. The mean vector of the 66 elements alongside the standard deviation vector was calculated. The result is a 24 element feature vector for the text independent speaker identification system.

Fig. 1 Proposed solution for King-Kopetzky


253

Fig. 2 MFCC Calculation

III. RESULTS

In the previous section, the theory of CoBliSS and text independent speaker recognition system used was presented. To test their effectiveness in solving KKS, both algorithms were applied on real signals and the results were interpreted.

BSS was tested on real world signals mixed in different environments. The first test was speech – music separation, where speaker has been recorded with two distance talking microphones in a normal office room with loud music in the background as described in fig. 3. The distance between the speaker, cassette player and the microphones is about 60cm in square ordering [18]. The second experiment described in fig. 4, was Speech-Speech separation in which the cassette player was replaced with a second speaker [18]. Finally, CoBliSS’s blind separation ability was tested in a difficult environment as described in fig. 5, in which two speakers were recorded in a 5.5x8m conference room. Moreover, the conference room had some air-conditioning noise, and the microphones were placed 120 cm away from the speakers [18].

The three experiments served as a good test for the chosen BSS technique. Although there was no way to test the effectiveness of the separation algorithm as there is no a priori information about the original unmixed signals, however experts opinion stated that there was ample separation.

Fig. 4 Speech-Speech Separation

Fig. 5 Speech-Speech (difficult environment) separation

Next, text independent speaker recognition was tested on the Australian National Database of Spoken Language (ANDOSL) [19]. The database consisted of a total 200 speech samples divided equally for 2 different speakers. The difficulty of the used database is that the utterances consisted of 2 males of nearly the same age group. Thus the algorithm was tested on a difficult identification case. Several tests were conducted in order to determine the ratio of the training signals and it was found that a total of 10% of the signals were enough to train the classifier. Feature vectors were extracted from the 20 speech utterances and for each of the feature vectors a unique target vector was set. Feature vectors and targets were then fed as inputs and target outputs respectively to train the MLP. The remaining 180 signals were served as testing and validation signals. The recognition rate was high enough to do the job in the proposed solution. The algorithm was tested on three different random combinations of training and test signals and the recognition rate is presented in table 1.

After proving the effectiveness of each of the tools used throughout the study, the proposed solution was tested as a complete system. Two random speech utterances were chosen from the ANDOSL database, and they were artificially mixed using a random square mixing matrix A.

Fig. 3 Speech-Music Separation


254

It should be noted that the elements of the mixing matrix A were constrained between 0 and 1 so that to replicate the amplitude attenuation effect that affects a sound signal as it moves away from the source. At this step, the obtained mixed signals are a version of the signals that arrives to the human’s ear. The first step is to apply CoBliSS to the mixed signals. Since the original unmixed signals were known this time, an evaluation of the CoBliSS algorithm was possible. The first evaluation test was the cross correlation between the separated signals and the original signals. The results are presented in table 2.

The calculated correlation coefficients clearly show that each of the separated signals is not correlated to one of the original signals with correlation values of 0.07 and 0.04. However the correlation with the other original signal is not high enough to evaluate the efficiency of the separation. This result is predictable since the separation is not ideally complete thus a slight interference from the non intended signal is considered as an interference and would affect the correlation with the original signal. A way for an objective evaluation of Blind Audio Source Separation (BASS) algorithms is presented by Vincent E. in [20]. A new numerical performance criterion that can help evaluate a BSS is presented by Signal to Distortion Ration (SDR), Signal to Interferences Ratio (SIR), Signal to Noise Ratio (SNR), and Signal to Artifact Ratio (SAR). These four measures are inspired by the usual definition of SNR with few modifications and are defined as follows:

𝑆𝐷𝑅 ≔ 10 𝑙𝑜𝑔10

∥ 𝑠𝑡𝑎𝑟𝑔𝑒𝑡 ∥2

∥ 𝑒𝑖𝑛𝑡𝑒𝑟𝑓 + 𝑒𝑛𝑜𝑖𝑠𝑒 + 𝑒𝑎𝑟𝑡𝑖𝑓 ∥2

(1)

𝑆𝐼𝑅 ≔ 10 𝑙𝑜𝑔10

∥ 𝑠𝑡𝑎𝑟𝑔𝑒𝑡 ∥2

∥ 𝑒𝑖𝑛𝑡𝑒𝑟𝑓 ∥2

(2)

𝑆𝑁𝑅 ≔ 10 𝑙𝑜𝑔10

∥ 𝑠𝑡𝑎𝑟𝑔𝑒𝑡 + 𝑒𝑖𝑛𝑡𝑒𝑟𝑓 ∥2

∥ 𝑒𝑛𝑜𝑖𝑠𝑒 ∥2

(3)

𝑆𝐴𝑅 ≔ 10 𝑙𝑜𝑔10

∥ 𝑠𝑡𝑎𝑟𝑔𝑒𝑡 + 𝑒𝑖𝑛𝑡𝑒𝑟𝑓 + 𝑒𝑛𝑜𝑖𝑠𝑒 ∥2

∥ 𝑒𝑎𝑟𝑡𝑖𝑓 ∥2

(4)

The principle of the performance measures is to decompose a given estimate 𝑠 𝑡such that

𝑆 𝑡 = 𝑠𝑡𝑎𝑟𝑔𝑒𝑡 𝑡 + 𝑒𝑖𝑛𝑡𝑒𝑟𝑓 𝑡 + 𝑒𝑛𝑜𝑖𝑠𝑒 𝑡 + 𝑒𝑎𝑟𝑡𝑖𝑓 (𝑡) (5)

While Starget is defined as an allowed deformation of the target source, einterf is an allowed deformation of the sources which accounts for the interferences of the unwanted sources, 𝑒𝑛𝑜𝑖𝑠𝑒 is an allowed deformation of the perturbating noise but not the signals, and eartif is an artifact term that may correspond to artifacts of the separation algorithm such as musical noise etc. or simply to deformations induced by the separation algorithm that are not allowed [21].

The evaluation parameter of interest is SIR since it presents a comparison between the separated signals and the original sources. A high SIR value would correspond to a high ratio between the target signal and the interference signal. The resulting SIR’s were 12.22 and 9.3 respectively proving the efficiency of the BSS algorithm.

Now that the mixed signals are broken down to independent components, each speech utterance should be identified. Text independent speaker recognition was used. The MLP was trained as previously stated using random signals from the ANDOSL database. Then the separated signals were fed into the MLP and the speakers were identified. Finally, and the most important step is giving the patient a privilege to select the speaker he/she wants to hear. This was done through a user friendly program where the user can either choose “speaker 1” or “speaker 2”. Once the selection is done, the algorithm will play the speech utterance of interest. Finally a complete preliminary solution for KKS was tested from top to bottom.

IV. DISCUSSION

In everyday life, ears receive numerous amount of sound signals all mixed together. People are often interested on focusing on a single sound source, which introduces a problem of demixing the sound signals in order to hear the one of interest known as the cocktail party problem. The cocktail party problem is referred to the task of acquiring a sound or a speech segment of interest from mixture of sound. There are two main difficulties or challenges that a listener faces in an environment with two or more speech and sound signals. The first challenge is the problem of sound segregation and the other is how to direct the listener’s attention toward the intended sound source while ignoring the others, and continuously switching attention between different sound sources [2]. The proposed solution reflects a way to overcome those difficulties and thus solving the cocktail party problem for people diagnosed with KKS.

Starting with Blind Source Separation, Instantaneous BSS failed to deliver the required separation as the separated signals were severely damaged. Convolutive Blind Source Separation was the solution to separate sound signals mixed in real

TABLE 1. SPEAKER RECOGNITION SUCCESS RATE

Combinations

1st combination 2

nd combination 3

rd combination

Recognition Rate 99.4% 96.1% 86.1%

TABLE 2. CORRELATION COEFFICIENTS

Original Signals

Original 1 Original 2

Separated

Signals

Separated 1 0.5986 0.0679

Separated 2 0.0352 0.7226


255

environments due to reflections and time delays imposed on the mixtures. Moreover, CoBliSS algorithm respected the fact that speech signal’s energies decay for higher frequencies. CoBliSS was tried on most possible scenarios in a real environment. It succeeded in separating speech-speech mixing, speech-music mixing, and the separation of different sound signals in the presence of background noise. This proved the algorithms ability and robustness in the application proposed. A problem faced while evaluating the efficiency of BSS on real world mixed signals is that the data acquisition system can either record individual signals, or record mixed signals. With one of the two, mixed or individual signals, the only way for evaluation was relying on the subjective yet efficient human hearing sense for evaluation.

As for the text independent speaker recognition system, the algorithm succeeded with recognition rates ranging between 86.11% and 99.44% and an average of 93.9%. One possible explanation of the lowest obtained recognition rate (86.11%) could be due to a bad combination of training signals used for training the MLP. Signal acquisition is a subject of different artifacts that could be caused by the recording environment, data acquisition equipment, background noise etc… which normally affects the classification results. In addition, the ANDOSL database used for evaluating the proposed algorithm came up to be a hard one. In fact, consisting of speech utterances of two different male adults of the same age group, the used database presents one of the most severe cases for speaker recognition.

V. CONCLUSION AND FUTURE RECOMMENDATION

Two signal processing techniques were combined together with a selectivity method to achieve a possible preliminary solution for KKS. Focusing on physiological causes behind KKS this can provide patients with a dim light inside a dark tunnel. KKS breaks down the communication in the society and the ability to follow up with other people especially in a crowded environment disconnecting the patient from the society. Solving this pathology will certainly help him blend again and re-establish a lost communication with his environment.

The presented work proposed a software solution for the indicated pathology. The next step should be a hardware implementation in a portable and compact device that is capable of helping patients fight KKS. This device can be further developed into a Cochlear prosthesis thus making it much more compact and hidden allowing the full and easy integration of the patient in his society. In addition, the selectivity technique proposed can be further developed in which the patient can select the person he wants to hear by calling his name through a dedicated microphone. This could make the system more user friendly by avoiding tags and codes. The pre-recorded database could be tagged by the name of each speaker corresponding to each speech utterance through a certain speech recognition technique. The patient then calls the name of the intended speaker through a dedicated microphone, the system analyzes the spoken word and find its match and thus extracts the corresponding speech utterance.

On the other hand, despite the fact that the research was used as a solution for KKS, however it can be used as a

development to cognitive robots. The system proposed can be implemented in robots to restrict orders to programmers or owners. Thus if the robot was operating in a crowded environment it won’t have a problem identifying its operator and abiding by his instructions.

REFERENCES

[1] R.A. Dewey, “Psychology: An Introduction”, Wadsworth Publishing,

2004.

[2] J.H. McDermott, “The cocktail party problem”, Current Biology, vol. 19, No. 22, 2009.

[3] F.Zhao and D.Stephens, “Subcategories of patients with King-Kopetzky syndrome”, British Journal of Audiology,34, p.p. 241-256, 2000.

[4] R. Hinchcliffe, “King-Kopetzky syndrome: An auditory stress disorder?”, J Audiol Med, vol. 1, p.p. 89-98, 1992.

[5] G. Saunders, H. Haggard, "The clinical assessment of obscure auditory dysfunction-1. Auditory and psychological factors." Ear and hearing vol. 10, No.3, p.p.200-208, 1989.

[6] F. Zhao, D. Stephens, “The Role of a Family History in King Kopetzky Syndrome (Obscure Auditory Dysfunction)”, Acta oto-laryngologica, Vol. 120, No. 2 , p.p. 197-200, 2000.

[7] F. Zhao, D. Stephens, “Audioscan Testing in Patients with King-Kopetzky Syndrome”, Acta oto-laryngologica, Vol. 119, No. 3, p.p. 306-310, 1999.

[8] H. Pryce, "The process of coping in King-Kopetzky Syndrome." Audiological Medicine, vol. 4, p.p. 60-67, 2006.

[9] K. Matsuoka, “Independent component analysis and its applications to sound signal separation”, International Workshop on Acoustic Echo and Noise Control, pp. 15-18,Sept. 2003.

[10] D.W.E. Schobben and P.C.W. Sommen, “A new blind signal separation algorithm based on second order statistics”, Proceeding of the IASTED International Conference on Signal and Image Processing, USA, Oct. 1998.

[11] D.A. Reynolds, “An overview of automatic speaker recognition technology”, IEEE International conference on Acoustics, Speech, and Signal Processing (ICASSP), vol.4, pp. IV-4072,IV-4075, 2002.

[12] R. Soria and E. Cabral, “Speaker recognition with artificial neural networks and Mel-Frequency Cepstral Coefficients Correlations”,EUSIPCO, vol. 2; pp. 1051-1054,1996.

[13] A. Bala, A. Kumar and N. Birla, “Voice command recognition system based on MFCC and DTW”, International Journal of Engineering Science and Technology, vol. 2, No. 12, pp. 7335-7342, 2010.

[14] S. Davis and P. Mermelstein,“Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28, No. 4, pp. 357-366, 1980.

[15] X. Huang, A. Acero, and H. Hon,“Spoken Language Processing: A guide to theory, algorithm, and system development”, Prentice Hall, 2001.

[16] T.H. Park, “Introduction to Digital Signal Processing Computer Musically Speaking”, World Scientific Publishing Co. pte. Ltd., Singapore, 2010.

[17] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J., 1993.

[18] T-W. Lee, A. Ziehe, R. Orglmeister and T.J. Sejnowski, Combining time-delayed decorrelation and ICA: towards solving the cocktail party problem , Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing , Seattle, Vol 2, pp. 1249-1252, 1998.

[19] Australian National Database of Spoken Language (ANDOSL) www.andosl.rsise.anu.edu.au/andosl/

[20] E. Vincent, R. Gribonval, and C. Fèvotte, “Performance Measurement in Blind Source Separation”, IEEE trans. Auido, Speech and Language Processing, vol. 14, No. 4, pp. 1462-1469, 2006.

[21] C. Fèvotte, R. Gribonval, and E.Vincent, BSS_EVAL Toolbox User Guide, rev.2,Technical Report 1706, IRISA, 2005.


256

Documents

[IEEE MELECON 2014 - 2014 17th IEEE Mediterranean Electrotechnical Conference - Beirut, Lebanon (2014.04.13-2014.04.16)] MELECON 2014 - 2014 17th IEEE Mediterranean Electrotechnical