4
Person Recognition using Humming, Singing and Speech Hemant A. Patil, Maulik C. Madhavi, Nirav H. Chhayani Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT) Gandhinagar, India e-mail: {hemant_patil,madhavi_maulik,chhayani_nirav}@daiict.ac.in Abstract – Speaker recognition deals with designing the system which recognizes the person by speech with the help of computers. In this paper, the various biometric signals produced by humans, viz., speech, singing and humming are considered for person recognition task. Corpus has been developed from 28 subjects in real-life settings. For person recognition task, state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) and a discriminatively-trained polynomial classifier of 2 nd order approximation are used as spectral feature and classification techniques, respectively. Our experimental results indicate that the performance of person recognition system obtained using humming outperforms other biometric patterns (i.e., speech and singing) by 9 % in EER and 9 % in Identification Rate. We believe that this may be due to the person-specific characteristics are better captured in humming sounds, (which are nasalized sounds) than speech and singing. Keywords- Biometric; Humming; Corpus development; Speaker recognition; Singer recognition I. INTRODUCTION Speaker recognition has been extensively developed using normal speech signal using speaker recognition evaluations and methodologies are given in [1]. Research on speaker recognition in Indian languages is reported in [2]. There has been recent development in using person’s hum as biometric pattern. Hum is a sound created by person by closing mouth and enforced air flow to pass through nose. It is very significant for person recognition due to the following reasons [3],[8]: Nasal cavities are almost fixed during production of nasals as compared to oral sounds. Nasal sounds have less variation within the person and more variations across different persons [3]. In terms of universality criteria of biometric pattern, hum is applicable with deaf person as well as an infant who have disorder in speech production mechanism [4]. Humming-based person detection and identification proposed initially by Patil et. al. [5] and subsequently by Jin et. al. [6]. They all used spectral features-like Mel Frequency Cepstral Coefficients (MFCC) in their research work [7]. Patil et. al. studied static and dynamic features for person recognition using humming [8]. The singing voice is the oldest musical instrument and one with which almost everyone has a great deal of familiarity. Given the importance and usefulness of vocal communication, it is not surprising that our auditory, physiology and perceptual apparatus has evolved to a high level of sensitivity to the human voice. This is primarily due to the rapid acoustic variation involved in the singing process. In order to pronounce different words, a singer must move his/her jaw, tongue, teeth, etc. changing the shape and thus the acoustic properties of his or her vocal tract. No other instrument exhibits the amount of physical variation of the human voice. This complexity has affected research in both analysis and synthesis of singing [9]. For example, figure 1 shows time- domain waveform and corresponding spectrogram for humming, singing and speech produced by a male speaker of age 23 years. It is evident from the plots that the pattern and frequency range of spectral energy distribution varies for each of these signals. It was observed that speaker discriminative information features in frequency region is mostly between 4 kHz and 5.5 kHz. This is due to the result that glottis causes important feature of speaker information in low frequency region [10]. Thus, in this study, we found that speech, humming and signing are important person-specific attributes. Those can be useful in person recognition applications. In this work, we propose person recognition using three different pattern such as speech, singing and humming. The idea behind using all three patterns is to study discrimination ability of these biometric patterns. The organization of the paper is as follows. Section II describes the corpus development part. In Section III, we discuss experimental setup. The experimental results along with evaluation are described in Section IV. Finally, Section V concludes the work with future research directions. II. CORPUS DEVELOPMENT AND DESIGN The corpus development for this novel approach for person recognition is the key and primary important part of this research work. The speech corpus and songs are of Hindi language (i.e., an Indian language). A database is developed with the help of 28 persons from the Western part of India, viz., Gujarat state. In this region, Gujarati is spoken as a primary language and Hindi is a secondary language. Database is prepared with the help of microphones (Sennheiser PC 3 Chat). To obtain spontaneous speech, the interviewer asked few questions regarding personal information of the speaker such as his or her name, age, education, profession, etc. A speech-based corpus, which is used for speech-based person recognition only, contains five questions, isolated words, digits, combination-lock phrases, sentences and paragraph [2]. To obtain read speech, text material the list was prepared. It was given to the subject to read in his or her own way. The language used in all three types of the recording (i.e., speech, humming and singing) was Hindi. As we know that for every speaker the speech 2012 International Conference on Asian Language Processing 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.58 149 2012 International Conference on Asian Language Processing 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.58 149

[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Person

  • Upload
    nirav-h

  • View
    221

  • Download
    2

Embed Size (px)

Citation preview

Person Recognition using Humming, Singing and Speech

Hemant A. Patil, Maulik C. Madhavi, Nirav H. Chhayani Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT)

Gandhinagar, India e-mail: {hemant_patil,madhavi_maulik,chhayani_nirav}@daiict.ac.in

Abstract – Speaker recognition deals with designing the system which recognizes the person by speech with the help of computers. In this paper, the various biometric signals produced by humans, viz., speech, singing and humming are considered for person recognition task. Corpus has been developed from 28 subjects in real-life settings. For person recognition task, state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) and a discriminatively-trained polynomial classifier of 2nd order approximation are used as spectral feature and classification techniques, respectively. Our experimental results indicate that the performance of person recognition system obtained using humming outperforms other biometric patterns (i.e., speech and singing) by 9 % in EER and 9 % in Identification Rate. We believe that this may be due to the person-specific characteristics are better captured in humming sounds, (which are nasalized sounds) than speech and singing.

Keywords- Biometric; Humming; Corpus development;

Speaker recognition; Singer recognition

I. INTRODUCTION Speaker recognition has been extensively developed

using normal speech signal using speaker recognition evaluations and methodologies are given in [1]. Research on speaker recognition in Indian languages is reported in [2]. There has been recent development in using person’s hum as biometric pattern. Hum is a sound created by person by closing mouth and enforced air flow to pass through nose. It is very significant for person recognition due to the following reasons [3],[8]:

• Nasal cavities are almost fixed during production of nasals as compared to oral sounds.

• Nasal sounds have less variation within the person and more variations across different persons [3].

In terms of universality criteria of biometric pattern, hum is applicable with deaf person as well as an infant who have disorder in speech production mechanism [4]. Humming-based person detection and identification proposed initially by Patil et. al. [5] and subsequently by Jin et. al. [6]. They all used spectral features-like Mel Frequency Cepstral Coefficients (MFCC) in their research work [7]. Patil et. al. studied static and dynamic features for person recognition using humming [8].

The singing voice is the oldest musical instrument and one with which almost everyone has a great deal of familiarity. Given the importance and usefulness of vocal communication, it is not surprising that our auditory, physiology and perceptual apparatus has evolved to a high level of sensitivity to the human voice. This is primarily due to the rapid acoustic variation involved

in the singing process. In order to pronounce different words, a singer must move his/her jaw, tongue, teeth, etc. changing the shape and thus the acoustic properties of his or her vocal tract. No other instrument exhibits the amount of physical variation of the human voice. This complexity has affected research in both analysis and synthesis of singing [9]. For example, figure 1 shows time-domain waveform and corresponding spectrogram for humming, singing and speech produced by a male speaker of age 23 years. It is evident from the plots that the pattern and frequency range of spectral energy distribution varies for each of these signals. It was observed that speaker discriminative information features in frequency region is mostly between 4 kHz and 5.5 kHz. This is due to the result that glottis causes important feature of speaker information in low frequency region [10].

Thus, in this study, we found that speech, humming and signing are important person-specific attributes. Those can be useful in person recognition applications. In this work, we propose person recognition using three different pattern such as speech, singing and humming. The idea behind using all three patterns is to study discrimination ability of these biometric patterns.

The organization of the paper is as follows. Section II describes the corpus development part. In Section III, we discuss experimental setup. The experimental results along with evaluation are described in Section IV. Finally, Section V concludes the work with future research directions.

II. CORPUS DEVELOPMENT AND DESIGN The corpus development for this novel approach for

person recognition is the key and primary important part of this research work. The speech corpus and songs are of Hindi language (i.e., an Indian language). A database is developed with the help of 28 persons from the Western part of India, viz., Gujarat state. In this region, Gujarati is spoken as a primary language and Hindi is a secondary language. Database is prepared with the help of microphones (Sennheiser PC 3 Chat). To obtain spontaneous speech, the interviewer asked few questions regarding personal information of the speaker such as his or her name, age, education, profession, etc. A speech-based corpus, which is used for speech-based person recognition only, contains five questions, isolated words, digits, combination-lock phrases, sentences and paragraph [2]. To obtain read speech, text material the list was prepared. It was given to the subject to read in his or her own way. The language used in all three types of the recording (i.e., speech, humming and singing) was Hindi. As we know that for every speaker the speech

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.58

149

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.58

149

cannot be same each time one speaks. Thus, we have to consider all possible variations in pronunciation for a particular text material. Hence, we have recorded data with few repetitions. As used for recording, we increase number of times repetitions the speaker starts with more loudness and at the end, one tries to complete the recording as soon as possible; so subjects speak very fast.

The data was recorded with 5 repetitions except digits and combination-lock phrases with 3 times repetitions and no repetition for the contextual speech. These repetitions are taken to obtain more number of the realization of the random process (as speech production mechanism is also random process).

(a)

(b)

(c)

(d)

(e)

(f)

Figure 1: (a) waveform of humming signal, (b) spectrogram of humming signal, (c) waveform of singing signal, (d) spectrogram of singing signal, (e) waveform of speech and (f) spectrogram of speech signal.

For corpus development of humming and singing, we

chose Hindi Bollywood songs. Similar songs were used to build corpora used in [11]. There are 14 Hindi songs for training purpose and 6 songs are for testing purpose. The songs which are used in the testing are different than the songs which are used in the training. Here, for one session, three types of recordings were performed one after another.

A. Data Acquisition The recorded voice is stored into the computer through

the sound card of the computer (Core 2 Duo-1.40 GHz, 2 GB RAM). The speech files were stored into the ‘.wav’ file format with the help of Audacity cross-platform sound

editor software with 44.1 kHz sampling frequency and 16 bit resolution. Table 1 shows the detailed description of database. The age variation of subjects is from 13 to 45 years old. Database also contains two identical twin pairs, i.e., four speakers (one male pair and one female pair).

B. Issues in Data Collection Following are typical issues and observations that we analyzed during building of the corpus for the proposed study.

1) This database is made by the people from the Gujarat, India, where Gujarati is the primary language and Hindi is the secondary language. So subjects are chosen as they are literate and know Hindi sufficiently.

2) Sometimes laughter, throat clearing and poor articulation are recorded in normal way.

3) As we took samples of humming, speech and singing one after another from subjects. We took the utmost care that subjects feel comfort while recording. Due to this, we required long time for recording. At first, speech samples are taken to make subject comfortable with the

Table 1: Database Description Item Details

No. of Persons 28 Persons ( 21 Female and 7 Male ) No. of Sessions 1

Data Type Read Speech, Singing and Humming

Sampling Rate 44.1 kHz Sampling Format 1 Channel, 16-bit resolution

Type of Speech Read sentences, isolated words and digits, combination-lock phrases, questions, contextual speech of considerable duration.

Application Text–independent Automatic Speaker Recognition system, Query-By-Humming (QBH) system, Person recognition using humming, Singer Identification

Language Hindi No. of Songs in Singing 14 songs for training and 6 songs for testing

No. of Songs in Humming

14 songs for training and 6 songs for testing (same songs used for recording of singing)

No. of Repetitions 5 for sentences, isolated words, 3 for digits and combination-lock phrases and none for questions, contextual speech, singing and humming.

Training Segments 30 s, 45 s, 60 s, 75 s, 90 s (5 training segments)

Testing Segments 5 s, 5.5 s, 6 s, 6.5 s,…, 30 s (51 testing segments)

Genuine Trials 28 (# of persons) x 5 (# of training sessions) x 51 (# of testing sessions) =7,140

Impostor Trials 28 (# of persons) x 28 (# of persons) x 5 (# of training sessions) x 51 (# of testing sessions) - 7140 (Genuine Trials) = 192,780

Microphone Sennheiser PC 3 Chat (Microphone Frequency Response: 90 Hz – 15 KHz )

Acoustic Environments Room Environment

150150

environments. Then singing samples are recorded, i.e., somewhat difficult part of recording. Then subjects were asked to hum for same songs. So that we make sure that subjects get awareness about that song. The very exhaustedness part in recording is humming so this session is taken as last in corpus for recording.

4) It is also observed that due to lack of speaking experience in Hindi, some words and combination-lock phrases were not pronounced correctly so in that case first interviewer read it and then subject followed it.

5) Initially subjects were reluctant to record their voice so to avoid nervousness; we started recording with general introductory question-answer session which helps subject to get familiar with the recording environment.

6) Because the singing part is also in corpus, nervousness also increases in the subject for recording. To avoid these difficulties, first we have to made them realize that they sing and hum well, sometimes we have recorded dummy recording for the singing and humming. This can be useful to motivate and to boost their confidence for recording songs and humming.

7) The singing and humming are also the part in corpus. Sometimes it may happen that subjects do not know the lyrics of the songs or the tune of the same. To avoid difficulties, we let them listen the actual songs, sung by the original singers. So that they get awareness of that songs for recording songs and humming.

8) Here, the humming was done with closed lips. So initially subjects were informed about the difficulties of recording of humming. It is always difficult to ask subject to hum for a long duration because after some time breathing may be recorded. As per their convenience and comfort level, they can interrupt during recording.

9) As speech and singing having mouth and nasal sounds, the microphone position is below the lips from left hand side. The humming is having nasal sounds so for recording of humming the microphone position was changed to near to nose from left hand side of the subject. So the position of microphone is also important for different kinds of recording.

III. EXPERIMENTAL SETUP

A. Mel Frequency Cepstral Coefficients (MFCC) MFCC features are one of the most popular and state-of-

the-art features in speech and speaker recognition applications. In 1980, Davis and Mermelstein used triangular- shaped filter in their proposed Mel-filterbank structure to identify monosyllabic word for speech recognition application [7]. For MFCC calculation, first input sequence of hum is passed via pre-processing block (frame-blocking, Hamming windowing and pre-emphasis filtering). Then, pre-processed frame is passed through Mel filterbank, followed by subband energy and logarithm

operations. At the end, Discrete Cosine Transform is applied to obtain 12-dimensional MFCC feature vectors. Feature extraction schemes for MFCC is described in figure 2 by a block diagram.

B. Polynomial Classifier

In this paper, a discriminatively-trained polynomial classifier is used as the basis for all person recognition experiments. Campbell et al. (2002) first proposed polynomial classifier for speaker verification application. This classifier is the best approximation to the optimal Bayes classifier. It has capability of new class addition, efficient multiply and add (Digital Signal Processor) DSP structure. It uses out-of-class data to optimize performance as opposed to other statistical methods such as Hidden Markov Model (HMM), Gaussian Mixer Model (GMM), etc. [12]. For the two-class problem, let pw be the optimum speaker model, ω the class label, and ( )y ω the ideal output, i.e., ( ) 1y spk = and ( ) 0y imp = .The resulting problem using Mean Square Error (MSE) is

2

wargmin ( ) ( ) ,T

p E p y ω� �� �� �� � � �� � �

= −w w x (1)

where {}.E means expectation over x and ω , ( )p x = vector of polynomial basis terms of the input test feature vector [12]. The basic structure of the classifier is shown in figure 3.

IV. EXPERIMENTAL RESULTS In this paper, state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) and discriminatively-trained polynomial classifier is used. It is worthwhile to observe how each pattern performs for the person recognition task. A DET curve is the error trade-off curve between two types of errors, viz., False Alarm and Miss Detection. The point when both of these errors are equal is known as Equal Error Rate (EER) [13] Detection Cost Function (DCF) is the

Figure 2. MFCC feature extraction (After [7]).

Pre-Processing

Magnitude Spectrum

Mel- Filterbank

Log (.)DCT

Sub bandEnergy

MFCC

Speech/ Hum /

Singing

% ID 1,.., Mx x

Score

Select Maximum

1

1 ( )M T

j ii

pM =

� w x

Person-specific models (i.e., 1 2,, ..... Nw w w )

Testing Feature vectors

Figure 3. The polynomial classifier structure [12].

151151

weighted average of cost associated with false acceptance and miss detection probabilities. The success rate is the number of correctly identified person from total tested person. Table 2 and figure 4 shows the performance obtained on the corpus developed in this work.

Performance is obtained over a large number of genuine

and impostor trials. For statistically meaningful evaluation, we considered 95 % confidence intervals (as defined in [14]). It is evident from the figure 4 and Table 2 that the humming-based person recognition performs better than other two biometric patterns, viz., singing and conventional speech. The singing-based recognition is better than speech but worse than humming. It is evident from figure 4 that the DET curve obtained using humming outperforms for speech and singing at most of the operating points in the DET curve. This may be due to the fact that humming sounds are nasalized sounds which are produced with less movement of articulators in the vocal tract. In addition, during production of nasal sounds, nasal cavity remains almost steady. These two features of nasal sounds (in this case, humming) make them person-specific. Furthermore, nasal sounds (in particular, humming) have spectral energy concentrated mostly below 2 kHz whereas for speech and singing it is up to 5 kHz or even more. Since Mel filter have good spectral resolution below 1 kHz, person-specific information is better captured in humming than singing and normal speech by MFCC features.

V. SUMMARY AND CONCLUSIONS Experimental results show that the humming gives better

performance with respect to singing and speech for speaker recognition system. One of the limitations of the present work, is results are reported on single session data. Our immediate goal is to collect the samples and evaluate the performance on the large population size with more than one training session and four testing sessions. The other goal is to generate features which give better performance and designing algorithms in which the speaker recognition is done based on all these three, i.e., speech, singing and humming.

REFERENCES [1] G. R. Doddington, M. A. Przybocki, A. F. Martin and D. A. Reynolds,

“The NIST speaker recognition evaluation-Overview, methodology systems, results, perspective,” Speech Communication, vol. 31, no. 2-3, pp. 225-254, 2000.

[2] H. A. Patil and T. K. Basu, “Development of speech corpora for speaker recognition research and evaluation in Indian languages,” Int. J. of Speech Tech., vol. 11, no. 1, pp. 17-32, 2008.

[3] K. Amino, T. Sugawara and T. Arai, “The correspondences between the perception of the speaker individualities contained in speech sounds and their acoustics properties”, INTERSPEECH’05, pp. 2025-2028, 2005.

[4] K. Anderson and L. Schal’en, “Etiology and treatment of psychogenic voice disorder: Results of a follow-up study of thirty patients,” Journal of Voice, vol. 12, no. 1, pp. 96-106, 1998.

[5] H. A. Patil, P. Jain and R. Jain, A Novel Approach to Identification of Speakers from Their Hum,” 7th Int. Conf. on Advances in Pattern Recognition, ICAPR, pp. 167-170, 2009.

[6] M. Jin, J. Kim and C. D. Yoo, “Humming-based human verification and identification,” Proc. Int. Conf. on Acoustic, Speech and Signal Processing, ICASSP’09, pp. 1453-1456, 2009.

[7] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech and Signal Processing, vol. ASSP-28, no.4, pp. 357-366, August 1980.

[8] H. A. Patil, M. C. Madhavi and Keshab K. Parhi, “Static and dynamic information derived from source and system features for person recognition from humming,” Int. J. Speech Tech., IJST’12, Springer, vol. 15, no. 3, pp. 393-406, Sept., 2012.

[9] Y. E. Kim, "Excitation codebook design for coding of the singing voice," 2001 IEEE Workshop Applications of Signal Processing to Audio and Acoustics, pp.155-158, 2001.

[10] X. Lu and J. Dang, “An investigation of dependencies between frequency components and speaker characteristics for text independent speaker identification,” Speech Communication, vol. 50, no. 4, pp. 312–322, 2008.

[11] H. Patil, M. Madhavi and Keshab K. Parhi, “Combining evidence from spectral and source like features for person recognition from humming,” Proc. INTERSPEECH’11, pp. 369-372, 2011.

[12] W. M. Campbell, K. T. Assaleh and C. C. Broun, “Speaker recognition with polynomial classifiers,” IEEE Trans. on Speech and Audio Processing, vol. 10, no.4, pp. 205-212, May 2002.

[13] A. F. Martin, G. Doddington, T. Kamm, M. Ordowski and M. Przybocki, “The DET curve in assessment of detection task performance,” Proc. EUROSPEECH’97, vol. 4, pp. 1899-1903, 1997.

[14] N. A. Weiss and M. J. Hasset, “Introductory Statistics,” 3rd Ed. Reading, Massachusetts: Addison-Wesley Publishing Company, 1991.

Table 2: Performance of person recognition experiment using MFCC and 2nd order polynomial classifier.

Type of biometric

signature EER (%)

Min. DCF

Success Rate (%)(95% CNFINT)

Humming 7.9 0.0787 95.61(95.14, 96.09)

Singing 15.04 0.1499 84.41(83.57, 85.25)

Speech 17.73 0.1760 86.38 (85.59, 87.18)

Figure 4. DET curves for all three biometric signatures.

152152