iscsct09p330

330

Building a Speaker Recognition System with one Sample

Mansour Alsulaiman, Ghulam Muhammad, Yousef Alotaibi, Awais Mahmood, and Mohamed Abdelkader Bencherif

Computer Engineering Dept., College of Computer and Information Sciences King Saud University, Saudi Arabia

[email protected]; [email protected]; [email protected] ; [email protected]; [email protected]

Abstract— Speaker recognition system is the process of automatically recognizing the person from his/her speech. To correctly recognize a speaker by the system, many speech samples are needed at different times from each speaker. However, in some applications, such as forensic, the number of samples for each speaker is very limited. In this paper, a method is proposed to train the speaker recognition system based on only one speech sample. From that one sample, other samples are generated. The intent is to provide a complete speaker recognition system, without bothering the speaker to record the speech samples at different times. For this purpose, the speech samples are modified without altering the pitch and the speaker dependent features. Many techniques are used to generate new samples and apply these to the system, when the recognition system is based on the hidden Markov model. The system is built using the HTK software which is a hidden Markov model kit, and the best recognition rate is 85.86%. Index Terms— Speaker recognition; sample generation; hidden Markov model.

I. INTRODUCTION

Biometric systems are roughly divided into behavioral and physical pattern measurements. Many countries are making valuable scientific reports on the feasible and viable methods to be used in access or recognition systems. The complexity comes from many issues, the most important ones concern: the no error reproducibility of the registered pattern; the lower data collection error rate, the high user acceptability [1,2], the size of the database [9], the necessary technology to embed into the terminal capture points. These major strategic points tend to classify the different biometric issues into classes, and weight some techniques among others. From the dynamic methods, considered sometimes as changing over time, speech is extremely concerned, as fingerprints are, by the data collection error, many sessions must be executed in order to get a candidate set of samples.

The beauty of speech is its non-invasive nature, i.e. it can be recorded without the person’s acceptance, or sometimes without his/her physical attendance. This is also subject to the existing speech recording technology, by the use of sophisticated microphones, or via channels like the landline or mobile phones, or from some TV interviews or radio broadcasts. Unfortunately, sometimes there is not enough speech data recorded, which leads to a

lack of enough training data for the model to be correctly trained, that results into a very low recognition rate.

Some methods of speech lengthening are used in human speech recognition for the benefit of speech perception, and a source of information in understanding prosodic organization of speech [3], and also for children in kinder gardens for word discrimination [4].

Regarding the above mentioned problem, different techniques have been proposed in this paper for generating more new samples by using one sample. One method is an expansion or a meaningful lengthening, by modifying one of the existing samples, in order to strengthen the template establishment during training. All other original samples will be used for testing the system.

The paper is organized as follows: section II describes the database, and selection of data; section III defines the modeling technique used in this paper; section IV introduces the front-end processing part of the system, section V illustrates different generation methods which will be explored in this paper, section VI describes the experiments performed with the results given in section VII; in section VIII, the results are analyzed. Finally, section IX concludes the paper and gives suggestions for future work.

II. DATABASE

This research has been conducted with a local dataset recorded at King Saud University, College of Computer and Information Sciences-CCIS, during the year 2007 [5]. The dataset consists of 91 speakers, pronouncing the word “نعم”, which stands in English for the word “yes”, in 5 different occurrences.

The speaker recognition system is phoneme based, and uses the phonemes of the word “نعم”, for recognizing the speaker. The main characteristics of this word are of two aspects. The first aspect is that approximately all the Arab speakers frequently say "yes" (in Arabic) in any discussion. The second aspect is the richness of this word in the phonetic structure. It contains the nasal phoneme at last a bilabial ,[ع] a very pertinent phoneme ,[ن]phoneme [م], allowing the capture of the energy of the whole word. It also contains two occurrences of the vowel (فتحة). This richness, plus the fact that it is a commonly pronounced makes it a good choice for our investigation.

The samples will be denoted as: • First original sample: O1.

ISBN 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM)Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT ’09)

Huangshan, P. R. China, 26-28,Dec. 2009, pp. 330-334

© 2009 ACADEMY PUBLISHER AP-PROC-CS-09CN005

331

• Four other original samples are used for testing: O2, O3, O4, O5.

In this work, a part of the database is used. This part consists of 25 different male speakers (20 adults + 5 children). All are native Arabic speakers. Each Speaker uttered the same isolated word نعم five times. The speakers recorded their speech samples in one or two sessions.

III. MODELING TECHNIQUE AND SPEECH FEATURES

In text dependent applications, where there is a strong prior knowledge of the spoken text, additional temporal knowledge can be incorporated by using the Hidden Markov Models (HMMs). HMM is a stochastic modeling approach used for speech/speaker recognition. It is similar to a finite state machine. Each state (node) has an associated probability density function (PDF) for the feature vectors. Moving from one state to another is subject to a transition probability. The first and the last states are not emitting states, since the first state is always from where the state machine starts and the last state is the one, where it always ends, i.e. there are no incoming transitions into the start state and there are no output transitions from the end state. Every emitting state has a set of outgoing transitions and the sum of the probabilities for those transitions is equal to one, since the transition from a non-final state must always occur [6, 7, 8].

The HMM system is build using the HTK (Hidden Markov Toolkit) software, which was developed by Steve Young at Cambridge University in 1989.

Our work involves with three active states, left to right. Also, each state has one mixture. Each phoneme in the keywords is modeled by one model with number of speakers. For example, for a given speaker, each phoneme is modeled differently, even by dealing with the same linguistic sound. These models can be used to find the speaker identity. The silence model is also included in the model set. In a later step, the short pause is created from and tied to the silence model. This system is similar to our original work as presented in [5].

IV. FRONT-END PROCESSING

This step deals with the extraction of features, where speech is reduced into a smaller amount of important characteristics, represented by a set of vectors, such as the Mel Frequency Cepstral Coefficients (MFCC). The cepstral features are mostly used in speaker recognition, due to many reasons, their robustness to noise distortion, their capability to filter the sound as does the human cochlear system, and their degree of de-correlation. The parameters of the system are 16KHz sampling rate with a 16 bit sample resolution, 25 milliseconds Hamming window duration with a step size of 10 milliseconds, and 12 MFCC coefficients as features

V. PROPOSED METHODS

The methods or techniques to generate new speech samples from one original speech sample are proposed. These new samples can be used for the training of speaker recognition systems without altering the speaker identity, such as modifying the pitch and/or the speaker dependent features. All the samples are generated by modifying the first speech sample of each speaker, in the time domain using the PRAAT software. The new samples are generated by any/or combination of the following methods:

A. Copying a part of speech & concatenating it The samples are generated by copying a small part

from the initial speech sample and then inserting it just after the selection. This is done on the first, middle, and last parts of the sample, resulting in three different additional samples. The cut part is around 20 to 30 milliseconds for first group of three new samples, and 40 to 60 milliseconds for the second group.

B. Reversing of the word In this category four different kinds of samples are

generated. The first sample is generated by reversing the original sample. The second, third and fourth samples are generated by coping a small part (approximately 20ms to 30 ms) from the phonemes “ع“ , ”م” and “ن” then inserting it just after the selection in the reversed word.

C. Adding noise at different SNR A total of six samples are generated. First three samples

are generated by adding babble noise of 5db, 10db and 20db SNR, respectively. The other three samples are generated by adding train noise of 5db, 10db and 20db SNR, respectively.

VI. EXPERIMENTS

In order to confirm that the new generated samples contain supplementary information about the speakers, initially a test experiment is performed on which a system is trained and tested with the same original sample. This

(a) Original (b) Lengthening “ن”

(c) Lengthening “ع” (c) Lengthening “م”

Figure1. Original and generated samples using concatenation.

332

experiment is named as E1. The recognition rate is 10%, as expected, which is very low. This is due to the fact, that there is not enough information in one sample. Moreover, the system is trained and tested with the original and generated samples (experiments E2 and E3 from section V. A), and 100% accuracy is obtained. This high recognition rate is due to supplementary or additional information obtained during the training by the new generated samples.

However, this is not a real test, because the system shall

be tested with other original samples, as with the following experiments.

A. Concatenation Samples S5, S6 and S7 are generated in this section.

These are generated by copying the central part, approximately 20 ms to 30 ms, of each phoneme “ع“ ,”ن” and “م” of the original sample (O1), then inserting it just after the selected part, respectively. The vertical dotted lines in theFig. 1 show the inserted part. This group is named as conc1.

The samples S8, S9 and S10 are generated by copying a part of 40 ms to 60 ms. It is a longer length than conc1, of each phoneme “ع“ ,”ن” and “م” of the original sample O1. Then it is inserted just after the selected part. This group is named as conc2, as mentioned in Table1.

B. Generating Samples by Reversing Different samples are generated in this part. The first

sample in this group S11 is generated by reversing the sample O1. The second, third and fourth generated samples in this group are S12, S13 and S14. These are generated by copying a part of approximately 20 ms to 30 ms of each phoneme “ع“ ,”م” and “ن”, of the sample S11, then inserting it just after the selected part respectively.

Note that the order of the phonemes is reversed leading to a new word meaning "all together". This group is named as rev4. S11 will name as rev1.

C. Generating samples by adding noise A total of six samples are generated in this last

category. The samples S15, S16 and S17 are generated by adding the babble noise at 5db, 10db and 20db SNR respectively. This group’s name is nois1. The three other samples S18, S19 and S20 are generated by adding the train noise at 5db, 10db and 20db SNR respectively. nois2 is the selected name for this group.

VII. RESULTS

Table 2 describes the results of the different conducted experiments. These experiments are performed using 25 speakers of the database (Section II). In all these experiments the training samples are O1, and the groups of generated samples as presented in Table 1.

A. Effect of concatenation Three experiments are conducted in this part:

1. E4 represents the training of the system by using the samples of conc1; the recognition rate is 50%.

2. E5 describes the training of the system by using the samples of conc2; the recognition rate is 40%; this reduced the previous recognition rate by 10 %.

3. In the experiment E6, it is observed that when both types of concatenations (more information) are included, the recognition rate increased to 82%.

These results indicate that different types of concatenation or more samples will improve the recognition rates, better than a single type of concatenation.

Table 2. Experimental Results Exp no. Technique Training

Samples Test

SamplesRec. rate

E4 conc1 O1,S5,S6,S7 O2,O3 50% E5 conc2 O1, S8,S9, S10 O2,O3 40%

E6 conc1,conc2 O1, S5, S6, S7, S8, S9,S10

O2,O3, O4,O5

83%

E7 conc1,nois1 O1 S5, S6, S7, S8, S15,S16,S17

O2,O3 ,O4,O5

82.11%

E8 nois1,nois2,conc1

O1,S15, S16, S17, S18 , S19,S20,

O2,O3, O4,O5

82%

E9 conc1,rev4 O1, S5, S6, S7 , ,S11 , S12,S13,S14

O1,O2, O6,O7

74%

E10conc1,conc2,rev4

O1,S5,S6,S7,S8, S9, S10,S11,S12, S13, S14

O2,O3, O4,O5

76.77%

E11conc1,conc2,rev1

O1,S5,S6,S7,S8,S9,S10,S11

O2,O3, O4,O5

85.86%

TABLE 1. Techniques for generating samples

Sample Code Category Method to generate the new

sample O1 Original Sample

S5,S6,S7 conc1

A small part of the first second and third phoneme which are “ن”, approx. 0.02 to 0.03) ,”م“ and ”ع“seconds) is copied and inserted it just after the selection.

S8,S9,S10 conc2

A small part of the first second and third phonemes which are approx. 0.05 to) ,”م“ and ”ع“ ,”ن“0.06 seconds) is copied and inserted it just after the selection.

S11 rev1 Reverse of O1

S12,S13, S14

rev4

A small part of S11, the first second and third phoneme which are“م”, .approx) ,”ن“ and ”ع“0.02 to 0.03 seconds) is copied and inserted it just after the selection respectively

S15,S16, S17

nois1 Babble noise is added at 5db, 10db and 20db in the original speech signal O1.

S18,S19, S20

nois2 Train noise is added at 5db, 10db and 20db in the original speech signalO1.

O2,O3,O4,O5 Original samples O6,O7 Reverse of O2,O3 respectively

333

B. Effect of Noise Two experiments are conducted in this part:

1. E7 illustrates the training of the system by using the samples of conc1 and nois1, the recognition rate is 82.11%.

2. E8 represents the training of the system by using the samples of nois1 and nois2, the recognition rate is 82%.

C. Effect of Reverse In this part, the used samples are generated as

described in section V. B. In the following three experiments

1. E10 shows that by training the system using the samples of conc1, conc2 and rev4. The recognition rate is 76.77%.

2. In E11, the samples conc1, conc2, and rev1 are used, the recognition rate increased to 85.86 %.

VIII. DISCUSSION

Experiment E1 sets the baseline for this work, since it is shown that without enough information in different samples, the HMM will not be able to build a model and recognize it. Repeating the same sample does not give any new information. Then, by conducting experiments E4 and E5, it is proved that by careful modification of a sample, new samples can be generated this would give to the HMM more information, and allows building an improved model that enhances the recognition rates. So, these three experiments (E1, E4, and E5) are the bootstrap of our work.

Figure2. Recognition rates per experiment

From experiment E6, it can be seen that by complementing one method of generation with another the recognition rate increased from 40-50% to 83%. Similar conclusion can be obtained from E7, where we complemented concatenation with adding noise. Experiment E8 not only emphasizes this conclusion, but it is also a major result, since it gives a high recognition rate with the samples generated by adding noise and with a little alteration in the original sample (conc1).

These results wrap up with the following consequences, lengthening the vowel part duration may increase the recognition rate. Adding noise together with lengthening vowels may reduce the error rate. E9 and E10

do not give good results as in E6-E8. This can be attributed to the fact that there is a co-articulation effect, and the phonemes are context dependant, these are affected by the previous and following phonemes. In E11, conc1, conc2 and rev1 are used; this experiment outperforms the other methods. The recognition rate attained is 85.86%. Although conc1 and conc2 are generated using the same technique, but they are of different lengths, giving different information. rev1 is generated by reversing the word. Hence, there are 3 methods, which are complementing each other, and results into the best recognition rate. In other words reversing the sample in time domain may have a positive effect on the recognition rate. This concept of complementary methods give better results, it can be clarified by the following example and analogy. By looking at a view from different angles, we can produce a better picture of the view or even a complete 360 degrees picture. Similarly using different methods of generating new samples will give HMM a better representation, so a better model is built. This point suggests that we investigate other ways of speech samples generation and explore different combinations.

IX. CONCLUSION

Different techniques to generate new samples from an original sample, to overcome the problem of a limited database are proposed. Experimental results showed that even adding different types of noises at different SNR levels to the original sample, during training significantly improved the recognition accuracy. The experiments also demonstrate that by using different methods to complement each other will lead to an increase in the recognition rate. The highest obtained recognition rate is 85.86% by using samples from three different methods. In the future, we will investigate these techniques on a large number of speakers, as well as improve the accuracy using other possible techniques. Initial results with 50 speakers are encouraging. The work could be extended to look for the minimum number of words, with some specific phonemes, that may keep the recognition rate as high as possible. This might be used to select a very accurate set of words (phonemes) that characterizes the speaker, without making long sessions of recording. From these selected set of basic words (phonemes) one can generate new samples, using the methods presented in this paper.

REFERENCES

[1]. J. Wayman, A. Janil, D. Maltoni, D. Maio, “Biometric Systems, Technology, Design, and performance evaluation”, Springer Editions.

[2]. Thomas Ruggles, “comparison of biometric techniques”, http://www.bioconsulting.com/bio.htm.

[3]. Cao Jianfen, “Restudy of segmental lengthening in Mandarin Chinese”, ISCA2004.

[4]. Segers, Eliane; Verhoeven, Ludo, “Effects of Lengthening the Speech Signal on Auditory Word Discrimination in

65%

70%

75%

80%

85%

90% E6

E7

E8

E9

E10

E11

334

Kindergartners with SLI”, Journal of Communication Disorders, v38 n6 p499-514 Nov-Dec 2005.

[5]. S.S. Al-Dahri, Y.H. Al-Jassar,Y.A. Alotaibi, M.M. Alsulaiman, K.A.B Abdullah-Al-Mamun, “A Word-Dependent Automatic Arabic Speaker Identification System” Signal Processing and Information Technology, ISSPIT 2008. IEEE, pp. 198-202.

[6]. L. R. Rabiner, B. H. Juang, “An Introduction to Hidden Markov Models”. IEEE Acoust. Speech Signal Proc. Magazine, Vol. 3, pp. 4-16, Jan., 25, 1986.

[7]. L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected applications in Speech Recognition” Proc. IEEE, vol. 77 (2), pp. 257-286, Feb. 1989.

[8]. J. Olsson, “Text Dependent Speaker Verification with a Hybrid HMM/ANN System”. Thesis Project in Speech Technology, 2002. http://www.speech.kth.se/ctt/publications/exjobb /exjobb_jolsson.pdf.

[9]. F.Botti, A.Alexander, and A.Drygajlo. An interpretation framework for the evaluation of evidence in forensic automatic speaker recognition with limited suspect data. In Proceedings of 2004: A Speaker Odyssey, pages 63-68, Toledo, Spain, 2004.

Documents

iscsct09p330