1
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are derived by obtaining the Fourier transform of a signal and mapping the result on the mel- scale, which is an auditory perception- based scale of pitch differences. With these unique labels on speech files, the similarity between two files can be determined by the Kullback–Leibler (KL) distance, which is based on probability distributions, and, given a training set upon which to base one’s decisions, the corresponding speaker can be identified. The goal of this research is to test the robustness of MFCCs in speaker detection by varying the testing and training parameters with the following methods: 1) using segments of a whole speech file 2) varying the number of speech files used, and 3) splicing together the vowel phones of a speech segment The TIMIT Corpus The TIMIT corpus was created as a joint effort between Texas Instruments (TI) and MIT and consists of time-aligned orthographic, phonetic and word transcriptions for each of the 6300 16- bit 16kHz speech files. 630 speakers from the 8 major dialects of American English each read from 10 ‘phonetically rich’ texts, among which 2 are common across all speakers. PHONETIC 10160 10733 y 10733 11880 axr WORD 10160 11880 your ORTHOGRAPHIC 0 57140 She had your dark suit in greasy wash water all year. In order to investigate the distribution of the phonemes in TIMIT, the plot shown above was generated. The average sample length is The individual texts may be phonetically rich, but taken as a whole the distribution of the phonemes is unbalanced. Acknowledgments Grateful acknowledgement is made to Youngmoo Kim for providing insight and direction throughout my research and to Jiahong Yuan for encouraging my pursuit of corpus phonetics. Results Conclusions While optimal performance in speaker recognition is expected with a larger training set, the availability of testing material did not seem to affect performance if at least three files are used and if the number of training files is equal to or greater than the number of testing files. Though this might be expected to extend to the length of the wav files, it was not necessarily the case because using half a file to test consistently demonstrated poor results. Most importantly, testing and training with vowel phones only provided impressive recognition rates at approximately 93%, meriting further study. With regards to individual phone contributions to recognition, it was found that a single phoneme does not predict a speaker more effectively when using the same phoneme to train, as compared to any other phoneme. However, two phonemes consistently outperform the others in predicting 1a speaker: 'ae' and 'ay'. Of the five trials, 'ae' was ranked most highly recognized 3 times, 'ay' was highest twice, and both were among the top two in four of the trials. More tests are being done to obtain a statistically significant conclusion. Rio Akasaka ’09, Youngmoo Kim, Ph.D* Department of Linguistics/Engineering, Swarthmore College *Drexel University Literature cited Cole, Ronald A., et al.. 1996. The Contribution of Consonants Versus Vowels to Word Recognition in Fluent Speech Van Heerden, C.J, E. Bernard. 2008. Speaker-specific variability of Phoneme Durations. Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa. 2006. Phoneme Based Speaker Modeling to Improve Speaker Identification For further information Figure 3. To test the possibility that one speaker is consistently retrieved as the ideal candidate for a particular phoneme, the above plot was generated to plot the predicted speaker vs the actual speaker based on speaker ID. Speaker 183 is selected most often in the above scenario. The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT Figure 1. Speaker recognition based on 144 vowel-based files Figure 2. Speaker prediction based on individual phonemes. The results show that while speaker recognition based on individual phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal does show slightly higher recognition rates, as would be expected. Vowels only Individual phonemes are exceedingly difficult to use when predicting speaker based on entire wav files. Train: 5V, Test: 3F Evaluated: 342 Correct: 242, Percentage: 0.707602 Train: 5F, Test: 5V Evaluated: 570 Correct: 53, Percentage: 0.092982 1.676 592 Control Train: 5F , Test: 5F (not including SA) Evaluated: 570 Correct: 493, Percentage: 0.864912 Difference in number Reducing both the number of training and testing files to be consistent results in an optimal (~84.2%) success rate, but only up to 3 files. Train: 3F , Test: 5F Evaluated: 570 Correct: 440, Percentage: 0.771930 Train: 5F , Test: 3F Evaluated: 342 Correct: 292, Percentage: 0.853801 Train: 3F , Test: 3F Evaluated: 342 Correct: 290, Percentage: 0.847953 The following nomenclature is adopted in this poster: F: Full (complete) speech file H: Speech file segmented at middle V: File consisting of vowel phones only Difference in size Performance is considerably better with full speech files during testing, regardless of which half of the file we use. Train: 5F, Test: 5H Evaluated: 570 Correct: 327, Percentage: 0.573684 Train: 5H, Test: 5F Evaluated: 570 Correct: 442, Percentage: 0.775439 Train: 5H, Test: 3F Evaluated: 342 Correct: 277, Percentage: 0.809942 Figure 1. The predicted speaker ID plotted against the actual speaker, for 144 full speech files. Vowels only, cont. If individual phoneme files are used for training and testing, the results are impressive. Train: 5V, Test: 5V Evaluated: 570 Correct: 533, Percentage: 0.935088 Train: 3V, Test: 3V Evaluated: 342 Correct: 278, Percentage: 0.812865 Further examination In order to extract more information about the role that individual phones play in speaker recognition, the same algorithm was applied to test recognition based on individual phonemes that are extracted from each speaker. The training set consists of files containing only file segments for a particular phoneme, which are then later tested individually. Please contact [email protected] . Further details about the methodology may be read online at wiki.rioleo.org

Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are

Embed Size (px)

Citation preview

Page 1: Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are

IntroductionMel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are derived by obtaining the Fourier transform of a signal and mapping the result on the mel-scale, which is an auditory perception-based scale of pitch differences. With these unique labels on speech files, the similarity between two files can be determined by the Kullback–Leibler (KL) distance, which is based on probability distributions, and, given a training set upon which to base one’s decisions, the corresponding speaker can be identified.

The goal of this research is to test the robustness of MFCCs in speaker detection by varying the testing and training parameters with the following methods:

1) using segments of a whole speech file 2) varying the number of speech files used, and 3) splicing together the vowel phones of a speech segment

The TIMIT CorpusThe TIMIT corpus was created as a joint effort between Texas Instruments (TI) and MIT and consists of time-aligned orthographic, phonetic and word transcriptions for each of the 6300 16-bit 16kHz speech files. 630 speakers from the 8 major dialects of American English each read from 10 ‘phonetically rich’ texts, among which 2 are common across all speakers.

PHONETIC10160 10733 y10733 11880 axr

WORD10160 11880 your

ORTHOGRAPHIC0 57140 She had your dark suit in greasy wash water all year.

In order to investigate the distribution of the phonemes in TIMIT, the plot shown above was generated. The average sample length is

The individual texts may be phonetically rich, but taken as a whole the distribution of the phonemes is unbalanced.

AcknowledgmentsGrateful acknowledgement is made to Youngmoo Kim for providing insight and direction throughout my research and to Jiahong Yuan for encouraging my pursuit of corpus phonetics.

Results ConclusionsWhile optimal performance in speaker recognition is expected with a larger training set, the availability of testing material did not seem to affect performance if at least three files are used and if the number of training files is equal to or greater than the number of testing files. Though this might be expected to extend to the length of the wav files, it was not necessarily the case because using half a file to test consistently demonstrated poor results.

Most importantly, testing and training with vowel phones only provided impressive recognition rates at approximately 93%, meriting further study.

With regards to individual phone contributions to recognition, it was found that a single phoneme does not predict a speaker more effectively when using the same phoneme to train, as compared to any other phoneme.

However, two phonemes consistently outperform the others in predicting 1a speaker: 'ae' and 'ay'. Of the five trials, 'ae' was ranked most highly recognized 3 times, 'ay' was highest twice, and both were among the top two in four of the trials. More tests are being done to obtain a statistically significant conclusion.

Rio Akasaka ’09, Youngmoo Kim, Ph.D*Department of Linguistics/Engineering, Swarthmore College *Drexel University

Literature citedCole, Ronald A., et al.. 1996. The Contribution of

Consonants Versus Vowels to Word Recognition in Fluent Speech

Van Heerden, C.J, E. Bernard. 2008. Speaker-specific variability of Phoneme Durations.

Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa. 2006. Phoneme Based Speaker Modeling to Improve Speaker Identification

For further informationFigure 3. To test the possibility that one speaker is consistently retrieved as the ideal candidate for a particular phoneme, the above plot was generated to plot the predicted speaker vs the actual speaker based on speaker ID. Speaker 183 is selected most often in the above scenario.

The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT

Figure 1. Speaker recognition based on 144 vowel-based files

Figure 2. Speaker prediction based on individual phonemes. The results show that while speaker recognition based on individual phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal does show slightly higher recognition rates, as would be expected.

Vowels onlyIndividual phonemes are exceedingly difficult to use when predicting speaker based on entire wav files.

Train: 5V, Test: 3FEvaluated: 342 Correct: 242, Percentage: 0.707602

Train: 5F, Test: 5VEvaluated: 570 Correct: 53, Percentage: 0.092982

1.676 592

ControlTrain: 5F , Test: 5F (not including SA) Evaluated: 570 Correct: 493, Percentage: 0.864912

Difference in numberReducing both the number of training and testing files to be consistent results in an optimal (~84.2%) success rate, but only up to 3 files.

Train: 3F , Test: 5F Evaluated: 570 Correct: 440, Percentage: 0.771930

Train: 5F , Test: 3FEvaluated: 342 Correct: 292, Percentage: 0.853801

Train: 3F , Test: 3FEvaluated: 342 Correct: 290, Percentage: 0.847953

The following nomenclature is adopted in this poster:F: Full (complete) speech fileH: Speech file segmented at middleV: File consisting of vowel phones only

Difference in sizePerformance is considerably better with full speech files during testing, regardless of which half of the file we use.

Train: 5F, Test: 5HEvaluated: 570 Correct: 327, Percentage: 0.573684

Train: 5H, Test: 5FEvaluated: 570 Correct: 442, Percentage: 0.775439

Train: 5H, Test: 3FEvaluated: 342 Correct: 277, Percentage: 0.809942

Figure 1. The predicted speaker ID plotted against the actual speaker, for 144 full speech files.

Vowels only, cont.If individual phoneme files are used for training and testing, the results are impressive.

Train: 5V, Test: 5VEvaluated: 570 Correct: 533, Percentage: 0.935088

Train: 3V, Test: 3VEvaluated: 342 Correct: 278, Percentage: 0.812865

Further examinationIn order to extract more information about the role that individual phones play in speaker recognition, the same algorithm was applied to test recognition based on individual phonemes that are extracted from each speaker. The training set consists of files containing only file segments for a particular phoneme, which are then later tested individually.

Please contact [email protected]. Further details about the methodology may be read online at wiki.rioleo.org