4
10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010) INVESTIGATING THE ADAPTATION OF ARABIC SPEECH RECOGNITION SYSTEMS TO FOREIGN ACCENTED SPEAKERS Youse! Ajami Alotaibi 1 and Sid-Ahmed Selouani 2 1 Computer Engineering Department, King Saud University, Saudi Arabia 2 LARIHS Lab. Universite de Moncton, Campus de Shippagan, Canada [email protected], [email protected] ABSTRACT This paper investigates Arabic speech recogmtIon systems adaptation to foreign accented speakers. This adaptation scheme is accomplished by using the Maximum Likelihood Linear Regression (MLLR), Maximum a posteriori (MAP), and, then, combination of MLLR and MAP techniques. The HTK toolkit for speech recognition is used throughout all experiments. The systems were evaluated using both word and phoneme levels. The LDC West Point Modem Standard Arabic (MSA) corpus is used throughout the experiments. Results show that particular Arabic Phonemes such as pharyngeal and emphatic consonants, that are hard to pronounce for non-native speakers, benefit om the adaptation process using MLLR and MAP combination. An overall improvement of 7.37% has been obtained. Keywords: Modem Standard Arabic, Adaptation, Foreign accents, MLLR, MAP, HMMs. 1. INTRODUCTION Numerous studies have been caied out to improve the automatic recognition of speech uttered by non-native speakers. Fakotakis [4] worked on the adaptation of standard Greek speech recognition systems to work with Cypriot dialect by using HTK toolkit [12], MLLR, MAP, and combined MLLR and MAP techniques [7][8]. The best accuracy improvement was encountered with digits strings database and the combined MLLR and MAP technique. This improvement was about 2%. Bartkova and Jouvet proposed in [3], multiple models for improved speech recognition of non-native French speakers. They addressed the problem of foreign accent by using acoustic models of the target language phonemes (French phonemes in their case) adapted with speech data om three other languages: English, German, and Spanish. Their results obtained for eleven language groups of speakers, showed that error rate can be significantly reduced when standard acoustic models of phonemes are adapted using speech data om different languages. In their outputs, the highest error rate reduction of 40% was obtained on English native speakers. They proved that the recognition performance was improved on almost all language groups, even though only three foreign languages were available in their study for acoustic 978-1-4244-7167-6/10/$26.00 ©2010 IEEE 646 model adaptation. In [5], Hui and et al. proposed a speaker adaptation method that modifies the principal mixture for improved continuous speech recognition. This method reduces the Hidden Markov Models (HMMs) complexity by choosing only the principle mixtures corresponding to particular speaker's characteristics. They proved that the new method improved both recognition accuracy (by 31%) and recognition speed (by 30%) when compared to ll mixture speaker adaptation models. Other recent techniques only require the leers' utterances in their native language for adapting ASR to non-native speech [10]. In this context, it is important to note that compared to other languages, the Arabic language benefits om very limited number of research initiatives. In a previous work [1], we have analyzed, at phonemic level, the results of both native and non-native speech recognition in order to determine the phonemes that have a significant impact on the recognition accuracy. In this paper, we extend the previous work [1] by investigating how the adaptation techniques could improve a trained recognition system to be used by non- native Arabic speakers to get minimum amount of degradation in the system accuracy. This adaptation is accomplished through the use of the MLLR, MAP, and combination of MLLR and MAP techniques. The original (baseline) recognition system was trained by native Arabic speakers. Before adaptation, the system was tested by non-native Arabic speakers and its performance was considered for the sake comparisons with that of the adapted systems. The organization of this paper is as follows. In Section 2 a basic background on Arabic language is given. In Section 3, the adaptation methods are briefly presented. Then, Section 4 presents the experimental amework, and Section 5 proceeds with a discussion of the obtained results. Finally, in Section 6 we conclude and give indications about the ture work. 2. BASIC ARABIC LANGUAGE BACKGROUND Modem Standard Arabic (MSA) has six vowels and 28 consonants. Many differences are noticed when MSA is compared with European languages such as English. Among these differences, we can cite uniqueness of some Arabic phonemes, particular phonetic features, and complicated morphological structures. A major difference lies in Arabic text, where it is written with the absence of

[IEEE 2010 10th International Conference on Information Sciences, Signal Processing and their Applications (ISSPA) - Kuala Lumpur, Malaysia (2010.05.10-2010.05.13)] 10th International

Embed Size (px)

Citation preview

10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010)

INVESTIGATING THE ADAPTATION OF ARABIC SPEECH RECOGNITION SYSTEMS TO FOREIGN ACCENTED SPEAKERS

Youse! Ajami Alotaibi 1 and Sid-Ahmed Selouani2

1 Computer Engineering Department, King Saud University, Saudi Arabia

2 LARIHS Lab. Universite de Moncton, Campus de Shippagan, Canada

[email protected], [email protected]

ABSTRACT

This paper investigates Arabic speech recogmtIon systems adaptation to foreign accented speakers. This adaptation scheme is accomplished by using the Maximum Likelihood Linear Regression (MLLR), Maximum a posteriori (MAP), and, then, combination of MLLR and MAP techniques. The HTK toolkit for speech recognition is used throughout all experiments. The systems were evaluated using both word and phoneme levels. The LDC West Point Modem Standard Arabic (MSA) corpus is used throughout the experiments. Results show that particular Arabic Phonemes such as pharyngeal and emphatic consonants, that are hard to pronounce for non-native speakers, benefit from the adaptation process using MLLR and MAP combination. An overall improvement of 7.37% has been obtained.

Keywords: Modem Standard Arabic, Adaptation, Foreign accents, MLLR, MAP, HMMs.

1. INTRODUCTION

Numerous studies have been carried out to improve the automatic recognition of speech uttered by non-native speakers. Fakotakis [4] worked on the adaptation of standard Greek speech recognition systems to work with Cypriot dialect by using HTK toolkit [12], MLLR, MAP, and combined MLLR and MAP techniques [7][8]. The best accuracy improvement was encountered with digits strings database and the combined MLLR and MAP technique. This improvement was about 2%. Bartkova and Jouvet proposed in [3], multiple models for improved speech recognition of non-native French speakers. They addressed the problem of foreign accent by using acoustic models of the target language phonemes (French phonemes in their case) adapted with speech data from three other languages: English, German, and Spanish. Their results obtained for eleven language groups of speakers, showed that error rate can be significantly reduced when standard acoustic models of phonemes are adapted using speech data from different languages. In their outputs, the highest error rate reduction of 40% was obtained on English native speakers. They proved that the recognition performance was improved on almost all language groups, even though only three foreign languages were available in their study for acoustic

978-1-4244-7167-6/10/$26.00 ©2010 IEEE 646

model adaptation. In [5], Hui and et al. proposed a speaker adaptation method that modifies the principal mixture for improved continuous speech recognition. This method reduces the Hidden Markov Models (HMMs) complexity by choosing only the principle mixtures corresponding to particular speaker's characteristics. They proved that the new method improved both recognition accuracy (by 31 %) and recognition speed (by 30%) when compared to full mixture speaker adaptation models. Other recent techniques only require the learners' utterances in their native language for adapting ASR to non-native speech [10]. In this context, it is important to note that compared to other languages, the Arabic language benefits from very limited number of research initiatives.

In a previous work [1], we have analyzed, at phonemic level, the results of both native and non-native speech recognition in order to determine the phonemes that have a significant impact on the recognition accuracy. In this paper, we extend the previous work [1] by investigating how the adaptation techniques could improve a trained recognition system to be used by non­native Arabic speakers to get minimum amount of degradation in the system accuracy. This adaptation is accomplished through the use of the MLLR, MAP, and combination of MLLR and MAP techniques. The original (baseline) recognition system was trained by native Arabic speakers. Before adaptation, the system was tested by non-native Arabic speakers and its performance was considered for the sake comparisons with that of the adapted systems.

The organization of this paper is as follows. In Section 2 a basic background on Arabic language is given. In Section 3, the adaptation methods are briefly presented. Then, Section 4 presents the experimental framework, and Section 5 proceeds with a discussion of the obtained results. Finally, in Section 6 we conclude and give indications about the future work.

2. BASIC ARABIC LANGUAGE BACKGROUND

Modem Standard Arabic (MSA) has six vowels and 28 consonants. Many differences are noticed when MSA is compared with European languages such as English. Among these differences, we can cite uniqueness of some Arabic phonemes, particular phonetic features, and complicated morphological structures. A major difference lies in Arabic text, where it is written with the absence of

any information that leads to short vowels, geminate, and pharyngealization. This might lead to many identical­looking forms in a great variety of contexts, which reduces predictability in correct word pronunciation, sentence sense, and rules of language model. Hence, accurate acoustic model testing, which depend on Arabic text, is difficult when the identity and location of short vowels, for example, is unknown [2][6].

The Arabic language has three long and three short vowels. Permissible syllables in the Arabic language include CV, CVC, and CVCC, where C indicates a consonant and V a long or short vowel. This not the case for English, where there is no restriction on number of syllables and their valid forms. Any Arabic words and syllable can only start with a consonant and must contain at least one vowel. The Arabic language is characterized by the presence of emphatic and pharyngeal phonemes. There are a total offive pharyngeal phonemes; two of which are fricatives IH/(C) and IC/(t). The main characteristic of these phonemes is the constriction existing between the tongue and the lower pharynx. Besides this we note the rising of the larynx. Also there are three uvular pharyngeal phonemes Ix! (t), IGI (t), and Iql (J) characterized by a constraint formed between the tongue and the upper pharynx for Ix! and IGI and a complete closure for Iql at the same level. These five consonants are considered as the Arabic pharyngeal phonemes [11]. On the other hand, there arefour emphatic phonemes: IS/(ua), ID/(u-"=», IT/(..b), and IZICb). These phonemes are emphatic versions of the oral dental consonants Is/(U"), Id/H, It/(w) and ITHI (0).

3. SYSTEM ADAPT A nON METHODS

The MLLR is a widely used parameter transformation technique that has proven successful in the case where a small amount of adaptation data is available [8]. It aims at reducing the mismatch between initial reference models and the adaptation data through the use of a set of transformations. In our experiments, we use MLLR to determine a set of linear transformations for only the means of the HMM Gaussian mixtures. The goal of these transformations is to linearly modifY the mean components so that each HMM state of initial system is more likely to produce the new adaptation data. The new estimate of the adapted mean obtained through the MLLR transformation matrix is stated as:

ji = WJl', (1)

where W is the n x n(n + 1) transformation matrix, n is the dimensionality of the data, and Jl' is the extended

mean vector. Jl' is defined as follows:

(2)

647

where W represents a bias offset whose value here is

fixed at one. Hence W can be decomposed into:

(3)

where Ac represents an n x n regression matrix and be is an additive bias vector associated with the broad class c. The adapted klh mean vector for each state i can be written as follows:

(4)

The System adaptation can be accomplished using Maximum a posteriori (MAP) technique [7]. For MAP adaptation, the re-estimation formula for Gaussian mean is a weighted sum of the prior mean with the maximum likelihood mean estimate. It is formulated as:

T

Tikf.1ik + L�I(i,k)XI

J-lik == ____ -";,::::-"".!..1

____ _

Tik + L�I (i,k) 1�1

(5)

where Tik is the weighting parameter for the klh Gaussian

component in the state i. �t (i, k) is the occupation

likelihood of the observed adaptation data Xt.

One of the drawbacks of MAP adaptation when compared with MLLR, is that it needs more adaptation data to be accurate. When MLLR is combined with MAP we can benefit from both of the techniques. Theoretically, the combination offers compact transformations for rapid adaptation when only limited amount of data is available, thanks to MLLR, and the asymptotical efficacy of MAP adaptation when the amount of data increases. There are many ways to combine MLLR and MAP. We choose to use the MLLR transform prior to the MAP adaptation. Hence, the adapted means can be written as:

f.1ik

l'

Tik jiik + L �t (i, k) Xt

1=1 (6)

The principal difficulty in MAP adaptation is to determine the mixing parameters. As it is commonly used, we chose a one mixing parameter for each model

that we built, i.e. T;k= r.

4. EXPERIMENTAL SETUP

4.1. Data The LDC corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42 hours of speech data. Each speech file represents one person uttering one prompt. The files were recorded using a 16-bit PCM low-byte-first ("Iittle­endian") coding, with a sampling rate of 22.05 KHz. They were then converted to The NIST SPHERE format. Approximately 7,200 of the recordings are from native speakers and 1200 files are from non-native speakers [9]. From the WestPoint corpus we selected four different and disjointed lists; all have been chosen randomly from non­native Arabic speakers only. The first list called AD100, contains 100 utterances; the second list called AD 150, contains 150 utterances; the third list called AD200, contains 200 utterances; the last list called AD250, contains 250 utterances. The four lists are chosen randomly from all available scripts, speakers, and genders. The designed lists were used to adapt a native Arabic speaker based system to deal with non-native Arabic speakers. For this purpose, three different adaptation techniques were used: MLLR, MAP and a combination of MLLR and MAP. The performance is analyzed both at the word level (by incorporating a language model) and at the phoneme level. The phoneme level permits us to investigate the improvement (if any) of system accuracy on individual phonemes; hence giving us the chance to analyze the weakness of non-native Arabic speakers' pronunciation as a phoneme-wise way of analysis. Based on the above explanations, we refer to our experiments as AD! OO/MLLR, AD! OO/MAP, AD! OO/MLLRMAP, etc.

4.2. Recognition platform and parameters

In our experiments, we used the HTK toolkit [12] to design and test the developed speech recognition systems. A phoneme level recognizer is considered as a baseline system. It is based on continuous and left-to-right HMM models with three active states. The reference models are generated for the MSA phones as given by the LDC catalog. Since most of the LDC words consisted of more than two phonemes, the context-dependent triphone models were used instead of monophone models. In the training step, the HMM models are re-estimated through the use of the Baum-Welch algorithm. For more accuracy, the decision-tree method is used to align and tie some models. We have shown in [1] that the context­dependent phoneme models (triphones) are useful to characterize formant transition information. Therefore, this helps the HMM models to make an effective discrimination between confusable speech units and thus to obtain better accuracy. The parameters of the system consisted of a 22 KHz sampling frequency with a sample resolution of 16 bits. A window of size 25 ms duration and Hamming type with a step size of 10 ms is used. Each window contains 13 static MFCCs and their first and second derivatives.

648

5. RESULTS AND DISCUSSION

All native Arabic speakers' data provided by the LDC West Point corpus was used for training the original recognition system. After that, all non-native Arabic speakers' data provided by the same corpus was used for testing the system. As a result of that test, the accuracy (correctness) of the system was 89.02% and 93.19% for word level and phoneme level, respectively. This performance is relatively low compared to the same system but with testing data taken from native Arabic speakers. Table I shows the system performance for the four adaptation lists and for the three adaptation techniques. This performance is compared to the accuracy of the system prior to any kind of adaptation, where 89.02% for word level and 93.19% for phoneme level have been respectively obtained.

As it can be inferred from the results, the improvement of system performance increases when the size of the adaptation list is increased. This performance is improved rapidly to reach its best at 96.39% which represents a 7.37% improvement (at word level) in comparison with the original system. This result is obtained by the adaptation list AD250 and the adaptation combining MLLR and MAP techniques (i.e., experiment AD250/MLLRMAP). We concluded that in ADI50 and AD250, the merged MLLR and MAP adaptation techniques gave better performance compared to others. We notice that there is no fixed rule governing the comparisons of MLLR, MAP, and their combination. In some experiments, MLLR gave improvement better that that of MAP. In other experiments MAP gave better accuracy improvement. The combined MLLR and MAP techniques sometimes gave less improvement compared to either MLLR or MAP. We believe that this is due to the random choice of sentences used in adaptation. In some cases, more relevant and specific Arabic phonemes are included in the adaptation data, while in other cases, the adaptation set contains less of these phonemes.

By investigating the system performances for individual phonemes, we can notice that the phonemes /H/k), /TH/(-:"), /g/«(�J, /q/(J) and /z/(j), gained more improvement in their performances for all experiments. Table 2 shows the increases in performance for these phonemes for all conducted experiments. Except for phoneme /z/(j), these phonemes are Arabic phonemes that cannot be found in English.

It is worthy to note that the adaptation process permits to enhance the performance of phonemes that are hard for non-native Arabic speakers to pronounce, especially /H/(c::) that is a fricative unvoiced non-emphatic pharyngeal sound. Thus, in the non-adapted system these sounds and other particular Arabic phonemes produced errors due to the phonological and acoustical changes induced by the pronunciation of non-native speakers. Therefore, we confirmed according to the obtained results that by means of the adaptation process the performance of automatic recognition of Arabic foreign-accented speech can be significantly improved.

Table 1. Accuracy improvement using adaptation techniques with different size of adaptation data

Adaptation Le\€1 MLLR MAP MLLRMAP List » Word 2.37% 3.16% 2.56% 0 ..... 0 Tri-phone 1.88% 2.65% 1.97% 0

» Word 5.27% 5.37% 5.74% 0 ..... U"1 Tri-phone 3.47% 3.57% 3.64% 0

» Word 6.12% 5.58% 6.08% 0 � Tri-phone 4.02% 3.92% 4.06% 0

» Word 6.95% 6.77% 7.37% 0 til Tri-phone 4.45% 4.64% 4.80% 0

Table 2. Accuracy improvement for Arabic specific phonemes

Adapt Adapt Impro\.ement in Accuracies (%) List Technique IHI ITHI 191 Iql Izl

a MLLR 6.9 4.5 11.1 6 8.8 a ...- MAP 6.1 1 10.7 5.6 8.8 0 « MLLRMAP 6.9 5.4 6.2 6.9 10.3 a MLLR 6.8 5.8 18.4 7.5 8.8 LO ...- MAP 6.8 4.5 3.8 7.7 9.1 0 « MLLRMAP 8.2 5.4 13.6 7.3 10.3 a MLLR 10.4 5.4 8.6 8 10.7 � MAP 8.2 3.6 8.4 8 10.3 0 « MLLRMAP 10.6 5.8 6.2 8.6 11 a MLLR 9.6 6.7 16 7.3 11.7 � MAP 11.2 4.9 13.6 9.1 12.1 0 « MLLRMAP 10.4 6.7 16 7.8 11.4

6. CONCLUSION

An automatic Arabic speech recognition system was trained by using native Arabic speakers' speech data provided by the LDC West Point corpus for MSA Arabic. Then, the system was adapted to non-native Arabic speakers' speech data provided by the same corpus. The adaptation techniques were MLLR, MAP, and a combination of them. The accuracies of the non-adapted system were 89.02% and 93.19% at word and phoneme levels respectively. The best system accuracy improvement was 7.37% and this was obtained in experiment AD250/MLLRMAP. The specific Arabic phonemes /H/«(), /TH/(�), and /q/(J) known as being hard to pronounce for a non-native Arabic speaker got better accuracy improvements in all experiments. This work will be continued by investigating an evolutionary­based technique in order to give the Arabic speech recognition system an auto-adaptation capability in the context of more foreign accents.

649

REFERENCES

[1] Y. A. Alotaibi, S.-A. Selouani, and D. O'Shaughnessy, "Experiments on Automatic Recognition of Nonnative Arabic

Speech," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2008, Article ID 679831, 9 pages, doi:l0.1155/2008/679831, 2008.

[2] M. Elshafei, 'Toward an Arabic Text-to-Speech System," The Arabian Journal for Science and Engineering, vol. 16, no. 4B, pp. 565-83, Oct. 1991.

[3] K. Bartkova, D. Jouvet, "Multiple Models for Improved

Speech Recognition for Non-Native Speakers", SPECOM'2004: 9th Conference of Speech and Computer, St. Petersburg, Russia 2004.

[4] N. Fakotakis, "Cypriot Speech Database: Data Collection

and Greek to Cypriot Dialect Adaptation", 4th International on Language Resources and Evaluation, May 24-30, 2004.

[5] Y. Hui, P. Fung, H. Taiyi, "Principal Mixture Speaker Adaptation for Improved Continuous Speech Recognition", International Conference on Speech and Language Processing (ICSLP), Beijing, Beijing, 2000.

[6] K. Kirchhoff, J. Bilmes, S. Das, N. Duta, M. Egan, J. Gang, H. Feng, J Henderson, L. Daben, M. Noamany, P. Schone, R. Schwartz, D. Vergyri, "Novel approaches to Arabic speech

Recognition: Johns-Hopkins University Summer Research

Workshop 2002, Final Report", http://www.c\sp.jhu.edu/ws02

2002.

[7] C.-H. Lee and J.-L. Gauvain, "Speaker Adaptation Based on

MAP Estimation of HMM Parameters, Proceedings ICASSP, ,

Minneapolis, Minnesota, pp. 558-561, 1993.

[8] c. Leggeter, P. Woodland, "Maximum Likelihood Linear

Regression for Speaker Adaptation of Continuous Density Hidden Markov Models", Computer Speech and Language, Vol. 9, pp. 171-185, 1995.

[9] Linguistic Data Consortium (LDC) Catalog Number LDC2002S02, http://www.ldc.upenn.edu/, 2002.

[10] Y.Ohkawa, M. Suzuki, H.Ogasawara, A. Ito, and S. Makino, "A speaker adaptation method for non-native speech

using learners' native utterances for computer-assisted language learning systems," Speech Communication, pp. 875-882, May

2009.

[11] S. Ouni, M. Cohen, and W. Massaro" 'Training Baldi to be Multilingual: A Case Study for an Arabic Badr", Speech Communication, Vo1.45, 115-37 (2005).

[12] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw,

G. Moore J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, "HTK Version. 3.4", Cambridge University Engineering Department, 2006.

[13] Y. Zheng, R. Sproat, L. Gu, I. Shafran, H. Zhou, Y. Su, D.

Jurafsky, R. Starr, S. Yoon, "Accent Detection and Speech

Recognition for Shanghai-Accented Mandarin", 9th European Conference on Speech Communication and Technology (Eurospeech), Lisboa, Portugal, 2005.