13
A scalable architecture for multilingual speech recognition on embedded devices Martin Raab a,b, * , Rainer Gruhn a , Elmar No ¨th b a Harman Becker Automotive Systems, Speech Dialog Systems, Ulm, Germany b University of Erlangen-Nuremberg, Chair of Pattern Recognition, Erlangen, Germany Received 9 February 2009; received in revised form 24 July 2010; accepted 30 July 2010 Abstract In-car infotainment and navigation devices are typical examples where speech based interfaces are successfully applied. While classical applications are monolingual, such as voice commands or monolingual destination input, the trend goes towards multilingual applica- tions. Examples are music player control or multilingual destination input. As soon as more languages are considered the training and decoding complexity of the speech recognizer increases. For large multilingual systems, some kind of parameter tying is needed to keep the decoding task feasible on embedded systems with limited resources. A traditional technique for this is to use a semi-continuous Hid- den Markov Model as the acoustic model. The monolingual codebook on which such a system relies is not appropriate for multilingual recognition. We introduce Multilingual Weighted Codebooks that give good results with low decoding complexity. These codebooks depend on the actual language combination and increase the training complexity. Therefore an algorithm is needed that can reduce the training complexity. Our first proposal are mathematically motivated projections between Hidden Markov Models defined in Gauss- ian spaces. Although theoretically optimal, these projections were difficult to employ directly in speech decoders. We found approxi- mated projections to be most effective for practical application, giving good performance without requiring major modifications to the common speech recognizer architecture. With a combination of the Multilingual Weighted Codebooks and Gaussian Mixture Model projections we create an efficient and scalable architecture for non-native speech recognition. Our new architecture offers a solution to the combinatoric problems of training and decoding for multiple languages. It builds new multilingual systems in only 0.002% of the time of a traditional HMM training, and achieves comparable performance on foreign languages. Ó 2010 Elsevier B.V. All rights reserved. Keywords: Multilingual speech recognition; Non-native speech; Projections between Gaussian spaces; Gaussian Mixture Model distances 1. Introduction Current state of the art systems already provide speech control, but with the limited processing power and memory of these systems it is difficult to provide speech recognition for many languages. There are situations where it is neces- sary to recognize multilingual speech. One example is when users drive to other countries and need to input navigation destinations. Another example is speech controlled music selection. The artists and titles in music collections can be from many different languages, and the system has to allow the selection of all of them via speech. The issue becomes more complicated as the user utters many of these additional speech items with non-native accent. For the dialog in the car navigation and infotain- ment system, this means that there is a distinguished main language of the system and some additional languages. The main language of the system is the native language of the user. The additional languages are the languages that are dependent on the task. In the first part of our literature review we analyze previous approaches for multilingual speech recognition. An approach that is used in many works to reduce the 0167-6393/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2010.07.007 * Corresponding author at: Harman Becker Automotive Systems, Speech Dialog Systems, Ulm, Germany. Tel.: +49 (0)731 15239 441. E-mail addresses: [email protected] (M. Raab), [email protected] (R. Gruhn). www.elsevier.com/locate/specom Available online at www.sciencedirect.com Speech Communication 53 (2011) 62–74

A scalable architecture for multilingual speech recognition on embedded devices

Embed Size (px)

Citation preview

Page 1: A scalable architecture for multilingual speech recognition on embedded devices

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 53 (2011) 62–74

A scalable architecture for multilingual speech recognition onembedded devices

Martin Raab a,b,*, Rainer Gruhn a, Elmar Noth b

a Harman Becker Automotive Systems, Speech Dialog Systems, Ulm, Germanyb University of Erlangen-Nuremberg, Chair of Pattern Recognition, Erlangen, Germany

Received 9 February 2009; received in revised form 24 July 2010; accepted 30 July 2010

Abstract

In-car infotainment and navigation devices are typical examples where speech based interfaces are successfully applied. While classicalapplications are monolingual, such as voice commands or monolingual destination input, the trend goes towards multilingual applica-tions. Examples are music player control or multilingual destination input. As soon as more languages are considered the training anddecoding complexity of the speech recognizer increases. For large multilingual systems, some kind of parameter tying is needed to keepthe decoding task feasible on embedded systems with limited resources. A traditional technique for this is to use a semi-continuous Hid-den Markov Model as the acoustic model. The monolingual codebook on which such a system relies is not appropriate for multilingualrecognition. We introduce Multilingual Weighted Codebooks that give good results with low decoding complexity. These codebooksdepend on the actual language combination and increase the training complexity. Therefore an algorithm is needed that can reducethe training complexity. Our first proposal are mathematically motivated projections between Hidden Markov Models defined in Gauss-ian spaces. Although theoretically optimal, these projections were difficult to employ directly in speech decoders. We found approxi-mated projections to be most effective for practical application, giving good performance without requiring major modifications tothe common speech recognizer architecture. With a combination of the Multilingual Weighted Codebooks and Gaussian Mixture Modelprojections we create an efficient and scalable architecture for non-native speech recognition. Our new architecture offers a solution to thecombinatoric problems of training and decoding for multiple languages. It builds new multilingual systems in only 0.002% of the time ofa traditional HMM training, and achieves comparable performance on foreign languages.� 2010 Elsevier B.V. All rights reserved.

Keywords: Multilingual speech recognition; Non-native speech; Projections between Gaussian spaces; Gaussian Mixture Model distances

1. Introduction

Current state of the art systems already provide speechcontrol, but with the limited processing power and memoryof these systems it is difficult to provide speech recognitionfor many languages. There are situations where it is neces-sary to recognize multilingual speech. One example is whenusers drive to other countries and need to input navigationdestinations. Another example is speech controlled music

0167-6393/$ - see front matter � 2010 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2010.07.007

* Corresponding author at: Harman Becker Automotive Systems,Speech Dialog Systems, Ulm, Germany. Tel.: +49 (0)731 15239 441.

E-mail addresses: [email protected] (M. Raab),[email protected] (R. Gruhn).

selection. The artists and titles in music collections can befrom many different languages, and the system has to allowthe selection of all of them via speech.

The issue becomes more complicated as the user uttersmany of these additional speech items with non-nativeaccent. For the dialog in the car navigation and infotain-ment system, this means that there is a distinguished mainlanguage of the system and some additional languages. Themain language of the system is the native language of theuser. The additional languages are the languages that aredependent on the task.

In the first part of our literature review we analyzeprevious approaches for multilingual speech recognition.An approach that is used in many works to reduce the

Page 2: A scalable architecture for multilingual speech recognition on embedded devices

M. Raab et al. / Speech Communication 53 (2011) 62–74 63

decoding complexity is knowledge based model sharing. Inthis approach, phonemes from different languages shareone acoustic model when they have the same IPA (Interna-tional Phonetic Alphabet, Ladefoged, 1990) symbol.Examples are Weng et al. (1997), Koehler (2001), Uebler(2001), Schultz and Waibel (2001), Wang et al. (2002),Niesler (2006). The works vary in the degree to which theyenforce the clustering between languages.

There are less works that experimented with data drivenmodel sharing in the acoustic model. Koehler (2001),Dalsgaard et al. (1998) measure the log-likelihood differ-ence on development data to determine the similarity ofphonemes, as motivated by Juang and Rabiner (1985).Wang et al. (2002) trains phones from different languageson the same codebook and measures the distances betweenphones by the Euclidean distance between the mixtureweight vectors of the Hidden Markov Models (HMMs).

The knowledge based and the data driven approachesare well suited for the recognition of many languages ifthere are no additional knowledge sources. In our case,we know the native language of the speaker from thegraphical user interface language of the system. This isthe main language of interaction between the user andthe system and a user usually utters commands, spellingsand digit sequences in that language. Hence it is vital fora commercial system to recognize this main language withmaximum performance. Therefore we introduced Multilin-gual Weighted Codebooks (MWCs) as a technique thatdoes not deteriorate the performance in the main language.MWCs are basically a main language codebook that isenriched with some additional Gaussians to better coverall languages. We were able to show the benefits of MWCsfor both native speakers and non-native speakers (Raabet al., 2008a; Raab et al., 2008b).

There are also works that propose techniques for theefficient handling of multilingual language models. Exam-ples are Harbeck et al. (1998), Noth et al. (1999), Fuegen(2003). For our work, this is less relevant, as we focus ona command and control application or selection-from-listtype applications with little room for the user to makenon-native grammar mistakes.

The second part of our literature review focuses on non-native speech. Tomokiyo and Waibel (2001) present severalresults with different adaptation techniques like MAP andMLLR and achieve up to 30% WER improvement. Bous-elmi et al. (2007) introduce confusion based acoustic modelintegration that allows additional HMM structures for fre-quently confused phoneme models. They report improve-ments of up to 70% WER and an absolute WA of up to98.0% without speaker adaptation on the Hiwire test data(Segura et al., 2007) that we also use. However, using theHiwire data for adaptation and testing is likely to givegood results as the lexicon size is very limited and the samespeakers are in the adaptation and test set. This was ana-lyzed by Lang (2009), where it was shown that standardBaum–Welch re-estimation gives comparable results to(Bouselmi et al., 2007). Lang (2009) also proved the over-

fitting problem as an adaptation on Hiwire did not leadto improvements on ISLE (Menzel et al., 2000), anothernon-native corpora. Lang (2009) used the same recognizerand the same training data as the work in this paper.

These acoustic model adaptation methods have thedrawback that they need adaptation data from the corre-sponding accents. The biggest database known to theauthors covers almost 30 different accents (Schaden,2006) (overview of existing collections in (Raab et al.,2007), but there are a lot more accents that are not covered.Other techniques try to circumvent the need for specialtraining or adaptation data. Bartkova and Jouvet (2006)and Goronzy et al. (2001) use manually derived pronunci-ation rules for the modification of lexicons. However, theirapproaches require expensive human work and achievemore moderate improvements in the range of 15% to30% WER.

There are also methods that try to extract informationabout non-native accents from a comparison of the nativelanguage of the speaker and the spoken language. Witt(1999) proposes three different algorithms for this, amongstother Model Merging. Improvements of up to 27% WERare reported for the methods without online adaptation.However, the work of Witt (1999) was performed on con-tinuous HMMs and can not directly be applied to a semi-continuous HMM. Witt (1999)’s algorithms do also benefitfrom adding Gaussians from other languages, so there isthe question to what extent for example Model Mergingcan add on top of MWCs. The same question arises withwork from (Tan and Besacier, 2007).

Finally, we have to deal with the limited resources of anin-car system. We use a semi-continuous speech recognizer(Huang et al., 1990) as a technique to keep the memory andprocessing demand of the system relatively low (Koch,2004). A similar system was proposed in (Park and Ko,2004). Such a semi-continuous speech recognizer achievesparameter tying through one single set of Gaussians forall phoneme models.

We combine our semi-continuous system with MWCs asthis data driven model sharing technique has the advantagethat it does not degrade performance on the main lan-guage. The problem with MWCs is that they do dependon the actual language combination. This leads to an unac-ceptably high training effort for more than a couple oflanguages.

A solution to avoid these unacceptably high numbers ofsystems is to provide just the right system, instead of pro-viding all possible systems. While making this decision isimpossible in the offline part of the training of speech rec-ognizers, it is possible on the actual embedded system inthe car of the user.

Fig. 1 depicts how such a process can look like for thetwo applications that we have in mind, multilingual desti-nation input and music selection. In the destination selec-tion, the system determines the language of nearbydestinations. In the music selection, the languages of inter-est can be determined from the language distribution in the

Page 3: A scalable architecture for multilingual speech recognition on embedded devices

LanguageDetection

PlaylistGeneration

Connect Media Device

Create MWCCodebook

Project toCodebook

RecognizeUtterances

Country Detection

Destination Input

Music Selection

Fig. 1. Generating a user adapted system on an embedded system.

64 M. Raab et al. / Speech Communication 53 (2011) 62–74

music database. There are three tasks that are common toboth examples, language identification, MWC based code-book creation and the generation of HMMs on top of thegenerated codebook. All the tasks should be fast enough torun on an embedded system.

We do not go in detail about language identification oftext in this paper as it is widely used and there are freelyavailable tools like TextCat (Noord, 2009) for 69 differentlanguages. One approach is for example that the languagesare recognized based on n-gram frequencies of lettersequences that are specific for each language. Languagerecognition rates are in the range of 90% or higher for 30letter sequences (Ueda and Nakagawa, 1990). The MWCtask was already discussed. The last task is to provideHMMs that use only Gaussians from this codebook. Dueto the runtime constraints, we do not consider a commonBaum–Welch training. Instead, we project the GaussianMixture Models (GMMs) from their different monolingualcodebooks to the previously generated MWC. In this paperwe present seven different methods for this projection, threeof them were presented before (Raab et al., 2009).

The remainder of this paper is organized as follows. Inthe next section we present our multilingual baseline sys-tem. Section 3 describes our algorithms and introducesthe concept of the scalable architecture. In Section 4 theexperimental setup is described. Section 5 presents theexperimental results. Finally, a conclusion is drawn in Sec-tion 6.

2. Benchmark and baseline systems

The starting point for our comparison systems is trainedmonolingual semi-continuous HMM speech recognizers.This means that we have trained triphone models for alllanguages.

The benchmark system for the recognition of multiplelanguages combines all triphone models in one large modelset. This is nothing else as evaluating all monolingual rec-ognizers in parallel. Thus this system can achieve monolin-gual performance in all languages. However, in thisapproach all Gaussians from all languages that are cur-rently set active for recognition have to be evaluated. Thisviolates the motivation for the use of a semi-continuoussystem as no longer only one fixed number of Gaussians

has to be evaluated for all models. To summarize, thisapproach can be considered as an upper bound in perfor-mance, but requires a linear increase of resources on theembedded system with the number of consideredlanguages.

Our baseline systems reduce the resource need throughonly using Gaussians of the current main language of thesystem. This gives monolingual performance for the nativelanguage of the user and does not increase the number ofGaussians that have to be evaluated. The drawbacks of thisapproach are significantly reduced performance on theadditional languages and a training effort that is quadraticwith respect to the number of languages considered. Thisalso leads to the fact that a quadratic number of systemshas to be deployed on the embedded system.

The following describes the necessary steps for the gen-eration of our baseline system for one given main language.The HMM models of all additional languages are added tothe model set of the main language recognizer. However,these additional models have to be trained again, as theGaussians in the codebook have changed. Therefore, eachphoneme model of each of the additional languages isrebuild with data from the corresponding language, butthis time the HMMs can only model their output distribu-tion with Gaussians from the main language codebook.Fig. 2 sketches the procedure for an example bilingual Ger-man/English system.

3. Algorithm description

3.1. Multilingual Weighted Codebooks

To improve the performance on the additional lan-guages of our baseline system, the monolingual codebookis replaced by a Multilingual Weighted Codebook(MWC). The MWC is basically the main language code-book plus some additional Gaussians. Fig. 3 depicts anexample for the extension of a codebook to cover an addi-tional language. From left to right one iteration of the gen-eration of MWCs is represented.

The picture to the left shows the initial situation. The Xsare mean vectors from the main language codebook, andthe area that is roughly covered by them is indicated bythe dotted line. Additionally, the numbered Os are mean

Page 4: A scalable architecture for multilingual speech recognition on embedded devices

Fig. 3. The idea of MWCs. The three pictures present one iteration of the MWC algorithm. On the left, the initial situation is depicted. The nearestneighbor calculation is shown in the middle. The rightmost picture presents the coverage final situation in which the coverage of the MWC has beenextended through the addition of one extra Gaussian.

GEHMM

USHMM

USHMM

GEHMM

GESpeech

USSpeech

USHMM

GEHMM

USHMM

GEHMMTraining

GECodebook

USCodebook

Bilingual AcousticModel

Training

Fig. 2. Baseline system for an example German/English bilingual system. Each HMM becomes trained with speech from its corresponding language. AllHMMs use only Gaussians from the main language (German) codebook.

M. Raab et al. / Speech Communication 53 (2011) 62–74 65

vectors from the second language codebook. Supposingthat both Xs and Os are optimal for the language they werecreated for, it is clear that the second language containssound patterns that are not typical for the first language(Os 1,2 and 3).

The middle picture shows the distance calculation. Foreach of the second language codebook vectors, the nearestneighbor among the main language Gaussians is deter-mined. These nearest neighbor connections are indicatedby the dotted lines. Our previous experiments showed thatusing the Mahalanobis distance produces the best results(Raab et al., 2008a).

The right picture presents the outcome of one iteration.From each of the nearest neighbor connections, the largestone (O number 2) was chosen as this is obviously the meanvector which causes the largest vector quantization error.Thus, the Gaussian O number 2 was added to the main lan-guage codebook.

In (Raab et al., 2008a) we have shown that MultilingualWeighted Codebooks (MWCs) increase performance onthe additional languages for fluent non-natives withoutaffecting performance on the main language. Raab et al.(2008b) proves that MWCs also help for the recognitionof less fluent non-native speakers.

A negative aspect of MWCs is that they depend on thelanguages that are added. In fact, the number of differentsystems grows exponentially with the number of languagesthe system has to support.

3.2. Distance between GMMs

In the literature many distances between Gaussian Mix-ture Models have been proposed. Examples are an approx-imated Kullback Leibler divergence (Hershey and Olsen,2007), the likelihood difference on a development set(Juang and Rabiner, 1985; Koehler, 2001) or the L2distance (Jian and Vemuri, 2005; Jensen et al., 2007). Thelikelihood difference on a development set has the disad-vantage that development data is necessary. We use theL2 distance between Gaussians, as a closed solution existsfor this distance, which is not the case for the Kullback–Leibler distance.

The L2 distance (Lieb and Loss, 2001) between twoGaussian Mixture Models A and B is defined by

DL2ðA;BÞ ¼ZðaT aðxÞ � bT bðxÞÞ2dx; ð1Þ

a and b are the weight vectors of the Gaussian vectors a

and b.

a ¼

wa1

wa2

..

.

wan

0BBBB@

1CCCCA; aðxÞ ¼

N ðx; la1;R

a1Þ

N ðx; la2;R

a2Þ

..

.

Nðx; lanR

anÞ

0BBBB@

1CCCCA; ð2Þ

Page 5: A scalable architecture for multilingual speech recognition on embedded devices

66 M. Raab et al. / Speech Communication 53 (2011) 62–74

b ¼

wb1

wb2

..

.

wbm

0BBBB@

1CCCCA; bðxÞ ¼

N ðx; lb1;R

b1Þ

N ðx; lb2;R

b2Þ

..

.

Nðx; lbm;R

bmÞ

0BBBBB@

1CCCCCA; ð3Þ

where l and R are the mean and the covariance of theGaussians. The distance DL2 can be calculated as follows

DL2ðA;BÞ ¼ZðaT aðxÞ � bT bðxÞÞ2 dx

¼X

i

Xj

aiaj

ZaiðxÞajðxÞdx

� 2X

i

Xj

aibj

ZaiðxÞbjðxÞdx

þX

i

Xj

bibj

ZbiðxÞbjðxÞdx; ð4Þ

with aiðxÞ ¼ N ðx; lai ;R

ai Þ and biðxÞ ¼ N ðx; lb

i ;Rbi Þ. In or-

der to solve this problem, the correlationRNðx; l1;R1Þ

N ðx; l2;R2Þdx between the Gaussians needs to be calcu-lated. Petersen and Pedersen (2008) state that

Nðx; l1;R1ÞN ðx; l2;R2Þ ¼ ccNðx; lc;RcÞ; ð5Þ

with

cc ¼ Nðl1; l2; ðR1 þ R2ÞÞ; ð6Þlc ¼ ðR�1

1 þ R�12 Þ�1ðR�1

1 l1 þ R�12 l2Þ; ð7Þ

Rc ¼ ðR�11 þ R�1

2 Þ�1: ð8Þ

Thus all correlations between all Gaussians can be cal-culated and written in three matrices MAA, MAB and MBB.

MAAij ¼

ZaiðxÞajðxÞdx; ð9Þ

MABij ¼

ZaiðxÞbjðxÞdx; ð10Þ

MBBij ¼

ZbiðxÞbjðxÞdx: ð11Þ

Hence Eq. (4) can be written as

DL2ðA;BÞ ¼ aT MAAa� 2aT MABbþ bT MBBb: ð12Þ

3.3. Optimal projections between Gaussian spaces

The purpose of Eq. (12) is to measure distances betweendifferent given Gaussian mixtures. In this work it is moreinteresting to find an amin that minimizes DL2(A,B). Thesolutions from this section were first presented in (Raabet al., 2009).

To obtain the minimum we differentiate DL2 withrespect to a:

@DL2

@a¼ 2MAAa� 2MABb: ð13Þ

In order to find the minimum, we set the gradient to~0 ¼ ð0; 0; . . . ; 0ÞT . This leads to the optimal weights amin.

amin ¼ ðMAAÞ�1MABb: ð14Þ

This amin is a true minimum when the second derivativeof DL2 is positive definite. The second derivative is 2MAA.MAA is a correlation matrix and therefore positive semidef-inite. As long as none of the Gaussians is linear dependenton the other Gaussians, this matrix is positive definite andamin a true minimum.

Projection 1. An optimal projection from GMM A to Bthat minimizes the DL2 error DL2(A,B). The projectioncreates negative weights for Gaussians, and there is nonormalization of the sum of the Gaussian weights.

Despite the fact that the proposed projection is optimalwith regard to the L2 distance, it is likely to be suboptimalfor the use in a common speech recognizer. The reasons arethat

1. The elements of amin do not sum to one, thus somestates can always have higher scores than others.

2. Some weights for Gaussians are negative. In our deco-der the corresponding log probabilities are replaced bya threshold.

The first problem can be solved with the Lagrange con-straint that all weights have to sum to one. The Lagrangefunction to minimize can be stated as:

Lða; kÞ ¼ aT MAAa� 2aT MABbþ bT MBBbþ kX

i

ðaiÞ � 1

!;

ð15Þwith the additional Lagrange multiplier k. Differentiatingthis function gives

@L@ða; kÞ ¼

2MAA ~1~1T 0

!a

k

� �� 2MAB ~0

~0T 1=k

!b

k

� �;

ð16Þwhere ~1 ¼ ð1; 1; . . . ; 1ÞT .

Setting the derivation to~0 and removing k from the sec-ond matrix leads to

amin

k

� �¼ 2MAA ~1

~1T 0

!�1

� 2MAB ~0~0T 1

!b

1

� �: ð17Þ

Resulting in an a vector that sums up to one. WhenMAA�1 is known, the inverse of the complete matrix canbe computed efficiently with the Schur complement(Zhang, 2005).

Projection 2. An optimal projection from GMM A to Bthat minimizes the DL2 error DL2(A,B). Additionally, theconstraint that all Gaussian weights have to sum to one isenforced. There are negative weights for Gaussians afterthe projection.

Page 6: A scalable architecture for multilingual speech recognition on embedded devices

M. Raab et al. / Speech Communication 53 (2011) 62–74 67

Solving the issue of negative weights is a more difficultconvex optimization problem (Boyd and Vandenberghe,2004). A common method to solve it are the Karush KhunTucker constraints (Kuhn and Tucker, 1951). These arebasically a generalization of the Lagrange constraints andcan work with inequalities by introducing slack variabless that transform every inequality in an equality, whichcan be solved as any Lagrange constraint. In the case here,an inequality constraint has to be introduced for every ele-ment of a. This gives the new function KKT for the distancebetween the mixture distribution A and B.

KKT ða; k; cÞ ¼ aT MAAa� 2aT MABbþ bT MBBb

þ kX

i

ðaiÞ � 1

!þXn

i¼1

cið�ai þ s2i Þ; ð18Þ

with c = (c1,c2, . . . ,cn) and s = (s1, s2, . . . , sn).When ai is zero, constraint i is said to be active, other-

wise the constraint is inactive. If constraint i is active, ci

is greater 0. To find the optimal solution, all possible com-binations of active constraints and inactive constraintsneed to be evaluated.

In practice it is not possible to check all the possiblecombinations for the optimal value. Similar problems haveto be solved for Neural Networks (Platt and Bar, 1988; Bie-hl et al., 1990). Basically, the idea is to perform a gradientdescent on the optimization criterion and a gradient ascenton the equality constraint. Biehl et al. (1990) show that aquadratic optimization problem that ignores negative val-ues converges with gradient descent. In our case, the actualimplementation needed well tuned update weights to pre-vent oscillations caused by the opposed equality andinequality constraints. Nevertheless, the sequential iterativeoptimization algorithm achieved on average almost thesame L2 distance as Projection 1 with only three iterations.

Projection 3. An “almost optimal” projection from GMMA to B that minimizes the DL2 error DL2(A,B). The weightsof the projected distribution sum to one and there are nonegative weights.

When applying these projections to our recognizer notall Gaussians are comparable, as the different languageshave different LDAs (Linear Discriminant Analyses).Therefore each Gaussian was saved before it was modifiedby an LDA. Thus we can make our comparisons with com-parable Gaussians. These Gaussians are also used for theapproximated projections in the next section.

3.4. Approximated projections between Gaussian spaces

In the previous section, three different projections withdifferent constraints were introduced. Each of them hassome disadvantages for employment in an embeddedspeech recognition system. Therefore, we propose someexperimentally motivated projections.

The goal of each projection is to map all HMMs of all L

languages to one fixed set of N Gaussians (=Recognition

Codebook, RC) which can be either mono-or multilingual.Such a mapping can be achieved by mapping all Ml Gaus-sians of each additional language codebook (=Monolin-gual Codebook, MCl) to the RC. Each Gaussian N isrepresented by its mean l and covariance matrix R. TheMahalanobis distance measures the distance betweenGaussians (DG).

Projection 4. An approximated projection that only com-pares individual Gaussians in the different codebooks toderive a mapping. Each additional language Gaussian isreplaced by another Gaussian according to mapG.

mapGðNiMClÞ ¼ N j

RC; ð0 6 i < Ml; 0 6 j < N ; 0 6 l < LÞj ¼ arg min

kDGðli

MCl; lk

RC;RiMClÞ: ð19Þ

When all Gaussians from the main language are in theRC, there are further possibilities how HMMs from otherlanguages can be linked to the RC. All states from themain language map only to Gaussians from the RC. Thuswhen all S states are mapped to RS main language statesonly Gaussians from the RC are used. The same is truewhen all HMMs H are mapped to main language HMMsRH. Both of these additional mappings have the advan-tage that they consider the combination of Gaussians intheir distance.

We map states based on the minimum Mahalanobis dis-tance (DS) between the expected values E of their GaussianMixture Models. The covariance which is needed for theMahalanobis distance is a global diagonal covariance RAll

estimated on all training samples. This covariance can alsobe calculated from the Gaussians in the codebook, thusthere is no need for the actual training data. With DS wedefine our state based mapping as Projection 5.

Projection 5. An approximated projection that comparesstates from additional languages to main language HMMstates to derive a mapping. Each individual HMM state isreplaced by another HMM state according to mapS.

mapSðsilÞ ¼ sj

RS ð0 6 i < Sl; 0 6 j < RS; 0 6 l < LÞj ¼ arg min

kDSðE½si

l�;E½skRS �;RAllÞ: ð20Þ

Based on DS we can define a distance between HMMs(DH). In our system each context dependent phoneme isrepresented through a three state HMM model. In this casethe distance between two phonemes q1 and q2 is

DH ðq1; q2Þ ¼X3

i¼1

DSðsiq1; si

q2Þ: ð21Þ

With DH we can define Projection 6.

Projection 6. An approximated projection that comparesHMMs from additional languages to main languageHMMs to derive a mapping. Each additional language

Page 7: A scalable architecture for multilingual speech recognition on embedded devices

68 M. Raab et al. / Speech Communication 53 (2011) 62–74

HMM is replaced by a main language HMM according tomapH.

mapHðqilÞ ¼ q

jRH ð0 6 i < Hl; 0 6 j < RH ; 0 6 l < LÞ

j ¼ arg mink

DH ðqil; q

kRH Þ: ð22Þ

DG and DS provide consistently good performance fordifferent tests, while they use rather different informationfor their calculation. Therefore we also test a combinedmapG+S.

Projection 7. An approximated projection that comparesboth Gaussians and HMM states to derive a mapping.Each additional language state gets a new output distribu-tion probability according to mapG+S

mapGþSðsilÞ ¼ cGþS mapSðsi

þ ð1� cGþSÞ

w1si

lmapGðN

1MClÞ

w2si

lmapGðN

2MClÞ

..

.

wMl

sil

mapGðNMlMClÞ

0BBBBBBB@

1CCCCCCCA

ð0 6 l < L; 0 6 i < SlÞ;

ð23Þ

with the combination weight cG+S.cG+S has to be determined in experiments. An additional

retraining after each of the projections would probablyincrease the performance. In our experiments no retrainingwas performed, as this keeps the creation of new multilin-gual systems as simple as possible and on-demand acousticmodel creation feasible.

3.5. Overview of projections

In the previous two sections, several methods for theprojection of HMMs from one language to another wereproposed. Table 1 summarizes the main information aboutthem. The method column describes which information isused for the projection. The probability column indicateswhether the result of the projection is a correct probabilitydistribution.

Table 1Comparison of projection methods.

Projection Method Probability

Pro1 L2 minimization NoPro2 L2 minimization NoPro3 L2 minimization YesPro4 Gaussian mapping YesPro5 State mapping YesPro6 HMM mapping YesPro7 Pro4+Pro5 Yes

3.6. Scalable architecture

In Sections 3.3 and 3.4 several projections betweenGaussian spaces where defined. Each of these projectionsallows to use only one codebook for all languages, whichkeeps the decoding feasible on an embedded system. Onlyat the moment of application it is known which languageshave to be recognized. Therefore, if the projections can becalculated on the embedded system, there is no combina-toric problem for the training algorithms.

Thus, the defined projections generate a speech recog-nizer for every language combination without increasingthe training effort. To actually have a scalable architecture,an algorithm is needed that can improve the performance.This can be achieved with the MWC algorithm defined inSection 3.1. This increases the number of Gaussians inour system and hence the memory demand, but the decod-ing complexity can be kept much lower as with monolin-gual recognizers that run in parallel. A graphicalrepresentation of the overall process was given in Fig. 1of the introduction.

4. Experimental setup

Our semi-continuous HMM speech recognizer uses 11MFCCs with their first and second derivatives per frameand LDA for feature space transformation. Monolingualrecognizers for English, French, German, Spanish and Ital-ian are trained on 200 hours of Speecon data (Iskra et al.,2002) with 1024 Gaussians with full covariance in the code-book (L = 5,Ml = 1024,0 6 l < L). The HMMs are contextdependent and the codebook for each language is different.We have between 2000-3000 triphone models for each lan-guage, each represented by a 3-state HMM. The languagemodel is specified as a context free grammar.

Table 2 describes the native test sets for these five lan-guages. The test sets are all from proprietary in-car data,but some of them are cleaner than others and match thetraining data better. Due to this some languages havehigher recognition rates than other languages. Each testset contains city names. The number of different city namesin our context free grammars is specified in the fourth col-umn of Table 2. As some city names can be repeated, thenumber of words can be higher than the number of entriesin the vocabulary.

Table 3 shows the non-native test sets, mostly from theHiwire database (Segura et al., 2007). The spoken language

Table 2Descriptions of the native test set for each language.

Testset Language Words Vocab.

GE_City German 2005 2498US_City English 852 500IT_City Italian 2000 2000FR_City French 3308 2000SP_City Spanish 5143 3672

Page 8: A scalable architecture for multilingual speech recognition on embedded devices

Table 3Description of the non-native test sets.

Testset Accent Words Vocab.

Hiwire_FR French 5192 140Hiwire_SP Spanish 1759 140Hiwire_IT Italian 3482 140IFS_MP3 German 831 63

Table 4Comparison between monolingual codebooks and a multilingualcodebook.

Codebook 1024 Benchmark 1424 Multilingual

German 84.1 80.8English 75.5 70.5Italian 92.3 90.6French 76.1 72.2Spanish 91.9 91.4

M. Raab et al. / Speech Communication 53 (2011) 62–74 69

in the Hiwire tests is English. The native language back-ground of the speaker varies, as indicated in Column 2.The Hiwire test sets are as specified in the distribution ofthe Hiwire database and contain command and controlutterances in an aeronautic scenario. The MP3 test is per-formed on data that was especially collected for this workand contains Italian, French and Spanish artists and songnames. Depending on which information is more interest-ing, either the spoken languages of the test is indicatedbefore the name, or the native language of the speakers isindicated after the name.

5. Experiments

We motivate our new approaches by evaluating the stateof the art approach for multilingual speech recognition inSection 5.1. Section 5.2 shows that MWCs perform wellfor both native and non-native speech. We always testour systems on native speech as well as on non-nativespeech as we expect that many people that use for examplenavigation systems for foreign destination input are quitefluent in the spoken language. Therefore our system alsohas to recognize fluent speech of the spoken language well.Section 5.3 compares the different projection methods thatwe have proposed in order to reduce the exponentiallyincreased training effort which is coming from the applica-tion of MWCs. Section 5.4 evaluates the combination ofthe MWC algorithm and the best projection which allowsefficient recognition of any language combination onembedded systems.

5.1. State of the art

The literature review about multilingual speech recogni-tion indicated that a global phoneme model is the preferredsolution for dealing with many languages. If there is onlyone phoneme model, there is also only one codebook forall languages in a semi-continuous system. Thus the ques-tion arises how well a global codebook can model pho-nemes from different languages. Therefore we built acodebook with training data from five languages, 200hours of Speecon data for each language. The phonemesfrom each language are trained with speech from the corre-sponding language and this global codebook. The globalcodebook contains 1424 Gaussians. Table 4 shows thatthe performance in all languages is decreased. The loss islanguage dependent, for example German and English suf-

fer more than Spanish and Italian. Nevertheless, theseresults are sufficient for the statement that a multilingualcodebook performs worse than a monolingual codebookfor each language, even if it is allowed to be a little larger.As the global phoneme model induces a global codebook,the same conclusions can be drawn for this approach. Thisconclusion is concordant with (Koehler, 2001; Wang et al.,2002).

5.2. Multilingual weighted codebooks

The performance is evaluated on German, English, Ital-ian, French and Spanish test sets. German is chosen asmain language for the MWC construction. The MWCalgorithm can only take two codebooks as input. Thereforewe put all Gaussians from the additional languages in alarge codebook with 4096 Gaussians. Together with theGerman codebook this is the input to the MWC algorithm.Fig. 4 shows the results of the baseline and several MWCsystems. The baseline experiment uses the 1024 GermanGaussians as codebook. The other systems add 200, 400and 800 Gaussians from the additional languages. Thus,the total codebook sizes are 1224, 1424 and 1824. Withthese codebooks, the same retraining as for the baselinesystems was performed. This means, each language got adifferent HMM set, and this HMM set was trained withspeech from the corresponding language.

For German the benchmark and baseline systems areidentical, therefore there is only one line visible in thegraph. The MWC performance on the German test set var-ies also insignificantly. This indicates that the extensions tothe codebook do not hurt the performance on the mainlanguage and is a benefit compared to the state of the artapproach discussed in Section 5.1. The performance onthe other tests shows that MWCs improve significantlyover our baseline system. For Spanish the MWC with1424 Gaussians almost achieves the benchmark perfor-mance. The differences between the different test sets arenot relevant, as they are mainly due to the match betweentraining and testing data, which is higher for example forSpanish and Italian, and lower for English and French.

To some extent the improvements of the MWCs canalso be due to the fact that the MWCs contains more Gaus-sians than the baseline system. Therefore we also tested asystem with an only German codebook that contains1824 Gaussians and compared it to the MWC with 1824

Page 9: A scalable architecture for multilingual speech recognition on embedded devices

Table 5Comparison of an MWC to a monolingual codebook of the same size.

GE_City US_City IT_City FR_City SP_City

GE 1824 83.8 67.6 81.4 70.2 89.8MWC 1824 84.3 72.0 89.7 72.9 91.0

German MWCs on Native Speech

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

1024

1224

1424

1824

1024

1224

1424

1824

1024

1224

1424

1824

1024

1224

1424

1824

1024

1224

1424

1824

GE_City US_City IT_City FR_City SP_City

Codebook Size / Testset

WA

Benchmark 1024 GE MWC GE Baseline 1024

Fig. 4. MWCs on native speech of five languages. All MWCs contain the full German codebook.

70 M. Raab et al. / Speech Communication 53 (2011) 62–74

Gaussians. Table 5 demonstrates that Gaussians fromother languages help more than more German Gaussiansfor the recognition of the additional languages.

Table 6 presents the performance of MWCs on non-native accents. The benchmark system for Hiwire is themonolingual English system. For the four lingual MP3 testno benchmark performance is given, as there are utterancesthat contain more than one language and no monolingualsystem can recognize such utterances. The baseline systemsand MWCs are different for each column. The reason isthat it makes for example more sense to recognize Spanishaccented English with a MWC that contains the full Span-ish codebook.

That this is the right approach for non-native accentedspeech is proven by the fact that the baseline systems out-perform the benchmark system significantly in all cases. InWord Error Rate (WER), the native language codebookgives actually improvements in the range of 25% relativeWER, thus very similar to what the literature about non-native speech recognition could achieve without non-nativeadaptation data. The fact that a baseline system is better

Table 6Word accuracies with MWCs on the non-native accented tests. Allspeaker.

Codebook Hiwire_SP Hiwire_FBenchmark 1024 82.5 83.9

Baseline 1024 86.6 86.0MWC 1224 86.6 86.4MWC 1424 85.7 86.1MWC 1824 86.0 85.8

than the benchmark system can occur in these tests, asthe tested speech differs strongly from the native trainingspeech. In general the MWCs keep the performance, thereare no significant improvements when additional Gaussi-ans from other languages are added.

The absolute performance of the systems in Fig. 4 andTable 6 is actually quite similar. Of course, the non-nativeaccented speech is harder to recognize by the speech recog-nizer, but the vocabulary size is smaller for the non-nativetests, and together these two factors lead to rather similarperformance for our native and non-native tests.

To summarize, Table 6 shows that training the spokenlanguage on native language codebooks of the speakershelps significantly for the recognition of strongly accentedspeakers. However, such systems do not perform well forthe recognition of more fluent speakers of the language,as shown in Table 4. For such speakers, it is necessary toadd some additional Gaussians to the codebook to allowa better modeling of the spoken language. These additionalGaussians do not diminish the benefit of using the nativelanguage codebook of the speakers (Table 6).

5.3. Comparison of optimal and approximated projections

There are two attributes our projection must have. First,it must be executable on the embedded system, and second,it should be as efficient as possible. Table 7 presents Word

MWCs contain the full codebook of the native language of the

R Hiwire_IT IFS_MP3wK81.6 –

86.2 60.586.7 59.786.0 61.385.1 59.9

Page 10: A scalable architecture for multilingual speech recognition on embedded devices

Table 7Comparing optimal and approximated projections. The first columnshows the word accuracy on the native US City test. The second columngives distances to the monolingual US English HMM models. The thirdcolumn shows the runtime in seconds for precomputations. The fourthcolumn shows the actual runtime of the estimation of the outputprobabilities of the HMM models. The runtime is given for the projectionof one language with 1800 phoneme models to another codebook.

Projection WA Distance L2 Precomp. (s) Runtime (s)

Pro1 5.2 4.08e�9 330 30Pro2 49.7 4.08e�9 330 30Pro3 55.5 4.10e�9 330 90

Pro4 44.8 6.80e�8 2 0.2Pro5 44.5 6.64e�8 12 0.1Pro6 31.2 5.29e�8 4 0.1Pro7 55.1 5.07e�8 14 0.3

Baseline 65.6 4.13e�8 – 14,400Benchmark 75.5 0 – 14,400

M. Raab et al. / Speech Communication 53 (2011) 62–74 71

Accuracies (WA) on the native US cities task, the degree ofoptimality according to the distance proposed in Section3.2 and an indication of the time needed for the projectionon an Intel PC with 3.6 GHz.

Where possible, we tried to precompute elements thathave to be computed only once for every language anddo not depend on the actual combination. Examples arethe distance between states in Projection 5 and Projection7, as well as for the distance between all HMMs in Projec-tion 6. The runtime for these additional precomputations isgiven in column 4. For Projection 1–3 the correlationsbetween Gaussians are precomputed.

As expected, the optimal Projection 1–3 give by far thelowest error in L2 distance. However, Projection 1 resultsin a weight vector for the HMM states that is so differentfrom regular probability distributions that a standard rec-ognizer achieves only very low recognition rates. Projection2 adds the normalization that weights have to sum to one,and this leads already to a reasonable recognition perfor-mance. Compared to other projections it is clear that the

Influence of the C

-10.0%

-5.0%

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

0.0 0.2 0.4Sta

Rel

ativ

e W

A Im

prov

emen

t

US_City FR

Fig. 5. Effect of the combination weight in Projection 7 for three different test sonly one of the projections alone.

negative weights for some Gaussians still pose a problemfor the decoding. Both Projection 1 and 2, are also quiteslow, as the projection of each of the 5400 HMM statesrequires the multiplication with a large matrix.

Projection 3 gives the best overall performance, but issignificantly slower than all other projections. This is dueto the fact that an sequential, iterative gradient descent isperformed. Furthermore, after each update of a weightall other weights are adjusted to keep the constraint thatthe sum equals one at every step. This is repeated threetimes for each weight. The total number of changes to eachweight leads to the high runtime and in succession to thefact that Projection 3 is not applicable for the proposedscalable architecture.

From the approximated projections, both Projection 4(Gaussian mapping) and Projection 5 (State mapping)achieve good performance in spite of their simplicity.Finally, Projection 7 (combined Gaussian+state mapping)has the best overall performance with both good recogni-tion rate and fast runtime. The results also show that theprojections alone reduce the performance significantly,both compared to the benchmark and the baseline. How-ever, for practical application the projections are an inter-esting alternative as they allow multilingual recognitionwith no additional training and decoding effort.

In the above discussion Projection 7 was used with aweight of 0.5. This combination weight was determinedin a grid search where we investigated values between 0and 1 in 0.1 steps. Fig. 5 shows that a wide range of valuesfor the combination weight are acceptable, all valuesbetween 0.3 and 0.8 led to good results.

5.4. Scalable architecture with approximated projections and

MWCs

In Section 5.2 we have shown that MWCs can improvethe speech recognition performance across languages. Sec-tion 5.3 demonstrated that the training complexity can bereduced with approximated projections. This section evalu-

ombination Weight

0.6 0.8 1.0te Weight

_City Hiwire_FR

ets. All weight values between 0.3 and 0.8 perform significantly better than

Page 11: A scalable architecture for multilingual speech recognition on embedded devices

72 M. Raab et al. / Speech Communication 53 (2011) 62–74

ates the performance of the combination of approximatedprojections and MWCs.

Fig. 6 depicts that the projections are as good as aretrained system when 200 more Gaussians are used forfour of the five languages. Of course, to some extent thisis an unfair comparison, as we compare a system with moreGaussians to our baseline system. However, we are con-vinced that this is the fairest possible comparison regardingthe actual behavior of our embedded target system. Thereason is that the larger codebook depends on the combi-nation of languages, in some case we may want to have50 Italian and 30 French additional Gaussians, in othercases we would prefer to have 60 Japanese and 20 EnglishGaussians. It all depends on the test set, and in our sce-nario we first know the test on the embedded system itself.With the traditional training approach, we can not react tothe different test sets by training more Gaussians for some

Scalable Architecture

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

1024

1124

1224

1424

1024

1124

1224

1424

1024

GE_City US_City

Codeboo

WA

Benchmark 1024 GE P

Fig. 6. Scalable architecture on five d

Scalable Architecture o

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

1024

1124

1224

1424

1024

1124

1224

1424

Hiwire_SP Hiwire_FR

Codeboo

WA

Benchmark 1024 P

Fig. 7. Scalable architecture on t

languages. With the proposed scalable architecture, we canreact and provide the right system.

Fig. 7 depicts the performance of the scalable architec-ture on our non-native speech tests. As in Table 6, theMWCs used are actually different, and each test is testedwith an MWC that contains the all Gaussians from thenative language codebook of the speakers. The baselineand benchmark systems are also the same as in Table 6,which means that the upper line indicates the performanceof our baseline systems. The performance of Projection 7 issignificantly improved for the Hiwire tests when moreGaussians are added, but the MP3 test changes onlyslightly. We believe that this is due to the fact that thespeakers have so few knowledge of the Italian, Spanishand French song names that they are really using Germansounds to pronounce them. The figure shows that the sys-tems generated by the scalable architecture perform slightly

on Native Speech11

24

1224

1424

1024

1124

1224

1424

1024

1124

1224

1424

IT_City FR_City SP_City

k Size / Testset

rojection 7 GE Baseline 1024

ifferent native language test sets.

n Non-native Speech

1024

1124

1224

1424

1024

1124

1224

1424

Hiwire_IT IFS_MP3wK

k Size / Testset

rojection 7 Baseline 1024

he non-native accented tests.

Page 12: A scalable architecture for multilingual speech recognition on embedded devices

M. Raab et al. / Speech Communication 53 (2011) 62–74 73

worse than both the benchmark and baseline systems, butgiven the fact that the benchmark systems require moreresources for each additional language, and the trainingeffort of the baseline systems increases with the numberof languages, the scalable architecture is the method ofchoice if many language combinations are possible, andmost of them will be needed rarely.

6. Conclusion

In this paper we have explained the combinatoric prob-lems that come with the provision of multilingual speechrecognition for many languages. For the efficient introduc-tion of multilingual knowledge, we use MultilingualWeighted Codebooks that have low decoding complexityand good recognition performance for both almost fluentand less fluent non-native speakers. To keep the trainingeffort reasonable, we have defined several projectionsbetween Gaussian spaces. From these projections, Projec-tion 7 proved itself to be the most suitable one for speechrecognition, as it is either better or faster than the otherproposed projections. Though we think that in othernon-speech applications the more exact L2 based projec-tion might be more appropriate.

A combination of the proposed algorithms leads to anarchitecture with both low training and decoding effort.The scalable architecture outperforms our baseline systemsby up to 5.4% absolute word accuracy, and performsalmost similar as monolingual benchmark systems onnon-native accented speech. Additionally, there are severaladvantages of our new scalable architecture for commercialapplication. First, it is customer friendlier, as it can recog-nize speech from all language combinations. Second, it iseasier to provide and maintain due to the reduced redun-dancy. Third, the performance is better than that of ourbaseline system for fluent speakers of foreign languages.Fourth, it is cheaper, as it is not necessary to train speechrecognizers for many different language combinations.

In a final comparison to the state of the art as identifiedin the literature we can say that our approach is more suit-able if the native language of the user is known, and max-imum performance in this language is paramount. In othercases, where the native language of the speaker is notknown, or many speakers have to be recognized simulta-neously, the global phoneme model remains the architec-ture of choice.

Acknowledgements

The authors thank Olaf Schreiner, Tobias Herbig andDr. Volker Schubert for help with mathematical problems.Dr. Franz Gerl was a great help for the iterative solution ofthe optimization problem. We also thank Raymond Brue-ckner and Dr. Guillermo Aradilla for advice and discus-sion. Finally, we thank the reviewers for their valuablecomments to initial versions of the paper.

References

Bartkova, K., Jouvet, D., 2006. Using multilingual units for improvedmodeling of pronunciation variants. In: Proc. ICASSP. Toulouse,France, pp. 1037–1040.

Biehl, M., Anlauf, J.K., Kinzel, W., 1990. Perceptron learning byconstrained optimization: the AdaTron algorithm. In: Proc. ASISummer Workshop Neurodynamics. Clausthal, Germany.

Bouselmi, G., Fohr, D., Illina, I., 2007. Combined acoustic and pronun-ciation modeling for non-native speech recognition. In: Proc. Inter-speech. Antwerp, Belgium, pp. 1449–1552.

Boyd, S., Vandenberghe, L., 2004. Convex Optimization. CambridgeUniversity Press.

Dalsgaard, P., Andersen, O., Barry, W., 1998. Cross-language mergedspeech units and their descriptive phonetic correlates. In: Proc. ICSLP.Sydney, Australia, pp. 2623–2626.

Fuegen, C., 2003. Efficient handling of multilingual language models. In:Proc. ASRU. St. Thomas, USA, pp. 441–446.

Goronzy, S., Sahakyan, M., Wokurek, W., 2001. Is non-native pronun-ciation modeling necessary?. In: Proc. Interspeech. Aalborg, Denmark,pp. 309–312.

Harbeck, S., Noth, E., Niemann, H., 1998. Multilingual speech recogni-tion in the context of multilingual information retrieval dialogues. In:Proc. TSD. pp. 375–380.

Hershey, J.R., Olsen, P.A., 2007. Approximating the Kullback Leiblerdivergence between Gaussian mixture models. In: Proc. ICASSP.Honolulu, Hawaii, pp. 317–320.

Huang, X., Lee, K.F., Hon, H.W., 1990. On semi-continuous hiddenMarkov modeling. In: Proc. ICASSP. Albuquerque, USA, pp. 689–692.

Iskra, D., Grosskopf, B., Marasek, K., van den Huevel, H., Diehl, F.,Kiessling, A., 2002. Speecon - speech databases for consumer devices:database specification and validation. In: Proc. LREC. Las Palmas deGran Canaria, Spain, pp. 329–333.

Jensen, J.H., Ellis, D.P.W., Christensen, M., Jensen, S.H., 2007. Evalu-ation of distance measures between Gaussian mixture models ofMFCCs. In: Proc. ISMIR. Vienna, Austria, pp. 107–108.

Jian, B., Vemuri, B.C., 2005. A robust algorithm for point set registrationusing mixture of Gaussians. In: Proc. IEEE Internat. Conf. onComputer Vision. Beijing, China, pp. 1246–1251.

Juang, B.H., Rabiner, L.R., 1985. A probabilistic distance measure forhidden Markov models. AT& T Technical Journal 64 (2), 391–408.

Koch, W., 2004. Optimierungsverfahren fnr einen universellen Spracher-kenner mit robusten, effizienten Algorithmen. Ph.D. thesis, UniversityKiel, Kiel, Germany.

Koehler, J., 2001. Multilingual phone models for vocabulary-independentspeech recognition tasks. Speech Communication Journal 35 (1–2), 21–30.

Kuhn, H.W., Tucker, A.W., 1951. Nonlinear programming. In: Proc. 2ndBerkeley Symp. Berkeley, USA, pp. 481–492.

Ladefoged, P., 1990. The revised international phonetic alphabet.Language 66 (3), 550–552.

Lang, H., 2009. Methods for the Adaptation of Acoustic Models to Non-Native Speakers. Diplomarbeit, Institute of Information Technology,University Ulm, Ulm, Germany.

Lieb, E.H., Loss, M., 2001. Analysis. American Mathematical Society.Menzel, W., Atwell, E., Bonaventura, P., Herron, D., Howarth, P.,

Morton, R., Souter, C., 2000. The ISLE corpus of non-native spokenEnglish. In: Proc. LREC. Athens, Greece, pp. 957–963.

Niesler, T., 2006. Language-dependent state clustering for multilingualspeech recognition in Afrikaans, South African English, Xhosa andZulu. In: Proc. ITRW. Stellenbosch, South Africa.

Noord, G., 2009. Textcat. <http://odur.let.rug.nl/vannoord/TextCat/>.Noth, E., Harbeck, S., Niemann, H., 1999. Multilingual speech recogni-

tion. In: Ponting, K. (Ed.), Computational models of speech patternprocessing. NATO ASI Series F. Berlin, Germany, pp. 363–375.

Page 13: A scalable architecture for multilingual speech recognition on embedded devices

74 M. Raab et al. / Speech Communication 53 (2011) 62–74

Park, J., Ko, H., 2004. Compact acoustic model for embedded imple-mentation. In: Proc. Interspeech. Jeju Island, Korea, pp. 693–696.

Petersen, K., Pedersen, M., 2008. The matrix cookbook. <http://matrixcookbook.com>.

Platt, J.C., Bar, A.H., 1988. Constrained Differential Optimization forNeural Networks. Technical Report, Caltech, USA.

Raab, M., Gruhn, R., Noth, E., 2007. Non-native speech databases. In:Proc. ASRU. Kyoto, Japan, pp. 413–418.

Raab, M., Gruhn, R., Noth, E., 2008a. Multilingual weighted codebooks.In: Proc. ICASSP. Las Vegas, USA, pp. 4257–4260.

Raab, M., Gruhn, R., Noth, E., 2008b. Multilingual weighted codebooksfor non-native speech recognition. In: Proc. TSD. Brno, CzechRepublic, pp. 485–492.

Raab, M., Schreiner, O., Herbig, T., Gruhn, R., Noth, E., 2009. Optimalprojections between Gaussian mixture feature spaces for multilingualspeech recognition. In: Proc. DAGA. Rotterdam, Netherlands, pp.411–414.

Schaden, S., 2006. Regelbasierte Modellierung fremdsprachlichakzentbehafteter Aussprachevarianten. Ph.D. thesis, University Duis-burg-Essen, Duisburg, Germnay.

Schultz, T., Waibel, A., 2001. Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Communi-cation 35, 31–51.

Segura, J., Ehrette, T., Potamianos, A., Fohr, D., Illina, I., Breton, P.-A.,Clot, V., Gemello, R., Matassoni, M., Maragos, P., 2007. TheHIWIRE database, a noisy and non-native English speech corpusfor cockpit communication. <http://www.hiwire.org/>.

Tan, T.P., Besacier, L., 2007. Acoustic model interpolation for non-nativespeech recognition. In: Proc. ICASSP. Honolulu, Hawaii, pp. 1009–1013.

Tomokiyo, L.M., Waibel, A., 2001. Adaptation methods for non-nativespeech. In: Proc. MSLP. Aalborg, Denmark, pp. 39–44.

Uebler, U., 2001. Multilingual speech recognition in seven languages.Speech Communication 35, 53–69.

Ueda, Y., Nakagawa, S., 1990. Prediction for phoneme/syllable/word-category and identification of language using HMM. In: Proc. ICSLP.Kobe, Japan, pp. 1209–1212.

Wang, Z., Topkara, U., Schultz, T., Waibel, A., 2002. Towards universalspeech recognition. In: Proc. ICMI. Pittsburgh, USA, pp. 247–252.

Weng, F., Bratt, H., Neumeyer, L., Stolcke, A., 1997. A study ofmultilingual speech recognition. In: Proc. Eurospeech. Rhodes,Greece, pp. 359–362.

Witt, S., 1999. Use of Speech Recognition in Computer-AssistedLanguage Learning. Ph.D. thesis, Cambridge University EngineeringDepartment, Cambridge, UK.

Zhang, F., 2005. The Schur Complement and Its Applications. Springer.