8
Cent. Eur. J. Eng. • 1(2) • 2011 • 181-188 DOI: 10.2478/s13531-011-0017-6 Central European Journal of Engineering Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions Research article Rytis Maskeliunas 1* , Vytautas Rudzionis 2 1 Kaunas University of Technology, Lithuania, 2 Vilnius University, Lithuania Received 14 February 2011; accepted 31 March 2011 Abstract: In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90% . Keywords: Lithuanian • Multilingual • Speech recognition • Transcriptions © Versita Sp. z o.o. 1. Introduction From the advent of speech recognition research and the appearance of the first commercial applications, the main efforts were devoted to the recognition of widely used lan- guages, particularly the English language. The reason for such behavior is very clear – popular widely used lan- guages have a larger market potential for practical appli- cations. Many other less widely used languages remain out of the scope of interest for the major speech recognition * E-mail: [email protected] solution providers. In countries were such less popular languages are used as a main source of spoken language communication, businesses and state institutions face a challenge in development of their own speech recognition tools. The two major solutions are as follows: • to develop their own speech recognition engine from scratch; • to adapt a foreign-language-based engine for the recognition of their native language. The first approach has the potential to exploit the pecu- liarities of the selected language and hence to achieve a higher recognition accuracy; however the drawback of 181

Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Embed Size (px)

Citation preview

Page 1: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Cent. Eur. J. Eng. • 1(2) • 2011 • 181-188DOI: 10.2478/s13531-011-0017-6

Central European Journal of Engineering

Recognition of voice commands using adaptation offoreign language speech recognizer via selection ofphonetic transcriptions

Research article

Rytis Maskeliunas1∗, Vytautas Rudzionis2

1 Kaunas University of Technology, Lithuania,

2 Vilnius University, Lithuania

Received 14 February 2011; accepted 31 March 2011

Abstract: In recent years various commercial speech recognizers have become available. These recognizers provide thepossibility to develop applications incorporating various speech recognition techniques easily and quickly. All ofthese commercial recognizers are typically targeted to widely spoken languages having large market potential;however, it may be possible to adapt available commercial recognizers for use in environments where less widelyspoken languages are used. Since most commercial recognition engines are closed systems the single avenuefor the adaptation is to try set ways for the selection of proper phonetic transcription methods between the twolanguages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commandsto be recognized using English speech engines. The experimental evaluation showed that it is possible to findphonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracyof over 90% .

Keywords: Lithuanian • Multilingual • Speech recognition • Transcriptions© Versita Sp. z o.o.

1. Introduction

From the advent of speech recognition research and theappearance of the first commercial applications, the mainefforts were devoted to the recognition of widely used lan-guages, particularly the English language. The reasonfor such behavior is very clear – popular widely used lan-guages have a larger market potential for practical appli-cations. Many other less widely used languages remainout of the scope of interest for the major speech recognition

∗E-mail: [email protected]

solution providers. In countries were such less popularlanguages are used as a main source of spoken languagecommunication, businesses and state institutions face achallenge in development of their own speech recognitiontools. The two major solutions are as follows:

• to develop their own speech recognition engine fromscratch;

• to adapt a foreign-language-based engine for therecognition of their native language.

The first approach has the potential to exploit the pecu-liarities of the selected language and hence to achievea higher recognition accuracy; however the drawback of

181

Page 2: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

such an approach is that the providers of the major speechtechnologies avoid the implementation of such languagesin their products, which generally leads to higher costs.The second approach has the potential to achieve somepractically-acceptable results faster than developing anentirely new speech recognition engine. Another advan-tage of this approach is the potential to achieve fastercompatibility with existing technological platforms. Theidea behind this approach is to transfer the existing sourceacoustic models from a source language to the target lan-guage without using the speech corpora in that languageand without full retraining of the speech recognition sys-tem.There were various attempts to investigate and apply prin-ciples of multilingual recognition. In [1], basic terms re-lated with multilingual recognition (polyphones and mono-phones) were discussed. The authors applied a data-driven algorithm in order to find similar phonemes in fourlanguages. Experiments showed that the method allowedsuccessfully recognizing language. In [2], common pho-netic and syntactic models for English and Swedish weredeveloped. An IBM team [3] investigated possibilities totransfer an English recognition system for French. Theyselected 25 common polyphones and 24 specific Englishand 9 specific French monophones. Working with Euro-pean multilingual SpeechDat corpora [4], researchers triedto use models obtained from French, Italian, Portuguese,Spanish, and English speech recordings for recognition ofa new language (German). It was observed that recogni-tion accuracy was only slightly worse than using the lan-guage specific recognizer (85% against 89%). Possibilitiesto use models from several languages [5] (English, Span-ish, Russian, Chinese) for large vocabulary Czech recogni-tion were investigated. Using only foreign language mod-els, a significant number of errors were obtained (about80%) but after introduction of some Czech data recogni-tion the error rate dropped to about 30%. The MASPERinitiative [6] was the first initiative to create cross-lingualrecognition models from Central and Eastern Europeanlanguages. Concerning Lithuanian speech recognition, thework done in [7] is important. This was the first attempt touse an English recognizer for Lithuanian recognition. Theauthor proposed rules for the transcription of Lithuanianwords for English recognizer. 85% recognition accuracyfor 500 names was achieved, but only 2-4 speakers wereused for evaluation.As in [7] we also used the commercial Microsoft Englishspeech recognition engine in our experiments. This en-gine was selected because it is the most commonly usedengine (preinstalled in MS Windows XP and above) andpossesses well established software toolkits for develop-ment of practical applications.

2. The implementation of multipletranscriptions for the recognition oflithuanian voice commandsOne of the main tasks of this work is to propose themethodology for utilization of foreign language recogni-tion tools for the creation of a Lithuanian spoken languagespeech recognition system. Naturally, such a recognizerwon’t be universal, but may work for limited vocabularytasks with high recognition accuracy. The methods of-fered here are for a “third party” closed source recognizer,so it is impossible to modify or replace the recognitionalgorithms or acoustic – phonetic models. There is onlya single parameter to modify: the system input, in thiscase the transcriptions for the analyzed word (or otherlinguistic unit).In the adaptation methods a similarity measure is appliedduring mapping. The similarity measure itself is obtainedfrom some data applying some algorithm. The idea behindthis method is that similar phonemes are confused duringspeech recognition by a phoneme recognizer. The basiccharacteristic of such a recognizer is that it recognizesphoneme sequences instead of words from a vocabulary.For generating a crosslingual confusion matrix, acousticmodels of one of the source languages were applied onspeech utterances of the target language. The recognizedsequence of the source phonemes was then aligned to thereference sequence of the target phonemes. The outputof this alignment was the crosslingual phoneme confusionmatrix M. At this stage for each target phoneme ftrg thebest corresponding source phoneme fsrc should be lookedfor. As similarity measure, the number of phoneme confu-sions c(ftrg, fsrc) is often selected.Thus, for each target phoneme ftrg the source phoneme fsrc

with the highest number of confusions c is selected in thisschema. If two or more source phonemes has the samehighest number of confusions it was proposed to leavethe decision for the expert as to which one of the sourcephonemes should represent the target phoneme ftrg. Thesame procedure could be applied if no confusions betweensource and target phonemes were observed.In experiments described below the task has been lim-ited to the possibilities of adapting phonetic transcriptionsgiven in one language to transcriptions in another. Theselimitations were set up due to the initial formulation ofthe task; since the main task is to adapt “closed” speechrecognition engines the single opportunity is to adapt thephonetic transcriptions. The proposed methods are newsince most other researchers tried to transfer the nativelanguage acoustic or phonetic model to the foreign lan-guage recognizer, while in this work the recognizer wasunmodified. The main advantage of the proposed meth-

182

Page 3: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Rytis Maskeliunas, Vytautas Rudzionis

ods is that they may be adapted to any foreign languagerecognizer, if enough acoustic resources are available.

3. Algorithms for the transcriptioncreationFor the task of the Lithuanian voice commands recognitionan English recognizer was chosen – this way trying to in-terpret how the English recognizer will react to the spokenLithuanian voice commands. Two fixed-structure long andshort voice command vocabularies with potential practicaluses were chosen: the Lithuanian family names and theLithuanian digit titles. The first vocabulary has consid-erably long duration voice commands, while the secondcontains short duration voice commands.

3.1. Creation of the transcriptions for the longvoice commandsSome known transcription selection principles may beused to select the transcriptions necessary for the longvoice commands recognition: data (statistical) based crite-ria, expert knowledge based criteria, and perception basedcriteria. Some initial selection procedures must be foundto form and iteratively select the most suitable transcrip-tions. Expert knowledge based transcription criteria maybe described as using the linguistic knowledge to form thetranscriptions. This way the Lithuanian consonants andvowels are replaced with the similar English alternatives.Perception criteria are based on the “intuitive” selection oftranscriptions - similar to the native Lithuanian languagetranscriptions - for evaluation using the foreign languageTTS engine.The creation of the long voice commands transcription set(illustrated in Fig. 1) may be separated into the followingstages:1) The initial transcriptions are created in this stage foreach of the analyzed words, utilizing the principles notedin section 3.1. The recognition accuracy of this set isthen verified, using a large multi speaker corpus. Thepoorly recognized transcriptions are selected (recognitionaccuracy < 90 %) and forwarded to step two.2) The recognition accuracy is improved for the poorlyrecognized transcriptions in this stage. First all possi-ble transcription variations are generated for each of thepoorly recognized words. The transcription set is thenpassed to the recognizer using the same corpus as in step1. The selected transcriptions are noted and the experi-ment is iteratively repeated until a few best transcriptionsremain, in this way forming the different size transcriptionssets. These new transcriptions are then verified using a

Figure 1. The algorithm for long voice command transcription cre-ation.

new corpus (containing much larger number of utterancesfor each word). The most accurately recognized set is thentransferred to the original recognition vocabulary.

3.2. Creation of transcriptions for the shortvoice commands

The task of the short voice commands recognition is sig-nificantly more difficult than the task of the long voicecommands recognition. The principles used for the longvoice commands are not suitable here in most cases. Thetranscription forming may be separated into three stages(illustrated in Fig. 2):1) The transcriptions for the smaller parts of the analyzedword (syllables) are formed in this stage. Only the mostsimple consonant – vowel type syllables are analyzed.2) The recognition frequencies of the analyzed syllabletranscriptions are noted using the specific multispeakersyllable corpus. The syllable transcriptions for furtherwhole-word transcription forming are then selected, basedon the phonetic (the most frequently selected phonemesare used for forming the syllable transcription set) or syl-labic criterion (the most frequently recognized syllabletranscriptions are used).3) The best transcriptions out of a set formed in the previ-ous stage are iteratively selected in the third stage. Thepossible poorer recognition accuracy of some voice com-

183

Page 4: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Figure 2. The algorithm for the: a) syllable transcription set creation; b) short voice command transcription set creation.

mands may be further improved by modifying the tran-scriptions using the other methods, such as expert lin-guistic knowledge. In the end the recognition results ofthe whole vocabulary are verified using the new corpusnot used for the initial transcription selection.

4. Recognition accuracy of the longlithuanian voice commands

The idea behind the successful recognition of long Lithua-nian voice commands was the use of more than one tran-scription for each of the analyzed words. Obviously theuse of the foreign language phoneme recognizer is quitecomplicated. It is impossible to find the direct equiva-lent for every sound unit (spoken by a different speaker).One of the possible ways to overcome this problem wasthe usage of multiple transcriptions for each of the an-alyzed voice commands. The assumption was made thatonly one transcription fits one concrete case. Theoreti-

cally the number of such transcriptions is obviously verylarge, so some optimum limit must be found. The optimumnumber depends on such factors as the size of the vocab-ulary, the similarity of the words, speakers number (inde-pendent or not), etc. Currently there exist no theoreticalalgorithms describing how to do this, so the research mustbe done experimentally. The usage of multiple transcrip-tions proves quite useful for some practical non-dictationapplications, as it may be realized at somewhat lowercosts than a full blown native Lithuanian recognizer.

The initial group of experiments was conducted using the100 Lithuanian first and last names corpus (35 speakers,3500 phrases total) utilizing the two Lithuanian vowelbased transcriptions [7] (hereafter referred to as LBT)(Fig. 3).

The recognition system correctly recognized 90.0 ± 0.2 %of the corpus, unidentified 5.8 ± 0.1 %, and substituted4.2 ± 0.1 %. The substitution errors were the worst typeof practical recognition errors, because the system offereda false answer as the correct one.

184

Page 5: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Rytis Maskeliunas, Vytautas Rudzionis

Figure 3. Long voice commands recognition results using the initialcorpus and 2 LBT based transcriptions.

Figure 4. Long voice commands recognition results using the veri-fied corpus and 2 LBT based transcriptions.

The analysis of the recognition errors was performed inorder to improve the recognizer performance and optimizethe adaptation procedure. There were 351 total substitu-tions and unidentified voice commands errors in the firstgroup of the experiments, so it was natural to expect thatnot all the names would produce the equal number of er-rors. The 5 most confusing names produced ~ 30 % ofall the substitution and the indeterminacy errors, so the“concentration” of the errors was large and more attentionto the names that resulted in larger amounts of errors wasnecessary. Some of the errors could not be explainedstraightforwardly. For example, the name “Gudas” wasoften confused with the name “Butkus”.A similar experiment was conducted, only this time utiliz-ing the verified (error free) corpus (Fig. 4).The recognition accuracy improvement was insignificant(1.1 %). The results show that the recognizer is quite ro-bust in the case of long command recognition – the utter-ance quality has a low impact on the recognition accuracy.One of the transcription adaptation procedures incor-porated multiple transcriptions for one Lithuanian word.Several experiments were performed. The idea was togenerate multiple transcriptions of the same word and tocheck which transcriptions will be recognized more of-ten for different speakers. Two family names from thesame list of 100 Lithuanian names – “Beliukeviciute” and“Varanauskas” – were selected. In the case of the familyname “Beliukeviciute” 1152 transcriptions were generated

for this experiment and for the family name “Varanauskas”188 transcriptions were obtained. Then, the two expe-rienced speakers pronounced each of the family names100 times and the recognition system was coded to selectwhich transcription is the most likely for each speakerand each name. A notable observation in table 1 is that alarge number of the transcriptions were recognized as themost similar ones for each of the speaker. These resultsallowed concluding that the use of multiple transcriptionsis a reasonable step and it is worth further investigation.

Table 1. The four most frequently recognized transcriptions, for thetwo speakers and for the two Lithuanian names.

NameBeliukeviciute Varanauskas

1st. speaker 2nd. speaker 1st. speaker 2nd. speakertranscr. occur. transcr. occur. transcr. occur. transcr. occur.111 23 505 18 67 24 19 1099 20 121 12 130 11 166 815 13 504 11 4 8 144 83 10 507 11 70 8 6 8

Overall 27 Overall 21 Overall 28 Overall 27transcriptions transcriptions transcriptions transcriptions

The phrase with the worst recognition accuracy “GudasAudrius” was chosen for further multiple transcription us-age analysis. 252 possible transcriptions were created forthis purpose. The recognition accuracy was verified usingthe same 35 speaker corpus.Iteratively the worst transcriptions have been removed(only 22, 10, 4 and 2 best transcriptions were left in eachstage).In the first stage the recognition accuracy was verified us-ing the two original transcriptions based on the LBT prin-ciples (selected using a 2 speaker corpus). For the 14 newspeakers we found only 26.21 ± 1 %. 24.14 ± 1 % of thespoken commands were unidentified, while 49.64 ± 1.6 %were recognized incorrectly.In the second stage the experiment was repeated, usingonly the two transcriptions obtained from the iterative se-lection method (out of 252 possible). The average recog-nition accuracy was 64.5 ± 1.2 % – almost 2.5 times morethan in the case of the 2 LBT transcriptions. Similarly,the number of substitution errors and unrecognized wordsdecreased almost 2 times, to 25.6 ± 0.6 % and 9.9 ± 1.0 %,respectively.In the further stages the larger sets of the transcriptionswere used (4, 10 and 22), in trying to evaluate the as-sumption that the bigger number of transcriptions would

185

Page 6: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Figure 5. The comparison of the recognition accuracy when 22, 10,4 and 2 iteratively selected transcriptions and 2 originalLBT based transcriptions were used.

Figure 6. The comparison of the original transcription list and the listwith added 22 problematic phrase transcriptions recogni-tion results: a) recognition accuracy, %; b) unrecognizedwords, %; c) wrongly recognized words, %.

account for a larger variety of voices, thus increasing therecognition accuracy. The experiments were repeated withthe same 14 speaker corpus. In the case of 4 transcrip-tions the recognition accuracy increased about 10 % (to73.8 ± 1.0 %), and the number of substitution errors de-creased by 10 % (to 18.4 ± 0.8 %). The use of 10 tran-scriptions led to another 5 % increase in the recognitionaccuracy (78.9 ± 0.8 %) and 5 % decrease of the substi-tution errors (to 13.9 ± 0.7 %). The use of the 22 tran-scriptions increased the recognition accuracy by a smallamount (~ 1 % to 79.8 ± 0.8 %) and similarly decreasedthe substitution errors by a small margin (to 13.1 ± 0.7 %).The results show (Fig. 5) that one way to achieve highrecognition accuracy is to create the transcriptions basedon as the largest possible number of speakers – i.e., over-lapping the variety of voices as high as possible.The research on the use of multiple transcriptions for theproblematic phrase proved successful. Since those experi-ments were applied to analyzing only one phrase (“GudasAudrius”) vocabulary, it was necessary to determine if theadditional number of transcriptions (22 in this case) wouldworsen the overall recognition accuracy of the 100 namesdictionary.The results of the experiment (Fig. 6) showed that the useof the larger number of transcriptions for the problem-atic phrase does not deteriorate the recognition accuracy;in fact, the overall recognition accuracy increased to a

Figure 7. The recognition accuracy comparison, between all LBTbased transcriptions, and the two best and single best it-eratively selected transcriptions.

91.3 ± 0.1 %, because the previously problematic phrasewas recognized for the majority of the speakers.

5. Recognition accuracy of theshort lithuanian voice commandsThe task of short voice commands recognition is one ofthe most difficult and the most significant tasks of speechrecognition. The experiments described were aimed to de-termine the degree of the possibility to adapt the Englishrecognition engine to the practical Lithuanian languagerecognition based application.At the initial stage of the experiments, 70 transcriptionvariations were created for the selected vocabulary, uti-lizing the LBT [7] principles. In these experiments (Fig. 7)all possible transcriptions (noted as LBT 70), the two besttranscriptions (noted as LBT 2) and the single best tran-scription (noted as LBT 1) were used. The best transcrip-tions were selected using the iterative selection method.The best results (81.1 ± 0.7 %.) were achieved utilizingonly the single best transcription.

5.1. Recognition experiments of the set of theconsonant vowel syllablesInitially, the syllable transcriptions were constructed outof 7 consonants (P, T, K, B, D, G, R) and all possible16 vowel and diphthong combinations. The recognitionaccuracy of the corpus, containing the syllables, con-structed using the consonants in the open, semi open andclosed vowel context (word “Keturi” syllables „Ke, Tu, Ri”),was analyzed. The results of this experiment are furthernoted as “C7V16”. The recognition result analysis (table 3)showed that the syllable “Ke” was recognized as sometranscription variation 98.1 % of the time, syllable “Tu” –94.3 %, and syllable “Ri” – 66.3 %. The largest number ofdifferent answers (out of 112 possible transcriptions) was

186

Page 7: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Rytis Maskeliunas, Vytautas Rudzionis

found to be selected for the syllable “Ri”, although itsrecognition accuracy was the worst of all three. It seemsthat the large dispersion of answers correlates with thelow recognition accuracy in the case of very short voicecommands recognition.A similar experiment (noted as “C24V16”) was undertaken.384 syllable transcription variations (using all 24 conso-nants) were created for the second group of the experi-ments. The comparison of both types of experiments isoffered in table 2.

Table 2. Syllable recognition results

Recognition results, %Syllable The type of the

experimentAnswers Unrecognized Omissions

“Ke” “C7V16” 98.1 0.3 1.6“C24V16” 97.8 0.4 1.8

“Tu” “C7V16” 94.3 1.2 4.5“C24V16” 91.1 1.8 7.1

“Ri” “C7V16” 66.3 16.0 17.7“C24V16” 59.1 21.3 19.6

The majority of the most frequently recognized syllabletranscriptions (Table 3) were phonetically dissimilar to thedirect alternatives of the Lithuanian word “Keturi”. Forexample, in the case of “C7V16”, the most often selectedalternatives for the syllable “Ke” were: G AE (pronouncedge), and T AX (ta); for the syllable “Tu”, D OW (dou),T OW (tou), D AO (do), and B OW (bou); for the syllable“Ri”, D AX (da), G EY (gei), D EH (de), and T UW (tû).The assumption may be made that accurately recognizedtranscription sets for especially short voice commands maybe created only using experimental data.The detailed result analysis proved that the recognizer iscapable of correctly recognizing the syllables, based onthe open / semi open vowel: “Ke” and “Tu”. Syllable “Ke”was recognized ~ 98 % in both types of the experiments.The number of omissions was larger for the syllable “Tu”.The syllable “Ri” was the most poorly recognized syllable:~ 66.3 % and 59.1 %. The recognition accuracy was a bithigher in the case of “C7V16” possibly due to a more limitedset of transcriptions used.Small dispersion was noted for the syllable made fromconsonants in the open vowel context (“Ke”) in both ofthe experiments (in the case of “C7V16” four answers over-lapped 89.8 %, in the case of “C24V16” – 75.2 %). In bothgroups of the experiments the answer dispersion for thesyllable made from consonants in the semi open vowelcontext (“Tu”) was higher (in the second – as much as20 %). Very significant answer dispersion was noticed for

Table 3. The most frequently selected syllable transcriptions

Syllable The type of theexperiment

The most frequently recognizedtranscription (%)

“Ke” “C7V16” G AE(53.5)

K AE(21.8)

G EH(10.2)

T AX(2.6)

“C24V16” JH AE(29,4)

G AE(29.0)

JH EH(8.1)

K AE(8.3)

“Tu” “C7V16” D OW(24,1)

T OW(18.8)

D AO(13.7)

B OW(9.7)

“C24V16” DOW(15.3)

S OW(12.4)

TH OW(9.2)

D AO(9.2)

“Ri” “C7V16” D AX(8.5)

G EY(7.1)

D EH(7.0)

T UW(7.1)

“C24V16” F IH(8.1)

D AX(4.6)

JH EY(3.7)

S UW(5.2)

the syllable made from consonants in the closed vowelcontext (“Ri”).Most of the frequently recognized syllable transcriptionswere quite phonetically different from their Lithuaniancounterparts. The phonetically similar alternatives wereselected very rarely, proving a disadvantage on the use ofsuch linguistic knowledge based methods as LBT.The phonetic syllable transcription selection criterion the-oretically was useful only in the case of syllables withsmall answer dispersion (“Ke and Tu”). The syllabic tran-scription selection criterion should be more useful in thiscase, as only the most frequently recognized transcriptionswere selected.

5.2. Combinative method based transcriptionrecognition experiments

At the first stage the transcriptions were created, using thesyllable recognition results, and the best of those were se-lected using the iterative selection method. Trying to fur-ther improve the recognition accuracy, the most dissimilarsyllable transcriptions were replaced with more phoneti-cally close, though less frequently recognized, transcrip-tions. For example, the second syllable’s transcriptionfrom the word “Keturi” – D OW (pronounced dau) – wasreplaced with a more phonetically similar alternative –T UH (pronounced tu). This led to a larger improvement(Table 4).The best transcriptions from that modified set were itera-tively selected (Table 5). The overall recognition accuracyachieved (91 %) was almost 3 times higher than for the LBTbased transcriptions.Similarly the transcriptions were created for the other

187

Page 8: Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

Table 4. Some of the accurately recognized word “Keturi” transcrip-tions, where one syllable was replaced with a more phonet-ically similar alternative

Transcription Recognition accuracy, %G EH T UH D IH 1G AE T UH D IH 1K EH T UH D IH 1G EH T OW D IH 1G EH D OW D IH 1G EH T UW D IH 1G EH T UH D IY 1

89.089.087.076.050.089.585.5

Table 5. The best iteratively selected word “Keturi” transcriptions

Transcription Recognition accuracy, %G EH T UH D IHG EH T UW D IYG EH T UW D IHG EH T OW G EYK EH T UH D IYAll:

40.520.519.03.57.591.0

words in the short voice command vocabulary. Comparisonof the recognition accuracy for the original LBT methodbased transcriptions and the new transcriptions is foundin figure 8.

Figure 8. The comparison of the recognition accuracy of the orig-inal LBT transcriptions and the new combinative methodbased transcriptions .

6. ConclusionsThe main conclusion of this study is that high recogni-tion accuracy could be achieved for some applications and

specific command sets through the adaptation of phonetictranscriptions for foreign language speech recognition en-gines. This approach opens the way for the development ofeconomically feasible speech recognition applications forsmaller enterprises working in markets where less popu-lar languages are used as the main means of interpersonalcommunication.

The studies showed that the recognition accuracy couldbe significantly increased (often to error rates of 5-10%)by selecting proper phonetic transcriptions for the foreignlanguage trained recognizer. At the same time the selec-tion of the proper transcriptions is not a trivial task andvarious expert-driven and data-driven techniques shouldbe employed in order to achieve high recognition accuracy.

It should be emphasized that the success of the adaptationwill depend on the vocabulary used in the application andon the availability of a speech recognition engine trainedfor the language which has a more similar phonetic struc-ture to the target language for the application.

References

[1] Anderson O., Dalsgaard P., Barry W., On the useof data-driven clustering technique for identificationof poly- and mono phonemes for four European lan-guages, Proc. Of ICASSP., vol. 1:121-124, 1994

[2] Weng F., Bratt H., Stolcke A., A Study of MultilingualSpeech Recognition, Proc. Of Eurospeech, vol. 1:359-362, 1997

[3] Cohen P., et al., Towards a Universal Speech Recog-nizer for Multiple Languages, Proc. ASRU, 591-598,1997

[4] Kohler J., Language Adaptation of Multilingual PhoneModels for Vocabulary Independent Speech Recogni-tion Tasks, Proc. ICASSP, 417-420, 1998

[5] Byrne W. et al., Towards Language Independent Acous-tic Modeling, Proc. ICASSP, vol. 2:1029-1032, 2000

[6] Zgank A., et al., The COST278 MASPER Initiative- Crosslingual Speech Recognition With Large Tele-phone Databases, Proc. LREC, 2107-2110, 2004

[7] Kasparaitis P., Lithuanian Speech Recognition Us-ing English Recognizer INFORMATICA, vol. 19(4):505-516, 2008

188