6
G.-J. Jang et al.: Recognition Unit Determination of Interactive Chinese Speech Recognition for Embedded Devices 1353 Contributed Paper Manuscript received 10/14/12 Current version published 12/28/12 Electronic version published 12/28/12. 0098 3063/12/$20.00 © 2012 IEEE Gil-Jin Jang, Member, IEEE, Chunghsi Pan, Jae-Hyun Park, Jeong-sik Park, and Ji-Hwan Kim, Member, IEEE Abstract The work in this paper concerns the determination of a recognition unit in a small footprint Chinese large vocabulary word recognition system for embedded devices. This paper proposes a method of extending the conventional initial/final phonetic units in Chinese language to be used as a recognition unit. The word recognition performance of the proposed extended initial/final is compared with that of other recognition units, such as syllable, phoneme, triphone and the original initial/final. Under a similar memory usage limitation, the word recognition rates of different recognition units are evaluated. The proposed extended initial/final shows the best performance in a fixed memory usage condition: 93.74%, 94.76%, and 95.42% accuracy for 50K-word recognizer under memory restriction around 500KB, 1MB, and 5MB, respectively 1 . Index Terms — Recognition unit, Chinese speech recognition, Small footprint, Extended initial/final I. INTRODUCTION As the most common mode of human-human interaction, speech has been considered as an ideal medium for human- machine interaction [1]. Verbal communication between human and an embedded device can be often categorized as commanding the device or the conversation between them. Understanding spoken dialogue requires excessive resources in computation and memory. As a result, the method to achieve spoken dialogue comprehension for embedded devices often requires a cloud computing environment. However, understanding a simple command can be implemented by using only a word recognizer if its use scenario is well-defined. For example, office location enquiry directed to a robot receptionist 1 This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MEST)(2011-0027537). Gil-Jin Jang is with the School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology, Ulsan, Korea (e-mail: [email protected]). Chunghsi Pan is with the Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: [email protected]). Jae-Hyun Park is with the Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: [email protected]). Jeong-sik Park is with the Dept. of Intelligent Robot Engineering, Mokwon University, Daejon, Korea ([email protected]). Ji-Hwan Kim is with the Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: [email protected]). can be designed by speaking the name that a guest is looking for. Destination entry for an autonomous vehicle can also be executed by speaking the location name. The Chinese character is a pictograph expressing conceptual ideas. There are about 2,500 commonly used characters in the modern Chinese language. Pinyin is the official system for transcribing Chinese characters into Latin script. It is often used to teach standard Chinese and as an input method to enter Chinese characters into computers. For example, consider a word ‘Beijing’. It consists of two Chinese characters: the first is pronounced as ‘bei’ (meaning north) and the second as ‘jing’ (meaning capital). All Chinese characters are monosyllable with four tones. In contrast, English words are mostly polysyllables. Each Chinese monosyllable can be spelled with exactly one initial followed by one final, except in the special syllable ‘er’ and when a trailing ‘-r’ is considered as a part of a syllable. In the previous example, ‘Beijing’, ‘b’ is an initial and ‘ei’ is a final in ‘bei’, and ‘j’ is an initial and ‘ing’ is a final in ‘jing’. Unlike European languages, most initials contain a consonant whereas finals are not always single vowels. In a compound final, a medial is placed. In general, the phoneme set in the European language speech recognizers consists of consonants and vowels, and the phoneme or triphone is used as a base recognition unit. However, the Chinese speech recognizers normally use initial/final as the recognition unit. Therefore, the best set of phonemes and the base recognition unit should be further scrutinized and determined by performance comparisons in terms of speech recognition rate and memory requirement. This paper aims at determining a proper recognition unit necessary to implement a memory-efficient command- understanding system. This paper consists of five sections. In Section II, previous works for Chinese word recognition are introduced. In Section III, five recognition units are described. In Section IV, experimental setups and results are presented. Section V concludes this paper. II. PREVIOUS WORK Many researches concerning Chinese word recognition have been a work in progress. However, many of them use the initial/final as a base recognition unit without executing a detailed performance and memory usage analysis for the alternative recognition units. Generally, Chinese word Recognition Unit Determination of Interactive Chinese Speech Recognition for Embedded Devices

Recognition unit determination of interactive chinese speech recognition for embedded devices

  • Upload
    ji-hwan

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recognition unit determination of interactive chinese speech recognition for embedded devices

G.-J. Jang et al.: Recognition Unit Determination of Interactive Chinese Speech Recognition for Embedded Devices 1353

Contributed Paper Manuscript received 10/14/12 Current version published 12/28/12 Electronic version published 12/28/12. 0098 3063/12/$20.00 © 2012 IEEE

Gil-Jin Jang, Member, IEEE, Chunghsi Pan, Jae-Hyun Park, Jeong-sik Park, and Ji-Hwan Kim, Member, IEEE

Abstract — The work in this paper concerns the

determination of a recognition unit in a small footprint Chinese large vocabulary word recognition system for embedded devices. This paper proposes a method of extending the conventional initial/final phonetic units in Chinese language to be used as a recognition unit. The word recognition performance of the proposed extended initial/final is compared with that of other recognition units, such as syllable, phoneme, triphone and the original initial/final. Under a similar memory usage limitation, the word recognition rates of different recognition units are evaluated. The proposed extended initial/final shows the best performance in a fixed memory usage condition: 93.74%, 94.76%, and 95.42% accuracy for 50K-word recognizer under memory restriction around 500KB, 1MB, and 5MB, respectively 1.

Index Terms — Recognition unit, Chinese speech recognition, Small footprint, Extended initial/final

I. INTRODUCTION

As the most common mode of human-human interaction, speech has been considered as an ideal medium for human-machine interaction [1]. Verbal communication between human and an embedded device can be often categorized as commanding the device or the conversation between them. Understanding spoken dialogue requires excessive resources in computation and memory. As a result, the method to achieve spoken dialogue comprehension for embedded devices often requires a cloud computing environment. However, understanding a simple command can be implemented by using only a word recognizer if its use scenario is well-defined. For example, office location enquiry directed to a robot receptionist

1 This work was supported by the National Research Foundation of

Korea(NRF) grant funded by the Korea government(MEST)(2011-0027537). Gil-Jin Jang is with the School of Electrical and Computer Engineering,

Ulsan National Institute of Science and Technology, Ulsan, Korea (e-mail: [email protected]).

Chunghsi Pan is with the Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: [email protected]).

Jae-Hyun Park is with the Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: [email protected]).

Jeong-sik Park is with the Dept. of Intelligent Robot Engineering, Mokwon University, Daejon, Korea ([email protected]).

Ji-Hwan Kim is with the Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: [email protected]).

can be designed by speaking the name that a guest is looking for. Destination entry for an autonomous vehicle can also be executed by speaking the location name.

The Chinese character is a pictograph expressing conceptual ideas. There are about 2,500 commonly used characters in the modern Chinese language. Pinyin is the official system for transcribing Chinese characters into Latin script. It is often used to teach standard Chinese and as an input method to enter Chinese characters into computers. For example, consider a word ‘Beijing’. It consists of two Chinese characters: the first is pronounced as ‘bei’ (meaning north) and the second as ‘jing’ (meaning capital).

All Chinese characters are monosyllable with four tones. In contrast, English words are mostly polysyllables. Each Chinese monosyllable can be spelled with exactly one initial followed by one final, except in the special syllable ‘er’ and when a trailing ‘-r’ is considered as a part of a syllable. In the previous example, ‘Beijing’, ‘b’ is an initial and ‘ei’ is a final in ‘bei’, and ‘j’ is an initial and ‘ing’ is a final in ‘jing’.

Unlike European languages, most initials contain a consonant whereas finals are not always single vowels. In a compound final, a medial is placed. In general, the phoneme set in the European language speech recognizers consists of consonants and vowels, and the phoneme or triphone is used as a base recognition unit. However, the Chinese speech recognizers normally use initial/final as the recognition unit. Therefore, the best set of phonemes and the base recognition unit should be further scrutinized and determined by performance comparisons in terms of speech recognition rate and memory requirement.

This paper aims at determining a proper recognition unit necessary to implement a memory-efficient command-understanding system. This paper consists of five sections. In Section II, previous works for Chinese word recognition are introduced. In Section III, five recognition units are described. In Section IV, experimental setups and results are presented. Section V concludes this paper.

II. PREVIOUS WORK

Many researches concerning Chinese word recognition have been a work in progress. However, many of them use the initial/final as a base recognition unit without executing a detailed performance and memory usage analysis for the alternative recognition units. Generally, Chinese word

Recognition Unit Determination of Interactive Chinese Speech Recognition

for Embedded Devices

Page 2: Recognition unit determination of interactive chinese speech recognition for embedded devices

1354 IEEE Transactions on Consumer Electronics, Vol. 58, No. 4, November 2012

recognition systems using the initial/final as a recognition unit show a little over 90% and slightly less than 95% top-1 recognition rate for about a 10K-word vocabulary and 5K-word vocabulary, respectively. Word recognizers showed a 93% recognition rate for a 5K-word vocabulary and a 90% recognition rate for 10K-word vocabulary [2]. The word recognition rate for a 10K-word vocabulary was measured as 91.4% (top-1) and 99.2% (top-10) [3].

In the above research findings, there are no performance analyses according to memory usage of their speech recognizers. In an embedded environment, implementations of a speech recognition system were presented. Feature masking was applied to reduce the computational complexity and memory consumption for an embedded system [4]. The speech recognition rate was reported as 87.88% for a vocabulary composed of 64 pairs of tone-confusable names. However, there is no specific detail regarding memory usage or speech recognition units. A multi-lingual system for isolated word recognition was also presented. Its base recognition unit was the monophone [5]. The speech recognition rate was reported as 94.36% for a vocabulary consisting of 100 Chinese names. However, similar to the aforementioned studies, there is no description for performance analysis with respect to memory usage.

Performance comparison results were presented among four different recognition units [6]: context-independent models, intra-syllable context-dependent models, right context-dependent models and left context-dependent models.

For the context-independent models, an initial/final system, consisting of 21 initial models and 38 final models, was used. Intra-syllable context-dependent models reflected only the co-articulation effects between initials and finals. 21 initials were classified as 100 right context-dependent initials by considering the co-articulation effects between the initials and the finals. As a result, the intra-syllable context-dependent models composed of 100 right context-dependent initial models and 38 final models.

Right context-dependent models were inter-syllable diphone models. In these models, 38 final models were classified further by considering the co-articulation effects from finals to initials to consider the influence from a right context, and thus certain right context final models were added to intra-syllable initial models. The initials of the following syllables were divided into 7 groups according to the manner of articulation of the initials. If the following syllables have null initials, the right context-dependency from finals becomes the context-dependency from finals to finals. These finals with null initials were divided into 6 groups based on their first vowels. Consequently, they are composed of 100 right context-dependent initial models and 38x13 right context-dependent final models.

In the case of left context-dependent models, left context-dependencies from initials were divided into 10 groups based on the last phoneme in the final of the preceding syllable. Left context-dependencies from finals were classified into 7 groups classified according to consonant initials and another group represented left context-dependency to null initials. As

a result, left context-dependent models consisted of 21x10 left context-dependent initial models and 38x8 left context-dependent final models.

Speech recognition rates were measured at 85.7%. 91.0%, 93.2% and 88.6% for context-independent models, intra-syllable context-dependent models, right context-dependent models and left context-dependent models, respectively [6]. These evaluation results showed that right context-dependency was larger than left context-dependency. This research performs a systematic comparison of various recognition units by speech recognition rates. However, analysis of memory efficiencies of various recognition units was not presented. In addition, the evaluation corpus consisted of only 500 words.

III. RECOGNITION UNITS To determine the best recognition unit, five different units

(syllable, phoneme, triphone, initial/final and extended initial/final) were evaluated in terms of speech recognition rates and memory usage. This chapter describes how we designed the units in detail.

A. Syllable

The linguistic features of Chinese are well represented by syllables, showing good performance in Chinese speech recognition [7]. The modern Chinese language is comprised of more than 400 syllables where each syllable has three structural constituents: initial, final, and tone. Each Chinese character can be represented by a single syllable and each syllable can be broken down according to its tone. In this case, it is known that the number of syllables is more than 1,400. This paper uses the GB2312 code [8]. According to our analysis, there are 6,763 Chinese characters in this code from which 405 unique syllables can be identified if tone is not considered. Table I shows some examples of syllables presented in GB2312. If each syllable was to be broken down according to its tone, the number of recognition units increases by a maximum of fivefold, but it is reported that the improvement of speech recognition rate remains minimal [9]. In this paper, only mono-syllables were compared with other recognition units because disyllables or tri-syllables increase the number of recognition units extremely.

TABLE I EXAMPLES OF SYLLABLES IN GB2312 (TOTAL NUMBER OF SYLLABLES :

405. TONE IS NOT CONSIDERED)

Syllables a ai an ang ao ba bai ban

bang bao bei ben beng bi bian biao bie bin bing bo bu ca cai can

B. Phoneme Phonemes are the most common recognition units in speech

recognition. Phonemes normally consist of consonants and vowels. Because the phoneme is the basic element of a spoken language, the number of units is smaller than that of other units. We use the 36 phonemes that were presented in [10]. In that paper, ‘a’, ‘e’, ‘i’ and ‘o’ are subcategorized according to its left and right context.

Page 3: Recognition unit determination of interactive chinese speech recognition for embedded devices

G.-J. Jang et al.: Recognition Unit Determination of Interactive Chinese Speech Recognition for Embedded Devices 1355

TABLE II EXPLANATION OF VOWEL PHONEMES WITH CONTEXT-DEPENDENCY

(EXCERPTED FROM [10])

Subcategorized Phoneme

Explanation

aI the phoneme ‘a’ in the context ‘ai’ a the phoneme ‘a’ in other context conditions Ie the phoneme ‘e’ in the context ‘ie’ eI the phoneme ‘e’ in the context ‘ei’ eN the phoneme ‘e’ in the context ‘en’ e the phoneme ‘e’ in other context conditions Ci the phoneme ‘i’ in the context ‘ci’, ‘si’, or ‘zi’ CHi the phoneme ‘i’ in the context ‘chi’, ‘shi’, or ‘zhi’ Bi the phoneme ‘i’ in other context conditions oU the phoneme ‘o’ in the context ‘ou’ o the phoneme ‘o’ in other context conditions

TABLE III

SET OF PHONEMES (TOTAL NUMBER OF PHONEMES : 36)

Group Phoneme Consonant (22)

b p M f d t n l g k H j q x zh ch sh r Z c s ng

Vowel (14)

a aI Ie eI eN e Ci CHi Bi oU O u v er

Table II explains these subcategorized phonemes. As a result, 22 consonants and 14 vowels are used in this paper. Table III presents the set of phonemes.

C. Triphone

Triphones were generated based on the phonemes described in Table III. For 37 phonemes (36 phonemes + silence), about 50K (373) possible different triphones were generated. When triphones are used as recognition units, it is needed to cluster similar states of triphones in order to reduce the memory requirements of recognition units. Decision tree-based clustering is one of the general approaches for state clustering [11]. A decision tree is constructed for each phone to cluster all of the corresponding states. All statistics for different contexts of a phoneme are pooled at the root node. Binary decision trees are grown in a top-down fashion. Splits occur by a question set by partitioning the contexts at that node. The goodness of the split is measured by a measure function. Tree growing stops when the gain in splitting nodes no longer exceeds a stopping criterion.

Vowels and consonants are categorized into 39 groups according to their phonological characteristics [10]. 14 vowel groups with corresponding phonemes are listed in Table IV and 25 consonant groups with corresponding phonemes are listed in Table V. Our question set is generated according to context-dependency (left, right, and center). For each position, 39 different questions were generated according to vowel/consonant groups. In total, 117 (39+39+39) questions were generated. Table VI shows examples of these 117 questions. For example, ‘L_High’ means that the left context of a phone is ‘High’. In the ‘High’ group, there are five vowels: Bi, Ci, CHi, u, and v. Therefore ‘L_High’ is represented as {Bi-*, Ci-*, CHi-*, u-*, v-*}.

TABLE IV VOWEL GROUPS WITH CORRESPONDING PHONEMES

(EXCERPTED FROM [10])

Vowel Group (14)

Corresponding Phonemes

High Bi Ci CHi u v Medium e Ie eI eN er o oU Low a aI Top Vowel a aI e Ie eI Front aI a Bi v Ie End a e eI eN u o oU Unrounded a aI Bi e Ie eI eN Rounded u v o oU Apical Vowel Ci CHi Evowel e Ie eI eN Evowel 2 e eI eN Ivowel Bi Ci CHi Ovowel o oU u and v u v

TABLE V CONSONANT GROUPS WITH CORRESPONDING PHONEMES

(EXCERPTED FROM [10])

Consonant Group (25)

Corresponding Phonemes

Stop b d g p t k

Aspirated Stop b d g Unaspirated Stop

p t k

Affricate z zh j c ch q Aspirated Affricate

z zh j

Unaspirated Affricate

c ch q

Fricative f s sh x h r

Fricative 2 f s sh x h r k Voiceless Fricative

f s sh h

Voice Fricative r k

Nasal m n ng

Nasal 2 m n l

Nasal 3 m n l ng

Labial b p m

Labial 2 b p m f

Apical z c s d t n l

zh ch sh r

Apical Front z c s

Apical 1 d t n l

Apical 2 d t

Apical 3 n L

Apical End zh Ch Sh t

Apical End 2 zh ch Sh

Tongue Top j q X

Tongue Root g k H ng

Tongue Root 2 g k H

TABLE VI EXAMPLE OF QUESTION SETS

Question Sets QS "L_High" {Bi-*, Ci-*, CHi-*, u-*, v-*}

QS "R_High" {*+Bi, *+Ci, *+CHi, *+u, *+v} QS "C_High" {*-Bi+*, *-Ci+*, *-CHi+*, *-u+*, *-v+*, *-Bi, *-Ci, *-CHi, *-u, *-v, Bi+*, Ci+*, CHi+*, u+*, v+*, Bi, Ci, CHi, v, u}

Page 4: Recognition unit determination of interactive chinese speech recognition for embedded devices

1356 IEEE Transactions on Consumer Electronics, Vol. 58, No. 4, November 2012

The number of clustered triphone states depends on the size of the decision tree. As the size of the decision tree increases, the number of clustered triphone states increases. The size of the decision tree is determined by two parameters: the minimum number of state occupations and the threshold for increase in likelihood after splitting. These two parameters are determined according to memory requirements which are proportional to the number of clustered states.

D. Initial/final

Many previous researches were based on initials/finals since the initial/final reflects the linguistic features of Chinese well. The sets of initials/finals found in those researches are almost identical. We have adopted the set of initials/finals that were presented in [7]. In that paper, the final ‘i' is divided into three finals: ‘i0', ‘i1’ and ‘i2’. ‘i1’represents a final ‘i' that has the preceding initial ‘zh,' ‘ch’ or ‘sh.’ ‘i2’ corresponds to a final ‘i’ with preceding initials ‘z,’ ‘c’ or ‘s’. ‘i0’ represents a final ‘i' with any other preceding initials. In this research, to reduce the number of recognition units, ‘i1’ and ‘i2’ were merged into a single final, ‘I’. Table VII presents 21 initials and 37 finals.

TABLE VII

INITIALS AND FINALS (TOTAL NUMBER OF INITIALS/FINALS: 58)

Group Initial/final Initial (21)

b p m f d t n l g k h j q x zh ch sh r z c s

Final (37)

a ai ia ua ve o ei ie uo van e ao iao uai vn er ou in uei uen iou I i0 ian uan en iong an ang iang uang u eng ing ueng v ong

TABLE VIII

PROPOSED EXTENDED INITIALS AND FINALS (TOTAL NUMBER OF

EXTENDED INTIALS/FINALS: 64)

Group Extended Initial/final Initial (27)

b p m f d t n l g k h j q x zh ch sh r z c s _a _o _e _i _u _v

Final (37)

a ai ia ua ve o ei ie uo van e ao iao uai vn er ou in uei uen iou I i0 ian uan en iong an ang iang uang u eng ing Ueng v ong

E. Extended Initial/final

A syllable normally consists of one initial and one final, but some syllables consist of one final only, such as ‘a’, ‘ai’, ‘an’ and ‘ang’. To distinguish between finals with a left context as an initial and those with a left context as a null initial, starting vowels of 37 finals, listed in Table VII, are categorized as 6 vowels (‘a’, ‘o’, ‘e’, ‘i’, ‘u’ and ‘v’).

These 6 vowels are defined as extended initials. These initials were denoted as ‘_a’, ‘_o’, ‘_e’, ‘_i’, ‘_u’ and ‘_v’. As a result, our proposed extended initials/finals are defined by 27 initials and 37 finals, where a syllable consisting of only a final is written by combining the corresponding extended initial with the final syllable. For example, the syllable

consisting of only the final ‘uang’ would be denoted as ‘_u’ and ‘uang’. Table VIII illustrates the set of our proposed extended initials and finals.

IV. EXPERIMENTS

Speech Information Technology & Industry Promotion Center (www.sitec.or.kr), which is the representative organization for creating and distributing linguistic resources, released 4 Chinese corpora, named as Chinese01, Chinese02, Chinese03 and Chinese04. Spoken words in Chineses01, Chinese02 and Chinese03 were used as acoustic model training corpora, and spoken pinyins in Chinese01 to estimate the initial values of parameters in acoustic models. Chinese04 was used as a test corpus. Table IX summarizes these corpora. HTK (Hidden Markov model Toolkit) [12] was used for the training of acoustic models and for the testing of speech recognition under a linux environment. HTK is one of the well-known software toolkits intended for speech recognition.

TABLE IX

CORPUS DESCRIPTION

Corpus Description of Used Data

Chinese01 No. of speakers: 300 (Men: 150 Women: 150) 100~105 pinyins and 60 words per a speaker

Chinese02 No. of speakers: 300 (Men: 150 Women: 150) 150 words per a speaker

Chinese03 No. of speakers: 300 (Men: 150 Women: 150) 150 words per a speaker

Chinese04 No. of speakers: 100 (Men: 50 Women: 50) 120~150 Chinese persons’ names per a speaker

A. Experimental Results

This section shows the recognition rates and memory usage for all of the recognition units described in Section III according to the number of mixtures (for clustered triphone, the number of clustered states) and the size of vocabulary.

A word list was built by extracting unique words from all of the Chinese persons’ names in Chinese04, in which each Chinese person's name is pronounced twice: once by a male speaker and once by a female speaker. Due to this pronunciation duplication, the size of this list is about 6K.

Two vocabularies, whose sizes are 200 and 1K respectively, were generated by random selection from the above 6K words. If the size of the vocabulary was larger than 6K, such as a 10K or 50K vocabulary, the vocabulary included all of the words in the above 6K word list. The other words were randomly generated as 5-syllable words using syllables defined in the GB2312 code and their representative Chinese characters.

Table X presents the recognition rates and memory usages according to the size of vocabulary and the number of mixtures for syllable-based recognition units. For the vocabulary with 50K words, the recognition rates were measured as 56.94%, 77.12% and 91.94% when the memory requirement was limited to around 500KB, 1MB and 5MB, respectively. The recognition rates of 56.94% and 77.12% seem far below the level required for commercial service.

Page 5: Recognition unit determination of interactive chinese speech recognition for embedded devices

G.-J. Jang et al.: Recognition Unit Determination of Interactive Chinese Speech Recognition for Embedded Devices 1357

TABLE X EVALUATION RESULTS FOR SYLLABLE-BASED RECOGNITION UNITS

No. of Mixtures

Size of Vocabulary Memory Usage 200 1K 10K 50K

1 99.50% 88.90% 68.86% 56.94% 441KB

2 98.00% 93.70% 82.20% 77.12% 852KB

4 99.50% 96.70% 89.38% 87.02% 1.7MB

8 99.00% 97.50% 91.88% 90.58% 3.2MB

16 100% 98.10% 92.70% 91.94% 6.3MB

32 100% 98.00% 93.22% 92.46% 13MB

Table XI describes the performance analysis for phoneme-

based recognition units. For the vocabulary with 50K words, the recognition rates are evaluated as 90.48%, 92.22% and 94.04% when the memory requirement is restricted to around 500KB, 1MB and 5MB, respectively.

TABLE XI

EVALUATION RESULTS FOR PHONEME-BASED RECOGNITION UNITS

No. of Mixtures

Size of Vocabulary Memory Usage 200 1K 10K 50K

1 88.00% 78.50% 61.46% 61.14% 41KB

2 95.00% 89.90% 74.92% 74.82% 78KB

4 98.00% 93.50% 82.84% 82.82% 151KB

8 98.50% 95.20% 87.32% 87.26% 295KB

16 99.00% 96.60% 90.52% 90.48% 585KB

32 99.50% 97.20% 92.22% 92.22% 1.2MB

64 100% 97.60% 93.26% 93.28% 2.3MB

128 100% 98.40% 94.04% 94.04% 4.6MB

Table XII presents the recognition rates and memory usages

for one and two mixtures according to the size of vocabulary for triphone-based recognition units. Even for one mixture, memory usage exceeded 5MB. At this memory requirement, the recognition rate for the vocabulary with 50K words was measured as 82.46%.

TABLE XII

EVALUATION RESULTS FOR TRIPHONE-BASED RECOGNITION UNITS

No. of Mixtures

Size of Vocabulary Memory Usage 200 1K 10K 50K

1 96.50% 90.40% 82.40% 82.46% 5.0MB

2 95.50% 91.80% 84.56% 84.66% 9.6MB

Table XIII shows the evaluation results according to the

size of vocabulary and the number of clustered states for clustered triphone-based recognition units. In this evaluation, the number of mixtures per state was fixed at one. When the total number of clustered states is fixed, its corresponding decision tree is determined by two parameters as explained in Section III.C: the minimum number of state occupations and the threshold for the increase in likelihood after splitting. The recognition rate reaches the maximum when the number of clustered states is around 6,000. The best number depends on the amount of training data. That is, more training data leads to an increase in the best number of clustered states. For the vocabulary with 50K words, the maximum recognition rate was measured as 87.48%. It was lower than those for other recognition units and consumed more memory.

Table XIV shows the evaluation results for initial/final-based recognition units. For the vocabulary with 50K words, the recognition rates were evaluated as 91.36%, 93% and 94.76% when the memory restriction is around 500KB, 1MB and 5MB, respectively.

Table XV illustrates the experimental results for our proposed extended initial/final-based recognition units. For the vocabulary with 50K words, the recognition rates were measured as 93.74%, 94.76% and 95.42% when the memory limitation is around 500KB, 1MB and 5MB, respectively.

TABLE XIII EVALUATION RESULTS FOR CLUSTERED TRIPHONE-BASED

RECOGNITION UNITS

No. of States

Size of Vocabulary Memory Usage 200 1K 10K 50K

1,000 97.50% 93.40% 84.92% 84.76% 1.2MB

2,000 97.50% 94.90% 86.40% 86.26% 1.5MB

3,000 98.00% 95.40% 87.12% 87.00% 1.8MB

4,000 97.50% 95.40% 87.54% 87.48% 2.2MB

5,000 98.00% 95.70% 87.50% 87.40% 2.5MB

6,000 98.00% 95.90% 87.70% 87.48% 2.8MB

7,000 97.50% 95.60% 87.62% 87.40% 3.2MB

8,000 97.50% 95.40% 87.52% 87.34% 3.5MB

9,000 98.00% 95.00% 87.32% 87.22% 3.8MB

10,000 98.00% 94.90% 87.06% 86.84% 4.1MB

TABLE XIV EVALUATION RESULTS FOR INITIAL/FINAL-BASED RECOGNITION UNITS

No. of Mixtures

Size of Vocabulary Memory Usage 200 1K 10K 50K

1 94.00% 81.00% 73.90% 72.76% 64KB

2 98.00% 91.70% 81.52% 81.00% 122KB

4 98.50% 96.20% 88.08% 87.76% 236KB

8 99.00% 97.60% 91.52% 91.36% 463KB

16 99.00% 97.80% 93.10% 93.00% 917KB

32 100% 98.20% 93.98% 93.96% 1.8MB

64 100% 98.70% 94.80% 94.76% 3.6MB

128 100% 98.50% 95.06% 95.00% 7.2MB

TABLE XV

EVALUATION RESULTS FOR EXTENDED INITIAL/FINAL-BASED

RECOGNITION UNITS

No. of Mixtures

Size of Vocabulary Memory Usage 200 1K 10K 50K

1 95.50% 92.20% 80.94% 80.12% 70KB

2 98.00% 94.50% 85.92% 85.62% 135KB

4 99.50% 97.30% 91.24% 91.16% 260KB

8 99.50% 97.70% 93.84% 93.74% 511KB

16 100% 98.10% 94.82% 94.76% 1.0MB

32 100% 98.30% 95.20% 95.20% 2.0MB

64 100% 98.50% 95.44% 95.42% 4.0MB

128 100% 98.40% 95.96% 95.96% 7.9MB

Table XVI summarizes the best recognition rates for 50K-

word vocabularies achieved by all recognition units described in this paper under the approximate memory requirements of 500KB, 1MB and 5MB. For all cases, our proposed extended initials/finals outperform the other recognition units.

Page 6: Recognition unit determination of interactive chinese speech recognition for embedded devices

1358 IEEE Transactions on Consumer Electronics, Vol. 58, No. 4, November 2012

TABLE XVI SUMMARY OF THE BEST RECOGNITION RATES FOR 50K-WORD

VOCABULARIES ACCORDING TO RECOGNITION UNIT AND APPROXIMATE

MEMORY REQUIREMENT

Recognition Unit Approximate Memory Requirement

500KB 1MB 5MB Syllable 56.94% 77.12% 91.94%

Phoneme 90.48% 92.22% 94.04%

Triphone N/A N/A 82.46% Clustered Triphone

N/A 84.76% 87.48%

Initial/final 91.36% 93.00% 94.76% Extended

Initial/final 93.74% 94.76% 95.42%

V. CONCLUSION

This paper proposes extended initials/finals as recognition units in order to implement a small footprint, large vocabulary Chinese word recognition system for commanding an embedded device. The performance of the proposed extended initials/finals is compared with that of other recognition units, such as syllables, phonemes, triphones and initials/finals.

A recognition unit with our extended initial/final has shown the best performance in a fixed-memory usage condition. Our extended initial/final-based recognition unit shows 93.74%, 94.76% and 95.42% accuracy for 50K-word recognizers under memory restrictions of around 500KB, 1MB and 5MB, respectively.

REFERENCES

[1] S. Han, J. Hong, S. Jeong, and M. Hahn, “Robust GSC-based Speech Enhancement for Human Machine Interface,” IEEE Trans. on Consumer Electronics, vol. 56, no. 2, pp. 965-970, 2010.

[2] H. Hon, B. Yuan, and Y. Chow, “Towards Large Vocabulary Mandarin Chinese Speech Recognition,” Proc. International Conference on Acoustics, Speech, and Signal Processing, pp. 545-548, 1994.

[3] B. Xu, T. Huang, Z. Lin, D. Xu, and B. Ma, “A General Chinese Acoustic/Phonetic Decoder for Syllable, Word and Continuous Speech Recognition,” Proc. International Symposium on Speech, Image Processing and Neural Networks, pp. 706-709, 1994.

[4] Y. Tang, X. Wang, Y. Cao, and F. Ding, “Feature Masking in an Embedded Mandarin Speech Recognition System,” Proc. International symposium on Chinese Spoken Language Processing, pp. 245-248, 2004.

[5] X. Wang, Y. Cao, F. Ding, and Y. Tang, “An Embedded Multilingual Speech Recognition System for Mandarin, Cantonese, and English,” Proc. International Conference on Natural Language Processing and Knowledge Engineering, pp. 465-468, 2003.

[6] B. Ma, T. Huang, B. Xu, X. Zhang, and F. Qu, “Context-Dependent Acoustic Models for Chinese Speech Recognition,” Proc. International Conference on Acoustics, Speech, and Signal Processing, pp. 455-458, 1996.

[7] F. Zheng, Z. Song, P. Fung and W. Byrne, “Mandarin Pronunciation Modeling Based on the CASS Corpus,” Journal of Computer Science and Technology, vol. 17, no. 3, pp. 249-263, 2002.

[8] S. Li and K. Momoi, “A Composite Approach to Language/Encoding Detection,” Proc. International Unicode Conference, pp.1-14, 2001.

[9] L. Lee, C. Tseng, H. Gu, F. Liu, C. Chang, C. Chang, Y. Lin, Y, Lee, S. Tu, S. Hsieh, and C. Chen, “Golden Madarin (I) - A Real Time Mandarin Speech Dictation Machine for Chinese Language with Very Large Vocabulary,” IEEE Trans. on Speech and Audio Processing, vol. 1, no. 2, pp. 158-179, 1993.

[10] B. Ma and Q. Huo, “Benchmark Results of Triphone-Based Acoustic Modeling on HKU96 and HKU99 Putonghua Corpora,” Proc. International Symposium on Chinese Spoken Language Processing, pp. 359-362, 2000.

[11] S. Young, J. Odell, and P. Woodland, “Tree-Based State Tying for High Accuracy Acoustic Modelling,” Proc. ARPA Human Language Technology Workshop, pp. 307-312, 1994.

[12] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book(for HTK version 3.4), Cambridge University, 2006.

BIOGRAPHIES

Gil-Jin Jang (M’10) is an assistant professor at Ulsan National Institute of Science and Technology (UNIST), South Korea. He received his B.S., M.S., and Ph.D. degrees in computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejon, South Korea in 1997, 1999, and 2004, respectively. From 2004 to 2006 he was a research staff at Samsung Advanced Institute of Technology and from 2006 to 2007 he worked as a research engineer at Softmax, Inc. in San

Diego. From 2008 to 2009 he joined Hamilton Glaucoma center at University of California, San Diego as a postdoctoral employee. His research interests include acoustic signal processing, pattern recognition, speech recognition and enhancement, and biomedical signal engineering.

Chunghsi Pan received his B.E. and M.E. degree in Computer Science and Engineering from Sogang University in 2008 and 2011 respectively. He is currently working as an engineer at NCsoft, Inc. in Seoul, Korea. His research interests include spoken multimedia content search and embedded speech recognition.

Jae-Hyun Park received his B.E. degree in Computer Science and Engineering from Sogang University in 2011. He is currently pursuing an M.E. degree in Computer Science and Engineering at Sogang University. His research interests include speech recognition and spoken multimedia content search.

Jeong-sik Park received his B.E. degree in Computer Science from Ajou University, South Korea in 2001 and his M.E. and Ph.D. degree in Computer Science from KAIST in 2003 and 2010, respectively. From 2010 to 2011, he was a Post-Doc. researcher in the Computer Science Department, KAIST (Korea Advanced Institute of Science and Technology). He is now an assistant professor in the Department of Intelligent Robot Engineering, Mokwon University. His research interests

include speech emotion recognition, speech recognition, speech enhancement, and voice interface for human-computer interaction.

Ji-Hwan Kim (M’09) received the B.E. and M.E. degrees in Computer Science from KAIST (Korea Advanced Institute of Science and Technology) in 1996 and 1998 respectively and Ph.D. degree in Engineering from the University of Cambridge in 2001. From 2001 to 2007, he was a chief research engineer and a senior research engineer in LG Electronics Institute of Technology, where he was engaged in development of speech recognizers for

mobile devices. In 2005, he was a visiting scientist in MIT Media Lab. Since 2007, he has been a faculty member in the Department of Computer Science and Engineering, Sogang University. Currently, he is an associate professor. His research interests include spoken multimedia content search, speech recognition for embedded systems and dialogue understanding.