8
Contributed Paper Manuscript received October 15, 2009 0098 3063/09/$20.00 © 2009 IEEE A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices Hyeopwoo Lee, Sukmoon Chang, Member, IEEE, and Dongsuk Yook, Member, IEEE, Yongserk Kim Abstract Voice activity detection plays an important role for an efficient voice interface between human and mobile devices, since it can be used as a trigger to activate an automatic speech recognition module of a mobile device. If the input speech signal can be recognized as a predefined magic word coming from a legitimate user, it can be utilized as a trigger. In this paper, we propose a voice trigger system using a keyword-dependent speaker recognition technique. The voice trigger must be able to perform keyword recognition, as well as speaker recognition, without using computationally demanding speech recognizers to properly trigger a mobile device with low computational power consumption. We propose a template based method and a hidden Markov model (HMM) based method for the voice trigger to solve this problem. The experiments using a Korean word corpus show that the template based method performed 4.1 times faster than the HMM based method. However, the HMM based method reduced the recognition error by 27.8% relatively compared to the template based method. The proposed methods are complementary and can be used selectively depending on the device of interest. 1 Index Terms — Voice trigger, keyword recognition, speaker recognition, dynamic time warping, vector quantization, Gaussian mixture model, hidden Markov model I. INTRODUCTION The burgeoning of handheld mobile devices and home appliances in recent decades has provided us with the unfathomable convenience in various daily activities and communications. However, the use of these devices requires a high level of user attention, as well as dexterity to activate and operate the devices. For example, many communication devices are equipped with multiple buttons and/or touch sensitive screens. A user must manipulate the small buttons or touch sensitive screens to activate and operate such devices. While seemingly trivial, this method is often difficult to 1 This work was supported by the Korea Research Foundation (KRF) grant funded by the Korea government (MEST) (No. 2009-0077392). It was also supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA- 2009-C1090-0902-0007). Hyeopwoo Lee and Dongsuk Yook (corresponding author) are with the Speech Information Processing Laboratory, Department of Computer and Communication Engineering, Korea University, Seoul, 136-701, Republic of Korea (e-mail: [email protected] and [email protected]). They would like to thank Samsung Electronics for their cooperation. Sukmoon Chang is with Pennsylvania State University, Middletown, PA 17057, USA (e-mail: [email protected]). Yongserk Kim is with Acoustic Technology Center, Samsung Electronics Co., LTD., Suwon, 443-742, Republic of Korea (e-mail: [email protected]). perform, requiring the user’s careful attention, especially when the user is simultaneously carrying out another activity, such as driving a car. Moreover, the dexterity required to manipulate such user interfaces prevents users who may have little or no motor control capability from unassisted use of the devices [1][2]. This issue has been alleviated to some degree by the use of speech recognition systems for the hands-free activation and operation of the devices. When a voice signal is detected, the speech recognizer is triggered to process the signal. The detection of voice signals can be performed by traditional voice activity detection (VAD) methods [3]. This approach, however, raises several concerns. Note that the devices are expected to be used in real world environments where much of the speech around the devices is not directed to them. Although it is effective for isolating voice signals from noises, traditional VAD methods cannot differentiate the voice signals of a legitimate user from others. This causes the speech recognizer to be frequently activated to perform unnecessary tasks [4]. It is desirable to prevent the frequent activation of the system on small mobile devices with limited power supply. The system should be activated only when the voice signal from a legitimate user is detected [5]. Furthermore, due to their computational cost, the full-fledged speech recognizers are unsuited to devices with limited computing power. In summary, to work effectively on a device with limited power supply and computing power, the voice trigger system must have a small computational cost and be able to detect only the keywords uttered by legitimate users without a fully featured speech recognizer. The voice trigger system can be viewed as a keyword- dependent speaker verification problem, as shown in Fig. 1. To properly trigger a mobile device using voice signals, the system must recognize the registered keywords (magic words), as well as the speaker of the voice signals, from a short Fig. 1. A voice trigger system for mobile devices. The magic word represents the registered keyword by the authorized speaker. H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2377

A voice trigger system using keyword and speaker recognition for mobile recognotion

Embed Size (px)

DESCRIPTION

for more projects visit @ www.nanocdac.com

Citation preview

Page 1: A voice trigger system using keyword and speaker recognition for mobile recognotion

Contributed Paper Manuscript received October 15, 2009 0098 3063/09/$20.00 © 2009 IEEE

A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices

Hyeopwoo Lee, Sukmoon Chang, Member, IEEE, and Dongsuk Yook, Member, IEEE, Yongserk Kim

Abstract — Voice activity detection plays an important role

for an efficient voice interface between human and mobile devices, since it can be used as a trigger to activate an automatic speech recognition module of a mobile device. If the input speech signal can be recognized as a predefined magic word coming from a legitimate user, it can be utilized as a trigger. In this paper, we propose a voice trigger system using a keyword-dependent speaker recognition technique. The voice trigger must be able to perform keyword recognition, as well as speaker recognition, without using computationally demanding speech recognizers to properly trigger a mobile device with low computational power consumption. We propose a template based method and a hidden Markov model (HMM) based method for the voice trigger to solve this problem. The experiments using a Korean word corpus show that the template based method performed 4.1 times faster than the HMM based method. However, the HMM based method reduced the recognition error by 27.8% relatively compared to the template based method. The proposed methods are complementary and can be used selectively depending on the device of interest.1

Index Terms — Voice trigger, keyword recognition, speaker recognition, dynamic time warping, vector quantization, Gaussian mixture model, hidden Markov model

I. INTRODUCTION The burgeoning of handheld mobile devices and home

appliances in recent decades has provided us with the unfathomable convenience in various daily activities and communications. However, the use of these devices requires a high level of user attention, as well as dexterity to activate and operate the devices. For example, many communication devices are equipped with multiple buttons and/or touch sensitive screens. A user must manipulate the small buttons or touch sensitive screens to activate and operate such devices. While seemingly trivial, this method is often difficult to

1 This work was supported by the Korea Research Foundation (KRF) grant

funded by the Korea government (MEST) (No. 2009-0077392). It was also supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2009-C1090-0902-0007).

Hyeopwoo Lee and Dongsuk Yook (corresponding author) are with the Speech Information Processing Laboratory, Department of Computer and Communication Engineering, Korea University, Seoul, 136-701, Republic of Korea (e-mail: [email protected] and [email protected]). They would like to thank Samsung Electronics for their cooperation. Sukmoon Chang is with Pennsylvania State University, Middletown, PA 17057, USA (e-mail: [email protected]). Yongserk Kim is with Acoustic Technology Center, Samsung Electronics Co., LTD., Suwon, 443-742, Republic of Korea (e-mail: [email protected]).

perform, requiring the user’s careful attention, especially when the user is simultaneously carrying out another activity, such as driving a car. Moreover, the dexterity required to manipulate such user interfaces prevents users who may have little or no motor control capability from unassisted use of the devices [1][2].

This issue has been alleviated to some degree by the use of speech recognition systems for the hands-free activation and operation of the devices. When a voice signal is detected, the speech recognizer is triggered to process the signal. The detection of voice signals can be performed by traditional voice activity detection (VAD) methods [3]. This approach, however, raises several concerns. Note that the devices are expected to be used in real world environments where much of the speech around the devices is not directed to them. Although it is effective for isolating voice signals from noises, traditional VAD methods cannot differentiate the voice signals of a legitimate user from others. This causes the speech recognizer to be frequently activated to perform unnecessary tasks [4]. It is desirable to prevent the frequent activation of the system on small mobile devices with limited power supply. The system should be activated only when the voice signal from a legitimate user is detected [5]. Furthermore, due to their computational cost, the full-fledged speech recognizers are unsuited to devices with limited computing power. In summary, to work effectively on a device with limited power supply and computing power, the voice trigger system must have a small computational cost and be able to detect only the keywords uttered by legitimate users without a fully featured speech recognizer.

The voice trigger system can be viewed as a keyword-dependent speaker verification problem, as shown in Fig. 1. To properly trigger a mobile device using voice signals, the system must recognize the registered keywords (magic words), as well as the speaker of the voice signals, from a short

Fig. 1. A voice trigger system for mobile devices. The magic word represents the registered keyword by the authorized speaker.

H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2377

Page 2: A voice trigger system using keyword and speaker recognition for mobile recognotion

utterance of about one second, without a full-fledged speech recognizer. When an unregistered keyword is uttered or the voice of an impostor is detected, the voice trigger system should reject the signal. Thus, a voice trigger system may consist of two components, i.e., keyword recognition and speaker recognition [6].

Hidden Markov models (HMM) have been widely used to register words for keyword recognition [7][8]. However, the system must be provided with the voice signals of the keywords along with their labels to register the keywords using HMM. That is, the user must not only speak the keywords but also register the keywords using an input device, such as a keyboard, prohibiting the voice trigger system from being a truly hands-free system.

The methods based on the Gaussian mixture model-universal background model (GMM-UBM) [9] as well as support vector machines [10] have been widely used for speaker recognition. Although these methods were shown to produce good performances in the NIST (National Institute of Standards and Technology) speaker recognition evaluation [11], they have relatively high computational costs and are text-independent. Vector quantization based methods have been developed to reduce the computational costs of the speaker recognition task [12][13]. These methods explicitly create the speaker model using the codebook through the vector quantization procedure. They produce relatively good performance with low computational cost. However, since they lack a background model, and thus lack a normalization process, their performance rapidly degrades when used in a different condition from that of the training data collection, e.g., different microphones and environments.

These approaches cannot be reliably used as a voice trigger in the small devices due to their high computational costs in GMM-UBM based methods and performance degradation in vector quantization based methods. In this paper, we propose new methods that address these issues: a template based method and a HMM based method. The template based method is proposed to overcome performance degradation in the vector quantization method in different environments. This is achieved by adding the background model using the vector quantization method, enabling the normalization of the voice and noise signals. The HMM based method is proposed to adapt the GMM-UBM based method for small devices with limited computational power. Registration can be made with only the voice signals for the keywords, without requiring the labels of the uttered keywords, in the proposed HMM based method. The proposed voice trigger system consists of two steps, as shown in Fig. 2. The keyword recognition step requires two acoustic models. They are the keyword model and the garbage model. The speaker recognition step requires the speaker model and the background model. We introduce the proposed method in two phases, i.e., registration phase and verification phase, since the models are generated during the

keyword and user registration process and are used for the keyword and user verification.

The remainder of this paper is organized as follows. The template based method and the HMM based method are introduced in Sections II and III, respectively. Section IV gives experimental results. Section V concludes the paper.

II. TEMPLATE BASED METHOD The template based method is a simple way of keyword and

speaker recognition. The well-known pattern matching algorithm termed the dynamic time warping (DTW) method [14] may be used as a voice trigger. The DTW algorithm does not require any specific knowledge other than the feature vectors of the registration speech data. The DTW method measures the distance between the input and the registered data and shows relatively good performance when used in the same condition as the one under which the registration data were collected. However, the performance of the DTW method degrades when the testing condition changes; for example, different types of microphone are used in different environments. The main cause of the performance degradation is the lack of models that can be used to normalize voice data and noise. The proposed method generates the acoustic models of codebook using the vector quantization scheme for the normalization to overcome this weakness.

A. Registration Phase The four models, in Fig. 2, generated during the keyword

and user registration phase are: • Garbage model: The garbage model is generated as a

codebook by the k-means clustering algorithms [15] using a

Fig. 2. Block diagram of a voice trigger system comprising keyword recognition and speaker recognition modules.

2378 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009

Page 3: A voice trigger system using keyword and speaker recognition for mobile recognotion

large amount of speech data in advance, as shown in Fig. 3 (top row). The garbage model represents all the words and describes the general acoustic space.

• Keyword model: The keyword model represents the registered keyword. The model is generated using the codebook of the garbage model and the registration data, as shown in Fig. 3 (bottom row). For each feature vector of the registration data, the best matching codeword is selected from the codebook and a new vector sequence is generated by replacing the feature vectors with the selected codewords.

• Speaker model: The speaker model represents the acoustic space of the registered speaker. The model uses the feature vectors of the registration data of the user without generating any specific model.

• Background model: The background model represents the voice data from all the speakers, other than the registered speaker. We assume that each codeword in the codebook roughly corresponds to a context-dependent sub-word unit. Under this assumption, since the keyword model generated above implies some speaker-independent property, the background model simply uses the keyword model.

B. Verification Phase We first calculate the DTW score of the input voice data

against to the models (keyword and speaker) for the verification of the registered keyword and the user:

),(DTW ..1 mxS Tm = , (1)

where x1..T represents the input voice data and m represents one of the acoustic models except the garbage model. The score of the garbage model is calculated by the sum of the distance between each input feature vector and the minimum distance codeword corresponding to the vector.

We apply a two-step procedure that performs the keyword and speaker recognition in sequence to determine if the input

voice data is the registered keyword spoken by the registered user:

1garbagekeyword θ>− SS , (2)

2backgroundspeaker θ>− SS . (3)

The input voice data is determined as the registered keyword when the score difference between the keyword model and the garbage model exceeds a threshold 1θ . Similarly, if the score difference between the speaker model and the background model exceeds threshold 2θ , the system determines the input voice data to be the voice of the registered speaker.

Another decision method is motivated by the fact that the background model simply uses the same model as the keyword model. Thus, by adding (2) and (3), we obtain a simpler one-step verification procedure:

3garbagespeaker θ>− SS , (4)

where )( 213 θθθ += is a threshold. Although the one-step procedure is faster than the two-step procedure, it is important to note that the one-step procedure can only be used when it is safe to assume that the model scores have similar distributions. Otherwise, the one-step procedure will most likely fail. For example, if Skeyword – Sgarbage is relatively large, but Sspeaker – Sbackground is not, it is possible that (4) is satisfied, but (2) is not. As we will examine in more detail in Section IV, these two procedures may be used with the consideration of the tradeoffs between the time and accuracy.

III. HMM BASED METHOD Although the template based method uses the four acoustic

models to address the performance degradation issue of the conventional vector quantization method, the problem cannot be completely overcome. We propose a voice trigger system based on the HMM to obtain performance that is more reliable.

A. Registration Phase The four acoustic models to be used in the HMM based

method are generated as follows: • Garbage model: The garbage model is represented using the

GMM, a well known approach in speaker recognition tasks:

∑ ΣM

kkk xw1

),;( μN , (5)

where M is the number of Gaussian probability density functions (PDFs) in the mixture, wk is the weight for the k-th component Gaussian PDF, x is the input feature vector, and ),;( kkx ΣμN represents a Gaussian PDF with mean vector kμ and covariance matrix kΣ . The GMM is trained in advance on a large amount of speech data using the expectation-maximization (EM) algorithm.

Fig. 3. The garbage model (codebook) and the keyword model generation procedure in the template based method. The feature vectors of the keyword voice data are replaced by the best matching codewords in the codebook.

H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2379

Page 4: A voice trigger system using keyword and speaker recognition for mobile recognotion

• Keyword model: We propose a pseudo phoneme keyword HMM generation algorithm that produces a speaker-independent keyword HMM to register a keyword without any transcription. If the registration data represents simply one utterance, the algorithm works as follows (also, Fig. 4): Step 1: For each input vector, xt (t = 1 … T), calculate log

likelihood, Sk,t, with each Gaussian of the GMM in the garbage model:

),;(log, kktktk xwS Σ= μN . (6)

Step 2: Select top N Gaussian PDFs with the largest Sk,t and

build a Gaussian index table, Gg,t (g = 1 … N and t = 1 … T), which contains the top N Gaussian PDF indices for each feature vector. Fig. 4 shows an example of a Gaussian index table with N = 3 and T = 6.

Step 3: Cluster the columns of the table based on the Gaussian with the largest Sk,t from each column and using the distance between the adjacent columns (i.e., adjacent times). The Bhattacharya distance can be used that measures the similarity of two Gaussian PDFs to cluster the adjacent columns of the Gaussian index table [16].

Step 4: For each cluster, select top N Gaussians with the largest Sk,t from the cluster and assign them to a state

of the keyword HMM. Corresponding Gaussian weights are also assigned to the state and normalized.

Step 5: The states generated in step 4 are concatenated to form a left-to-right HMM with self transition.

If the registration data represents more than one utterance, the pseudo phoneme keyword HMM generation method is modified, as follows. For each utterance, we follow Steps 1 through 3 of the single utterance case to build a clustered Gaussian index table. Once all the tables are built for each utterance, we find the median number of clusters amongst the tables. Then, by adjusting the clustering threshold in Step 3, we re-cluster the tables so that the number of clusters in each table is the same as the median number of clusters found previously. If a table fails to be re-clustered to have the median number of clusters, we simply ignore the table. We then select top N Gaussians with the largest Sk,t from all the clusters of the table with the same cluster index and assign them to a state of the keyword HMM. Finally, the Gaussian weights and the transition probabilities are assigned in the same way as in Steps 4 and 5 of the single utterance case. Note that the clustering of the feature vectors is performed based on their time index. That is, the pseudo phoneme keyword HMM generation algorithm incorporates the time information of the keyword voice data into the model. Assuming that each Gaussian in the original GMM roughly models a context-dependent sub-word unit, the keyword model generated in this way implies some speaker-independent property.

• Speaker model: As mentioned previously, the speaker model indicates the acoustic model of a registered speaker, whilst the keyword model is speaker-independent. We build the speaker model by adapting the keyword model using the registration data from the speaker. The adaptation is performed using the maximum a posterior (MAP) based method [17]. We perform only the mean adaptation of the component Gaussians, since our experimental results show that good performance is achieved using only the mean adaptation.

• Background model: As in the template based method, the background model simply uses the keyword model, since we assume that the keyword model generated above contains some speaker-independent characteristics.

B. Verification Phase We first calculate the log likelihood, Sm to verify the

registered keyword and user:

);(log ..1 mxpS Tm = , (7)

where x1..T and m represents the input voice data and each of the four acoustic models, respectively. We can calculate Sm using the forward algorithm [18]. The procedures of the verification are the same as those in the template based case.

Fig. 4. Pseudo phoneme keyword HMM and speaker HMM generation procedure. In the Gaussian index table in Step 3, the columns with the same color represent one cluster.

2380 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009

Page 5: A voice trigger system using keyword and speaker recognition for mobile recognotion

IV. EXPERIMENTS We performed the experiments using a Korean word corpus

to evaluate the performance of the proposed methods. The corpus consists of six words spoken by 30 people (23 males and 7 females), repeating each word ten times. The words contain 4–11 phonemes. We used half of the corpus to train the models and the remaining half as test data. That is, out of the ten repetitions of each word spoken by each person, we used five words for the keyword and user registration. The remaining five words were used for testing, along with other words spoken by the same person, as well as the words spoken by other people. Table 1 shows the composition of a test set. Each test set contains five acceptance trials (i.e., the same keyword spoken by the same person as the training data) and fifteen rejection trials (i.e., different keywords and/or different person). The rejection trial data were selected randomly from the corpus. We repeated this experiment 180 (=30 x 6) times.

The voice data were partitioned into a sequence of 25 millisecond frames with a 10 millisecond advance. A Hamming window was applied to each frame. 12 dimensional mel-frequency cepstral coefficients (MFCCs), log energy, and their first and second order time derivatives were used as the feature vector. The equal error rate (EER), which is the error rate measured when the false alarm rate and the false rejection rate are the same, was used as the performance measure.

The garbage model for the template based method (i.e., the codebook) and the garbage model for the HMM based method (i.e., the GMM) were trained with the Korean Standard 2001 corpus. This is a collection of 16433 words spoken by 200 people. The data in the corpus is phonetically rich. We varied the size of GMM to analyze the effect of the number of Gaussians; 1024, 2048, 4096, and 8192 Gaussians. The codebooks were trained with the k-means algorithm. Each codebook contained 2048 code words.

Fig. 5 shows the experimental results with various values of the clustering threshold in Step 3 of Section III-A, which controls the number of HMM states in the pseudo phoneme keyword HMM generation algorithm. There were 2048 Gaussians in the GMM. The results of the template based method using 2048 codewords are also shown in the figure as a reference. In this figure, five training data sets were used to generate the keyword and speaker models. The verification decision was made by (4). On average, there were 27 to 38 states in the 2048-Gaussian GMMs in Fig. 5. Ten Gaussians were selected (i.e., N = 10) during Steps 2 and 4 of the keyword model generation process. As the clustering

threshold increases, denoting that the clustering criterion relaxes, the performance of the verification degrades. That is, as the clustering criterion relaxes, the number of clusters in the Gaussian index table becomes smaller. This, in turn, means that the small number of states in the HMM causes the degradation of the keyword and speaker representation power of the model. On average, the threshold of 4.5 shows the best performance. We use this value in the remaining experiments.

The execution time of the two methods run on a 500 MHz CPU ultra mobile personal computer (UMPC) is shown in Fig. 6. The verification decision was made by (4). It takes roughly real-time to register a keyword and a user, and much less than real-time for recognition, using the template based method. In contrast, it takes almost 4.5 times real-time for registration, and two times real-time for recognition, in the HMM based method. The template based method is much faster than the HMM based method. However, the HMM based method

TABLE I. COMPOSITION OF EACH TEST SET

Index Keyword User Label

1-5 (acceptance trials) Same Same SK-SU 6-10 (rejection trials) Same Different SK-DU 11-15 (rejection trials) Different Same DK-SU 16-20 (rejection trials) Different Different DK-DU

Fig. 6. Comparison of the execution time of the template based method and the HMM based method.

Fig. 5. Performance comparison of the template based method with 2048 codewords and the HMM based method with 2048 GMM Gaussians.

H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2381

Page 6: A voice trigger system using keyword and speaker recognition for mobile recognotion

reduced the verification error by 25.5% compared to the template based method.

Fig. 7 shows the effects of the number of Gaussian in a state of the HMM. As the number of Gaussians increased, performance improved. However, after 15 Gaussians the performance can be degraded due to the representation power. Thus, the remaining experiments were conducted with 15 Gaussians in a state of the HMM to acquire the best result.

The relationship among the EER, the execution time, and the number of Gaussians is shown in Fig. 8. Although the number of Gaussians is an important performance factor, the computation times of the registration and recognition are also critical factors in the small mobile devices, such as cellular phones. The computation time increases almost linearly with the number of Gaussians in the GMM. The case using 8192

Gaussians achieved the highest recognition accuracy. Considering the computation time along with the EER, however, the 2048 Gaussians achieved a better result overall than the 8192 Gaussians.

Next experiments were performed with the different number of registration utterances to generate the keyword model and perform the speaker model adaptation in order to show the effects of the amount of registration data. Fig. 9 shows these results. As in our expectation, the larger number of training data achieves better performance than lesser ones. Thus, the amount of registration data is an important issue in digital devices to use as a voice trigger.

Note that, until now, only the results of the one step decision procedure have been presented. Table II analyses the error rates contributed by the configuration of the test set. The configuration labels used in the table were explained in Table I. As shown in Table II, the error in the keyword recognition, i.e., ‘DK-SU’ and ‘DK-DU’, contributed to a large portion of the overall error rate. Since the decision procedure in (4) consists of only one step, the classification ability may be degraded, especially when the four DTW scores have large discrepancies. Therefore, the one-step decision procedure should only be used when the four DTW scores can be assumed to come from similar distributions and after careful consideration of the tradeoffs between execution time and accuracy.

Finally, to compensate for the error in keyword verification, the two step decision procedure in (2) and (3) was tested using the HMM based method with the 2048-Gaussian GMM, as well as the template based method with the 2048 codewords.

Fig. 7. EER as a function of the number of Gaussians for the HMMbased method.

Fig. 8. Performance of the registration and recognition time for various sizes of the GMM in the HMM based method. The circular marks represent EER, the diamond marks describe the real-time factor of the registration process, and the triangular marks indicate the real-time factor of the recognition process.

TABLE II ERROR PORTION

Label SK-SU SK-DU DK-SU DK-DU

Error (%) 25.8 35.5 23.3 15.4

Fig. 9. Performance as a function of number of registration words for the2048-Gaussian GMM.

2382 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009

Page 7: A voice trigger system using keyword and speaker recognition for mobile recognotion

The two step decision procedure requires all four models to be evaluated, rather than only two models as in the one step decision scheme. The two thresholds, 1θ and 2θ , were determined empirically to apply the two step decision procedure when the miss ratio of the keyword recognition is less than 1.0%. Table III shows the results of the EER of both approaches. The two step decision procedure decreased the EER by 2.0% in the HMM based method and 3.6% in the template based method compared to the one step decision scheme. In addition, the HMM based method decreased the recognition error rate by 27.8% compared to the template based method (from 13.2% to 9.5%). Thus, the GMM based method can be used as a more reliable voice trigger for home appliances and digital devices, while the template based method has a fast voice trigger at the expense of accuracy.

V. CONCLUSION This paper proposed voice trigger systems that use keyword

and speaker recognition techniques for a hands-free interface to small mobile devices. The proposed methods, the template based method and the HMM based method, do not require a speech recognizer to register keywords and users. Unlike the traditional GMM speaker recognition method, they also utilize the temporal constraints of the voice signals. These methods generate the specific models and make a decision. The experiments using a Korean word corpus show the effectiveness of the proposed methods. The two proposed methods are complementary. Although the template based method is relatively faster than the HMM based method, the performance of the HMM based method is much better than the template based method. Therefore, the proposed voice trigger systems for the keyword and speaker verification can be used selectively taking into account the tradeoff between speed and the accuracy, depending on the device of interest.

REFERENCES [1] M. Matsuda, T. Nonaka, and T. Hase, “AV control method using natural

language understanding,” IEEE Trans. Consum. Electron., vol. 52, no. 3, pp. 990-997, 2006.

[2] H.-C. Huang, T.-C. Lin, and Y.-M. Huang, “A smart universal remote control based on audio-visual device virtualization,” IEEE Trans. Consum. Electron., vol. 55, no. 1, pp. 172-178, 2009.

[3] A. Davis, S. Nordholm, and R. Togneri, “Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold,” IEEE Trans. Speech Audio Process., vol. 14, no. 2, pp. 412-424, 2006.

[4] H. Chung and I. Chung, “Memory efficient and fast speech recognition system for low-resource mobile devices,” IEEE Trans. Consum. Electron., vol. 52, no. 3, pp. 792-796, 2006.

[5] M. JI, S. Kim, H. Kim, and H.-S Yoon, “Text-independent speaker identification using soft channel selection in home robot environment,” IEEE Trans. Consum. Electron., vol. 54, no. 1, pp. 140-144, 2008.

[6] Y. R. Oh, J. S. Yoon, M. Kim, and H. K. Kim, “A name recognition based call-and-come service for home robots,” IEEE Trans. Consum. Electron., vol. 54, no. 2, pp. 247-253, 2008.

[7] J. Wilpon, L. Rabiner, C. Lee, and E. Goldman, “Automatic recognition of keywords in unconstrained speech using hidden markov models.” IEEE Trans. Acoust., Speech, Signal Process., vol. 38, no. 11, pp. 1870-1878, 1990.

[8] E. Lleida and R. Rose, “Utterance verification in continuous speech recognition: decoding and training procedure,” IEEE Trans. Acoust., Speech, Signal Process., vol. 8, no. 2, pp. 126-139, 2000.

[9] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Process., vol. 10, pp. 19-41, 2000.

[10] W. Campbell, J. Campbell, and D. Reynolds, “Support vector machines for speaker and language recognition,” Computer Speech and Language, vol. 20, pp. 210-229, 2006.

[11] G. Doddington, M. Przybocki, A. Martin, and D. Reynolds, “The NIST speaker recognition evaluation – overview, methodology, systems, results, perspective,” Speech Comm., vol. 31, pp. 225-254, 2000.

[12] T. Kinnunen, E. Karpov, and P. Franti, “Real-time speaker identification and verification,” IEEE Trans. Acoust., Speech, Signal Process., vol. 14, pp. 277-288, 2006.

[13] V. Hautamaki, T. Kinnunen, I. Karkkainen, J. Saastamoinen, M. Tuononen, and P. Franti, “Maximum a posteriori adaptation of the centroid model for speaker verification,” IEEE Signal Process. Lett., vol. 15, pp. 162-165, 2008.

[14] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 26, no. 1, pp. 43-49, 1978.

[15] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Comm., vol. com-28, no. 1, pp. 84-98, 1980.

[16] A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distribution,” Bull. Calcutta Math. Soc., vol. 35, pp. 99-110, 1943.

[17] J. Gauvain and C. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Tans. Speech Audio Process., vol. 2, no. 2, pp. 291-298, 1994.

[18] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. of IEEE, vol. 72, no. 2, pp. 257-286, 1989.

Hyeopwoo Lee received the B.S. and M.S. degrees in Computer and Communication Engineering, from Korea University, Seoul, Korea, in 2006 and 2008. He is currently in the Ph.D. program at the Speech Information Processing Laboratory in Korea University. His research interests are speech and speaker recognition. Sukmoon Chang received the M.S. degree in computer science from Indiana University, Indiana, USA, in 1995 and the Ph.D. degree in computer science from Rutgers University, New Jersey, USA, in 2002. He worked on image and signal processing at the Center for Computational Biomedicine Imaging and Modeling, Rutgers University, from 2002 to 2004. He is a professor in Computer Science, School of Science, Engineering, and

Technology, Pennsylvania State University. His research interests include image and signal processing and machine learning. Dr. Chang is a member of IEEE.

Dongsuk Yook received the B.S. and M.S. degrees in computer science from Korea University, Korea, in 1990 and 1993, respectively, and the Ph.D. degree in computer science from Rutgers University, New Jersey, USA, in 1999. He worked on speech recognition at IBM T.J. Watson Research Center, New York, USA, from 1999 to 2001. He is a professor in the Department of Computer and Communication Engineering, Korea University,

Korea. His research interests include machine learning and speech processing. Dr. Yook is a member of IEEE.

TABLE III PERFORMANCES COMPARISONS OF THE DECISION METHODS IN THE

TEMPLATE BASED AND HMM BASED METHODS Decision

System One step (%) Two steps (%) Improvement (%)

Template based 13.7 13.2 3.6

HMM based 9.7 9.5 2.0

H. Lee et al.: A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices 2383

Page 8: A voice trigger system using keyword and speaker recognition for mobile recognotion

Yongserk Kim received the B.S. degree in electronics engineering from Sungkyunkwan University, Korea, in 1983. He has been working on audio processing and telecommunications since 1983. He was awarded an honorary Ph.D. degree from Samsung Electronics in 2002. Currently, he is a vice president, and the director of the Acoustic Technology Center, Samsung Electronics Co., LTD.

2384 IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009