4
264 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 4, APRIL 2009 Environmental Model Adaptation Based on Histogram Equalization Youngjoo Suh, Member, IEEE, and Hoirin Kim, Member, IEEE Abstract—In this letter, a new environmental model adaptation method is proposed for robust speech recognition under noisy en- vironments. The proposed method adapts initial acoustic models of a speech recognizer into environmentally matched models by utilizing the histogram equalization technique. Experiments performed on the Aurora noisy environment showed that the proposed technique provides substantial improvement over the baseline speech recognizer trained on the clean speech data. Index Terms—Environmental model adaptation, histogram equalization, robust speech recognition. I. INTRODUCTION C URRENTLY, most speech recognizers trained on clean speech data usually show serious performance degra- dation when they are tested in acoustically mismatched noisy environments [1], [2]. The main cause of the acoustic mis- match between training and test environments is known as the corruption by additive noise and channel distortion. There- fore, one of the hot issues in automatic speech recognition is to provide speech recognizers with robustness against the acoustic mismatch between the two environments [3]. Current robust speech recognition techniques can be categorized into feature-compensation, model-adaptation and uncertainty-based approaches [2]. Of the three approaches, the easiest way to provide the robustness against the acoustic mismatch is feature compensation which compensates noisy test features into compensated test features, which are then decoded by speech recognizers trained on clean speech data [3]. The histogram equalization (HEQ) technique is known to be one of the most efficient feature compensation approaches because most of its algorithm consists of sorting and search routines with narrow depths [4]–[8]. Additionally, it provides considerable effectiveness in feature compensation because of its nonlinear transformation characteristics which can fundamentally cope with the nonlinearity of noisy speech features in the logarithmic feature spaces such as cepstral coefficients [4], [8]. However, due to some acoustic-phonetic information loss in noisy speech features, there are unavoidable discrepancies between clean speech models and compensated features. On the contrary, the Manuscript received September 22, 2008; revised December 16, 2008. Cur- rent version published February 12, 2009. This work was supported by the IT R&D program of MIC/IITA [2008-S-001-01, Development of large vocabu- lary/interactive distributed/embedded VUI for new growth engine industries]. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Steve Renals. The authors are with the Department of Information and Communications Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-732, Korea (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/LSP.2009.2014109 Fig. 1. Block diagram of the histogram equalization-based model adaptation technique in speech recognition. clean speech models can be fully adapted into acoustically matched speech models as far as the amount of adaptation data is enough in model adaptation [1]. Therefore, although the same information loss can occur in model adaptation, it does not cause undesirable discrepancies between acoustic models and speech features in the decoding process. For this reason, the model adaptation approach can be more effective than the feature compensation one in reducing the mismatch between acoustic environments. Two well-known model adaptation methods are parallel model combination and vector Taylor series [1], [2]. These methods use both clean speech and noise models to approximate noisy speech models and are reported to be very effective in reducing the mismatch. However, in the computational aspect, they are not very efficient due to the transformations between log-spectral and cepstral domains. In this letter, we propose a new efficient model adaptation approach based on the histogram equalization technique to take advantage of its effectiveness in providing noise robustness as well as computational efficiency in the model space. In the pro- posed approach shown in Fig. 1, the histogram equalization technique transforms initial acoustic models of a speech recog- nizer into environmentally adapted acoustic models. The trans- formation function of the histogram equalization technique is obtained by using the reference and test cumulative distribution functions (CDFs) of the training data and utterance input, re- spectively. Then, the speech recognizer based on the adapted acoustic models decodes the utterance input into the recognized output. II. HISTOGRAM EQUALIZATION FOR FEATURE COMPENSATION A basic assumption in utilizing HEQ for feature compensa- tion (HEQ-FC) is that the acoustic mismatch between clean ref- erence features and noisy test ones is directly related to the dis- crepancy between their corresponding probability density func- tions (PDFs). Then, the idea of HEQ-FC is to convert the test PDF into the reference one to recover clean reference features 1070-9908/$25.00 © 2009 IEEE

Environmental Model Adaptation Based on Histogram Equalization

Embed Size (px)

Citation preview

Page 1: Environmental Model Adaptation Based on Histogram Equalization

264 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 4, APRIL 2009

Environmental Model AdaptationBased on Histogram Equalization

Youngjoo Suh, Member, IEEE, and Hoirin Kim, Member, IEEE

Abstract—In this letter, a new environmental model adaptationmethod is proposed for robust speech recognition under noisy en-vironments. The proposed method adapts initial acoustic modelsof a speech recognizer into environmentally matched modelsby utilizing the histogram equalization technique. Experimentsperformed on the Aurora noisy environment showed that theproposed technique provides substantial improvement over thebaseline speech recognizer trained on the clean speech data.

Index Terms—Environmental model adaptation, histogramequalization, robust speech recognition.

I. INTRODUCTION

C URRENTLY, most speech recognizers trained on cleanspeech data usually show serious performance degra-

dation when they are tested in acoustically mismatched noisyenvironments [1], [2]. The main cause of the acoustic mis-match between training and test environments is known as thecorruption by additive noise and channel distortion. There-fore, one of the hot issues in automatic speech recognitionis to provide speech recognizers with robustness against theacoustic mismatch between the two environments [3]. Currentrobust speech recognition techniques can be categorized intofeature-compensation, model-adaptation and uncertainty-basedapproaches [2]. Of the three approaches, the easiest way toprovide the robustness against the acoustic mismatch is featurecompensation which compensates noisy test features intocompensated test features, which are then decoded by speechrecognizers trained on clean speech data [3]. The histogramequalization (HEQ) technique is known to be one of the mostefficient feature compensation approaches because most ofits algorithm consists of sorting and search routines withnarrow depths [4]–[8]. Additionally, it provides considerableeffectiveness in feature compensation because of its nonlineartransformation characteristics which can fundamentally copewith the nonlinearity of noisy speech features in the logarithmicfeature spaces such as cepstral coefficients [4], [8]. However,due to some acoustic-phonetic information loss in noisy speechfeatures, there are unavoidable discrepancies between cleanspeech models and compensated features. On the contrary, the

Manuscript received September 22, 2008; revised December 16, 2008. Cur-rent version published February 12, 2009. This work was supported by the ITR&D program of MIC/IITA [2008-S-001-01, Development of large vocabu-lary/interactive distributed/embedded VUI for new growth engine industries].The associate editor coordinating the review of this manuscript and approvingit for publication was Dr. Steve Renals.

The authors are with the Department of Information and CommunicationsEngineering, Korea Advanced Institute of Science and Technology, Daejeon305-732, Korea (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/LSP.2009.2014109

Fig. 1. Block diagram of the histogram equalization-based model adaptationtechnique in speech recognition.

clean speech models can be fully adapted into acousticallymatched speech models as far as the amount of adaptation datais enough in model adaptation [1]. Therefore, although thesame information loss can occur in model adaptation, it doesnot cause undesirable discrepancies between acoustic modelsand speech features in the decoding process. For this reason,the model adaptation approach can be more effective than thefeature compensation one in reducing the mismatch betweenacoustic environments. Two well-known model adaptationmethods are parallel model combination and vector Taylorseries [1], [2]. These methods use both clean speech and noisemodels to approximate noisy speech models and are reportedto be very effective in reducing the mismatch. However, in thecomputational aspect, they are not very efficient due to thetransformations between log-spectral and cepstral domains.

In this letter, we propose a new efficient model adaptationapproach based on the histogram equalization technique to takeadvantage of its effectiveness in providing noise robustness aswell as computational efficiency in the model space. In the pro-posed approach shown in Fig. 1, the histogram equalizationtechnique transforms initial acoustic models of a speech recog-nizer into environmentally adapted acoustic models. The trans-formation function of the histogram equalization technique isobtained by using the reference and test cumulative distributionfunctions (CDFs) of the training data and utterance input, re-spectively. Then, the speech recognizer based on the adaptedacoustic models decodes the utterance input into the recognizedoutput.

II. HISTOGRAM EQUALIZATION FOR FEATURE COMPENSATION

A basic assumption in utilizing HEQ for feature compensa-tion (HEQ-FC) is that the acoustic mismatch between clean ref-erence features and noisy test ones is directly related to the dis-crepancy between their corresponding probability density func-tions (PDFs). Then, the idea of HEQ-FC is to convert the testPDF into the reference one to recover clean reference features

1070-9908/$25.00 © 2009 IEEE

Page 2: Environmental Model Adaptation Based on Histogram Equalization

SUH AND KIM: ENVIRONMENTAL MODEL ADAPTATION BASED ON HISTOGRAM EQUALIZATION 265

from noisy test features. Here, HEQ-FC is applied to each fea-ture component independently for algorithmic simplicity underthe assumption of statistical independence which can be ac-cepted in the orthogonal features such as cepstral coefficients.In this case, the algorithm of HEQ-FC is given as follows.

For given reference and test random variables and , re-spectively, a transformation function by HEQ-FCmapping the test PDF into the reference PDF isobtained by equating their corresponding CDFs defined in [4]as

(1)

(2)

where is the inverse of the reference CDFis the test CDF, and is an inverse transformation functionof the acoustic mismatch and has monotonically nondecreasingcharacteristics. From (2), it can be inferred that the effectivenessof HEQ-FC is directly related to the reliable estimation of bothreference and test CDFs. In practice, each CDF in (2) is approx-imated by its cumulative histogram. Therefore, a better CDF es-timation is achieved by a larger amount of sample data. In thissense, the reference CDF can be obtained quite reliably becausea large amount of sample data can be available at the trainingphase. On the contrary, when short utterances are used as thetest data in speech recognition, the amount of sample data maybe insufficient for the reliable estimation of the test CDF. There-fore, a reliable estimation of the test CDF becomes an importantissue for effective HEQ-FC in this test environment. When theamount of sample data is small, the order-statistic-based CDFestimation method can be more reliable than the classical his-togram-based approach. A brief algorithm of the order-statis-tics-based is given as follows [4], [6].

Let us define a sequence consisting of frames of testfeature components as

(3)

where is a test feature component at the th frame.The order-statistics of the sequence in (3) is represented by

rearranging its elements in ascending order as

(4)

where represents the original frame index of the fea-ture component whose rank is denoted by . Then, theorder-statistic-based test CDF estimate of the test featurecomponent is given as

(5)

where denotes the rank of among the feature compo-nents composing the sequence according to the order-statis-tics defined in (4). From (2) and (5), an estimate of the referencefeature by HEQ-FC given the test feature is obtained as

(6)

III. HISTOGRAM EQUALIZATION FOR MODEL ADAPTATION

Unlike HEQ-FC, the histogram equalization for model adap-tation (HEQ-MA) technique models the acoustic mismatch be-tween the acoustic model under the noisy test environmentand that under the clean reference one as the transformationfunction in the model space. The transformationfunction of HEQ-MA is obtained by mapping the reference PDF

into the test PDF by the following rule, whichis the inverse of the transformation function of HEQ-FC, as

(7)

Let and denote the mean vector and covariance matrix ofan initial acoustic model in a speech recognizer trained on cleanspeech data, respectively. We then assume that HEQ-MA is ap-plied to each mean vector of all acoustic models in the speechrecognizer on a component-by-component basis. Under theseassumptions, the adaptation rule for HEQ-MA is given by using(7) and a linear interpolation between two test feature compo-nents in the sequence which are nearest to the initial meancomponent in terms of the CDF value as (8), shown at the bottomof the page, where and denote the th componentsof the adapted and initial mean vectors, respectively, isthe inverse of the test CDF for the th test feature component,

is the reference CDF estimate of the th meancomponent is the smallest rank index satisfying the re-lationship is a linearextrapolation factor for the boundary condition at the th meancomponent and is set to the interval between the last two orderstatistic values, and is the linear interpolation factor of the

th mean component that is based on the order-statistics-basedtest CDF of the sequence defined in (5) and is given by

(9)

where the test CDF estimate of the undefined feature componentis assumed to be zero to satisfy its boundary condition.

In environmental model adaptation, it is generally known thatthe improvements gained using mean and variance adaptation

(8)

Page 3: Environmental Model Adaptation Based on Histogram Equalization

266 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 4, APRIL 2009

over mean adaptation only are especially large in noisy envi-ronments, although adapting the means provides the greater ef-fect on performance [9]. Therefore, unlike HEQ-FC, HEQ-MAneeds to adapt not only mean vectors but also covariance ma-trices. For this purpose, an efficient covariance adaptation rule,which depends on the degree of noise corruption, is given by alinear interpolation of the initial covariance matrix and the se-quence-level global sample covariance matrix as

(10)

where is a signal-to-noise ratio (SNR)-dependentsmoothing factor given by , where is theaveraged SNR value of the sequence and and are em-pirically chosen slope and bias constants, respectively, anddenotes the global sample covariance matrix of the sequencefrom the maximum likelihood estimation criterion as

(11)

where represents the global sample mean of the th featurecomponents in sequence .

IV. EXPERIMENTAL RESULTS

In the performance evaluation, we used two speechdatabases, the Aurora2 speech database [10] converted fromthe TI-DIGITS database and the KPOW (Korean phoneticallyoptimized words) database [11] consisting of 37 993 utterancesof 3848 Korean words, to examine the effectiveness of the pro-posed approach in the small as well as large vocabulary speechrecognition tasks. The initial acoustic models of two baselinespeech recognizers were separately trained on the clean speechtraining sets of the two databases. In the evaluations, we usedthe three test sets of the Aurora2 noisy speech database, wheretest set A was added by four kinds of noise, test set B wascorrupted by another four types of noise, and test set C wascontaminated by two kinds of the noise and channel distortiontogether [10]. Additionally, we used two test sets of the KPOWnoisy speech database, which were generated by artificiallyadding the same eight kinds of the Aurora noise to the KPOWclean speech test set composed of 7609 utterances. We em-ployed the ETSI Aurora2 experimental framework following[10]. In feature extraction, speech signals are firstly blockedinto a sequence of frames, each 25 ms in length with a 10 msinterval. Next, speech frames are pre-emphasized by a factor of0.97, and a Hamming window is applied to each frame. Froma sequence of 23 mel-scaled log filter-bank energies, the 39-di-mensional mel frequency cepstral coefficient (MFCC)-basedfeature vectors, each consisting of 12 MFCCs, log energy,and their delta and acceleration features, are extracted. Thebaseline speech recognizer for the Aurora2 task employs 13whole-word-based hidden Markov models (HMMs), whichconsist of 11 digit models with 16 states, a silence modelwith three states, and a short-pause model with a single state.The states for digit models are composed of three Gaussianmixture components while those for silence and short-pausemodels have six Gaussian mixture components, respectively.The baseline recognizer for the KPOW task has 6776 tied-state

Fig. 2. Recognition results on the Aurora2 data with regard to various SNRconditions by the baseline speech recognizer, HEQ-FC, HEQ-MA-M (meanadaptation only), and HEQ-MA-M&V (mean and variance adaptation) tech-niques.

triphone-based HMMs, each has three states and eight Gaussianmixture components per state. Diagonal covariance matricesare used in all of the HMMs. In the performance evaluation,the performance of the baseline speech recognizer trained onthe clean speech data, HEQ-FC, and HEQ-MA were examined.In feature compensation, HEQ-FC is applied to all of the39-D MFCCs independently for both training and test dataafter estimating reference CDFs from the training data. Inmodel adaptation, the HEQ and proposed variance adaptationtechniques are applied to the 39-D mean vectors and diagonalcovariance matrices, respectively, of all initial HMMs in thebaseline speech recognizers on a component-by-componentbasis. The number of histogram bins in reference CDFs wasempirically chosen as 64. The SNR-dependent smoothingparameters and in the adaptation of covariance matrices areempirically set to and 0.9, respectively. The averagedSNR value was estimated as the ratio of the averaged frameenergy to the averaged noise energy of the initial silence regionin each utterance. The histogram equalization is conducted onan utterance-by-utterance basis in feature compensation andmodel adaptation.

Fig. 2 shows the recognitions results on the Aurora2 testsets at various SNR conditions in terms of the averaged wordaccuracy for all of the three test sets. The figure indicates thatthe HEQ-MA approaches, HEQ-MA with mean adaptationonly (HEQ-MA-M) and HEQ-MA with mean and varianceadaptation (HEQ-MA-M&V), provide significant improve-ments over the baseline speech recognizer. HEQ-MA-M isbetter than HEQ-FC at high SNRs but it becomes inferior atthe SNR condition of 0 dB. On the contrary, HEQ-MA-M&Vyields substantial improvements over HEQ-FC even at thelower SNRs. Therefore, we believe that the inferior result ofHEQ-MA-M at the lower SNR is mainly due to the mismatchin the variance models and thus the variance adaptation playsa very crucial role at the lower SNRs. Fig. 3 illustrates therecognition results on the KPOW test sets at various SNRconditions. The results indicate that the proposed techniquesare also effective in the large vocabulary task although theirperformance improvements are not as remarkable as thoseat the Aurora2 task. In Figs. 2 and 3, we observe that the

Page 4: Environmental Model Adaptation Based on Histogram Equalization

SUH AND KIM: ENVIRONMENTAL MODEL ADAPTATION BASED ON HISTOGRAM EQUALIZATION 267

Fig. 3. Recognition results on the Korean POW data with regard to variousSNR conditions by the baseline speech recognizer, HEQ-FC, HEQ-MA-M(mean adaptation only), and HEQ-MA-M&V (mean and variance adaptation)techniques.

TABLE IWORD ERROR RATES (%) BY THE BASELINE SPEECH RECOGNIZER,

MULTI-CONDITION TRAINING SCHEME, HEQ-FC, AND HEQ-MA TECHNIQUES ON

THE AURORA2 TASK (RESULTS ARE AVERAGED BETWEEN 0 AND 20 dB SNRS)

performance gains obtained by HEQ-MA-M&V over HEQ-FCare more notable at the lower SNR conditions. These resultssupport our previous suggestion that model adaptation be moreeffective than feature compensation in serious noise conditionswhere it becomes more difficult to compensate noisy speechfeatures into clean speech features due to the increased loss ofacoustic-phonetic information.

Tables I and II show the word error rates averaged between0 and 20 dB SNRs [10] for the Aurora2 and KPOW tasks, re-spectively. For the Aurora2 test sets, HEQ-MA with mean andvariance adaptation produces error reductions of 62.83% and23.40% over the baseline recognizer and HEQ-FC, respectively.Compared to the multi-condition training scheme, we observethat the proposed approach is even better at test sets B and C,where the noise types are not exposed to the training phase. Forthe KPOW test sets, it also reduces errors by 43.04% and 8.55%over the two approaches. The results mean that HEQ-MA pro-vides substantial performance gains in both small and large vo-cabulary speech recognition tasks and its effectiveness is betterat the small vocabulary task due to the less possibility of unob-served models.

TABLE IIWORD ERROR RATES (%) BY THE BASELINE SPEECH RECOGNIZER, HEQ-FC,

AND HEQ-MA TECHNIQUES ON THE KOREAN POW TASK (RESULTS ARE

AVERAGED BETWEEN 0 AND 20 dB SNRS)

V. CONCLUSION

In this letter, we propose a new environmental model adap-tation method for robust speech recognition. The proposedapproach utilizes the histogram equalization technique whichtransforms initial mean vectors of the speech recognizer intothe environmentally matched acoustic models on a singleutterance basis. The covariance matrices are adapted SNR-de-pendently by using sample covariance matrices estimated inthe input utterance level. According to the experimental results,the proposed model adaptation approach provides substantialeffectiveness in reducing the acoustic mismatch between initialacoustic models of speech recognizers and test environments.

REFERENCES

[1] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: AGuide to Theory, Algorithm, and System Development. Upper SaddleRiver, NJ: Prentice-Hall, 2001.

[2] M. Gales and S. Young, “The application of hidden Markov models inspeech recognition,” Found. Trends Signal Process., vol. 1, no. 3, pp.195–304, 2007.

[3] N. S. Kim, Y. J. Kim, and H. W. Kim, “Feature compensation based onsoft decision,” IEEE Signal Process. Lett., vol. 11, no. 3, pp. 378–381,Mar. 2004.

[4] J. C. Segura, C. Benítez, Á. de la Torre, A. J. Rubio, and J. Ramírez,“Cepstral domain segmental nonlinear feature transformations for ro-bust speech recognition,” IEEE Signal Process. Lett., vol. 11, no. 5, pp.517–520, May 2004.

[5] Á. de la Torre, A. M. Peinado, J. C. Segura, J. L. Pérez-Cólrdoba, M.C. Benítez, and A. J. Rubio, “Histogram equalization of speech rep-resentation for robust speech recognition,” IEEE Trans. Speech, AudioProcess., vol. 13, no. 3, pp. 355–366, May 2005.

[6] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker ver-ification,” in Proc. Speaker Odyssey, 2001, pp. 213–218.

[7] Y. Suh, S. Kim, and H. Kim, “Compensating acoustic mismatch usingclass-based histogram equalization for robust speech recognition,”EURASIP J. Adv. Signal Process., vol. 2007, Article ID 67070, 9pages, doi: 10.1155/2007/67870.

[8] Y. Suh, M. Ji, and H. Kim, “Probabilistic class histogram equalizationfor robust speech recognition,” IEEE Signal Process. Lett., vol. 14, no.4, pp. 287–290, Apr. 2007.

[9] M. J. F. Gales and P. C. Woodland, “Mean and variance adaptationwithin the MLLR framework,” Comput. Speech Lang., vol. 10, pp.249–264, 1996.

[10] D. Pearce and H.-G. Hirsch, “The Aurora experimental frameworkfor the performance evaluation of speech recognition systems undernoisy conditions,” in Proc. Int. Conf. Spoken Language Processing,Oct. 2000, pp. 29–32.

[11] Y. Lim and Y. Lee, “Implementation of the POW (phonetically opti-mized words) algorithm for speech database,” in Proc. Int. Conf. Acous-tics, Speech, and Signal Processing, May 1995, pp. 89–92.