6
VI INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM (ITS2006), SEPTEMBER 3-6, 2006, FORTALEZA-CE, BRAZIL HMM Topology in Continuous Speech Recognition Systems Glauco F. G. Yared, F´ abio Violaro and Antˆ onio Marcos Selmini Abstract— Nowadays, HMM-based speech recognition systems are used in many real time processing applications, from cell phones to automobile automation. In this context, one important aspect to be considered is the HMM model size, which directly determines the computational load. So, in order to make the system practical, it is interesting to optimize the HMM model size constrained to a minimum acceptable recognition performance. Furthermore, topology optimization is also important for reliable parameter estimation. This work presents the new Gaussian Elimination Algorithm (GEA) for determining the more suitable HMM complexity in continuous speech recognition systems. The proposed method is evaluated on a small vocabulary continuous speech (SVCS) database as well as on the TIMIT corpus. Index Terms— Gaussian Elimination Algorithm, HMM com- plexity, Viterbi alignment. I. I NTRODUCTION In the last years, applications requiring speech recognition techniques have grown considerably, varying from speaker independent to speaker adapted systems, in low and high noise environments. Moreover, the speech recognition systems have been designed to achieve both high processing performance and accuracy. In order to achieve such aims, the model size is an important aspect to be analyzed as it is directly related to the processing load and to the model classification capability. So, this work presents a new approach to determine the number of Gaussian components in continuous speech recognition systems: the new Gaussian Elimination Algorithm (GEA). The results are compared to the classical BIC and to an entropy- based method. II. BACKGROUND The statistical modeling problem has some points that do not depend on the specific task for which the model was designed. One classical problem that must be overcome is the over-parameterization [1] which may occur with large models, that is, models with an excessive number of parameters. In general, such models present low training error rate, due to the high flexibility, but the performance, with a test database, is almost always unsatisfactory. On the other hand, models with insufficient number of parameters cannot even be trained. At this point there is a trade-off between robustness and trainability, which must be taken into account in order to obtain a high performance system. In the speech recognition context, the percentage of correctly recognized words (for the SVCS corpus) or phones (for the TIMIT corpus), and/or the phone recognition accuracy (for the TIMIT corpus) are used The authors are with the Digital Speech Processing Laboratory, State Univeristy of Campinas, C.P. 6101, CEP: 13083-852, Campinas-SP, Brazil. email: [email protected],{fabio,selmini}@decom.fee.unicamp.br as a performance measure and the total number of Gaussian components as the model size. The idea of using different number of components for each mixture is firstly supported by the fact that each acoustic unit has a certain number of samples in the training database. Depending on the amount of data samples available, the acoustic resolution of the HMM models should be increased or decreased in order to perform a reliable parameter estimation. In addition, the complexity of the acoustic distribution bound- ary also determines the number of components required for correct modeling different classes. In this manner, it is possible to reduce the probability of some Gaussians to converge to the wrong distributions, when using the Maximum Likelihood Estimation (MLE) algorithm for training HMMs [2]. The computational cost is directly related to the number of Gaussian components present in the system. As a conse- quence, the number of operations and the memory require- ments increase with the number of components, and thus the computational load is a practical argument that supports the optimization of the model complexity. Finally, before starting any analysis, it is important to define the emphasis for topology selection method, that is, to determine a varying number of Gaussians per state sys- tem that gives the highest performance or that allows the maximum economy of parameters constrained to a minimum performance threshold. Briefly, section III describes the speech recognition system and the database used in the experiments. Then, the two classical methods for determining the more suitable model size are summarized in sections IV and V. Section VI presents the new Gaussian Elimination Algorithm (GEA) and section VII shows the experimental results. The conclusions are given in section VIII. III. SPEECH RECOGNITION SYSTEM AND DATABASE A. SVCS Corpus A brazilian Portuguese small vocabulary continuous speech database (700 different words) [3], with a set of 200 different sentences, spoken by 40 speakers (20 male and 20 female), was used in the experiments. In addition, 1200 utterances were used for training and 400 for testing [3]. A maximum likelihood estimation-based (MLE) training system was implemented in order to obtain three state left-right continuous density HMM models. As 36 different context- independent phones (including silence) are used, 108 multidi- mensional Gaussian mixtures, with a variable or fixed number of components in each mixture, are present in the system. In addition, each Gaussian component is represented in a 39- dimensional space (12 mel cepstral coefficients, 1 log-energy SBrT © 588

Document28

Embed Size (px)

DESCRIPTION

vgfhj

Citation preview

Page 1: Document28

VI INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM (ITS2006), SEPTEMBER 3-6, 2006, FORTALEZA-CE, BRAZIL

HMM Topology in Continuous Speech RecognitionSystems

Glauco F. G. Yared, Fabio Violaro and Antonio Marcos Selmini

Abstract— Nowadays, HMM-based speech recognition systemsare used in many real time processing applications, from cellphones to automobile automation. In this context, one importantaspect to be considered is the HMM model size, which directlydetermines the computational load. So, in order to make thesystem practical, it is interesting to optimize the HMM model sizeconstrained to a minimum acceptable recognition performance.Furthermore, topology optimization is also important for reliableparameter estimation. This work presents the new GaussianElimination Algorithm (GEA) for determining the more suitableHMM complexity in continuous speech recognition systems. Theproposed method is evaluated on a small vocabulary continuousspeech (SVCS) database as well as on the TIMIT corpus.

Index Terms— Gaussian Elimination Algorithm, HMM com-plexity, Viterbi alignment.

I. INTRODUCTION

In the last years, applications requiring speech recognitiontechniques have grown considerably, varying from speakerindependent to speaker adapted systems, in low and high noiseenvironments. Moreover, the speech recognition systems havebeen designed to achieve both high processing performanceand accuracy. In order to achieve such aims, the model size isan important aspect to be analyzed as it is directly related tothe processing load and to the model classification capability.So, this work presents a new approach to determine the numberof Gaussian components in continuous speech recognitionsystems: the new Gaussian Elimination Algorithm (GEA). Theresults are compared to the classical BIC and to an entropy-based method.

II. BACKGROUND

The statistical modeling problem has some points that donot depend on the specific task for which the model wasdesigned. One classical problem that must be overcome is theover-parameterization [1] which may occur with large models,that is, models with an excessive number of parameters. Ingeneral, such models present low training error rate, due tothe high flexibility, but the performance, with a test database,is almost always unsatisfactory. On the other hand, modelswith insufficient number of parameters cannot even be trained.At this point there is a trade-off between robustness andtrainability, which must be taken into account in order toobtain a high performance system. In the speech recognitioncontext, the percentage of correctly recognized words (for theSVCS corpus) or phones (for the TIMIT corpus), and/or thephone recognition accuracy (for the TIMIT corpus) are used

The authors are with the Digital Speech Processing Laboratory, StateUniveristy of Campinas, C.P. 6101, CEP: 13083-852, Campinas-SP, Brazil.email: [email protected],{fabio,selmini}@decom.fee.unicamp.br

as a performance measure and the total number of Gaussiancomponents as the model size.

The idea of using different number of components for eachmixture is firstly supported by the fact that each acoustic unithas a certain number of samples in the training database.Depending on the amount of data samples available, theacoustic resolution of the HMM models should be increased ordecreased in order to perform a reliable parameter estimation.In addition, the complexity of the acoustic distribution bound-ary also determines the number of components required forcorrect modeling different classes. In this manner, it is possibleto reduce the probability of some Gaussians to converge tothe wrong distributions, when using the Maximum LikelihoodEstimation (MLE) algorithm for training HMMs [2].

The computational cost is directly related to the numberof Gaussian components present in the system. As a conse-quence, the number of operations and the memory require-ments increase with the number of components, and thus thecomputational load is a practical argument that supports theoptimization of the model complexity.

Finally, before starting any analysis, it is important todefine the emphasis for topology selection method, that is,to determine a varying number of Gaussians per state sys-tem that gives the highest performance or that allows themaximum economy of parameters constrained to a minimumperformance threshold.

Briefly, section III describes the speech recognition systemand the database used in the experiments. Then, the twoclassical methods for determining the more suitable model sizeare summarized in sections IV and V. Section VI presents thenew Gaussian Elimination Algorithm (GEA) and section VIIshows the experimental results. The conclusions are given insection VIII.

III. SPEECH RECOGNITION SYSTEM AND DATABASE

A. SVCS Corpus

A brazilian Portuguese small vocabulary continuous speechdatabase (700 different words) [3], with a set of 200 differentsentences, spoken by 40 speakers (20 male and 20 female),was used in the experiments. In addition, 1200 utterances wereused for training and 400 for testing [3].

A maximum likelihood estimation-based (MLE) trainingsystem was implemented in order to obtain three state left-rightcontinuous density HMM models. As 36 different context-independent phones (including silence) are used, 108 multidi-mensional Gaussian mixtures, with a variable or fixed numberof components in each mixture, are present in the system. Inaddition, each Gaussian component is represented in a 39-dimensional space (12 mel cepstral coefficients, 1 log-energy

SBrT © 588

Page 2: Document28

parameter and their first and second order derivatives), usingdiagonal covariance matrix. The recognition task is performedby the HTK [4] using a Back-off bigram language model.

B. TIMIT Corpus

The TIMIT corpus is a benchmark for testing continuousphone recognition systems [5], [6]. The training set contains3696 sentences and the complete test set contains 1344 sen-tences.

Following the convention [7], [8], 48 different context-independent phones are used for training models and 39phones for recognition. The same model and parametersconfiguration described in the SVCS corpus are used. Therecognition task is performed by the HTK [4] using a phone-level bigram language model with a scale factor of 2.

IV. ENTROPY-BASED METHOD

The method that will be described now is based on theentropy measure and uses growing procedures to determine theappropriate model size [9]. Basically, all acoustic models startswith unitary size Gaussian mixture per state, and the entropyinformation of each Gaussian is calculated by Equation (1)

Hsi =

D∑d=1

1

2log10

(2πeσ2

sd

), (1)

where “s” is the state number, “i” is the Gaussian number,“D” is the Gaussian dimensionality and “σ2” is the varianceof the i-th mixture component in state “s”.

After that, the entropy of each state (Hs) is calculated asshown in Equation (2)

Hs =

MS∑i=1

NsiHsi, (2)

where MS is the number of Gaussian components in state“s” and Nsi the number of data samples associated to the i-th mixture component in state “s”. Then segmental K-meansalgorithm is applied in order to split all Gaussian componentsin such a way that, after splitting, the model will have twice thenumber of components. The entropy for this new configuration(Hs) is then calculated and now it is possible to obtain theentropy variation in Equation (3)

ds = Hs − Hs. (3)

The idea is to arrange the states in a decreasing order ofthe entropy variation, retain the first N states with the newconfiguration and then decrease the others by one Gaussiancomponent (previous configuration). This process is repeateduntil the desired model size is reached.

The algorithm is summarized below.

• Start with one Gaussian per state and calculate Hs.• Split all clusters so that the model will have twice the

number of Gaussians.• Apply one step of segmental K-means algorithm.• Calculate Hs.

• Calculate the entropy variation ds and arrange it indecreasing order.

• Retain the current size of the first N states arranged indecreasing order of ds while returning the others to theprevious configuration.

• Repeat the steps above util the expected model size isreached.

V. BAYESIAN INFORMATION CRITERION

The BIC [10], [11], [12] have been widely used withinstatistical modeling in many different areas for model structureselection. The main concept that supports the BIC is theprinciple of parsimony, that is, the selected model should bethat with the smallest complexity and the highest capacity ofmodeling the training data. This can be observed directly fromEquation (4)

BIC(M

jl

)=

Nj∑t=1

log P(o

jt |M

jl

)− λ

νjl

2log Nj , (4)

where Mjl is the candidate model “l” of state “j”, Nj is the

number of data samples of state “j”, ojt is the t-th data sample

of state “j”, νjl is the number of free parameters present in

Mjl and the parameter λ controls the penalization term.According to such criterion, the selected model is the one

over all models that gives the highest BIC value. As can benoted, the topology of each state model is selected despite ofthe other existing states. However, there are some proposedmodifications in the BIC in order to take into acount allexisting states during the topology selection [13].

VI. GAUSSIAN ELIMINATION ALGORITHM (GEA)

Previous works in this area have used likelihood measures inmodel topology selection criterions [13], [14], [15], [16]. Thepresent work defines a Gaussian Importance Measure (GIM),which is firstly used in a new discriminative Gaussian selectionalgorithm and then in a Euclidian distance based Gaussianselection procedure.

A. Discriminative Determination of Model Complexity

On a previous work, a likelihood based measure was usedto compute the contribution of each data sample to theimportance of each Gaussian, which was the Winner GaussianProbability (WGP) [17], as defined in Equation 5

P (i;j;s)wg =

N(i;j;s)wg

N(s)frames

, (5)

where N(i;j;s)wg correspponds to the number of times that the

Gaussian “i”, which belongs to state model “j”, gives thehighest likelihood for the data associated to state “s”, andN

(s)frames is the number of frames associated to state “s”.The WGP measure is a discrete probability function, and

may be inaccurate depending on the amount of data available.In this manner, it seems to be more appropriate to define anew continuous function for the GIM.

SBrT © 589

Page 3: Document28

The proposed discriminative method for determining themore suitable number of Gaussians per state differs fromothers [13], [16], in the sense that the algorithm starts from awell-trained system and indicates which Gaussians should beeliminated using the new GIM, instead of likelihood measures.

Assuming for simplicity a 2-dimensional feature space, thenew method is based on the hipervolume probability. Basically,in this case, the contribution of each data sample to the GIMis given by the volume outside the elipse indicated in Figure1.

Data sample

Fig. 1. Multidimensional Gaussian.

In fact, in the present work all multidimensional GaussiansN (µ,Σ) are represented in a 39-dimensional acoustic space,and the probability density function is given by Equation (6)

f (Ot) =1

(2π)dim/2

|Σ|1/2

e−(x−µ)′Σ−1(x−µ)/2, (6)

where |Σ| is the determinant of covariance matrix. If theparameters are statistically independent (diagonal covariancematrix), then the pdf can be written as in Equation (7)

f (Ot) =

dim∏d=1

1√2πσ2

d

e−[(xd−µd)2/2σ2d]. (7)

For each Gaussian, the contribution along all dimensions tothe GIM can be calculated by Equation (8)

GIM (Ot)(i;j;s)

=

dim∏d=1

⎛⎜⎝1 −

⎡⎢⎣ 2√

π

‖Zd‖∫0

e−(zd)2dzd

⎤⎥⎦

⎞⎟⎠,

(8)where “dim” is the feature vector dimension, zd =

xd−µdij√

2σdij

,

and Ot = (x1, x2, . . . , xdim) is the feature vector. The valuesµdij and σdij correspond to the mean and standard deviationrespectively, along dimension “d”, of Gaussian “i” that belongsto state “j”.

The GIM of Gaussian “i”, that belongs to state “j”, iscalculated for every data sample of state “s”, so that the meanvalue of GIM can be obtained in respect to each existing state,as shown in Equation (9) below

P(i;j;s)GIM =

Ns∑t=1

GIM (Ot)(i;j;s)

Ns

, (9)

xd µd

(a)

2µd-xd xdµd

(b)

2µd-xd

Fig. 2. Contribution along each dimension to the GIM.

where Ns is the number of data samples of state “s”.The new proposed measure (GIM) is based on the proba-

bility of the feature samples being out of the interval

µd − ‖xd − µd‖ < x < µd + ‖xd − µd‖ ,

as shown in Figure 2. Moreover, the closer the feature sampleis to the Gaussian mean value, the higher is the contributionto the GIM along the analyzed dimension.

The P(i;j;s)GIM can be used as a measure of the importance

of each Gaussian in respect to each existing state. In thismanner, a discriminative Gaussian selection method can beimplemented. Thus, the main objective is to maximize thediscriminative relation so that each final model gives thehighest P

(i;j;s)GIM for the correct patterns and gives the smallest

P(i;j;s)GIM for the wrong patterns at the same time. This can be

done by the maximization of Equation (10) below

DC(j) =

[Mj∑i=1

P(i;j;j)GIM

]K

[N∑

s �=j

Mj∑i=1

P(i;j;s)GIM

]/(N − 1)

(10)

where K is the rigour exponent, Mj is the number of Gaus-sians in state “j” and N is the total number of states. If thelogarithm of the Discriminative Constant (DC) is taken, theresulting expression in Equation (11) is similar to Equation(4), in the sense that the first term measures the capacity ofmodeling the correct patterns and the second is a penalty term

log DC(j) = K log

Mj∑i=1

P(i;j;j)GIM − log

N∑s �=j

Mj∑i=1

P(i;j;s)GIM

N − 1. (11)

The difference resides in the second term, which is a penaliza-tion due to the number of parameters in the model, in Equation(4), and which is a penalization due to the interference in theother state models, in Equation(11).

The main idea is to eliminate Gaussians from a well trainedmodel with a fixed number of Gaussian components per stateand to observe the new DC value obtained. The Discrimi-native Constant (DC) given above may increase or decreasedepending on the relevance of the eliminated Gaussian. Inthis manner, the rigor exponent plays an important role in theGaussian selection, since it makes the discriminative criterionDC more restrictive, that is, the greater the exponent, themore rigorous the criterion is and therefore less Gaussiansare eliminated.

The procedure described above is applied for each HMMstate model. Once the discriminative Gaussian selection is

SBrT © 590

Page 4: Document28

finished, the resulting model is trained by means of the Baum-Welch algorithm.

It is also important to observe that the discriminative al-gorithm only detects Gaussians that converged to the wrongdistributions during training. However, there may still existexceeding Gaussians after applying the discriminative algo-rithm. Although these Gaussians are not detected by the dis-criminative algorithm, they must be discarded, since this canbe performed without any degradation of the model capacityof classification. In this manner, an additional distance basedalgorithm is applied to eliminate the exceeding Gaussians.

B. Gaussian Selection Based on Euclidian Distance Measure

Initially, a distance threshold is necessary for determininggroups of Gaussian components that should be replaced byjust one. In addition, a different threshold should be usedfor Gaussians in the boundaries and for Gaussians in thecentral part of the distribution. Then, the Euclidian distance iscalculated between all Gaussians for a specific state, and theexceeding components are replaced by the Gaussian with thehighest determinant of the covariance matrix.

After that, the P(i;j;s)GIM was used in order to give an indi-

cation of which Gaussians are in the boundaries, and also togive a different weight for these Gaussians and those in thecentral part of the distribution. Therefore, a modified Euclidiandistance measure was defined in Equation (12)

Mdxy=

dxy∑i=x,y

P(i;j;j)GIM∑

i=x,y

P(i;j;j)GIM

+∑s �=j

∑i=x,y

P(i;j;s)GIM

, (12)

where dxy is given by

dxy =

√(µx − µy) · (µx − µy)

T, (13)

and µx and µy are the mean vector of Gaussian componentsx and y respectively.

VII. EXPERIMENTS

A. SVCS Database

In the experiments, the proposed method was used tooptimize the baseline system containing 12 Gaussians perstate (1296 Gaussians), which gives the highest percentageof correctly recognized words (81.01%) for this database.The tests with the SVCS database have shown that the newGEA was able to outperform both the Bayesian InformationCriterion [18] and the entropy-based method presented insection IV.

The highest performances given by the varying number ofGaussians per state systems obtained from the three methods,as well as the empirical parameters used are presented in TableI.

The GEA performance depends on the segmentation used inthe discriminative analysis, and also on the effective reductionon the exceeding Gaussians present in the models. The SVCSdatabase is not manually segmented and, in this case, the

TABLE I

THE FOLLOWING INCREMENTS IN THE NUMBER OF GAUSSIANS WERE

TESTED: 40, 50, 60, 70, 80, 90 AND 100; THE REGULARIZATION

PARAMETER (λ) ASSUMED THE VALUES 0.3, 0.2, 0.15, 0.1, 0.07, 0.05,

0.03 AND 0.01; THE DISTANCE THRESHOLDS (Mdxy) 7, 6, 5.5, 5, 4.5, 4,

3.5 AND 0 WERE USED IN THE EXPERIMENTS, WITH K=106 .

Entropy-based methodIncrem. Reduced Economy of Perc. Differ. in

N model Gaussians Corr. perform.Gauss. size (%) (%) (%)

70 1040 19.8 81.62 +0.61

BICregular. Reduced Economy of Perc. Differ. inparamet. model Gaussians Corr. perform.

(λ) size (%) (%) (%)0.05 1110 14.4 80.75 -0.26

GEADistance Reduced Economy of Perc. Differ. inthreshold model Gaussians Corr. perform.(Mdxy

) size (%) (%) (%)4.5 1055 18.6 82.34 +1.33

Viterbi alignment against the correct transcriptions was usedfor obtaining the reference segmentation. So, it seems to beimportant to evaluate the quality of the segmentation usedand also to investigate which recognition system is the moreappropriate for the Viterbi alignment task. In order to do so,the following experiments were performed with the TIMITcorpus, which is manually segmented.

B. TIMIT Database

The first step was to obtain three baseline systems for theTIMIT corpus. The performance of such systems are shownin Table II

TABLE II

BASELINE SYSTEMS (8, 16 AND 32 GAUSSIANS PER STATE).

Number Percentage Recognitionof Correct Accuracy

Gaussians (%) (%)1152 70.94 63.72304 73.03 65.874608 73.11 66.14

After that, the new GEA was applied to optimize thecomplexity of each of them. It is important to hightlight thatthe baseline system containing 4608 Gaussians was used forgenerating the reference segmentation, which is used in thediscriminative analysis. As there is also a manual segmentationfor this database, the quality of the segmentation obtained fromthe Viterbi alignmente will be evaluated.

1) Reduction of the Baseline Systems Complexity: TheGEA was able to determine reduced systems which presentedat least the same performance of the original system. Thiswas accomplished by starting from each baseline system andthen varying the rigour exponent K and the distance thresholdMdxy

. In such experiments, the economy of Gaussians wasemphasized, constrained to at least the same performance ofthe original system.

SBrT © 591

Page 5: Document28

TABLE III

SYSTEMS OBTAINED AFTER APPLYING THE GEA USING A HMM

SEGMENTATION.

Baseline System with 1152 GaussiansNum. Recog. Economy

MdxyK of Accur. of Gauss.

Gauss. (%) (%)5 50 1114 63.72 (+0.02) 3.3

Baseline System with 2304 GaussiansNum. Recog. Economy

MdxyK of Accur. of Gauss.

Gauss. (%) (%)0 5 2056 65.91 (+0.04) 10.8

Baseline System with 4608 GaussiansNum. Recog. Economy

MdxyK of Accur. of Gauss.

Gauss. (%) (%)14 10 3825 66.51 (+0.37) 17.0

The results obtained from the GEA for the TIMIT corpusare shown in Table III.

As can be noted, the highest economy (17%) was achievedfrom the 4608 Gaussians baseline system, from which 783Gaussians were eliminated without any loss in performance.In addition, the smaller the system, the lower the economyobtained, what seems to be reasonable, since the modeltopology selection process is constrained to at least the sameperformance of the original system. The lack of acousticresolution makes difficult the task of distinguishing differentacoustic patterns, what may increase the recognition errors andthus must be avoided.

2) The Quality of the Segmentation Used: The segmenta-tion used in the discriminative analysis is important for label-ing each acoustic vector and thus for calculating the DC valueof each state model. In this manner, it is interesting to definea methodology for choosing the more appropriate recognitionsystem for the Viterbi alignment task. The percentage of errorsobserved in the segmentation [19], [20] generated by the 4608Gaussians baseline system is presented in Table IV.

TABLE IV

SEGMENTATION ERROR FOR THE 4608 GAUSSIANS BASELINE SYSTEM.

Percentagem of PercentageSegmentation Errors Correct

≤ threshold (%)≤ 10ms 48.2≤ 20ms 81.2≤ 30ms 92.7≤ 40ms 96.3≤ 50ms 98.1

As can be noted, 98.1% of the segmentation errors aresmaller than 50ms, that is, up to 5 frames apart from thecorrect position. The question that arises at this point is relatedto the apropriate choice of the recognition system, that is,if there were more systems availables, which one shouldbe chosen for generating the segmentation. The apparentlyobvious solution is to choose the system that gives the highestrecognition performance. In order to evaluate such approach,20 different recognition systems with different topologies andperformances were obtained from the GEA. The correlation

between the recognition performance, in terms of phone recog-nition rate and accuracy, and the quality of the segmentationgenerated by the Viterbi alignment (compared against themanual segmentation of the TIMIT database), was calcutedfor each system. The results are shown in Figures 3 and 4 forthe percentage of segmentation errors smaller than 50ms.

0.974 0.976 0.978 0.98 0.98262.5

63

63.5

64

64.5

65

65.5

66

66.5

67

Percentage of segmentation errors smaller than 50ms

Acc

urac

y (%

)

Fig. 3. Relation between the percentage of segmentation errors and the phonerecognition rate.

0.974 0.976 0.978 0.98 0.98270

70.5

71

71.5

72

72.5

73

73.5

74

Percentage of segmentation errors smaller than 50ms

Pho

ne r

ecog

nitio

n ra

te (

%)

Fig. 4. Relation between the percentage of segmentation errors and therecognition accuracy.

The correlation coefficient between the percentage of seg-mentation errors and the recognition performances, for diffen-rent error thresholds, are presented in Table V.

TABLE V

CORRELATION COEFFICIENTS (C.C.) BETWEEN THE PERCENTAGE OF

SEGMENTATION ERRORS AND THE RECONGITION PERFORMANCE.

Segmentation Error Phone Recognition Rate Recognition Accuracy≤10ms c.c = 0.28 c.c. = 0.34≤20ms c.c = 0.05 c.c. = 0.11≤30ms c.c = 0.07 c.c. = 0.12≤40ms c.c = 0.29 c.c. = 0.34≤50ms c.c = 0.58 c.c. = 0.61

The results show that the correlation between the quality ofsegmentation and the recognition performance was not high(c.c.≤ 61%). In addition, the percentage of segmentation errorsis more related to the recognition accuracy than to the phonerecognition rate.

SBrT © 592

Page 6: Document28

VIII. CONCLUSION

The new GEA outperformed the BIC and the Entropy-basedmethod for determining systems with a better compromisebetween complexity and recognition performance. In addition,the reduced system also gave an increase of 1.33% in perfor-mance, when compared to the baseline system that gives thehighest recognition performance for the SVCS database usedin the experiments.

The experiments with the TIMIT database have shown thatthe new GEA was able to determine reduced systems withat least the same performance as the original system. It wasalso verified that the correlation between the percentage ofsegmentation errors and the recognition performance is nothigh (c.c.≤ 61% for the conditions used in the experiments).Furthermore, the recognition accuracy seems to be moreappropriate for choosing the recognition system for the Viterbialignmente task, than the phone recognition rate, that is, thesystem that gives the highest accuracy should be used in theViterbi Alignment task in order to obtain the segmentation forthe GEA.

The next work is to extend the GEA concepts for HMMcomplexity control in context-dependent phone models andalso to other pattern recognition problems.

IX. ACKNOWLEDGEMENT

We would like to thank the CNPq (Conselho Nacional deDesenvolvimento Cientıfico e Tecnologico) for the researchfunding.

REFERENCES

[1] L. A. Aguirre, An Introduction to Systems Identification - Linear andNon-linear Techniques Applied to Real Systems (in portuguese). EditoraUFMG, 2000.

[2] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelyhoodapproach to continuous speech recognition,” IEEE Transactions onPattern Anal. Machine Intell., vol. 5, pp. 179–190, 1983.

[3] C. A. Ynoguti, “Continuous speech recognition using hidden markovmodels (in portuguese),” Ph.D. dissertation, Universidade Estadual deCampinas, 1999.

[4] The HTK Book, Cambridge University Engineering Departament, 2002.[5] S. Kapadia, V. Valtchev, and S. J. Young, “MMI training for continuous

phoneme recognition on the TIMIT database,” in IEEE InternationalConference on Acoustic, Speech and Signal Processing, 1993.

[6] L. F. Lamel and J. L. Gauvain, “High performance speaker-independentphone recognition using CDHMM,” in Eurospeech 1993, 1993.

[7] K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognitionusing Hidden Markov Models,” IEEE Transactions on Acoustics, Speechand Signal Processing, vol. 37, no. 11, pp. 1641–1648, 1989.

[8] K.-F. Lee and H.-W. H. M.-Y. Hwang, “Recent progress in the sphinxspeech recognition system,” in Workshop on Speech and Natural Lan-guage, 1989.

[9] Y. J. Chung and C. K. Un, “Use of different number of mixtures incontinuous density hidden markov models,” Electronics Letters, vol. 29,no. 9, pp. 824–825, 1993.

[10] I. Csiszr and P. C. Shields, “The consistency of the bic markov orderestimator,” in IEEE International Symposium on Information Theory,2000.

[11] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finitemixture models,” IEEE Transactions on Pattern Analysis and MachineIntelligence., vol. 24, no. 3, pp. 381–396, 2002.

[12] S. S. Chen and P. S. Gopalakrishnan, “Clustering via the bayesianinformation criterion with applications in speech recognition,” in IEEEInternational Conference on Acoustic, Speech and Signal Processing,1998.

[13] A. Biem, “Model selection criterion for classification: Application toHMM topology optimization,” in 7-th International Conference onDocument Analysis and Recognition (ICDAR’03), 2003.

[14] L. R. Bahl and M. Padmanabhan, “A discriminant measure for modelcomplexity adaptation,” in IEEE International Conference on Acoustic,Speech and Signal Processing, 1998.

[15] Y. Gao, E.-E. Jan, M. Padmanabhan, and M. Picheny, “HMM trainingbased on quality measurement,” in IEEE International Conference onAcoustic, Speech and Signal Processing, 1999.

[16] M. Padmanabhan and L. R. Bahl, “Model complexity adaptation usinga discriminant measure,” IEEE Transactions on Speech and AudioProcessing, vol. 8, no. 2, pp. 205–208, 2000.

[17] G. F. G. Yared, F. Violaro, and L. C. Sousa, “Gaussian Elimination Algo-rithm for HMM complexity reduction in continuous speech recognitionsystems,” in Interspeech - Eurospeech 2005, 2005.

[18] V. Valtchev, “Discriminative methods in HMM-based speech recogni-tion,” Ph.D. dissertation, University of Cambridge, 1995.

[19] D. T. Toledano, L. A. H. Gomez, and L. V. Grande, “Automatic phoneticsegmentation,” IEEE Transactions on Speech and Audio Processing,vol. 11, no. 6, pp. 617–625, 2003.

[20] D. T. Toledano and L. A. H. Gomez, “Local refinement of phoneticboundaries: A general framework and its application using differenttransition models,” in Eurospeech 2001, 2001.

SBrT © 593