A comparison of mel-frequency cepstral coefficient (MFCC) calculation techniques

8/3/2019 A comparison of mel-frequency cepstral coefficient (MFCC) calculation techniques

1/5

A comparison of mel-frequency cepstralcoefficient (MFCC) calculation techniques

Amelia C. Kelly and Christer Gobl

AbstractUnit selection speech synthesis involves concatenating segments of speech contained in a large database in such a

way as to create novel utterances. The sequence of speech segments is chosen using a cost function. In particular the join cost

determines how well consecutive speech segments fit together by extracting acoustic parameters from frames of speech on

either side of a potential join point and calculating the distance between them. The mel-frequency cepstral coefficient (MFCC) is

a popular numerical representation of acoustic signals and is widely used in the fields of speech synthesis and recognition. In

this paper we investigate some of the parameters that affect the calculation of the MFCC, particularly (i) the window length used

to examine the speech segments, (ii) the time-frequency pre-processing performed on the signal, and (iii) the shape of the filters

used in the mel filter bank. We show with experimental results that the choices of (i) (iii) have a direct impact on the MFCC

values calculated, and hence the ability of the distance measure to predict discontinuity, which has a significant impact on the

ability of the synthesiser to produce quality speech output. In addition, while previous research tended to focus on sonorant

sounds such as vowels, diphthongs and nasals, the speech data used in this study has been classified into the following three

groups according to their acoustic characteristics 1. Vowels (non-turbulent, periodic), 2. Voiced fricatives (turbulent, periodic

sounds), 3. Voiceless fricatives (turbulent, non-periodic sounds). The choice of (i) is shown to significantly affect the calculation

of MFCC values differently for each sound group. One possible application of these findings is altering the cost function in unit

selection speech synthesis so that it accounts for the type of sound being joined.

Index TermsFeature extraction, signal analysis, mel-frequency cepstral coefficients (MFCC), speech synthesis,

1 INTRODUCTION

HE cost function in unit selection speech synthesis[1] is a measure of how well a sequence of candi-

date units represents the target utterance, which is the

input to the system, in the form of text. The cost func-tion has two major components, the target cost and the

join cost. In this paper we focus on the join cost, partic-ularly on the measurement of spectral discontinuity.Consider the speech segments /k-ae/ and /ae-t/, thatmake up the word cat. Calculating the spectral dis-continuity between these two segments requires thatthe spectral characteristics of each be quantified bynumerically coding the sounds as vectors of acousticmeasurements. More specifically, the numeric repre-sentation is taken only on the portions of each soundthat will be coming into contact, i.e. at the end of thefirst segment /k-ae/ and at the start of the second

segment /ae-t/ and so the speech sounds are win-dowed before numerically coding the signals. Theacoustic measurement used to represent the soundmust therefore be one for which perceived changes inthe sound are accurately mirrored by numericalchanges in the acoustic representation.

Here we focus on the MFCC, a non-linear distancemetric based on the Fourier transform. The first step incalculating the MFCC values is to get the Fourier trans-form of a window of the speech signal.

The FT of a signal is given by:

The FT is a special case of the more general fractionalFourier transform (FRFT). The FRFT is dependent onthe parameter a, which determines the angle of rota-

tion in the time-frequency domain, where a is the iden-tity, and a = /2 is the classical Fourier transform. An-gles of a between these two values correspond to atransform that lies somewhere between the time andfrequency domains. Numerical representations ofspeech sounds based on time-frequency analysis haveshown to yield promising results in the area of speechrecognition. The p-th order fractional Fourier trans-form of a signal is given by:

where the kernel Kp(u,u') is defined as:

Next the energy of the signal is calculated using:

It is convenient then to convert the linear fre-quency to the mel frequency scale. The signal(represented on the mel scale) is then filtered by a se-ries of uniformly spaced triangular windows of equalwidth. Frequency values in Hertz (k) are often trans-formed to mel frequencies (fmel) using [3] [2]:

A. C. Kelly and C. Gobl are with the Phonetics and Speech Laboratory, Centre for Language and Communication Studies, SLSCS, Trinity College Dublin,Ireland.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING

WWW.JOURNALOFCOMPUTING.ORG 62


2/5

The mel filter bank, n(k), is constructed asdescribed in [3], where the frequency response of thenth filter is calculated using:

where the boundary points of that filter are

given by:where fhigh and flow are the highest and lowest

frequencies contained in the signal, and Nis the num-ber of filters.

The log energy output of a filter in the mel scale fil-terbank is calculated by weighting the power valuegiven by P(k) with the frequency response of the ap-propriate filter (k) such that:

The coefficients can then be extracted by cal-culating the Discrete Cosine Transform (DCT) of thelog of the output energies of the filters, using the fol-

lowing equation, taken from [4]:

whereMis the number of MFCCs, and n=1,2, , Nis the number of mel-scaled filters.

Recent studies [5], [6], [7], attempting to comparespectral representations of sound have investigated theeffects of changing the width of the windowing func-tion used to derive the acoustic parameterisation. Thestudies both found that windowing functions of par-ticular widths yielded higher correlations between theacoustic representation and human perception of the

join.Clearly window length plays a role in calculating

acoustic parameters like the MFCC since the MFCCsare calculated by first taking the FT distribution of thewindowed speech segment, the width of the window-ing function will directly affect the MFCC values ex-tracted from the signal. This in turn affects the join costpenalty incurred by the potential concatenation. De-spite this very well-known fact, there is remarkablevariation in the choices of window length used in thestudies undertaken to find the most useful metric. Fur-thermore there has been little theoretical examinationof this effect so the role of windowing can be moreintuitively understood. A number of studies [8], [9],[10], [11], [12], [13], investigating which numeric repre-

sentation of sound best reflects human perception havecompared different numeric representations of speechsounds by assessing listeners' capability of detecting a

join. Although some of these and similar studies oftenconclude that one metric outperforms the others, thehighest correlation achieved between such metrics andhuman perception of discontinuity is roughly 0.7. Fur-thermore each study cites a different method as theone that best reflects human detection of joins. For ex-ample, Wouters and Macon [8], using a 5 ms frame ofspeech to compare metrics, concluded that the MFCCoutperformed the others, while Chen and Campbell[10] cited the bispectrum as the best metric using 20 msframes of speech for their calculations.

It is not possible to realistically compare the relativeperformance of these different metrics unless properattention is paid to one of the first and most funda-mental operations in the signal processing stage,namely what choice of window length will produce anumerical representation that best predicts discontinu-ities in speech.

In this paper we investigate (i) the impact of win-dow length, and (ii) the impact of the fractional Fouri-er transform angle a, and (iii) the shape of the filterused in the mel filter bank (rectangular, triangular orGaussian), on MFCC calculation and whether the per-formance of the distance measure varies for three dif-ferent types of speech sounds vowels, voiced frica-tives and voiceless fricatives. The Euclidean distancewas calculated between vectors of MFCC values ex-tracted from windows speech segments in each of thethree sound categories. The measurement was firstcalculated for segments of speech that were naturallyconsecutive, and then between examples of speech

from non-matching phonemes from the same soundcategory in order to simulate what would be consi-dered a bad join.

We have chosen to perform the experiments in thismanner so as to remove the statistical uncertaintynormally associated with small numbers of humanobservers, thus providing experimental evidence, thefirst to our knowledge, that directly measures the in-fluence of (i) (iii) on MFCC values for different typesof speech sounds. The results show that the metric issignificantly sensitive to both the type of sound beingmeasured and the length of the function used to win-dow the signals, with shorter window lengths giving

better results for groups 1 and 2, and longer windowlengths showing better results for group 3. This is turnsuggests that the performance of unit selection speechsynthesis systems may be improved if the cost functionwas designed to account for these dependencies.

2 EXPERIMENTAL PROCEDURE

This experiment was designed to examine the ef-fects of changing the value of:

i. the window length used to examine thespeech segments, and

ii. the time-frequency analysis performed on





3/5

the signal,iii. the shape of the filters used in the mel filter

bank,on the calculation of MFCCs, and to investigate

which types of speech sounds from the followinggroups are the most sensitive to these changes.

Group 1: vowels non-turbulent, periodic sounds.Group 2: voiced fricatives turbulent, periodic

sounds.Group 3: voiceless fricatives turbulent, non-

periodic sounds.The MFCC values were first calculated for consecu-

tive segments of the test stimuli. The speech samplesexamined were 295 examples of /aa/, 531 examples of/i/, and 224 examples of /u/for the vowel category,590 examples of /f/, 915 examples of /s/ and 323 ex-amples of /sh/ in the voiceless fricatives category, and163 examples of /v/, 326 examples of /dh/ for thevoiced fricatives category. The samples were halved atthe midpoint and the MFCC values were extracted fordifferent values of (i), (ii) and (iii). The Euclidean dis-

tances between consecutive vectors of MFCCs werethen calculated. If the MFCC is a good numerical re-presentation of speech signals, then these values areexpected to be low, as the segments of speech are natu-rally consecutive.

The MFCC values were then calculated for non-consecutive segments of the test stimuli, within eachsound category in such a way as to create `bad' joins,i.e. joining spectrally dissimilar speech sounds whichstill have the same acoustic characteristics as definedby the three sound groups. To create these perceptual-ly `bad' joins, 220 examples of each sound in the cate-gory was joined to 220 examples of the other sound in

the category. Again, the samples were halved at themidpoint and the MFCC values were extracted for dif-ferent values of (i), (ii) and (iii). In this case the Eucli-dean distance between the non-consecutive MFCCvectors were calculated. If the MFCC is a good numer-ical representation of speech signals, then these valuesare expected to be high, as the segments of speech arenaturally non-consecutive.

All speech examples were taken from the RMS(American male) recordings of the CMU Arctic data-base [14], and sampled at 16 kHz. 100 ms of eachsound was examined. Each example was windowedusing a Hanning windowing function on either side of

the midpoint to minimise spectral leakage introducedby windowing, and the fast fractional Fourier trans-form (FRFT) was calculated on the windowed seg-ments, with window lengths ranging from 10 ms to 50ms in steps of 5 ms, for angles of a varying from 0 to .MFCCs were calculated using a filter bank triangularfilters as described by Memon et al. [3]. The Euclideandistance was then calculated between the two vectorsof 12 MFCCs representing the speech sounds on eitherside of the midpoint. The Euclidean distance is calcu-lated using:

wherep and q are vectors of MFCC values.The Euclidean distance between MFCC values was

used as a measure of the efficacy of the MFCC as anobjective measurement of spectral discontinuity. Thedifference, D, was measured, which is the Euclideandistance calculated between non-consecutive segmentsminus the Euclidean distance calculated between con-secutive segments. This discrimination value D is tak-en to be a measurement of how well the metric candistinguish between natural and joined speech, and isdefined as follows:

The results were analysed and are presentedin the following section.

3 RESULTS

In this section, the results of the experiment are pre-sented. The general effects of changing parameters (i),(ii) and (iii) are first examined, and then discussed in

relation to the particular sound categories.Changes in each of the parameter values impacted

to a different degree for each sound category. MFCCcalculation was by far the most sensitive to changes inwindow length, as evidenced by the effect on D valuescompared with the effect imposed by a. Changes in ahad little or no effect on the MFCC values calculatedand did not significantly affect the D values.

For vowel sounds, the window size significantly af-fected the calculated value of D, but that changes in aand filter type produced slight changes that were notsignificant. The MFCC best represents vowel sounds,with the D values significantly higher than for the oth-

er sound groups. The metric performed best with win-dow sizes of 25 ms and a = 90 degrees, (FFT).

For voiced fricative sounds, the window size signif-icantly affected the calculated value of, but that theeffect of changing a and filter type was insignificant.The MFCC performed the worst for the voiced soundcategory compared to vowels and voiceless fricatives,and in this category performed best for windowlengths of 10 ms and a = 72 degrees.





4/5

For voiceless fricative sounds, the windowsize significantly affected the calculated value of D, butthat changes in a and filter type produced slightchanges that were not significant. Contrary to vowelsand voiced fricatives, the MFCC performed best forvoiceless fricatives when a large window length of 45ms was used, and with a = 27 degrees. The metric didnot perform as well as for this category as it did for

vowels, but was significantly better than for voicedfricatives.

4 DISCUSSION

The MFCC is frequently used as a numerical represen-tation of speech sounds. Representing a speech signalin the cepstral domain essentially performs a source-filter separation, making it possible to represent thespectral characteristics of the speech sound using thefirst few coefficients. Changes in these values representchanges in the spectral distribution of the signal. Theo-retically, this makes the MFCC a good indicator of thesimilarity of two speech sounds, and provides a me-thod of detecting discontinuities between two concate-nated segments of speech. Furthermore, the MFCC issaid to be a particularly faithful representation as thefrequency scale is warped to a mel scale so that it moreclosely resembles the non-linear response of the ear.Many studies have been done testing the efficacy ofmetrics such as the MFCC, and have relied heavily onperceptual results as an indicator. These studies arerarely in agreement however. The calculation of MFCCvalues relies heavily on a number of fundamental pa-rameters including (i) the length of the windowingfunction used to segment the speech signal, (ii) thetime-frequency analysis performed on the signal and

(iii) the shape of the filters used in the mel filter bank.In this study the MFCC values we calculated for dif-ferent values of each of these parameters in order toinvestigate the extent to which they affect the MFCCcalculation. Furthermore the values have been testedon three different types of test stimuli, which differbased on their inherent acoustic characteristics.

Measuring how the MFCC values are affected bychanges in (i), (ii) and (iii) has important impact onunit selection speech synthesis, particularly with aview to optimising the cost function. This study dem-onstrated the extent to which MFCC values are af-fected by changes in (i), (ii) and (iii) by showing thatcertain combinations of these parameters yields agreater distinction between consecutive and non-consecutive speech for each group of sounds andtherefore are more adept at detecting the presence of alarge spectral distance between concatenated seg-ments. In a unit selection speech synthesis system, theMFCC values are calculated for the database and a costis assigned to a sequence of units that reflects how well

the segments would join together based on the differ-ence between these measurements. This study hasshown objectively not only the importance of carefullyselecting the fundamental parameters used to calculatethe MFCCs, but also the fact that the choice of theseparameters should differ depending on the type ofsound being examined. One of the applications ofthese findings would be to overhaul the traditionalunit selection algorithm so that it takes into accountthe type of sounds that a concatenation occurs in. Fu-ture work in this area will be to incorporate these find-ings into a unit selection speech synthesiser and eva-luate the resulting synthesised speech.

CONCLUSION

This study was designed to investigate the efficacy

of the MFCC metric as a means of objectively dis-

tinguishing between speech sounds. The MFCC

values were calculated by first windowing the

speech sounds, and a range of window lengths were

tested in order to address the lack of consensus of

previous studies investigating the performance of

the metric. Furthermore the MFCC metric was as-

sessed not only for vowel sounds, but for voiceless

and voiced fricatives as well, and accordingly addsto the literature in this important and growing area.

In each sound category a few hundred examples of

two phonemes were extracted from the American

male RMS recordings of the CMU Arctic database.

First the MFCC values were calculated on either

side of the midpoint of sounds of the same phonetic

label. The Euclidean distance between the vectors of

MFCC values was taken as a measure of how well

the MFCC can predict that two segments of speech

are naturally consecutive. The MFCCs were ex-





5/5

tracted using a range of different values for (i) and

(ii). In order to measure the ability of the MFCC to

predict that segments are non-consecutive, a percep-

tually bad join was modelled, by calculating the

Euclidean distance between MFCC vectors was cal-

culated between a few hundred examples of one

phoneme type in the sound category and the other

phoneme type, a situation that models a very per-

ceptually salient artificial join. Again this was calcu-

lated for a number of window lengths and values of

a.

From these results we can conclude the following

points:

1. The ability of the distance measure to pre-

dict discontinuity differs significantly with

respect to the width of the windowing

function.

2. The ability of the distance measure to pre-

dict discontinuity differs with respect tothe angle between the time and frequency

domain that the signal is transformed to,

but not significantly so for these examples.

3. The choice of filter type did not significant-

ly affect the ability of the metric to predict

discontinuity.

4. The distance measure was more effective at

measuring perceptual discontinuity for

vowel sounds than it was for voiced frica-

tives, which in turn performed better than

for voiceless fricatives.

The last result is of particular interest, as the costfunctions in unit selection speech synthesis systems

generally use one spectral discontinuity measure to

calculate join cost regardless of the speech sound

being used. The results of this paper show that there

is scope for further research into detecting disconti-

nuities in voiced and voiceless fricatives. We are

currently investigating the use of other linear, and

bi-linear transforms such as the fractional Fourier

transform,wavelets and the Wigner-Ville distribu-

tion to calculate MFCCs, with the intention of opti-

mising the unit selection cost function for use with

different types of speech sounds.

ACKNOWLEDGMENT

This work was supported in part by Foras na Gaeilge.

REFERENCES

[1] A. Hunt and A. W. Black, Unit selection in a concatenative speech

synthesis system using a large speech database Acoustics,

Speech , and Signal Processing, 1, 373 37,1996.

[2] D. O'Shaughnessy, Speech Communication: human and ma-

chine. Addison-Wesley, pp. 150, 1987.

[3] S. Memon, M. Lech, N. Maddage and L. He, Application of

the Vector Quantization Methods and the Fused MFCC-

IMFCC Features in the GMM based Speaker Recognition in

Recent Advances in Signal Processing, Editor: Ashraf A Zaher,

Published by InTech, 2009.

[4] S. B. Davis and P. Mermelstein, Comparison of parametric

representations for monosyllabic word recognition in conti-

nuously spoken sentences," IEEE Transactions on Acoustics,

Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366,

1980.

[5] B. Kirkpatrick, D. O'Brien and R. Scaife, "A comparison of

spectral continuity measures as a join cost in concatenative

speech synthesis," Proceedings of the IET Irish Signals and Sys-

tems Conference (ISSC), 2006.

[6] A. C. Kelly, Join Cost Optimisation for Unit Selection

Speech Synthesis, poster, Sao Paulo School of Advanced

Studies in Speech Dynamics, Sao Paulo, Brazil,

www.dinafon.iel.unicamp.br/spsassd_files/posterA

meliaKelly.pdf 2010.

[7] A. C. Kelly and C. Gobl, "The effects of windowing on the

calculation of MFCCs for different types of speech sounds,"

Proc. International Conference on Non-Linear Speech Processing,

Gran Canaria, 2011.

[8] J. Wouters and M. W. Macon, Perceptual evaluation of

distance measures for concatenative speech synthesis, In-

ternational conference on Spoken Language Processing, ICSLP,

1998.

[9] E. Klabbers and R. Veldhuis, On the Reduction of Concate-

nation Artefacts in Diphone Synthesis, Proceedings of the

Acoustics, Speech, and Language Processing 1998.

[10] J. D. Chen and N. Campbell, Objective distance measures for

assessing concatenative speech synthesis, Proceedings of Eu-

rospeech, 1999.

[11] Y. Stylianou and A. Syrdal, Perceptual and Objective De-

tection of Discontinuities in Concatenative Speech Synthe-sis, Proceedings of the International Conference on Acoustics,

Speech, and Signal Processing, 2001.[12] J. Vepa, S. King and P. Taylor, New objective distance

measures for spectral discontinuities in concatenativespeech synthesis, Proceedings of 2002 IEEE Workshop onSpeech Synthesis, 2002.

[13] Y. Pantazis, Y. Stylianou, and E. Klabbers, Discontinuity Detec-tion in Concatenated Speech Synthesis based on NonlinearSpeech Analysis, Interspeech, 2005.

[14] J. Kominek and A. Black, The CMU ARCTIC speech databasesfor speech synthesis research, Tech. Report CMU-LTI-03-177http://festvox.org/cmu arctic/, Language Technologies Institute,Carnegie Mellon University, 2003.

Amelia C. Kelly is a Ph.D. student at Trinity College Dublin.

Christer Gobl is senior lecturer in Speech Science at TrinityCollege Dublin.




Documents

A comparison of mel-frequency cepstral coefficient (MFCC) calculation techniques