Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

1/7

C.M. Travieso-González, J.B. Alonso-Hernández (Eds.): NOLISP 2011, LNAI 7015, pp. 183–189, 2011.

© Springer-Verlag Berlin Heidelberg 2011

Combining Mel Frequency Cepstral Coefficients and

Fractal Dimensions for Automatic Speech Recognition

Aitzol Ezeiza1, Karmele López de Ipiña1, Carmen Hernández1, and Nora Barroso2

1 Department of System Engineering and Automation, University of the Basque Country, Spain

{aitzol.ezeiza,mamen.hernandez,karmele.ipina}@ehu.es2 Irunweb Enterprise, Auzolan 2B – 2, Irun, Spain

[email protected]

Abstract. Hidden Markov Models and Mel Frequency Cepstral Coefficients

(MFCC’s) are a sort of standard for Automatic Speech Recognition (ASR) sys-

tems, but they fail to capture the nonlinear dynamics of speech that are present

in the speech waveforms. The extra information provided by the nonlinear fea-

tures could be especially useful when training data is scarce, or when the ASR

task is very complex. In this work, the Fractal Dimension (FD) of the observed

time series is combined with the traditional MFCC’s in the feature vector in or-

der to enhance the performance of two different ASR systems: the first one is a

very simple one, with very few training examples, and the second one is a

Large Vocabulary Continuous Speech Recognition System for Broadcast News.

Keywords: Nonlinear Speech Processing, Automatic Speech Recognition, Mel

Frequency Cepstral Coefficients, Fractal Dimensions.

1 Introduction

There are strong foundations to claim that speech is a nonlinear process [1], but even

if there are many research groups working on nonlinear enhancements for Speech

Processing, most of the Automatic Speech Recognition (ASR) Systems are based on

linear models. The state-of-the-art ASR systems are mostly developed using HiddenMarkov Models (HMM’s) and linear filtering techniques based on Fourier Trans-

forms, such as Mel Frequency Cepstral Coefficients (MFCC’s). There have been

many success stories which used these methods, but the development of such systems

require of large corpora for training and as a side effect, they are very language-

dependent. If the appropriate corpora are available, most of the systems rely on Ma-

chine Learning techniques so they don’t need many extra efforts in order to achievetheir minimal goals. In contrast, the ASR tasks that have to deal with a very large

vocabulary, with under-resourced languages [2], or with noisy environments have to

try alternative techniques. An interesting set of alternatives come in the form of

nonlinear analysis [3], and some works [4,5,6] show that combining nonlinear fea-

tures with MFCC’s can produce higher recognition accuracies without substituting the

whole linear system with novel nonlinear approaches.One of these alternatives is to consider the fractal dimension of the speech signal as

a feature in the training process. The interest on fractals in speech date back to the


2/7

184 A. Ezeiza et al.

mid-80’s [7], and they have been used for a variety of applications, including conso-nant/vowel characterization [8,9], speaker identification [10], and end-point detection[11], even for whispered speech [12]. Indeed, this metric has been also used in speechrecognition, in some cases combined with MFCC’s as described above [4]. Remarka-bly, the more notable contributions to the enhancement of ASR using fractals andother nonlinear and chaotic systems features have been made by the Computer Vision,Speech Communication, and Signal Processing Group of the National Technical Uni-versity of Athens [4,13,14,15,16].

The simple approach of this work is to improve the HMM-based systems devel-oped in our previous work [2] augmenting the MFCC-based features with FractalDimensions. More precisely, an implementation of Higuchi’s algorithm [17] has beenapplied to the same sliding window employed for the extraction of the MFCC’s inorder to add this new feature to the set that feeds the training process of the HMM’s

of the Speech Recognition System. Given the complexity of the Broadcast News task,an initial experiment was assembled with a very simple system [18], with the aim ofevaluating qualitatively the benefits of the methodology. This experiment on its ownis significant because the system was developed using a very small corpus, which isone of our strands of work.

The rest of this paper is organized this way: In Section 2, the methodology of theexperiments is explained, Section 3 shows the experimental results, and finally, con-clusions are presented in Section 4.

2 Methodology

2.1 Fractal Dimension

The Fractal Dimension is one of the most popular features which describe the com-plexity of a system. Most if not all of the fractal systems have a characteristic calledself-similarity. An object is self-similar if a close-up examination of the object revealsthat it is composed of smaller versions of itself. Self-similarity can be quantified as arelative measure of the number of basic building blocks that form a pattern, and thismeasure is defined as the Fractal Dimension. There are several algorithms to measurethe Fractal Dimension, but this current work focus on the alternatives which don’tneed previous modelling of the system. Two of these algorithms are Katz [19] andHiguchi [17], named after their authors. From these two similar methods Higuchi hasbeen chosen because it has been reported to be more accurate [20], but Katz algorithmwill be tested in future work, since it gets better results in certain cases.

Higuchi [17] proposed an algorithm for measuring the Fractal Dimension of dis-crete time sequences directly from the time series x(1),x(2),…,x(n). Without going

into detail, the algorithm calculates the length Lm(k) (see Equation 1) for each value ofm and k covering all the series.

(1)


3/7

Combining Mel Frequency Cepstral Coefficients and Fractal Dimensions 185

After that, a sum of all the lengths Lm(k) for each k is determined with Equation 2.

(2)

And finally, the slope of the curve ln(L(k))/ln(1/k) is estimated using least squares

linear best fit, and the result is the Higuchi Fractal Dimension (HFD).

Once the HFD algorithm is implemented, the method employed in the development

of the experiments described in this work has been the following:

1.

The original MFCC’s are extracted from the waveforms employing the standard

tool available in the HMM ToolKit (HTK) [21], and they are stored in a single

MFCC file for each speech recording file.

2.

The same window size and time-shift is applied on the original waveform data, andeach of these sub-waveforms will be the input of the Higuchi Fractal Dimension

(HFD) function.

3.

The result of the function is appended to the original feature vector, and the com-

plete result of the processing of the whole speech recording file is stored in a new

MFCC file.

2.2 Description of the ASR Tasks

With the aim of exploring the suitability of the Higuchi Fractal Dimension for ASR

tasks, two separate experiments have been developed. The first one has been em-

ployed as a test bed for several analyses, and the second one as the target task of the

research.

The first is a Chinese Digit Recognition task developed by Jang [18]. This is asimple system developed using a small corpus of 56 recordings of each of the 10

standard digits in Chinese. Since it’s an isolate word recognition task with a very

small lexicon, the difficulty of the task lies on the lack of transcribed recordings. The

baseline system has been trained using a feature vector size of 39 (12 MFCCs + C0

energy log component and their first and second derivatives). The enhanced system

combines the previous 39 features with the HFD of each window’s time series. Those

features have been used to train HMM models using HTK [21], and for train-

ing/testing purposes the corpus has been divided using 460 recordings for training and

the remaining 100 have been reserved for testing.

The second task is a Broadcast News task in Spanish. The corpus available consists

of 832 sentences extracted from the hourly news bulletins of the Infozazpi radio [2].

The total size of the audio content is nearly one hour (55 minutes and 38 seconds),

and the corpus has these relevant characteristics:

1.

It only has two speakers, because this is not an interactive radio program,

but an hourly bulletin of highlights of the daily news.

2.

The background noise (mostly filling music) is also considerable. Two

measures have been employed: NIST STNR and WADA SNR, resulting

in 10.74 dB and 8.11 dB, respectively, whilst common measures for clean

speech are about 20dB.


4/7


Fig. 1. Waveform and Higuchi Fractal Dimension (HFD) function of the words “jiou” and

“liou” in Chinese

3. The speed of speech is fast (16.1 phonemes per second) comparing to

other Broadcast News corpora (an average of 12 phonemes per second in

Basque and French for the same broadcaster).

4. Cross-lingual effects: 3.9% of the words are in Basque, so it is much more

difficult to use models from other Spanish corpora.

5. The size of the vocabulary is large in proportion: there are a total of

12,812 utterances of words and 2,042 distinct word units.

In order to get significant results, the system has been trained using allophones andtriphones. The feature vector in this second case comprises 42 parameters (13 MFCCs

+ C0 energy log component and their first and second derivatives). In the same way as

the first system, the enhanced system uses a feature vector size of 43 (the previous 42

features + HFD). In this case, the testing method has been done with 20-fold cross-

validation.

“liou”“jiou”


5/7


3 Results of the Experiments

Some attention-grabbing results have been gathered from the experiments carried out

with the systems described in the previous section. The experiment on Chinese Digit

Recognition is very limited in terms of both training and testing, but the improvement

is noteworthy (see Table 1). During the regular test, where the input of the system was

a set of 100 recordings (10 for each digit), the Correct Word Rate was increased intwo points. Indeed, some other experiments, in which some features of the original

MFCC vector were substituted with the Fractal Dimension, reached the 96% thresh-

old, so it suggests the feature selection might be revised for this case. In any case, the

improvement is significant enough so as to be taken into account. In fact, the results

confirm the conclusions of previous works, which stated that the most significant

benefits of using fractals is their usefulness to differ between voiced and unvoicedsounds [8], and between affricates and other sounds [9]. For example, Figure 1 shows

two very similar cases that were mismatched using only MFCC’s but where classified

correctly using HFD. In this case, the Fractal Dimension is useful for differentiate a

liquid /l/ and an affricate /j/.

Table 1. Correct Word Rate (CWR) of the two experiments

Task name MFCC only MFCC+HFD

Chinese Digit Recognition 93% 95%

Infozazpi Broadcast News 55.755% 55.738%

In actual fact, the complex Broadcast News task was a much closer contest. TheCorrect Word Rate was minimally improved, but it has to be remarked that the system

has a very large set of basic units and very few utterances available for each of them,

so it makes difficult to extrapolate information based on a single parameter as it is thecase of the MFCC+HFD experiment. Nevertheless, other indicators and particular

examples advise keeping on working in this line. In particular, some of the sounds

had lower confusion rates, but it wasn’t reflected in the final results because of dic-

tionary and Language Modelling errors that are very common in complex tasks with

large vocabularies.

4 Conclusions and Future Work

In this work, it is described a first approach to the inclusion of nonlinear features in analready developed state-of-the-art HMM-based ASR system. By augmenting the

MFCC’s with one extra feature, the useful information that was present in the original

system is not affected, while the Fractal Dimension adds useful information about the

dynamics of the speech generation. Additionally, it has been proposed a quite simple

method that consists of inserting the extra features using the same window as the one

used during the MFCC feature extraction. This straightforward approach might befrail in terms of capturing the dynamics of the whole waveform, but it offers many

advantages in terms of computability, and it also makes easier to compare the power


6/7


of the new features against the traditional ones. Overall, the results suggest that it’s

worth considering this and other nonlinear features in order to obtain more robust

ASR systems, even if the improvement in terms of Word Error Rates isn’t significant

in some of the tasks. According to this point of view, our current work streams

include trying new related features such as Lyapunov Exponents [14] and Filtered

Dynamics [15]. Finally, one of our current tasks consists in developing an ontology-

driven Information Retrieval system for Broadcast News [22], which employs many

advanced techniques and could include the Fractal Dimension as a feature in the near

future.

References

1.

Teager, H.M., Teager, S.M.: Evidence for Nonlinear Sound Production Mechanisms in the

Vocal Tract. In: Speech Production and Speech Modelling, Bonas, France. NATO

Advanced Study Institute Series D, vol. 55 (1989)

2.

Barroso, N., López de Ipiña, K., Ezeiza, A.: Acoustic Phonetic Decoding Oriented to Mul-

tilingual Speech Recognition in the Basque Context. Advances in Intelligent and Soft

Computing, vol. 71. Springer, Heidelberg (2010)

3.

Faúndez, M., Kubin, G., Kleijn, W.B., Maragos, P., McLaughlin, S., Esposito, A.,

Hussain, A., Schoentgen, J.: Nonlinear speech processing: overview and applications. Int.

J. Control Intelligent Systems 30(1), 1–10 (2002)

4. Pitsikalis, V., Maragos, P.: Analysis and Classification of Speech Signals by Generalized

Fractal Dimension Features. Speech Communication 51(12), 1206–1223 (2009)

5.

Indrebo, K.M., Povinelli, R.J., Johnson, M.T.: Third-Order Moments of Filtered Speech

Signals for Robust Speech Recognition. In: Faundez-Zanuy, M., Janer, L., Esposito, A.,

Satue-Villar, A., Roure, J., Espinosa-Duro, V. (eds.) NOLISP 2005. LNCS (LNAI),vol. 3817, pp. 277–283. Springer, Heidelberg (2006)

6.

Shekofteh, Y., Almasganj, F.: Using Phase Space based processing to extract properfea-

tures for ASR systems. In: Proceedings of the 5th International Symposium on Telecom-

munications (2010)

7.

Pickover C.A., Khorasani A.: Fractal characterization of speech waveform graphs. Com-

puters & Graphics (1986)

8.

Martinez, F., Guillamon, A., Martinez, J.J.: Vowel and consonant characterization using

fractal dimension in natural speech. In: NOLISP 2003 (2003)

9.

Langi, A., Kinsner, W.: Consonant Characterization Using Correlation Fractal Dimension

for Speech Recognition. In: IEEE Wescanex 1995, Communications, Power and Compu-

ting, Winnipeg, MB, vol. 1, pp. 208–213 (1995)

10.

Nelwamondo, F.V., Mahola, U., Marwola, T.: Multi-Scale Fractal Dimension for Speaker

Identification Systems. WSEAS Transactions on Systems 5(5), 1152–1157 (2006)

11.

Li, Y., Fan, Y., Tong, Q.: Endpoint Detection In Noisy Environment Using Complexity

Measure. In: Proceedings of the 2007 International Conference on Wavelet Analysis and

Pattern Recognition, Beijing, China (2007)

12. Chen, X., Zhao, H.: Fractal Characteristic-Based Endpoint Detection for Whispered

Speech. In: Proceedings of the 6th WSEAS International Conference on Signal, Speech

and Image Processing, Lisbon, Portugal (2006)

13. Maragos P.: Fractal Aspects of Speech Signals: Dimension and Interpolation. In: Proc. of

1991 International Conference on Acoustics, Speech, and Signal Processing (ICASSP

1991), Toronto, Canada, pp. 417–420 (May 1991)


7/7


14.

Maragos, P., Potamianos, A.: Fractal Dimensions of Speech Sounds: Computation and

Application to Automatic Speech Recognition. Journal of Acoustical Society of Ameri-

ca 105(3), 1925–1932 (1999)

15.

Pitsikalis, V., Kokkinos, I., Maragos, P.: Nonlinear Analysis of Speech Signals: Genera-

lized Dimensions and Lyapunov Exponents. In: Proceedings of Interspeech 2002, Santori-

ni, Greece (2002)

16.

Pitsikalis, V., Maragos, P.: Filtered Dynamics and Fractal Dimensions for Noisy Speech

Recognition. IEEE Signal Processing Letters 13(11), 711–714 (2006)

17.

Higuchi, T.: Approach to an irregular time series on the basis of the fractal theory. Physica

D 31, 277–283 (1988)

18.

Jang J.S.R.: Audio Signal Processing and Recognition. Available at the links for on-line

courses at the author’s homepage, http://www.cs.nthu.edu.tw/~jang

19. Katz, M.: Fractals and the analysis of waveforms. Comput. Biol. Med. 18(3), 145–156

(1988)20.

Esteller, R., Vachtsevanos, G., Echauz, J., Litt, B.: A comparison of waveform fractal di-

mension algorithms. IEEE Transactions on Circuits and Systems I: Fundamental Theory

and Applications 48(2), 177–183 (2001)

21.

Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK

Book 3.4. Cambridge University Press, Cambridge (2006)

22. Barroso, N., Lopez de Ipiña, K., Ezeiza, A., Hernandez, C., Ezeiza, N., Barroso, O., Sus-

perregi, U., Barroso, S.: GorUp: an ontology-driven Audio Information Retrieval system

that suits the requirements of under-resourced languages. In: Proceedings of Interspeech

2011, Firenze (2011)

Documents

Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition