Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

Embed Size (px)

Citation preview

  • 8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

    1/7

    C.M. Travieso-González, J.B. Alonso-Hernández (Eds.): NOLISP 2011, LNAI 7015, pp. 183–189, 2011.

    © Springer-Verlag Berlin Heidelberg 2011

    Combining Mel Frequency Cepstral Coefficients and

    Fractal Dimensions for Automatic Speech Recognition

    Aitzol Ezeiza1, Karmele López de Ipiña1, Carmen Hernández1, and Nora Barroso2 

    1 Department of System Engineering and Automation, University of the Basque Country, Spain

    {aitzol.ezeiza,mamen.hernandez,karmele.ipina}@ehu.es2 Irunweb Enterprise, Auzolan 2B – 2, Irun, Spain

    [email protected]

    Abstract.  Hidden Markov Models and Mel Frequency Cepstral Coefficients

    (MFCC’s) are a sort of standard for Automatic Speech Recognition (ASR) sys-

    tems, but they fail to capture the nonlinear dynamics of speech that are present

    in the speech waveforms. The extra information provided by the nonlinear fea-

    tures could be especially useful when training data is scarce, or when the ASR

    task is very complex. In this work, the Fractal Dimension (FD) of the observed

    time series is combined with the traditional MFCC’s in the feature vector in or-

    der to enhance the performance of two different ASR systems: the first one is a

    very simple one, with very few training examples, and the second one is a

    Large Vocabulary Continuous Speech Recognition System for Broadcast News.

    Keywords: Nonlinear Speech Processing, Automatic Speech Recognition, Mel

    Frequency Cepstral Coefficients, Fractal Dimensions.

    1 Introduction

    There are strong foundations to claim that speech is a nonlinear process [1], but even

    if there are many research groups working on nonlinear enhancements for Speech

    Processing, most of the Automatic Speech Recognition (ASR) Systems are based on

    linear models. The state-of-the-art ASR systems are mostly developed using HiddenMarkov Models (HMM’s) and linear filtering techniques based on Fourier Trans-

    forms, such as Mel Frequency Cepstral Coefficients (MFCC’s). There have been

    many success stories which used these methods, but the development of such systems

    require of large corpora for training and as a side effect, they are very language-

    dependent. If the appropriate corpora are available, most of the systems rely on Ma-

    chine Learning techniques so they don’t need many extra efforts in order to achievetheir minimal goals. In contrast, the ASR tasks that have to deal with a very large

    vocabulary, with under-resourced languages [2], or with noisy environments have to

    try alternative techniques. An interesting set of alternatives come in the form of

    nonlinear analysis [3], and some works [4,5,6] show that combining nonlinear fea-

    tures with MFCC’s can produce higher recognition accuracies without substituting the

    whole linear system with novel nonlinear approaches.One of these alternatives is to consider the fractal dimension of the speech signal as

    a feature in the training process. The interest on fractals in speech date back to the

  • 8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

    2/7

    184 A. Ezeiza et al.

    mid-80’s [7], and they have been used for a variety of applications, including conso-nant/vowel characterization [8,9], speaker identification [10], and end-point detection[11], even for whispered speech [12]. Indeed, this metric has been also used in speechrecognition, in some cases combined with MFCC’s as described above [4]. Remarka-bly, the more notable contributions to the enhancement of ASR using fractals andother nonlinear and chaotic systems features have been made by the Computer Vision,Speech Communication, and Signal Processing Group of the National Technical Uni-versity of Athens [4,13,14,15,16].

    The simple approach of this work is to improve the HMM-based systems devel-oped in our previous work [2] augmenting the MFCC-based features with FractalDimensions. More precisely, an implementation of Higuchi’s algorithm [17] has beenapplied to the same sliding window employed for the extraction of the MFCC’s inorder to add this new feature to the set that feeds the training process of the HMM’s

    of the Speech Recognition System. Given the complexity of the Broadcast News task,an initial experiment was assembled with a very simple system [18], with the aim ofevaluating qualitatively the benefits of the methodology. This experiment on its ownis significant because the system was developed using a very small corpus, which isone of our strands of work.

    The rest of this paper is organized this way: In Section 2, the methodology of theexperiments is explained, Section 3 shows the experimental results, and finally, con-clusions are presented in Section 4.

    2 Methodology

    2.1 Fractal Dimension

    The Fractal Dimension is one of the most popular features which describe the com-plexity of a system. Most if not all of the fractal systems have a characteristic calledself-similarity. An object is self-similar if a close-up examination of the object revealsthat it is composed of smaller versions of itself. Self-similarity can be quantified as arelative measure of the number of basic building blocks that form a pattern, and thismeasure is defined as the Fractal Dimension. There are several algorithms to measurethe Fractal Dimension, but this current work focus on the alternatives which don’tneed previous modelling of the system. Two of these algorithms are Katz [19] andHiguchi [17], named after their authors. From these two similar methods Higuchi hasbeen chosen because it has been reported to be more accurate [20], but Katz algorithmwill be tested in future work, since it gets better results in certain cases.

    Higuchi [17] proposed an algorithm for measuring the Fractal Dimension of dis-crete time sequences directly from the time series  x(1),x(2),…,x(n).  Without going

    into detail, the algorithm calculates the length  Lm(k) (see Equation 1) for each value ofm and k covering all the series.

    (1)

  • 8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

    3/7

      Combining Mel Frequency Cepstral Coefficients and Fractal Dimensions 185

    After that, a sum of all the lengths  Lm(k) for each k is determined with Equation 2.

    (2)

    And finally, the slope of the curve ln(L(k))/ln(1/k)  is estimated using least squares

    linear best fit, and the result is the Higuchi Fractal Dimension (HFD).

    Once the HFD algorithm is implemented, the method employed in the development

    of the experiments described in this work has been the following:

    1. 

    The original MFCC’s are extracted from the waveforms employing the standard

    tool available in the HMM ToolKit (HTK) [21], and they are stored in a single

    MFCC file for each speech recording file.

    2. 

    The same window size and time-shift is applied on the original waveform data, andeach of these sub-waveforms will be the input of the Higuchi Fractal Dimension

    (HFD) function.

    3. 

    The result of the function is appended to the original feature vector, and the com-

    plete result of the processing of the whole speech recording file is stored in a new

    MFCC file.

    2.2 Description of the ASR Tasks

    With the aim of exploring the suitability of the Higuchi Fractal Dimension for ASR

    tasks, two separate experiments have been developed. The first one has been em-

    ployed as a test bed for several analyses, and the second one as the target task of the

    research.

    The first is a Chinese Digit Recognition task developed by Jang [18]. This is asimple system developed using a small corpus of 56 recordings of each of the 10

    standard digits in Chinese. Since it’s an isolate word recognition task with a very

    small lexicon, the difficulty of the task lies on the lack of transcribed recordings. The

    baseline system has been trained using a feature vector size of 39 (12 MFCCs + C0 

    energy log component and their first and second derivatives). The enhanced system

    combines the previous 39 features with the HFD of each window’s time series. Those

    features have been used to train HMM models using HTK [21], and for train-

    ing/testing purposes the corpus has been divided using 460 recordings for training and

    the remaining 100 have been reserved for testing.

    The second task is a Broadcast News task in Spanish. The corpus available consists

    of 832 sentences extracted from the hourly news bulletins of the Infozazpi radio [2].

    The total size of the audio content is nearly one hour (55 minutes and 38 seconds),

    and the corpus has these relevant characteristics:

    1. 

    It only has two speakers, because this is not an interactive radio program,

    but an hourly bulletin of highlights of the daily news.

    2. 

    The background noise (mostly filling music) is also considerable. Two

    measures have been employed: NIST STNR and WADA SNR, resulting

    in 10.74 dB and 8.11 dB, respectively, whilst common measures for clean

    speech are about 20dB.

  • 8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

    4/7

    186 A. Ezeiza et al.

    Fig. 1.  Waveform and Higuchi Fractal Dimension (HFD) function of the words “jiou” and

    “liou” in Chinese

    3.  The speed of speech is fast (16.1 phonemes per second) comparing to

    other Broadcast News corpora (an average of 12 phonemes per second in

    Basque and French for the same broadcaster).

    4.  Cross-lingual effects: 3.9% of the words are in Basque, so it is much more

    difficult to use models from other Spanish corpora.

    5.  The size of the vocabulary is large in proportion: there are a total of

    12,812 utterances of words and 2,042 distinct word units.

    In order to get significant results, the system has been trained using allophones andtriphones. The feature vector in this second case comprises 42 parameters (13 MFCCs

    + C0 energy log component and their first and second derivatives). In the same way as

    the first system, the enhanced system uses a feature vector size of 43 (the previous 42

    features + HFD). In this case, the testing method has been done with 20-fold cross-

    validation.

    “liou”“jiou”

  • 8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

    5/7

      Combining Mel Frequency Cepstral Coefficients and Fractal Dimensions 187

    3 Results of the Experiments

    Some attention-grabbing results have been gathered from the experiments carried out

    with the systems described in the previous section. The experiment on Chinese Digit

    Recognition is very limited in terms of both training and testing, but the improvement

    is noteworthy (see Table 1). During the regular test, where the input of the system was

    a set of 100 recordings (10 for each digit), the Correct Word Rate was increased intwo points. Indeed, some other experiments, in which some features of the original

    MFCC vector were substituted with the Fractal Dimension, reached the 96% thresh-

    old, so it suggests the feature selection might be revised for this case. In any case, the

    improvement is significant enough so as to be taken into account. In fact, the results

    confirm the conclusions of previous works, which stated that the most significant

    benefits of using fractals is their usefulness to differ between voiced and unvoicedsounds [8], and between affricates and other sounds [9]. For example, Figure 1 shows

    two very similar cases that were mismatched using only MFCC’s but where classified

    correctly using HFD. In this case, the Fractal Dimension is useful for differentiate a

    liquid /l/ and an affricate /j/.

    Table 1. Correct Word Rate (CWR) of the two experiments

    Task name MFCC only MFCC+HFD

    Chinese Digit Recognition 93% 95%

    Infozazpi Broadcast News 55.755% 55.738%

    In actual fact, the complex Broadcast News task was a much closer contest. TheCorrect Word Rate was minimally improved, but it has to be remarked that the system

    has a very large set of basic units and very few utterances available for each of them,

    so it makes difficult to extrapolate information based on a single parameter as it is thecase of the MFCC+HFD experiment. Nevertheless, other indicators and particular

    examples advise keeping on working in this line. In particular, some of the sounds

    had lower confusion rates, but it wasn’t reflected in the final results because of dic-

    tionary and Language Modelling errors that are very common in complex tasks with

    large vocabularies.

    4 Conclusions and Future Work

    In this work, it is described a first approach to the inclusion of nonlinear features in analready developed state-of-the-art HMM-based ASR system. By augmenting the

    MFCC’s with one extra feature, the useful information that was present in the original

    system is not affected, while the Fractal Dimension adds useful information about the

    dynamics of the speech generation. Additionally, it has been proposed a quite simple

    method that consists of inserting the extra features using the same window as the one

    used during the MFCC feature extraction. This straightforward approach might befrail in terms of capturing the dynamics of the whole waveform, but it offers many

    advantages in terms of computability, and it also makes easier to compare the power

  • 8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

    6/7

    188 A. Ezeiza et al.

    of the new features against the traditional ones. Overall, the results suggest that it’s

    worth considering this and other nonlinear features in order to obtain more robust

    ASR systems, even if the improvement in terms of Word Error Rates isn’t significant

    in some of the tasks. According to this point of view, our current work streams

    include trying new related features such as Lyapunov Exponents [14] and Filtered

    Dynamics [15]. Finally, one of our current tasks consists in developing an ontology-

    driven Information Retrieval system for Broadcast News [22], which employs many

    advanced techniques and could include the Fractal Dimension as a feature in the near

    future.

    References

    1. 

    Teager, H.M., Teager, S.M.: Evidence for Nonlinear Sound Production Mechanisms in the

    Vocal Tract. In: Speech Production and Speech Modelling, Bonas, France. NATO

    Advanced Study Institute Series D, vol. 55 (1989)

    2. 

    Barroso, N., López de Ipiña, K., Ezeiza, A.: Acoustic Phonetic Decoding Oriented to Mul-

    tilingual Speech Recognition in the Basque Context. Advances in Intelligent and Soft

    Computing, vol. 71. Springer, Heidelberg (2010)

    3. 

    Faúndez, M., Kubin, G., Kleijn, W.B., Maragos, P., McLaughlin, S., Esposito, A.,

    Hussain, A., Schoentgen, J.: Nonlinear speech processing: overview and applications. Int.

    J. Control Intelligent Systems 30(1), 1–10 (2002)

    4.  Pitsikalis, V., Maragos, P.: Analysis and Classification of Speech Signals by Generalized

    Fractal Dimension Features. Speech Communication 51(12), 1206–1223 (2009)

    5. 

    Indrebo, K.M., Povinelli, R.J., Johnson, M.T.: Third-Order Moments of Filtered Speech

    Signals for Robust Speech Recognition. In: Faundez-Zanuy, M., Janer, L., Esposito, A.,

    Satue-Villar, A., Roure, J., Espinosa-Duro, V. (eds.) NOLISP 2005. LNCS (LNAI),vol. 3817, pp. 277–283. Springer, Heidelberg (2006)

    6. 

    Shekofteh, Y., Almasganj, F.: Using Phase Space based processing to extract properfea-

    tures for ASR systems. In: Proceedings of the 5th International Symposium on Telecom-

    munications (2010)

    7. 

    Pickover C.A., Khorasani A.: Fractal characterization of speech waveform graphs. Com-

    puters & Graphics (1986)

    8. 

    Martinez, F., Guillamon, A., Martinez, J.J.: Vowel and consonant characterization using

    fractal dimension in natural speech. In: NOLISP 2003 (2003)

    9. 

    Langi, A., Kinsner, W.: Consonant Characterization Using Correlation Fractal Dimension

    for Speech Recognition. In: IEEE Wescanex 1995, Communications, Power and Compu-

    ting, Winnipeg, MB, vol. 1, pp. 208–213 (1995)

    10. 

    Nelwamondo, F.V., Mahola, U., Marwola, T.: Multi-Scale Fractal Dimension for Speaker

    Identification Systems. WSEAS Transactions on Systems 5(5), 1152–1157 (2006)

    11. 

    Li, Y., Fan, Y., Tong, Q.: Endpoint Detection In Noisy Environment Using Complexity

    Measure. In: Proceedings of the 2007 International Conference on Wavelet Analysis and

    Pattern Recognition, Beijing, China (2007)

    12.  Chen, X., Zhao, H.: Fractal Characteristic-Based Endpoint Detection for Whispered

    Speech. In: Proceedings of the 6th WSEAS International Conference on Signal, Speech

    and Image Processing, Lisbon, Portugal (2006)

    13.  Maragos P.: Fractal Aspects of Speech Signals: Dimension and Interpolation. In: Proc. of

    1991 International Conference on Acoustics, Speech, and Signal Processing (ICASSP

    1991), Toronto, Canada, pp. 417–420 (May 1991)

  • 8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition

    7/7

      Combining Mel Frequency Cepstral Coefficients and Fractal Dimensions 189

    14. 

    Maragos, P., Potamianos, A.: Fractal Dimensions of Speech Sounds: Computation and

    Application to Automatic Speech Recognition. Journal of Acoustical Society of Ameri-

    ca 105(3), 1925–1932 (1999)

    15. 

    Pitsikalis, V., Kokkinos, I., Maragos, P.: Nonlinear Analysis of Speech Signals: Genera-

    lized Dimensions and Lyapunov Exponents. In: Proceedings of Interspeech 2002, Santori-

    ni, Greece (2002)

    16. 

    Pitsikalis, V., Maragos, P.: Filtered Dynamics and Fractal Dimensions for Noisy Speech

    Recognition. IEEE Signal Processing Letters 13(11), 711–714 (2006)

    17. 

    Higuchi, T.: Approach to an irregular time series on the basis of the fractal theory. Physica

    D 31, 277–283 (1988)

    18. 

    Jang J.S.R.: Audio Signal Processing and Recognition. Available at the links for on-line

    courses at the author’s homepage, http://www.cs.nthu.edu.tw/~jang 

    19.  Katz, M.: Fractals and the analysis of waveforms. Comput. Biol. Med. 18(3), 145–156

    (1988)20.

     

    Esteller, R., Vachtsevanos, G., Echauz, J., Litt, B.: A comparison of waveform fractal di-

    mension algorithms. IEEE Transactions on Circuits and Systems I: Fundamental Theory

    and Applications 48(2), 177–183 (2001)

    21. 

    Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK

    Book 3.4. Cambridge University Press, Cambridge (2006)

    22.  Barroso, N., Lopez de Ipiña, K., Ezeiza, A., Hernandez, C., Ezeiza, N., Barroso, O., Sus-

    perregi, U., Barroso, S.: GorUp: an ontology-driven Audio Information Retrieval system

    that suits the requirements of under-resourced languages. In: Proceedings of Interspeech

    2011, Firenze (2011)