Upload
perrazoss
View
216
Download
0
Embed Size (px)
Citation preview
8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition
1/7
C.M. Travieso-González, J.B. Alonso-Hernández (Eds.): NOLISP 2011, LNAI 7015, pp. 183–189, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Combining Mel Frequency Cepstral Coefficients and
Fractal Dimensions for Automatic Speech Recognition
Aitzol Ezeiza1, Karmele López de Ipiña1, Carmen Hernández1, and Nora Barroso2
1 Department of System Engineering and Automation, University of the Basque Country, Spain
{aitzol.ezeiza,mamen.hernandez,karmele.ipina}@ehu.es2 Irunweb Enterprise, Auzolan 2B – 2, Irun, Spain
Abstract. Hidden Markov Models and Mel Frequency Cepstral Coefficients
(MFCC’s) are a sort of standard for Automatic Speech Recognition (ASR) sys-
tems, but they fail to capture the nonlinear dynamics of speech that are present
in the speech waveforms. The extra information provided by the nonlinear fea-
tures could be especially useful when training data is scarce, or when the ASR
task is very complex. In this work, the Fractal Dimension (FD) of the observed
time series is combined with the traditional MFCC’s in the feature vector in or-
der to enhance the performance of two different ASR systems: the first one is a
very simple one, with very few training examples, and the second one is a
Large Vocabulary Continuous Speech Recognition System for Broadcast News.
Keywords: Nonlinear Speech Processing, Automatic Speech Recognition, Mel
Frequency Cepstral Coefficients, Fractal Dimensions.
1 Introduction
There are strong foundations to claim that speech is a nonlinear process [1], but even
if there are many research groups working on nonlinear enhancements for Speech
Processing, most of the Automatic Speech Recognition (ASR) Systems are based on
linear models. The state-of-the-art ASR systems are mostly developed using HiddenMarkov Models (HMM’s) and linear filtering techniques based on Fourier Trans-
forms, such as Mel Frequency Cepstral Coefficients (MFCC’s). There have been
many success stories which used these methods, but the development of such systems
require of large corpora for training and as a side effect, they are very language-
dependent. If the appropriate corpora are available, most of the systems rely on Ma-
chine Learning techniques so they don’t need many extra efforts in order to achievetheir minimal goals. In contrast, the ASR tasks that have to deal with a very large
vocabulary, with under-resourced languages [2], or with noisy environments have to
try alternative techniques. An interesting set of alternatives come in the form of
nonlinear analysis [3], and some works [4,5,6] show that combining nonlinear fea-
tures with MFCC’s can produce higher recognition accuracies without substituting the
whole linear system with novel nonlinear approaches.One of these alternatives is to consider the fractal dimension of the speech signal as
a feature in the training process. The interest on fractals in speech date back to the
8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition
2/7
184 A. Ezeiza et al.
mid-80’s [7], and they have been used for a variety of applications, including conso-nant/vowel characterization [8,9], speaker identification [10], and end-point detection[11], even for whispered speech [12]. Indeed, this metric has been also used in speechrecognition, in some cases combined with MFCC’s as described above [4]. Remarka-bly, the more notable contributions to the enhancement of ASR using fractals andother nonlinear and chaotic systems features have been made by the Computer Vision,Speech Communication, and Signal Processing Group of the National Technical Uni-versity of Athens [4,13,14,15,16].
The simple approach of this work is to improve the HMM-based systems devel-oped in our previous work [2] augmenting the MFCC-based features with FractalDimensions. More precisely, an implementation of Higuchi’s algorithm [17] has beenapplied to the same sliding window employed for the extraction of the MFCC’s inorder to add this new feature to the set that feeds the training process of the HMM’s
of the Speech Recognition System. Given the complexity of the Broadcast News task,an initial experiment was assembled with a very simple system [18], with the aim ofevaluating qualitatively the benefits of the methodology. This experiment on its ownis significant because the system was developed using a very small corpus, which isone of our strands of work.
The rest of this paper is organized this way: In Section 2, the methodology of theexperiments is explained, Section 3 shows the experimental results, and finally, con-clusions are presented in Section 4.
2 Methodology
2.1 Fractal Dimension
The Fractal Dimension is one of the most popular features which describe the com-plexity of a system. Most if not all of the fractal systems have a characteristic calledself-similarity. An object is self-similar if a close-up examination of the object revealsthat it is composed of smaller versions of itself. Self-similarity can be quantified as arelative measure of the number of basic building blocks that form a pattern, and thismeasure is defined as the Fractal Dimension. There are several algorithms to measurethe Fractal Dimension, but this current work focus on the alternatives which don’tneed previous modelling of the system. Two of these algorithms are Katz [19] andHiguchi [17], named after their authors. From these two similar methods Higuchi hasbeen chosen because it has been reported to be more accurate [20], but Katz algorithmwill be tested in future work, since it gets better results in certain cases.
Higuchi [17] proposed an algorithm for measuring the Fractal Dimension of dis-crete time sequences directly from the time series x(1),x(2),…,x(n). Without going
into detail, the algorithm calculates the length Lm(k) (see Equation 1) for each value ofm and k covering all the series.
(1)
8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition
3/7
Combining Mel Frequency Cepstral Coefficients and Fractal Dimensions 185
After that, a sum of all the lengths Lm(k) for each k is determined with Equation 2.
(2)
And finally, the slope of the curve ln(L(k))/ln(1/k) is estimated using least squares
linear best fit, and the result is the Higuchi Fractal Dimension (HFD).
Once the HFD algorithm is implemented, the method employed in the development
of the experiments described in this work has been the following:
1.
The original MFCC’s are extracted from the waveforms employing the standard
tool available in the HMM ToolKit (HTK) [21], and they are stored in a single
MFCC file for each speech recording file.
2.
The same window size and time-shift is applied on the original waveform data, andeach of these sub-waveforms will be the input of the Higuchi Fractal Dimension
(HFD) function.
3.
The result of the function is appended to the original feature vector, and the com-
plete result of the processing of the whole speech recording file is stored in a new
MFCC file.
2.2 Description of the ASR Tasks
With the aim of exploring the suitability of the Higuchi Fractal Dimension for ASR
tasks, two separate experiments have been developed. The first one has been em-
ployed as a test bed for several analyses, and the second one as the target task of the
research.
The first is a Chinese Digit Recognition task developed by Jang [18]. This is asimple system developed using a small corpus of 56 recordings of each of the 10
standard digits in Chinese. Since it’s an isolate word recognition task with a very
small lexicon, the difficulty of the task lies on the lack of transcribed recordings. The
baseline system has been trained using a feature vector size of 39 (12 MFCCs + C0
energy log component and their first and second derivatives). The enhanced system
combines the previous 39 features with the HFD of each window’s time series. Those
features have been used to train HMM models using HTK [21], and for train-
ing/testing purposes the corpus has been divided using 460 recordings for training and
the remaining 100 have been reserved for testing.
The second task is a Broadcast News task in Spanish. The corpus available consists
of 832 sentences extracted from the hourly news bulletins of the Infozazpi radio [2].
The total size of the audio content is nearly one hour (55 minutes and 38 seconds),
and the corpus has these relevant characteristics:
1.
It only has two speakers, because this is not an interactive radio program,
but an hourly bulletin of highlights of the daily news.
2.
The background noise (mostly filling music) is also considerable. Two
measures have been employed: NIST STNR and WADA SNR, resulting
in 10.74 dB and 8.11 dB, respectively, whilst common measures for clean
speech are about 20dB.
8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition
4/7
186 A. Ezeiza et al.
Fig. 1. Waveform and Higuchi Fractal Dimension (HFD) function of the words “jiou” and
“liou” in Chinese
3. The speed of speech is fast (16.1 phonemes per second) comparing to
other Broadcast News corpora (an average of 12 phonemes per second in
Basque and French for the same broadcaster).
4. Cross-lingual effects: 3.9% of the words are in Basque, so it is much more
difficult to use models from other Spanish corpora.
5. The size of the vocabulary is large in proportion: there are a total of
12,812 utterances of words and 2,042 distinct word units.
In order to get significant results, the system has been trained using allophones andtriphones. The feature vector in this second case comprises 42 parameters (13 MFCCs
+ C0 energy log component and their first and second derivatives). In the same way as
the first system, the enhanced system uses a feature vector size of 43 (the previous 42
features + HFD). In this case, the testing method has been done with 20-fold cross-
validation.
“liou”“jiou”
8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition
5/7
Combining Mel Frequency Cepstral Coefficients and Fractal Dimensions 187
3 Results of the Experiments
Some attention-grabbing results have been gathered from the experiments carried out
with the systems described in the previous section. The experiment on Chinese Digit
Recognition is very limited in terms of both training and testing, but the improvement
is noteworthy (see Table 1). During the regular test, where the input of the system was
a set of 100 recordings (10 for each digit), the Correct Word Rate was increased intwo points. Indeed, some other experiments, in which some features of the original
MFCC vector were substituted with the Fractal Dimension, reached the 96% thresh-
old, so it suggests the feature selection might be revised for this case. In any case, the
improvement is significant enough so as to be taken into account. In fact, the results
confirm the conclusions of previous works, which stated that the most significant
benefits of using fractals is their usefulness to differ between voiced and unvoicedsounds [8], and between affricates and other sounds [9]. For example, Figure 1 shows
two very similar cases that were mismatched using only MFCC’s but where classified
correctly using HFD. In this case, the Fractal Dimension is useful for differentiate a
liquid /l/ and an affricate /j/.
Table 1. Correct Word Rate (CWR) of the two experiments
Task name MFCC only MFCC+HFD
Chinese Digit Recognition 93% 95%
Infozazpi Broadcast News 55.755% 55.738%
In actual fact, the complex Broadcast News task was a much closer contest. TheCorrect Word Rate was minimally improved, but it has to be remarked that the system
has a very large set of basic units and very few utterances available for each of them,
so it makes difficult to extrapolate information based on a single parameter as it is thecase of the MFCC+HFD experiment. Nevertheless, other indicators and particular
examples advise keeping on working in this line. In particular, some of the sounds
had lower confusion rates, but it wasn’t reflected in the final results because of dic-
tionary and Language Modelling errors that are very common in complex tasks with
large vocabularies.
4 Conclusions and Future Work
In this work, it is described a first approach to the inclusion of nonlinear features in analready developed state-of-the-art HMM-based ASR system. By augmenting the
MFCC’s with one extra feature, the useful information that was present in the original
system is not affected, while the Fractal Dimension adds useful information about the
dynamics of the speech generation. Additionally, it has been proposed a quite simple
method that consists of inserting the extra features using the same window as the one
used during the MFCC feature extraction. This straightforward approach might befrail in terms of capturing the dynamics of the whole waveform, but it offers many
advantages in terms of computability, and it also makes easier to compare the power
8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition
6/7
188 A. Ezeiza et al.
of the new features against the traditional ones. Overall, the results suggest that it’s
worth considering this and other nonlinear features in order to obtain more robust
ASR systems, even if the improvement in terms of Word Error Rates isn’t significant
in some of the tasks. According to this point of view, our current work streams
include trying new related features such as Lyapunov Exponents [14] and Filtered
Dynamics [15]. Finally, one of our current tasks consists in developing an ontology-
driven Information Retrieval system for Broadcast News [22], which employs many
advanced techniques and could include the Fractal Dimension as a feature in the near
future.
References
1.
Teager, H.M., Teager, S.M.: Evidence for Nonlinear Sound Production Mechanisms in the
Vocal Tract. In: Speech Production and Speech Modelling, Bonas, France. NATO
Advanced Study Institute Series D, vol. 55 (1989)
2.
Barroso, N., López de Ipiña, K., Ezeiza, A.: Acoustic Phonetic Decoding Oriented to Mul-
tilingual Speech Recognition in the Basque Context. Advances in Intelligent and Soft
Computing, vol. 71. Springer, Heidelberg (2010)
3.
Faúndez, M., Kubin, G., Kleijn, W.B., Maragos, P., McLaughlin, S., Esposito, A.,
Hussain, A., Schoentgen, J.: Nonlinear speech processing: overview and applications. Int.
J. Control Intelligent Systems 30(1), 1–10 (2002)
4. Pitsikalis, V., Maragos, P.: Analysis and Classification of Speech Signals by Generalized
Fractal Dimension Features. Speech Communication 51(12), 1206–1223 (2009)
5.
Indrebo, K.M., Povinelli, R.J., Johnson, M.T.: Third-Order Moments of Filtered Speech
Signals for Robust Speech Recognition. In: Faundez-Zanuy, M., Janer, L., Esposito, A.,
Satue-Villar, A., Roure, J., Espinosa-Duro, V. (eds.) NOLISP 2005. LNCS (LNAI),vol. 3817, pp. 277–283. Springer, Heidelberg (2006)
6.
Shekofteh, Y., Almasganj, F.: Using Phase Space based processing to extract properfea-
tures for ASR systems. In: Proceedings of the 5th International Symposium on Telecom-
munications (2010)
7.
Pickover C.A., Khorasani A.: Fractal characterization of speech waveform graphs. Com-
puters & Graphics (1986)
8.
Martinez, F., Guillamon, A., Martinez, J.J.: Vowel and consonant characterization using
fractal dimension in natural speech. In: NOLISP 2003 (2003)
9.
Langi, A., Kinsner, W.: Consonant Characterization Using Correlation Fractal Dimension
for Speech Recognition. In: IEEE Wescanex 1995, Communications, Power and Compu-
ting, Winnipeg, MB, vol. 1, pp. 208–213 (1995)
10.
Nelwamondo, F.V., Mahola, U., Marwola, T.: Multi-Scale Fractal Dimension for Speaker
Identification Systems. WSEAS Transactions on Systems 5(5), 1152–1157 (2006)
11.
Li, Y., Fan, Y., Tong, Q.: Endpoint Detection In Noisy Environment Using Complexity
Measure. In: Proceedings of the 2007 International Conference on Wavelet Analysis and
Pattern Recognition, Beijing, China (2007)
12. Chen, X., Zhao, H.: Fractal Characteristic-Based Endpoint Detection for Whispered
Speech. In: Proceedings of the 6th WSEAS International Conference on Signal, Speech
and Image Processing, Lisbon, Portugal (2006)
13. Maragos P.: Fractal Aspects of Speech Signals: Dimension and Interpolation. In: Proc. of
1991 International Conference on Acoustics, Speech, and Signal Processing (ICASSP
1991), Toronto, Canada, pp. 417–420 (May 1991)
8/18/2019 Combining Mel Frequency Cepstral Coefficients AndFractal Dimensions for Automatic Speech Recognition
7/7
Combining Mel Frequency Cepstral Coefficients and Fractal Dimensions 189
14.
Maragos, P., Potamianos, A.: Fractal Dimensions of Speech Sounds: Computation and
Application to Automatic Speech Recognition. Journal of Acoustical Society of Ameri-
ca 105(3), 1925–1932 (1999)
15.
Pitsikalis, V., Kokkinos, I., Maragos, P.: Nonlinear Analysis of Speech Signals: Genera-
lized Dimensions and Lyapunov Exponents. In: Proceedings of Interspeech 2002, Santori-
ni, Greece (2002)
16.
Pitsikalis, V., Maragos, P.: Filtered Dynamics and Fractal Dimensions for Noisy Speech
Recognition. IEEE Signal Processing Letters 13(11), 711–714 (2006)
17.
Higuchi, T.: Approach to an irregular time series on the basis of the fractal theory. Physica
D 31, 277–283 (1988)
18.
Jang J.S.R.: Audio Signal Processing and Recognition. Available at the links for on-line
courses at the author’s homepage, http://www.cs.nthu.edu.tw/~jang
19. Katz, M.: Fractals and the analysis of waveforms. Comput. Biol. Med. 18(3), 145–156
(1988)20.
Esteller, R., Vachtsevanos, G., Echauz, J., Litt, B.: A comparison of waveform fractal di-
mension algorithms. IEEE Transactions on Circuits and Systems I: Fundamental Theory
and Applications 48(2), 177–183 (2001)
21.
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK
Book 3.4. Cambridge University Press, Cambridge (2006)
22. Barroso, N., Lopez de Ipiña, K., Ezeiza, A., Hernandez, C., Ezeiza, N., Barroso, O., Sus-
perregi, U., Barroso, S.: GorUp: an ontology-driven Audio Information Retrieval system
that suits the requirements of under-resourced languages. In: Proceedings of Interspeech
2011, Firenze (2011)