1
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Speech is the primary means of human communication and people
are interested in building mechanical models that mimic human behavior,
particularly the capability of speaking naturally and responding properly to
spoken language, has intrigued engineers and scientists for centuries. This
makes the speech processing and Automatic Speech Recognition (ASR) by
machines as one of the most attractive areas of research over the past five
decades (Biing-Hwang (Fred) Juang and Lawrence Rabiner 2005). The ASR
is the extraction of linguistic information from an utterance of speech. The
ASR is a real time computer based transcription system that converts spoken
language into sequence of words. This technology makes the computer to
communicate with human by detecting the spoken words and follow human
voice commands (Anusuya and Katti 2009).
In the starting stage, the speech recognition technology is used for
people with physical disabilities who often find typing difficult, painful or
impossible. Those who have spelling difficulties are helped to recognize
words which are always correctly spelled including those with dyslexic. Now,
as the technology has become more sophisticated, its application areas are
widened as wherever human machine interface is required like telephone
networks, query based information retrieval, creating documents from
2
dictations, medical transcriptions, language translation, railway reservation,
super market etc (Anusuya and Katti 2009).
Although many technological improvements have been done in
ASR, recognition accuracy is still far from human levels (Alejandro Acero
1990). Generally when the speech recognition systems applied in real world
applications, where there is no possibility to control the acoustic environment,
there is possibility of mismatch between the testing and training conditions,
causing degradation in the performance (Sankar et al. 1996). This happens
when the system is not designed by considering variability of on-field
environment.
Benzeghiba et al. (2007) given a detailed review of different kinds
of variabilities in ASR. Two broad groups of variabilities can be defined:
extrinsic (non speaker related) and intrinsic (speaker-related) variabilities.
Environmental noise and transmission artifacts are two examples of extrinsic
variabilities.
Besides varying accents, speaking-styles and speaking-rates, age
and emotional state, it is the shape of the vocal tract that intrinsically
contributes to the variable occurrence of speech signals representing the same
textual content (Florian Muller and Alfred Mertins 2011). The problems
originating from different Vocal Track Lengths (VTLs) become especially
apparent in speaker-independent ASR systems (Benzeghiba et al. 2007). The
goal of ASR is to have speech as a medium of interaction between man and
machine and it is desired that an ASR system is robust to these unwanted
variability (Alan Oppenheim 1969).
3
Bearing these in mind, this research work focuses on the
implementation of noise-resilient and speaker independent speech recognition
system, for continuous speech recognition. Mel-Frequency Cepstral
Coefficients (MFCCs) and auditory transform based features called as
Cochlear Filter Cepstral Coefficients (CFCCs) which resemble the peripheral
auditory system have been applied in this research work as feature extraction
algorithms. To improve the noise robustness of MFCC and CFCC under
mismatched training and testing conditions, these features are enhanced by
the application of wavelet based denoising algorithm called adaptive wavelet
thresholding.
Also, to overcome the effects of inter-speaker variability
originating from different VTLs, in this research work, a feature-extraction
method that is based on the principle of invariant integration is applied. This
method integrates the regular nonlinear functions of the features over the
transformation group for which invariance should be achieved. These features
are referred as Invariant-Integration Features (IIFs).
The basic steps involved in the proposed speech recognition
algorithm are given in Figure 1.1. As preprocessing of speech signal is
considered as a crucial step in the development of a robust and efficient
speech or speaker recognition system, as an initial step, preprocessing of input
speech signal is applied. In the feature extraction stage, as a first step the
MFCC features are extracted and as a second step of feature extraction
auditory transform based feature extraction algorithm has been applied. The
features generated from the auditory transform based feature extraction are
named as CFCCs.
4
Figure 1.1 Basic steps involved in the proposed speech recognitionalgorithm
After feature extraction, the denoising has been done for the
enhancement of speech features. In this research work, the wavelet based
denoising algorithm called adaptive wavelet thresholding is applied. Then the
features are made robust against VTL changes by the application of invariant
integration. Finally the features are trained/recognized using neural networks.
As the dimensionality of resultant feature vectors, after the
invariant integration, is large, there is a need to optimize and classify the
features. For classification Feature-Finding Neural Network (FFNN) has been
Feature Extraction (MFCC / CFCC)
Training/ Recognition using Neural Network(FFNN)
Speech Enhancement by AdaptiveWavelet Thresholding
Input Speech Signal
Recognized Speech
Pre-processing
Speaker Invariant Feature Extraction usingInvariant Integration Method
5
used. The FFNN consists of a feature extracting network and a linear
classifier that consists of a linear single-layer perceptron for classifying the
features in order to recognize the words. Tino Gramss and Hans
Werner Strube (1990) had applied this classification system to isolated word
recognition and had proved that the FFNN is faster in recognition than the
classical Hidden Markov Model (HMM) and Dynamic Time Warping (DTW)
recognizers and yields similar recognition rates. For optimization of features,
an algorithm called substitution algorithm (Tino Gramss 1991) is applied in
this work.
Both the proposed MFCC and CFCC based speech recognition
systems have been evaluated in a task, where, the acoustic conditions of
training and testing are mismatched, i.e., the training data set has been
recorded under a clean condition while the testing data sets were mixed with
different types of background noise at various noise levels. Also, the systems
have been tested with matching and mismatching of VTLs. In this work, the
Speech Separation Challenge database has been used for training and testing.
This research works have been attempted using the tool Matlab 7.10.
1.2 NEED FOR ROBUST SPEECH RECOGNITION AND ITS
IMPORTANCE
In order to maintain good speech recognition accuracy even when
the input quality of the speech is corrupted or when there occurs a difference
in the acoustical, articulatory or phonetic characteristics of
speech - ‘Robustness’ is required. Some of the obstacles recognized are
acoustical degradation, which is caused by additive noise, the effects of linear
filtering, nonlinearities in transduction or transmission, as well as impulsive
interfering sources and diminished accuracy caused by changes in articulation
produced by the presence of high intensity noise sources (Richard Stern
1997). Some of the other changes are speaker to speaker difference, variations
6
in speech rate, co-articulation, context, and dialect. When the training and the
testing conditions differ, the performance of speaker independent systems
also, starts degrading. The invariance of recognition performance under such
disturbances is called robustness.
Speech recognition systems have become much more robust in
recent years with respect to both the speaker variability and acoustical
variability. In addition to achieving speaker independence, many current
systems can also, automatically compensate for modest amounts of acoustical
degradation caused by the effects of the unknown noise and the unknown
linear filtering (Pedro Moreno et al. 1995).
As speech recognition and spoken language technologies are being
transferred to the real world applications, the need for greater robustness in
recognition technology is becoming increasingly apparent (Meenakshi
Sharma and Salil Khare 2009).
With the rapid development of voice communication and
information systems, efficient interactions between the users and the terminals
or remote database systems are required. Robust speaker dependent/
independent speech recognition technology makes these possible. First
systems are available that can compensate for modest amounts of acoustical
degradation caused by the effects of the unknown noise and the unknown
linear filtering. Still, the performance of even the best state-of-the-art systems
is heavily deteriorated in the mentioned adverse conditions. This is one of the
main reasons that prevent ASR from being used in everyday situations, so
increased robustness is still a very desirable property in ASR.
There exist three different approaches in order to achieve this goal
(Wynand Harmse 2004):
7
As a first approach the disturbances are to be removed from the
speech signal before features that carry speech-relevant information could be
extracted. There exist a number of methods to deal with additive or
convolutive noise (like spectral subtraction, processing with the
Ephraim- Malah algorithm or inverse filtering). One of the downsides of such
processing is that the application of these techniques produces artifacts in the
speech signal, for example, due to wrong estimation of the noise signal.
Another approach is to design a robust feature extraction, where
features are as invariant as possible under adverse acoustical conditions. This
approach has been pursued in this research work. In the third approach, the
classifier is designed to cope with a large variety of noise signals. This is
achieved by training multiple acoustical models with speech under different
noise conditions. The problem with this approach is that the increase in
computational cost and demand for memory. Another problem is the
automatic selection of the appropriate model in dependence of the actual
acoustical situation.
1.3 GENERAL FEATURE EXTRACTION MODELS IN SPEECH
RECOGNITION
Feature extraction is the process of obtaining different features such
as power, pitch, and vocal tract configuration from the speech signal. Feature
extraction is the first crucial component in automatic speech processing.
Generally speaking, successful front-end features should carry enough
discriminative information for classification or recognition, fit well with the
back-end modeling and be robust with respect to the changes of acoustic
environments. At a high level, most speech feature extraction methods fallinto the following two categories:
i. Modeling the human voice production system
ii. Modeling the peripheral auditory system.
8
1.3.1 Human Auditory-System Based Feature Extraction in Speech
Recognition
The imitation of the human hearing system is a promising research
direction towards improving feature robustness (Qi Li and Yan Huang 2010).
Motivated by the fact that the human auditory system outperforms current
machine-based systems for acoustic signal processing, many research works
have been done for developing high performance systems. The traveling
waves of the basilar membrane in the cochlea and its impulse response have
been measured and reported in the literature (Aage Moller 1977, Sellick et al.
1982). Moreover, the basilar membrane tuning and auditory filters have also,
been studied in the literature (Roy Patterson 1976, Brian Moore et al. 1990,
Bin Zhou 1995, Kuansan Wang and Shihab Shamma 1995, Dennis Barbour
and Xiaoqin Wang 2003). Many electronic and mathematic models have
been defined to simulate the traveling wave, the auditory filters, and the
frequency responses of the basilar membrane (James Flanagan 1972, (Richard
Lyon and Carver Mead 1988, James Kates 1991, James Kates 1993, Liu et al.
1992).
1.3.1.1 Human auditory-system
The human ear, as shown in Figure 1.2, has three sections: the outer
ear, the middle ear and the inner ear (Xuedong Huang et al. 2001). The outer
ear consists of the external visible part and the external auditory canal that
forms a tube along which sound travels. This tube is about 2.5 cm long and is
covered with the eardrum at the far end. When the air pressure variations
reach the eardrum from the outside, it vibrates, and transmits the vibrations to
bones adjacent to its opposite side (William Tecumseh Sherman Fitch 1994).
The vibration of the eardrum is at the same frequency (alternating
compression and rarefaction) as the incoming sound pressure wave. The
middle ear is an air-filled space or cavity about1.3 cm across, and about
9
63 cm volume. The air travels to the middle ear cavity along the tube (when
opened) that connects the cavity with the nose and the throat. The mammalian
inner ear is a spiral structure, the cochlea (snail), consisting of three uid-
lled chambers, or scala, the scala vestibuli, the scala media, and the scala
tympani. The oval window shown in Figure 1.2 (Ben Cloptonand and Francis
Spelman 2003) is a small membrane at the bony interface to the inner ear
(cochlea). Since the cochlear walls are bony, the energy is transferred
by a mechanical action of the stapes into an impression on the membrane
stretching over the oval window.
Figure 1.2 The structure of the human ear
The relevant structure of the inner ear for sound perception is thecochlea, which communicates directly with the auditory nerve, conducting arepresentation of sound to the brain. The cochlea is a spiral tube about 3.5 cmlong, which coils about 2.6 times.
The pirals divided, primarily by the basilar membrane runninglength wise, into two fluid-filled chambers. The cochlea can be roughlyregarded as a filter bank, whose outputs are ordered by location, so that afrequency-to-place transformation is accomplished. The filters closest to the
10
cochlear base respond to the higher frequencies and those closest to its apexrespond to the lower.
1.3.1.2 Existing methods
Based on the concept of the cochlea, many feature extraction
algorithms have been developed for speech recognition, such as
MFCCs, Fourier transform, wavelet transform and others (Qi Li 2009).
The Fourier transform is the most popularly used transform to
convert signals from the time domain to frequency domain. However, it has a
fixed time-frequency resolution and its frequency distribution is restricted to
be linear. These limitations generate problems in audio and speech processing
such as the pitch harmonics, computational noise, and sensitivity to
background noise. On the other hand, the wavelet transform provides flexible
time-frequency resolution, but also has notable problems. First, no existing
wavelet is capable of mimicking the impulse responses of the basilar
membrane closely, so it cannot be directly used to model the cochlea or carry
out related computation. Additionally, even though forward and inverse
continuous wavelets transforms are defined for continuous variables, there is
no numerical computational formula for real Inverse Continuous Wavelet
Transforms (ICWT). No such function exists even in a commercial wavelet
package. Discrete Wavelet Transform (DWT) has been applied in speech
processing, but the frequency distribution is limited to the dyadic scale which
is different from the scale in the cochlea.
Perceptual Linear Predictive (PLP) analysis is another peripheral
auditory-based approach. Based on the FFT output, it uses several
perceptually motivated transforms, including Bark frequency, equal-loudness
pre-emphasis, and cubic-root amplitude compression (Hynek Hermansky
1990). The Relative Spectra (RASTA) is further developed to filter the time
11
trajectory to suppress constant factors in the spectral component (Hynek
Hermansky and Nelson Morgan 1994). RASTA has been often cascaded with
the PLP feature extraction to form the RASTA-PLP features. Comparisons
between MFCC and RASTA-PLP have been reported by Grimaldi and
Cummins (2008). Both MFCC and RASTA-PLP features are based on the
Fourier transform. The Fourier transform has a fixed time–frequency
resolution and a well-defined inverse transform. Fast algorithms exist for both
the forward transform and the inverse transform. Despite its simplicity and
efficient computation algorithms, when applied to speech processing the
time–frequency decomposition mechanism of the Fourier transform is
different from the mechanism in the hearing system. First, it uses fixed-length
windows, which generate pitch harmonics over the entire speech bands.
Second, its individual frequency bands are distributed linearly, which is
different from the distribution in the human cochlea. Further wrapping is
needed to convert to the Bark, Mel, or other scales (Qi Li and Yan Huang
2011).
Gammatone filter banks (Johannesma 1972) has been proposed to
model the impulse responses of the basilar membrane and has been used to
decompose time-domain signals into different frequency bands. However,
there is no mathematical proof of how to synthesize the decomposed
multichannel signals back to a time-domain signal. Although some
suggestions on resynthesis have been given in plain language, (Mitchel
Weintraub 1985), or simply at the conceptual level, there remain no details or
mathematical proof to validate the accuracy and computation efficiency. In
(Hohmann 2002), a Gammatone based transform with analysis and synthesis
was presented, but the filter bank derives a complex valued output which is
not only different from the real cochlea, but further complicates its
implementation.
12
However, an auditory based transform with both forward and
inverse transforms for digital computers is needed for many audio
applications, such as noise reduction, hearing aid, coding, speech and music
synthesis, speaker and speech recognition, etc. To employ the concept of the
auditory system to the audio signal processing, Qi Li (2009) has expressed an
auditory-based transform, which was inspired by the traveling waves in the
cochlea and this auditory-based transform provides a new platform for the
research of robust feature extraction. This transform is modeled to provide a
simple and fast transform for real application, as an alternative selection to the
Fourier transform and WT. Based on this auditory-based transform, Qi Li and
Yan Huang (2010), (2011) has developed an auditory-based feature extraction
algorithm for robust speaker identification. Under mismatched acoustic
conditions, this feature consistently performs better than the MFCC features.
1.3.2 Human Voice Production System Based Feature Extraction in
Speech Recognition
A general problem in speaker-independent ASR is the high
variability that is inherent in human speech. The problems originating from
different VTLs become especially apparent in mismatching training-testing
conditions. For example, if children use an ASR system whose acoustic
models have only been trained with adult data, the recognition performance
degrades significantly compared to the performance of adult users. Therefore,
in speaker-independent ASR systems, one often uses speaker-adaptation
techniques to reduce the influence of speaker-related variabilities.
A common model of human speech production is the source-filter
model (Xuedong Huang et al. 2001). In this model, the source corresponds to
the air stream originating from the lungs and the filter corresponds to the
vocal tract, which is located between the glottis and lips, and is composed of
different cavities. The locations of the vocal tracts’ resonance frequencies (the
13
“formants”) shape the overall short-time spectrum and define the phonetic
content. The spectral effects of different VTLs have been widely studied
(Benzeghiba et al. 2007). An important observation is that, while the absolute
formant positions of individual phones are speaker specific, their relative
positions for different speakers are somewhat constant. A relation that
describes this observation is given by considering a uniform tube model with
lengthl . Here, the resonances occur at frequencies F = c (2i 1) (4l) ,
i = 1,2,3, …, where c is the speed of sound (John Deller et al. 1993). Using
this model, the spectra S of the same utterance from two speakers and
with different VTLs are related by a scaling factor , which is also, known as
the frequency-warping factor (Florian Muller and Alfred Mertins 2011):
S ( ) = S ( ) (2.1)
In typical speaker-independent ASR task, the value of is
between 0.8 and 1.2. This intrinsic variability has a negative effect on the
recognition rate of speaker independent ASR systems. Though Equation (2.1)
is only a rough approximation for the real relationship between spectra from
speakers with different VTLs, methods that try to achieve speaker
independency for an ASR system commonly take this relationship as their
fundamental assumption. A time-frequency analysis of the speech signal is
usually the first operation in an ASR feature extraction stage after possible
preprocessing steps such as pre-emphasis or noise cancellation. This analysis
tries to simulate the human auditory system up to a certain degree, and
different methods have been proposed. As it is done for the computation of
the well known MFCCs, a basic approach is the use of the Fast Fourier
Transformation (FFT) applied on windowed short-time signals whose output
is weighted by a set of triangular bandpass filters in the spectral domain
(Xuedong Huang et al. 2001). Another common filterbank approach uses
14
gammatone filters (Patterson et al. 1992). These filters were shown to fit the
impulse response of the human auditory filters well. Both types of
time-frequency analysis methods have in common that they locate the center
frequencies of the filters evenly spaced on nonlinear auditory motivated
scales. In case of the MFCCs the Mel scale is used (Steven Davis and Paul
Mermelstein 1980), and in case of a gammatone filterbank the Equivalent
Rectangular Bandwidth (ERB) scale is used (Patterson et al. 1992, Moore and
Glasberg 1996, Kentaro Ishizuka et al. 2006). Different works make use of
the observation that both the Mel and the ERB scale approximately map the
spectral scaling as described in Equation (2.1) to a translation along the
subband-index space of the time-frequency analysis.
1.3.2.1 Human voice production system
The vocal tract is a fundamental component of the human speech
production system. The gender and age of the individual speaker is the two
factors that determine the average VTL (Louis-jean Boe et al. 2006). VTL is
(in mammals) the distance from the glottis to the outer portion of the lips
shown as a dark line in Figure1.3 (William Tecumseh Sherman Fitch 1994).
The place where the vocal folds come together is called the glottis. The
length of the vocal tract is strongly inhibited by body size, lip size and the
position of the larynx. Since the length of the vocal tract controls (all other
things being equal) the dispersion of formants in the vocal tract transfer
function, formant dispersion should provide a readily-available acoustic cue
to body size. VTL represents the source of variability between individual
speakers. On average, the VTL is about 14 cm for adult women and 17 cm for
men (Louis-jean Boe et al. 2006).
15
Figure 1.3 Side view of the human vocal tract
1.3.2.2 Existing methods
Methods that try to achieve speaker invariant feature extraction can
be roughly grouped into three categories (Florian Muller and Alfred Mertins
2011). These groups act on different stages of the ASR process and often may
be combined within the same ASR system. One group tries to normalize the
features after the extraction (Lee and Rose 1998), (Welling et al. 2002) by
estimating the implicit warping factors of the utterances. These techniques are
commonly referred to as VTL normalization (VTLN) methods. A second
group of methods adapts the acoustic models to the features of each utterance
(Leggetter and Woodland 1995). The use of Maximum-Likelihood Linear
Regression (MLLR) methods is part of most state of- the-art recognition
systems nowadays. It was shown by Pitz and Ney (2005) that certain types of
VTLN methods are equivalent to constrained MLLR. Thus, a third group of
16
methods tries to generate features that are independent of the warping factor
(Welling et al. 1999, Rademacher et al. 2006, Monaghan et al. 2008).
The concept of computing features that are independent of the VTL
has been taken up by several works in the past, and different methods were
proposed. Leon Cohen (1993) has introduced the scale transformation which
was further investigated for its applicability in the field of ASR by Umesh
et al. (1999). Its use in ASR is motivated by the relationship given in
Equation (2.1). One property of the scale transformation is that the
magnitudes of the transformations of two scaled versions of one and the same
signal are the same. Thus, the magnitudes can be seen to be scaling invariant.
The scale Cepstrum, which has the same invariance property, was also
introduced by Umesh et al. (1999). The scale transformation is a special case
of the Mellin transformation. Roy Patterson (2000) has described a so-called
Auditory Image Model (AIM) that was extended with the Mellin
transformation by Irino and Patterson (2002). Further studies about the Mellin
transformation have been conducted, for example, by Antonio De Sena and
Davide Rocchesso (2005).
Various works rely on the assumption that the effects of inter-
speaker variability caused by VTL differences are mapped to translations
along the subband-index space of an appropriate filter bank analysis (Alfred
Mertins and Jan Rademacher 2005, 2006, Monaghan et al. 2008, Florian
Muller et al. 2009, Florian Muller and Alfred Mertins 2009, 2010,
Rademacher et al. 2006).
Alfred Mertins and Jan Rademacher (2005, 2006) have used a
wavelet transformation for the time-frequency analysis and proposed
so-called VTL Invariant (VTLI) features based on auto- and cross-
correlations of wavelet coefficients. Jan Rademacher et al. (2006) have shown
17
that a gammatone filterbank instead of a wavelet filterbank leads to a higher
robustness against the VTL changes.
Methods that extract a time-frequency representation of an input
signal for ASR tasks commonly locate the frequency centers of the analysis
filters on auditory motivated scales like the Mel or ERB scale. Using these
scales, (Monaghan et al. 2008, Umesh et al. 1996) it was proved that VTL
changes approximately lead to translations in the subband- index space of
these time-frequency representations. This can be utilized for the computation
of features that are invariant to translation (Jan Rademacher et al. 2006,
Monaghanet al. 2008). The invariance can lead to an increase of robustness
against VTL changes. The determination of invariants is well-founded in the
field of mathematics and physics. Practical methods for the retrieval of
invariants against rotation and translation were especially applied in the field
of pattern recognition.
The cyclic autocorrelation of a sequence and the modulus of the
Discrete Fourier Transform (DFT), fall under this type of transforms. A
general class of translation-invariant transforms was introduced by Wagh and
Kanetkar (1977) and further Hans Burkhardt and Xaver Muller (1980) have
investigated in the field of pattern recognition. It is called the class CT.
Another class of transformation, investigated in the field of speaker
independent speech recognition known as Generalized Cyclic
Transformations (GCT) (Florian Muller et al. 2009). Instances of this class
were successfully used in the field of pattern recognition and feature
extraction in ASR systems.
One of these general methods integrates regular nonlinear functions
of the features over the transformation group for which invariance should be
achieved. This method is commonly known as “invariant integration”. It was
shown by Florian Muller and Alfred Mertins (2009) that in large- vocabulary
18
phoneme recognition experiments that the invariant integration based feature
sets lead to better recognition results than the standard MFCCs under
matching training and testing conditions and that these features outperform
the MFCCs in the cases in which training and testing conditions differ with
respect to the mean VTL.
1.4 SPEECH ENHANCEMENT BY DENOISING
In speech processing applications such as mobile communications,
speech recognition and hearing aids, speech has to be processed in the
presence of a background noise. Therefore, the problem of removing
uncorrelated noise components from the noisy speech, i.e., speech
enhancement, has been widely studied in the past, and it is still remaining as
an important issue in the field of speech research. Speech enhancement
techniques have been employed to improve the quality, and intelligibility of
the noise corrupted speech and/or the speech recognition performance. The
performance of such applications is highly dependent on how much the noise
is removed (Sumithra and Thanushkodi 2009).
1.4.1 Existing Methods
The problem of denoising consists of removing noise from
corrupted signal without altering it. Generally the noise sources are classified
as additive and convolutional. The former very often dominates in the real
world applications and the Spectral Subtraction (SS) approach has been a very
popular example solution for it (Steven Boll 1979, Michael Berouti et al.
1979, Sunil Kamath and Philipos Loizou 2002, Yasser Ghanbari et al. 2004,
Yasser Ghanbari and Mohammad Reza Karami Mollaei 2004, Sumithra and
Thanushkodi 2009. To subtract the noise components from the input noisy
speech, the SS algorithm has to estimate the statistics of the additive noise in
the frequency domain. Under low Signal-to-Noise Ratio (SNR) a condition, a
19
spectral flooring process is usually taken to prevent the over-subtraction
situation occurred. However, all such processes very often produce some
unnatural residual noise in the enhanced speech, the so-called musical noise,
due to the inevitable random tone peaks generated in the time-frequency
spectrogram. Previous studies have pointed out that this perceivable residual
noise can be effectively alleviated by considering the masking effect in
human auditory system (Dionysis Tsoukalas et al. 1997, Nathalie Virag 1999)
i.e., the residual noise will not be perceived if it is under the masking
thresholds in human auditory functions.
In the recent years, several alternative approaches such as signal
subspace methods (Yariv Ephraim and Harry Van Trees 1995), (Mark Klein
and Peter Kabal 2002) have been proposed for enhancing the degraded
speech. In subspace method, the estimation of the signal subspace dimension
is difficult for unvoiced period and transitional regions. Existing approaches
to this task include traditional methods such as spectral subtraction and
Ephraim Malah filtering (Yariv Ephraim and David Malah 1984).
A drawback of this technique is the necessity to estimate the noise or the
SNR. This can be a strong limitation when recording with non-stationary
noise and for situations where the noise cannot be estimated. Fourier domain
was long been the method of choice to suppress noise. Recently, methods
based on the wavelet transformation have become increasingly popular.
Wavelets provide a powerful tool for non-linear filtering of signals
contaminated by noise. Stephane Mallat and Wen Liang Hwang (1992) have
shown that effective noise suppression may be achieved by transforming the
noisy signal into the wavelet domain, and preserving only the local maxima of
the transform. Alternatively, a reconstruction that uses only the
large-magnitude coefficients has been shown to approximate well the
uncorrupted signal.
20
In other words, noise suppression is achieved by thresholding the
wavelet transform of the contaminated signal. The method of wavelet
threshold denoising is based on the principle of the multiresolution analysis.
The discrete detail coefficients and the discrete approximation coefficients
can be obtained by multi-level wavelet decomposition. Wavelet Thresholding
is a simple, non-linear technique, which operates on one wavelet coefficient at
a time. In its most basic form, each coefficient is thresholded by comparing
against threshold, if the coefficient is smaller than threshold, set to zero;
otherwise it is kept or modified. Replacing the small noisy coefficients by
zero and inverse wavelet transform on the result may lead to reconstruction
with the essential signal characteristics and with less noise.
David Donoho (1995) has introduced wavelet thresholding
(shrinking) as a powerful tool in denoising signals degraded by additive white
noise and more recently a number of attempts have been made to use
perceptually motivated wavelet decompositions coupled with various
thresholding and estimation methods (Ing Yann Soon et al. 1997, Sungwook
Chang et al. 2002, Hamid Sheikhzadeh and Hamid Reza Abutalebi 2001, Jong
Won Seok and Keun Sung Bae 1997). The most known thresholding methods
in the literature are the soft and hard thresholding. Comparing with hard
thresholding, soft thresholding is more efficient in denoising. Although the
application of wavelet shrinking for speech enhancement has been reported in
literature (Jong Won Seok and Keun Sung Bae 1997, Yasser Ghanbari and
Mohammad Reza Karami Mollaei 2006, Michael Johnson et al. 2007,
Sumithra et al. 2009) there are many problems yet to be resolved for a
successful application of the method to speech signals degraded by real
environmental noise types.
In the most techniques which use the wavelet thresholding for
speech enhancement, they may suffer from a main problem that is the
21
detection of the voiced/unvoiced segments of the speech signals (Hamid
Sheikhzadeh and Hamid Reza Abutalebi 2001, Jong Won Seok and Keun
Sung Bae 1997. For the incorrectly classified segments, the enhancement
performance drastically decreases. The other controversial subjects affecting
the enhancement performance are the thresholding function and the threshold
value.
In general, a small threshold value will leave behind all the noisy
coefficients, and subsequently the resultant denoised speech signal may still
be noisy. On the other hand, a large threshold value converts more number of
coefficients to zero, which directs to smooth the signal, destroys details and
the resultant image may cause blur and artifacts.
There are some defects with the basic wavelet thresholding method
when it is faced to the noisy speech corrupted by real-world noises. The basic
method assumes that the noise spectrum is white. However, not only the white
noise does not exist in the real environment, but also colored noises have to
be faced in most practical systems. Therefore, the basic wavelet shrinkage
does not result in good speech quality and cannot remove the non-stationary
noises.
The next problem is the shrinkage of the unvoiced speech segments
which contain many noise-like speech components. This leads to degradation
of the quality of the enhanced speech. The use of a single threshold for all
wavelet packet bands is not reasonable and use of the classic thresholding
functions like the Hard and Soft thresholding functions often brings about
time-frequency discontinuities. Therefore, an adaptive threshold value should
be found out that is adaptive to different subband characteristics. In general,
adaptive approaches have been found to be more effective than their global
counterparts. Wavelet-based techniques using coefficient thresholding
(Stephane Mallat and Wen Liang Hwang 1992), using adaptive thresholding
22
(Sumithra and Thanushkodi 2009), approaches have also been applied to
speech enhancement.
To solve the problem of poor understandability of the speech
signals processed by the fixed wavelet threshold, a new speech enhancement
method of adaptive wavelet thresholds is presented by Zhang Jie et al. (2009).
In this method, the types of additive noise are ascertained initially according
to the differences in the spectrum amplitude between white noise (including
color noise with flatting spectrum amplitude) and color noise with varying
spectrum amplitude. Since Lipschitz exponent varies with the types of noise
and speech, different kinds of the adaptive threshold function of the wavelet
transform are used to enhance the noisy speech signals according to the types
of noise.
1.5 PROBLEM DEFINITION
Based on the literature survey the following problems are
identified,
In the computation of Mel-frequency Cepstral Coefficients,
Fast Fourier Transform is used to produce the spectrum in
linear scale. As FFT has a fixed time-frequency resolution,
and because of its linear frequency distribution, performance
of MFCC based speech processing system gets affected by the
background noise.
There is a need for a new algorithm which will work well in
mismatching conditions than MFCC.
The recognition rate of noise robust MFCCs and auditory
transform based feature extraction algorithm are still far from
humans, and this is because of speaker variations.
23
The speaker independent speech recognition systems are to be
designed in diverse training-testing conditions with respect to
the mean VTL.
In order to overcome these difficulties this research work is
attempted to implement a speech recognition system that is robust against
noise and speaker variation.
1.6 OBJECTIVES
To design and implement a new feature extraction algorithm
based on human auditory system for the improvement of the
recognition rate of speech recognition system.
To design and implement the noise enhancement system with
human auditory system based feature extraction algorithm for
the improvement of noise robustness of speech recognition
system
To design and implement a speech recognition system that is
robust against the noise and the speaker variance.
1.7 MOTIVATION
This research work has been done with the motivation of designing
an interactive, ubiquitous teaching robot. Today, number of people going for
higher education worldwide is much lower (National Science Board 2012).
Even after getting higher degrees they are not interested in selecting teaching
as their profession, they are fancied towards software industry
(National Science Board 2012). At the same time, there is less expertise in
cutting edge technologies to educate the students. These difficulties of the
traditional educational field can be removed by incorporating the emerging
24
ubiquitous technology in education. This method can be applied for
empowering the quality of higher education to support learners. Quality
enhancement of higher education ensuring better performance can be done by
making the possibility of interaction between the robot teacher and students/
learner. Based on these constraints this research work has been attempted to
design the speech recognition part of the interactive, ubiquitous teaching
robot.
1.8 ORGANIZATION OF THE THESIS
A detailed review of literature about auditory transform based
feature extraction, wavelet thresholding for speech enhancement and VTL
Invariant algorithms in feature extraction are covered in Chapter 2.
Preliminary works done on MFCC, auditory transform based
feature extraction, VTL invariance algorithms and FFNN are included
Chapter 3.
MFCC with the adaptive wavelet thresholding and the invariant
integration algorithm have been presented in chapter 4. Prior to that the
existing method of MFCC based feature extraction and their noise robustness
have been discussed. The implementation of Enhanced Mel-frequency
Cepstral Coefficients Invariant Integration Features (EMFCCIIFs) based
feature extraction method has been compared with standard MFCC features,
and its results have been tabulated. Results show that, under mismatched
conditions, the EMFCCIIFs perform consistently better than the standard
MFCC features.
The existing method of auditory-based feature extraction algorithm
and the proposed auditory-based feature extraction algorithm with the
adaptive wavelet thresholding and the invariant integration have been
25
explained in chapter 5. The enhanced features have been then tested under
different noise, matching and mismatching of VTL conditions. Performances
of CFCC and Enhanced Cochlear Filter Cepstral Coefficient Invariant
Integration Features (ECFCCIIFs) have been compared. Experiments show
that, under mismatched conditions, the ECFCCIIFs perform consistently
better than the existing CFCC features. Also, performances of ECFCCIIFs
and EMFCCIIFs are compared. Results have shown that recognition
accuracies of ECFCCIIFs are higher than the corresponding EMFCCIIFs
accuracies.
Conclusions derived from this research work with the scope of
future work have been summarized in chapter 6.