73
Mapping Based Noise Reduction for Robust Speech Recognition Somayeh Hosseini Master of Science in Electrical Engineering with emphasis on Signal Processing Blekinge Institute of Technology Department of Signal Processing July 2010 Thesis Advisors: Dr. Rolf Bardeli Dr. Nedelko Grbic Thomas Winkler Examiner: Dr. Nedelko Grbic MEE10:73

Mapping Based Noise Reduction - DiVA portal

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mapping Based Noise Reduction - DiVA portal

Mapping Based Noise Reduction

for Robust Speech Recognition

Somayeh Hosseini

Master of Science in Electrical Engineering

with emphasis on Signal Processing

Blekinge Institute of Technology

Department of Signal Processing

July 2010

Thesis Advisors: Dr. Rolf Bardeli

Dr. Nedelko Grbic

Thomas Winkler

Examiner: Dr. Nedelko Grbic

MEE10:73

Page 2: Mapping Based Noise Reduction - DiVA portal
Page 3: Mapping Based Noise Reduction - DiVA portal

iii

Contents

Abstract xi

1 Introduction and Related Work 1

1.1 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Principles of Speech Signal Processing 3

2.1 Time Domain Representation of Speech Signals . . . . . . . . 3

2.1.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Quantization and Coding . . . . . . . . . . . . . . . . 6

2.2 Frequency domain Representation of Speech signal . . . . . 7

2.2.1 Discrete-Time Fourier Transform, DTFT . . . . . . . . 9

2.2.2 Discrete Fourier Transform, DFT . . . . . . . . . . . . 10

3 Automatic Speech Recognition, ASR 13

3.1 Speech Production . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Articulatory Phonetics . . . . . . . . . . . . . . . . . . 14

Page 4: Mapping Based Noise Reduction - DiVA portal

iv Contents

3.1.2 Acoustic Phonetics . . . . . . . . . . . . . . . . . . . . 17

3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 An Introduction to Hidden Markov Models . . . . . 24

3.3.2 Training Acoustic Models . . . . . . . . . . . . . . . . 26

3.4 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Mapping Based Noise Reduction 31

4.1 Locally Linear Embedding, LLE Algorithm . . . . . . . . . . 32

4.2 Principal Component Analysis . . . . . . . . . . . . . . . . . 36

4.3 k-Means Clustering Algorithm . . . . . . . . . . . . . . . . . 38

4.4 Mapping Procedure . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Implementation and Results 45

5.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Implementation by LLE . . . . . . . . . . . . . . . . . . . . . 46

5.3 Implementation by PCA . . . . . . . . . . . . . . . . . . . . . 49

5.4 Implementation by Clustered Vectors . . . . . . . . . . . . . 51

6 Summary and future work 57

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 59

Page 5: Mapping Based Noise Reduction - DiVA portal

v

List of Figures

2.1 For two sinusoids with fundamental frequencies F1 =−7

8Hz and F2 = 18Hz, selecting a sampling rate of Fs = 1Hz

results in identical samples of two sinusoidal components;this is called aliasing distortion [8]. . . . . . . . . . . . . . . . 5

2.2 A digital signal x(n) with four quantization levels [8] . . . . 6

2.3 Power distribution of a continuous-time periodic signalwith fundamental frequency F0 [8]. . . . . . . . . . . . . . . . 9

2.4 The magnitude of DTFT for the sequence in Equation (2.17)with length L = 10 [8]. . . . . . . . . . . . . . . . . . . . . . . 10

2.5 The magnitude of 50-point FFT for the sequence in Equation(2.17) with length L = 10 [8]. . . . . . . . . . . . . . . . . . . . 11

3.1 Components of a Speech Recognition System [10] . . . . . . 14

3.2 Human organs which make speech (Vocal tract) [9] . . . . . 15

3.3 The structure of Larynx [3] . . . . . . . . . . . . . . . . . . . . 15

3.4 The spectrum of a speech waveform is a combination of har-monics and formants [14] . . . . . . . . . . . . . . . . . . . . 18

3.5 Overview of short-time Fourier transform . . . . . . . . . . . 19

3.6 A digital signal and its spectrogram . . . . . . . . . . . . . . 20

Page 6: Mapping Based Noise Reduction - DiVA portal

vi List of Figures

3.7 Common window functions used for STFT calculation andtheir effect on spectrogram . . . . . . . . . . . . . . . . . . . . 21

3.8 Mel scale versus Hertz scale [16] . . . . . . . . . . . . . . . . 21

3.9 Mel-frequency triangular filter banks [17] . . . . . . . . . . . 22

3.10 A graph example of Hertz spectrum, Mel spectrum and Melcepstrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.11 A summary of MFCC feature extraction [17] . . . . . . . . . 25

3.12 An example of a Markov Model λ [17] . . . . . . . . . . . . . 28

4.1 An example that shows how mapping to lower dimensionalspace leads to more accurate distance measurement and asa result more precise vector replacement . . . . . . . . . . . . 32

4.2 A summary of LLE [1] . . . . . . . . . . . . . . . . . . . . . . 35

4.3 A 3D data set mapped to 2D space using LLE [4] . . . . . . . 36

4.4 Realization of a 3D data set xn (red) and Principal compo-nents of the data set xn (blue) . . . . . . . . . . . . . . . . . . 39

4.5 Block Diagram of standard k−means algorithm [13] . . . . . 40

4.6 An example of how cluster centroids (stars) move after eachiteration in the k−means algorithm [13] . . . . . . . . . . . . 41

5.1 The spectrogram of two noisy feature vectors with SNR val-ues of 25 and 0 dB . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 LLE, SNR = 10 dB, d = 11 . . . . . . . . . . . . . . . . . . . . 48

5.3 LLE, SNR = 10 dB, d = 12 . . . . . . . . . . . . . . . . . . . . 48

5.4 PCA, SNR = 10 dB, d = 11 . . . . . . . . . . . . . . . . . . . . 50

5.5 PCA, SNR = 10 dB, d = 12 . . . . . . . . . . . . . . . . . . . . 51

Page 7: Mapping Based Noise Reduction - DiVA portal

vii

List of Tables

5.1 The percentage of correctly recognized phonemes for ourclean and noisy feature vectors . . . . . . . . . . . . . . . . . 45

5.2 Speaker 1, SNR = 10 dB, d = 12 . . . . . . . . . . . . . . . . . 47

5.3 Speaker 1, SNR = 10 dB, d = 11 . . . . . . . . . . . . . . . . . 47

5.4 Speaker 1, SNR = 25 dB, d = 12 . . . . . . . . . . . . . . . . . 47

5.5 Speaker 2, SNR=10 dB, d = 11 . . . . . . . . . . . . . . . . . . 47

5.6 Speaker 1, SNR = 10 dB, d = 12 . . . . . . . . . . . . . . . . . 49

5.7 Speaker 1, SNR = 10 dB, d = 11 . . . . . . . . . . . . . . . . . 49

5.8 Speaker 1, SNR = 25 dB, d = 12 . . . . . . . . . . . . . . . . . 49

5.9 Speaker 2, SNR = 10 dB, d = 11 . . . . . . . . . . . . . . . . . 50

5.10 Speaker 2, SNR = 25 dB, d = 12 . . . . . . . . . . . . . . . . . 50

5.11 2 clusters, LLE, Speaker1, SNR = 10 dB, d = 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.12 2 clusters, LLE, Speaker 1, SNR = 25 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Page 8: Mapping Based Noise Reduction - DiVA portal

viii List of Tables

5.13 2 clusters, PCA, Speaker 1, SNR = 10 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.14 2 clusters, PCA, Speaker 1, SNR = 25 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.15 2 clusters, PCA, Speaker 2, SNR = 10 dB, d = 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.16 2 clusters, PCA, Speaker 2, SNR = 25 dB, d = 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.17 3 clusters, LLE, Speaker 1, SNR = 10 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.18 3 clusters, LLE, Speaker 1, SNR = 25 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.19 3 clusters, PCA, Speaker 1, SNR = 10 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Page 9: Mapping Based Noise Reduction - DiVA portal

List of Tables ix

5.20 3 clusters, PCA, Speaker 1, SNR = 25 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.21 3 clusters, PCA, Speaker 2, SNR = 10 dB, d = 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.22 3 clusters, PCA, Speaker 2, SNR = 25 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.23 4 clusters, LLE, Speaker 1, SNR = 10 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.24 4 clusters, LLE, Speaker 1, SNR = 25 dB, K = 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.25 4 clusters, PCA, Speaker 1, SNR = 10 dB, K = 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.26 4 clusters, PCA, Speaker 1, SNR = 25 dB, K = 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Page 10: Mapping Based Noise Reduction - DiVA portal

x List of Tables

5.27 5 clusters, LLE, Speaker 1, SNR = 10 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.28 5 clusters, LLE, Speaker 1, SNR = 25 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.29 5 clusters, PCA, Speaker 1, SNR = 10 dB, d = 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.30 5 clusters, PCA, Speaker 1, SNR = 25 dB, d = 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Page 11: Mapping Based Noise Reduction - DiVA portal

xi

Abstract

This project aims at proposing a new noise reduction technique for speechrecognition purposes. The proposed method called mapping based noisereduction is performed on the feature vectors extracted from speech sig-nals. In this work the dimensionality reduction functionality of algorithmssuch as Locally Linear Embedding and Principal Component Analysis isexploited to map the corrupted speech feature vectors to their correspond-ing noise-free feature vectors. The feature vectors are first mapped to thelower dimensional space and in this space the nearest clean vector to eachnoisy vector is found, mapped back again to the original space and givenas the input to the speech recognition system. This approach is examinedon the speech signals with artificially added wind noise with different sig-nal to noise ratio values and articulated by two different speakers.

Page 12: Mapping Based Noise Reduction - DiVA portal
Page 13: Mapping Based Noise Reduction - DiVA portal

1

Chapter 1

Introduction and Related Work

A speech recognition system that converts the human speech to text isconsidered robust if it is able to achieve a high recognition accuracy evenunder improper conditions. These improper conditions might be causedby the low quality of input speech signal due to additive environmentalnoise or differences in acoustic features of testing and training speech sig-nals that are recorded in different environments or articulated by differentspeakers.

The speech signals we are dealing with in this thesis are a part of theproject MOVEON in Fraunhofer Institute, IAIS. The goal of the projectMOVEON is to develop a communication system that is able to performwell even in challenging scenarios. A good instance of these challeng-ing scenarios is when we have speech signals recorded on motorcycleswhere different sources of noises are available and we are going to accom-plish robust speech recognition on those speech signals. Hence, in ourexperiments, we use the speech data that is corrupted by the same noisetypes appearing on speech data recorded from a moving motorcycle suchas wind noise.

One way to achieve robustness in speech recognition system is to up-grade the corrupted speech signal by reducing the disturbance. Severalspeech enhancement methods have been tried to date. Spectral Subtrac-tion [7] for instance tries to compensate the effects of noise by estimatingthe noise power spectrum and subtracting it from the power spectrum ofthe corrupted speech signal. Cepstral Mean Subtraction [7], Blind Equal-ization [7] and adaptive filters are another noise compensation methodsfor speech recognition. However, most of these methods are insufficient

Page 14: Mapping Based Noise Reduction - DiVA portal

2 1 Introduction and Related Work

when faced with difficult domains and consequently the need for an alter-native noise reduction method arises.

1.1 Scope of the Thesis

The scope of this thesis is presenting a new approach for noise reductioncalled mapping based noise reduction, applied on the feature vectors ofthe speech signal. In this approach, the feature vectors extracted fromnoisy speech signals are projected to their nearest feature vectors extractedfrom clean speech signals. The projected feature vectors then serve as in-puts to the speech recognition system to improve its accuracy and per-formance. This process is carried out in the lower dimensional subspacebecause with the complexity of higher dimensional space, the projectionof feature vectors is almost not feasible. In other words, each noisy featurevector is replaced by its nearest clean feature vector found in the lower di-mensional space. Principal Component Analysis and Locally Linear Em-bedding are the dimensionality reduction algorithms we have used formapping the speech feature vectors to lower dimensional space.

1.2 Overview

In chapter 2 the basics of audio signal processing including time and fre-quency domain properties of audio signals are discussed. Chapter 3 pro-vides an overview of a modern automatic speech recognition system andits units. Chapter 4 describes the mapping based method for noise re-duction and the dimensionality reduction algorithms we have applied. Inchapter 5 the results of experiments are presented and compared. Finally,the conclusion of this work is provided in chapter 6.

Page 15: Mapping Based Noise Reduction - DiVA portal

3

Chapter 2

Principles of Speech SignalProcessing

A speech signal is the air pressure amplitude generated by forcing airthrough the vocal cords and radiated from a speaker’s mouth or nose.From mathematics point of view, it is a function x : R → C which is ananalog or continuous-time function. In order to represent such a signalin computers we have to digitalize it. The digitization process will be de-scribed further in the following sections.

In addition to time domain representation, a speech (audio) signal can berepresented in frequency domain. The frequency domain representationof a signal shows us the frequency information of the signal.

2.1 Time Domain Representation of Speech Signals

Speech signals as information-bearing signals are functions of a single in-dependent variable, time, and are analog in nature. This means that theyare functions of continuous time. Such signals can be processed using ana-log processors. However, an alternative, and mostly better, method forprocessing is digital processing using a digital computer or microproces-sor. In the latter case we have to first convert the so called analog signalto digital format, that is to convert it to a finite sequence of numbers, toperform the processing digitally. The analog to digital converter or A/Dconverter is used for this purpose. The process of analog to digital conver-sion or digitization consists of 3 steps: sampling, quantization and coding

Page 16: Mapping Based Noise Reduction - DiVA portal

4 2 Principles of Speech Signal Processing

which are briefly described in the following sections. The inverse pro-cess which converts the digital signal to analog, is called D/A conversionand is necessary after processing in many cases such as speech process-ing. However, in the speech recognition case, there is no need for D/Aconversion since the output of a recognizer is text, not speech.

2.1.1 Sampling

By taking samples of a continuous-time signal at discrete time instants, itwill be converted to a discrete-time signal. This is the first step to digiti-zation. Thus, if xa(t) is the continuous-time signal and we take samplesevery T seconds, then the sampled signal is given by: x(n) = xa(nT ),where T is called the sampling period. Moreover, the number of samplesper second Fs = 1

T (Hz), is called sampling frequency. The relationshipbetween continuous and discrete time variables t and n is given by :

t = nT =n

Fs. (2.1)

Accordingly, there is a relationship between the continuous time and dis-crete time signal frequencies F and f . It is simply proved that this rela-tionship is :

f =F

Fs, (2.2)

where f is called normalized frequency and −12 ≤ f ≤12 . Using the above

relation, we will get a range for continuous-time signal frequency:

−Fs2≤ F ≤ Fs

2(2.3)

which means the maximum frequency of an analog signal that can be dis-tinguished after sampling at Fs = 1

T , is Fmax = Fs2 . In other words, only

the frequencies which lie in this range can be reconstructed after sampling.If the selected sampling frequency does not fulfill the inequality in Equa-tion (2.3), then a distortion called aliasing distortion occurs at frequenciesinside and outside this range. For a better understanding of aliasing dis-tortion consider an analog signal with fundamental frequency F0. If thissignal is sampled at a rate Fs < 2F0, then all frequencies Fk = F0 + kFswill produce identical sample values to those of F0 and as a result the partsof the analog signal related to these frequency contents cannot be recon-structed from their sample values. The frequencies Fk are all called aliasesof F0. Figure 2.1 shows the aliasing distortion in time domain.

Thus, to find the appropriate sampling frequency, we should apply thesampling theorem [8]. According to the sampling theorem, if the highest

Page 17: Mapping Based Noise Reduction - DiVA portal

2.1 Time Domain Representation of Speech Signals 5

Figure 2.1: For two sinusoids with fundamental frequencies F1 = −78Hz

and F2 = 18Hz, selecting a sampling rate of Fs = 1Hz results in identical

samples of two sinusoidal components; this is called aliasing distortion[8].

frequency content of an analog signal is Fmax = B (Hz) and the samplingfrequency is Fs > 2B, then the analog signal can be accurately recon-structed from its sample values according to the following equation:

xa(t) =∞∑

n=−∞x(n)sinc(2Bt− 2Bn

Fs) (2.4)

where

sinc(t) =

{sin(πt)πt if t 6= 0;

1 if t = 0.

The minimum allowable value of sampling frequency, Fs = 2B, is callednyquist rate.

Since a speech signal has frequency components up to 4 KHz, its samplingfrequency is selected in the range of 8 to 16 KHz according to the samplingtheorem.

Page 18: Mapping Based Noise Reduction - DiVA portal

6 2 Principles of Speech Signal Processing

Figure 2.2: A digital signal x(n) with four quantization levels [8]

2.1.2 Quantization and Coding

By sampling an analog signal it is only converted to a discrete-time sig-nal, however, a digital signal must be both discrete in time and ampli-tude. Therefore, quantization is performed on a sampled signal to makeit discrete-valued as well. This is done simply by converting continuoussample values to discrete levels and represent them as a finite sequenceof numbers. Thus, each sample value is truncated or rounded to a prede-fined allowable level called quantization level. A rounding quantizationis done by replacing each sample value by its nearest quantization levelwhile truncation quantization replaces each sample value by the quanti-zation level below it. The distance between successive quantization levelsis called quantization resolution ∆. This value of resolution can simply beformulated as follows; if we have the maximum and minimum amplitudevalues xmax, xmin, of a sampled signal and the number of quantizationlevels L:

∆ =xmax − xmin

L− 1, (2.5)

where xmax − xmin, is called the dynamic range of the signal.

For the rounding quantization, the quantization error which is definedas the difference between the quantized signal and the sampled signal,eq(n) = xq(n)− x(n), is bounded to the following range:

−∆

2≤ eq(n) ≤ ∆

2(2.6)

Therefore, by observing Equations (2.5), (2.6), one can simply conclude

Page 19: Mapping Based Noise Reduction - DiVA portal

2.2 Frequency domain Representation of Speech signal 7

that for a fixed value of dynamic range, increasing the number of quanti-zation levels, increases the quantization resolution and consequently leadsto a decrease in quantization error. Thus, the higher the number of quan-tization levels, the more accurate the quantization process.

Figure 2.2 illustrates a discrete-time quantized signal (digital signal).

Finally, coding is the representation of quantization levels by binary num-bers of length b bits. 2b binary numbers are produced by b bits and if L bi-nary numbers are needed ( L quantization levels), the following inequalitymust hold true to obtain the number of bits b:

2b ≥ L⇒ b ≥ log2 L. (2.7)

For speech signals, usually 16 bit words are used.

2.2 Frequency domain Representation of Speech sig-nal

In the frequency domain, a signal is represented by a sum of several sinu-soidal components (frequency components). This is called the frequencyanalysis of the signal. For continuous-time signals, the Fourier series andthe continuous Fourier transform represent the frequency components ofa periodic and aperiodic signal, respectively.

A continuous-time periodic signal x(t) that in Hilbert space [15] ofL2([0, 1]) is represented in terms of its Fourier coefficients Ck, using thefollowing expression for Fourier series:

x(t) =

∞∑k=−∞

Ckej2πkF0t (2.8)

where F0 is the fundamental frequency (the lowest frequency or the firstharmonic) and T0 = 1

F0is the period of the signal and ejθ = cos θ+j sin θ is

the Euler’s formula (j =√−1). The Hilbert space (such as the Euclidean

space) is the space of finite-energy signals and has the following definition:

L2([0, 1]) = {x : [0, 1]→ C | ||x||2 =

√∫ 1

0|x(t)|2dt <∞}; (2.9)

this means that the periodic signal has finite energy in one period (energyof a signal is defined as

∫R |x(t)|2dt).

Page 20: Mapping Based Noise Reduction - DiVA portal

8 2 Principles of Speech Signal Processing

The Fourier coefficients can be obtained by the following expression:

Ck =1

T0

∫ T02

−T02

x(t)e−j2πkF0tdt. (2.10)

Furthermore, the average power of a periodic signal x(t), which has infi-nite energy but finite average power, is given by:

P =1

T0

∫ T02

−T02

|x(t)|2dt =∞∑

k=−∞|Ck|2; (2.11)

this means that the total power of a periodic signal is the sum of the pow-ers for each frequency component (harmonic) of it.

Figure 2.3 shows the power distribution among frequency components ofa periodic signal x(t) and it is called power density spectrum of the signal.One can observe from this figure that periodic signals have line spectra.However if the period of the signal tends toward infinity and the signalbecomes aperiodic, the spectrum becomes continuous.

Thus, the Fourier transform of a finite energy (aperiodic) signal x(t) isdefined as:

X(F ) =

∫ ∞−∞

x(t)e−j2πFtdt (2.12)

which is a continuous function of frequency F .

The inverse Fourier transform is obtained by replacing the summation inEquation (2.8) by an integral over the frequency F :

x(t) =

∫ ∞−∞

X(F )ej2πFtdF. (2.13)

Finally, the total energy of x(t), which is a finite-energy signal, is given by:

E =

∫ ∞−∞|x(t)|2dt =

∫ ∞−∞|X(F )|2dF , (2.14)

while the energy density spectrum of x(t) (the energy distribution of thesignal as a function of frequency) is defined as:

Sxx(F ) = |X(F )|2; (2.15)

therefore, the integral of Sxx over F is equal to total energy of signal.

Page 21: Mapping Based Noise Reduction - DiVA portal

2.2 Frequency domain Representation of Speech signal 9

Figure 2.3: Power distribution of a continuous-time periodic signal withfundamental frequency F0 [8].

2.2.1 Discrete-Time Fourier Transform, DTFT

In this section frequency analysis of discrete-time signals is discussed.

The Fourier transform of a continuous-time signal has an infinite numberof frequency components in the range−∞ < F <∞, while in the discrete-time case the frequency components are limited to the range −1

2 < f < 12

or equivalently −π < ω < π, where ω = 2πf is the angular frequency.Therefore, the Fourier transform of a discrete-time signal is unique only inthis range and the frequencies which lie out of this range are equivalentto the ones within the range. This means that the Fourier transform of adiscrete-time signal or the discrete time Fourier transform is periodic withperiod 2π and it is defined as:

X(ω) =∞∑

n=−∞x(n)e−jωn. (2.16)

Figure 2.4 illustrates the DTFT of the following finite-length sequence:

x(n) =

{1 0 ≤ n ≤ L− 1;0 otherwise.

(2.17)

The inverse discrete time Fourier transform derived from the periodic

Page 22: Mapping Based Noise Reduction - DiVA portal

10 2 Principles of Speech Signal Processing

Figure 2.4: The magnitude of DTFT for the sequence in Equation (2.17)with length L = 10 [8].

characteristic of X(ω) is as follows:

x(n) =1

∫2πX(ω)ejωndω. (2.18)

In addition, the energy of a discrete-time signal in terms of time and fre-quency can be simply formulated as:

E =

∞∑n=−∞

|x(n)|2 =1

∫ π

−π|X(ω)|2dω, (2.19)

and likewise the continuous-time case, energy density spectrum of thediscrete-time signal is given by:

Sxx(ω) = |X(ω)|2. (2.20)

In the cases where our discrete-time signal, x(n), is real, it is easily provedthat X(ω) and Sxx(ω) will be symmetric functions of frequency. As aresult,the frequency range of real discrete-time signals can be further re-stricted to the range 0 ≤ ω ≤ π.

2.2.2 Discrete Fourier Transform, DFT

As in the time-domain case, to present the spectrum of a discrete-timesignal we need to digitize it first. By sampling the continuous spectrum

Page 23: Mapping Based Noise Reduction - DiVA portal

2.2 Frequency domain Representation of Speech signal 11

Figure 2.5: The magnitude of 50-point FFT for the sequence in Equation(2.17) with length L = 10 [8].

X(ω) of finite-energy sequence x(n) at equally spaced frequencies, we canperform the frequency analysis of this sequence on a digital computer.This frequency domain representation of x(n) is called Discrete FourierTransform. The Discrete Fourier Transform (DFT) of a sequence x(n) oflength L ≤ N is given by:

X(k) =

N−1∑n=0

x(n)e−j2πkn/N , k = 0, 1, ..., N − 1, (2.21)

where N is the number of frequency lines in the spectrum. You can ob-serve in this equation that computation of the N−point DFT requires N2

complex multiplications and N(N − 1) complex additions.

The relation for inverse DFT (IDFT) which reconstructs the sequence x(n)from its frequency samples is as follows:

x(n) =1

N

N−1∑k=0

X(k)ej2πkn/N , n = 0, 1, ..., N − 1. (2.22)

In Figure 2.5 you can observe the frequency samples of the sequence inEquation (2.17).

Alternatively, viewing the Equations (2.21) and (2.22) as linear transforma-tions of x(n) and X(k), leads to the matrix format representation of DFTand IDFT.

Page 24: Mapping Based Noise Reduction - DiVA portal

12 2 Principles of Speech Signal Processing

We define an N ×N transformation matrix W:

W =

1 1 1 ... 11 W W 2 ... WN−1

1 W 2 W 4 ... W 2(N−1)

. . . .

. . . ... .

. . . .

1 WN−1 W 2(N−1) ... W (N−1)(N−1)

, (2.23)

where W = e−j2π/N , to obtain the following linear expressions for theN−point DFT:

X = Wx, (2.24)

and IDFT:x =

1

NW∗X. (2.25)

The N−dimensional vectors X and x in the above equations are the vec-tors of signal sequence x(n) and its frequency samples X(k), respectively.The term W∗ refers to the complex conjugate of the transformation matrix.

By applying symmetry and periodicity properties of the term W =e−j2π/N , the complexity of DFT computation can noticeably be reduced.This leads us to a number of algorithms called Fast Fourier Transform(FFT) algorithms, that compute DFT efficiently. Assuming that N isnot a prime number and can be factorized to a product of two integers(N = LM ), one of FFT algorithms, for instance, divides the N−point DFTto a number of smaller DFTs (M−point DFTs) to reduce the computationcost. As a result, the computational complexity is reduced toN(M+L+1)complex multiplications andN(M+L−2) complex additions. In the caseswhereN is a power of 2, we will have anN log10(N)-complexity algorithmfor computation of DFT.

Page 25: Mapping Based Noise Reduction - DiVA portal

13

Chapter 3

Automatic SpeechRecognition, ASR

Speech technology includes several subfields; amongst them are SpeakerIdentification (extracting a person’s identity from his/her voice), SpeechSynthesis (converting text to voice in order to produce artificial humanspeech) and Speech Recognition (converting speech to text). In this sectionwe try to introduce the newest techniques for Automatic Speech Recogni-tion.

An automatic speech recognition system is used for labeling the humanspeech and it has a variety of applications such as voice dialling, call rout-ing, data entry and automatic control of home appliances. Figure 3.1 givesyou an overview of a speech recognition system and its units. A modernspeech recognition system is composed of 3 main components:

• Feature Extraction: Extracting feature vectors from digitized speechsignal.

• Training Hidden Markov Models: Training acoustic models forspeech feature vectors using Hidden Markov Models.

• Recognition: Classifying the test speech vectors using the acousticmodels obtained in previous step.

In the rest of this section we first introduce phonetics and then explaineach of the above components.

Page 26: Mapping Based Noise Reduction - DiVA portal

14 3 Automatic Speech Recognition, ASR

Figure 3.1: Components of a Speech Recognition System [10]

3.1 Speech Production

Study of human sound production and properties is called phonetics. Inthis section two major branches of phonetics, Articulatory Phonetics andAcoustic Phonetics are discussed. Articulatory Phonetics focuses on howhuman organs are involved in speech production while Acoustic Phonet-ics studies the acoustic properties of speech.

3.1.1 Articulatory Phonetics

Speech is composed of phonemes as its smallest units. There are twotypes of phonemes called vowels and consonants and a combination ofthem makes each spoken word. This section discusses the production ofvowels and consonants as units of a spoken word. Figure 3.2 shows thehuman body organs that produce and modulate the speech (vocal tract).The sound source as you observe in this figure, is the airflow provided bylungs and diaphragm while breathing (Respiration phase). This airflowthen passes through a structure called Larynx which plays an importantpart in sound production. The structure of Larynx as you can observein Figure 3.3 contains two bands of muscle and tissue called vocal cordsand the gap in between them called glottis. This phase of sound produc-tion with Larynx’s contribution is defined as phonation. Depending onthe position of vocal cords and the rate of airflow, different phonations areobtained. If the vocal cords are completely closed, a kind of consonantal

Page 27: Mapping Based Noise Reduction - DiVA portal

3.1 Speech Production 15

Figure 3.2: Human organs which make speech (Vocal tract) [9]

Figure 3.3: The structure of Larynx [3]

sound is produced named glottal stop. If the airflow causes a vibrationin vocal cords while passing through them, the resulting noise in the airproduces a voicing sound (vowels and voiced consonants) such as [b], [d],[v], [i], [r], [z]. On the other hand, if the vocal cords are completely apart(wide-open glottis), the air can freely pass through them and no noise willbe produced in the air. The resulting sound in this case is a voiceless sound(voiceless consonants) such as [p], [t], [k], [s], [f], [sh], [th], [ch]. Moreover,if the air flows so rapidly through the wide-open glottis, a whisper-likesound such as [h] (and some whispered vowels) becomes the result.

The last phase of speech production is called articulation and it refers to

Page 28: Mapping Based Noise Reduction - DiVA portal

16 3 Automatic Speech Recognition, ASR

the places in the human vocal tract which are mostly constricted whena sound is produced (place of articulation) and how the organs such astongue or lips contribute in producing the sounds (manner of articula-tion). Since the vowels are produced with no noise from constriction inthe vocal tract, these two terms are only used for consonants. The articu-latory features of vowels that classify the quality of vowels are as follows:

• Height: Vowel height determines the vertical position of the tongue.Based on this feature, vowels are classified to high vowels (thetongue located high in the mouth) such as [i] and [u] and low vowels(the tongue located low in the mouth) such as [a].

• Backness: Vowel backness refers to the position of the tongue rela-tive to the back of the mouth. For this feature we will have frontvowels (the tongue located forward in the mouth) such as [i] vs backvowels (the tongue located backward in the mouth) such as [u].

• Roundedness: It determines whether the lips are rounded or notduring articulation and classifies the vowels to rounded or un-rounded.

• Nasalization: It determines whether the air passes through the noseduring articulation and classifies the vowels to nasal and oral.

A combination of these features determines the quality of each vowelsound.

The following are the most important places of articulation for classifica-tion of consonants:

• Labial (lips)

• Coronal (tip or blade of tongue)

• Dorsal (back of tongue)

Furthermore, the major manners of articulation for the consonants are asfollows:

• Stop: All of articulators are closed and air cannot escape through themouth.

Page 29: Mapping Based Noise Reduction - DiVA portal

3.1 Speech Production 17

• Oral stop (plosive): Air cannot escape through the nose ([p], [t], [k],[b], [d], [g]).

• Nasal stop: The oral cavity is closed and air escapes through the nose([m], [n], [ng]).

• Fricatives: A noisy airflow exists at the place of articulation and ahissing sound is produced ([f], [v], [s], [z], [th], [dh]).

• Approximant: There is very little closure of articulators and conse-quently no noise is involved ([y], [r]).

• Lateral approximant: Airflow is obstructed along the center of oraltract and the sound is pronounced by sides of tongue ([l]).

• Tap or Flap: The oral cavity is closed temporarily by tongue ([dx]).

• Affricate: Oral stop that is followed by a fricative instantly ([ch],[jh]).

3.1.2 Acoustic Phonetics

This section describes the acoustic properties of speech signals. The am-plitude (loudness) of a speech signal is defined as the air pressure in differ-ent time instants. A complex speech signal is resulted from adding severalsimple sinusoids with different frequencies. A spectrum shows each fre-quency content (pitch) of the speech signal and its corresponding ampli-tude. In a spectrum diagram the horizontal axis represents the frequency(in Hz) and the vertical axis represents the amplitude (in dB). The spec-trum of a sound wave acts like a prism that breaks the light into colors;similarly, it breaks a sound wave to its frequency components. This helpsus to find out the frequency patterns of different vowels and consonants(or phonemes in general).

The human speech production system can be simulated by a system con-sisting of excitation and articulation units. In the excitation (source) unita switch is used to decide between a tone generator for the voiced soundsand a noise generator for the unvoiced sounds (simulating the vocal cordstask). Then in the articulation (filter) unit a variable filter is used to articu-late the sounds and make the final speech (simulating the mouth and nosetasks). Thus, the spectrum of a speech waveform is composed of the actualfrequencies produced by the source mounted on an envelope that is thefrequency response of the filter. The peaks of the filter’s (vocal tract’s) fre-quency response (resonance frequencies of the filter) are called formants

Page 30: Mapping Based Noise Reduction - DiVA portal

18 3 Automatic Speech Recognition, ASR

Figure 3.4: The spectrum of a speech waveform is a combination of har-monics and formants [14]

[6]. The formants as demonstrated in Figure 3.4 make the overall shape(envelope) of the spectrum while the harmonics are the frequency con-tents of it. Different vowels (with different formants) can have the sameharmonics and same vowels (with similar formants) can be spoken withdifferent harmonics.

3.2 Feature Extraction

Feature extraction is a kind of dimensionality reduction method used inpattern recognition tasks to reduce the amount of data (in order to avoidredundancy), strengthen the variable parts that improve discriminationof patterns and attenuate the variable parts that worsen discrimination ofpatterns. As for speech recognition, the variable parts that may disorderthe speech recognition outcome are speaker variabilities and acoustic andchannel distortions. Through this process the digitized speech signal istransformed to a sequence of equally spaced (in time) feature vectors. Thespeech signal can be considered stationary in the intervals covered by eachfeature vector (10 ms intervals).

The features that we use for speech recognition are called Mel FrequencyCepstral Coefficients (MFCC features). In this section we explain howMFCC feature vectors are derived from the digitized speech signal.

The first step in MFCC feature extraction is to perform short-time Fouriertransform (STFT) on the digitized speech signal. The short-time Fourier

Page 31: Mapping Based Noise Reduction - DiVA portal

3.2 Feature Extraction 19

Figure 3.5: Overview of short-time Fourier transform

transform determines the temporal frequency contents (time localizationof frequency contents) of a signal which is favorable in our case and it isa linear transform. Since we are dealing with digital speech signals, weonly describe the discrete STFT. Figure 3.5 shows the process of calculat-ing discrete STFT. First, the digitized speech signal x(n) is split up intofinite-length frames by multiplying the speech signal by a discrete finite-length window w(n). For reducing the boundary noise, we should extractoverlapping frames. Then FFT is calculated for each overlapping frameaccording to the following equation:

STFT{x(n)} = Xi(k) =∞∑

n=−∞x(n)w(n− i)e−j2π

kNn, (3.1)

where k = 0, 1, ..., N − 1, i denotes the frame index and N is the numberof frequency bins for each frame that is equal to the frame length.

The squared magnitude of the resulting complex sequence is defined asthe spectrogram of the signal x(n) and it is obtained by the following equa-tion:

spectrogram{x(n)} = |Xi(k)|2. (3.2)

The plot of spectrogram (with horizontal and vertical axes representingtime and frequency respectively) shows how the frequency contents of a

Page 32: Mapping Based Noise Reduction - DiVA portal

20 3 Automatic Speech Recognition, ASR

Figure 3.6: A digital signal and its spectrogram

signal change during the time (time-frequency graph). As an example, inFigure 3.6 the spectrogram of a signal that has varying frequency contentsin different time intervals is given. One can observe from this figure thatthe spectrogram shows more energy of the signal in the interval where ithas higher frequencies.

There are three factors that have effect on the quality of STFT: windowtype, window length and step size value. In Figure 3.7 you can observetwo commonly used window types for frame extraction and the spectro-grams that are obtained using these window functions. Hamming win-dow is the most appropriate one since it prevents the discontinuities atwindow boundaries. The equation for hamming window is given by:

w(n) =

{0.54− 0.46 cos 2πn

N−1 0 ≤ n ≤ N − 1;

0 otherwise.(3.3)

The window length controls the time-frequency resolution of the spec-trogram. Long windows lead to high frequency resolution (for detec-tion of formants with short spectral distance) and low temporal resolutionwhile short windows result in low frequency resolution and high tem-poral resolution (for detection of fast variations such as short plosive).Therefore, having high frequency resolution and temporal resolution atthe same time is not possible and even not desirable. However, for au-

Page 33: Mapping Based Noise Reduction - DiVA portal

3.2 Feature Extraction 21

Figure 3.7: Common window functions used for STFT calculation andtheir effect on spectrogram

Figure 3.8: Mel scale versus Hertz scale [16]

tomatic speech recognition the window length of 400 samples (at 16 kHzsampling rate) which leads to frequency resolution of ∆f = 40 Hz andtemporal resolution of ∆t = 25 ms along with the window shift value of10 ms, works well.

So far we have obtained the temporal spectrum of the digital speech signal

Page 34: Mapping Based Noise Reduction - DiVA portal

22 3 Automatic Speech Recognition, ASR

Figure 3.9: Mel-frequency triangular filter banks [17]

and now we should transform the frequency scale to the mel scale andtake the logarithm of squared amplitude of spectrum. The mel is a non-linear frequency scale for sound pitches that are perceived by the listenersto have equal distances from each other. It models the human auditorysystem due to the fact that the human ear does not perceive the pitchesof sound linearly but logarithmically. This is to our advantage because itimproves the human ear sensitivity in lower frequencies and smooths thespectrum. Moreover, experiments reveal that mapping the frequencies ofspeech signal to mel scale improves the speech recognition performance.The following equation is used for converting the Hertz frequency to itscorresponding Mel scale:

mel = 2595 log10(1 +f

700), (3.4)

and Figure 3.8 demonstrates this relation. As you see in this figure, thehigh frequencies are compacted in the mel scale while the low frequen-cies are expanded. The reference point between the Hertz and Mel scalesis 1000 Hz = 1000 Mel and the relation is approximately linear below 1KHz and logarithmic above 1 KHz. For converting the frequency scale toMel, we can apply Mel-frequency triangular filter banks [11] as exhibitedin Figure 3.9. The filters are bandpass and they model auditory spectralbands. The center frequencies of triangles are calculated according to theMel scale (Equation (3.4)) and as a result the bandwidth and distance oftriangles increase with frequency. For each of the triangular filters, thesquared magnitude of each frame’s spectrum is multiplied by the filtergain and the results are summed up. The logarithm of the resulting valuespresent the scaled power spectrum of each frame. The following formula

Page 35: Mapping Based Noise Reduction - DiVA portal

3.2 Feature Extraction 23

expresses this step:

X ′i(j) = logN−1∑k=0

H(j, k)|Xi(k)|2, {j = 1, ...,M}. (3.5)

In Equation (3.5), M represents the number of filter banks (M < N ), X(k)is the spectrum of a frame (short-time spectrum) and H(j, k) is the jthfilter bank. The reason for taking the logarithm of squared amplitude Mel-spectrum is that the loudness of a sound is approximately perceived in alogarithmic manner.

The last step of MFCC feature extraction is to calculate the Discrete Co-sine Transform (DCT) of the Mel log power spectrum in order to decor-relate the Mel-spectrum vectors which are highly correlated. The DCT isnearly similar to DFT; the only difference between them is that DFT usesboth sine and cosine functions (complex exponentials) to calculate Fouriertransform whereas DCT uses only cosine function. The following equationformulates the last step of MFCC feature extraction:

mi(l) =

M∑j=1

X ′i(j) cos (lπ

M(j − 1

2)), {l = 1, ...,M}, (3.6)

and it can be expressed in matrix form as follows:

mi = DCT(X′i), (3.7)

where mi is the final M -dimensional feature vector. The resulting vectoris called Mel power Cepstrum since we have taken the Fourier transformof log-Mel power spectrum. The word Cepstrum refers to the Fouriertransform of log spectrum and it is derived from reversing some lettersin ”spectrum”. The independent variable of the cepstrum is called quef-erency. Figure 3.10 compares the plots of Hertz spectrum, Mel spectrumand Mel cepstrum.

The feature vectors mi are 13-dimensional vectors called static features.The dynamic features are gained by taking the first and second orderderivatives of short-time energy of the signal resulting in two more 13-dimensional vectors. Thus, each frame of speech signal is finally rep-resented by a 39-dimensional feature vector; although, we only needthe static features (the first 13 MFCCs) in our experiments. Figure 3.11presents a summary of the whole process of MFCC feature extraction. TheMFCC feature vectors are extracted for both training and testing data inour experiments.

Page 36: Mapping Based Noise Reduction - DiVA portal

24 3 Automatic Speech Recognition, ASR

Figure 3.10: A graph example of Hertz spectrum, Mel spectrum and Melcepstrum

3.3 Training

In the modern speech recognition systems, the acoustic models are trainedby Hidden Markov Models [12]. In this section we briefly describe HMMsand its application in speech recognition. First, we introduce the HMMand its elements and 3 basic problems for it. Then we focus on the func-tionality of HMM for training acoustic models and labeling the test speechsignals.

3.3.1 An Introduction to Hidden Markov Models

HMM is a statistical model of sequences of events which is an extension ofMarkov chains. In the ordinary Markov chains the states are visible whilein a HMM only the outputs (a set of observation sequences) are visible

Page 37: Mapping Based Noise Reduction - DiVA portal

3.3 Training 25

Figure 3.11: A summary of MFCC feature extraction [17]

and the states are hidden. In a first order Markov Model we assume thatthe probability of an event depends on its previous event only. HMMsare widely used for recognition of time-varying pattern sequences such asspeech feature vectors. The elements of a first order HMM illustrated inFigure 3.12 are as follows:

• A set of N states: S = {s1, s2, ..., sN} that are not visible (denoted bycircles in Figure 3.12). Only one state is active at each time instant.

• A set of K possible observation vectors: V = {v1,v2, ...,vK}; theobservation vectors are selected from set V (ot ∈ V).

• The probabilities of state transitions (meaning change from one stateto another at equally spaced discrete times and denoted by arcs inFigure 3.12) which is a N ×N matrix A with aij = P (qt = sj |qt−1 =

si), where qt ∈ S represents the state at time t and∑N

j=1 aij = 1.

• The probabilities of observation vectors represented by a N ×K ma-trix B with bjk = P (ot = vk|qt = sj), where ot ∈ V represents theobservation at time t and

∑Kk=1 bjk = 1.

Page 38: Mapping Based Noise Reduction - DiVA portal

26 3 Automatic Speech Recognition, ASR

• The initial state probability represented by a N−dimensional vectorπ with πi = P (q1 = Si) where

∑Ni=1 πi = 1.

A HMM denoted by λ is represented by its known parameters:λ(A,B,π).

The three basic questions we may face in a HMM are:

• Evaluation Problem: What is the probability that an observation se-quence O = o1,o2, ...,oT is produced by a given model λ or mathe-matically what is the probability P (O|λ) given O and λ?

Solution: Forward algorithm [12]

• Uncovering Problem: What is the most likely state sequenceQ = q1, q2, ..., qT that can produce an observation sequence O =o1,o2, ...,oT by a given model λ?

Solution: Backward algorithm [12]

• Learning Problem: For a given observation sequence O =o1,o2, ...,oT of length T , how should we set the parameters of themodel λ(A,B,π) to maximize the probability P (O|λ)?

Solution: Viterbi algorithm for training [12]

The learning and evaluation problems are similar to the training andrecognition phases of an automatic speech recognition system.

3.3.2 Training Acoustic Models

The goal of a speech recognition system is to do a mapping between a se-quence of speech vectors and its corresponding symbol. The speech signalis either related to a single symbol (word) selected from a fixed vocabu-lary or a sequence of symbols (sentences). In the former case, we will haveisolated word (word level) recognition while the latter case is called con-tinuous speech (phoneme level) recognition. In the isolated word recogni-tion, we have to generate a model for each word in the vocabulary usingenough training samples. Consequently, words that do not appear on thetraining set cannot be recognized. Although isolated word recognitionsounds artificial, it has a variety of practical applications.

Page 39: Mapping Based Noise Reduction - DiVA portal

3.3 Training 27

If we represent the input speech waveform as a sequence of observationvectors O = o1,o2, ...,oT (where ot is the observation vector at time t),then computing the following probability will solve the isolated wordrecognition problem:

arg maxi{P (wi|O)}, (3.8)

where wi is the i−th word in the vocabulary. The probability in Equation(3.8) cannot be computed directly. Hence, by using the Bayes’ rule we turnit to the following computable probability:

P (wi|O) =P (O|wi)P (wi)

P (O), (3.9)

that means if the prior probabilities P (wi) are known, the most likely wordcorresponding to the observation sequence O is only related to the proba-bility P (O|wi). Therefore, solving the isolated word recognition problemis simplified to computation of the probability P (o1,o2, ...|wi). By gener-ating a Markov Model for each word using spoken examples of it, we canreplace the computation of probability P (O|wi) by the estimation of theMarkov Model parameters. In other words, if λi represents the Markovmodel of the word wi, the following relation will hold true:

P (O|wi) = P (O|λi), (3.10)

meaning that the isolated word recognition problem is solved by findingthe parameters of the Markov models for each word wi. As indicated inFigure 3.12, the sequence of observation vectors O corresponding to eachword is assumed to be produced by a Markov model. At each time unitthe state of the model is changed and when the state j is entered at timet, a vector ot is produced by the probability density bj(ot). Sometimes thesystem stays at the same state for several time units and as a result morethan one vectors are produced at the same state and by the same prob-ability density function. The transition between states is also a discreteprobability denoted by aij . The state sequence that generates vectors o1 too6 in Figure 3.12 isQ = q1, q2, q2, q3, q4, q4, q5, q6. Given this state sequence,the probability P (o1, ...,o6, Q|λ) can be computed simply by multiplyingthe transition probabilities aij and the output probabilities bj(ot). In ahidden Markov model, however, the states are hidden and we have noinformation of the state sequence Q. Therefore, the probability P (O|λ) iscalculated by finding the most probable state sequence and multiplying itscorresponding transition and output probabilities assuming that we knowthe parameters of the model aij and bj(ot).

The training phase of a speech recognition system consists of finding theHMM parameters for each word. The models for each vocabulary word

Page 40: Mapping Based Noise Reduction - DiVA portal

28 3 Automatic Speech Recognition, ASR

Figure 3.12: An example of a Markov Model λ [17]

is produced using enough spoken examples of that word. The parametersof the models are estimated by the Baum-Welch Re-Estimation algorithm[17]. These parameters are the transition probabilities aij and the meansµj and variances Σj for each of the output probabilities bj(ot). The distri-bution functions of output probabilities are continuous Gaussian Mixturedensities.

In continuous speech recognition our models would be trained based onphonemes rather than words; therefore, even the words that are not insidethe training set can be recognized. Moreover, since we have only a limitednumber of phonemes, we would have enough training samples for ourmodels. However in continuous speech recognition, it is so difficult tospecify the boundaries between symbols from the speech signal. In thiscase the HMMs are connected to form the continuous speech models and3 states are considered for each phoneme.

Page 41: Mapping Based Noise Reduction - DiVA portal

3.4 Recognition 29

3.4 Recognition

The recognition phase of an ASR system is to estimate the maximum valueof P (O|λi) after training the HMMs for each word (in isolated word recog-nition case) / sentence (in continuous speech recognition). The maximumvalue of these probabilities (as mentioned in section 3.3.2) is calculatedbased on the most likely state sequence using Viterbi decoding algorithm[17]. An extension of the Viterbi algorithm is used for continuous recog-nition as well. The word (sentence) wi corresponding to the model whichproduces the maximum likelihood P (O|λi) will be then the recognizedword (sentence).

Page 42: Mapping Based Noise Reduction - DiVA portal
Page 43: Mapping Based Noise Reduction - DiVA portal

31

Chapter 4

Mapping Based NoiseReduction

This chapter mainly focuses on linear and nonlinear dimensionality reduc-tion techniques for the purpose of noise reduction. In many speech recog-nition problems, environmental and natural noise corrupt the speech data.In order to perform robust speech recognition in presence of such noise,one has to first reduce the noise. Several noise reduction methods havebeen used so far to improve the speech recognition in noisy environments,amongst them are Spectral Subtraction, Cepstral Mean Subtraction andBlind Equalization. Although, in difficult domains these methods proveto be insufficient.

Another method, however, is the mapping based method which maps thenoisy feature vectors extracted from a corrupted speech signal to corre-sponding clean feature vectors. In other words, it substitutes each samplefrom a noisy data set (noisy feature vectors) Xn by a sample from a cleandata set (clean feature vectors) Xc which best describes its features.Themapping, however, is done in a low dimensional space due to the fact thatin the lower dimensional space, the distances between vectors are calcu-lated more accurately and consequently the most accurate substitution ofthe noisy feature vector with highest similarity to it, is found. For a bet-ter understanding of this, you can observe Figure 4.1. One can observefrom this figure that the nonlinearity of high dimensional space, leads tonon accurate measurements of distance between vectors and as a resultthe substitution result is not as expected, while, mapping to low dimen-sional space, results in a submanifold with more linear characteristics andas a result higher precision in distance measurement. Moreover, the hope

Page 44: Mapping Based Noise Reduction - DiVA portal

32 4 Mapping Based Noise Reduction

Figure 4.1: An example that shows how mapping to lower dimensionalspace leads to more accurate distance measurement and as a result moreprecise vector replacement

is that some dimensions representing noise are dropped in the dimension-ality reduction. This nonlinear dropping of noise dimensions makes theprocess conceptually and mathematically different from techniques likemulti-conditional training.

In the following sections two dimensionality reduction methods are de-scribed in detail: Principal Component Analysis as a linear method andLocally Linear Embedding as a nonlinear method. Then, a standardk−means clustering algorithm is discussed and finally the whole noisereduction procedure is presented in last section.

4.1 Locally Linear Embedding, LLE Algorithm

Locally Linear Embedding [5] as an unsupervised manifold learning al-gorithm, performs a nonlinear mapping from high-dimensional to low-dimensional space. Some of the applications of LLE are Speech Analysis,Feature Extraction for use in Speech Recognition, Speech data visualiza-tion, Phone Classification, and Noise Reduction.

Page 45: Mapping Based Noise Reduction - DiVA portal

4.1 Locally Linear Embedding, LLE Algorithm 33

The idea behind LLE is that the neighbouring points in high dimensionalspace, remain adjacent in low dimensional space as well. In other words,mapping through LLE preserves the local neighbourhood in the low di-mensional space. Therefore, LLE aims at finding a low dimensional em-bedding which has this property that if two points are neighbors or co-located to each other in high dimensional space, they will definitely beneighbours in low dimensional space as well. This means that LLE main-tains the local neighbourhood of high dimensional space in low dimen-sional space.

Suppose X : {x1,x2, ...,xn} , xi ∈ RD is the high-dimensional dataset lying on a nonlinear manifold. LLE maps this set of data to a lowdimensional data set Y : {y1,y2, ...,yn} , yi ∈ Rd, d < D in two steps :firstly, finding K nearest neighbours of each sample xi and calculating thereconstruction weights of each sample based on it’s nearest neighboursand secondly, finding the low dimensional mappings yi for each xi whichminimizes a function of the reconstruction weights.

To obtain the reconstruction weights based on nearest neighbours of xi,the following function is minimized:

ε1(W) =n∑i=1

|xi − x̃i|2 =n∑i=1

|xi −K∑j=1

w(i)j xN(j)|2, (4.1)

x̃i =

K∑j=1

w(i)j xN(j),

where xN(j) represents the j−th nearest neighbour of xi. The function ε1shows the difference between original and reconstructed values of xi andthus the error of reconstruction. With the optimal value of weights (min-imum value of ε1), xi is reconstructed most accurately by it’s K nearestneighbours xN(1)...xN(K). Then for only one vector xi, above equation is:

ε(i)1 =

K∑j=1

K∑m=1

w(i)j w

(i)m Q

(i)jm, (4.2)

Q(i)jm = (xi − xN(j))

T (xi − xN(m)),

where∑K

j=1w(i)j = 1 and Q(i) contains the inner products of neigh-

borhood vectors for xi. Finally by solving Equation (4.2) and settingR(i) = (Q(i))−1, the optimal reconstruction weights are calculated:

w(i)j =

∑Km=1R

(i)jm∑K

p=1

∑Kq=1R

(i)pq

. (4.3)

Page 46: Mapping Based Noise Reduction - DiVA portal

34 4 Mapping Based Noise Reduction

However, since rank(Q(i)) = min(D,K), for K > D, Q(i) is not invertibleand as a result a regularization parameter r should be added before inver-sion of Q(i) :R(i) = (Q(i) + rI)−1 (I is the unity matrix with diagonal elements equalto 1 and other elements equal to 0), otherwise the eigenanalysis algorithmfor finding the inverse of Q(i) does not converge.

Now that the reconstruction weights are calculated, low-dimensionalmappings yi can be calculated by minimizing the following function ofthese weights:

ε2(Y) =

n∑i=1

|yi −K∑j=1

w(i)j yN(j)|2. (4.4)

This equation can be expressed in matrix form as follows:

ε2 =n∑i=1

n∑j=1

MijyiTyj = trace(YMYT), (4.5)

where W is an n × n sparse matrix of reconstruction weights withWi,N(j) = w

(i)j and M is an n × n matrix wich is defined as

M = (I−W)T(I−W) and finally Y is the matrix containing low dimen-sional vectors yi in its columns. In order to find the optimal solution, La-grange Multipliers method is used. The Lagrange equation is obtained bya combination of Equation (4.5) and a constraint applied to the covariancematrix of Y to make it equal to identity: 1

nYYT = I. Thus, the derivativeof this Lagrange equation is set to zero to lead us to the final equation:(M−Λ)YT = 0 where Λ is a diagonal matrix which contains Lagrangemultipliers as diagonal elements. The solution to this equation obviouslyconsists of the eigenvectors of M. However, not all of these eigenvectorsbut only the ones which correspond to the (d + 1) smallest eigenvaluesexcept for the first one which is discarded, minimize ε2. The eigenvectorcorresponding to the smallest eigenvalue is equal to the mean of vectors yi

and by discarding it we make∑n

i=1 yi = 0. Figure 4.2 shows a summaryof how LLE works. According to this figure, mapping a vector xi to itscorresponding vector in lower dimensional space yi through LLE can besummarized in 3 steps:

1. The nearest neighbours to xi are selected.

2. The reconstruction weights wij are calculated based on how well xi

can be recreated by its nearest neighbours.

Page 47: Mapping Based Noise Reduction - DiVA portal

4.1 Locally Linear Embedding, LLE Algorithm 35

Figure 4.2: A summary of LLE [1]

3. The same reconstruction weights wij are used to find the mappedvector yi based on how well yi can be recreated by mappings of xi’snearest neighbours weighted to wij.

To map a new sample x which is not included in data set X, its K nearestneighbours xN(1)...xN(K) in X are found to calculate the new reconstruc-tion weights wj according to Equation (4.3). Then, it is mapped by usingthe new weights and corresponding vectors in Y to the nearest neigh-bours. Therefore, the mapping equation is:

y =

K∑j=1

wjyN(j). (4.6)

According to above mapping procedure, there are a couple of parameterswhich affect the results. To have a desired mapping result, these 3 param-eters should be set properly:

1. Regularization parameter r: needed for the cases where K > D to

Page 48: Mapping Based Noise Reduction - DiVA portal

36 4 Mapping Based Noise Reduction

Figure 4.3: A 3D data set mapped to 2D space using LLE [4]

help the eigenvalue computation algorithm converge. For my exper-iments, r = 10−3 results in a good convergence.

2. Number of neighbours K: Low values of K cause a mapping withno global properties. On the other hand, with high values of K, thewhole data is considered as neighborhood and as a result, the non-linearity of the mapping process is lost. Additionally, high values ofK make the algorithm slower due to high amount of computations.

3. Dimensionality to map to d: Low values of d result in several datasamples mapped to the same new point and consequently data islost and high values for d lead to a mapping with higher amount ofnoise involved.

Figure 4.3 shows an example of a 3-dimensional data set mapped to 2-dimensional space by LLE.

4.2 Principal Component Analysis

Principal Component Analysis [2] is an unsupervised learning algorithmwhich reduces the dimensionality of data points by projecting each pointof a data set linearly on a number of vectors called principal components.These principal components are ordered in the direction of maximumvariance of data and they are orthogonal to each other consecutively. Toput it in other words, PCA projects the data to a lower dimensional space

Page 49: Mapping Based Noise Reduction - DiVA portal

4.2 Principal Component Analysis 37

(principal subspace) orthogonally and in a way that leads to maximumvariance of the projected data. In addition to dimensionality reduction,PCA has other applications such as data compression, feature extractionand data visualization.

Let xn : n = 1, 2, ..., N , xn ∈ RD represent a set of vectors with dimension-ality D. Using PCA, we intend to project this set of vectors to a new set ofvectors yn : n = 1, 2, ..., N , yn ∈ Rd with a lower dimensionality d < D.

We start the formulation by projecting the data to a one dimensional spaceor by setting d = 1, and then generalize it to all possible values for d.We define the principal component of this one dimensional space by avector u1 of dimensionality D. Since we only need the direction of u1

for data projection, we assume that it is a unit vector and formulate thisassumption as: u1

Tu1 = 1. Using the principal component u1, the inputvectors xn are projected to scalar values yn = u1

Txn.

To proceed, we need the relations for mean and covariance of the inputvectors xn. The following relations give the mean vector m and covariancematrix S of input vectors xn, respectively:

m =1

N

N∑n=1

xn, (4.7)

S =1

N

N∑n=1

(xn −m)(xn −m)T . (4.8)

Now, we can proceed by finding a relation for the variance of projecteddata yn, and maximizing it in terms of u1. Considering that the mean ofprojected data is u1

Tm, the following relation is the variance of projecteddata:

V ar(yn) =1

N

N∑n=1

(u1Txn − u1

Tm)2 = u1TSu1. (4.9)

The goal is now to solve the following maximization problem:

arg maxu1

u1TSu1 (4.10)

subject to: u1Tu1 = 1.

This is a constrained maximization problem which uses the unity charac-teristic of u1 as the constraint to avoid that the magnitude of u1 tends to

Page 50: Mapping Based Noise Reduction - DiVA portal

38 4 Mapping Based Noise Reduction

infinity. By using a Lagrange multiplier denoted by λ1, this constraint isimposed to the maximization problem and the following unconstrainedmaximization problem is obtained:

arg maxu1

{u1TSu1 + λ1(1− u1

Tu1)}. (4.11)

To solve this problem we should simply set the derivative in terms of u1

equal to zero which results in the following equation:

Su1 = λ1u1. (4.12)

Equation (4.12) apparently implies that the vector u1 is an eigenvector ofthe covariance matrix S. We can left multiply both sides of this equationby u1

T as follows:u1

TSu1 = λ1, (4.13)

to make one further conclusion. We observe that the left side of Equation(4.13) is the variance of projected data which is equal to λ1. This means thatwe will have the maximum variance of projected data if u1 is selected asan eigenvector corresponding to the largest eigenvalue of the covariancematrix S.

So far, we have found the first principal component which is the eigen-vector related to the largest eigenvalue of S. To find the other principalcomponents we should similarly find the eigenvectors corresponding tothe second, third, ... largest eigenvalues of S. The orthogonality require-ment of successive principal components is fulfilled since the eigenvectorsare orthogonal to each other.

Finally, we generalize the algorithm for any value of d (d−dimensionalprojection). The best d−dimensional projection which maximizes thevariance of projected data is obtained by finding the d eigenvectorsu1, ...,ud (d principal components) corresponding to the d largest eigen-values λ1, ..., λd of the covariance matrix S.

As an example, Figure 4.4 illustrates the realization of a data set with 3dependent coordinates and its principal components.

4.3 k-Means Clustering Algorithm

k−means is an unsupervised clustering algorithm which classifies the dataset to a number of clusters (k−clusters) such that each data point is asso-ciated to the nearest cluster centroid. Mathematically, k−means algorithm

Page 51: Mapping Based Noise Reduction - DiVA portal

4.3 k-Means Clustering Algorithm 39

Figure 4.4: Realization of a 3D data set xn (red) and Principal componentsof the data set xn (blue)

classifies a data set X : {x1, ...,xn}, where xi ∈ Rd, into k disjoint clustersS : {S1, S2, ..., Sk} by minimizing the sum of Euclidean distances betweenthe cluster centroid and the data inside a cluster. This can be formulatedas:

arg minS

k∑i=1

∑xj∈Si

||xj − ci||2, (4.14)

where ci denotes the centroid of cluster Si. Here, the cluster centroid is themean of data points inside the cluster. The standard k−means algorithmworks as follows:

1. The first step is to select a value for the number of clusters k.

2. The second step is to set the initial values for the k cluster centroidsc1, c2, ..., ck. One way is to assign the data point randomly to thek clusters and calculate the centroids and the other way is to assignthe k first data points to each of k clusters and assign the rest of datato the cluster with nearest centroid and recalculate the centroid aftereach assignment.

3. In the third step calculate the Euclidean distance between each datapoint and each of the cluster centroids and again assign the datapoints to the clusters with nearest centroid. Recalculate the cluster

Page 52: Mapping Based Noise Reduction - DiVA portal

40 4 Mapping Based Noise Reduction

Figure 4.5: Block Diagram of standard k−means algorithm [13]

centroid for the clusters which have gained or lost data points. Thisstep can be formulated as follows:

S(t)i = {xj : ||xj − ci

(t)|| ≤ ||xj − c`(t)||, ∀` = 1, ..., k}, (4.15)

ci(t+1) =

1

|S(t)i |

∑xj∈S

(t)i

xj. (4.16)

4. Iterate until no change occurs in the location of centroids and thealgorithm converges. The algorithm will always converge if aftereach iteration the sum of Euclidean distances between each clustercentroid and the points in that cluster decreases.

Figure 4.5 shows a block diagram of the algorithm and Figure 4.6 showsan example of how the cluster centroids (denoted by red stars) move dur-ing iterations and finally all the data points (pictured by blue squares) areclassified to two disjoint clusters.

Page 53: Mapping Based Noise Reduction - DiVA portal

4.4 Mapping Procedure 41

Figure 4.6: An example of how cluster centroids (stars) move after eachiteration in the k−means algorithm [13]

4.4 Mapping Procedure

The process of mapping the noisy feature vectors to clean ones is per-formed in several steps which are presented in this section. The cleanvectors are the MFCC feature vectors extracted from clean speech signalsrecorded for the purpose of speech recognition and the noisy vectors arethe MFCC feature vectors extracted from the same speech signals and bythe same speakers but recorded in a noisy environment like on motorcy-cles where different sources of noise (wind, construction, other vehicles,etc.) are available.

The following is the noise reduction process:

1. Let X be the set of clean feature vectors. First, this set is mapped toa lower dimensional set using a dimensionality reduction algorithm(LLE / PCA).

2. Let xn be the noisy vector which should also be mapped to low di-mensional space. Since it is considered a new previously unseensample, the mapping can be done using the weights obtained fromnearest neighbours of the noisy vector in clean vectors and find themapping of noisy vector based on that. The process is as follows :

First of all, the K nearest clean vectors to the noisy vector are foundin the original manifold. Using these nearest neighbours, reconstruc-tion weights wi are calculated so that the following relation is ful-

Page 54: Mapping Based Noise Reduction - DiVA portal

42 4 Mapping Based Noise Reduction

filled:

xn ≈K∑i=1

wixi. (4.17)

To get a mapping of the noisy data z, use the same weights as fol-lows:

z ≈K∑i=1

wiyi, (4.18)

where Y is the image of X mapped to lower dimensional space.Equations (4.17) and (4.18) are developed based on the minimumapproximate (≈) error of reconstruction (of signal using its nearestneighbours) in original and lower dimensional spaces.

3. After mapping both clean and noisy vectors to lower dimensionalspaces, we find the nearest mapped clean vector yi to the mappednoisy vector z.

4. Finally, the nearest neighbour to z is mapped back again to originalspace. In other words, xi related to yi is found and given as the inputto the recognizer instead of z.

Lastly, to avoid the fact that the mapped vector xi may lie far away fromthe original vector xn, a linear interpolation between these two vectors canbe used. Therefore, a λ parameter determines what percentage of xn andxi is considered in output: o = (1− λ)xi + λxn.

In order to improve the current mapping based noise reduction algorithm,the available clean and noisy data can be classified prior to mapping pro-cess. This means that a k−means clustering is performed on the clean dataset to find several cluster centroids ci, i = 1, ..., k. Then the nearest clusterci to each noisy vector xni

is found by calculating Euclidean distances be-tween that noisy vector and the cluster centroids. Then considering thatxni

belongs to cluster ci, the mapping process for xniis done using the

clean data from the same cluster Xci .

A brief description of this algorithm is as follows:

1. Let X be the clean data set; using a k−means clustering algorithm, itis classified to k clusters Xc1 , ...,Xck , where c1, ..., ck are cluster cen-troids. Then each cluster of clean data is mapped to low dimensionalspace separately: Yc1 , ...,Yck are clean data sets in low dimensionalspace.

Page 55: Mapping Based Noise Reduction - DiVA portal

4.4 Mapping Procedure 43

2. Let xn be the noisy vector which can also be classified by calculatingthe Euclidean distances between xn and all cluster centroids to findits nearest cluster ci. Assuming that the nearest centroid to xn is ci,we can say xn belongs to cluster Xci .

3. To find the best clean vector substitution for xn, do the same map-ping process as before for the data in cluster ci or in other words dothe mapping for xn using Xci , Yci .

Page 56: Mapping Based Noise Reduction - DiVA portal
Page 57: Mapping Based Noise Reduction - DiVA portal

45

Chapter 5

Implementation and Results

In this chapter, different approaches for implementation and the obtainedresults are presented. For each approach there are a number of param-eters which affect the performance and the results are compared so thatwe finally find the optimum values. Moreover, the results are verified fordifferent speakers and for different levels of artificially added noise.

5.1 Data Selection

The speech data used in this experiment includes clean speech signalsfrom 2 male speakers and the noisy signals obtained by adding wind noiseartificially to the same clean signals and with different signal to noise ra-tio values. The feature vectors are then extracted for clean and distortedspeech signals. As an example of how different signal to noise ratio valuesaffect the MFCC feature vectors, you can observe in Figure 5.1 the spectro-gram of distorted MFCC vectors with 25 and 0 dB SNR values.

*Corr Clean 10 dB 25 dBSpeaker 1 82.58 40.91 75.00Speaker 2 87.88 60.61 86.36

Table 5.1: The percentage of correctly recognized phonemes for our cleanand noisy feature vectors

Page 58: Mapping Based Noise Reduction - DiVA portal

46 5 Implementation and Results

Figure 5.1: The spectrogram of two noisy feature vectors with SNR valuesof 25 and 0 dB

We have used the Hidden Markov Model Toolkit (HTK) [17] for the con-tinuous (phoneme level) speech recognition and QtOctave for implemen-tation of our noise reduction algorithm. In Table 5.1 you can observe therecognition results of clean vectors, noisy vectors with SNR=10 dB andnoisy vectors with SNR=25 dB, for two speakers. The value of ”Corr” inthis and all other tables represents the percentage of correctly recognizedphonemes.

In the following sections the recognition results of the distorted speechsignals after processing by our noise reduction methods are presented.

5.2 Implementation by LLE

We implemented our noise reduction procedure described in section 4.4first by using LLE as the dimensionality reduction algorithm. The noisyfeature vectors were originally 13 dimensional vectors. Through the noisereduction process this dimensionality was reduced to the values d lowerthan 13. The other parameters in our experiment were the number of near-est neighbours K and the interpolation parameter λ described in section4.4.

Page 59: Mapping Based Noise Reduction - DiVA portal

5.2 Implementation by LLE 47

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 21.21 35.61 41.67 40.91 42.42

K = 15 18.94 37.12 40.15 37.12 40.91

K = 20 26.52 43.94 43.18 43.94 43.18

Table 5.2: Speaker 1, SNR = 10 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 18.94 31.82 33.33 43.18 40.91

Table 5.3: Speaker 1, SNR = 10 dB, d = 11

The implementation results we obtained using LLE for the dimensionalityreduction are given in Tables 5.2 to 5.5 for different values of d,K, and λ. Inaddition, Figures 5.2 and 5.3 show a graph of how these results change byλ for different values of K and λ. One can observe from Table 5.2 a slightimprovement in speech recognition for K = 20. However, we found outby checking our results that there is some randomness in the output of LLEthat makes our results unstable. In other words, we obtain a new resultevery time we run the algorithm. This randomness might be caused bythe eigen analysis algorithm used in LLE. Moreover, the use of LLE makesthe implementation of our method so slow to be run. Hence, we decidedto replace LLE by another dimensionality reduction algorithm which doesnot have the mentioned drawbacks.

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 32.58 52.27 66.67 70.45 74.24

K = 15 50.76 67.42 70.45 69.70 75.00

Table 5.4: Speaker 1, SNR = 25 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 15.15 29.55 40.91 56.06 59.85

Table 5.5: Speaker 2, SNR=10 dB, d = 11

Page 60: Mapping Based Noise Reduction - DiVA portal

48 5 Implementation and Results

Figure 5.2: LLE, SNR = 10 dB, d = 11

Figure 5.3: LLE, SNR = 10 dB, d = 12

Page 61: Mapping Based Noise Reduction - DiVA portal

5.3 Implementation by PCA 49

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 26.52 40.91 42.42 46.97 46.97

K = 10 23.48 42.42 43.18 46.21 46.21

K = 15 37.88 43.94 42.42 44.70 44.70

K = 20 27.27 38.64 40.91 45.45 45.45

Table 5.6: Speaker 1, SNR = 10 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 26.52 40.91 40.15 45.45 45.45

K = 10 26.52 40.15 43.94 45.45 45.45

K = 15 25.00 40.15 44.70 43.94 43.94

K = 20 28.03 41.67 40.91 44.70 44.70

Table 5.7: Speaker 1, SNR = 10 dB, d = 11

5.3 Implementation by PCA

The dimensionality reduction algorithm we used instead of LLE was PCA.Using PCA led to the faster implementation of our algorithm. Besides, therandomness caused by LLE did not exist in PCA and consequently usingPCA resulted in stable outputs. Tables 5.6 to 5.10 illustrate the results ob-tained by using PCA instead of LLE. One can observe from Table 5.6, forinstance, that using PCA in our algorithm leads to a considerable improve-ment of 6 percent in the speech recognition. Figures 5.4 and 5.5 illustrategraphically how the results change with different values of λ. The bestvalue of λ and K according to these graphs are λ = 0.9 and K = 5, 10leading to the highest percentage of recognition. Furthermore, d = 12 isthe best value of the dimensionality to map to.

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 68.18 74.24 74.24 75.00 75.76

K = 10 67.42 72.73 72.73 72.73 75.76

K = 15 67.42 71.21 73.48 75.76 75.00

K = 20 66.67 70.45 71.97 75.76 75.00

Table 5.8: Speaker 1, SNR = 25 dB, d = 12

Page 62: Mapping Based Noise Reduction - DiVA portal

50 5 Implementation and Results

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 23.48 42.42 47.73 60.61 60.61

Table 5.9: Speaker 2, SNR = 10 dB, d = 11

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 72.73 84.85 84.85 84.85 86.36

K = 10 72.73 80.30 80.30 85.61 87.12

Table 5.10: Speaker 2, SNR = 25 dB, d = 12

Figure 5.4: PCA, SNR = 10 dB, d = 11

Page 63: Mapping Based Noise Reduction - DiVA portal

5.4 Implementation by Clustered Vectors 51

Figure 5.5: PCA, SNR = 10 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 15 8.33 5.30 6.06 36.36 41.67

Table 5.11: 2 clusters, LLE, Speaker1, SNR = 10 dB, d = 11

5.4 Implementation by Clustered Vectors

This section consists of the results obtained by using the clustered featurevectors and doing the noise reduction on each cluster seperately. In thispart of our work, we classified both clean and noisy feature vectors to 2, 3,4, and 5 clusters before we map them to lower dimensional space. Then foreach class of vectors we did the same process of noise reduction using bothLLE and PCA as dimensionality reduction algorithms. With higher num-bers of clusters the noise reduction algorithm was executed faster. How-ever, the obtained results given in Tables 5.11 to 5.30 show no significantimprovement in speech recognition. The reason for having undesirableresults after clustering the feature vectors might be that we did not haveenough training (clean) feature vetors to be used in our algorithm.

.

Page 64: Mapping Based Noise Reduction - DiVA portal

52 5 Implementation and Results

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 3.79 12.12 27.27 64.39 75.76

Table 5.12: 2 clusters, LLE, Speaker 1, SNR = 25 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 6.82 8.33 8.33 38.64 38.64

Table 5.13: 2 clusters, PCA, Speaker 1, SNR = 10 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 12.12 11.36 23.48 61.36 75.76

K = 10 12.12 12.88 28.03 63.64 71.21

Table 5.14: 2 clusters, PCA, Speaker 1, SNR = 25 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 6.06 6.06 10.61 51.52 51.52

Table 5.15: 2 clusters, PCA, Speaker 2, SNR = 10 dB, d = 11

Page 65: Mapping Based Noise Reduction - DiVA portal

5.4 Implementation by Clustered Vectors 53

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 19.70 16.67 71.21 83.33 87.12

Table 5.16: 2 clusters, PCA, Speaker 2, SNR = 25 dB, d = 11

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 3.79 2.27 4.55 37.12 37.12

Table 5.17: 3 clusters, LLE, Speaker 1, SNR = 10 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 4.55 13.64 25.00 62.12 75.76

Table 5.18: 3 clusters, LLE, Speaker 1, SNR = 25 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 6.82 3.03 6.06 34.85 40.91

Table 5.19: 3 clusters, PCA, Speaker 1, SNR = 10 dB, d = 12

Page 66: Mapping Based Noise Reduction - DiVA portal

54 5 Implementation and Results

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 9.09 8.33 27.27 61.36 75.76

Table 5.20: 3 clusters, PCA, Speaker 1, SNR = 25 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 7.58 6.82 9.85 53.03 53.03

Table 5.21: 3 clusters, PCA, Speaker 2, SNR = 10 dB, d = 11

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 15.91 17.42 65.15 81.82 86.36

Table 5.22: 3 clusters, PCA, Speaker 2, SNR = 25 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 4.55 4.55 8.33 31.82 40.91

Table 5.23: 4 clusters, LLE, Speaker 1, SNR = 10 dB, d = 12

Page 67: Mapping Based Noise Reduction - DiVA portal

5.4 Implementation by Clustered Vectors 55

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

d = 6 5.30 6.82 21.97 58.33 75.76

d = 12 8.33 9.09 20.45 61.36 75.76

Table 5.24: 4 clusters, LLE, Speaker 1, SNR = 25 dB, K = 10

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

d = 5 7.58 4.55 6.82 34.85 34.85∗d = 12 6.06 5.30 6.06 34.85 34.85∗

Table 5.25: 4 clusters, PCA, Speaker 1, SNR = 10 dB, K = 10

.

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

d = 4 6.82 10.61 23.48 58.33 76.52

d = 12 12.88 9.09 17.42 59.09 75.76

Table 5.26: 4 clusters, PCA, Speaker 1, SNR = 25 dB, K = 15

Page 68: Mapping Based Noise Reduction - DiVA portal

56 5 Implementation and Results

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 4.55 4.55 6.06 35.61 38.64

Table 5.27: 5 clusters, LLE, Speaker 1, SNR = 10 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 5.30 8.33 8.33 72.73 71.97

Table 5.28: 5 clusters, LLE, Speaker 1, SNR = 25 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 10 7.58 3.79 3.79 36.36 36.36

Table 5.29: 5 clusters, PCA, Speaker 1, SNR = 10 dB, d = 12

∗Corr λ = 0 λ = 0.5 λ = 0.6 λ = 0.9 λ = 0.99

K = 5 7.58 8.33 8.33 71.21 71.21

Table 5.30: 5 clusters, PCA, Speaker 1, SNR = 25 dB, d = 11

Page 69: Mapping Based Noise Reduction - DiVA portal

57

Chapter 6

Summary and future work

6.1 Conclusions

In this work, we have presented a new approach towards noise reduc-tion for speech recognition named mapping based noise reduction. In thistechnique, feature vectors which are extracted from noisy speech signalsare mapped from a higher dimensional feature space to a lower dimen-sional subspace in which the clean speech feature vectors are also avail-able. Afterwards, the noisy features are projected to their nearest cleanfeature vectors in lower dimensional space. The results obtained from ourexperiments implemented by Octave show that the recognition percent-age can be raised up to 6% for some speech signals when we use PCA tofind the lower dimensional features. Nonetheless, it should be noted thatin some cases, this technique does not lead to a significant improvementin the outcome of the speech recognition system. One of the reasons forsuch undesirable results might be lack of training (clean) feature vectors.Another reason might be that the feature vectors turn into non smoothvectors after the mapping process. Consequently, this method does notprove yet to yield promising outcomes in all cases.

6.2 Future work

Considering the results of our experiments, the mapping based noise re-duction is efficient when enough clean models in lower dimensional spacecan be found. Accordingly the current work can be improved to some

Page 70: Mapping Based Noise Reduction - DiVA portal

58 6 Summary and future work

extent by using more clean feature vectors for training the submanifoldmodel of the clean speech vectors.

Page 71: Mapping Based Noise Reduction - DiVA portal

59

Bibliography

[1] LLE algorithm. Locally linear embedding. http://www.cs.nyu.edu/˜roweis/lle/.

[2] Christopher M. Bishop. Pattern Recognition and Machine Learning.Springer, 2006.

[3] John Coleman. Phonetics course. http://www.phon.ox.ac.uk/jcoleman/mst_mphil_phonetics_course_index.html.

[4] Dick de Ridder and Robert P.W. Duin. Locally linear embedding forclassification. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 2002(January):1–13, 2002.

[5] Andrew Errity and John McKenna. An investigation of manifoldlearning for speech analysis. School of Computing Dublin City Uni-versity, 2009(January):1–4, 2009.

[6] Sadaoki Furui. Digital Speech Processing, Synthesis and Recognition (2ndEdition). CRC Press, 2000.

[7] Christopher Kermorvant. A comparison of noise reduction tech-niques for robust speech recognition. 1999:1–16, 1999.

[8] John G. Proakis and Dimitris K Manolakis. Digital Signal Process-ing: Principles, Algorithms and Applications (3rd Edition). Prentice Hall,1995.

[9] Rabiner and Juang. Introduction to the physiology of speech andhearing. http://cobweb.ecn.purdue.edu/˜ee649/notes/physiology.html.

[10] Giampiero Salvi. Htk tutorial. http://www.speech.kth.se/˜matsb/speech_speaker_rec_course_2007/htk_tutorial.pdf.

Page 72: Mapping Based Noise Reduction - DiVA portal

60 Bibliography

[11] Sigurdur Sigurdsson, Kaare Brandt Petersen, and Tue Lehn-Schioler.Mel frequency cepstral coefficients: An evaluation of robustness ofmp3 encoded music. Proceedings of the Seventh International Conferenceon Music Information Retrieval (ISMIR), 2006(January):1–4, 2006.

[12] Mark Stamp. A revealing introduction to hidden markov models.(January):1–20, 2004.

[13] Kardi Teknomo. K-mean clustering tutorials. http://people.revoledu.com/kardi/tutorial/kMean/.

[14] Yale University. The source-filter model. http://www.haskins.yale.edu/featured/heads/mmsp/acoustic.html.

[15] Wikipedia. Hilbert space. http://en.wikipedia.org/wiki/Hilbert_space.

[16] Wikipedia. Mel scale. http://en.wikipedia.org/wiki/Mel_scale.

[17] Steve Young, Dan Kershaw, Julian Odell, Dave Ollason, ValtchoValtchev, and Phil Woodland. The HTK Book (for HTK Version 3.4).Cambridge University, 2009.

Page 73: Mapping Based Noise Reduction - DiVA portal

Typeset September 4, 2010