MULTIPLE CLUSTERING BASED ON DEEP NEURAL SUPPORT …€¦ · ii BONAFIDE CERTIFICATE Certified that this project report titled “MULTIPLE CLUSTERING BASED ON DEEP NEURAL SUPPORT

MULTIPLE CLUSTERING BASED ON DEEP

NEURAL SUPPORT VECTOR MACHINE FOR

SPEECH RECOGNITION

PROJECT REPORT PHASE-II

Submitted by

VIDYA I

Register No: 14MAE018

In partial fulfilment for the requirement of award of the degree

Of

MASTER OF ENGINEERING

in APPLIED ELECTRONICS

Department of Electronics and Communication Engineering KUMARAGURU COLLEGE OF TECHNOLOGY

(An Autonomous Institution affiliated to Anna University, Chennai) COIMBATORE-641049

ANNA UNIVERSITY: CHENNAI 600 025

APRIL 2016

ii

BONAFIDE CERTIFICATE

Certified that this project report titled “MULTIPLE CLUSTERING BASED ON

DEEP NEURAL SUPPORT VECTOR MACHINE FOR SPEECH

RECOGNITION“ is the bonafide work of VIDYA I [Reg. No. 14MAE018] who

carried out the research under my supervision. Certified further, that to the best of my

knowledge the work reported herein does not form part of any other project or

dissertation on the basis of which a degree or award was conferred on an earlier

occasion on this or any other candidate.

The Candidate with university Register No. 14MAE018 was examined by us

in the project viva-voice examination held on ……………………

INTERNAL EXAMINER EXTERNAL EXAMINER

SIGNATURE

Mrs. S. KRITHIKA,

PROJECT SUPERVISOR

Department of ECE

Kumaraguru College of Technology

Coimbatore-641 049

SIGNATURE

Dr. A. VASUKI

HEAD OF THE DEPARTMENT

Department of ECE

Kumaraguru College of Technology

Coimbatore-641 049

iii

ACKNOWLEDGEMENT

First I would like to express my praise and gratitude to the Lord, who has

showered his grace and blessing enabling me to complete this project in an

excellent manner.

I would like to express my sincere thanks to our beloved Principal

Dr.R.S.Kumar Ph.D., Kumaraguru College of Technology, who encouraged me

with his valuable thoughts.

I would like to thank Dr. A. Vasuki Ph.D., Head of the Department,

Electronics and Communication Engineering for her kind support and for

providing necessary facilities to carry out the project work.

I am greatly privileged to express my deep sense of gratitude to the Project

Coordinator Mrs. S. Umamaheswari M.E, Associate Professor, for her continuous

support throughout the course.

In particular, I wish to thank and express my everlasting gratitude to the

Supervisor Mrs. S. Krithika M.E, Assistant Professor for her expert counseling in

each and every steps of project work and I wish to convey my deep sense of

gratitude to all teaching and non-teaching staff members of ECE Department for

their help and cooperation.

Finally, I thank my parents and my dear friends for giving me the moral

support in all of my activities.

VIDYA I

iv

ABSTRACT

A unique type of Deep Neural Networks (DNN) is introduced which has achieved

significant growth in its performance in various tasks in large-scale Automatic Speech

Recognition (ASR). Traditional DNN use the multinomial logistic regression at the very

first layer for classification. Instead, the new DNN employs a Support Vector Machine

(SVM) at the first layer. Under the levels of frame and sequence, two algorithms are

trained to gain inside information about the parameters of SVM and DNN in the

boundary of the maximum-margin conditions. In case of frame-level training, the new

model can be associated to the multiclass SVM with DNN features and the sequence-

level training, is associated to the structured SVM containing features of DNN. This

model is named as DNSVM (Deep Neural Support Vector Machine).

v

TABLE OF CONTENTS

CHAPTER NO.

TITLE PAGE NO.

ABSTRACT iv

LIST OF FIGURES vii

LIST OF TABLES viii

LIST OF ABBREVATIONS ix

1 INTRODUCTION 1

1.1 Types Of Speech 2

1.1.1 Isolated Word 2

1.1.2 Connected Word 2

1.1.3 Continuous Word 2

1.1.4 Spontaneous Word 3

1.2 ASR System Classification 3

2 LITERATURE SURVEY 4

3 FEATURE EXTRACTION TECHNIQUE 10

3.1 Feature Extraction 10

3.2 Mel Frequency Cepstral Coefficient 11

3.2.1 Pre–Processing 11

3.2.2 Framing 12

3.2.3 Hamming Windowing 12

3.2.4 Fast Fourier Transform 13

3.2.5 Mel Filter Bank 13

3.2.6 Discrete Cosine Transform 14

vi

4 CLASSIFICATION TECHNIQUES 15

4.1 Introduction 15

4.2 Deep Neural Network (DNN) 15

4.2.1Structure of mDNN 16

4.2.2 Data Partition based on state clustering for mDNN

18

4.3 Parallel training of mDNN 19

4.3.1 Frame Level Cross Entropy training of mDNN 19

4.3.2 Sequence training OF mDNN 21

4.4 Algorithm For Cluster Based mDNN Methods 23

4.4.1 Steps For Training 23

4.5 Support Vector Machine (SVM) 24

4.6 Deep Neural Support Vector Machine 24

4.6.1 Frame Level Max-Margin Training 26

4.6.2 Sequence Level Max-Margin Training 26

4.7 Performance Measures 28

4.8 Drawbacks Of DNN 30

4.9 Advantages Of DNSVM 30

5 SIMULATION RESULTS 31

5.1 Result And Discussions 31

5.2 Mean Square Error 34


6 CONCLUSION AND FUTURE WORK 36

REFERENCES 37

PUBLICATION 40

vii

LIST OF FIGURES

FIGURE NO TITLE PAGE NO

3.1 Block diagram of Speech Recognition

10

3.2 Block Diagram of Mel Frequency Cepstral Coefficients

12

3.3

Mel Scale Filter Bank

14

4.1 Illustration of using mDNN for Acoustic Modelling

16

4.2 Architecture of Deep Neural Support Vector Machine

25

4.3 SVM Hyperplane 27

5.1 Input Speech Signal 31

5.2 Plots of MFCC 32

5.3 Output Signal 33

viii

LIST OF TABLES

TABLE NO TITLE

PAGE NO

5.1 Mean Square Error of Speech Signal

34


ix

LIST OF ABBREVATIONS

DNN Deep Neural Network

HMM Hidden Markov Model

GMM Gaussian Mixture Model

CE Cross Entropy

SVM Support Vector Machine

DNSVM Deep Neural Support Vector Machine

1

CHAPTER 1

INTRODUCTION

The speech is the primary mode of communication between human being

and also the most natural and efficient form of exchanging information among

human in speech. Speech Recognition can be defined as the process of

converting a speech signal to a sequence of words by means algorithm

implemented as a computer program. Speech processing is one of the exciting

areas of signal processing.

The goal of the Speech Recognition area is to developed technique and

system to develop for speech input to machine based on major advanced in

statically modelling of speech, Automatic Speech Recognition (ASR) today

finds widespread application in task that require human machine interface such

as automatic call processing. Since the 1960s computer scientists have been

researching ways and means to make computers able to record interpret and

understand human speech. Throughout the decades, this has been a daunting

task. Even the most rudimentary problem such as digitalizing (sampling) voice

was a huge challenge in the early years. It took until the 1980s before the first

systems arrived which could actually decipher speech. Of course, these early

systems were very limited in scope and power.

The purpose of Speech Recognition is to formulate a method of providing

related services to users. Recent advances in Speech Recognition technology

coupled with the advent of modern operating systems and high-powered

affordable personal computers, have culminated in the first Speech Recognition

systems that can be deployed to a wide community of users.

2

Communication among the human being is dominated by speaking a

language, therefore it is natural for people to expect speech interfaces with a

computer. Machine recognition of speech involves generating a sequence of

words best matches the given speech signal. Some of known applications

include virtual reality, Multimedia searches, auto-attendants, travel information

and reservation, translators, natural language understanding and many more

applications.

1.1 Type of Speech Recognition

The Speech Recognition system can be separated in different classes by

describing what type of utterances they can recognize. These classes are based

on the fact that one of the difficulties of ASR is the ability to determine when a

speaker starts and finishes an utterance.

1.1.1 Isolated Word

Isolated word recognizer usually requires each utterance to have quiet on

both sides of sample windows. It accepts single words or single utterances at a

time. Often these systems have “Listen and Non Listen state”. Isolated utterance

might be a better name for this class.

1.1.2 Connected Word

The connected word system is similar to isolated words, but allow the

separate utterance to be run together minimum pause between them.

1.1.3 Continuous Word

Continuous speech recognizers allows user to speak almost naturally,

while the computer determine the content. Recognizer with continuous speech

capabilities are some of the most difficult to create because they utilize special

method to determine utterance boundaries. A continuous speech system

3

operates on speech in which words are connected together, i.e. not separated by

pauses. Continuous speech is more difficult to handle because of a variety of

effects. First, it is difficult to find the start and end points of words. Another

problem is “co-articulation”. The production of each phoneme is affected by the

surrounding phonemes, and the start and end of words are affected by the

preceding and following words.

1.1.4 Spontaneous Word

At a basic level, it can be thought of as speech that is natural sounding

and not rehearsed. An ASR System with spontaneous speech ability should be

able to handle a variety of natural speech feature such as words being run

together.

1.2 ASR System Classifications

Speech Recognition is a special case of pattern recognition. There are two

phases in supervised pattern recognition, viz., Training and Testing. The process

of extraction of features relevant for classification is common in both phases

[10]. During the training phase, the parameters of the classification model are

estimated using a large number of class examples (Training Data). During the

testing or recognition phase, the feature of a test pattern (test speech data) is

matched with the trained model of each and every class.

4

CHAPTER 2

LITERATURE SURVEY

2.1 Pan Zhou, Hui Jiang, Senior Member, IEEE, Li-Rong Dai, Yu Hu,

and Qing-Feng Liu “State-Clustering Based Multiple Deep Neural

Networks Modelling Approach for Speech Recognition”, IEEE/ACM

Transactions On Audio, Speech, And Language Processing, Vol. 23,

no. 4, April 2015.

A new cluster-based multiple DNNs method for acoustic modelling

in LVCSR (Large Vocabulary Continuous Speech Recognition) is

proposed. The new modelling method can yield comparable recognition

performance as the regular DNN method, but the multiple DNNs can be

efficiently trained in parallel using multiple GPUs, which leads to a

dramatic speedup in training. The multiple DNNs have achieved, the less

error rate recognition.

2.2 Pratik. K. Kurzekar, Rathnadeep. R. Deshmukh, Pukhraj. P.

Shrishrimal “A Comparative Study of Feature Extraction

Techniques For Speech Recognition System,” International Journal

Of Innovative Research In Science, Engineering Technology. Vol. 3,

Dec.2014.

Mel Frequency Cepstral Coefficients (MFCC) are most commonly

used features extraction technique in speech recognition systems. The

reason for MFCC being most commonly used for extracting is that it is

nearer to the actual human auditory speech perception. Some researchers

have proposed modifications to the basic MFCC algorithm to improve

robustness.

5

2.3 Y. Tang, “Deep learning using linear support vector machines,” in

International Conference on Machine Learning, December 2013.

Deep learning using linear support vector machines (DLSVM)

works better than softmax on two standard data sets and a recent dataset.

Switching from softmax to SVMs is incredibly simple and appears to be

useful for classification tasks. It demonstrates a small but consistent

advantage of replacing the soft-max layer with a linear support vector

machine. Learning minimizes a margin-based loss instead of the cross-

entropy loss. While there have been various combinations of neural nets

and SVMs, replacing softmax with linear SVMs gives significant gains

on popular deep learning data sets.

2.4 O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and

D. Yu, “Convolutional Neural Networks For Speech Recognition,”

IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 10,

pp.1533–1545, Oct. 2014.

A Convolutional Neural Networks (CNN) for Speech Recognition

with error rate reduction are proposed. The use of basic CNN of speech

recognition is explained in this paper. A limited-weight-sharing scheme

that can better model speech features is proposed. The special structure

such as local connectivity, weight sharing, and pooling in CNNs exhibits

some degree of invariance to small shifts of speech features along the

frequency axis, which is important to deal with speaker and environment

variations. The major advantages of CNN technique are same filter is

used for each pixel in the layer hence memory size is reduced and

performance is improved.

6

2.5 S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu, “Asynchronous

Stochastic Gradient Descent for DNN training,” in Proc. IEEE Int.

Conf. Acoust., Speech, Signal Process. (ICASSP), 2013.

Asynchronous Stochastic Gradient Descent for DNN training that

asynchronous SGD (Stochastic Gradient Descent) is implemented for

training DNNs by using a number of GPUs. In contrast, another possible

parallelization strategy is “model parallelism”, i.e., splitting the model

into several parts, each of which is computed by one computing unit. The

advantages of SGD are efficiency and ease of implementation. The

limitations require a number of hyper parameters such as regularization

parameter and the number of iterations and it is sensitive to feature

scaling.

2.6 X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, “Pipelined Back

Propagation for Context-Dependent Deep Neural Networks,” in

Proc. Interspeech, 2012.

A pipelined implementation of Back Propagation (BP) is used for

parallel training of DNNs using multiple GPUs, where computation

related to different layers of a DNN is distributed to several GPUs. The

limitation of this pipelined method is, how to balance computation load

among different GPUs, particularly when the output layer dominates the

computation.

7

2.7 Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J.

Dean, and A.Ng, “Building High-Level Features Using Large Scale

Unsupervised Learning,” in Proc. ICML, 2012.

Large scale unsupervised learning is used for training a cluster of

thousands of CPUs, by implementing both model parallelism and the

asynchronous SGD. However, no matter whether data partition or model

parallelism is implemented, all of the above-mentioned parallel training

methods suffer from the communication overhead problem when

collecting gradients, redistributing updated model parameters over

different units, and delivering model outputs into another unit. This cross-

unit communication is balance to become the major performance

bottleneck, especially when parallel training is scaled up to a large

number of computing units. The asynchronous SGD is used to speed up

the neural network training.

2.8 G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent

Pretrained Deep Neural Networks For Large Vocabulary Speech

Recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20,

no.1, pp. 30–42, Jan. 2012.

A context-dependent (CD) pre-trained deep neural networks for

Large Vocabulary Speech Recognition (LVSR) that leverages recent

advances in using deep belief networks for phone recognition. This

describes a pre-trained Deep Neural Network Hidden Markov Model

(DNN-HMM) hybrid architecture that trains the DNN to produce a

distribution over senones as its output. The deep belief network pre-

training algorithm is a robust and often helpful way to initialize deep

neural networks generatively that can aid in optimization and reduce

8

generalization error and describe the procedure for applying CD-DNN-

HMMs to LVSR, and analyse the effects of various modelling choices on

performance.

2.9 D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting Sparseness In Deep

Neural Networks For Large Vocabulary Speech Recognition,” in

Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP),

2012.

The exploiting sparseness in deep neural networks for large

vocabulary speech recognition is used to develop no performance loss

when zeroing 80% of small weights in a large DNN model. This method

is good to reduce total DNN model size, but it gives no gain in training

speed due to highly random memory accesses introduced by sparse

matrices.

2.10 G. Dahl, D. Yu, L. Deng, and A. Acero, “Large Vocabulary

Continuous Speech Recognition with Context-Dependent DBN-

HMMs,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.

(ICASSP), 2011, pp. 4688–4691.

Large vocabulary Continuous Speech Recognition with context-

dependent Deep Belief Neural- Hidden Markov Model (DBN-HMMs)

dramatically outperforms strong Gaussian Mixture Model - Hidden

Markov Model (GMM-HMM) baselines on a challenging, large

vocabulary, spontaneous speech recognition dataset from the bing mobile

voice search task. DBN-HMMs provide dramatic improvements in

recognition accuracy, training DBN-HMMs is quite expensive compared

with training GMM - HMMs, primarily because training the former is not

9

easy to parallelize across computers and needs to be carried out on a

single GPU.

2.11 J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland,

“Efficient Generation and Use of MLP Features For Arabic Speech

Recognition,” in Proc. Interspeech, 2009.

The efficient generation and use of MLP features for Arabic speech

recognition is used to divide the training set into N disjoint subsets in

each epoch and a separate MLP (Multi Layer Perceptron) is trained on

each subset. Then these sub-MLPs are combined by a merger network

that is trained by another subset of data. An advantage of MLP is to

classify an unknown pattern with other known patterns that shares the

same distinguishing features. The neural networks are highly fault

tolerant. Disadvantages of MLP technique are computationally expensive

learning process (i.e. Large number of iterations required for learning)

and no guaranteed solution and presence of scaling problem.

10

CHAPTER 3

FEATURE EXTRACTION TECHNIQUES

3. 1 Feature Extraction

The speech feature extraction in a categorization problem is about

reducing the dimensionality of the input vector while maintaining the

discriminating power of the signal. From the fundamental formation of speaker

identification and verification system [17], that the number of training and test

vector needed for the classification problem grows with the dimension of the

given input so feature extraction of the speech signal is required.

Fig. 3.1 Block diagram of Speech Recognition

The input signal is a speech signal which is in the extension of .wav

format.The features of the speech signal are extracted by using Mel Frequency

Cepstral Coefficients (MFCC).Finally, the features are used to classify the

signals as recognize or not recognize using a Deep Neural Support Vector

Machine (DNSVM).

Input Signal (Speech)

Feature extraction

Classification

Performance Measures

11

3.2 Mel Frequency Cepstral Coefficients (MFCC)

The first stage in the speech recognition process is feature extraction. The

use of Mel Frequency Cepstral Coefficients can be considered as one of the

standard methods for feature extraction. The use of about 20 MFCC coefficients

is common in ASR, although 10-12 coefficients are often considered to be

sufficient for coding speech[18]. The most notable downside of using MFCC is

its sensitivity to noise due to its dependence on the spectral form. Methods that

utilize information in the periodicity of speech signals could be used to

overcome the problems, although speech also contains aperiodic content.

MFCCs are said to be the coefficients that together represent the short-

term power spectrum of the sound which is based on a linear cosine transform

of a log power spectrum on a nonlinear Mel Scale Frequency [20]. MFCC is

used to extract feature vector from the sound wave. MFCC algorithm is based

on human hearing perceptions and having a Mel Scale based filter.

First the voice data are divided into frame. Each frame is windowed using

Hamming window. Second the analysis of frame is converted to the frequency

domain using a short time Fourier Transform. Third a certain number of sub-

band energies are calculated using a Mel filter bank, which is a nonlinear- scale

filter bank that imitates a human„s aural system[19]. Fourth, the logarithm of the

sub-band energies is calculated. Finally, the MFCC is computed by an inverse

Fourier Transform.

3.2.1 Pre–processing

This step processes the passing of a signal through a filter which

emphasizes higher frequencies. This process will increase the energy of the

signal at higher frequency.

Y[n] = X[n] - 0. 95 X[n - 1] (3.1)

12

Lets consider a = 0.95, which make 95% of any one sample is presumed to

originate from previous sample.

Fig. 3.2. Block diagram of Mel Frequency Cepstral Coefficients

3.2.2 Framing

The process of segmenting the speech samples obtained from Analog to

Digital Conversion (ADC) into a small frame with the length within the range

of 20 to 40msec. The voice signal is divided into frames of N samples. Adjacent

frames are being separated by M (M<N).

3.2.3 Hamming windowing

Hamming window is used as window shape by considering the next block

in the feature extraction, processing chain and integrates all the closest

frequency lines. The Hamming window is defined as W(n), 0 ≤ n ≤ N-1.

Input speech signal

Pre-processing

Framing Hamming window

MFCC Features

Discrete Cosine

Transform

Mel Filter

Bank

Fast Fourier

Transform

13

The Hamming window equation is given as:

Y(n) = X(n) W(n) (3.2)

W(n) = 0.54 - 0.46 cos(2πn/N-1) , 0≤ n ≤ N− 1 (3.3)

N = number of samples in each frame

Y(n) = Output signal

X(n) = input signal

W(n) = Hamming window

3.2.4 Fast Fourier Transform (FFT)

Fast Fourier Transform is used to convert each frame of N samples from

time domain into the frequency domain. FFT is an efficient algorithm of

Discrete Fourier Transform (DFT). Usually, FFT is performed to obtain the

magnitude frequency response for each frame.

This statement supports the equation below:

Y(ω)=FFT [H(t)*X(t)] = H(ω) * X(ω) (3.4)

X(ω), H(ω) and Y(ω) are the Fourier Transform of X (t), H (t) and Y (t)

respectively.

3.2.5 Mel Filter Bank

The frequency range in FFT spectrum is very wide and voice signal does

not follow the linear scale. The bank of filters according to Mel scale is shown

in Fig.3.3. This Fig.3.3 shows a set of triangular filters that are used to compute

a weighted sum of filter spectral components so that the output of process

approximates to a Mel scale.

Each filter‟s magnitude frequency response is triangular in shape and

equal to unity at the centre frequency and decrease linearly to zero at centre

frequency of two adjacent filters. Then, each filter output is the sum of its

filtered spectral components.

14

Fig.3.3 Mel Scale Filter Bank

After that, the following equation is used to compute the F (Mel) for

given frequency f in HZ:

F (Mel) = [ 2595 * log10 [1 + f /700] (3.5)

3.2.6 Discrete Cosine Transform (DCT)

In this last stage the Mel frequency Cepstral Coefficients are obtained.

Apply DCT on the energy obtained from the triangular band pass filters to

have L Mel-Scale Cepstral Coefficients. The formula for DCT is given by

∑ , ( )

- ( )

Where

N is the number of triangular band pass filters,

L is the number of Mel-Scale Cepstral Coefficients.

DCT transforms the frequency domain into a time-like domain called

quefrency domain. The obtained features are referred to as the Mel-Scale

Cepstral Coefficients or MFCC.

15

CHAPTER 4

CLASSIFICATION TECHNIQUES

4.1 INTRODUCTION

In machine learning, classification is the problem of identifying to which

set of categories a new observation belongs, on the basis of a training set of data

containing observations whose category membership is known.

Two types of Classification techniques are used to classify the signals.

They are,

Deep Neural Network

Deep Neural Support Vector Machine

4.2 DEEP NEURAL NETWORK (DNN)

A Deep Neural Network (DNN) is an Artificial Neural Network (ANN)

with multiple hidden layers of units between the input and output layers. Here

Multiple DNN (mDNN) is used. The hybrid Deep Neural Network-Hidden

Markov Model (DNN-HMM) architecture models the temporal nature of speech

with an HMM and uses a DNN to replace Gaussian Mixture Model (GMMs) to

compute posterior probabilities of tied HMM states, which can be converted to

scaled likelihoods for Viterbi decoding. In large vocabulary ASR tasks,

thousands of tied HMM states are common. This results in an extremely large

output weight matrix that significantly slows down the back-propagation

process in DNN training.

Here state-clustering multiple DNNs (mDNN) used for acoustic modeling

instead of single large DNN for Speech Recognition [1]. The HMM states

posterior probability distribution that can be estimated by multiple DNN.

Multiple DNN can be trained independently through a combination of “data

partition” and “model parallelism”, and allow for full parallelization.

16

4.2.1 Structure of mDNN

Fig.4.1, illustrates how to use multiple DNNs (mDNN) for acoustic

modeling in ASR. If the whole training data had disjoint subsets, then there are

no common state labels among these subsets. In this case, several hierarchical

structured DNNs can be trained [7]. First of all, a small DNN (with 3 hidden

layers), is trained to distinguish different clusters in the training data, i.e., for

computing the posterior probability of each cluster given. Since this is a very

small Neural Network (NN), containing 3 hidden layers and a small number of

output nodes, the training is very fast, even though it needs to access the entire

training set along with the corresponding cluster labels.

Fig 4.1. Illustration of using multiple DNNs for Acoustic Modelling.

All clustered subsets of training data are used to train multiple DNN to

classify different states within each cluster, i.e., for computing posterior

17

probabilities of all tied states within each cluster, where denotes any HMM

state within cluster.

Since DNN is trained only on a subset of the training data, for certain

training criteria (such as the Frame-Level Cross-Entropy training), all DNN can

be separately trained in parallel using different GPU without transferring any

data or gradients among GPU during the entire training process[11]. DNN is

much faster to train than the large joint DNN in the normal DNN-HMM method

because each DNN only involves the fraction of training data belonging to that

cluster, and each DNN is much smaller in size because it has fewer output

classes and probably fewer nodes in each hidden layer.

As for decoding, each observation sample, X, is fed into all of the

estimated DNNs to compute Pr( | ) and ( | ) as illustrated in Fig 4.1.

Generally, calculate the posterior probability of any tied state, , as follows:

( ) ( |X) = Pr ( | ) ( | ) ( ) (4.1)

Assume all clusters are disjoint and above equation is always used for the

following derivation unless stated explicitly.

The above posterior probabilities, ( |X) , are used for decoding in the

same way as in the normal hybrid DNN-HMM Model. An implication from the

above factorization is to explicitly calculate all soft max operations in all DNNs

to derive the required posterior probabilities for decoding, which may lead to a

very small overhead in the test stage.

The proposed mDNN method uses a different method to factorize

posterior probabilities, based on a hierarchy of output labels and automatically

generated clusters [6]. The purpose of this factorization is to decouple DNN

computation into several smaller independent models for effective parallel

training. Hierarchical factorization is applied to neural network language

18

models to reduce the huge weight matrix in the output layer caused by large

vocabulary size.

4.2.2 Data Partition Based on State Clustering for mDNN

In the mDNN modeling framework is to partition the whole training data

into multiple disjoint subsets. A data-driven unsupervised clustering method is

used to group training data into several clusters [16]. In this way, data samples

belonging to different clusters tend to be less similar so that all different clusters

can be easily distinguished by using a small and shallow neural network, in the

top level. Moreover, this performs data partitioning at the state level, not at the

utterance level.

State clustering aims to divide the entire training dataset into several

subsets that contain no common state labels. Thus mDNN training can be

conducted in a fully parallel fashion [9]. Furthermore, a GMM is chosen to

model each cluster during the data-driven clustering process since a GMM can

be efficiently estimated even from a large amount of data. This approach is

called Gaussian Mixture Model (GMM) based state clustering.

Initially, a baseline GMM-HMM system is built, consisting of tied HMM

states, and all training data are forced to align using word level transcriptions to

generate state labels for all speech frames in the training set [3]. Assume that the

training data are split into clusters. It may need to estimate different GMM for

all of these clusters. There are several different methods to perform data

clustering, either top-down or bottom-up. The simplest way is to randomly

select states as the starting point to initialize different GMM using the data from

these states. Then the remaining states are classified in one of these clusters, one

by one based on the total likelihood value of each state calculated from the

GMM. In this process all data belonging to one HMM state are viewed as one

indivisible unit while assigning them to different clusters.

Finally, all GMM are re-estimated based on the new data partition. This

process is repeated until all clusters converge. The initial selection of states for

19

GMM estimation may not be very robust for clustering [8]. A more stable

clustering method is to use the top-down k-means method to automatically

group training data into clusters for GMM estimation. In this case, this

gradually grows the number of clusters from one at the beginning until it

reaches the desired cluster number [13]. Each cluster is modeled by one GMM

and all data belonging to one HMM state are assigned to each cluster as one

indivisible unit based on the total likelihood value calculated by the current

GMM. Once data are re-assigned all GMM are re-estimated. This state

clustering process is very similar to the standard speaker clustering for ASR

except that all training samples are clustered based on the HMM state labels

rather than the speaker labels.

4.3 Parallel Training of Multiple DNNs

4.3.1 Frame-level Cross Entropy Training of mDNN

For multiple DNNs (mDNN), consider the same frame-level Cross

Entropy (CE) objective function and derive the error signals in a similar way as

the conventional DNN process [9]. Assume that the clusters are disjoint and do

not contain common output labels. In this case, the mDNN is computed, as a

product of two terms, one is from the top level and the other is from a lower

level [5]. By construction, the mDNN can be trained independently for the

frame-level CE training criterion. In this cases, given an input feature vector

along with its target label (assuming the target label belongs to cluster), and

consider how to compute the error signals at the output layer for CE training of

an mDNN.

20

For the top-level, the error signals at the output layer can be obtained as

follows:

( )

( )

( )

( )

( ) ( )

( ) (4.2)

= - ( )

( )

, ( | ) ( | )-

( )

= ( | )

( )

( ) = Pr( | ) ( ) (4.3)

Dirac delta function

( ) {

(4.4)

Furthermore, the above error signals can be back-propagated to derive

error signals in all other layers.

( )

( )

( )

( )

( ) ( )

( ) (4.5)

= - ( )

( )

, ( | ) ( | )-

( )

= ( | )

( )

( )= Pr ( | ) ( ) (4.6)

21

Each training data sample, which makes non-zero updates to one DNN

containing its target class label, and the gradients of all other DNNs remain zero

[2]. Instead of feeding all training data to all parallel DNNs to perform cross-

entropy training, divide the whole training set into different clusters based on

the class labels and feed each subset of training data only to its own DNN.

For the frame-level CE training, each can be trained independently on its

own data without involving any communication traffic [10]. After splitting

training data into different clusters, mDNN can be trained independently with

the standard Back Propagation (BP) using its own data and the corresponding

labels [15]. This leads to a maximum degree of parallelism.

4.3.2 Sequence Training of mDNN

As opposed to the frame-level training criterion, sequence training of an

mDNN is no longer independent among all its parallel DNNs since it needs to

access all state posteriors to process word graphs [4]. The error signals and the

related sequence training procedure for multiple Deep Neural Networks

(mDNN) are also derived. The error signal related to any one HMM tied state,

in the output layer is computed as

( )

( )

( ) ∑

( )

( | ) ( | )

( )

( )

Where, over all HMM tied states are summed in the output layer. In

mDNN, since the output posterior, is a product of two terms, the partial

derivatives of its likelihood, with respect to any ( ) , in a lower-level DNN,

is computed as

( | )

( )

( ( | ) ( ) ( ))

( )

22

= ( | )

( )

= [

( | ) ( | )

] (4.8)

Where stands for the cluster containing HMM state, and for the cluster

containing HMM states .

For each lower-level DNN in mDNN, the error signals in the output layer

are calculated as

( )

( )

( ) ∑

( )

( | ) ( | )

( )

( )

( | ) ( | ∑

( )

( | )

( ) [

( )

( )] ( | )

∑ ,

( )

( )- (4.9)

The above error signals at the output layer of lower level contain two

terms and the second term above is not equal to zero anymore [14]. Since it is

only summed over a subset of state labels rather than all states in the model.

These error signals can be back-propagated in the same way as the regular Back

Propagation (BP) to derive error signals for other layers in all parallel DNNs. In

this case, some methods to prune and compress statistics and gradients [12] may

be used to improve training efficiency.

23

The partial derivatives is given as

( | )

( )

( | )

( )

( | )

( )

=[ ( | ) ( | )

] (4.10)

In the same way, these error signals are back-propagated to derive error

signals in all other layers.

4.4 Algorithm for Cluster Based mDNN Method

State-Clustering multiple DNNs (mDNN) used for acoustic modelling in

place of DNN for speech recognition. The clustered subsets of training data are

used to train multiple DNNs to classify different states within each cluster [15].

The HMM states posterior probability distribution can be estimated by multiple

DNNs and multiple DNNs can be trained independently.

The main steps involved in the mDNN training procedure are

summarized in this algorithm.

4.4.1 Steps for Training

1. Train a MFCC baseline system with N tied states, denoted as mfcc.

2. Use MFCC to generate state level alignments of all training data.

3. Cluster all training data belonging to N tied states into several disjoint

sets.

4. Generate a mapping from each tied state to the cluster label to which it

belongs, denoting this mapping as state two cluster state.

5. Use the entire training set to train a small NN.

6. Use all clustered subsets of training data to train multiple smaller DNNs

denoted as dnn, 1≤ r ≤ C.

24

4.5 Support Vector Machine (SVM)

“Support Vector Machine” (SVM) is a supervised machine learning

algorithm which can be used for both classification and regression challenges. It

is mostly used in classification problems, which uses linear and non-linear

hyper-planes for classifying data. It is basically a binary nonlinear classifier

capable of guessing whether an input vector x belongs to particular class A or

class B.

For a given set of separable data, the goal is to find the optimal decision

function. This is done by choosing a maximum margin as the distance between

the closest sample and the decision boundary. It performs classification by

constructing hyper planes in a multidimensional space that separates different

class labels based on statistical learning theory. SVMs are applied in various

fields due to the features of SVM like a) High accuracy and flexibility b)

Capacity to accommodate large number of attributes.

4.6 Deep Neural Support Vector Machine

Most of the DNN use the multinomial logistic regression, also known as

softmax active function, at the top layer for classification. Specifically, given

the observation at frame t, let is the output vector of the top hidden layer

in DNN, the output of DNNs for state can be expressed as

( ⁄ ) (

)

∑ ( )

( )

Where are the weights connecting the last hidden layer to the output

state ,and N is the number of states. For example, in the frame classification,

given an observation , the corresponding state can be inferred by

( | )

( )

25

For multiclass SVM [8], the classification function is

( ) ( )

Where ( ) is the predefined feature space and is the weight

parameter for class/state s. Two algorithms, at frame and sequence-level, are

also proposed to estimate the parameters of SVM (in the last layer) and to

update the parameters of DNN (in all previous layers) using maximum margin

criteria. The resulting model is named Deep Neural SVM (DNSVM). Its

architecture is illustrated in Fig. 4.2.

Fig.4.2. Architecture of Deep Neural Support Vector Machine

State j

t-1 t

DNN Multiclass SVM

Structured SVM

26

The double headed arrows illustrate the scope of parameters for DNNs,

Multiclass SVMs and Structured SVMs. For sequence-level max-margin

training, the dark straight arrows (in the trellis) represent the reference state

sequence, and the dashed arrows represent the most competing state sequence.

4.6.1 Frame-level max-magin training

The training observations and their corresponding state labels

are *( )+

, where * +, in frame-level training, the parameters

of DNN are normally estimated by minimizing the Cross Entropy (CE). Let

( ) as the feature space derived from the DNN, the parameters of the

last layer are first estimated using the multiclass SVM training algorithm [21],

∑ || ||

∑

(4.14)

For every training frame t=1… T,

Where ≥0 is the slack variable which penalizes the data points that

violate the margin requirement.The only difference comes from the constraints,

which basically says that, the score of the correct state label has to be

greater than the scores of any other states,

, by a margin determined by

the loss. According to [16], using the squared slacks is slightly better than .

If the correct score, is greater than all the competing scores,

, then

it must be greater than the most competing score,

.

4.6.2 Sequence level max-margin training

In the max-margin sequence training, for simplicity, first consider one

training utterance (O, S), where O = {O1… OT} is the observation sequence and

S = {s1… sT} is the corresponding reference states. The parameters of the model

can be estimated by maximizing,

* ( )⁄

( )⁄+

*

( )⁄ ( )

( ) ( )⁄+ ( )

27

Here the margin is defined as the minimum distance between the

reference state sequence S and competing state sequence S‟ in the log posterior

domain as illustrated in the Fig.4.3.

Fig.4.3.SVM Hyperplane.

For DNSVM, the log( ( )⁄ ( )) can be computed via

∑(

( ) ( )) ( ) ( )⁄

Where ( ) is the joint feature [22], which characterizes the

dependencies between O and S,

( ) ∑

[

( )

( )

( )

( ⁄ )]

[

]

( )

Reference state

sequence S

Competing state

sequence S‟

Margin

space ( )

28

where ( )is the the Kronecker delta (indicator) function. Here the prior,

P(w), is assumed to be a Gaussian with a zero mean and a scaled identity

covariance matrix CI, thus log P(w) = log N(0,CI)

The parameters

of DNSVM (in the last layer) can be estimated by minimizing,

( )

|| ||

∑,

( )

* ( ) (

)+- ( )

Objective function (4.18) for DNSVM is the same as the training criterion

for structured SVMs with the features defined in (4.17). To solve equation

(4.16), the cutting plane algorithm can be applied. It requires searching the most

competing state sequence efficiently.

The computational load during training is dominated by this search

process. To speed up the training, denominator lattices with state alignments are

used to constrain the search space. Then a lattice-based forward-backward

search is applied to find the most competing state sequence .

4.7 PERFORMANCE MEASURES

Sensitivity (also called the True Positive Rate (TPR))

It measures the proportion of positives that are correctly identified

as such (e.g., the percentage of sick people who are correctly identified as

having the condition)

Sensitivity= Number of true positive/ (Number of true positive+

Number of false negative)

TPR= TP/ (TP + FN) (4.19)

29

Specificity (also called the True Negative Rate (TNR))

It measures the proportion of negatives that are correctly identified

as such (e.g., the percentage of healthy people who are correctly

identified as not having the condition).

Specificity= Number of true negative/ (Number of true negative+

Number of false positive)

TNR= TN/ (TN + FP) (4.20)

Accuracy

The accuracy is the proportion of the total number of predictions

that were correct. It can be determined by the equation,

Accuracy = (TN + TP)/ (TN+TP+FN+FP) (4.21)

where , TP = true positive

TN = true negative

FN = false negative

FP = false positive

Mean Square Error

The mean square error is the difference between the original signal

and the output signal. It can be calculated as

∑(

)

30

4.8 Drawbacks of DNN:

• Converges on local minima rather than global minima.

• Overfits if training goes on too long.

• N-array classifier with Deep Neural Network - train each of them by one

go.

4.9 Advantages of DNSVM:

• It overcomes the first two drawbacks of DNN.

• N-array classifier with support vector machines - train each of them one

by one.

31

CHAPTER 5

SIMULATION RESULTS

5.1 Result and Discussion

The results illustrate the speech recognition based on DNSVM approach.

The input speech signal is first converted into .wav format and the features of

the speech signal is getting extracted by the Mel Frequency Cepstral Coefficient

Techniques. Here, the feature vectors of 12 Mel Frequency Cepstral

Coefficients are obtained and the training datasets are loaded and trained. From

the trained datasets, the given input speech signals are classified by the

DNSVM.

The speech signal with a length of 13 seconds is given as input, which is

shown in Fig.5.1.

Fig.5.1 Input Speech Signal

From the given speech signal, the features are extracted by the MFCC

and it results in the feature vectors of MFCC.

32

Fig.5.2 Plots of MFCC

The Deep Neural Support Vector Machine is used for classification. From

the trained datasets , the given input speech signals is classified by the DNSVM

and if the given input speech signal matches with the trained datasets then the

signal is considered as recognized or if it mismatches then it is not recognized.

33

Fig.5.3 Output Signal

The fifty samples of speech signals are extracted and trained independently

with each of the feature values. Among these fifty samples, if any one of the speech

signal is considered for testing, the signal will be classified and recognized and in the

output window it will be shown as “signal is recognized”. Other than theses fifty

samples, if any, other signal is given as input, the signal will be classified but it will

not be recognized. The result in the output window will be shown as “signal is not

recognized”.

34

5.2 MEAN SQUARE ERROR (MSE)

The mean square error of the input speech signal is shown in the Table 5.1.

In Deep Neural Network, as the number of iterations increases, the value of mean

square error also increases, but in Deep Neural Support Vector Machine, as the

number of iterations increases, the value of mean square error will be decreased.

Compared to DNN, the DNSVM shows the best result for the mean square error.

Iterations MSE of DNN MSE of DNSVM

Iteration 1 0.141394 0.125154

Iteration 2 0.182261 0.120891

Iteration 3 0.232163 0.098774

Iteration 4 0.285192 0.078025

Iteration 5 0.331714 0.061430

Iteration 6 0.359896 0.053050

Iteration 7 0.380208 0.050005

Iteration 8 0.395030 0.048190

Iteration 9 0.404762 0.046674

Table.5.1 Mean Square Error of Speech Signal

35

5.3 PERFORMANCE MEASURES

Performance

Metrics

DNN (%) DNSVM (%)

Accuracy 92.54 96.66

Specificity 94.63 97.83

Sensitivity 90.71 95.45

Table.5.2 Performance Measures

From the above table, it is clear that Deep Neural Support Vector Machine

yields better performance in accuracy, sensitivity and specificity.

36

CHAPTER 6

CONCLUSION AND FUTURE WORK

A new type of DNN is introduced and the softmax model is used in

Traditional DNNs at the top layer for classification. The suggested DNN

employs an SVM at the highest layer. Here training algorithms are derived at

the frame and sequence-level to jointly learn the parameters of SVM and DNN

in the maximum-margin condition. Under the level of frame development, the

new model is associated with the multiclass SVM including DNN features. In

the sequence training, it is associated with the structured SVM inclusive of

DNN features and state transition features.

The accuracy is better and the Error rate obtained is less when the Deep

Neural Network Support Vector machine is used for classification technique.

In this work only the highest layer of DNNs is replaced by linear SVMs,

investigation of non-linear kernels of deep SVM can be done as future work.

37

REFERENCES

[1] Pan Zhou, Hui Jiang, Senior Member, IEEE, Li-Rong Dai, Yu Hu, and Qing-

Feng Liu State-Clustering Based Multiple Deep Neural Networks Modeling

Approach for Speech Recognition”, IEEE/ACM Transactions On Audio,

Speech, And Language Processing, vol. 23, no. 4, April 2015.

[2] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-Bit Stochastic Gradient

Descent And Application To Data-Parallel Distributed Training Of Speech

DNN,” in Proc. Interspeech, 2014.

[3] O.Abdel-Hamid,A-R. Mohamed, H.Jiang, L.Deng, G.Penn, and D.Yu,

“Covolutional Neural Networks For Speech Recognition,”IEEE/ACM Trans.

Audio, Speech, Lang. Process., vol.22, no.10, pp. 1533-1545, oct.2014.

[4] P.Zhou, C. Liu, Q. Liu,L.Dai, and H.Jiang, “A Cluster Based Multiple Deep

Neural Networks Method For Large Vocabulary Continuous Speech

Recognition,” IEEE International Conference Acoustic Speech,Signal

Process.(ICASSP),2013.

[5] H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and F. Yvon, “Structured

Output Layer Neural Network Language Models For Speech Recognition,”

IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 1, pp. 197–206, Jan.

2013.

[6] S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu, “Asynchronous Stochastic

Gradient Descent For DNN Training,” in Proc. IEEE Int. Conf. Acoust.,

Speech, Signal Process. (ICASSP), 2013.

[7] X. Chen, A. Eversole, G. Li, D. Yu, And F. Seide, “Pipelined Back-

Propagation For Context-Dependent Deep Neural Networks,” in Proc.

Interspeech, 2012.

[8] D.Yu,F.Seide, G.Li, and L.Deng, “Exploiting Sparseness In Deep Neural

Networks For Large Vocabulary Speech Recognition,” in Proc.IEEE Int. Conf.

Acout., Speech , Signal Process.(ICASSP),2012.

38

[9] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent Pretrained

Deep Neural Networks For Large Vocabulary Speech Recognition,” IEEE

Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42, Jan. 2012.

[10] A.-R. Mohamed, G. Hinton, and G. Penn, “Understanding How Deep

Belief Networks Perform Acoustic Modeling,” in Proc. IEEE Int. Conf.

Acoust., Speech, Signal Process. (ICASSP), 2012.

[11] F. Seide, G. Li, X. Chen, and D. Yu, “Feature Engineering In Context-

Dependent Deep Neural Networks For Conversational Speech Transcription,”

in Proc. IEEE Workshop Automatic. Speech Recognition. Understand.

(ASRU), 2011.

[12] F. Seide, G. Li, and D. Yu, “Conversational Speech Transcription Using

Context-Dependent Deep Neural Networks,” in Proc. Interspeech, 2011, pp.

437–440.

[13] G. Dahl, D. Yu, L. Deng, and A. Acero, “Large Vocabulary Continuous

Speech Recognition With Context-Dependent DBN-HMM,” in Proc. IEEE Int.

Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 4688–4691.

[14] D. Yu, L. Deng, and G. Dahl, “Roles of Pre-Training And Fine-Tuning

In Context-Dependent DBN-HMM For Real-World Speech Recognition,” in

Proc. NIPS Workshop Deep Learning Unsupervised. Feat. Learn, 2010.

[15] K. Veselý, L. Burget, and F. Grézl, “Parallel Training Of Neural

Networks For Speech Recognition,” in Proc. Interspeech, 2010.

[16] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, “Efficient

Generation And Use Of MLP Features For Arabic Speech Recognition,” in

Proc. Interspeech, 2009.

[17] S. Kontár, “Parallel Training Of Neural Networks for Speech

Recognition,” in Proc. 12th Int. Conf. Soft Computing, 2006.

[18] Pratik. K. Kurzekar, Rathnadeep. R. Deshmukh, Pukhraj. P. Shrishrimal

“A Comparative Study of Feature Extraction Techniques for Speech

Recognition System,” International Journal of Innovative Research In Science,

Engineering Technology.vol.3, Dec.2014.

39

[19] Y. Tang, “Deep Learning Using Linear Support Vector Machines,” in

International Conference on Machine Learning, December 2013.

[20] J. Chen , K. K. Paliwal, M. Mizumachi and S.Nakamura, "Robust

MFCC Derived From Differentiated Power Spectrum " Eurospeech 2001,

Scandinavia, 2001.

[21] K. Crammer and Y. Singer, “On the Algorithmic Implementation of

Multiclass Kernel-Based Vector Machines,” vol. 2, pp. 265–292, 2001.

[22] S.-X. Zhang, A. Ragni, and M. J. F. Gales, “Structured Log Linear

Models For Noise Robust Speech Recognition,” Signal Processing Letters,

IEEE, vol. 17, pp. 945–948, 2010.

40

PUBLICATION

International Conference

Presented a paper titled “Multiple Clustering Based on Deep Neural Support

Vector Machine for Speech Recognition” IEEE sponsored 3rd International

Conference on Innovations in Information, Embedded and Communications

Systems (ICIIECS‟16) Vol.4 pp.1066-1070 held on 18th March 2016 in

Karpagam College of Technology.

Documents

MULTIPLE CLUSTERING BASED ON DEEP NEURAL SUPPORT …€¦ · ii BONAFIDE CERTIFICATE Certified that this project report titled “MULTIPLE CLUSTERING BASED ON DEEP NEURAL SUPPORT