Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
MULTIPLE CLUSTERING BASED ON DEEP
NEURAL SUPPORT VECTOR MACHINE FOR
SPEECH RECOGNITION
PROJECT REPORT PHASE-II
Submitted by
VIDYA I
Register No: 14MAE018
In partial fulfilment for the requirement of award of the degree
Of
MASTER OF ENGINEERING
in APPLIED ELECTRONICS
Department of Electronics and Communication Engineering KUMARAGURU COLLEGE OF TECHNOLOGY
(An Autonomous Institution affiliated to Anna University, Chennai) COIMBATORE-641049
ANNA UNIVERSITY: CHENNAI 600 025
APRIL 2016
ii
BONAFIDE CERTIFICATE
Certified that this project report titled “MULTIPLE CLUSTERING BASED ON
DEEP NEURAL SUPPORT VECTOR MACHINE FOR SPEECH
RECOGNITION“ is the bonafide work of VIDYA I [Reg. No. 14MAE018] who
carried out the research under my supervision. Certified further, that to the best of my
knowledge the work reported herein does not form part of any other project or
dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this or any other candidate.
The Candidate with university Register No. 14MAE018 was examined by us
in the project viva-voice examination held on ……………………
INTERNAL EXAMINER EXTERNAL EXAMINER
SIGNATURE
Mrs. S. KRITHIKA,
PROJECT SUPERVISOR
Department of ECE
Kumaraguru College of Technology
Coimbatore-641 049
SIGNATURE
Dr. A. VASUKI
HEAD OF THE DEPARTMENT
Department of ECE
Kumaraguru College of Technology
Coimbatore-641 049
iii
ACKNOWLEDGEMENT
First I would like to express my praise and gratitude to the Lord, who has
showered his grace and blessing enabling me to complete this project in an
excellent manner.
I would like to express my sincere thanks to our beloved Principal
Dr.R.S.Kumar Ph.D., Kumaraguru College of Technology, who encouraged me
with his valuable thoughts.
I would like to thank Dr. A. Vasuki Ph.D., Head of the Department,
Electronics and Communication Engineering for her kind support and for
providing necessary facilities to carry out the project work.
I am greatly privileged to express my deep sense of gratitude to the Project
Coordinator Mrs. S. Umamaheswari M.E, Associate Professor, for her continuous
support throughout the course.
In particular, I wish to thank and express my everlasting gratitude to the
Supervisor Mrs. S. Krithika M.E, Assistant Professor for her expert counseling in
each and every steps of project work and I wish to convey my deep sense of
gratitude to all teaching and non-teaching staff members of ECE Department for
their help and cooperation.
Finally, I thank my parents and my dear friends for giving me the moral
support in all of my activities.
VIDYA I
iv
ABSTRACT
A unique type of Deep Neural Networks (DNN) is introduced which has achieved
significant growth in its performance in various tasks in large-scale Automatic Speech
Recognition (ASR). Traditional DNN use the multinomial logistic regression at the very
first layer for classification. Instead, the new DNN employs a Support Vector Machine
(SVM) at the first layer. Under the levels of frame and sequence, two algorithms are
trained to gain inside information about the parameters of SVM and DNN in the
boundary of the maximum-margin conditions. In case of frame-level training, the new
model can be associated to the multiclass SVM with DNN features and the sequence-
level training, is associated to the structured SVM containing features of DNN. This
model is named as DNSVM (Deep Neural Support Vector Machine).
v
TABLE OF CONTENTS
CHAPTER NO.
TITLE PAGE NO.
ABSTRACT iv
LIST OF FIGURES vii
LIST OF TABLES viii
LIST OF ABBREVATIONS ix
1 INTRODUCTION 1
1.1 Types Of Speech 2
1.1.1 Isolated Word 2
1.1.2 Connected Word 2
1.1.3 Continuous Word 2
1.1.4 Spontaneous Word 3
1.2 ASR System Classification 3
2 LITERATURE SURVEY 4
3 FEATURE EXTRACTION TECHNIQUE 10
3.1 Feature Extraction 10
3.2 Mel Frequency Cepstral Coefficient 11
3.2.1 Pre–Processing 11
3.2.2 Framing 12
3.2.3 Hamming Windowing 12
3.2.4 Fast Fourier Transform 13
3.2.5 Mel Filter Bank 13
3.2.6 Discrete Cosine Transform 14
vi
4 CLASSIFICATION TECHNIQUES 15
4.1 Introduction 15
4.2 Deep Neural Network (DNN) 15
4.2.1Structure of mDNN 16
4.2.2 Data Partition based on state clustering for mDNN
18
4.3 Parallel training of mDNN 19
4.3.1 Frame Level Cross Entropy training of mDNN 19
4.3.2 Sequence training OF mDNN 21
4.4 Algorithm For Cluster Based mDNN Methods 23
4.4.1 Steps For Training 23
4.5 Support Vector Machine (SVM) 24
4.6 Deep Neural Support Vector Machine 24
4.6.1 Frame Level Max-Margin Training 26
4.6.2 Sequence Level Max-Margin Training 26
4.7 Performance Measures 28
4.8 Drawbacks Of DNN 30
4.9 Advantages Of DNSVM 30
5 SIMULATION RESULTS 31
5.1 Result And Discussions 31
5.2 Mean Square Error 34
5.3 Performance Measures 35
6 CONCLUSION AND FUTURE WORK 36
REFERENCES 37
PUBLICATION 40
vii
LIST OF FIGURES
FIGURE NO TITLE PAGE NO
3.1 Block diagram of Speech Recognition
10
3.2 Block Diagram of Mel Frequency Cepstral Coefficients
12
3.3
Mel Scale Filter Bank
14
4.1 Illustration of using mDNN for Acoustic Modelling
16
4.2 Architecture of Deep Neural Support Vector Machine
25
4.3 SVM Hyperplane 27
5.1 Input Speech Signal 31
5.2 Plots of MFCC 32
5.3 Output Signal 33
viii
LIST OF TABLES
TABLE NO TITLE
PAGE NO
5.1 Mean Square Error of Speech Signal
34
5.2 Performance Measures 35
ix
LIST OF ABBREVATIONS
DNN Deep Neural Network
HMM Hidden Markov Model
GMM Gaussian Mixture Model
CE Cross Entropy
SVM Support Vector Machine
DNSVM Deep Neural Support Vector Machine
1
CHAPTER 1
INTRODUCTION
The speech is the primary mode of communication between human being
and also the most natural and efficient form of exchanging information among
human in speech. Speech Recognition can be defined as the process of
converting a speech signal to a sequence of words by means algorithm
implemented as a computer program. Speech processing is one of the exciting
areas of signal processing.
The goal of the Speech Recognition area is to developed technique and
system to develop for speech input to machine based on major advanced in
statically modelling of speech, Automatic Speech Recognition (ASR) today
finds widespread application in task that require human machine interface such
as automatic call processing. Since the 1960s computer scientists have been
researching ways and means to make computers able to record interpret and
understand human speech. Throughout the decades, this has been a daunting
task. Even the most rudimentary problem such as digitalizing (sampling) voice
was a huge challenge in the early years. It took until the 1980s before the first
systems arrived which could actually decipher speech. Of course, these early
systems were very limited in scope and power.
The purpose of Speech Recognition is to formulate a method of providing
related services to users. Recent advances in Speech Recognition technology
coupled with the advent of modern operating systems and high-powered
affordable personal computers, have culminated in the first Speech Recognition
systems that can be deployed to a wide community of users.
2
Communication among the human being is dominated by speaking a
language, therefore it is natural for people to expect speech interfaces with a
computer. Machine recognition of speech involves generating a sequence of
words best matches the given speech signal. Some of known applications
include virtual reality, Multimedia searches, auto-attendants, travel information
and reservation, translators, natural language understanding and many more
applications.
1.1 Type of Speech Recognition
The Speech Recognition system can be separated in different classes by
describing what type of utterances they can recognize. These classes are based
on the fact that one of the difficulties of ASR is the ability to determine when a
speaker starts and finishes an utterance.
1.1.1 Isolated Word
Isolated word recognizer usually requires each utterance to have quiet on
both sides of sample windows. It accepts single words or single utterances at a
time. Often these systems have “Listen and Non Listen state”. Isolated utterance
might be a better name for this class.
1.1.2 Connected Word
The connected word system is similar to isolated words, but allow the
separate utterance to be run together minimum pause between them.
1.1.3 Continuous Word
Continuous speech recognizers allows user to speak almost naturally,
while the computer determine the content. Recognizer with continuous speech
capabilities are some of the most difficult to create because they utilize special
method to determine utterance boundaries. A continuous speech system
3
operates on speech in which words are connected together, i.e. not separated by
pauses. Continuous speech is more difficult to handle because of a variety of
effects. First, it is difficult to find the start and end points of words. Another
problem is “co-articulation”. The production of each phoneme is affected by the
surrounding phonemes, and the start and end of words are affected by the
preceding and following words.
1.1.4 Spontaneous Word
At a basic level, it can be thought of as speech that is natural sounding
and not rehearsed. An ASR System with spontaneous speech ability should be
able to handle a variety of natural speech feature such as words being run
together.
1.2 ASR System Classifications
Speech Recognition is a special case of pattern recognition. There are two
phases in supervised pattern recognition, viz., Training and Testing. The process
of extraction of features relevant for classification is common in both phases
[10]. During the training phase, the parameters of the classification model are
estimated using a large number of class examples (Training Data). During the
testing or recognition phase, the feature of a test pattern (test speech data) is
matched with the trained model of each and every class.
4
CHAPTER 2
LITERATURE SURVEY
2.1 Pan Zhou, Hui Jiang, Senior Member, IEEE, Li-Rong Dai, Yu Hu,
and Qing-Feng Liu “State-Clustering Based Multiple Deep Neural
Networks Modelling Approach for Speech Recognition”, IEEE/ACM
Transactions On Audio, Speech, And Language Processing, Vol. 23,
no. 4, April 2015.
A new cluster-based multiple DNNs method for acoustic modelling
in LVCSR (Large Vocabulary Continuous Speech Recognition) is
proposed. The new modelling method can yield comparable recognition
performance as the regular DNN method, but the multiple DNNs can be
efficiently trained in parallel using multiple GPUs, which leads to a
dramatic speedup in training. The multiple DNNs have achieved, the less
error rate recognition.
2.2 Pratik. K. Kurzekar, Rathnadeep. R. Deshmukh, Pukhraj. P.
Shrishrimal “A Comparative Study of Feature Extraction
Techniques For Speech Recognition System,” International Journal
Of Innovative Research In Science, Engineering Technology. Vol. 3,
Dec.2014.
Mel Frequency Cepstral Coefficients (MFCC) are most commonly
used features extraction technique in speech recognition systems. The
reason for MFCC being most commonly used for extracting is that it is
nearer to the actual human auditory speech perception. Some researchers
have proposed modifications to the basic MFCC algorithm to improve
robustness.
5
2.3 Y. Tang, “Deep learning using linear support vector machines,” in
International Conference on Machine Learning, December 2013.
Deep learning using linear support vector machines (DLSVM)
works better than softmax on two standard data sets and a recent dataset.
Switching from softmax to SVMs is incredibly simple and appears to be
useful for classification tasks. It demonstrates a small but consistent
advantage of replacing the soft-max layer with a linear support vector
machine. Learning minimizes a margin-based loss instead of the cross-
entropy loss. While there have been various combinations of neural nets
and SVMs, replacing softmax with linear SVMs gives significant gains
on popular deep learning data sets.
2.4 O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and
D. Yu, “Convolutional Neural Networks For Speech Recognition,”
IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 10,
pp.1533–1545, Oct. 2014.
A Convolutional Neural Networks (CNN) for Speech Recognition
with error rate reduction are proposed. The use of basic CNN of speech
recognition is explained in this paper. A limited-weight-sharing scheme
that can better model speech features is proposed. The special structure
such as local connectivity, weight sharing, and pooling in CNNs exhibits
some degree of invariance to small shifts of speech features along the
frequency axis, which is important to deal with speaker and environment
variations. The major advantages of CNN technique are same filter is
used for each pixel in the layer hence memory size is reduced and
performance is improved.
6
2.5 S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu, “Asynchronous
Stochastic Gradient Descent for DNN training,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process. (ICASSP), 2013.
Asynchronous Stochastic Gradient Descent for DNN training that
asynchronous SGD (Stochastic Gradient Descent) is implemented for
training DNNs by using a number of GPUs. In contrast, another possible
parallelization strategy is “model parallelism”, i.e., splitting the model
into several parts, each of which is computed by one computing unit. The
advantages of SGD are efficiency and ease of implementation. The
limitations require a number of hyper parameters such as regularization
parameter and the number of iterations and it is sensitive to feature
scaling.
2.6 X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, “Pipelined Back
Propagation for Context-Dependent Deep Neural Networks,” in
Proc. Interspeech, 2012.
A pipelined implementation of Back Propagation (BP) is used for
parallel training of DNNs using multiple GPUs, where computation
related to different layers of a DNN is distributed to several GPUs. The
limitation of this pipelined method is, how to balance computation load
among different GPUs, particularly when the output layer dominates the
computation.
7
2.7 Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J.
Dean, and A.Ng, “Building High-Level Features Using Large Scale
Unsupervised Learning,” in Proc. ICML, 2012.
Large scale unsupervised learning is used for training a cluster of
thousands of CPUs, by implementing both model parallelism and the
asynchronous SGD. However, no matter whether data partition or model
parallelism is implemented, all of the above-mentioned parallel training
methods suffer from the communication overhead problem when
collecting gradients, redistributing updated model parameters over
different units, and delivering model outputs into another unit. This cross-
unit communication is balance to become the major performance
bottleneck, especially when parallel training is scaled up to a large
number of computing units. The asynchronous SGD is used to speed up
the neural network training.
2.8 G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent
Pretrained Deep Neural Networks For Large Vocabulary Speech
Recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20,
no.1, pp. 30–42, Jan. 2012.
A context-dependent (CD) pre-trained deep neural networks for
Large Vocabulary Speech Recognition (LVSR) that leverages recent
advances in using deep belief networks for phone recognition. This
describes a pre-trained Deep Neural Network Hidden Markov Model
(DNN-HMM) hybrid architecture that trains the DNN to produce a
distribution over senones as its output. The deep belief network pre-
training algorithm is a robust and often helpful way to initialize deep
neural networks generatively that can aid in optimization and reduce
8
generalization error and describe the procedure for applying CD-DNN-
HMMs to LVSR, and analyse the effects of various modelling choices on
performance.
2.9 D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting Sparseness In Deep
Neural Networks For Large Vocabulary Speech Recognition,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP),
2012.
The exploiting sparseness in deep neural networks for large
vocabulary speech recognition is used to develop no performance loss
when zeroing 80% of small weights in a large DNN model. This method
is good to reduce total DNN model size, but it gives no gain in training
speed due to highly random memory accesses introduced by sparse
matrices.
2.10 G. Dahl, D. Yu, L. Deng, and A. Acero, “Large Vocabulary
Continuous Speech Recognition with Context-Dependent DBN-
HMMs,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 2011, pp. 4688–4691.
Large vocabulary Continuous Speech Recognition with context-
dependent Deep Belief Neural- Hidden Markov Model (DBN-HMMs)
dramatically outperforms strong Gaussian Mixture Model - Hidden
Markov Model (GMM-HMM) baselines on a challenging, large
vocabulary, spontaneous speech recognition dataset from the bing mobile
voice search task. DBN-HMMs provide dramatic improvements in
recognition accuracy, training DBN-HMMs is quite expensive compared
with training GMM - HMMs, primarily because training the former is not
9
easy to parallelize across computers and needs to be carried out on a
single GPU.
2.11 J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland,
“Efficient Generation and Use of MLP Features For Arabic Speech
Recognition,” in Proc. Interspeech, 2009.
The efficient generation and use of MLP features for Arabic speech
recognition is used to divide the training set into N disjoint subsets in
each epoch and a separate MLP (Multi Layer Perceptron) is trained on
each subset. Then these sub-MLPs are combined by a merger network
that is trained by another subset of data. An advantage of MLP is to
classify an unknown pattern with other known patterns that shares the
same distinguishing features. The neural networks are highly fault
tolerant. Disadvantages of MLP technique are computationally expensive
learning process (i.e. Large number of iterations required for learning)
and no guaranteed solution and presence of scaling problem.
10
CHAPTER 3
FEATURE EXTRACTION TECHNIQUES
3. 1 Feature Extraction
The speech feature extraction in a categorization problem is about
reducing the dimensionality of the input vector while maintaining the
discriminating power of the signal. From the fundamental formation of speaker
identification and verification system [17], that the number of training and test
vector needed for the classification problem grows with the dimension of the
given input so feature extraction of the speech signal is required.
Fig. 3.1 Block diagram of Speech Recognition
The input signal is a speech signal which is in the extension of .wav
format.The features of the speech signal are extracted by using Mel Frequency
Cepstral Coefficients (MFCC).Finally, the features are used to classify the
signals as recognize or not recognize using a Deep Neural Support Vector
Machine (DNSVM).
Input Signal (Speech)
Feature extraction
Classification
Performance Measures
11
3.2 Mel Frequency Cepstral Coefficients (MFCC)
The first stage in the speech recognition process is feature extraction. The
use of Mel Frequency Cepstral Coefficients can be considered as one of the
standard methods for feature extraction. The use of about 20 MFCC coefficients
is common in ASR, although 10-12 coefficients are often considered to be
sufficient for coding speech[18]. The most notable downside of using MFCC is
its sensitivity to noise due to its dependence on the spectral form. Methods that
utilize information in the periodicity of speech signals could be used to
overcome the problems, although speech also contains aperiodic content.
MFCCs are said to be the coefficients that together represent the short-
term power spectrum of the sound which is based on a linear cosine transform
of a log power spectrum on a nonlinear Mel Scale Frequency [20]. MFCC is
used to extract feature vector from the sound wave. MFCC algorithm is based
on human hearing perceptions and having a Mel Scale based filter.
First the voice data are divided into frame. Each frame is windowed using
Hamming window. Second the analysis of frame is converted to the frequency
domain using a short time Fourier Transform. Third a certain number of sub-
band energies are calculated using a Mel filter bank, which is a nonlinear- scale
filter bank that imitates a human„s aural system[19]. Fourth, the logarithm of the
sub-band energies is calculated. Finally, the MFCC is computed by an inverse
Fourier Transform.
3.2.1 Pre–processing
This step processes the passing of a signal through a filter which
emphasizes higher frequencies. This process will increase the energy of the
signal at higher frequency.
Y[n] = X[n] - 0. 95 X[n - 1] (3.1)
12
Lets consider a = 0.95, which make 95% of any one sample is presumed to
originate from previous sample.
Fig. 3.2. Block diagram of Mel Frequency Cepstral Coefficients
3.2.2 Framing
The process of segmenting the speech samples obtained from Analog to
Digital Conversion (ADC) into a small frame with the length within the range
of 20 to 40msec. The voice signal is divided into frames of N samples. Adjacent
frames are being separated by M (M<N).
3.2.3 Hamming windowing
Hamming window is used as window shape by considering the next block
in the feature extraction, processing chain and integrates all the closest
frequency lines. The Hamming window is defined as W(n), 0 ≤ n ≤ N-1.
Input speech signal
Pre-processing
Framing Hamming window
MFCC Features
Discrete Cosine
Transform
Mel Filter
Bank
Fast Fourier
Transform
13
The Hamming window equation is given as:
Y(n) = X(n) W(n) (3.2)
W(n) = 0.54 - 0.46 cos(2πn/N-1) , 0≤ n ≤ N− 1 (3.3)
N = number of samples in each frame
Y(n) = Output signal
X(n) = input signal
W(n) = Hamming window
3.2.4 Fast Fourier Transform (FFT)
Fast Fourier Transform is used to convert each frame of N samples from
time domain into the frequency domain. FFT is an efficient algorithm of
Discrete Fourier Transform (DFT). Usually, FFT is performed to obtain the
magnitude frequency response for each frame.
This statement supports the equation below:
Y(ω)=FFT [H(t)*X(t)] = H(ω) * X(ω) (3.4)
X(ω), H(ω) and Y(ω) are the Fourier Transform of X (t), H (t) and Y (t)
respectively.
3.2.5 Mel Filter Bank
The frequency range in FFT spectrum is very wide and voice signal does
not follow the linear scale. The bank of filters according to Mel scale is shown
in Fig.3.3. This Fig.3.3 shows a set of triangular filters that are used to compute
a weighted sum of filter spectral components so that the output of process
approximates to a Mel scale.
Each filter‟s magnitude frequency response is triangular in shape and
equal to unity at the centre frequency and decrease linearly to zero at centre
frequency of two adjacent filters. Then, each filter output is the sum of its
filtered spectral components.
14
Fig.3.3 Mel Scale Filter Bank
After that, the following equation is used to compute the F (Mel) for
given frequency f in HZ:
F (Mel) = [ 2595 * log10 [1 + f /700] (3.5)
3.2.6 Discrete Cosine Transform (DCT)
In this last stage the Mel frequency Cepstral Coefficients are obtained.
Apply DCT on the energy obtained from the triangular band pass filters to
have L Mel-Scale Cepstral Coefficients. The formula for DCT is given by
∑ , ( )
- ( )
Where
N is the number of triangular band pass filters,
L is the number of Mel-Scale Cepstral Coefficients.
DCT transforms the frequency domain into a time-like domain called
quefrency domain. The obtained features are referred to as the Mel-Scale
Cepstral Coefficients or MFCC.
15
CHAPTER 4
CLASSIFICATION TECHNIQUES
4.1 INTRODUCTION
In machine learning, classification is the problem of identifying to which
set of categories a new observation belongs, on the basis of a training set of data
containing observations whose category membership is known.
Two types of Classification techniques are used to classify the signals.
They are,
Deep Neural Network
Deep Neural Support Vector Machine
4.2 DEEP NEURAL NETWORK (DNN)
A Deep Neural Network (DNN) is an Artificial Neural Network (ANN)
with multiple hidden layers of units between the input and output layers. Here
Multiple DNN (mDNN) is used. The hybrid Deep Neural Network-Hidden
Markov Model (DNN-HMM) architecture models the temporal nature of speech
with an HMM and uses a DNN to replace Gaussian Mixture Model (GMMs) to
compute posterior probabilities of tied HMM states, which can be converted to
scaled likelihoods for Viterbi decoding. In large vocabulary ASR tasks,
thousands of tied HMM states are common. This results in an extremely large
output weight matrix that significantly slows down the back-propagation
process in DNN training.
Here state-clustering multiple DNNs (mDNN) used for acoustic modeling
instead of single large DNN for Speech Recognition [1]. The HMM states
posterior probability distribution that can be estimated by multiple DNN.
Multiple DNN can be trained independently through a combination of “data
partition” and “model parallelism”, and allow for full parallelization.
16
4.2.1 Structure of mDNN
Fig.4.1, illustrates how to use multiple DNNs (mDNN) for acoustic
modeling in ASR. If the whole training data had disjoint subsets, then there are
no common state labels among these subsets. In this case, several hierarchical
structured DNNs can be trained [7]. First of all, a small DNN (with 3 hidden
layers), is trained to distinguish different clusters in the training data, i.e., for
computing the posterior probability of each cluster given. Since this is a very
small Neural Network (NN), containing 3 hidden layers and a small number of
output nodes, the training is very fast, even though it needs to access the entire
training set along with the corresponding cluster labels.
Fig 4.1. Illustration of using multiple DNNs for Acoustic Modelling.
All clustered subsets of training data are used to train multiple DNN to
classify different states within each cluster, i.e., for computing posterior
17
probabilities of all tied states within each cluster, where denotes any HMM
state within cluster.
Since DNN is trained only on a subset of the training data, for certain
training criteria (such as the Frame-Level Cross-Entropy training), all DNN can
be separately trained in parallel using different GPU without transferring any
data or gradients among GPU during the entire training process[11]. DNN is
much faster to train than the large joint DNN in the normal DNN-HMM method
because each DNN only involves the fraction of training data belonging to that
cluster, and each DNN is much smaller in size because it has fewer output
classes and probably fewer nodes in each hidden layer.
As for decoding, each observation sample, X, is fed into all of the
estimated DNNs to compute Pr( | ) and ( | ) as illustrated in Fig 4.1.
Generally, calculate the posterior probability of any tied state, , as follows:
( ) ( |X) = Pr ( | ) ( | ) ( ) (4.1)
Assume all clusters are disjoint and above equation is always used for the
following derivation unless stated explicitly.
The above posterior probabilities, ( |X) , are used for decoding in the
same way as in the normal hybrid DNN-HMM Model. An implication from the
above factorization is to explicitly calculate all soft max operations in all DNNs
to derive the required posterior probabilities for decoding, which may lead to a
very small overhead in the test stage.
The proposed mDNN method uses a different method to factorize
posterior probabilities, based on a hierarchy of output labels and automatically
generated clusters [6]. The purpose of this factorization is to decouple DNN
computation into several smaller independent models for effective parallel
training. Hierarchical factorization is applied to neural network language
18
models to reduce the huge weight matrix in the output layer caused by large
vocabulary size.
4.2.2 Data Partition Based on State Clustering for mDNN
In the mDNN modeling framework is to partition the whole training data
into multiple disjoint subsets. A data-driven unsupervised clustering method is
used to group training data into several clusters [16]. In this way, data samples
belonging to different clusters tend to be less similar so that all different clusters
can be easily distinguished by using a small and shallow neural network, in the
top level. Moreover, this performs data partitioning at the state level, not at the
utterance level.
State clustering aims to divide the entire training dataset into several
subsets that contain no common state labels. Thus mDNN training can be
conducted in a fully parallel fashion [9]. Furthermore, a GMM is chosen to
model each cluster during the data-driven clustering process since a GMM can
be efficiently estimated even from a large amount of data. This approach is
called Gaussian Mixture Model (GMM) based state clustering.
Initially, a baseline GMM-HMM system is built, consisting of tied HMM
states, and all training data are forced to align using word level transcriptions to
generate state labels for all speech frames in the training set [3]. Assume that the
training data are split into clusters. It may need to estimate different GMM for
all of these clusters. There are several different methods to perform data
clustering, either top-down or bottom-up. The simplest way is to randomly
select states as the starting point to initialize different GMM using the data from
these states. Then the remaining states are classified in one of these clusters, one
by one based on the total likelihood value of each state calculated from the
GMM. In this process all data belonging to one HMM state are viewed as one
indivisible unit while assigning them to different clusters.
Finally, all GMM are re-estimated based on the new data partition. This
process is repeated until all clusters converge. The initial selection of states for
19
GMM estimation may not be very robust for clustering [8]. A more stable
clustering method is to use the top-down k-means method to automatically
group training data into clusters for GMM estimation. In this case, this
gradually grows the number of clusters from one at the beginning until it
reaches the desired cluster number [13]. Each cluster is modeled by one GMM
and all data belonging to one HMM state are assigned to each cluster as one
indivisible unit based on the total likelihood value calculated by the current
GMM. Once data are re-assigned all GMM are re-estimated. This state
clustering process is very similar to the standard speaker clustering for ASR
except that all training samples are clustered based on the HMM state labels
rather than the speaker labels.
4.3 Parallel Training of Multiple DNNs
4.3.1 Frame-level Cross Entropy Training of mDNN
For multiple DNNs (mDNN), consider the same frame-level Cross
Entropy (CE) objective function and derive the error signals in a similar way as
the conventional DNN process [9]. Assume that the clusters are disjoint and do
not contain common output labels. In this case, the mDNN is computed, as a
product of two terms, one is from the top level and the other is from a lower
level [5]. By construction, the mDNN can be trained independently for the
frame-level CE training criterion. In this cases, given an input feature vector
along with its target label (assuming the target label belongs to cluster), and
consider how to compute the error signals at the output layer for CE training of
an mDNN.
20
For the top-level, the error signals at the output layer can be obtained as
follows:
( )
( )
( )
( )
( ) ( )
( ) (4.2)
= - ( )
( )
, ( | ) ( | )-
( )
= ( | )
( )
( ) = Pr( | ) ( ) (4.3)
Dirac delta function
( ) {
(4.4)
Furthermore, the above error signals can be back-propagated to derive
error signals in all other layers.
( )
( )
( )
( )
( ) ( )
( ) (4.5)
= - ( )
( )
, ( | ) ( | )-
( )
= ( | )
( )
( )= Pr ( | ) ( ) (4.6)
21
Each training data sample, which makes non-zero updates to one DNN
containing its target class label, and the gradients of all other DNNs remain zero
[2]. Instead of feeding all training data to all parallel DNNs to perform cross-
entropy training, divide the whole training set into different clusters based on
the class labels and feed each subset of training data only to its own DNN.
For the frame-level CE training, each can be trained independently on its
own data without involving any communication traffic [10]. After splitting
training data into different clusters, mDNN can be trained independently with
the standard Back Propagation (BP) using its own data and the corresponding
labels [15]. This leads to a maximum degree of parallelism.
4.3.2 Sequence Training of mDNN
As opposed to the frame-level training criterion, sequence training of an
mDNN is no longer independent among all its parallel DNNs since it needs to
access all state posteriors to process word graphs [4]. The error signals and the
related sequence training procedure for multiple Deep Neural Networks
(mDNN) are also derived. The error signal related to any one HMM tied state,
in the output layer is computed as
( )
( )
( ) ∑
( )
( | ) ( | )
( )
( )
Where, over all HMM tied states are summed in the output layer. In
mDNN, since the output posterior, is a product of two terms, the partial
derivatives of its likelihood, with respect to any ( ) , in a lower-level DNN,
is computed as
( | )
( )
( ( | ) ( ) ( ))
( )
22
= ( | )
( )
= [
( | ) ( | )
] (4.8)
Where stands for the cluster containing HMM state, and for the cluster
containing HMM states .
For each lower-level DNN in mDNN, the error signals in the output layer
are calculated as
( )
( )
( ) ∑
( )
( | ) ( | )
( )
( )
( | ) ( | ∑
( )
( | )
( ) [
( )
( )] ( | )
∑ ,
( )
( )- (4.9)
The above error signals at the output layer of lower level contain two
terms and the second term above is not equal to zero anymore [14]. Since it is
only summed over a subset of state labels rather than all states in the model.
These error signals can be back-propagated in the same way as the regular Back
Propagation (BP) to derive error signals for other layers in all parallel DNNs. In
this case, some methods to prune and compress statistics and gradients [12] may
be used to improve training efficiency.
23
The partial derivatives is given as
( | )
( )
( | )
( )
( | )
( )
=[ ( | ) ( | )
] (4.10)
In the same way, these error signals are back-propagated to derive error
signals in all other layers.
4.4 Algorithm for Cluster Based mDNN Method
State-Clustering multiple DNNs (mDNN) used for acoustic modelling in
place of DNN for speech recognition. The clustered subsets of training data are
used to train multiple DNNs to classify different states within each cluster [15].
The HMM states posterior probability distribution can be estimated by multiple
DNNs and multiple DNNs can be trained independently.
The main steps involved in the mDNN training procedure are
summarized in this algorithm.
4.4.1 Steps for Training
1. Train a MFCC baseline system with N tied states, denoted as mfcc.
2. Use MFCC to generate state level alignments of all training data.
3. Cluster all training data belonging to N tied states into several disjoint
sets.
4. Generate a mapping from each tied state to the cluster label to which it
belongs, denoting this mapping as state two cluster state.
5. Use the entire training set to train a small NN.
6. Use all clustered subsets of training data to train multiple smaller DNNs
denoted as dnn, 1≤ r ≤ C.
24
4.5 Support Vector Machine (SVM)
“Support Vector Machine” (SVM) is a supervised machine learning
algorithm which can be used for both classification and regression challenges. It
is mostly used in classification problems, which uses linear and non-linear
hyper-planes for classifying data. It is basically a binary nonlinear classifier
capable of guessing whether an input vector x belongs to particular class A or
class B.
For a given set of separable data, the goal is to find the optimal decision
function. This is done by choosing a maximum margin as the distance between
the closest sample and the decision boundary. It performs classification by
constructing hyper planes in a multidimensional space that separates different
class labels based on statistical learning theory. SVMs are applied in various
fields due to the features of SVM like a) High accuracy and flexibility b)
Capacity to accommodate large number of attributes.
4.6 Deep Neural Support Vector Machine
Most of the DNN use the multinomial logistic regression, also known as
softmax active function, at the top layer for classification. Specifically, given
the observation at frame t, let is the output vector of the top hidden layer
in DNN, the output of DNNs for state can be expressed as
( ⁄ ) (
)
∑ ( )
( )
Where are the weights connecting the last hidden layer to the output
state ,and N is the number of states. For example, in the frame classification,
given an observation , the corresponding state can be inferred by
( | )
( )
25
For multiclass SVM [8], the classification function is
( ) ( )
Where ( ) is the predefined feature space and is the weight
parameter for class/state s. Two algorithms, at frame and sequence-level, are
also proposed to estimate the parameters of SVM (in the last layer) and to
update the parameters of DNN (in all previous layers) using maximum margin
criteria. The resulting model is named Deep Neural SVM (DNSVM). Its
architecture is illustrated in Fig. 4.2.
Fig.4.2. Architecture of Deep Neural Support Vector Machine
State j
t-1 t
DNN Multiclass SVM
Structured SVM
26
The double headed arrows illustrate the scope of parameters for DNNs,
Multiclass SVMs and Structured SVMs. For sequence-level max-margin
training, the dark straight arrows (in the trellis) represent the reference state
sequence, and the dashed arrows represent the most competing state sequence.
4.6.1 Frame-level max-magin training
The training observations and their corresponding state labels
are *( )+
, where * +, in frame-level training, the parameters
of DNN are normally estimated by minimizing the Cross Entropy (CE). Let
( ) as the feature space derived from the DNN, the parameters of the
last layer are first estimated using the multiclass SVM training algorithm [21],
∑ || ||
∑
(4.14)
For every training frame t=1… T,
Where ≥0 is the slack variable which penalizes the data points that
violate the margin requirement.The only difference comes from the constraints,
which basically says that, the score of the correct state label has to be
greater than the scores of any other states,
, by a margin determined by
the loss. According to [16], using the squared slacks is slightly better than .
If the correct score, is greater than all the competing scores,
, then
it must be greater than the most competing score,
.
4.6.2 Sequence level max-margin training
In the max-margin sequence training, for simplicity, first consider one
training utterance (O, S), where O = {O1… OT} is the observation sequence and
S = {s1… sT} is the corresponding reference states. The parameters of the model
can be estimated by maximizing,
* ( )⁄
( )⁄+
*
( )⁄ ( )
( ) ( )⁄+ ( )
27
Here the margin is defined as the minimum distance between the
reference state sequence S and competing state sequence S‟ in the log posterior
domain as illustrated in the Fig.4.3.
Fig.4.3.SVM Hyperplane.
For DNSVM, the log( ( )⁄ ( )) can be computed via
∑(
( ) ( )) ( ) ( )⁄
Where ( ) is the joint feature [22], which characterizes the
dependencies between O and S,
( ) ∑
[
( )
( )
( )
( ⁄ )]
[
]
( )
Reference state
sequence S
Competing state
sequence S‟
Margin
space ( )
28
where ( )is the the Kronecker delta (indicator) function. Here the prior,
P(w), is assumed to be a Gaussian with a zero mean and a scaled identity
covariance matrix CI, thus log P(w) = log N(0,CI)
The parameters
of DNSVM (in the last layer) can be estimated by minimizing,
( )
|| ||
∑,
( )
* ( ) (
)+- ( )
Objective function (4.18) for DNSVM is the same as the training criterion
for structured SVMs with the features defined in (4.17). To solve equation
(4.16), the cutting plane algorithm can be applied. It requires searching the most
competing state sequence efficiently.
The computational load during training is dominated by this search
process. To speed up the training, denominator lattices with state alignments are
used to constrain the search space. Then a lattice-based forward-backward
search is applied to find the most competing state sequence .
4.7 PERFORMANCE MEASURES
Sensitivity (also called the True Positive Rate (TPR))
It measures the proportion of positives that are correctly identified
as such (e.g., the percentage of sick people who are correctly identified as
having the condition)
Sensitivity= Number of true positive/ (Number of true positive+
Number of false negative)
TPR= TP/ (TP + FN) (4.19)
29
Specificity (also called the True Negative Rate (TNR))
It measures the proportion of negatives that are correctly identified
as such (e.g., the percentage of healthy people who are correctly
identified as not having the condition).
Specificity= Number of true negative/ (Number of true negative+
Number of false positive)
TNR= TN/ (TN + FP) (4.20)
Accuracy
The accuracy is the proportion of the total number of predictions
that were correct. It can be determined by the equation,
Accuracy = (TN + TP)/ (TN+TP+FN+FP) (4.21)
where , TP = true positive
TN = true negative
FN = false negative
FP = false positive
Mean Square Error
The mean square error is the difference between the original signal
and the output signal. It can be calculated as
∑(
)
30
4.8 Drawbacks of DNN:
• Converges on local minima rather than global minima.
• Overfits if training goes on too long.
• N-array classifier with Deep Neural Network - train each of them by one
go.
4.9 Advantages of DNSVM:
• It overcomes the first two drawbacks of DNN.
• N-array classifier with support vector machines - train each of them one
by one.
31
CHAPTER 5
SIMULATION RESULTS
5.1 Result and Discussion
The results illustrate the speech recognition based on DNSVM approach.
The input speech signal is first converted into .wav format and the features of
the speech signal is getting extracted by the Mel Frequency Cepstral Coefficient
Techniques. Here, the feature vectors of 12 Mel Frequency Cepstral
Coefficients are obtained and the training datasets are loaded and trained. From
the trained datasets, the given input speech signals are classified by the
DNSVM.
The speech signal with a length of 13 seconds is given as input, which is
shown in Fig.5.1.
Fig.5.1 Input Speech Signal
From the given speech signal, the features are extracted by the MFCC
and it results in the feature vectors of MFCC.
32
Fig.5.2 Plots of MFCC
The Deep Neural Support Vector Machine is used for classification. From
the trained datasets , the given input speech signals is classified by the DNSVM
and if the given input speech signal matches with the trained datasets then the
signal is considered as recognized or if it mismatches then it is not recognized.
33
Fig.5.3 Output Signal
The fifty samples of speech signals are extracted and trained independently
with each of the feature values. Among these fifty samples, if any one of the speech
signal is considered for testing, the signal will be classified and recognized and in the
output window it will be shown as “signal is recognized”. Other than theses fifty
samples, if any, other signal is given as input, the signal will be classified but it will
not be recognized. The result in the output window will be shown as “signal is not
recognized”.
34
5.2 MEAN SQUARE ERROR (MSE)
The mean square error of the input speech signal is shown in the Table 5.1.
In Deep Neural Network, as the number of iterations increases, the value of mean
square error also increases, but in Deep Neural Support Vector Machine, as the
number of iterations increases, the value of mean square error will be decreased.
Compared to DNN, the DNSVM shows the best result for the mean square error.
Iterations MSE of DNN MSE of DNSVM
Iteration 1 0.141394 0.125154
Iteration 2 0.182261 0.120891
Iteration 3 0.232163 0.098774
Iteration 4 0.285192 0.078025
Iteration 5 0.331714 0.061430
Iteration 6 0.359896 0.053050
Iteration 7 0.380208 0.050005
Iteration 8 0.395030 0.048190
Iteration 9 0.404762 0.046674
Table.5.1 Mean Square Error of Speech Signal
35
5.3 PERFORMANCE MEASURES
Performance
Metrics
DNN (%) DNSVM (%)
Accuracy 92.54 96.66
Specificity 94.63 97.83
Sensitivity 90.71 95.45
Table.5.2 Performance Measures
From the above table, it is clear that Deep Neural Support Vector Machine
yields better performance in accuracy, sensitivity and specificity.
36
CHAPTER 6
CONCLUSION AND FUTURE WORK
A new type of DNN is introduced and the softmax model is used in
Traditional DNNs at the top layer for classification. The suggested DNN
employs an SVM at the highest layer. Here training algorithms are derived at
the frame and sequence-level to jointly learn the parameters of SVM and DNN
in the maximum-margin condition. Under the level of frame development, the
new model is associated with the multiclass SVM including DNN features. In
the sequence training, it is associated with the structured SVM inclusive of
DNN features and state transition features.
The accuracy is better and the Error rate obtained is less when the Deep
Neural Network Support Vector machine is used for classification technique.
In this work only the highest layer of DNNs is replaced by linear SVMs,
investigation of non-linear kernels of deep SVM can be done as future work.
37
REFERENCES
[1] Pan Zhou, Hui Jiang, Senior Member, IEEE, Li-Rong Dai, Yu Hu, and Qing-
Feng Liu State-Clustering Based Multiple Deep Neural Networks Modeling
Approach for Speech Recognition”, IEEE/ACM Transactions On Audio,
Speech, And Language Processing, vol. 23, no. 4, April 2015.
[2] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-Bit Stochastic Gradient
Descent And Application To Data-Parallel Distributed Training Of Speech
DNN,” in Proc. Interspeech, 2014.
[3] O.Abdel-Hamid,A-R. Mohamed, H.Jiang, L.Deng, G.Penn, and D.Yu,
“Covolutional Neural Networks For Speech Recognition,”IEEE/ACM Trans.
Audio, Speech, Lang. Process., vol.22, no.10, pp. 1533-1545, oct.2014.
[4] P.Zhou, C. Liu, Q. Liu,L.Dai, and H.Jiang, “A Cluster Based Multiple Deep
Neural Networks Method For Large Vocabulary Continuous Speech
Recognition,” IEEE International Conference Acoustic Speech,Signal
Process.(ICASSP),2013.
[5] H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and F. Yvon, “Structured
Output Layer Neural Network Language Models For Speech Recognition,”
IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 1, pp. 197–206, Jan.
2013.
[6] S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu, “Asynchronous Stochastic
Gradient Descent For DNN Training,” in Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP), 2013.
[7] X. Chen, A. Eversole, G. Li, D. Yu, And F. Seide, “Pipelined Back-
Propagation For Context-Dependent Deep Neural Networks,” in Proc.
Interspeech, 2012.
[8] D.Yu,F.Seide, G.Li, and L.Deng, “Exploiting Sparseness In Deep Neural
Networks For Large Vocabulary Speech Recognition,” in Proc.IEEE Int. Conf.
Acout., Speech , Signal Process.(ICASSP),2012.
38
[9] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent Pretrained
Deep Neural Networks For Large Vocabulary Speech Recognition,” IEEE
Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42, Jan. 2012.
[10] A.-R. Mohamed, G. Hinton, and G. Penn, “Understanding How Deep
Belief Networks Perform Acoustic Modeling,” in Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process. (ICASSP), 2012.
[11] F. Seide, G. Li, X. Chen, and D. Yu, “Feature Engineering In Context-
Dependent Deep Neural Networks For Conversational Speech Transcription,”
in Proc. IEEE Workshop Automatic. Speech Recognition. Understand.
(ASRU), 2011.
[12] F. Seide, G. Li, and D. Yu, “Conversational Speech Transcription Using
Context-Dependent Deep Neural Networks,” in Proc. Interspeech, 2011, pp.
437–440.
[13] G. Dahl, D. Yu, L. Deng, and A. Acero, “Large Vocabulary Continuous
Speech Recognition With Context-Dependent DBN-HMM,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 4688–4691.
[14] D. Yu, L. Deng, and G. Dahl, “Roles of Pre-Training And Fine-Tuning
In Context-Dependent DBN-HMM For Real-World Speech Recognition,” in
Proc. NIPS Workshop Deep Learning Unsupervised. Feat. Learn, 2010.
[15] K. Veselý, L. Burget, and F. Grézl, “Parallel Training Of Neural
Networks For Speech Recognition,” in Proc. Interspeech, 2010.
[16] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, “Efficient
Generation And Use Of MLP Features For Arabic Speech Recognition,” in
Proc. Interspeech, 2009.
[17] S. Kontár, “Parallel Training Of Neural Networks for Speech
Recognition,” in Proc. 12th Int. Conf. Soft Computing, 2006.
[18] Pratik. K. Kurzekar, Rathnadeep. R. Deshmukh, Pukhraj. P. Shrishrimal
“A Comparative Study of Feature Extraction Techniques for Speech
Recognition System,” International Journal of Innovative Research In Science,
Engineering Technology.vol.3, Dec.2014.
39
[19] Y. Tang, “Deep Learning Using Linear Support Vector Machines,” in
International Conference on Machine Learning, December 2013.
[20] J. Chen , K. K. Paliwal, M. Mizumachi and S.Nakamura, "Robust
MFCC Derived From Differentiated Power Spectrum " Eurospeech 2001,
Scandinavia, 2001.
[21] K. Crammer and Y. Singer, “On the Algorithmic Implementation of
Multiclass Kernel-Based Vector Machines,” vol. 2, pp. 265–292, 2001.
[22] S.-X. Zhang, A. Ragni, and M. J. F. Gales, “Structured Log Linear
Models For Noise Robust Speech Recognition,” Signal Processing Letters,
IEEE, vol. 17, pp. 945–948, 2010.
40
PUBLICATION
International Conference
Presented a paper titled “Multiple Clustering Based on Deep Neural Support
Vector Machine for Speech Recognition” IEEE sponsored 3rd International
Conference on Innovations in Information, Embedded and Communications
Systems (ICIIECS‟16) Vol.4 pp.1066-1070 held on 18th March 2016 in
Karpagam College of Technology.