Isolated Digit Recognizer using GMMs ECE5526 FINAL PROJECT
SPRING 2011 JIM BRYAN
Slide 2
Abstract Provide an in depth look at how GMMs can be used for
word recognition based on Matlabs statistical toolbox. The isolated
digit recognizer is based on a voice activity detector using energy
thresholding and zero crossing detection. Moveover, the recognizer
uses MFCCs as the basis for acoustic speech representation. These
are standard voice processing techniques which it is assumed the
reader is familiar with. The focus of this presentation is on the
details the GMM implementation in Matlab, with the idea that a good
understanding of the Matlab approach will yield insight to other
system implementations such as Sphinx and HTK. Word recognition is
comprised of two components, Model training and Model testing. The
statistical toolbox functionGmmdistribution.fit is used for
training The statistical toolbox Postpriori is used for testing The
purpose of this effort is to train and run the recognizer, and to
understand the basic functionality of functionGmmdistribution.fit
and Postpriori funcion calls.
Slide 3
Introduction Based on MATLAB Digest - January 2010 Developing
an Isolated Word Recognition System in MATLAB By Daryl NingDaryl
Ning Describe the Matlab GUI base recognizer application Provide
introductory material on GMMs using a simple 2 Mixture example with
2 models Discuss in detail the algorithms used to determine the
best model match Show examples of Matlabs statistical toolbox
representation of GMMs Run the simulation Discuss simulaton results
and show possible improvements Summary Conclusions Areas for
further study
Slide 4
Isolated digit recognizer overview Uses 8 GMMs per digit to
train and recognize an individual users voice Matlab GUI based
digit recognizer uses the following toolboxes Signal Processing
toolbox provides a filtering and signal processing functions
Statistics toolbox is used to implement a GMM Expectation
Maximization algorithm to build the GMMs and to compute the
Mahalnobis distance during recognition Data acquisition toolbox is
used to stream the microphone input to Matlab for continous
recognition Single digit recognizer implemented using dictionary of
digits 0 9 Training is done with 30 second captures of repeated
utterance of the given digit using the wavread function in Matlab
Data is input continuously to Matlab via the data acquisition
toolbox while the GUI recognizer is running The recognized digit is
displayed on the GUI
Slide 5
Overview Continued Uses laptops internal microphone Sample rate
is 8ksps Uses 20msec frames with a 10 msec overlap with a frame
size of 160 samples per frame Uses a simple voice activity detector
based on energy threshold and zero crossings per second for both
training and the recognizer Voice activity energy and zero crossing
thresholds are programmable and must be the same for training and
recognition No model for silence or missed digit, so the recognizer
displays the closest digit
Slide 6
GMM training and recognizer Matlab function calls The
recognizer compute the posterior probabilities using the Statistics
Toolbox function posterior Posterior accepts a gmm object/model as
its input, along with an input data set, and returns a
log-likelihood number that represents the data set match to the
model The smallest log-likelihood has the highest posterior
probability The recognizer computes the probability of the current
word to each model in the dictionary. The model that has the lowest
posterior probability is the recognized digit. A gmm object is
created during training for each dictionary entry, in this case
digits 0-9, using the function call gmdistribution.fit.
Slide 7
Example using 2 GMMs with 2 mixtures
Slide 8
Posterior Posterior extracts gmdistribution object parameters
necessary to call Wdensity Wdensity performs the actual
log-likely-hood calculation for the GMM, given the data set
Wdensity returns two arrays log_lh is an array of size
length(data)x order(GMM) mahalaD is an array of size length(data)x
order(GMM), this is not the actual Mahalnoblis distance mahalaD =
(x -) -1 (x -) T Estep calculates the loglikelihood based on the
log_lh array and returns ll which is the loglikelihood of data x
given the model
Slide 9
Wdensity function description Example funtioncall
[log_lh,mahalaD]=wdensity(X, mu, Sigma, p, sharedCov, CovType)
Where X is input data Mu is an array of means with(j,:)
corresponding to jth mean vector Sigma is an array of arrays with
(:,:,j) corresponding to the jth sigma in the model P are the
mixture weights sharedCov indicates the covariance matrices may be
common to all mixtures CovType may be either diagonal or full
estep [ll, post, logpdf]=estep(log_lh) Find the max of each row
of log_lh matrix This represents the closest distance to the jth
mixture for this data point. Convert log_ih distance probabilities
by using post = exp(bsxfun(@minus, log_lh, maxll)), there will
always be a 1 in the column of the maximum value, therefore this
number is always >=1 Sum across the rows to normalize the
relative probabilities density = sum(post,2); normalize posteriors
post = bsxfun(@rdivide, post, density) Calculate the logpdf =
log(density) + maxll; ll = sum(logpdf)
Slide 13
Estep example showing log_lh inputsfor two Gaussian Mixtures
and the Maximum value of the log_lh P11 data from model log_lh =
-18.6236 -3.0708 -36.2569 -3.0821 -24.1669 -2.2514 -33.8821 -3.2357
-18.4447 -3.2818 -5.8488 -4.2339 -18.4529 -2.5661 -14.7058 -3.5421
-2.7563 -19.3866 -3.0744 -21.2154 -2.4251 -14.8179 -4.1699 -12.7317
-2.5825 -16.8520 -4.4938 -8.5847 -3.7883 -13.7861 -2.8691 -7.2573
maxll = -3.0708 -3.0821 -2.2514 -3.2357 -3.2818 -4.2339 -2.5661
-3.5421 -2.7563 -3.0744 -2.4251 -4.1699 -2.5825 -4.4938 -3.7883
-2.8691
Slide 14
Estep example showing Post and density, density is used to
normalize post P11 data from model post = exp(bsxfun(@minus,
log_lh, maxll)); 1.0000 0.0000 0.0000 1.0000 1.0000 0.0000 1.0000
0.0001 0.0832 1.0000 1.0000 0.0109 1.0000 0.0008 0.0000 1.0000
0.0003 1.0000 0.0001 1.0000 0.0000 1.0000 density = sum(post,2)
1.0000 1.0001 1.0832 1.0109 1.0008 1.0000 1.0003 1.0001 1.0000
Slide 15
Estep example showing post after normalization and logpdf P11
data from model post = bsxfun(@rdivide, post, density) 1.0000
0.0000 0.0000 1.0000 1.0000 0.0000 0.9999 0.0001 0.0768 0.9232
0.9892 0.0108 0.9992 0.0008 0.0000 1.0000 0.0003 0.9997 0.0001
0.9999 0.0000 1.0000 logpdf = log(density) + maxll; ll =
sum(logpdf) =-53.7464 -3.6490 -4.6937 -2.3765 -3.3219 -3.1317
-4.4911 -4.0361 -3.8076 -2.7171 -2.5739 -2.3359 -2.6023 -2.1502
-5.5963 -2.2777 -3.9857
Slide 16
Estep example showing log_lh inputs for two Gaussian Mixtures
and the Maximum value of the log_lh P12 Data not from Model log_lh
= -6.2916 -6.2281 -6.1189 -7.3603 -12.5238 -2.5414 -7.3336 -24.5710
-7.0679 -14.3058 -5.7049 -7.7255 -7.8564 -23.6082 -6.8128 -4.4655
-27.4139 -19.2832 -20.1139 -14.0730 -27.0048 -11.4791 -17.2614
-8.2714 -33.8912 -15.5351 -26.0666 -9.9934 -20.4353 -9.9218
-15.9387 -13.2732 maxll = -6.2281 -6.1189 -2.5414 -7.3336 -7.0679
-5.7049 -7.8564 -4.4655 -19.2832 -14.0730 -11.4791 -8.2714 -15.5351
-9.9934 -9.9218 -13.2732
Slide 17
Estep example showing Post and density, density is used to
normalize post Data not from model P12 post = exp(bsxfun(@minus,
log_lh, maxll)); 0.9384 1.0000 1.0000 0.2890 0.0000 1.0000 1.0000
0.0000 1.0000 0.0007 1.0000 0.1326 1.0000 0.0000 0.0956 1.0000
0.0003 1.0000 0.0024 1.0000 0.0000 1.0000 0.0001 1.0000 0.0000
1.0000 0.0696 1.0000 density = sum(post,2) 1.9384 1.2890 1.0000
1.0007 1.1326 1.0000 1.0956 1.0003 1.0024 1.0000 1.0001 1.0000
1.0696
Slide 18
Estep example showing post after normalization and logpdf P12
data not from model post = bsxfun(@rdivide, post, density) 0.4841
0.5159 0.7758 0.2242 0.0000 1.0000 1.0000 0.0000 0.9993 0.0007
0.8829 0.1171 1.0000 0.0000 0.0873 0.9127 0.0003 0.9997 0.0024
0.9976 0.0000 1.0000 0.0001 0.9999 0.0000 1.0000 0.0650 0.9350
logpdf = log(density) + maxll; ll = sum(logpdf) = -147.9445 -5.5662
-5.8650 -2.5414 -7.3336 -7.0671 -5.5804 -7.8564 -4.3742 -19.2829
-14.0706 -11.4791 -8.2713 -15.5351 -9.9934 -9.9218 -13.2060
Slide 19
Log-likelihood for 2 mixture example P =Nlogl = -ll 55.3416
109.3820 184.7868 42.8043 The diagonal term are the case where the
data came from the model The off diagonal terms represent when the
data came from the other model
Slide 20
Gaussian Models in Matlab
Slide 21
Model for one
Slide 22
Gaussian Mixture Distribution Structure one
Slide 23
8 Gaussian model means 8x39 one
Slide 24
Diagonal Covariance Matrix
Slide 25
Training the GMMs Before recording can begin it is necessary to
set the laptops internal microphone Training involves finding a
quiet environment and recording 30 seconds of utterance for each
digit These are captured using Matlabs wavrecord y =
wavrecord(30*8000,8000); There is a utility supplied that allows
viewing the Voice Activity detection algorithm in order to
determine correct captures of the training data
speechdetect(y);
Slide 26
Trainmodels overview Generates Frames of speech base on 160
samples/frame with an 80 sample overlap Uses the same energy detect
and zero crossing thresholds as the recognizer Determines portions
of voiced speech based on these thresholds as well as a minimum of
250msec duration for each word A minimum of 100msec is required
between each word Frames are marked as VA, voice active, and stored
in a buffer call ALLdata. ALLdata is arranged so that the frames
are in columns, the dimensions are 160xnumFRAMES Once all the words
are captured, MFCC is called which is passed the ALLdata buffer for
Mel cepstral coefficient processing MFCC returns MFCC vectors that
are 39 coefficients per frame Gmmdistribution.fit is passed the
MFCC vectors which runs an EM algorithm on the MFCC vectors to
generate an 8 Mixture GMM for each digit
Slide 27
MFCC credits Derived from the original function 'mfcc.m' in the
Auditory Toolbox % written by: % % Malcolm Slaney % Interval
Research Corporation % [email protected] %
http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/ % % Also
uses the 'deltacoeff.m' function written by: % % Olutope Foluso
Omogbenigun % London Metropolitan University %
http://www.mathworks.com/matlabcentral/fileexchange/19298
Slide 28
MFCC overview Pre-filter the data using a pre-emphasis filter
preEmphasized = filter([1 -.97], 1, input); Window the data with a
Hamming window preEmphasized =
preEmphasized.*repmat(hamWindow(:),1,frames); fftMag =
abs(fft(preEmphasized,fftSize)); earMag = log10(mfccFilterWeights *
fftMag); ceps = mfccDCTMatrix * earMag; meanceps = mean(ceps,2);
ceps = ceps - repmat(meanceps,1,frames); d =
(deltacoeff(ceps')).*0.6; %Computes delta-mfcc d1 =
(deltacoeff(d)).*0.4; %as above for delta-delta-mfcc ceps = [ceps;
d'; d1']; %concatenates all together Return vector of 13 cep, 13
diff and 13 diff diff coefficients
Slide 29
Sound Settings for Microphone on Windows 7 laptop
Slide 30
Voice Activity Detector Overview Voice activity detection based
on energy detection and zero crossing rate std_zxings: is the zero
crossing threshold, default =.5 std_energy: is the energy detect
threshold, default =.5 Energy and zero crossings thresholds are
determined during the first 500msec of training to determine the
background silence energy and zero crossing rate The same threshold
settings must be used for all digit recordings Once a good
recording has been made, save it to the hard drive using;
wavwrite(y,8000,one.wav); Repeat for all the digits Run transcript
and this will train the GMMs
Slide 31
Authors Ideal Voice Activity detector
Slide 32
Voice Detect using default thresholds digit = one
Slide 33
Voice Detect using default thresholds 1,1 digit = one
Slide 34
Voice Detect using default thresholds 1.5,1.5 digit = one
Slide 35
Transcript reads each model and calls trainmodels y =
wavread('one.wav'); trainmodels(y,'one'); y = wavread('two.wav');
trainmodels(y,'two'); y = wavread('three.wav');
trainmodels(y,'three'); y = wavread('four.wav');
trainmodels(y,'four'); y = wavread('five.wav');
trainmodels(y,'five'); y = wavread('six.wav');
trainmodels(y,'six'); y = wavread('seven.wav');
trainmodels(y,'seven'); y = wavread('eight.wav');
trainmodels(y,'eight'); y = wavread('nine.wav');
trainmodels(y,'nine'); y = wavread('zero.wav');
trainmodels(y,'zero');
Slide 36
GMM dimensions for typical utterance Assume average digit
length is 300 mSec Fs = 8000Hz 1/Fs = 125sec 160 samples/Fs =
20msec Since overlap and add using 50 % Hamming widow, 1 Frame
occurs every 10msec Average number of frames per word 300/10 = 30
MFCC takes in 30x160 samples and produces 30x39 MFCC vectors on
average Average size of log_lh vector per word for 8 Gaussian
mixtures = 30x8 Log-likelihood based on average 30x8 matrix
Slide 37
Voice Activity detect filter implemented as a 128 tap FIR
filter based on a Chebyschev window with 40 dB sidelobe
attenation
Slide 38
Voice detector using 125-750 Hz 128 tap Chebyshev bandpass
filter with 40 dB side lobe suppression and 20mse pre oneshot with
40msec post oneshot digit = one
Slide 39
Training Vector for digit one after modified VA detection
Slide 40
Scoring Difficult to score based on the real time recognizer.
Recognizer fires on ambient noise Recognizer is slow as it has to
perform GMM calculation for all dictionary entries Recorded test
set of test set, counting from 1-9,0 produced 70% accuracy two and
seven and eight did not correctly classify Had to lower
zerocrossing threshold for test to collect all the utterances
Accuracy might be due to insufficient training data Could have bad
models for some of the classes Hand scoring difficult because must
correctly label each utterance for the classifier. Seven had a null
portion in the middle Lap top computers fan kicked on during
training, this caused ambient noise during training so data set was
not perfect
Slide 41
Test Set counting 1-9,0 and repeat frame based with silence
removed
Slide 42
Summary An 8 mixture GMMs for speech recognition were
demonstrated. Using only a small training set and an laptop
microphone, digit recognition was demonstrated using only 8000Hz
sample rate Care and feeding of the GMMs is very important for
successful implementation. Garbage in, garbage out is especially
true for speech recognition Background noise is a very big problem
in accurate speech recognition. Adaptive noise cancellation using a
second microphone for just the background noise should improve
accuracy The voice activity detector is a critical component of the
recognizer Scoring is also a difficult problem as the acoustic data
must be synchronized with the dictionary to provide accurate
results Marking the speech pattern and word isolation is not
without difficulties as pauses between syllables occur during a
single utterance
Slide 43
Conclusion GMMs are very powerful models for speech
recognition. Scoring the models is difficult. The EM algorithm will
produce different models based on the random seeding of the
starting conditions. Simple utterances of ~15 repetitions is not
sufficient for good GMM accuracy The voice activity detector plays
a significant part in the training and testing of the data A new
voice activity detector did not magically produce 100 percent
scoring accuracy with a recorded test wav file Noise cancellation
techniques and sophisticated voice detection algorithms are
necessary for good performance as well as model optimization
Slide 44
Areas for further investigation Automate the scoring process
Improve the Voice activity detector in the real time recognizer Add
a second microphone for adaptive noise cancellation Convert GMMs to
combination GMMs and HMMs so dictionary search isnt so
computationally intensive Modify the number of mixtures of the GMMs
with HMM phonetic implementation HMMs will allow for continuous
digit recognition