8/7/2019 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms
http://slidepdf.com/reader/full/comparative-analysis-of-speaker-identification-using-row-mean-of-dft-dct 1/6
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
Comparative Analysis of Speaker Identification using
row mean of DFT, DCT, DST and Walsh Transforms
Dr. H B Kekre
Senior Professor, Computer Department,
MPSTME, NMIMS University,
Mumbai, India
Vaishali Kulkarni
Associate Professor, Electronics & Telecommunication,
MPSTME, NMIMS University,
Mumbai, India
Abstract — In this paper we propose Speaker Identification using
four different Transform Techniques. The feature vectors are the
row mean of the transforms for different groupings. Experiments
were performed on Discrete Fourier Transform (DFT), Discrete
Cosine Transform (DCT), Discrete Sine Transform (DST) and
Walsh Transform (WHT). All the Transform give an accuracy of
more than 80% for the different groupings considered. Accuracyincreases as the number of samples grouped is increased from 64
onwards. But for groupings more than 1024 the accuracy again
starts decreasing. The results show that DST performs best. The
maximum accuracy obtained for DST is 96% for a grouping of
1024 samples while taking the transform.
Keywords - Euclidean distance, Row mean, Speaker Identification,
Speaker Recognition
I. INTRODUCTION
Human speech conveys an abundance of information, from the
language and gender to the identity of the person speaking. Thepurpose of a speaker recognition system is thus to extract the
unique characteristics of a speech signal that identify a
particular speaker. [1, 2, 3] Speaker recognition systems are
usually classified into two subdivisions, speaker identification
and speaker verification. Speaker identification (also known as
closed set identification) is a 1: N matching process where the
identity of a person must be determined from a set of known
speakers [3 - 5]. Speaker verification (also known as open set
identification) serves to establish whether the speaker is who he
claims to be [6]. Speaker recognition can be further classified
into text-dependent and text-independent systems. In a text
dependent system, the system knows what utterances to expect
from the speaker. However, in a text-independent system, noassumptions about the text can be made, and the system must be
more flexible than a text dependent system.Speaker recognition technology has made it possible to use
the speaker's voice to control access to restricted services, forexample, for giving commands to computer, phone access tobanking, database services, shopping or voice mail, and accessto secure equipment. Speaker Recognition systems have beendeveloped for a wide range of applications [7 - 10].
Although many new techniques have been developed,widespread deployment of applications and services is still notpossible. None of these systems gives accurate and reliableresults. We When you open have proposed speaker recognitionusing vector quantization in time domain by using LBG (LindeBuzo Gray), KFCG (Kekre’s Fast Codebook Generation) andKMCG (Kekre’s Median Codebook Generation) algorithms[11], [12], [13] and in transform domain using DFT, DCT andDST [14].
The concept of row mean of the transform techniques hasbeen used for content based image retrieval (CBIR) [15 – 18].This technique also has been applied on speaker identificationby first converting the speech signal into a spectrogram [19].
For the purposes of this paper, we will be considering aspeaker identification system that is text-dependent. For theidentification purpose, the feature vectors are extracted bytaking the row mean of the transforms (Which is a columnvector). The technique is used as shown in figure 1. Here a
speech signal of 15 samples is divided into 3 blocks of 5 each,and these 3 blocks form the columns of the matrix whosetransform is taken. Then the mean of the absolute value of eachrow of the transform matrix is taken and this forms the columnvector of mean.
The rest of the paper is organized as follows: Section 2explains feature generation using the transform techniques,Section 3 deals with Feature Matching, and the results areexplained in Section 4 and the conclusion in section 5.
II. TRANSFORM TECHNIQUES
A. Discrete Fourier Transform
Spectral analysis is the process of identifying component
frequencies in data. For discrete data, the computational basisof spectral analysis is the discrete Fourier transform (DFT).
The DFT transforms time- or space-based data into frequency-
based data. The DFT allows you to efficiently estimate
component frequencies in data from a discrete set of values
sampled at a fixed rate. If the speech signal is represented by
y(t), then the DFT of the time series or samples y 0, y1,y2,
…..yN-1 is defined as:
102 http://sites.google.com/site/ijcsis/ISSN 1947-5500
8/7/2019 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms
http://slidepdf.com/reader/full/comparative-analysis-of-speaker-identification-using-row-mean-of-dft-dct 2/6
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
Yk = ne-2jπkn/N
(1)
Where yn=ys (nΔt); k= 0, 1, 2…, N-1.
Δt is the sampling interval.
Figure 1. Row Mean Generation Technique
B. Discrete Cosine Transform
A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functionsoscillating at different frequencies.
(2)
Where y(k) is the cosine transform, k=1,…, N.
k=1
2≤k≤N
The DCT is closely related to the discrete Fourier transform.
You can often reconstruct a sequence very accurately from
only a few DCT coefficients, a useful property for applications
requiring data reduction [20 – 22].
C. Discrete Sine Transform
A discrete sine transform (DST) expresses a sequence of
finitely many data points in terms of a sum of sine functions.
)
(3)Where y(k) is the sine transform, k=1,…, N.
D. Walsh Transform
The Walsh transform or Walsh – Hadamard transform is a non-
sinusoidal, orthogonal transformation technique that
decomposes a signal into a set of basis functions. These basis
functions are Walsh functions, which are rectangular or square
waves with values of +1 or – 1. The Walsh – Hadamard
transform returns sequency values. Sequency is a more
generalized notion of frequency and is defined as one half of
the average number of zero-crossings per unit time interval.
Each Walsh function has a unique sequency value. You can
use the returned sequency values to estimate the signal
frequencies in the original signal. The Walsh – Hadamard
transform is used in a number of applications, such as image
processing, speech processing, filtering, and power spectrum
analysis. It is very useful for reducing bandwidth storage
requirements and spread-spectrum analysis. Like the FFT, the
Walsh – Hadamard transform has a fast version, the fast
Walsh – Hadamard transform (fwht). Compared to the FFT,
the FWHT requires less storage space and is faster to calculate
because it uses only real additions and subtractions, while the
FFT requires complex values. The FWHT is able to represent
signals with sharp discontinuities more accurately using fewer
coefficients than the FFT. FWHTh is a divide and conquer
algorithm that recursively breaks down a WHT of size N intotwo smaller WHTs of size N / 2. This implementation follows
the recursive definition of the Hadamard
matrix H N :
(4)
The normalization factors for each stage may be grouped
together or even omitted. The Sequency ordered, also known
as Walsh ordered, fast Walsh – Hadamard transform, FWHTw,
is obtained by computing the FWHTh as above, and then
rearranging the outputs [23].
III. FEATURE EXTRACTION
The procedure for feature vector extraction is given below:
1. The speech signal is divided into groups of n samples.(Where n can take values: 64, 128, 256, 512, 1024,2048, and 4096) samples.
2. These blocks are then arranged as columns of a matrixand then the different transforms given in section II aretaken.
C1
C2
C3
C4
C5
Row Mean
(1 × 5)
Speech signal (1 × 15)
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
Speech signal
converted into
matrix (5 × 3)
Transform matrix
(5 × 3)
Dividing
into blocks
of 5
Transform
Mean of
each row
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T
103 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
8/7/2019 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms
http://slidepdf.com/reader/full/comparative-analysis-of-speaker-identification-using-row-mean-of-dft-dct 3/6
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
3. The mean of the absolute values of the rows of thetransform matrix is then calculated.
4. These row means form a column vector (1 × n where
n is the number of rows in the transform matrix).
5. This column vector forms the feature vector for
the speech sample.
6. The feature vectors for all the speech samples arecalculated for different values of n and stored in thedatabase.
Figure 2 shows the row mean generated for the fourtransforms for a grouping of 64 samples for one of thespeech signal in the databases. These 64 row meansform the feature vector for the particular sampleconsidered. In a similar fashion, the feature vectors forother speech signals were also calculated. This processwas repeated for all values of n. As can be seen fromfigure 2, the 64 mean values form a 1×64 featurevector.
IV. RESULTS
A. Basics of speech signal
The speech samples used in this work are recorded usingSound Forge 4.5. The sampling frequency is 8000 Hz (8 bit,mono PCM samples). Table I shows the database description.The samples are collected from different speakers. Samples aretaken from each speaker in two sessions so that training modeland testing data can be created. Twelve samples per speaker aretaken. The samples recorded in one session are kept in databaseand the samples recorded in second session are used for testing.
0 10 20 30 40 50 60 700
0.5
1
1.5
2
2.5
3
Mean of the absolute value for each row of the transform matrix
A m p l i t u d e
Row Mean for DFT for a grouping of 64 samples
0 10 20 30 40 50 60 700
0.5
1
1.5
2
Row mean of the absolute value for each row of the Transform matrix
A m p l i t u d e
Row mean for DST for a grouping of 64 samples
0 10 20 30 40 50 60 700
0.01
0.02
0.03
0.04
A m p l i t u d e
Row Mean for Walsh for a grouping of 64 samples
0 10 20 30 40 50 60 700
0.1
0.2
0.3
0.4
Row mean of the absolute value for each row of the Transform matrix
A m p l i t u d e
Row mean for DCT for a grouping of 64 samples
(A) (B)
(C) (D)
Figure 2. Row Mean Generation for a grouping of 64 samples for one of the speech signal
104 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
8/7/2019 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms
http://slidepdf.com/reader/full/comparative-analysis-of-speaker-identification-using-row-mean-of-dft-dct 4/6
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
TABLE I. DATABASE DESCRIPTION
Parameter Sample characteristics
Language English
No. of Speakers 105Speech type Read speech
Recording conditions Normal. (A silent room)
Sampling frequency 8000 Hz
Resolution 8 bps
B. Expermental Results
The feature vectors of all the reference speech samples arestored in the database in the training phase. In the matchingphase, the test sample that is to be identified is taken andsimilarly processed as in the training phase to form the featurevector. The stored feature vector which gives the minimumEuclidean distance with the input sample feature vector isdeclared as the speaker identified.
Table II gives the number of matches for the four differenttransforms. The matching has been calculated by consideringthe minimum Euclidean distance between the feature vector of the test speech signal and the feature vector of the speechsignals stored in the database. The rows of Table II show thenumber of samples of each speech signal grouped together toform the columns of a matrix whose transform is then taken.For each grouping, the transform which gives maximummatches has been shaded in yellow. We can see that forgroupings of 64, 128 and 256 DST gives the best matching i.e.86, 98 and 99 (out of 105) respectively. For a grouping of 512,DCT gives best matching i.e. 99. For a grouping of 1024samples, DST gives maximum matches i.e. 101. It can also beseen that as the number of samples grouped is further increased
beyond 1024, the number of matches is reduced for all thetransforms.
TABLE II. NO. OF MATCHES FOR DIFFERENT GROUPINGS
C. Accuracy of Identification
The accuracy of the identification system is calculated as
given by equation 5.
(5)
The accuracy for the different groupings of the four transforms
was calculated and is shown in Figure 3.
No. of samplesgrouped
Number of matches (out of 105)
FFT DCT DST WALSH
64 78 85 86 76
128 87 92 98 79
256 96 98 99 82
512 97 99 98 85
1024 100 97 101 89
2048 100 96 97 85
4096 98 96 99 83
8192 96 90 90 67
Figure 3. Accuracy for the four transforms by varying the groupings of samples 105 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
8/7/2019 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms
http://slidepdf.com/reader/full/comparative-analysis-of-speaker-identification-using-row-mean-of-dft-dct 5/6
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
The results show the accuracy increases as we increase the
feature vector size from 64 to 512 for the transforms. Only for
DST, the accuracy decreases from 94.28% to 93.33% as we
increase the feature vector size from 256 to 512. The feature
vector size of 1024 gives the best result for all the transforms
except DCT. For DCT, the best result is obtained for a feature
vector size of 512. For DFT, the maximum accuracy obtainedis 95.2381% for a feature vector size of 1024. Walsh transform
gives a maximum accuracy of around 84.7619%. DST
performs best giving a maximum accuracy of 96.1905% for a
feature vector size of 1024.
V. CONCLUSION
In this paper we have compared the performance of four
different transforms for speaker identification. All the
Transforms give an accuracy of more than 80% for the feature
vector size considered. Accuracy increases as the feature
vector size is increased from 64 onwards. But for feature
vector size of more than 1024 the accuracy again starts
decreasing. The results show that DST performs best. Themaximum accuracy obtained for DST is around 96% for a
feature vector size of 1024. The present study is ongoing and
we are analyzing the performance on other transforms.
REFERENCES
[1] Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana,“Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs,2009.
[2] S Furui, “50 years of progress in speech and speaker recognitionresearch”, ECTI Transactions on Computer andInformation Technology,Vol. 1, No. 2, November 2005.
[3] D. A. Reynolds, “An overview of automatic speaker recognitiontechnology,” Proc. IEEE Int. Conf. Acoust., Speech,S
[4] Joseph P. Campbell, Jr., Senior Member, IEEE, “Speaker Recognition:A Tutorial”, Proceedings of the IEEE, vol. 85, no. 9, pp. 1437 -1462,September 1997.
[5] S. Furui. Recent advances in speaker recognition. AVBPA97, pp 237--251, 1997
[6] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D.Petrovska-Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speakerverification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 1, pp.430 – 451, 2004.
[7] D. A. Reynolds, “Experimental evaluation of features for robust speaker identification,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp.639 – 643, Oct. 1994.
[8] Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨anti, “Rea ltime SpeakerIdentification”, ICSLP2004.
[9] Marco Grimaldi and Fred Cummins, “Speaker Identification usingInstantaneous Frequencies”, IEEE Transactions on Audio, Speech, andLanguage Processing, vol., 16, no. 6, August 2008.
[10] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, Yu. (1999). “BinaryQuantization of Feature vectors for robust text-independent SpeakerIdentification” in IEEE Transactions on Speech and Audio ProcessingVol. 7, No. 1, January 1999. IEEE, New York, NY, U.S.A.
[11] H B Kekre, Vaishali Kulkarni, “Speaker Identification by using VectorQuantization”, International Journal of Engineering Science andTechnology, May 2010.
[12] H B Kekre, Vaishali Kulkarni, “Performance Comparison of SpeakerRecognition using Vector Quantization by LBG and KFCG ” ,International Journal of Computer Applications, vol. 3, July 2010.
[13] H B Kekre, Vaishali Kulkarni, “ Performance Comparison of Automatic Speaker Recognition using Vector Quantization by LBGKFCG and KMCG”, International Journal of Computer Science andSecurity, Vol: 4 Issue: 4, 2010.
[14] H B Kekre, Vaishali Kulkarni, “Comparative Analysis of Automatic
Speaker Recognition using Kekre’s Fast Codebook GenerationAlgorithm in Time Domain and Transform Domain”, InternationalJournal of Computer Applications, Volume 7 No.1. September 2010.
[15] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “PerformanceComparision of Image Retrieval using Row Mean of TransformedColumn Image”, International Journal on Computer Science andEngineering Vol. 02, No. 05, 2010, 1908-1912
[16] Dr.H.B.Kekre,Sudeep Thepade “Edge Texture Based CBIR using RowMean of Transformed Column Gradient Image”, International Journal of Computer Applications (0975 – 8887) Volume 7 – No.10, October 2010
[17] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Eigenvectors of Covariance Matrix using Row Mean and Column Mean Sequences forFace Recognition”, International Journal of Biometrics andBioinformatics (IJBB), Volume (4): Issue (2)
[18] Dr. H.B.Kekre, Sudeep Thepade, Archana Athawale, “Grayscale ImageRetrieval using DCT on Row mean, Column mean and Combination”,
Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010[19] Dr. H. B. Kekre, Dr. T. K. Sarode, Shachi J. Natu, Prachi J. Natu
“Performance Comparison of Speaker Identification Using DCT, Walsh,Haar on Full and Row Mean of Spectrogram”, International Journal of Computer Applications (0975 – 8887) Volume 5 – No.6, August 2010
[20] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete Cosine Transform",IEEE Trans. Computers, 90-93, Jan 1974.
[21] N. Ahmed, "How I came up with the Discrete Cosine Transform",Digital Signal Processing, Vol. 1,1991.
[22] G. Strang, “The Discrete Cosine Transform,” SIAM Review, Volume41, Number 1,1999.
[23] Fino, B.J., and Algazi, V.R., 1976, "Unified Matrix Treatment of theFast Walsh – Hadamard Transform," IEEE Transactions on Computers25: 1142 – 1146
AUTHORS PROFILE
Dr. H. B. Kekre has received B.E. (Hons.) in
Telecomm. Engg. from Jabalpur University in1958, M.Tech (Industrial Electronics) from IIT
Bombay in 1960, M.S.Engg. (Electrical Engg.)
from University of Ottawa in 1965 and Ph.D.(System Identification) from IIT Bombay in
1970. He has worked Over 35 years as Faculty of
Electrical Engineering and then HOD ComputerScience and Engg. at IIT Bombay. For last 13 years worked as a Professor in
Department of Computer Engg. at Thadomal Shahani Engineering College,Mumbai. He is currently Senior Professor working with Mukesh Patel Schoolof Technology Management and Engineering, SVKM’s NMIMS University,
Vile Parle(w), Mumbai, INDIA. He ha guided 17 Ph.D.s, 150 M.E./M.Tech
Projects and several B.E./B.Tech Projects. His areas of interest are Digital
Signal processing, Image Processing and Computer Networks. He has more
than 300 papers in National / International Conferences / Journals to his credit.
Recently twelve students working under his guidance have received best paperawards. Recently two research scholars have received Ph. D. degree from
NMIMS University Currently he is guiding ten Ph.D. students. He is member
of ISTE and IETE.
106 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
8/7/2019 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms
http://slidepdf.com/reader/full/comparative-analysis-of-speaker-identification-using-row-mean-of-dft-dct 6/6
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
Vaishali Kulkarni has received B.E in Electronics Engg.
from Mumbai University in 1997, M.E
(Electronics and Telecom) from Mumbai
University in 2006. Presently she is pursuing
Ph. D from NMIMS University. She has a
teaching experience of more than 8 years.She is Associate Professor in telecom
Department in MPSTME, NMIMS University. Her areas of
interest include Speech processing: Speech and Speaker
Recognition. She has 8 papers in National / International
Conferences / Journals to her credit.
107 http://sites.google.com/site/ijcsis/
ISSN 1947-5500