Download pdf - Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms

8/7/2019 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms

http://slidepdf.com/reader/full/comparative-analysis-of-speaker-identification-using-row-mean-of-dft-dct 1/6

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 1, 2011

Comparative Analysis of Speaker Identification using

row mean of DFT, DCT, DST and Walsh Transforms

Dr. H B Kekre

Senior Professor, Computer Department,

MPSTME, NMIMS University,

Mumbai, India

[email protected]

Vaishali Kulkarni

Associate Professor, Electronics & Telecommunication,

MPSTME, NMIMS University,

Mumbai, India

[email protected]

Abstract — In this paper we propose Speaker Identification using

four different Transform Techniques. The feature vectors are the

row mean of the transforms for different groupings. Experiments

were performed on Discrete Fourier Transform (DFT), Discrete

Cosine Transform (DCT), Discrete Sine Transform (DST) and

Walsh Transform (WHT). All the Transform give an accuracy of

more than 80% for the different groupings considered. Accuracyincreases as the number of samples grouped is increased from 64

onwards. But for groupings more than 1024 the accuracy again

starts decreasing. The results show that DST performs best. The

maximum accuracy obtained for DST is 96% for a grouping of

1024 samples while taking the transform.

Keywords - Euclidean distance, Row mean, Speaker Identification,

Speaker Recognition

I. INTRODUCTION

Human speech conveys an abundance of information, from the

language and gender to the identity of the person speaking. Thepurpose of a speaker recognition system is thus to extract the

unique characteristics of a speech signal that identify a

particular speaker. [1, 2, 3] Speaker recognition systems are

usually classified into two subdivisions, speaker identification

and speaker verification. Speaker identification (also known as

closed set identification) is a 1: N matching process where the

identity of a person must be determined from a set of known

speakers [3 - 5]. Speaker verification (also known as open set

identification) serves to establish whether the speaker is who he

claims to be [6]. Speaker recognition can be further classified

into text-dependent and text-independent systems. In a text

dependent system, the system knows what utterances to expect

from the speaker. However, in a text-independent system, noassumptions about the text can be made, and the system must be

more flexible than a text dependent system.Speaker recognition technology has made it possible to use

the speaker's voice to control access to restricted services, forexample, for giving commands to computer, phone access tobanking, database services, shopping or voice mail, and accessto secure equipment. Speaker Recognition systems have beendeveloped for a wide range of applications [7 - 10].

Although many new techniques have been developed,widespread deployment of applications and services is still notpossible. None of these systems gives accurate and reliableresults. We When you open have proposed speaker recognitionusing vector quantization in time domain by using LBG (LindeBuzo Gray), KFCG (Kekre’s Fast Codebook Generation) andKMCG (Kekre’s Median Codebook Generation) algorithms[11], [12], [13] and in transform domain using DFT, DCT andDST [14].

The concept of row mean of the transform techniques hasbeen used for content based image retrieval (CBIR) [15 – 18].This technique also has been applied on speaker identificationby first converting the speech signal into a spectrogram [19].

For the purposes of this paper, we will be considering aspeaker identification system that is text-dependent. For theidentification purpose, the feature vectors are extracted bytaking the row mean of the transforms (Which is a columnvector). The technique is used as shown in figure 1. Here a

speech signal of 15 samples is divided into 3 blocks of 5 each,and these 3 blocks form the columns of the matrix whosetransform is taken. Then the mean of the absolute value of eachrow of the transform matrix is taken and this forms the columnvector of mean.

The rest of the paper is organized as follows: Section 2explains feature generation using the transform techniques,Section 3 deals with Feature Matching, and the results areexplained in Section 4 and the conclusion in section 5.

II. TRANSFORM TECHNIQUES

A. Discrete Fourier Transform

Spectral analysis is the process of identifying component

frequencies in data. For discrete data, the computational basisof spectral analysis is the discrete Fourier transform (DFT).

The DFT transforms time- or space-based data into frequency-

based data. The DFT allows you to efficiently estimate

component frequencies in data from a discrete set of values

sampled at a fixed rate. If the speech signal is represented by

y(t), then the DFT of the time series or samples y 0, y1,y2,

…..yN-1 is defined as:

102 http://sites.google.com/site/ijcsis/ISSN 1947-5500

mailto:[email protected]









Vol. 9, No. 1, 2011

Yk = ne-2jπkn/N

(1)

Where yn=ys (nΔt); k= 0, 1, 2…, N-1.

Δt is the sampling interval.

Figure 1. Row Mean Generation Technique

B. Discrete Cosine Transform

A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functionsoscillating at different frequencies.

(2)

Where y(k) is the cosine transform, k=1,…, N.

k=1

2≤k≤N

The DCT is closely related to the discrete Fourier transform.

You can often reconstruct a sequence very accurately from

only a few DCT coefficients, a useful property for applications

requiring data reduction [20 – 22].

C. Discrete Sine Transform

A discrete sine transform (DST) expresses a sequence of

finitely many data points in terms of a sum of sine functions.

)

(3)Where y(k) is the sine transform, k=1,…, N.

D. Walsh Transform

The Walsh transform or Walsh – Hadamard transform is a non-

sinusoidal, orthogonal transformation technique that

decomposes a signal into a set of basis functions. These basis

functions are Walsh functions, which are rectangular or square

waves with values of +1 or – 1. The Walsh – Hadamard

transform returns sequency values. Sequency is a more

generalized notion of frequency and is defined as one half of

the average number of zero-crossings per unit time interval.

Each Walsh function has a unique sequency value. You can

use the returned sequency values to estimate the signal

frequencies in the original signal. The Walsh – Hadamard

transform is used in a number of applications, such as image

processing, speech processing, filtering, and power spectrum

analysis. It is very useful for reducing bandwidth storage

requirements and spread-spectrum analysis. Like the FFT, the

Walsh – Hadamard transform has a fast version, the fast

Walsh – Hadamard transform (fwht). Compared to the FFT,

the FWHT requires less storage space and is faster to calculate

because it uses only real additions and subtractions, while the

FFT requires complex values. The FWHT is able to represent

signals with sharp discontinuities more accurately using fewer

coefficients than the FFT. FWHTh is a divide and conquer

algorithm that recursively breaks down a WHT of size N intotwo smaller WHTs of size N / 2. This implementation follows

the recursive definition of the Hadamard

matrix H N :

(4)

The normalization factors for each stage may be grouped

together or even omitted. The Sequency ordered, also known

as Walsh ordered, fast Walsh – Hadamard transform, FWHTw,

is obtained by computing the FWHTh as above, and then

rearranging the outputs [23].

III. FEATURE EXTRACTION

The procedure for feature vector extraction is given below:

1. The speech signal is divided into groups of n samples.(Where n can take values: 64, 128, 256, 512, 1024,2048, and 4096) samples.

2. These blocks are then arranged as columns of a matrixand then the different transforms given in section II aretaken.

C1

C2

C3

C4

C5

Row Mean

(1 × 5)

Speech signal (1 × 15)

1 6 11

2 7 12

3 8 13

4 9 14

5 10 15

Speech signal

converted into

matrix (5 × 3)

Transform matrix

(5 × 3)

Dividing

into blocks

of 5

Transform

Mean of

each row

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

T

103 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

http://en.wikipedia.org/wiki/Divide_and_conquer_algorithm


http://en.wikipedia.org/wiki/Recursion

http://en.wikipedia.org/wiki/Recursion







Vol. 9, No. 1, 2011

3. The mean of the absolute values of the rows of thetransform matrix is then calculated.

4. These row means form a column vector (1 × n where

n is the number of rows in the transform matrix).

5. This column vector forms the feature vector for

the speech sample.

6. The feature vectors for all the speech samples arecalculated for different values of n and stored in thedatabase.

Figure 2 shows the row mean generated for the fourtransforms for a grouping of 64 samples for one of thespeech signal in the databases. These 64 row meansform the feature vector for the particular sampleconsidered. In a similar fashion, the feature vectors forother speech signals were also calculated. This processwas repeated for all values of n. As can be seen fromfigure 2, the 64 mean values form a 1×64 featurevector.

IV. RESULTS

A. Basics of speech signal

The speech samples used in this work are recorded usingSound Forge 4.5. The sampling frequency is 8000 Hz (8 bit,mono PCM samples). Table I shows the database description.The samples are collected from different speakers. Samples aretaken from each speaker in two sessions so that training modeland testing data can be created. Twelve samples per speaker aretaken. The samples recorded in one session are kept in databaseand the samples recorded in second session are used for testing.

0 10 20 30 40 50 60 700

0.5

1

1.5

2

2.5

3

Mean of the absolute value for each row of the transform matrix

A m p l i t u d e

Row Mean for DFT for a grouping of 64 samples

0 10 20 30 40 50 60 700

0.5

1

1.5

2

Row mean of the absolute value for each row of the Transform matrix

A m p l i t u d e

Row mean for DST for a grouping of 64 samples

0 10 20 30 40 50 60 700

0.01

0.02

0.03

0.04

A m p l i t u d e

Row Mean for Walsh for a grouping of 64 samples

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

Row mean of the absolute value for each row of the Transform matrix

A m p l i t u d e

Row mean for DCT for a grouping of 64 samples

(A) (B)

(C) (D)

Figure 2. Row Mean Generation for a grouping of 64 samples for one of the speech signal


ISSN 1947-5500




Vol. 9, No. 1, 2011

TABLE I. DATABASE DESCRIPTION

Parameter Sample characteristics

Language English

No. of Speakers 105Speech type Read speech

Recording conditions Normal. (A silent room)

Sampling frequency 8000 Hz

Resolution 8 bps

B. Expermental Results

The feature vectors of all the reference speech samples arestored in the database in the training phase. In the matchingphase, the test sample that is to be identified is taken andsimilarly processed as in the training phase to form the featurevector. The stored feature vector which gives the minimumEuclidean distance with the input sample feature vector isdeclared as the speaker identified.

Table II gives the number of matches for the four differenttransforms. The matching has been calculated by consideringthe minimum Euclidean distance between the feature vector of the test speech signal and the feature vector of the speechsignals stored in the database. The rows of Table II show thenumber of samples of each speech signal grouped together toform the columns of a matrix whose transform is then taken.For each grouping, the transform which gives maximummatches has been shaded in yellow. We can see that forgroupings of 64, 128 and 256 DST gives the best matching i.e.86, 98 and 99 (out of 105) respectively. For a grouping of 512,DCT gives best matching i.e. 99. For a grouping of 1024samples, DST gives maximum matches i.e. 101. It can also beseen that as the number of samples grouped is further increased

beyond 1024, the number of matches is reduced for all thetransforms.

TABLE II. NO. OF MATCHES FOR DIFFERENT GROUPINGS

C. Accuracy of Identification

The accuracy of the identification system is calculated as

given by equation 5.

(5)

The accuracy for the different groupings of the four transforms

was calculated and is shown in Figure 3.

No. of samplesgrouped

Number of matches (out of 105)

FFT DCT DST WALSH

64 78 85 86 76

128 87 92 98 79

256 96 98 99 82

512 97 99 98 85

1024 100 97 101 89

2048 100 96 97 85

4096 98 96 99 83

8192 96 90 90 67

Figure 3. Accuracy for the four transforms by varying the groupings of samples 105 http://sites.google.com/site/ijcsis/

ISSN 1947-5500




Vol. 9, No. 1, 2011

The results show the accuracy increases as we increase the

feature vector size from 64 to 512 for the transforms. Only for

DST, the accuracy decreases from 94.28% to 93.33% as we

increase the feature vector size from 256 to 512. The feature

vector size of 1024 gives the best result for all the transforms

except DCT. For DCT, the best result is obtained for a feature

vector size of 512. For DFT, the maximum accuracy obtainedis 95.2381% for a feature vector size of 1024. Walsh transform

gives a maximum accuracy of around 84.7619%. DST

performs best giving a maximum accuracy of 96.1905% for a

feature vector size of 1024.

V. CONCLUSION

In this paper we have compared the performance of four

different transforms for speaker identification. All the

Transforms give an accuracy of more than 80% for the feature

vector size considered. Accuracy increases as the feature

vector size is increased from 64 onwards. But for feature

vector size of more than 1024 the accuracy again starts

decreasing. The results show that DST performs best. Themaximum accuracy obtained for DST is around 96% for a

feature vector size of 1024. The present study is ongoing and

we are analyzing the performance on other transforms.

REFERENCES

[1] Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana,“Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs,2009.

[2] S Furui, “50 years of progress in speech and speaker recognitionresearch”, ECTI Transactions on Computer andInformation Technology,Vol. 1, No. 2, November 2005.

[3] D. A. Reynolds, “An overview of automatic speaker recognitiontechnology,” Proc. IEEE Int. Conf. Acoust., Speech,S

[4] Joseph P. Campbell, Jr., Senior Member, IEEE, “Speaker Recognition:A Tutorial”, Proceedings of the IEEE, vol. 85, no. 9, pp. 1437 -1462,September 1997.

[5] S. Furui. Recent advances in speaker recognition. AVBPA97, pp 237--251, 1997

[6] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D.Petrovska-Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speakerverification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 1, pp.430 – 451, 2004.

[7] D. A. Reynolds, “Experimental evaluation of features for robust speaker identification,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp.639 – 643, Oct. 1994.

[8] Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨anti, “Rea ltime SpeakerIdentification”, ICSLP2004.

[9] Marco Grimaldi and Fred Cummins, “Speaker Identification usingInstantaneous Frequencies”, IEEE Transactions on Audio, Speech, andLanguage Processing, vol., 16, no. 6, August 2008.

[10] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, Yu. (1999). “BinaryQuantization of Feature vectors for robust text-independent SpeakerIdentification” in IEEE Transactions on Speech and Audio ProcessingVol. 7, No. 1, January 1999. IEEE, New York, NY, U.S.A.

[11] H B Kekre, Vaishali Kulkarni, “Speaker Identification by using VectorQuantization”, International Journal of Engineering Science andTechnology, May 2010.

[12] H B Kekre, Vaishali Kulkarni, “Performance Comparison of SpeakerRecognition using Vector Quantization by LBG and KFCG ” ,International Journal of Computer Applications, vol. 3, July 2010.

[13] H B Kekre, Vaishali Kulkarni, “ Performance Comparison of Automatic Speaker Recognition using Vector Quantization by LBGKFCG and KMCG”, International Journal of Computer Science andSecurity, Vol: 4 Issue: 4, 2010.

[14] H B Kekre, Vaishali Kulkarni, “Comparative Analysis of Automatic

Speaker Recognition using Kekre’s Fast Codebook GenerationAlgorithm in Time Domain and Transform Domain”, InternationalJournal of Computer Applications, Volume 7 No.1. September 2010.

[15] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “PerformanceComparision of Image Retrieval using Row Mean of TransformedColumn Image”, International Journal on Computer Science andEngineering Vol. 02, No. 05, 2010, 1908-1912

[16] Dr.H.B.Kekre,Sudeep Thepade “Edge Texture Based CBIR using RowMean of Transformed Column Gradient Image”, International Journal of Computer Applications (0975 – 8887) Volume 7 – No.10, October 2010

[17] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Eigenvectors of Covariance Matrix using Row Mean and Column Mean Sequences forFace Recognition”, International Journal of Biometrics andBioinformatics (IJBB), Volume (4): Issue (2)

[18] Dr. H.B.Kekre, Sudeep Thepade, Archana Athawale, “Grayscale ImageRetrieval using DCT on Row mean, Column mean and Combination”,

Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010[19] Dr. H. B. Kekre, Dr. T. K. Sarode, Shachi J. Natu, Prachi J. Natu

“Performance Comparison of Speaker Identification Using DCT, Walsh,Haar on Full and Row Mean of Spectrogram”, International Journal of Computer Applications (0975 – 8887) Volume 5 – No.6, August 2010

[20] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete Cosine Transform",IEEE Trans. Computers, 90-93, Jan 1974.

[21] N. Ahmed, "How I came up with the Discrete Cosine Transform",Digital Signal Processing, Vol. 1,1991.

[22] G. Strang, “The Discrete Cosine Transform,” SIAM Review, Volume41, Number 1,1999.

[23] Fino, B.J., and Algazi, V.R., 1976, "Unified Matrix Treatment of theFast Walsh – Hadamard Transform," IEEE Transactions on Computers25: 1142 – 1146

AUTHORS PROFILE

Dr. H. B. Kekre has received B.E. (Hons.) in

Telecomm. Engg. from Jabalpur University in1958, M.Tech (Industrial Electronics) from IIT

Bombay in 1960, M.S.Engg. (Electrical Engg.)

from University of Ottawa in 1965 and Ph.D.(System Identification) from IIT Bombay in

1970. He has worked Over 35 years as Faculty of

Electrical Engineering and then HOD ComputerScience and Engg. at IIT Bombay. For last 13 years worked as a Professor in

Department of Computer Engg. at Thadomal Shahani Engineering College,Mumbai. He is currently Senior Professor working with Mukesh Patel Schoolof Technology Management and Engineering, SVKM’s NMIMS University,

Vile Parle(w), Mumbai, INDIA. He ha guided 17 Ph.D.s, 150 M.E./M.Tech

Projects and several B.E./B.Tech Projects. His areas of interest are Digital

Signal processing, Image Processing and Computer Networks. He has more

than 300 papers in National / International Conferences / Journals to his credit.

Recently twelve students working under his guidance have received best paperawards. Recently two research scholars have received Ph. D. degree from

NMIMS University Currently he is guiding ten Ph.D. students. He is member

of ISTE and IETE.


ISSN 1947-5500




Vol. 9, No. 1, 2011

Vaishali Kulkarni has received B.E in Electronics Engg.

from Mumbai University in 1997, M.E

(Electronics and Telecom) from Mumbai

University in 2006. Presently she is pursuing

Ph. D from NMIMS University. She has a

teaching experience of more than 8 years.She is Associate Professor in telecom

Department in MPSTME, NMIMS University. Her areas of

interest include Speech processing: Speech and Speaker

Recognition. She has 8 papers in National / International

Conferences / Journals to her credit.


ISSN 1947-5500