5
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012 1 © 2012 JCSE www.journalcse.co.uk Text-Independent Speaker Verification Based On Ensemble Classifiers F.Forootan*, M.Mosleh and S.Setayeshi Abstract—Speaker verification system and the process of accepting or rejecting the claimed identity are based on sound features. A verification system can be used for numerous security systems and is more economical compared with the other biometric methods like finger print and face identity. In this method, there is no need for speaker’s presence and the identity can be verified from afar. Here, an ensemble classification algorithm is presented for identification and text-independent speaker verification which is more accurate than the current classifiers. Because of differences in speakers’sound samples, mapping techniques were used to make the samples of equal length, and in order to shorten the verification process, we chose just a subset of speakers’sound features after evaluating the extracted features. Index Terms—biometric, speaker verification, pattern matching, support vector machine, decision trees, ensemble classifier. —————————— —————————— 1 INTRODUCTION oday, more than ever, the concept of security in cyber setting and its influence on the computer world are understood. Without being highly secured, data and documents are not safe and can be infiltrated by hackers. That is why researchers have recently focused on discov- ering more reliable methods of protecting data and con- trolling ways of accessing them. One major way of access- ing data is recognizing and verifying people’s identity. In recognizing and verifying one’s identity, the objective is to make sure that the individual is the same person he or she claims to be and no one else. In other words, identity verification is the process of assuring the user’s identity using a personal identity verification method called bio- metrics. Bio-metric systems save the data for a person and are preferred to other systems since they do not require people to memorize passwords and that they can be ac- cessed and testedin every moment. A biometric system, is a system which functions based on pattern matching. It verifies people’s identity using their biological infor- mation. Here, the first step is saving biological infor- mation in the system data bank. Having extracted fea- tures, features vector can be obtained. Then, using intel- ligent algorithms like intelligent neural networks [1], support vector machine [2], and decision trees or com- bined patterns, a model is developed and saved for peo- ple’s identification. Compared with other biometric methods like finger print and face identification, the speaker identification needs no expensive or special equipment and then it is more economical. Moreover, there is no need for the speaker to be physically present to be identified, and his or her identity can be verified from distance. However, such systems have deficiencies, including identity disguising practice, synthesis of speaker’s speech, and noise effect on sound signal. All these make the identification processing a difficult one. In early 1960s, speaker identification was introduced to research fields in the world and thereafter, in early 1990s, the research on these systems began and their implementa- tion started. Vector quantization method was first used in 1987 for text-dependent speaker verification [3]. Mixed Gaussian Model is a method which has been recently used as a reference method to create speakers models in speaker verification applications. It is considered as a generalization of vector quantization model in which clusters overlap. In 1995, Reynolds used Mixed Gaussian Model for modeling speakers in text-independent verifica- tion [4]. In 2000, Reynolds and his colleagues, presented the adapted GMM-UBM Mixed Gaussian Model in speaker verification in which maximum posteriori estimations were used during training time [5]. Also, in 2008, Kinnunen and Hautamaki, used posteriori estimation maximization in systems based on vector quantization which was named VQ-UBM for speaker verification [6, 7]. In both GMM- UBM, and VQ-UBM models, speakers models are made by adapting UBM model parameters and using training speeches. The main idea in this method, compared with UBM model, is adapting speakers models by updating UBM parameters which optimizes its performance com- pared with the trained text-independent models. Another method for speaker verification modeling is the use of support vector machine classifier. Classifier SVM is a binary classifier whose objective is finding the best hyper plane which can separatetwo classes in feature spacebest. In 2006, the kernel function known as general- T ———————————————— F.Forootan is with the Department of Computer Engineering Faculty of Postgraduate Studies, Dezful Branch, Islamic Azad University, Dezful, Iran. M.Mosleh is with the Department of Computer Engineering Faculty of Postgraduate Studies, Dezful Branch, Islamic Azad University, Dezful, Iran. S.Setayeshi is with the Faculty of Physics and Nuclear Engineering Amirk- abir University, Tehran, Iran. * This paper has been extracted from the MsC thesis.

Text-Independent Speaker Verification Based On Ensemble Classifiers

Embed Size (px)

DESCRIPTION

Journal of Computer Science and Engineering, ISSN 2043-9091, Volume 12, Issue 2, April 2012 http://www.journalcse.co.uk

Citation preview

Page 1: Text-Independent Speaker Verification Based On Ensemble Classifiers

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012 1

© 2012 JCSE www.journalcse.co.uk

Text-Independent Speaker Verification Based On Ensemble Classifiers

F.Forootan*, M.Mosleh and S.Setayeshi

Abstract—Speaker verification system and the process of accepting or rejecting the claimed identity are based on sound features. A verification system can be used for numerous security systems and is more economical compared with the other biometric methods like finger print and face identity. In this method, there is no need for speaker’s presence and the identity can be verified from afar. Here, an ensemble classification algorithm is presented for identification and text-independent speaker verification which is more accurate than the current classifiers. Because of differences in speakers’sound samples, mapping techniques were used to make the samples of equal length, and in order to shorten the verification process, we chose just a subset of speakers’sound features after evaluating the extracted features.

Index Terms—biometric, speaker verification, pattern matching, support vector machine, decision trees, ensemble classifier.

—————————— u ——————————

1 INTRODUCTIONoday, more than ever, the concept of security in cyber setting and its influence on the computer world are understood. Without being highly secured, data and

documents are not safe and can be infiltrated by hackers. That is why researchers have recently focused on discov-ering more reliable methods of protecting data and con-trolling ways of accessing them. One major way of access-ing data is recognizing and verifying people’s identity. In recognizing and verifying one’s identity, the objective is to make sure that the individual is the same person he or she claims to be and no one else. In other words, identity verification is the process of assuring the user’s identity using a personal identity verification method called bio-metrics. Bio-metric systems save the data for a person and are preferred to other systems since they do not require people to memorize passwords and that they can be ac-cessed and testedin every moment. A biometric system, is a system which functions based on pattern matching. It verifies people’s identity using their biological infor-mation. Here, the first step is saving biological infor-mation in the system data bank. Having extracted fea-tures, features vector can be obtained. Then, using intel-ligent algorithms like intelligent neural networks [1], support vector machine [2], and decision trees or com-bined patterns, a model is developed and saved for peo-ple’s identification.  

Compared with other biometric methods like finger print and face identification, the speaker identification

needs no expensive or special equipment and then it is more economical. Moreover, there is no need for the speaker to be physically present to be identified, and his or her identity can be verified from distance. However, such systems have deficiencies, including identity disguising practice, synthesis of speaker’s speech, and noise effect on sound signal. All these make the identification processing a difficult one.

In early 1960s, speaker identification was introduced to research fields in the world and thereafter, in early 1990s, the research on these systems began and their implementa-tion started. Vector quantization method was first used in 1987 for text-dependent speaker verification [3].

Mixed Gaussian Model is a method which has been recently used as a reference method to create speakers models in speaker verification applications. It is considered as a generalization of vector quantization model in which clusters overlap. In 1995, Reynolds used Mixed Gaussian Model for modeling speakers in text-independent verifica-tion [4]. In 2000, Reynolds and his colleagues, presented the adapted GMM-UBM Mixed Gaussian Model in speaker verification in which maximum posteriori estimations were used during training time [5]. Also, in 2008, Kinnunen and Hautamaki, used posteriori estimation maximization in systems based on vector quantization which was named VQ-UBM for speaker verification [6, 7]. In both GMM-UBM, and VQ-UBM models, speakers models are made by adapting UBM model parameters and using training speeches. The main idea in this method, compared with UBM model, is adapting speakers models by updating UBM parameters which optimizes its performance com-pared with the trained text-independent models.

Another method for speaker verification modeling is the use of support vector machine classifier. Classifier SVM is a binary classifier whose objective is finding the best hyper plane which can separatetwo classes in feature spacebest. In 2006, the kernel function known as general-

T

———————————————— • F.Forootan is with the Department of Computer Engineering Faculty of

Postgraduate Studies, Dezful Branch, Islamic Azad University, Dezful, Iran.

• M.Mosleh is with the Department of Computer Engineering Faculty of Postgraduate Studies, Dezful Branch, Islamic Azad University, Dezful, Iran.

• S.Setayeshi is with the Faculty of Physics and Nuclear Engineering Amirk-abir University, Tehran, Iran.

* This paper has been extracted from the MsC thesis.

Page 2: Text-Independent Speaker Verification Based On Ensemble Classifiers

2

ized linear discriminant sequence kernel was offered to increase the efficiency of SVM in speaker verification [2].

The method GMM is one of the most frequent methods

for speaker verification specially in text-independent cases. Since modeling time for GMM is long, a mixture of this method and VQ method is used to reduce training time and making models.

2 SUPPORT VECTOR MACHINE Support vector machine is fairly a new method which is used in binary classification and is included in the kernel methods of machine training. This method was first in-troduced by Vladimir Vapnik and his colleagues in 1997 [8, 9].

Assume that we have L observations in which each ob-servation includes pairs (xi, yi), in which (i= 1, 2, … L) and

nix ℜ∈ are the input vector and y1 is a binary value (+1

or –1). The support vector machine tries to draw hyper planes in space which optimizes samples distinction of different classes of data. We can show this hyper plane through the following equation [10].

0 =+ bxTw (1) For a vector with w weight and b bias, the partitioning

space is the distance between the hyper plane defined by equation 1 and its closest feature vectors to it. The objec-tive of support vector machine is finding a hyper plane which has the most partitioning space. The most im-portant work of support vector machine is finding w0 and b0 based on the trained vectors given for this optimized hyper plane. Its general form is similar to Fig. 1 in which speakers sound models are created by using the samples in the training phase and in the training phase, each ob-servation is mapped in one of the two exiting cases and thus people’s identity is either verified or rejected. As-suming that these samples are separable in a linear way, hyper planes with maximum space for model partitioning are more efficient.

Fig. 1. optimized partitioning hyper planes and support vectors

From among the reasons for choosing support vector

machine, we can mention the following: 1. Training phases are simple in them.

2. It can be generalized well enough even with low data.

3. Data can be recorded in higher dimensions. Moreover, it does not have the problems of over-lapping and sustainability which are usual for other classifiers and because of the maximum space, it is damaged less by noise.

4. Unlike intelligent neural networks, we get a uni-versal optimized response, not a local one.

Its problems include long training time and lacking a rule for choosing kernel function.

3 DECISION TREES One of the efficient ways of classifying data is creating a decision tree. Decision tree is included in the most famous algorithms of deductive learning which has been success-fully used for different applications. It functions in a way in which samples, from the root, grow downwards and finally reach the leaves knots. This feature poses a ques-tion related to the input example. In each internal knot, there are as many branches as there are answers to this question. Each one of the leaves on this tree indicates a class or a group. The reason for its naming as ‘decision tree’ is that the tree shows the process of decision making for determining a group of input examples [11]. An edu-cational example in the decision tree is classifiedas the following:

it starts with the root; then the specified feature is test-ed by this knot and according to the feature in the exam-ple, it moves downwards along the branches. The process is repeated for the knots below the trees. We can see an example in Fig. 2.

Fig. 2. A simple design of decision tree

Decision trees are applied where they can be present-

ed in a way and in which they can offer a single response as the name of a group or a class. They can be used in cases where the objective function possess an output with inconsistent values. For example, we can use it in a ques-tion with ‘yes’ or ‘no’ response. A decision tree has the following features: It can be used for approximation of inconsistent functions (classification). It is resistant to the noise of imputed data. It is used for data of high volume and then it is used in data detection.

Page 3: Text-Independent Speaker Verification Based On Ensemble Classifiers

3

We can use the tree as the rules of ‘if-then’ which can be easily understood. It can also be used in cases in which the examples lack all features. Most learning algorithms in decision tree act based on an avid top-down searching process in the available space of the trees. Its basic algo-rithm is named Concept Learning System (CLS) which was introduced in 1950 and was more comprehensively presented by Ross Quinlan in 1986 under the title Induc-ing Decision Trees (IDS). Later, a more complete algo-rithm was presented under the title C4.5 which removed some of the ID3 deficiencies. A practiced algorithm in cloning of this article is the algorithm C4.5 which is used for creating a decision tree to do the classification.

4 ENSEMBLE CLASSIFIER AND ITS METHODS Generally in algorithms of monitoring training, searching is done in an imaginative space to find a solution for a special problem. An ensemble classifier is a training monitored algorithm which combines different hypothe-ses to make a better one. Then, ensemble classifier com-bines weak learners to create a strong learner. Fast algo-rithms, like decision trees, are also applied with ensem-ble classifiers. Observations show that various ensemble classifiers function more efficiently. Then, different meth-ods are proposed to make variation in the combining models. Famous method which can be mentioned are-Bagging and Boosting [12].In the Bagging method, classi-fiers designed on different versions of data are combined together and majority voting is practiced among a single classifier decisions. This method is called Bootstrap en-semble or Bagging for short. Random Forest is among the classifiers which use Bagging method. It contains several decision trees and its output is obtained through individ-ual trees. This method combines Bagging method with features randomly to create a group of different decision trees which can be controlled. High precision of the classi-fier is one of its advantages while it can also work with a lot of inputs [13]. The second famous method, Boosting, can teach new samples to enhance training samples and then create change in the ensemble classifier. This method is more precise in some cases compared with the Bagging method. A problem with Boosting is the long training phase of training for samples. AdaBoost is one of the most famous methods of Boosting.

5 DIMENSIONS REDUCTION We tend to choose a subset of more efficient features in-stead of using all the features. This is done through elimi-nation of those features which are not so much effective in classification precision. By choosing an optimized sub-set of features as input of training, we can do the classifi-cation process more effectively. Doing this, we can reduce the dimensions of data samples and remove the increase and the ambiguity created by the features. Then, we can use the applied classifier on just those selected feature and complete the training process.

One of the best algorithms for the selection of a subset of features is the genetic algorithm which belongs to the

family of calculation models [14]. Darwin’s calculation studies led to the creation of an infinite models for com-puters optimization. The genetic algorithms include a series of techniques based on evolution which focus on selective applications, variation, and repeated combina-tion of solutions to competitive problems. These algo-rithms are parallel repeated optimizations and do the optimization process properly in verification models and classification work and can be used with different classi-fiers like support vector machines.

6 PROPOSED METHOD  Proposed methods for speaker verification are presented in Fig. 3 in which the different phases of identity verifica-tion from pre-processing input signal to verification or rejection of identity are shown. Let us look at the different phases more closely. After the speaker sound signal is imputed and the pre-processing phase like removing silence are performed, we come to the extraction phase of features of speech. First, we use MFCC and LPC methods for cloning to extract features. Later, based on the obtained findings, we use LPC in extracting features. Having analyzed the features and extracting the effective features of the sound signal for speaker verification, we can use a subset of features and do the modeling based on them. Selection of a subset of features, reduces processing time and increased verifi-cation precision because of using effective features. After analyzing and selecting a subset of features comes the phase of comparing the imputed sound signal with the saved sound models to determine the extent of simi-larity. Finally, we have the verification phase.

We used numerous classifiers in modeling, and based on the obtained findings, we chose the ensemble classifier of Bagging with the decision tree based on Random Forest which had high precision. Having analyzed the similarity extent, the system either verified or rejected the speaker. In this phase, if the speaker was verified, we could also get the speaker’s name.

7 IMPLEMENTATION AND EXPERIMENTS RESULTS To do the experiment and compare the efficiency of the intelligent speaker verification algorithms, we used FARSDAT. This database contains numerous words and expressions uttered by woman and man speakers.

We used the expressions uttered by a total 135 speak-ers, 42 women, and 93 men. In the training phase, we used five different samples with varying lengths of 1 to 4 for each speaker. In the experimentation phase, we used two different samples with different lengths which were also different from the training samples. We used MATLAB software for implementing. We used support vector machine, C4.5 decision trees and Random Forest. The verification findings are given in Table 1. After implementing the mentioned methods, we did the modeling again. We used ensemble classifier of Bagging of which the findings are presented in Table 2. In modeling speakers’ sound, we used the two current

Page 4: Text-Independent Speaker Verification Based On Ensemble Classifiers

4

methods of MFCC and LPC to extract feature. The find-ings are given in Tables 1 and 2, separately.

Fig. 3. The proposed system for speakers verification

Regarding varying length of the sound samples, the fact that each sound sample consists of several frames, that 39 features are extracted from each frame, and that we had sev-eral samples in modeling, we used hyper vector method to record samples space with different dimensions. In this case, all samples are recorded by methods like PCA, and KPCA with one space with certain dimensions. We used KPCA method with Gaussian kernel function for implementing. The findings obtained from using hyper vectors methods are presented in Tables 1 and 2.

Table 1. Verification precision with support vector machine and

decision trees

SVM C4.5 Random Forest

MFCC 71.43 73.93 74.05

LPC 71.43 77.62 79.29

Table 2. Verification precision with combining methods of support vector machine, decision trees and Bagging ensemble algorithm

SVM C4.5 Random Forest

MFCC 71.43 76.43 75.24

LPC 71.43 78.10 81.55

In the above tables, classifier precision and speakers verifi-cation are given. As said before, we used hyper vectors method along with dimension transferring in the cloning process. We also used classifiers with the obtained features from MFCC and LPC methods to evaluate the samples.

8 CONCLUSION Regarding Tables 1 and 2, shown in Fig. 4, verification preci-sion in extracting features in LPC is higher than MFCC. Also, we learn that in speaker verification, classifiers based on deci-sion trees are more precise compared with support vector

machine. Moreover, Random Forest classifier is more precise.

Fig. 4. comparison of precision and efficiency in classification

algorithms We could use the current classifiers in an ensemble classi-

fier like Bagging, as applied in this study to increase verifica-tion precision. Also, in another study, we got the precision value of %83.21 when we reduced the 39 features to just 12 using genetic algorithm with Bagging classifier and the kernel of Random Forest.

Using evaluation methods for features and selecting a subset of those features, we could get a reduction in the train-ing time and experiment duration and also obtained higher precision in speaker verification.

REFERENCES [1] K. Farrell, R. Mammone, K. Assaleh, "Speaker recognition

using neural networks and conventional classifiers", IEEE Trans. Speech Audio Process. 2(1), 194–205, 1994.

[2] W. Campbell, J. Campbell, D. Reynolds, E. Singer, P. Torres-Carrasquillo, "Support vector machines for speaker and lan-guage recognition", Comput. Speech Lang. 20 (2–3), 210–229, 2006.

[3] D. Burton, "Text-dependent speaker verification using vector quantization source coding", IEEE Trans. Acoustics, Speech, Sig-nal Process., 35 (2), 133–143, 1987.

[4] D. Reynolds, "Speaker identification and verification using Gaussian mixture speaker models", Speech Comm. 17, 91–108,

Pre-­‐processingSpeech  Signal Feature  Extraction

SimilarityEnsemble  Classifier

Reference  Models

(Speakers)

Decision(Accept  /  Reject)

FeatureSubset  1

FeatureSubset  2

FeatureSubset  n

FeatureEvaluation

FeatureEvaluation

FeatureEvaluation

Level  1

SelectedFeatures

Level  2

.

...

Cascade  Speaker  Verification

Page 5: Text-Independent Speaker Verification Based On Ensemble Classifiers

5

1995. [5] D. Reynolds, T. Quatieri, R. Dunn, "Speaker verification using

adapted gaussian mixture models", Digital Signal Process. 10 (1), 19–41, 2000.

[6] V. Hautamaki, T. Kinnunen, I. Karkkainen, M. Tuononen, J. Saastamoinen, P. Franti, "Maximum a posteriori estimation of the centroid model for speaker verification", IEEE Signal Pro-cess Lett. 15, 162–165, 2008.

[7] T. Kinnunen, J. Saastamoinen, V. Hautamaki, M. Vinni, P. Franti, "Comparative evaluation of maximum a posteriori vec-tor quantizationand Gaussian mixture models in speaker veri-fication", Pattern Recognition Lett. 30 (4), 341–347, 2009.

[8] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1998.

[9] V. Vapnik, "The Nature of Statistical Learning Theory, Spring-er-Verlag", New York, 1995.

[10] C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition", Data Mining and Knowledge Discovery, 2(2), 121-167, 1998.

[11] S. Ahmed, F. Coenen, P.H. Leng, "Tree-based partitioning of date for association rule mining", Knowl. Inf. Syst. 10(3), 315-331, 2006.

[12] G. DietterichThomas, "An experimental comparison of three methods for constructing ensembles of decision trees:Bagging, boosting, and randomization", Machine Learning 40(2), 139–158, 2000.

[13] T.M. Khoshgoftaar, M. Golawala, J. Van Hulse, "An Empirical Study of Learning from Imbalanced Data Using Random For-est", Proceedings of the 19th. IEEE Conference on Tools with Artifi-cial Intelligence, pp. 310-317, 2007.

[14] K.M. Faraoun, A. Rabhi, "Data dimensionality reduction based on genetic selection of feature subsets", 2007.

F. Forootan Received the B.S. degree in Computer Software Engi-neering from the Islamic Azad University of Shiraz in 2002. he also received the M.S. degree in architecture of computer systems from the Islamic Azad University of Dezful in 2012. His main research activities cover the areas of intelligent systems and speech and speaker recognition systems. M. Mosleh received his B.S. in computer engineering from Islamic Azad University, Dezful Branch, in 2003, the M.S. in computer engi-neering from Islamic Azad University, Tehran, in 2006 and the PhD degree in computer engineering at the Islamic Azad University, Tehran, in 2010. He is assistant professor in the Department of Computer Engineering at the Islamic Azad University, Dezful Branch. His main research interests are in the areas of Speech Processing, Machine Learning, Intelligent Systems, and Audio Wa-termarking. S. Setayeshi B.Sc., M.Sc., M.A., M.A.Sc., Ph.D. (Electrical & Com-puter Eng., TUNS, Canada, 1993) is an Associate Professor and teaching in the Faculty of Nuclear Engineering and Physics, Amirk-abir University of Technology (Tehran Polytechnics). He is also presenting some courses in Computer Eng. Department of Res. & Sc. Branch of IAU as an Invited Professor. He has founded AL and Complex System Researches for first time there and supervised many Graduate students in the areas of using CA and LA. He has published more than 100 papers in ISI Journals and Conferences. His research interests are in the areas of AI & AL, Intelligent Control (Neural–Fuzzy–Expert-GA-CA-LA), Adaptive Signal Processing, Agent Based Modeling, Artificial Society, Social Evolution, Wealth Distribution, Knowledge Based Systems, and Dynamics of Complex Systems.