A new method for speaker identification based on Learning Automata

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 11, ISSUE 1, JANUARY 2012

1

© 2012 JCSE

www.journalcse.co.uk

A new method for speaker identification based on Learning Automata

S.Afshar*, M.Mosleh and M.Kheirandish

Abstract—In recent years, many methods have been introduced for speaker identification, each one includes high-level fea-

tures or novel classifiers. In this paper we present a speaker identification system in terms of a classifier based on Learning Au-

tomata. Indeed, in the proposed identification system, we use the Learning Automata classifier for speakers modeling and dis-

tinguishing. Classifier-based Learning Automata has been discussed in some papers and has presented good results in some

classification cases. However, it has some difficulties regarding the convergence speed and so it could not be used exactly as

the same in a speaker identification system. Hence, we modify it to overcome the problem. The proposed system, feature vec-

tors are generated using the Linear Prediction Coefficients (LPC) from the voice signal, reduced with the Principal Component

Analysis (PCA) method. The speech database, named FARSDAT, has been used to evaluate the effectiveness of the proposed

method. The experimental results show that the proposed speaker identification scheme can achieve a better recognition rate

comparing to other speaker identification systems based on Gaussian Mixture Model (GMM) and Support Vector Machine

(SVM).

Index Terms— Speaker identification, Speaker Modeling, Learning Automata.

—————————— ——————————

1 INTRODUCTION

peaker recognition including both speaker identifica-tion and verification, has been an active research area for several decades [1].

Speaker identification aims to identify the unknown speak-er from his/her voice sample. Speaker identification can be classified into text dependent or text-independent. In a text- depedent system, the speech used to train and test the system is constrained to be the same word or phrase. In a text-independent system, the training and testing speech are completely unconstrained [2]. This paper focuses on text-independent speaker identification. It consists of three important components: fearture extraction, speaker model-ing and the matching algorithm. Feature extraction is the process that produces speaker specific parameters from a given voice sample. A speaker model is generated for each speaker, using these features. Matching algorithm per-forms comparison on speaker models. A general speaker identification system is shown in Fig. 1. Speaker identification process consists of two phases; a training phase and a recognition phase. As seen from the figure, both training and recognition (testing) phases in-clude feature extraction component. In the training phase, a model is designed for each speaker and these models are stored for use in the recognition phase. In the recognition phase, the feature vectors for an unkhown speaker input

are computed and compared with each model stored in the database, and the model which produces the maximum similarity is assigned as to identify the unknown speaker for the identification system.

Various algorithms have been developed to model the speakers; these include Gaussian Mixture Model (GMM) [3], Support Vector Machine (SVM) [4], Vector Quantiza-tion (VQ) [5] and Neural Networks [6]. In this paper we propose a new speaker identification system using learn-ing automata (LA) as a speaker modeling algorithm. The outline of this paper is as follows. In Section 2 we in detail describe learning automata. We follow in Section 3 describ-ing LA based identification system. In Section 4 we consid-er implementation of the proposed speaker identification system and exprimental results. Finally, we conclude the paper in Section 5.

2 BACKGROUND

Before explaining the method, it is necessary to take an

————————————————

S.Afshar is with the Department of Computer Engineering, Dezful Branch, Islamic

Azad University, Dezful, Iran.

M.Mosleh is with the Department of Computer Engineering, Dezful Branch, Islamic


M.Kheirandish is with Department of Computer Engineering, Dezful Branch, Islamic


*This paper has been extracted form the MsC thesis.

S

Fig. 1. General speaker identification system

Speaker

modeling

Matching

algorithm Decision

Feature

extraction

Training

utterence

Speaker models

Feature

extraction

Test

utterence

Recognition phase

Training phase

2

overview on some important backgrounds including the Learning Automata and also a classifier based on the Learning Automata which has been presented in [9]. 2.1 Learning Automata

An automaton can be regarded as an abstract object with finite number of actions [7]. The action is chosen at ran-dom, based on a probability distribution kept over the action-set. And at each instant, the given action is served as the input to the random environment. In turn, the en-vironment responds the taken action with a reinforce-ment signal. The action probability vector is updated based on the reinforcement feedback from the environ-ment. The objective of learning automata is to find the optimal action from the action-set so that the average pe-nalty received from the environment is minimized. The environment can be described by a triple

},,{ DE , where },...,,{ 21 r represents the finite set of the inputs },...,,{ 21 m , denotes the set of the values that can be taken by the reinforcement signal

},...,,{ 21 rdddD , and denotes the set of the reward prob-abilities, where the element is associated with the given action i . The relationship between the learning automa-ta and its random environment has been shown in Fig. 2.

Learning automata has shown to perform well in sys-

tems where incomplete information about the environ-ment exists. Learning automata have a wide variety of applications in combinatorial optimization problems and can be classified into fixed structure and variable struc-ture [8]. Variable structure Learning Automata is represented by TP,,, , where is the set of inputs,

is the set of actions, ))(),...,(),(()( 21 kpkpkpkP n

is defined as the action probability distribution and T is learning algorithm. The learning algorithm is a recur-rence relation which is used to modify the action proba-bility vector. At each instant k, the automata chooses an action )(k , randomly based on its

current action probability distribution )(kP (where

)))(( )(( ii kprobkP andr

ii kkP

1

,1)( ).

The action chosen by the automata is the input of the environment, which responds with a reaction or rein-forcement , )(k . If the input set is binary ( }1,0{i ) then

the environment is known as P-model. In such an envi-

ronment, we set 0)(ki for penalty and 1)(ki for

reward. Then the environment sends back a random re-

sponse, )(k , to the automata which its expected value is

)(kdi if ik)( . Then the automata calculate the rein-

forcement scheme T. Define the index m by }{max iim dd .

Then the action m is called the optimal action. This pro-

cedure is continued until the optimal action to the envi-ronment is found.

2.2 A Classifier Based On Learning Automata

In [9] a classifier has been designed based on the Learning Automata. The target of this method is to find one or more hyperplanes to classify a space into some distinct classes, efficiently. For finding these hyperplanes, at first, the solution space is uniformly divided into r hypercubes each corresponding to one action of the learning automa-ta. Then using continuous Pursuit algorithm, the action probabilities and estimates of reward probabilities are updated at each period. This is done by calculating the function value of a randomly selected sample correspond-ing to the current action. If the estimate of a reward probability is smaller than a predefined threshold, the corresponding hypercube is then evaluated according to the samples whose function values have been calculated. If both the mean value and the variance of these function values are small enough, this hypercube is considered as stable and useless. Then, this hypercube is removed and the optimization contin-ues with the remaining 1r hypercubes. Otherwise, this hypercube is considered as unstable and the rising and falling (pinks and valleys) of the function are estimated in this hypercube, from the samples inside it. Next, this hypercube is divided into a number of sub-hypercubes, each containing only ascending or descending samples. And the original hypercube is replaced by the best re-warded sub-hypercube. The other sub-hypercubes are considered as useless and then removed. In this way, the number of actions is unchanged. This procedure is re-peated until a predefined precision condition is satisfied. Then, original hypercubes are either removed or converge to several values in which are included a quasi-global optimum, i.e. a solution which its function value is rather close to that of the global optimum. In [9], the Learning Automata based classifier has also been compared to other known classifiers, such as KNN (K-Nearest Neighbor) and Neural network, and showed good results in the cases with more non-linear separable databases.

3 THE PROPOSED METHOD

Identifying a person is indeed a classification problem. In an Identifying system there are some persons each cor-responds to a class and the system is wanted to choose the correct one. So it could be solved through a classifica-tion system. In this paper we want to design a suitable voice identification system based on an LA classifier and investigate its performance on a known database. In the beginning, the system is trained with some voice samples from each person and the remained samples are used for system verifying. It is obvious that such a system could have a lot of classes (proportional to the number of per-sons) and our observations show that the LA classifier in [9] has some issues in such a cases and its performance

Fig. 2. Relationship between environment and automata

Environment D , ,

Learning Automata T P, , ,

3

decreases rapidly, while the number of classes increases. However, the performance of the LA classifier could be improved through a one against one strategy, as we used. For training the system, first a feature vector is extracted for each training voice sample as it is described in the previous section. Then using the LA classifier the binary classification is done for all binary combinations of exist-ing classes (the number of theses combinations is

2)1(CC ). Finally, outputs are the selected hyper-planes for each binary classification. Notice that there is no need for hyperplanes number to be the same for all of binary classifications, and H could variant. Foe each bi-nary classification, we start with minimum H equal to one, and if the training success rate be less than a thre-shold, then H will be increase by one. Also there is a max-imum for H which could not be exceeds. Here's training phase has ended. Now, the remaining voice samples are used for verifying the accuracy of the identification system. So a feature vec-tor is made for each test voice sample and the system is wanted to choose a person (corresponds to a class) for this test sample. Suppose a feature vector to choose a class for it. First, we prepare a table which has a place for each class and they are initialized all with zero. Now for each binary combination of the classes, using its corresponding hyperplane (stored in training state), it is decided that which class is winner. In fact, the system chooses the more suitable class between these two classes. The choo-sen one will be assigned to the supposed feature vector. So we add one score (point) to the place that belongs to this class in the table. Also in the case of a feature vector which is placed exactly on a hyperplane, the correspond-ing places of both classes will be increased equally with one point (both of them assumed to be winner). Finally, the class corresponding to the place with the maximum score in the table will be the final chosen class for the supposed feature vector. Fig. 3 shows the overall process to choose a corresponding speaker for each input feature vector.

Procedure: Corresponding speaker selection

Begin

Get a feature vector to choose a class for it. Prepare table H and set Hi = 0 , i = {1, 2, …,C} (C is the

number of classes)

For each binary combination of classes do:

According to this hyperplane decide to which one of

these two classes is winner.

Hwinner_Class = Hwinner_Class + 1

End For

Chosen class for the input feature vector is the argument of

H which has maximum value (if there is more than one then

select one of them randomly).

End

The test process is repeated for all of test voice samples. Then according to the number of correct recognized sam-ples for each class, the success rate for that class could be

computed and finally overall success rate of the system is the average value of these classes success rates.

4 EXPRIMENTAL RESULTS

4.1 Database

To evaluate the proposed method, the database FARS-DAT is used. This database includes a variety of Persian speech data, uttered by 304 native speakers who differ from each other regarding the age, gender, dialect, and educational level. Each speaker uttered 20 sentences in two sessions. This database was provided by the Research Center Intelligent Signal Processing and collected in acoustic booth of the Linguistics Laboratory of the Uni-versity of Tehran. As it was mentioned, there are 20 audio sentence samples for each person in the database (each sentence is about 5 seconds). The final expriments were done for the systems with 3, 6, 9 and 12 different speakers. For each of these expriments, each audio sentence in the dataset was di-vided in two parts and also the dataset was 10-Folded to obtain a more accurate result. For example, in the expri-ment in which an Identification system with 6 persons was regarded, there are 6 classes each consists of 40 sam-ples. These samples are divided in ten folds and then in each repetition, 9 folds (36 samples) are used for train and the remained fold (4 samples) is used for test.

4.2 Feature Vector Extraction

It is typical to do some pre-proccesses on the input signal before feature extracting. However, since the Signal to Noise Ratio (SNR) for all voice samples in FARSDAT are greater than 22 dB, there is no need to perform some se-rious pre-processing, such as low pass filtering or signal enhancement. But for more robustness against environ-mental affects, transformation (1) is applied to the voice signal X to get a new signal 'X with the same length:

)()('

XSTDXXX (1)

In the above relation, X and )(XSTD are the mean and the standard deviation of signal X , respectively. Hence, after this transformation, independent of the person who speaks and independent of the environmental conditions, the new signal 'X has ever mean and the standard devia-tions equal to 0 and 1, respectively. In feature extraction, we want to extract 12 values as the features, for an audio sample. For this purpose, in first step, the Linear Prediction Coefficients (LPC) of sig-nal 'X are calculated. The order of LPC is supposed to be 12, also each frame length is 500 and the frames overlap is 250 samples. So, the output consists of 12 coefficients for each frame of signal 'X . Hence, for a signal with m frames the output could be considered as a matrix with m rows and 12 columns. Then the Principal Component Analysis (PCA) method is performed to reduce this matrix to a vector with only 12 coefficients, which is the final feature vector for the origi-

Fig. 3. Corresponding speaker selection procedure

4

nal voice signal X . PCA, also known as Karhunen-loeve transform, is a kind of linear orthogonal transform me-thod and has two main properties. It finds the uncorre-lated directions of maximum variance in the data space, and it provides the optimal linear projection in the least square sense [9]. Independent of its theory concept, PCA represents a linear mapping for each LPC coefficient and finally generates a matrix with dimension 1212 in which each column is a space vector and all of these vec-tors are orthogonal to each other. It is assumed that the original space could be reconstructed through these 12 vectors. Now, according to the Eigen values, we select the column (vector) corresponding to the element with the greatest Eigen value and these 12 coefficients form the final features vector.

4.3 Results

To have a comparison with other known identification systems, we also perform the above test to the same iden-tification system based on a SVM and GMM classifier. This SVM classifier is one with polynomial kernel and the GMM classifier has been set to have five Gaussian mix-tures with at last 1000 iterations. Also, the SVM and GMM classifiers were simulated using MATLAB Arsenal toolbox. Table 1 shows the results of the simulations for the identi-fication systems with 3, 6, 9 and 12 different speakers on FARSDAT.

Table 1. Comparative accuracy rate (%) for the proposed speaker

identification method against GMM and SVM based systems on

FARSDAT database.

Speakers num The proposed

method

GMM

based

system

SVM

based

system

3 97.5 95 95

6 86 85.83 85.83

9 77.91 77 77

12 71.76 71.75 71.75

As it can be seen, the accuracy of proposed speaker iden-tification system based on LA classifier is better than the accuracy of speaker identification system based on GMM and SVM except the case of 12 persons experiment. How-ever the result of 12 person experiment is also close to one of GMM based systems.

5 CONCLUSION

This paper discusses a speaker identification system based on LA classifier. LA classifier is discussed in [9] and has showed good results in classification cases. But, the LA has not been used in the speaker recogni-tion systems yet. In this paper we make an identifica-tion system based on the LA classifier and to reach the better performance, we modified it in a certain way. To evaluate the proposed method, the database FARS-DAT was used which is a well-known Persian speech

database, provided by the Research Center Intelligent Signal Processing and collected in acoustic booth of the Linguistics Laboratory of the University of Tehran. The experimental results was made for four identification systems with 3, 6, 9 and 12 different speakers and the results compared to the systems with GMM and SVM classifiers, show acceptable accuracy for the proposed identification system. Also, these results show that the LA classifier could be a good choice to make speaker recognition systems based on.

REFERENCES

[1] J.P. Campbell, “Speaker recognition: a tutorial,” Proc. IEEE, vol. 85,

no. 9, pp. 1473-1462, Sep. 1997.

[2] T. Kinnunen, H. Li, "An overview of text-independent speaker recogni-

tion: From features to supervectors", Speech Communication, vol. 52, no.

1, pp. 12-40, Jan. 2010.

[3] D.A. Reynolds DA, R.C. Rose, "Robust text-independent speak-

er identification using Gaussian mixture speaker models", IEEE

Trans. Speech Audio Process., vol. 2, no. 1, pp. 72 – 83, Jan. 1995,

doi: 10.1109/89.365379.

[4] V. Wan, W.M. Campbell, "Support vector machines for speaker verifi-

cation and identification", IEEE Signal Process. Society Workshop, vol. 2,

pp. 775-784, Dec. 2000.

[5] H.B. Kekre, V. Kulkarni, "Speaker identification by using vector quanti-

zation", Int. J. Eng. Sci. Tech., vol. 2, no. 5, pp. 1325-1331, May. 2010.

[6] R.V. Pawer, P.P. Kajave, S.N. Mali, "speaker identification using Neural

Networks", World Academy Sci. Eng. Tech., vol. 12, 2005.

[7] MAL. Thathachar, P.S. Sastry, "Varieties of learning automata:

An overview", IEEE Trans. Systems, Man, and Cybernetics, part B,

vlo. 32, no. 6, pp. 711-722, Dec. 2002, doi:

10.1109/TSMCB.2002.1049606.

[8] A.S. Poznyak, K. Najim, "Learning automata and stochastic

optimization", Berlin: Springer-Verlag, 1997.

[9] S.H Zahiri, "Learning automata based classifier", Patt. Recog.

Letters, vol. 29, no. 1, pp. 40-48, Jan. 2008.

[10] A. Weingessel, K. Hornik, "Local PCA algorithms", IEEE Trans.

Neural Networks, vol. 11, no. 6, pp. 1242 – 1250, Nov. 2000, doi: 10.1109/72.883408.

S. Afshar Received the B.S. degree in Computer Software Engi-neering from the Islamic Azad University of Mahshahr in 2007. She also received the M.S. degree in architecture of computer systems from the Islamic Azad University of Dezful in 2011. Her main re-search activities cover the areas of intelligent systems and speech and speaker recognition systems.

Mohammad Mosleh received his B.S. in computer engineering from Islamic Azad University, Dezful Branch, in 2003, the M.S. in comput-er engineering from Islamic Azad University, Tehran, in 2006 and the PhD degree in computer engineering at the Islamic Azad University, Tehran, in 2010. He is assistant professor in the Department of Computer Engineering at the Islamic Azad University, Dezful Branch. His main research interests are in the areas of Speech Processing, Machine Learning, Intelligent Systems, and Audio Watermarking. M. Kheyrandish Received the B.S. degree in Computer Hardware Engineering from the Islamic Azad University, Dezful branch, Dezful, Iran (2002). Then he received the M.S. degree in Architecture of Computer Systems from the Islamic Azad University, Scince Re-search branch, Tehran, Iran, (2005). Now, he is PhD candidate for Computer Engineering in Islamic Azad University, Scince Research branch. His main research activities cover the areas of Natural Lan-guage Processing and Speech Recognition and Reconstruction.

Documents

A new method for speaker identification based on Learning Automata