Speaker Identification from Voice Using Neural Networks

Journal of Scientific & Industrial Research

Vol. 61 , August 2002, pp 599-606

Speaker Identification from Voice Using Neural Networks

Bijita Biswas' and Amit Konar

Department of Electronics and Te lecommunication Engineering, Jadavpur University, Kolkata 700 032

Received : i 8 February 2002; accepted: 02 May 2002

The paper provides three different schemes for speaker identification of personnels from their voice using artificial neural networks. The first scheme recognizes speakers by employing the classical back-propagation algorithm pre-trained with known voice samples of the persons . The second scheme provides a framework for classifying the known trai ning samples of the voice features using a hierarchical architecture realized with a self-organizing feature map neural net. The first scheme is highly robust as it is capable of identifying the personnel s from their noi sy voice samples, but because of its excessive training time it has limited applications for a large voice database. The second scheme though not so robust like the former, however, can classify an unknown voice sample to its nearest class. The time needed for classification by the first scheme is always unique irrespecti ve of the voice sampl e. It is proportional to the number of feed -forward layers in the network . The time-requirement of the second classification scheme, however, is not free from the voice features and is proport ional to the number of 2-D arrays traversed by the algorithm on the hierarchical structure. The third scheme is hi ghl y robust and mi s-classification is as low as 0 .2 per cent. The third scheme combines the composite benefits of a radial basis function neural net and back-propagation trained neural net.

1 Introduction

Several techniques on personne l identi fication from their voice are avai lab le in the literature on phonetics l

-3

. Most of these techniques di scriminate the spea kers from the pitch and the formant of the respective voice samples. Many of these methods are dependent on the vo ice contents or its specific format such as the CVC sequence (Consonant followed by a Vowel foll owed by a Consonant) and thus is not generic in its true sense. W hen the subjects are hostile to the environments, collec ti on of such wellformatted speech sequence is difficult. For instance, one typ ica l application of the speaker identi fica ti on is in criminal investigation , where the subjects may not at a ll be harmonious to the sc ientific methods adopted for the sys tem. The paper presents a nove l sche me for speaker ident if icat ion from the generic unformatted voice samp les of the personnels and thus is unique in a ll respec ts. Soft computing too ls, such as fuzzy log ic and neuro-computing are ga ining the ir importance in pattern recogniti on. Speaker identificati on fro m the voice of the subj ects also belongs to thi s category of problems, and thus most of the available tools of soft

• Corresponding author Womcn 's Pol ytechnic. Jodhpur Park. Kolkata 700 068

computing may be utili zed to handle this problem. Here three distinct models of artificial neural nets , such as the back-propagation (BP) algorithm and the self-organizing feature map (SOFM) neural nets and the radial basis function (RBF) neural net and BP combination will be employed to handle the present pattern recognition problem.

The work also compares the performance of the two neural techniques mentioned above with respect to the application of the neural nets4

-6 in the specific

problem of speake r identification. An ana lysis of the comparison reveals that the BP algorithm7

-9 with

momentum term in the gradi ent descent learni ng expression has a good convergence to the global minima of the error surface and thus is amenable for speaker identification even with sma ll Gaussian noi se around the meas ured input voice in stance. One practical difficulty faced in the experimentat ion phase with the back-propagati on a lgorithm is the excessive ly large training time. In fact the trai ning time is hi gher o rder polynomial of the number of sample input-output instances. In our case the training input in stance is the features of the voice samples , and the output instance is the speaker number. The experimental results also demonstrate that when the training sa mples exceed 700 instances the traini ng

time is of the order of several days for arriving at a RMS error margin of 0.001 or lower.

The alternative approach of classifying the input training samples by a topology sensitive adaptation scheme fortunately requires only a few minutes to handle the problem of the same dimension mentioned above, and thus, of course, draws much attention. The SOFM based scheme has experimentally been found to have (Jj relatively poor classification accuracy than the BP based scheme. Failure in classification takes place when the voice samples of two or more patterns do not have significant difference. In our implementation the per cent error in classification has been fou ll1d to be significant ly low with respect to the works already done with SOFMIO. This could have been realized because of one specialized featu re pitch , defined differently') that can isolate significant spatial difference in the topological clustering process on the 2-D planes.

Formal definitions of speech features to be used for speaker identification are given in Section 2. A scheme for speaker identification using the classical BP algorithm is presented in Section 3. The SOFM based scheme for speaker classification is presented in Section 4. The RBF-BP scheme for speaker recogmtlOn is introduced in secti on 5. The experimental analysis for optimum choice of features for speaker recognition is presented in Section 6. The conclusions will be listed in Section 7.

2 Voice Feature Extraction

Voice samples from 500 speakers within a given locality in their native language have been collected and processed by a digital sonogram (Model No. 5500 maintained by Indian Statistical Institute, Calcutta). The sonogram samples the voice signal at a fixed rate of 10240 Hz and the samples are quantized to binary numbers with a truncation noise of 1/ 256

111

the dynamic range of the signal or less. The voice features have then been directly analyzed by the sonogram that provides the first few formants FI , Fl. .. .. F5, the pitch of the voice signal defined below and the maximum power spectrum of the voice signal . The power spectrum at the selected sample frequencies is also treated as the voice features. The following definitions are needed to explain the subsequent part of the paper.

Definition 1- A formant is defined as average of the fundamental (harmonic) frequencies obtained in a voice spectrum over a given interval of time.

Formally, if fi is the fundamental (harmonic) freq uency of the voice at time i, then the j-th formant Fj is defined as :

T Fj=LJiIT

i=O .. . (1)

Usually the first formant FI corresponds to the fundamental frequency, while the second, third, nth formant refers to the corresponding harmonic signal present in the voice. The averaging in the definition of the formant is needed from the computational standpoint as the fundamental formants ha.ve a small deviation in the speech sample. Figure 1 explains the frequency deviation in the respective fundamental and the harmonic signals of a voice spectrum.

Definition 2 - Power spectral density of a voice signal is defined as the instantaneous root mean square power of a voice signal spread over the avai lable frequency range. It is usually denoted by p2(f).

For the computation of the total power in a frequency interval !I to .h we usually need to integrate [p2(j) df] within the said bounds ofji and!2.

Definition 3 - The peak power spectral density is defined as the peak amplitude(s) of the power spectrum within a pre-defined frequency interva l of f) to f2• In case the power spectrum has several peaks, they are sorted according to their amplitudes in a descending order and the first three elements of the sorted array are used as the voice feature.

.-----------------------------, --., -5 t ..... ~ o .,

S .B s ... o ., ---~~ ~1 ~ .-o '" c:>. ...

~ .~ I I I

4 8 12 16

Time over which voice sample is uttered ~

Figure I - The first 3 formants in the power spectrum

t

BISWAS & KONAR: SPEAKER IDENTIFICATION FROM VOICE 601

Definition 4 - The pitch is a measure of the large amplitude changes in a signal within a small time-span. For convenience of measurements, a voice signal is first sampled at a high rate, typically of the order of 10 kHz, and then quantized. A count of the difference between the successive quantized levels above a predefined threshold over a time duration of 1 s is absolutely expressed as the measure of pitch (Figure 2).

The voice signal after digitization can also be analyzed in a MATLAB environment for the evaluation of the above features . The estimated values thus obtained may directly be ported to the neural net toolbox under MA TLAB, or can alternatively be realized with the user's own C-codes for neural networks. In the present context we first restricted our scheme for speaker identification using MA TLAB and then generated our own C-codes for neural net stimulation for on-time speaker recognition.

3 Speaker Identification Using The Back-Propagation Algorithm

3.1 The Training Cycle

The neural network model with four layers having ten neurons in the input layer, 12 neurons in each hidden layer and nine neurons in the output layer have been used for the computer simulation. It may be noted that the output of the neural net here generates an encoded output, which needs to be

1 second duration

Figure 2- A scheme for pitch estimation by measuring large changes in a time varying sampled and quantized signal over a time duration of I s. The block arrow points to the position of

large change in signal

decoded externally to identify the speaker. This was done so as to reduce the number of output lines even for a speaker set of 500.

In the simulation under MA TLAB the network has been configured to use the LOGSIGMOID function, which generates the output between 0 and 1.The training has been done by TRAINBPX, which uses both the momentum and the adaptive learning rate to speed up the learning. The learning rate and the number of iterations (epoch) in the present context have been selected judiciously so as to the minimize the sum-square error criterion accurately. The TRAINBP algorithm provides an autonomous adaptation of the learning rate.

The TRAINBP algorithm allows a forward pass through the network and subsequently determjnes the error at the nodes of the output layer by comparing them with the supplied target values. The error thus generated at the output layer is then propagated back following the steps of the back-propagation algorithm. Consequently, the weights of the nodes preceding to the layer where error has been backpropagated, are adapted by the gradient descent learning principle. The process is repeated from the output layer to the last hidden layer, then from the last hidden layer to the first hidden layer and finally to the input layer. The whole sequence of weight adaptation steps together is called one learning epoch. Usually the learning rate and the parameters a and b used in the weight adaptation expressions 12 are kept fixed within a learning epoch.

Thus, at the end of a learning epoch, error at the output layered nodes are calculate. If the new error exceeds the old error by more than a predefined ratio (typically 1.04), new weights and error are discarded, and the learning rate is decreased (typically by multiplying by 0.7); otherwise the new parameters are saved. If the new error is less than the old error, the learning rate is increased (typically by multiplying by 1.05).

3.2 The Recognition Cycle

Once the neural net is trained it may be used for recognition of the speaker from their voice. In case of the back-propagation learning the recognition process is very fast; it only needs to have a forward pass over the network. The outcome at the output layer thus obtained is then decoded to get the exact speaker number. In case the outputs are close to but less than 1, we consider it as 1. Similarly, if one or more

outputs of a given network is close to but not equal to zero, we call it logic O.

3.3 Results

Experiments with a large number of voice database reveal that the BP algorithm can safely recognize more than 98 per cent of the voice data. The remaining 2 per cent of the voice data could not be recognized for their poor and ambiguous measurements. The training time of the algorithm is very high, of the order of several hours. An alternati ve neura l topology thus may be used to solve the same problem.

4 Speaker identification Using Self-organizing Feature Map

4.1 The Training Cycle

The second approach to solve the speaker identi fication problem is to employ a hierarchical self-organ izing feature map (SOFM). The topology of a hierarchical SOFM is a tree where each node in the tree con'esponds to one 2-D array where the input vectors are mapped.

The SOFM works using the following principle. The weights of all neurons in the 2-D planes are fi rst randomized. The input feature vectors are then mapped onto the first 2-D plane of neurons by the fo llowing crite ria.

Criteria - Search the neuron on the given 2-D aITay so that its weight vector has the leas t E ucl idean distance with respect to the given input feature vector.

Once the neuron in the 2-D plane is identi fied, a small zone around it is fixed within wh ich the weight vectors of all the neuron are adj usted by the fo ll ow ing learni ng rule:

Learning Rule: Wllrll• ~ W,,/d + II h - wolr,l I, .. . (2)

where Wold is the weight vector of a gIven neuron, Wnew is the adapted va lue of the weight vector and the norm

II . I I denotes the Euclidean di stance between the two vectors. After each weight adaptation the old weight vector is rep laced by the new vector.

The above process corresponds to mapping an input vector to the neuronal plane. The process thus may be repeated for all possible input training vector. In case, more than one input vector is mapped at the

same neuron , they need to be re-mapped onto a new 2-D plane. The process is thus repeated by constructing newer planes until all neurons are mapped separately on individual nodes of the network. Thus the hierarchical organization of the tree is justified (Figure 3).

4.2 The Recognition Cycle

It may be menti oned here that during the training phase the input vectors used for training the ne[work comprises a ll the components mentioned in BP algorithm plus one, the last one be ing the speaker number. This is needed because in the recogniti on phase when the unknown input vector is searched aga inst the hierarchy of nodes, the fi rs t ten fields corresponding to the features only are compared. When a successful match is found on the fi rst plane, it is checked whether more than one input vectors had been mapped there. If no, then the 11 th field of the stored weight vector is the result to be submitted to the users. If yes , then we need to search the input vector in the next level of the tree. The search process thus continues until the leaf node in the tree is reached. Under this case, on completion of the searc h process, the last found node's 11 th component is regarded as the speaker number.

5 Speaker Identification by Composite Use of RBF Neural Net and Back-propagation Algorithm

Back-propagation algorithm is a good technique for speaker identi f ication from voice features of the speakers. But as the number of speaker increases the train ing time also increases to a great extent. For example, when the number of speakers increases from 50 to 500 the training time increases by a factor of 20 (approximate ly) fo r a prescribed level of mean square error sum at the output layer of the order of 0.001. Typically the training time for the bac kpropagat ion algorithm for approximate ly 500 speakers on an IBM Pentium type PC is around 40 h for a mean square error sum of the order of 0.00 1. Th us, back-propagation is not a good choice for large number of tra ining instances. Because of the small tra ining time of the SOFM or the Radial Basis Function (RBF) type neural nets, they are good choice for speaker identification. The SOFM and RBF networks usually classify the input instances into c lusters. Thus one to one mapping from input to output instances is not amenable by these networks. A combinati on of an RBF or an SOFM network with a back-propagation type neural net in cascade seems to

J,.. I

BISWAS & KONAR: SPEAKER IDENTIFICATION FROM VOICE 603

Non-tenninals

Input pattern

First level 2-D planer array

Second level 2-D planer array .

Third level 2-D planer array

Figure 3 - Mapping of the input vectors onto a topologically sensitive hierarchical 2-D array. The supplied input patterns are all mapped to the first layered 2-D array. On multiplici ty of mapping of more than one input vectors at the same neuron of the i-th layered 2-D plane,

the corresponding input vectors are re-mapped onto the (i + I )th layered plane, until coincidence of multiple mapping onto a neuron is avoided

be an ideal choice for speaker recogmtlon. In this section, the scope of the composite use of the RBF and the back-propagation type neural net in the speaker identification problem is examined.

For our experiment we considered 100 speakers . An RBF type neural net comprising of two layers only (the input layer and the RBF layer) is employed to select one of ten three layered feed-forward neural net trained with back-propagation algorithm. Each of these ten neural nets is trained with ten input-output instances that are sufficiently close to form a cluster. The RBF layer of the radial basis function neural net comprises of ten neurons, each tuned to the cluster

center of the ten close patterns that together form a cluster. Such training requires insignificantly small computational time, but the trained network can correctly identify the one out of 100 speakers at the output layer of the back-propagation neural net assembly (Figure 4).

The RBF function in the present context is a typical Gaussian function G (Il, cr\ whose mean Il is the component-wise ' average of the patterns within a cluster, and variance cr2 is the variance of the patterns themselves. Consequently, the cluster centers are sufficiently different to identify the class to which an unknown speaker belongs.

604 J SCIIND RES VOL 61 AUGUST 2002

BACK-PROPAGATION NEURAL NET I

BACK-PROPAGATION NEURAL NET 2

BACK-PROPAGATION NEURAL NET 10

FEATURE VECTOR

~--------------~

RBF CONTROL NETIVORK

Figure 4-- A 2-stage RBF- BP neural nets in tandem. The RBF neural net determines the cluster center and then passes the input Instance to the Input of the appropriate BP neural net for speaker identilication within that class

The mean and variance of the cluster centers for each node in the RBF layer are computed, prior to the application phase_ Thus when a new training instance appears at the input layer of the RBF-type neural net, the network can instantaneously determine the class of the unknown speaker. The back-propagation neural nets are also pre-trained with the ten voice features as input and the speaker identification number as the

output.

6 Optimal Choice of Features in Speaker Identification

Selection of features is an important consideration in speaker identification problem_ A series of experiments have been carried out to determine the minimal but the best possible features suitable for the speaker recognition problem. The experiments reveal that the first few formant

r-

BISWAS & KONAR : SPEAKER IDENTIFICATION FROM VOICE 60S

Table 1- Accuracy of correct identi fication of speakers on feature selection

Selected features

FI F2 F3 F4 Fs Pitch

FI F2 F3 F4 x Pitch

FI F2 F3 X x Pitch

FI F2 F3 X X x

FI F2 F3 X X x

FI F2 F3 X X x

FI F2 F3 X X x

FI F2 F3 X X X

FI F2 F3 F4 Fs Pitch

frequencies, pitch , total speech power, the peaks of the power spectral density and speaking time vec tor comprising of two components ST 1 and ST2 are unique feature that can correctly identify the speakers from the measured feature space.

During the experiment ' 2 the above features were extracted from the sample speech of 100 speakers, each utteri ng a given Bengali sentence: Dinubabur Deshn namm Phalakata. The speech samples were digitized and the paramete rs such as Formants , total speech power and peak power spectral densities were measured by a sonogram. For the measurements of speaking time, we ourselves designed a system that can measure the total speaking time ST 1 and the talker's speaking rate ST2 that denotes the number of phonemes produced when the acoustic energy exceeds the pre-defined threshold.

A comparative study '3-'4 of the above features is presented in Table 1. It is evident from the Table 1 that the formant frequencies F4 and Fs do not have significant contribution to speaker identification. About 5 per cent cases of mis-identification result in by pitch only. The total spectral power and the peak power spectral density also have a significant contribution in speaker identification. It is observed from the Table 1 that approximately 4-S per cent misidentification takes place in the absence of each of these two features . The ST 1 lnd ST2 have a major contribution to speaker identification. Around 10 per cent mi s-identification takes place in the absence of these two features in the training instance.

It may be added here that to represent the absence of a particular parameter, we represent the

Percentage of speakers correct I y identi fied

TSP PPSD ST1 ST2 99.8




TSP X ST1 ST2 94.0

TSP X ST1 X 90.0

TSP x X ST2 91.2

X X X ST2 66.3

TSP PPSD x x 88.2

corresponding field of the input training instance by a small positive number = O. 8S. The other fields of the input instance are normalized by dividing their abso lute value by the maximum value under the given field . Such normalization process prohibits saturation of the sigmoid function in the back-propagation algorithm.

7 Conclusions

The paper presented a new approach to speaker identification by using the composite benefits of a RBF neural net and a number of three-layered feedforward neural nets trained with back-propagat ion algorithm. The RBF neural net classifies the input feature space of the speakers into clusters and subsequently selects the appropriate Backpropagation neural net for determining the speaker under a class.

Selection of features has been given mai n consideration in the speaker identification problem. Experimental results demonstrate that the speaking time vector, the first three formants, the peak power spectral density and the total power delivered in the speech signal together can identify the speaker to high accuracy, i.e. around 99 per cent.

References

Bahl L R, Frederick J & Mercer R L, A maximum likelihood approach to continuous speech recognition , 1£££ Trans Pallem Analysis Mach Inlellig, 5(2) (March 1983) 179- 189.

2 Doherty E T. An Evaluation of selected acoustic parameters for the use in speaker identification , J Phonel, 4 ( 1976) 321-326.

3 Ladefoged p, Review: the phonetic bases of speaker recognition, by F J Nolan, J Phollel, 12( 1982) 85-89.

606 J SCI IND RES VOL 61 AUGUST 2002

4 Hassoun M H, Fundamentals of artificial neural network (Prentice- Hall of India, New Delhi) 1998.

5 Recent advances in artificial neural networks: design and applications , edited by L C Jain and A M Fanelli (March 2000) .

6 Practical applications of computational intelligence techniques, edited by L C Jain and P 0 Wilde (Kluwer Academic Publisher, Boston) 200 I.

7 The handbook of brain theory and neural networks, edited by M A Arbib (MIT Press, MA, Cambridge) 1995.

8 Handbook of neural computation, Part C, edited by E Fiesler and R Beale (Institute of Physics Publishing and Oxford Uni versity Press. New York) 1997.

9 Gupta M M & Knopf G K, Nellro-vision systems: principles and applications (IEEE Press. New.York) 1994.

10 Yegnanarayana. B, Artificial neural networks (Prentice-Hall of Indi a, New Delhi) 1989.

II Subramanian R V, A fu zzy intelligent system f or criminal investigation , ME Thesis, Jadavpur University. 1997.

12 Konar A, Artificial intelligence and 50ft computing: behavioral alld cognitive modeling of human brain (C RC Press, Boca Raton, Florida) Ch 23, 1999.

13 Biswas B, Fuzzy decision support sys/em for criminal investigation, Ph 0 Thesis. Jadavpur University, 2002.

14 Biswas B & Konar A. A neuro-fuzzy computational approach to criminal investigation (Physica-Verlag, Heidelberg) 2002.

Documents

Speaker Identification from Voice Using Neural Networks