ASR_final

“Development of Some TechniquesFor Text-Independent Speaker

Recognition From Audio Signals”

ByBidhan Barai

Under the guidance ofDr. Nibaran Das and Dr. Subhadip Basu

Assistant Professors of Computer Science & EngineeringJadavpur University

Kolkata – 700 032

Overview● Introduction

● Types of Speaker Recognition

● Principles of Automatic Speaker Recognition (ASR)

● Steps of Speaker Recognition:

1> Voice Recording

2> Feature Extraction

3> Modeling

4> Pattern Matching

5> Decision (accept / reject) (for Verification)

● Conclusion

● References

Introduction

● Speaker recognition is the identification of a person from characteristics of voices (voice biometrics). It is also called voice recognition. There is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said).

● In addition, there is a difference between the act of authentication (commonly referred to as speaker verification or speaker authentication) and identification.

Types of Speaker Identification

● Text-Dependent:

If the text must be the same for enrollment and verification this is called text-dependent recognition. In a text-dependent system, prompts can either be common across all speakers (e.g.: a common pass phrase) or unique

● Text-Independent:Text-Independent:

Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different.

Types of Speaker Identification

● Closed-Set: Assumed that Speaker is in Database

In closed-set identification, the audio of the test speaker is compared against all the available speaker models and the speaker ID of the model with the closest match is returned. Result is the best speaker matched.

● Open-Set: Speaker may not in Database

Open-set identification may be viewed as a combination of closed-set identification and speaker verification. Result can be a speaker or a no-match result.

Principles of Automatic Speaker Recognition

● Speaker recognition can be classified into identification and verification.

● Speaker identification is the process of determining which registered speaker provides a given utterance.

● Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker.

● Following figures shows the basic structures of speaker identification and verification systems. The system that we will describe is classified as text-independent speaker identification system since its task is to identify the person who speaks regardless of what is saying.

Principles of Automatic Speaker Recognition ... Contd.

Figure 1

Block Diagram of Speaker Recognition SystemBlock Diagram of Speaker Recognition System


● Speaker RecognitionSpeaker Recognition

Feature Extraction

Similarity

Reference ModelSpeaker #1

Similarity

Reference ModelSpeaker #N

Maximun Selection

Identification Result

(Speaker ID)Input Speech

Figure 2


● Speaker verificationSpeaker verification

Feature Extraction

Input Speech Similarity Decision

Reference ModelSpeaker #1

Speaker ID(#M) Threshold

Verification Result

(Accept/Reject)

Figure 3


● All speaker recognition systems have to serve two distinguished phases.

The first one is referred to the enrolment or training phase, while the second one is referred to as the operational or testing phase.

● In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples.

● In the testing phase, the input speech is matched with stored reference model(s) and a recognition decision is made.

Steps of Speaker Recognition

1> Voice Recording


3> Modeling

4> Pattern Matching

5> Decision (accept / reject) (for Verification)

Step 1: Voice Recording

● The speech input is typically recorded at a sampling rate above 10000 Hz (10 kHz).

● This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans.

● This sampling rate (10 kHz) is determined by the Nyquest Sampling Theorem.

Step 2: Speech Feature Extraction

● The purpose of this module is to convert the speech waveform, using digital signal processing (DSP) tools, to a set of features (at a considerably lower information rate) for further analysis. This is often referred as the signal-processing front end.

● The speech signal is a slowly timed varying signal (it is called quasi-stationary). When examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken.

● Therefore, short-time spectral analysis is the most common way to characterize the speech signal.

Speech Feature Extraction...Contd Examples of Speech Signals:

A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Group Delay Features (GDF) and others. MFCC is perhaps the best known and most popular, and will be described in this project.

Figure 4 Figure 5

Speech Feature Extraction...Contd

● Mel-frequency Cepstrum Coefficients Processor:

A block diagram of the structure of an MFCC processor is given in Figure

Figure 6


● Steps of extracting Feature from Speech Signal:

1> Pre-emphasis

2> Frame Blocking

3> Windowing

4> Fast Fourier Transform (FFT)

5> Mel-frequency Wrapping

6> Cepstrum: Logarithmic Compression and Discrete Cosine Transform (DCT)


● Pre-emphasis: In speech processing, the original signal usually has too much lower frequency energy, and processing the signal to emphasize higher frequency energy is necessary. To perform pre-emphasis, we choose some value α between 0.9 and 1. Then each value in the signal is re-evaluated using this formula:

This is apparently a first order high pass filter.

y [n]=x [n ]−α x [n−1 ] , where 0.9<α<1


Figure 7


● Frame Blocking: The input speech signal is segmented into frames of 20~30 ms with optional overlap of 1/3~1/2 of the frame size. Usually the frame size (in terms of sample points) is equal to power of two in order to facilitate the use of FFT. If this is not the case, we need to do zero padding to the nearest length of power of two.

● Windowing: Each frame has to be multiplied with a hamming window in order to keep the continuity of the first and the last points in the frame. If the signal in a frame is denoted by s(n), n = 0,…N-1, then the signal after Hamming windowing is s(n)*w(n), where w(n) is the Hamming window defined by:

Different values of corresponds to different curves for the Hamming windows shown next:

w (n ,α)=(1−α)−α cos(2πnN−1

) , 0≤n≤N−1

α


Figure 8


Figure 9


● Fast Fourier Transform (FFT): The Discrete Fourier Transform (DFT) of a discrete-time signal x(nT) is given by:

Where:

X ( k )=∑n=0

N−1

x [n ] e− j

2πN

nk

k=0,1,…N−1x (nT )=x [n ]


● If we let: thene− j

2πN =W N

X ( k )=∑n=0

N−1

x [n ]W Nnk

0 20 40 60 80 100 120-2

-1

0

1

2Sampled signal

Sample

Amplitude

0 0.1 0.2 0.3 0.4 0.50

0.2

0.4

0.6

0.8

1Frequency Domain

Normalised Frequency

Magnitude

Figure 10


● x[n] = x[0], x[1], …, x[N-1]

X ( k )=∑n=0

N−1

x [n ]W Nnk ; 0≤k≤N−1 [1][1]

Lets divide the sequence x[n] into even and odd sequences: x[2n] = x[0], x[2], …, x[N-2] x[2n+1] = x[1], x[3], …, x[N-1]

Speech Feature Extraction...Contd● Equation 1 can be rewritten as:

X ( k )= ∑n=0

N2

−1

x [ 2n ]W N2 nk+ ∑

n=0

N2

−1

x [ 2n+1 ]W N( 2n+1 )k [2][2]

Since:W N

2 nk=e− j

2πN

2 nk=e

− j2πN /2

nk=W N

2

nk W N( 2n+1 )k=W N

k⋅W N2

nk

Then:X ( k )= ∑

n=0

N2

−1

x [ 2n ]W N2

nk+W Nk ∑

n=0

N2

−1

x [ 2n+1 ]W N2

nk

=Y (k )+W Nk Z (k )

and


● The result is that an N-point DFT can be divided into two N/2 point DFT’s:

● Where Y(k) and Z(k) are the two N/2 point DFTs operating on even and odd samples respectively:

X ( k )=∑n=0

N−1

x [n ]W Nnk ; 0≤k≤N−1 N-point DFTN-point DFT

X ( k )= ∑n=0

N2

−1

x1 [n ]W N2

nk+W Nk ∑

n=0

N2

−1

x2 [n ]W N2

nk

=Y (k )+W Nk Z (k )

Two Two N/2-point N/2-point DFTsDFTs

Speech Feature Extraction...Contd● Periodicity and symmetry of W can be exploited to

simplify the DFT further:

X (k )= ∑n=0

N2

−1

x1 [n ] W N2

nk+W Nk ∑

n=0

N2

−1

x2 [n ]W N2

nk

⋮

X (k+N2 )= ∑n=0

N2

−1

x1 [n ]W N2

n (k +N2 )

+W

Nk+

N2

∑n=0

N2

−1

x2 [n ]W N2

n(k+N2 )

[3][3]

Or: W N

k+N2 =e

− j2πN

k

e− j

2πN

N2 =e

− j2πN

k

e− jπ=−e− j

2πN

k

=−W Nk : Symmetry: Symmetry

And: W N2

k+N2 =e

− j2πN /2

k

e− j

2πN /2

N2 =e

− j2πN /2

k

=W N2

k : Periodicity: Periodicity

Speech Feature Extraction...Contd● Finally by exploiting the symmetry and periodicity,

Equation 3 can be written as:

● Hence Complete Equations for finding FFT are:

X (k+N2 )= ∑

n=0

N2

−1

x1 [n ]W N2

nk−W Nk ∑

n=0

N2

−1

x2 [n ]W N2

nk

=Y (k )−W Nk Z (k )

[4][4]

X ( k )=Y ( k )+W Nk Z (k ); k=0,…(N2 −1)

X (k+N2 )=Y ( k )−W N

k Z (k ) ; k=0,…(N2 −1)

Speech Feature Extraction...Contd● Schematic Diagram for FFT: Radix-2 butterfly diagram

y[0]y[0]

y[1]y[1]

y[2]y[2]

y[N-2]y[N-2]

N/2 point N/2 point DFTDFT

x[0]x[0]

x[2]x[2]

x[4]x[4]

x[N-2]x[N-2]

N/2 point N/2 point DFTDFT

x[1]x[1]

x[3]x[3]

x[5]x[5]

x[N-1]x[N-1]

z[0]z[0]

z[1]z[1]

z[2]z[2]

z[N/2-1]z[N/2-1]

X[1] = y[1]+WX[1] = y[1]+W11z[1]z[1]

WW11

X[0] = y[0]+WX[0] = y[0]+W00z[0]z[0]

WW00X[N/2] = y[0]-WX[N/2] = y[0]-W00z[0]z[0]-1-1

X[N/2+1] = y[1]-WX[N/2+1] = y[1]-W11z[1]z[1]-1-1

Speech Feature Extraction...Contd● Mel-frequency Wrapping: Psychophysical studies

have shown that human perception of the frequency content of sounds does not follow a liner scale. That research has led to the concept of the subjective frequency, i.e., the perceived frequency of sounds is defined as follows. For each sound with an actual frequency, f , measured in Hz, a subjective frequency is measured on a scale called the "Mel scale". Mel-frequency can be approximated by

Mel(f )=2595log (1+f

700)

Speech Feature Extraction...ContdMel Frequency Plot:

Figure 11


● In the Mel-Frequency Scale, there is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000Hz.

● Triangular Filters Bank: The human ear acts essentially like a bank of overlapping band-pass filters and human perception is based on Mel scale. Thus, the approach to simulating the human perception is to build a filter bank with bandwidth given by the Mel scale and pass the magnitudes of the spectra, through these filters and obtain the Mel-frequency spectrum.

Speech Feature Extraction...Contd● Equally Spaced Mel values:

● We define a triangular filter-bank with M filters (m=1, 2,...,M) where, Hm[ k ] , is the magnitude (frequency response) of the filter given by:

Hm(k ) = {0, k< f (m−1)

k− f (m−1)

f (m)−f (m−1), f (m−1)≤k≤f (m)

f (m+1)−k

f (m+1)−f (m), f (m)≤k≤ f (m+1)

0, k> f (m+1)}

Speech Feature Extraction...Contd● Mel Filter Bank:


● Given the FFT of the input signal, x[n]

● The values of FFT are weighted by triangular filters. The result is called Mel-frequency power spectrum which is defined as:

where is called the Power Spectrum.

X [k ]=∑n=0

N−1

x [n]e− j2π nk /N ,0≤k≤N

S [m]=∑k=1

N

∣Xa [k ]∣2Hm[k ] ,0<m≤M

∣Xa [k ]∣2


● Schematic diagram of Filter Bank Energy:

● Finally, a discrete cosine transform (DCT) of the logarithm of S[m] is computed to form the MFCCs as:

mfcc [i ]=∑m=1

M

log (S [m])cos [i(m−12) πM

] ,

i=1,2, ......... , L

Step 3: Modeling

● State-of-the-Art Modeling Techniques:

1> Gaussian Mixture Model (GMM)

2> Hidden Markov Model (HMM)

GMM● Mixture model is a probabilistic model which assumes

the underlying data to belong to a mixture distribution.

Gaussian is a characteristic symmetric “bell curve"

GMM...Contd● Mathematical Description of GMM:

where = Mixed Density Function

= Mixture weight or mixture Coefficient

= Density Function

p (x)=∑i=1

i=n

w i pi(x)

p (x)

wi

pi(x)

GMM...Contd● Mathematical Description of GMM:

where = Mixed Density Function

= Mixture weight or mixture Coefficient

= Density Function

p (x)=∑i=1

i=n

w i pi(x)

p (x)

wi

pi(x)

GMM...Contd● Image showing Best fit Gaussian Curve:

GMM...Contd● Hence the Density Function is:

● The Descreption of GMM becomes

where ‘s are means and ‘s are covariance-matrix of individual components(probability density function) .

pi(x)=N (x∣μi ,Σi)

p (x)=∑i=1

i=n

w iN (x∣μi ,Σi)

μi Σi

G1,w1 G2,w2

G3,w3

G4,w4

G5,w5

GMM...Contd● The Gaussian (Normal) density function, in which each of

the mixture components are Gaussian distributions, each with their own mean and variance parameters is the most common mixture distribution.

The feature vectors follows the Gaussian Distribution. Hence X is distributed Normally.

: Multi variate Normal Distribution

Where = Means

= Covariance Matrix

μ

Σ

X∼N (x∣μ ,Σ)

GMM...Contd● The GMM for a Speaker is denoted by

Here a speaker is represented by a mixture of M Gaussian Components.

● The Gaussian Mixture Density is

λ={wi ,μi ,Σi }, where i=1,2, .......... , M

p( x⃗∣λ)=∑i=1

M

wi pi( x⃗)

where x⃗ = D−dimensional random vector (variable)

GMM...Contd● The Component Density is given by

● The schematic diagram of the GMM of a speaker is given below

pi( x⃗)=1

(2π)D /2∣Σi∣

1 /2 exp {−12( x⃗−μi)

TΣi

−1( x⃗−μi)}

p1() p2()μ1, Σ1μ2, Σ2

Σ

x⃗

p( x⃗∣λ)

w1

pM () μM ,ΣM

w2wM

. . . .

Model Parameter Estimation

● To create a GMM we are required to find the numerical values of Model parameters , and

● To obtain an optimum model representing each speaker we need to calculate a good estimation of the GMM parameters. To do that, a very efficient method is the Maximum-Likelihood Estimation (MLE) approach. For speaker identification, each speaker is represented by a GMM and is referred to by his/her model. In this regard EM algorithm is very useful tool to find the optimum model parameters by MLE approach.

wi μi Σi

Step 4: Pattern Matching: Classification

● In this stage, a series of input vectors are compared, and a decision is made as to which of the speakers in the set is the most likely to have spoken the test data. The input to the classification system is denoted as

● Using the models of each speaker and the unknown vectors the fitness values are calculated with the help of posterior probalility. The speaker model which gives the maximum fitness value, we classify the vectors to that speaker.

x⃗={x1, x2, x3,. ................ , xT }

x⃗

Conclusion...Contd● Modification can be done in the following

cases:


2> MFCC Feature

3> Filter Bank

4> Modeling Techniques

5> Pattern Matching

Conclusion...Contd● Feature Extraction: In the MFCC feature the phase

information is not taken into account. Only magnitude is considered. So using phase information along with the MFCC feature new feature vectors can be derived.

● Pattern Matching: In pattern matching step it is assumed that the feature vectors of unknown speaker are independent. With this assumption posterior probability is calculated. But we can use some orthogonal transformation to transform the set of vectors into a new set of orthogonal vectors. Hence, after the transformation the the vectors become independent. And then we can proceed as before.

References● [1] Molau, S., Pitz, M., Schlüter, R. & Ney, H. (2001), Computing Mel-Frequency

Cepstral Coefficients on the Power Spectrum, IEEE International Conference on Acoustics, Speech and Signal Processing, Germany, 2001: 73-76.

● [2] Huang, X., Acero, A. & Hon, H. (2001), Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey.

● [3] Homayoon Beigi, (2011), Fundamentals of Speaker Recognition, Springer.

● [4] Daniel J. Mashao, Marshalleno Skosan, Combining classifier decisions for robust speaker identification, ELSEVIER 2006.

● [5] W.M. Campbell , J.P. Campbell, D.A. Reynolds, E. Singer, P.A. Torres-Carrasquillo, Support vector machines for speaker and language recognition, ELSEVIER, 2006.

● [6] Seiichi Nakagawa, Kouhei Asakawa, Longbiao Wang, Speaker Recognition by Combining MFCC and Phase Information, INTERSPEECH 2007.

● [7] Nilsson, M. & Ejnarsson, M, Speech Recognition Using Hidden Markov Model Performance Evaluation in Noisy Environment, Blekinge Institute of Technology Sweden, 2002.

References...Contd● [8] Stevens, S. S. & Volkman, J. (1940), The Relation of the Pitch to

Frequency , Journal of Psychology, 1940(53): 329.

● [9] A . Jain, A. Ross, and S. Prabhakar, “An introduction to biometric recognition,” IEEETrans. Circuits Systems Video Technol., vol. 14, no. 1, pp. 4–20, 2004.

● [10] D. Reynolds, “An overview of automatic speaker recognition technology,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing (ICASSP), 2002, vol. 4, pp. 4072–4075.

● [11] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoustics Speech Signal Process., vol. 29, no. 2, pp. 254–272, 1981.

● [12] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, 1995.

● [13] D. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Commun., vol. 17, no. 1–2, pp. 91–108, 1995.

References...Contd● [14] Man-Wai Mak , Wei Rao, Utterance partitioning with acoustic vector

resampling for GMM–SVM speaker verification, ELSEVIER, 2011.

● [15] Md. Sahidullah, Goutam Saha, Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition, ELSEVIER, 2011.

● [16] Qi Li, and Yan Huang, An Auditory-Based Feature Extraction Algorithm for Robust Speaker Identification Under Mismatched Conditions , IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011.

● [17] Alfredo Maesa, Fabio Garzia, Michele Scarpiniti, Roberto Cusani, Text Independent Automatic Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and Gaussian Mixture Models, Journal of Information Security, 2012.

● [18] Ming Li, Kyu J. Han, Shrikanth Narayanan, Automatic speaker age and gender recognition using acoustic and prosodic level information fusion, ELSEVIER, 2013.

Thank YouThank You

Documents

ASR_final