Text Independent Speaker Recognition

8/2/2019 Text Independent Speaker Recognition

1/31

Speaker

Recognition-Prepared b

Pravin

Gondaliya[08BEC029]Surendra

Jalu[08BEC034]-Guided b

Dr. Tanish H. Zave


2/31

Our Goal:

To understand the Digital Speech SignalProcessing

and exploit it into spartan 3A DSP kit.


3/31

Todays Agenda:

Basics of speech processing

What is speech enhancement?

Speech enhancement algorithm

Spartan 3A DSP kit

ISE tool for Designing


4/31

Introduction to speechprocessing

Speech processing is the application ofDigital signal processing (DSP) techniquesto the processing and or analysis of speechsignals.

Application of Speech processing include

- Speech coding

- Speech Recognition

- Speaker Verification Identification- Speech Enhancement

- Speech synthesis (Text to Speech

conversion)


5/31

Figure shows a schematic diagram of thespeech production /speech perceptionprocess in human beings.

The speech production process beginswhen the talker formulates a message in

his/her mind to transmit to the listener viaspeech.

The next step in the process is the conversionof the message into a language code. This

corresponds to converting the message into aset phoneme sequences corresponding to thesounds that make up the words. Along withprosody (syntax) markers denoting durationofsounds, loudness of sounds and pitchassociated with the sounds.


6/31


7/31

Information Rate of the speechSignal

The discrete symbol information rate in theraw message text is rather low (about 50bits per second corresponding to about 8sounds per sounds per second, where each

sound is one of the about 50 distinctsymbols).

After the language code conversion, with

the inclusion of prosody information, theinformation rate rises to about 200 bps.


8/31

The mechanism of Speechproduction In order to apply DSP techniques to

speech processing problems it isimportant to understand thefundamentals of the speech

production process.

Speech signals are composed of asequence of sounds and thesequence of sounds are produced asa result of acoustical excitation of thevocal tract when air is expelled from

the lungs.


9/31

Speech Production Mechanism

Vocal tracts begins at theopening between the

vocalcords and ends at the lips

In the average male, thetotal

length of the vocal tract isabout 17 cm

The cross-sectional area ofthe

vocal , determined by thepositions of the tongue , lips,jaw and velum varies from

zero(complete closure) to about20 cm

f S h


10/31

Classification of Speech

Sounds In speech processing, speech sounds are

divided into TWO broad classes whichdepend on the role of the vocal chords onthe speech production mechanism.

-VOICED speech is produced when thevocal chords play an active role (i.e. vibrate)in the production of a sound:

Examples: voiced sounds /a/,/e/,/i/-UNVOICED speech is produced when vocalchords are inactiveExamples: unvoiced sounds /s/,/f/


11/31

Voiced Speech Voiced speech occurs when air flows

through the vocal chords into the vocaltract in discrete puffs rather than as acontinuous flow

The vocal chords vibrates at particularfrequency, which is called the fundamentalfrequency of the sound

- 50:200 Hz for male speakers

- 150:300 Hz for female speakers

- 200:400 Hz for child speakers


12/31

Unvoiced speech

For unvoiced speech, the vocal chordsare held open and air flowscontinuously through them

The vocal tract, however, is narrowed

resulting in a turbulent flow of air alongthe tract

Examples include the unvoiced

fricatives/f/ & /s/

Characterized by high frequencycomponents


13/31

Other Sound classes

Nasal Sounds

- Vocal tract coupled acoustically with nasalcavity through velar opening

- Sound radiated from nostrils as well as lips

- Examples include m, n, ing

Plosive Sounds

- Characterized by complete

closure/constriction towards front of thevocal tract

- Build up of pressure behind closure, suddenrelease

- Examples include p, t, k


14/31

Speech Enhancement

Speech enhancement is concernedwith improving some perceptualaspect of speech that has beendegraded by additive noise.

Different kind of noise affect on thequality of the speech. Different speechenhancement techniques are used to

improve the quality of speech andreduce the specific noise coming fromdifferent sources at different SNRs.

Bl k Di f MFCC


15/31

Block Diagram of MFCCalgorithm


16/31

Preprocessing & Frame Blocking

Continuous human speech is recorded and

preprocessed. In preprocessing , silence detection and

amplification takes place.

Then after the preprocessed output is fed toframe blocking section.

In frame blocking, the continuous speechsignal is blocked into frames of somenumber of samples. This process continuesuntil all the speech is accounted for withinone or more frames.


17/31

Windowing

The next step in the processing is to window

each individual frame so as to minimize thesignal discontinuities at the beginning and endof each frame. The concept here is to minimizethe spectral distortion by using the window to

taper the signal to zero at the beginning andend of each frame.

If we define the window as w(n), 0 n N-1,where N is the number of samples in each

frame, then the resulting signaly(n)=x(n)w(n) ; 0 n N-1

Typically the Hamming window is used, whichhas the form

w n =0.54 046 cos 2n/N-1 ; 0 n N-1


18/31

Mel Frequency Cepstrum The power cepstrum (of a signal) is the squared magnitude of

the Fourier transform of the logarithm of the squared

magnitude of the Fourier transform of a signal.

Mathematically: power cepstrum of signal

|F{log(|F{Y(t)}|2)}|2

Algorithmically:

signal FT abs() square log FT abs() square power cepstrum

The cepstrum can be seen as information about rate ofchange in the different spectrum bands. It was originallyinvented for characterizing the seismic echoes resulting from

earthquakes and bomb explosions. It has also been used todetermine the fundamental frequency of human speech andto analyze radar signal returns. Cepstrum pitch determinationis particularly effective because the effects of the vocalexcitation (pitch) and vocal tract (formants) are additive in the

logarithm of the power spectrum and thus clearly separate.


19/31

The independent variable of a cepstral graph iscalled the quefrency. The quefrency is a measureof time, though not in the sense of a signal in thetime domain. For example, if the sampling rate ofan audio signal is 44100 Hz and there is a largepeak in the cepstrum whose quefrency is 100samples, the peak indicates the presence of a pitchthat is 44100/100 = 441 Hz. This peak occurs in

the cepstrum because the harmonics in thespectrum are periodic, and the period correspondsto the pitch.

Mel-frequency cepstrum (MFC) is a

representation of the short-term power spectrum ofa sound, based on a linear cosine transform of alog power spectrum on a nonlinear mel scale offrequency.

So, Our next step is FFT(Fast Fourier Transform)of a speech signal and then it is fed to mel

Diff b t l d


20/31

Difference between normal andmel cepstrum Mel-frequency cepstral coefficients

(MFCCs) are coefficients that collectivelymake up an MFC. They are derived from atype of cepstral representation of the audio

clip (a nonlinear "spectrum-of-a-spectrum").The difference between the cepstrum andthe mel-frequency cepstrum is that in theMFC, the frequency bands are equally

spaced on the mel scale, whichapproximates the human auditory system'sresponse more closely than the linearly-spaced frequency bands used in the normal

cepstrum. This frequency warping can allow


21/31

Why MEL scale? psychophysical studies have shown

that human perception of thefrequency contents of sounds forspeech signals does not follow a

linear scale. Thus for each tone withan actual frequency, f, measured in

Hz, a subjective pitch is measured

on a scale called the mel scale. The mel-frequency scale is a linear

frequency spacing below 1000 Hz

and a logarithmic spacing above


22/31

MEL scale The mel scale, is a perceptual scale of

pitches judged by listeners to be equalin distance from one another. Thename mel comes from the word

melody to indicate that the scale isbased on pitch comparisons.

A popular formula to convert f hertz

into mmel is:

m = 2595 log10 {1+(f/700)}


23/31


24/31

MFCCMFCCs are commonly derived as follows:

Take the Fourier transform of (a windowedexcerpt of) a signal.

Map the powers of the spectrum obtainedabove onto the mel scale, using triangularoverlapping windows.

Take the logs of the powers at each of themel frequencies.

Take the discrete cosine transform of the listof mel log powers, as if it were a signal.

The MFCCs are the amplitudes of theresulting spectrum.


25/31

Implementation So, most of the work has been done. Now

For each speaker we record 5 samples ofspeech. Each speech sample will undergomel frequency cepstral analysis and MFCCare calculated for each of the sample. The

computed values are then stored in DB.matdatabase.

Then Pattern matching will takes place. Itwill ask user to enter his/her speech fortesting and compare the computed MFCCof this test speech with that of the DB.mat

database. If it matches then user will


26/31


27/31

Pattern matching In this process, the centroid of the

values for five samples is computedas shown in figure.

Then for each speaker, the test

speech of each speaker is comparedwith each of the samples including thecentroid one. The best match isselected on basis of maximum valuesmatched in the particular sample.

So if for any speaker any one out offive is matched with test speech thenthat user will be identified.


28/31

Waiting for YourValuable Suggestions

Thank You

Resonant Frequencies of Vocal


29/31

Resonant Frequencies of VocalTract Vocal tract is a non-uniform acoustic tube

that is terminated at one end by the vocalchords and at the other end by the lips

The cross-sectional area of the vocal tractis determined by the positions of tongue,

lips, jaws and velum The spectrum of vocal tract response

consists of a number of resonant

frequencies of the vocal tract The frequencies are called formants

Three to four formants present below 4KHzof speech


30/31

Formant Frequencies

Speech normally exhibits one formantfrequency in every 1KHz

For VOICED speech, the magnitude ofthe lower formant frequencies are

successively larger than magnitude ofthe higher formant frequencies

For UNVOICED speech, the

magnitude of the higher formantfrequencies are successively largerthan magnitude of the lower formant

frequencies


31/31

Documents

Text Independent Speaker Recognition