Text Independent Speaker Recognition

Embed Size (px)

Citation preview

  • 8/2/2019 Text Independent Speaker Recognition

    1/31

    Speaker

    Recognition-Prepared b

    Pravin

    Gondaliya[08BEC029]Surendra

    Jalu[08BEC034]-Guided b

    Dr. Tanish H. Zave

  • 8/2/2019 Text Independent Speaker Recognition

    2/31

    Our Goal:

    To understand the Digital Speech SignalProcessing

    and exploit it into spartan 3A DSP kit.

  • 8/2/2019 Text Independent Speaker Recognition

    3/31

    Todays Agenda:

    Basics of speech processing

    What is speech enhancement?

    Speech enhancement algorithm

    Spartan 3A DSP kit

    ISE tool for Designing

  • 8/2/2019 Text Independent Speaker Recognition

    4/31

    Introduction to speechprocessing

    Speech processing is the application ofDigital signal processing (DSP) techniquesto the processing and or analysis of speechsignals.

    Application of Speech processing include

    - Speech coding

    - Speech Recognition

    - Speaker Verification Identification- Speech Enhancement

    - Speech synthesis (Text to Speech

    conversion)

  • 8/2/2019 Text Independent Speaker Recognition

    5/31

    Figure shows a schematic diagram of thespeech production /speech perceptionprocess in human beings.

    The speech production process beginswhen the talker formulates a message in

    his/her mind to transmit to the listener viaspeech.

    The next step in the process is the conversionof the message into a language code. This

    corresponds to converting the message into aset phoneme sequences corresponding to thesounds that make up the words. Along withprosody (syntax) markers denoting durationofsounds, loudness of sounds and pitchassociated with the sounds.

  • 8/2/2019 Text Independent Speaker Recognition

    6/31

  • 8/2/2019 Text Independent Speaker Recognition

    7/31

    Information Rate of the speechSignal

    The discrete symbol information rate in theraw message text is rather low (about 50bits per second corresponding to about 8sounds per sounds per second, where each

    sound is one of the about 50 distinctsymbols).

    After the language code conversion, with

    the inclusion of prosody information, theinformation rate rises to about 200 bps.

  • 8/2/2019 Text Independent Speaker Recognition

    8/31

    The mechanism of Speechproduction In order to apply DSP techniques to

    speech processing problems it isimportant to understand thefundamentals of the speech

    production process.

    Speech signals are composed of asequence of sounds and thesequence of sounds are produced asa result of acoustical excitation of thevocal tract when air is expelled from

    the lungs.

  • 8/2/2019 Text Independent Speaker Recognition

    9/31

    Speech Production Mechanism

    Vocal tracts begins at theopening between the

    vocalcords and ends at the lips

    In the average male, thetotal

    length of the vocal tract isabout 17 cm

    The cross-sectional area ofthe

    vocal , determined by thepositions of the tongue , lips,jaw and velum varies from

    zero(complete closure) to about20 cm

    f S h

  • 8/2/2019 Text Independent Speaker Recognition

    10/31

    Classification of Speech

    Sounds In speech processing, speech sounds are

    divided into TWO broad classes whichdepend on the role of the vocal chords onthe speech production mechanism.

    -VOICED speech is produced when thevocal chords play an active role (i.e. vibrate)in the production of a sound:

    Examples: voiced sounds /a/,/e/,/i/-UNVOICED speech is produced when vocalchords are inactiveExamples: unvoiced sounds /s/,/f/

  • 8/2/2019 Text Independent Speaker Recognition

    11/31

    Voiced Speech Voiced speech occurs when air flows

    through the vocal chords into the vocaltract in discrete puffs rather than as acontinuous flow

    The vocal chords vibrates at particularfrequency, which is called the fundamentalfrequency of the sound

    - 50:200 Hz for male speakers

    - 150:300 Hz for female speakers

    - 200:400 Hz for child speakers

  • 8/2/2019 Text Independent Speaker Recognition

    12/31

    Unvoiced speech

    For unvoiced speech, the vocal chordsare held open and air flowscontinuously through them

    The vocal tract, however, is narrowed

    resulting in a turbulent flow of air alongthe tract

    Examples include the unvoiced

    fricatives/f/ & /s/

    Characterized by high frequencycomponents

  • 8/2/2019 Text Independent Speaker Recognition

    13/31

    Other Sound classes

    Nasal Sounds

    - Vocal tract coupled acoustically with nasalcavity through velar opening

    - Sound radiated from nostrils as well as lips

    - Examples include m, n, ing

    Plosive Sounds

    - Characterized by complete

    closure/constriction towards front of thevocal tract

    - Build up of pressure behind closure, suddenrelease

    - Examples include p, t, k

  • 8/2/2019 Text Independent Speaker Recognition

    14/31

    Speech Enhancement

    Speech enhancement is concernedwith improving some perceptualaspect of speech that has beendegraded by additive noise.

    Different kind of noise affect on thequality of the speech. Different speechenhancement techniques are used to

    improve the quality of speech andreduce the specific noise coming fromdifferent sources at different SNRs.

    Bl k Di f MFCC

  • 8/2/2019 Text Independent Speaker Recognition

    15/31

    Block Diagram of MFCCalgorithm

  • 8/2/2019 Text Independent Speaker Recognition

    16/31

    Preprocessing & Frame Blocking

    Continuous human speech is recorded and

    preprocessed. In preprocessing , silence detection and

    amplification takes place.

    Then after the preprocessed output is fed toframe blocking section.

    In frame blocking, the continuous speechsignal is blocked into frames of somenumber of samples. This process continuesuntil all the speech is accounted for withinone or more frames.

  • 8/2/2019 Text Independent Speaker Recognition

    17/31

    Windowing

    The next step in the processing is to window

    each individual frame so as to minimize thesignal discontinuities at the beginning and endof each frame. The concept here is to minimizethe spectral distortion by using the window to

    taper the signal to zero at the beginning andend of each frame.

    If we define the window as w(n), 0 n N-1,where N is the number of samples in each

    frame, then the resulting signaly(n)=x(n)w(n) ; 0 n N-1

    Typically the Hamming window is used, whichhas the form

    w n =0.54 046 cos 2n/N-1 ; 0 n N-1

  • 8/2/2019 Text Independent Speaker Recognition

    18/31

    Mel Frequency Cepstrum The power cepstrum (of a signal) is the squared magnitude of

    the Fourier transform of the logarithm of the squared

    magnitude of the Fourier transform of a signal.

    Mathematically: power cepstrum of signal

    |F{log(|F{Y(t)}|2)}|2

    Algorithmically:

    signal FT abs() square log FT abs() square power cepstrum

    The cepstrum can be seen as information about rate ofchange in the different spectrum bands. It was originallyinvented for characterizing the seismic echoes resulting from

    earthquakes and bomb explosions. It has also been used todetermine the fundamental frequency of human speech andto analyze radar signal returns. Cepstrum pitch determinationis particularly effective because the effects of the vocalexcitation (pitch) and vocal tract (formants) are additive in the

    logarithm of the power spectrum and thus clearly separate.

  • 8/2/2019 Text Independent Speaker Recognition

    19/31

    The independent variable of a cepstral graph iscalled the quefrency. The quefrency is a measureof time, though not in the sense of a signal in thetime domain. For example, if the sampling rate ofan audio signal is 44100 Hz and there is a largepeak in the cepstrum whose quefrency is 100samples, the peak indicates the presence of a pitchthat is 44100/100 = 441 Hz. This peak occurs in

    the cepstrum because the harmonics in thespectrum are periodic, and the period correspondsto the pitch.

    Mel-frequency cepstrum (MFC) is a

    representation of the short-term power spectrum ofa sound, based on a linear cosine transform of alog power spectrum on a nonlinear mel scale offrequency.

    So, Our next step is FFT(Fast Fourier Transform)of a speech signal and then it is fed to mel

    Diff b t l d

  • 8/2/2019 Text Independent Speaker Recognition

    20/31

    Difference between normal andmel cepstrum Mel-frequency cepstral coefficients

    (MFCCs) are coefficients that collectivelymake up an MFC. They are derived from atype of cepstral representation of the audio

    clip (a nonlinear "spectrum-of-a-spectrum").The difference between the cepstrum andthe mel-frequency cepstrum is that in theMFC, the frequency bands are equally

    spaced on the mel scale, whichapproximates the human auditory system'sresponse more closely than the linearly-spaced frequency bands used in the normal

    cepstrum. This frequency warping can allow

  • 8/2/2019 Text Independent Speaker Recognition

    21/31

    Why MEL scale? psychophysical studies have shown

    that human perception of thefrequency contents of sounds forspeech signals does not follow a

    linear scale. Thus for each tone withan actual frequency, f, measured in

    Hz, a subjective pitch is measured

    on a scale called the mel scale. The mel-frequency scale is a linear

    frequency spacing below 1000 Hz

    and a logarithmic spacing above

  • 8/2/2019 Text Independent Speaker Recognition

    22/31

    MEL scale The mel scale, is a perceptual scale of

    pitches judged by listeners to be equalin distance from one another. Thename mel comes from the word

    melody to indicate that the scale isbased on pitch comparisons.

    A popular formula to convert f hertz

    into mmel is:

    m = 2595 log10 {1+(f/700)}

  • 8/2/2019 Text Independent Speaker Recognition

    23/31

  • 8/2/2019 Text Independent Speaker Recognition

    24/31

    MFCCMFCCs are commonly derived as follows:

    Take the Fourier transform of (a windowedexcerpt of) a signal.

    Map the powers of the spectrum obtainedabove onto the mel scale, using triangularoverlapping windows.

    Take the logs of the powers at each of themel frequencies.

    Take the discrete cosine transform of the listof mel log powers, as if it were a signal.

    The MFCCs are the amplitudes of theresulting spectrum.

  • 8/2/2019 Text Independent Speaker Recognition

    25/31

    Implementation So, most of the work has been done. Now

    For each speaker we record 5 samples ofspeech. Each speech sample will undergomel frequency cepstral analysis and MFCCare calculated for each of the sample. The

    computed values are then stored in DB.matdatabase.

    Then Pattern matching will takes place. Itwill ask user to enter his/her speech fortesting and compare the computed MFCCof this test speech with that of the DB.mat

    database. If it matches then user will

  • 8/2/2019 Text Independent Speaker Recognition

    26/31

  • 8/2/2019 Text Independent Speaker Recognition

    27/31

    Pattern matching In this process, the centroid of the

    values for five samples is computedas shown in figure.

    Then for each speaker, the test

    speech of each speaker is comparedwith each of the samples including thecentroid one. The best match isselected on basis of maximum valuesmatched in the particular sample.

    So if for any speaker any one out offive is matched with test speech thenthat user will be identified.

  • 8/2/2019 Text Independent Speaker Recognition

    28/31

    Waiting for YourValuable Suggestions

    Thank You

    Resonant Frequencies of Vocal

  • 8/2/2019 Text Independent Speaker Recognition

    29/31

    Resonant Frequencies of VocalTract Vocal tract is a non-uniform acoustic tube

    that is terminated at one end by the vocalchords and at the other end by the lips

    The cross-sectional area of the vocal tractis determined by the positions of tongue,

    lips, jaws and velum The spectrum of vocal tract response

    consists of a number of resonant

    frequencies of the vocal tract The frequencies are called formants

    Three to four formants present below 4KHzof speech

  • 8/2/2019 Text Independent Speaker Recognition

    30/31

    Formant Frequencies

    Speech normally exhibits one formantfrequency in every 1KHz

    For VOICED speech, the magnitude ofthe lower formant frequencies are

    successively larger than magnitude ofthe higher formant frequencies

    For UNVOICED speech, the

    magnitude of the higher formantfrequencies are successively largerthan magnitude of the lower formant

    frequencies

  • 8/2/2019 Text Independent Speaker Recognition

    31/31