40
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana- Champaign, USA

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

  • Upload
    dino

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 3: Spectral Dynamics and the Production of Consonants. - PowerPoint PPT Presentation

Citation preview

Page 1: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark [email protected]

University of Illinois at Urbana-Champaign, USA

Page 2: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Lecture 3: Spectral Dynamics and the Production of Consonants

• International Phonetic Alphabet• Events in the Closure of a Nasal Consonant

– Formant transitions: a perturbation model– Nasalized vowel– Nasal murmur

• Events in the Release of a Stop Consonant– Pre-voicing (voiced stops in carefully read English)– Transient (stops and affricates)– Frication (stops, affricates, and fricatives)– Aspiration (aspirated stops and /h/)– Formant Transitions (any consonant-vowel transition)

• Formant Tracking– Does it help Speech Recognition?– Methods for Vowels, and for Aspiration & Nasals

• Reminder – lab 1 due Monday!

Page 3: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

International Phonetic Alphabet: Purpose and Brief History

• Purpose of the alphabet: to provide a universal notation for the sounds of the world’s languages– “Universal” = If any language on Earth distinguishes two

phonemes, IPA must also distinguish them– “Distinguish” = Meaning of a word changes when the phoneme

changes, e.g. “cat” vs. “bat.”• Very Brief History:

– 1876: Alexander Bell publishes a distinctive-feature-based phonetic notation in “Visible Speech: The Science of the Universal Alphabetic.” His notation is rejected as being too expensive to print

– 1886: International Phonetic Association founded in Paris by phoneticians from across Europe

– 1991: Unicode provides a standard method for including IPA notation in computer documents

Page 4: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

International Phonetic Alphabet: Vowels

Pinyin ARPABET(Approx.)

i /u (xu) IY / UX

EY

EH

a (zhang) AE

a (ma)

Pinyin ARPABET(Approx.)

/ u (zhu) / UW

o UH

/ oa / OW

/ o AH / AO

a (ma) AA

Pinyin:e ARPA:AX

Page 5: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

IPA: Regular Consonants

NG

ARPABET: F/V (labiodental), TH/DH (dental), S/Z (alveolar), SH/ZH (postalveolar or palatal)Pinyin: s (alveolar), x (postalveolar), sh/r (retroflex)

DX

RHH/HV

Q

Tongue Blade Tongue Body

Y

Page 6: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Affricates and Doubly-Articulated Consonants

Affricates in English and Chinese: Pinyin ARPABET IPA Alveolar: c/z ts/dz Post-alveolar: q/j CH/JH tʃ/dʒ Retroflex: ch/zh ţş/ɖʐ

ARPABET WH W

Page 7: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Non-Pulmonic Consonants

Page 8: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Events in the Closure of a Syllable-Final Nasal

Consonant

Page 9: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Events in the Closure of a Nasal Consonant

Vowel Nasalization

Formant Transitions

Nasal Murmur

Page 10: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Transitions: A Perturbation Theory Model

Page 11: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Transitions:

Labial Consonants

“the mom”

“the bug”

Page 12: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Transitions:

Alveolar Consonants

“the tug”

“the supper”

Page 13: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Transitions: Post-alveolar Consonants

“the shoe”

“the zsazsa”

Page 14: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Transitions:

Velar Consonants

“the gut”

“sing a song”

Page 15: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Transitions: A Perceptual Study

The study: (1) Synthesize speech with different formant patterns, (2) recordsubject responses. Delattre, Liberman and Cooper, J. Acoust. Soc. Am. 1955.

Page 16: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Perception of Formant Transitions: Conclusions

Page 17: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Vowel Nasalization

Page 18: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Vowel Nasalization

Page 19: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Additive Terms in the Log Spectrum

Page 20: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Transfer Function of a Nasalized Vowel

Page 21: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Nasal Murmur“the mug” “the nut” “sing a song”

Observations:Low-frequency resonance (about 300Hz) always presentLow-frequency resonance has wide bandwidth (about 150Hz)Energy of low-frequency resonance is very constantMost high-frequency resonances cancelled by zerosDifferent places of articulation have different high frequency spectraHigh-frequency spectrum is talker-dependent and variable

Page 22: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Resonances of a Nasal Consonant

Reference: Fujimura, JASA 1962

Page 23: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Anti-Resonances of a Nasal Consonant

Page 24: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Events in the Release of a Stop (Plosive) Consonant

Page 25: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Events in the Release of a Stop

“Burst” = transient + frication (the part of the spectrogram whose transfer function has poles only at the front cavity resonance frequencies, not at the back cavity resonances).

Page 26: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Events in the Release of a StopUnaspirated (/b/) Aspirated (/t/)

Transient Frication Aspiration Voicing

Page 27: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Pre-voicing during ClosureTo make a voiced stop in most European languages:

Tongue root is relaxed, allowing it to expandm so that vocal folds can continue to vibrating for a little while after oral closure.

Result is a low-frequency “voice bar” that may continue well into closure.

In English, closure voicing is typical of read speech, but not casual speech.

“the bug”

Page 28: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Transient: The Release of Pressure

Page 29: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Transfer Function During Transient and Frication: Poles

Front cavity resonance frequency: FR = c/4Lf

Turbulence striking an obstacle makes noise

Page 30: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Transfer Function During Frication: An Important Zero

Page 31: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Transfer Function During Frication: An Important Zero

Page 32: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Transfer Function During Aspiration

Page 33: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Are Formant Frequencies Useful for Speech Recognition?

• Kopec and Bush (1992): WER(formants alone) > WER(cepstrum alone) > WER(formants and cepstrum together)

• How should we track formants?– In vowels: Autoregressive (AR) modeling (also known

as LPC)– In aspiration, nasals: Autoregressive Moving Average

(ARMA) modeling. Problem: no closed-form solution– In aspiration, nasals: Exponentially Weighted

Autoregressive (EWAR; Zheng and Hasegawa-Johnson, ICASSP 2004)

Page 34: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Tracking for Vowels: Autoregressive Model (LPC)

Page 35: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Tracking for Aspiration: “Auto-Regressive Moving Average”

Model (ARMA)

Page 36: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Tracking for Aspiration: “Exponentially Weighted Auto-

Regressive” Model (EWAR)(Zheng and Hasegawa-Johnson, ICSLP 2004)

Page 37: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Solving the EWAR Model

Page 38: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Results: Stop Classification, MFCC alone vs. MFCC+formants

Page 39: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Results: Stop Classification, MFCC alone vs. MFCC+formants

Page 40: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Summary• International Phonetic Alphabet:

– Useful on any computer with unicode– International encoding for all sounds of the world’s languages

• Events in a nasal closure:– Formant transitions (perturbation model)– Vowel nasalization (sum of TFs)– Nasal murmur (impedance match at juncture)

• Events in release of a stop:– Pre-voicing in English voiced stops (read speech)– Transient (dp/dt ~ dA/dt)– Frication ((zero at f=0)/(front cavity resonances))– Aspiration ((zero at f=0)/(same poles as the vowel))

• Formant tracking– In a vowel: use LPC– In aspiration, frication, or nasal murmur: ARMA is theoretically

optimum, but computationally expensive– Aspiration etcetera: EWAR can be a good approximation to ARMA