CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) [email protected]

CS 545Lecture XI: Speech

Benjamin Snyder

(some slides courtesy Jurafsky&Martin)

[email protected][email protected]

Announcements

Office hours change for today and next week:1pm - 1:45pm

or by appointment -- but please schedule ahead

HW 4 / 5? will be out soon

n Frequency gives pitch; amplitude gives volume

n Frequencies at each time slice processed into observation vectors

s p ee ch l a b

ampl

itude

Speech in a Slide

frequ

ency

……………………………………………..a12a13a12a14a14………..

w � P (w) o � P (o|w)

The Noisy-Channel Model

language modelacoustic model

ASR System Components

sourceP(w)

w o

decoderobserved

argmax P(w|o) = argmax P(o|w)P(w)w w

w obest

channelP(o|w)

Language Model Acoustic Model

Phoneme InventoriesPhoneme: sound used as a building block in words

Some phonemes occur in most languages (b, p, m, n, s)

But substantial variation occurs in size and scope of phoneme inventories across languages

Consonants characterized by

1. place of articulation

2. manner of articulation

3. voicing

Vowels

PhonotacticsLanguages exhibit phonotactics

Some phoneme sequences are favored, others are forbidden

Phonotactics are largely language-specific... But, often shared within language families

And some sound sequences are anatomically difficult for everyone: “kgvrsatr”

Speech Recognition

  Applications of Speech Recognition (ASR)

  Dictation

  Telephone-based Information (directions, air travel, banking, etc)

  Hands-free (in car)

  Speaker Identification

  Language Identification

  Second language ('L2') (accent reduction)

  Audio archive searching

7/30/08 3 Speech and Language Processing Jurafsky and Martin

LVCSR

  Large Vocabulary Continuous Speech Recognition

  ~20,000-64,000 words

  Speaker independent (vs. speaker-dependent)

  Continuous speech (vs isolated-word)


Current error rates

Task Vocabulary Error Rate%

Digits 11 0.5

WSJ read speech 5K 3

WSJ read speech 20K 3

Broadcast news 64,000+ 10

Conversational Telephone 64,000+ 20

Ballpark numbers; exact numbers depend very much on the specific corpus


HSR versus ASR

  Conclusions:  Machines about 5 times worse than humans

 Gap increases with noisy speech

 These numbers are rough, take with grain of salt

Task Vocab ASR Hum SR

Continuous digits 11 .5 .009

WSJ 1995 clean 5K 3 0.9

WSJ 1995 w/noise 5K 9 1.1

SWBD 2004 65K 20 4


Issues

Pronunciation error 3-4 times higher for native Spanish and Japanese speakers

Car noiseerror 2-4 times higher

Multiple speakers

LVCSR Design Intuition

•  Build a statistical model of the speech-to-words process

•  Collect lots and lots of speech, and transcribe all the words.

•  Train the model on the labeled speech

•  Paradigm: Supervised Machine Learning + Search


Speech Recognition Architecture


Architecture: Five easy pieces (only 3-4 for today)

  HMMs, Lexicons, and Pronunciation

  Feature extraction

  Acoustic Modeling

  Decoding

  Language Modeling (seen this already)


Noisy Channel Part 1:Words to Phonemes

(transitions in HMM)

Lexicon

  A list of words

  Each one with a pronunciation in terms of phones

 We get these from on-line pronucniation dictionary

  CMU dictionary: 127K words   http://www.speech.cs.cmu.edu/cgi-bin/

cmudict

 We’ll represent the lexicon as an HMM


HMMs for speech: the word “six”


Phones are not homogeneous!

Time (s)0.48152 0.937203

0

5000

ay k


Each phone has 3 subphones


Resulting HMM word model for “six” with their subphones


Noisy Channel Part 1I:Phonemes to Sounds

(emissions in HMM)

George Miller figure


And also, human acoustic perception....

We care about the filter not the source

  Most characteristics of the source

  F0

  Details of glottal pulse

  Don’t matter for phone detection

 What we care about is the filter

  The exact position of the articulators in the

oral tract

  So we want a way to separate these

  And use only the filter function


Mel-scale

  Human hearing is not equally sensitive to all frequency bands

  Less sensitive at higher frequencies, roughly > 1000 Hz

  I.e. human perception of frequency is non-linear:


MFCC: Mel-Frequency Cepstral Coefficients


Final Feature Vector

  39 Features per 10 ms frame:

  12 MFCC features

  12 Delta MFCC features

  12 Delta-Delta MFCC features

  1 (log) frame energy

  1 Delta (log) frame energy

  1 Delta-Delta (log frame energy)

  So each frame represented by a 39D vector


Acoustic Modeling (= Phone detection)

  Given a 39-dimensional vector corresponding to the observation of one

frame oi

  And given a phone q we want to detect

  Compute p(oi|q)

  Most popular method:

  GMM (Gaussian mixture models)

  Other methods

  Neural nets, CRFs, SVM, etc


Gaussian Mixture Models

  Also called “fully-continuous HMMs”

  P(o|q) computed by a Gaussian:

€

p(o |q) =1

σ 2πexp(−

(o −µ)2

2σ 2)


Gaussians for Acoustic Modeling

  P(o|q):

P(o|q)

o

P(o|q) is highest here at mean

P(o|q is low here, very far from mean)

A Gaussian is parameterized by a mean and a variance:

Different means


Training Gaussians

  A (single) Gaussian is characterized by a mean and a variance

  Imagine that we had some training data in which each phone was labeled

  And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation

  We could just compute the mean and variance from the data:

€

µi =1

Tot

t=1

T

∑ s.t. ot is phone i

€

σ i

2=

1

T(ot

t=1

T

∑ −µi)2

s.t. ot is phone i


But we need 39 gaussians, not 1!

  The observation o is really a vector of length 39

  So need a vector of Gaussians:

€

p( o |q) =

1

2πD2 σ 2[d]

d =1

D

∏exp(−

1

2

(o[d]−µ[d])2

σ 2[d]d =1

D

∑ )


Gaussian Intuitions: Size of Σ

  µ = [0 0] µ = [0 0] µ = [0 0]

  Σ = I Σ = 0.6I Σ = 2I

  As Σ becomes larger, Gaussian becomes

more spread out; as Σ becomes smaller, Gaussian more compressed

Text and figures from Andrew Ng’s lecture notes for CS229 7/30/08 30 Speech and Language Processing Jurafsky and Martin

Actually, mixture of gaussians

  Each phone is modeled by a sum of different gaussians

  Hence able to model complex facts about the data

Phone A

Phone B


Gaussians acoustic modeling

  Summary: each phone is represented by a GMM parameterized by

 M mixture weights

 M mean vectors

 M covariance matrices

 Usually assume covariance matrix is diagonal

  I.e. just keep separate variance for each cepstral

feature


HMMs for speech


HMM for digit recognition task


Training and Decoding

TrainingWould be easy if phones observed (Maximum Likelihood)

But they are not... and neither are mixture weights

Use EM algorithm (Expectation Maximization)

DecodingBasic idea: Viterbi algorithm from last time

But many little details...

Summary: ASR Architecture

  Five easy pieces: ASR Noisy Channel architecture 1)  Feature Extraction:

39 “MFCC” features

2)  Acoustic Model: Gaussians for computing p(o|q)

3)  Lexicon/Pronunciation Model •  HMM: what phones can follow each other

4)  Language Model •  N-grams for computing p(wi|wi-1)

5) Decoder •  Viterbi algorithm: dynamic programming for combining all

these to get word sequence from speech!


Documents

CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) [email protected]