14
CS 545 Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) [email protected] [email protected] Announcements Office hours change for today and next week: 1pm - 1:45pm or by appointment -- but please schedule ahead HW 4 / 5? will be out soon Frequency gives pitch; amplitude gives volume Frequencies at each time slice processed into observation vectors s p ee ch l a b amplitude Speech in a Slide frequency ……………………………………………..a 12 a 13 a 12 a 14 a 14 ………..

CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) [email protected]

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

CS 545Lecture XI: Speech

Benjamin Snyder

(some slides courtesy Jurafsky&Martin)

[email protected][email protected]

Announcements

Office hours change for today and next week:1pm - 1:45pm

or by appointment -- but please schedule ahead

HW 4 / 5? will be out soon

n Frequency gives pitch; amplitude gives volume

n Frequencies at each time slice processed into observation vectors

s p ee ch l a b

ampl

itude

Speech in a Slide

frequ

ency

……………………………………………..a12a13a12a14a14………..

Page 2: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

w � P (w) o � P (o|w)

The Noisy-Channel Model

language modelacoustic model

ASR System Components

sourceP(w)

w o

decoderobserved

argmax P(w|o) = argmax P(o|w)P(w)w w

w obest

channelP(o|w)

Language Model Acoustic Model

Phoneme InventoriesPhoneme: sound used as a building block in words

Some phonemes occur in most languages (b, p, m, n, s)

But substantial variation occurs in size and scope of phoneme inventories across languages

Consonants characterized by

1. place of articulation

2. manner of articulation

3. voicing

Page 3: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Vowels

PhonotacticsLanguages exhibit phonotactics

Some phoneme sequences are favored, others are forbidden

Phonotactics are largely language-specific... But, often shared within language families

And some sound sequences are anatomically difficult for everyone: “kgvrsatr”

Page 4: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Speech Recognition

  Applications of Speech Recognition (ASR)

  Dictation

  Telephone-based Information (directions, air travel, banking, etc)

  Hands-free (in car)

  Speaker Identification

  Language Identification

  Second language ('L2') (accent reduction)

  Audio archive searching

7/30/08 3 Speech and Language Processing Jurafsky and Martin

LVCSR

  Large Vocabulary Continuous Speech Recognition

  ~20,000-64,000 words

  Speaker independent (vs. speaker-dependent)

  Continuous speech (vs isolated-word)

7/30/08 4 Speech and Language Processing Jurafsky and Martin

Current error rates

Task Vocabulary Error Rate%

Digits 11 0.5

WSJ read speech 5K 3

WSJ read speech 20K 3

Broadcast news 64,000+ 10

Conversational Telephone 64,000+ 20

Ballpark numbers; exact numbers depend very much on the specific corpus

7/30/08 5 Speech and Language Processing Jurafsky and Martin

Page 5: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

HSR versus ASR

  Conclusions:  Machines about 5 times worse than humans

 Gap increases with noisy speech

 These numbers are rough, take with grain of salt

Task Vocab ASR Hum SR

Continuous digits 11 .5 .009

WSJ 1995 clean 5K 3 0.9

WSJ 1995 w/noise 5K 9 1.1

SWBD 2004 65K 20 4

7/30/08 6 Speech and Language Processing Jurafsky and Martin

Issues

Pronunciation error 3-4 times higher for native Spanish and Japanese speakers

Car noiseerror 2-4 times higher

Multiple speakers

LVCSR Design Intuition

•  Build a statistical model of the speech-to-words process

•  Collect lots and lots of speech, and transcribe all the words.

•  Train the model on the labeled speech

•  Paradigm: Supervised Machine Learning + Search

7/30/08 8 Speech and Language Processing Jurafsky and Martin

Page 6: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Speech Recognition Architecture

7/30/08 9 Speech and Language Processing Jurafsky and Martin

Architecture: Five easy pieces (only 3-4 for today)

  HMMs, Lexicons, and Pronunciation

  Feature extraction

  Acoustic Modeling

  Decoding

  Language Modeling (seen this already)

7/30/08 16 Speech and Language Processing Jurafsky and Martin

Noisy Channel Part 1:Words to Phonemes

(transitions in HMM)

Page 7: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Lexicon

  A list of words

  Each one with a pronunciation in terms of phones

 We get these from on-line pronucniation dictionary

  CMU dictionary: 127K words   http://www.speech.cs.cmu.edu/cgi-bin/

cmudict

 We’ll represent the lexicon as an HMM

7/30/08 17 Speech and Language Processing Jurafsky and Martin

HMMs for speech: the word “six”

7/30/08 18 Speech and Language Processing Jurafsky and Martin

Phones are not homogeneous!

Time (s)0.48152 0.937203

0

5000

ay k

7/30/08 19 Speech and Language Processing Jurafsky and Martin

Page 8: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Each phone has 3 subphones

7/30/08 20 Speech and Language Processing Jurafsky and Martin

Resulting HMM word model for “six” with their subphones

7/30/08 21 Speech and Language Processing Jurafsky and Martin

Noisy Channel Part 1I:Phonemes to Sounds

(emissions in HMM)

Page 9: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

George Miller figure

7/30/08 43 Speech and Language Processing Jurafsky and Martin

And also, human acoustic perception....

We care about the filter not the source

  Most characteristics of the source

  F0

  Details of glottal pulse

  Don’t matter for phone detection

 What we care about is the filter

  The exact position of the articulators in the

oral tract

  So we want a way to separate these

  And use only the filter function

7/30/08 44 Speech and Language Processing Jurafsky and Martin

Mel-scale

  Human hearing is not equally sensitive to all frequency bands

  Less sensitive at higher frequencies, roughly > 1000 Hz

  I.e. human perception of frequency is non-linear:

7/30/08 37 Speech and Language Processing Jurafsky and Martin

Page 10: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

MFCC: Mel-Frequency Cepstral Coefficients

7/30/08 24 Speech and Language Processing Jurafsky and Martin

Final Feature Vector

  39 Features per 10 ms frame:

  12 MFCC features

  12 Delta MFCC features

  12 Delta-Delta MFCC features

  1 (log) frame energy

  1 Delta (log) frame energy

  1 Delta-Delta (log frame energy)

  So each frame represented by a 39D vector

7/30/08 28 Speech and Language Processing Jurafsky and Martin

Acoustic Modeling (= Phone detection)

  Given a 39-dimensional vector corresponding to the observation of one

frame oi

  And given a phone q we want to detect

  Compute p(oi|q)

  Most popular method:

  GMM (Gaussian mixture models)

  Other methods

  Neural nets, CRFs, SVM, etc

7/30/08 29 Speech and Language Processing Jurafsky and Martin

Page 11: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Gaussian Mixture Models

  Also called “fully-continuous HMMs”

  P(o|q) computed by a Gaussian:

p(o |q) =1

σ 2πexp(−

(o −µ)2

2σ 2)

7/30/08 30 Speech and Language Processing Jurafsky and Martin

Gaussians for Acoustic Modeling

  P(o|q):

P(o|q)

o

P(o|q) is highest here at mean

P(o|q is low here, very far from mean)

A Gaussian is parameterized by a mean and a variance:

Different means

7/30/08 31 Speech and Language Processing Jurafsky and Martin

Training Gaussians

  A (single) Gaussian is characterized by a mean and a variance

  Imagine that we had some training data in which each phone was labeled

  And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation

  We could just compute the mean and variance from the data:

µi =1

Tot

t=1

T

∑ s.t. ot is phone i

σ i

2=

1

T(ot

t=1

T

∑ −µi)2

s.t. ot is phone i

7/30/08 32 Speech and Language Processing Jurafsky and Martin

Page 12: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

But we need 39 gaussians, not 1!

  The observation o is really a vector of length 39

  So need a vector of Gaussians:

p( o |q) =

1

2πD2 σ 2[d]

d =1

D

∏exp(−

1

2

(o[d]−µ[d])2

σ 2[d]d =1

D

∑ )

7/30/08 33 Speech and Language Processing Jurafsky and Martin

Gaussian Intuitions: Size of Σ

  µ = [0 0] µ = [0 0] µ = [0 0]

  Σ = I Σ = 0.6I Σ = 2I

  As Σ becomes larger, Gaussian becomes

more spread out; as Σ becomes smaller, Gaussian more compressed

Text and figures from Andrew Ng’s lecture notes for CS229 7/30/08 30 Speech and Language Processing Jurafsky and Martin

Actually, mixture of gaussians

  Each phone is modeled by a sum of different gaussians

  Hence able to model complex facts about the data

Phone A

Phone B

7/30/08 34 Speech and Language Processing Jurafsky and Martin

Page 13: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Gaussians acoustic modeling

  Summary: each phone is represented by a GMM parameterized by

 M mixture weights

 M mean vectors

 M covariance matrices

 Usually assume covariance matrix is diagonal

  I.e. just keep separate variance for each cepstral

feature

7/30/08 35 Speech and Language Processing Jurafsky and Martin

HMMs for speech

7/30/08 46 Speech and Language Processing Jurafsky and Martin

HMM for digit recognition task

7/30/08 47 Speech and Language Processing Jurafsky and Martin

Page 14: CS 545 Lecture XI: Speechpages.cs.wisc.edu/~bsnyder/cs545-S12/lectures/lec11.pdf · Lecture XI: Speech Benjamin Snyder (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com

Training and Decoding

TrainingWould be easy if phones observed (Maximum Likelihood)

But they are not... and neither are mixture weights

Use EM algorithm (Expectation Maximization)

DecodingBasic idea: Viterbi algorithm from last time

But many little details...

Summary: ASR Architecture

  Five easy pieces: ASR Noisy Channel architecture 1)  Feature Extraction:

39 “MFCC” features

2)  Acoustic Model: Gaussians for computing p(o|q)

3)  Lexicon/Pronunciation Model •  HMM: what phones can follow each other

4)  Language Model •  N-grams for computing p(wi|wi-1)

5) Decoder •  Viterbi algorithm: dynamic programming for combining all

these to get word sequence from speech!

7/30/08 83 Speech and Language Processing Jurafsky and Martin