34
Speech Recognition (Part 2) T. J. Hazen MIT Computer Science and Artificial Intelligence Laboratory

Speech Recognition (Part 2)

  • Upload
    meir

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Speech Recognition (Part 2). T. J. Hazen MIT Computer Science and Artificial Intelligence Laboratory. Lecture Overview. Probabilistic framework Pronunciation modeling Language modeling Finite state transducers Search System demonstrations (time permitting). Probabilistic Framework. - PowerPoint PPT Presentation

Citation preview

Page 1: Speech Recognition (Part 2)

Speech Recognition(Part 2)

T. J. Hazen

MIT Computer Science and Artificial Intelligence Laboratory

Page 2: Speech Recognition (Part 2)

Lecture Overview

• Probabilistic framework

• Pronunciation modeling

• Language modeling

• Finite state transducers

• Search

• System demonstrations (time permitting)

Page 3: Speech Recognition (Part 2)

Probabilistic Framework

• Speech recognition is typically performed a using probabilistic modeling approach

• Goal is to find the most likely string of words, W, given the acoustic observations, A:

Wmax ( | )P W A

• The expression is rewritten using Bayes’ Rule:

W

( | ) ( )max max ( | ) ( )

( )

W

P A W P WP A W P W

P A

Page 4: Speech Recognition (Part 2)

LexicalNetwork

Probabilistic Framework

• Words are represented as sequence of phonetic units.

• Using phonetic units, U, expression expands to:

• Pronunciation and language models provide constraint

• Pronunciation and language models encoded in network

• Search must efficiently find most likely U and W

,max ( | ) ( | ) ( )U W

P A U P U W P W

Acoustic Model

Pronunciation Model

Language Model

Page 5: Speech Recognition (Part 2)

Phonemes

• Phonemes are the basic linguistic units used to construct morphemes, words and sentences.– Phonemes represent unique canonical acoustic sounds

– When constructing words, changing a single phoneme changes the word.

• Example phonemic mappings:– pin /p ih n/

– thought /th ao t/

– saves /s ey v z/

• English spelling is not (exactly) phonemic – Pronunciation can not always be determined from spelling

– Homophones have same phonemes but different spellings

* Two vs. to vs. too, bear vs. bare, queue vs. cue, etc.

– Same spelling can have different pronunciations

* read, record, either, etc.

Page 6: Speech Recognition (Part 2)

Phonemic Units and Classes

Vowelsaa : pot er : bertae : bat ey : baitah : but ih : bitao : bought iy : beataw : bout ow : boatax : about oy : boyay : buy uh : bookeh : bet uw : boot

Semivowelsl : light w : wetr : right y : yet

Fricativess : sue f : feez : zoo v : veesh : shoe th : thesiszh : azure dh : thathh : hat

Nasalsm : mightn : nightng : sing

Affricatesch : chew jh : Joe

Stopsp : pay b : bayt : tea d : dayk : key g : go

Page 7: Speech Recognition (Part 2)

Phones

• Phones (or phonetic units) are used to represent the actual acoustic realization of phonemes.

• Examples:– Stops contain a closure and a release

* /t/ [tcl t]

* /k/ [kcl k]

– The /t/ and /d/ phonemes can be flapped

* utter /ah t er/ [ah dx er]

* udder /ah d er/ [ah dx er]

– Vowels can be fronted:

* Tuesday /t uw z d ey/ [tcl t ux z d ey]

Page 8: Speech Recognition (Part 2)

Enhanced Phoneme Labels

Stopsp : pay b : bayt : tea d : dayk : key g : go

Special sequencesnt : interviewtq en : Clinton

Stops w/ optional releasepd : tap bd : tabtd : pat dd : badkd : pack gd : dog

Unaspirated stopsp- : speedt- : steepk- : ski

Stops w/ optional flaptf : batterdf : badder

Retroflexed stopstr : treedr : drop

Page 9: Speech Recognition (Part 2)

Example Phonemic Baseform File

<hangup> : _h1 +

<noise> : _n1 +

<uh> : ah_fp

<um> : ah_fp m

adder : ae df er

atlanta : ( ae | ax ) td l ae nt ax

either : ( iy | ay ) th er

laptop : l ae pd t aa pd

northwest : n ao r th w eh s td

speech : s p- iy ch

temperature : t eh m p ( r ? ax | er ax ? ) ch er

trenton : tr r eh n tq en

special filledpause vowel

alternatepronunciations

repeat previous symbol

special noise model symbol

optionalphonemes

Page 10: Speech Recognition (Part 2)

Applying Phonological Rules

• Multiple phonetic realization of phonemes can be generated by applying phonological rules.

• Example:

• Phonological rewrite rules can be used to generate this:

butter : b ah tf er

This can be realized phonetically as:

bcl b ah tcl t er

or as:

bcl b ah dx er

Standard /t/

Flapped /t/

butter : bcl b ah ( tcl t | dx ) er

Page 11: Speech Recognition (Part 2)

Example Phonological Rules

• Example rule for /t/ deletion (“destination”):

{s} t {ax ix} => [tcl t];

Left Context

Phoneme RightContext

PhoneticRealization

• Example rule for palatalization of /s/ (“miss you”):

{} s {y} => s | sh;

Page 12: Speech Recognition (Part 2)

Contractions and Reductions

• Examples of contractions:– what’s what is

– isn’t is not

– won’t will not

– i’d i would | i had

– today’s today is | today’s

• Example of multi-word reductions – gimme give me

– gonna going to

– ave avenue

– ‘bout about

– d’y’ave do you have

• Contracted and reduced forms entered in lexical dictionary

Page 13: Speech Recognition (Part 2)

Language Modeling

• A language model constrains hypothesized word sequences

• A finite state grammar (FSG) example:

• Probabilities can be added to arcs for additional constraint

• FSGs work well when users stay within grammar…

• …but FSGs can’t cover everything that might be spoken.

tell me

what is

theforecast

weather in

for

baltimore

boston

Page 14: Speech Recognition (Part 2)

N-gram Language Modeling

• An n-gram model is a statistical language model

• Predicts current word based on previous n-1 words

• Trigram model expression:

• Examples

• An n-gram model allows any sequence of words…

• …but prefers sequences common in training data.

P( wn | wn-2 , wn-1 )

P( | arriving in )

P( | tuesday march )

boston

seventeenth

Page 15: Speech Recognition (Part 2)

N-gram Model Smoothing

• For a bigram model, what if…

• To avoid sparse training data problems, we can use an interpolated bigram:

• One method for determining interpolation weight:

)(p~)1()|(p)|(p~11 11 nwnnwnn wwwww

nn

Kwc

wc

n

nwn

)(

)(

1

11

0)|(p 1 nn ww

Page 16: Speech Recognition (Part 2)

Class N-gram Language Modeling

• Class n-gram models can also help sparse data problems

• Class trigram expression:

• Example:

P(class(wn) | class(wn-2) , class(wn-1)) P(wn | class(wn))

P(seventeenth | tuesday march )

P( NTH | WEEKDAY MONTH ) P( seventeenth | NTH )

Page 17: Speech Recognition (Part 2)

Multi-Word N-gram Units

• Common multi-word units can be treated as a single unit within an N-gram language model

• Common uses of compound units:– Common multi-word phrases:

* thank_you , good_bye , excuse_me

– Multi word sequences that act as a single semantic unit:

* new_york , labor_day , wind_speed

– Letter sequences or initials:

* j_f_k , t_w_a , washington_d_c

Page 18: Speech Recognition (Part 2)

Finite-State Transducer (FST) Motivation

• Most speech recognition constraints and results can be represented as finite-state automata:– Language models (e.g., n-grams and word networks)

– Lexicons

– Phonological rules

– N-best lists

– Word graphs

– Recognition paths

• Common representation and algorithms desirable– Consistency

– Powerful algorithms can be employed throughout system

– Flexibility to combine or factor in unforeseen ways

Page 19: Speech Recognition (Part 2)

What is an FST?

• One initial state

• One or more final states

• Transitions between states: input : output / weight– input requires an input symbol to match

– output indicates symbol to output when transition taken

– epsilon () consumes no input or produces no output

– weight is the cost (e.g., -log probability) of taking transition

• An FST defines a weighted relationship between regular languages

• A generalization of the classic finite-state acceptor (FSA)

Page 20: Speech Recognition (Part 2)

FST Example: Lexicon

• Lexicon maps /phonemes/ to ‘words’

• Words can share parts of pronunciations

• Sharing at beginning beneficial to recognition speed because pruning can prune many words at once

Page 21: Speech Recognition (Part 2)

FST Composition

• Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step

words /phonemes/ /phonemes/ [phones]

o =

words [phones]

Page 22: Speech Recognition (Part 2)

FST Optimization Example

letter to word lexicon

Page 23: Speech Recognition (Part 2)

FST Optimization Example: Determinization

• Determinization turns lexicon into tree

• Words share common prefix

Page 24: Speech Recognition (Part 2)

FST Optimization Example: Minimization

• Minimization enables sharing at the ends

Page 25: Speech Recognition (Part 2)

A Cascaded FST Recognizer

Acoustic Model Labels

Phonetic Units

Phonemic Units

Spoken Words

Multi-Word Units

Canonical Words

C : CD Model Mapping

P: Phonological Model

L : Lexical Model

G : Language Model

R : Reductions Model

M : Multi-word MappingLanguage Model

Pronunciation Model

Page 26: Speech Recognition (Part 2)

A Cascaded FST Recognizer

Acoustic Model Labels

Phonetic Units

Phonemic Units

Spoken Words

Multi-Word Units

Canonical Words

C : CD Model Mapping

P: Phonological Model

L : Lexical Model

G : Language Model

R : Reductions Model

M : Multi-word Mapping

give me new_york_city

give me new york city

gimme new york city

g ih m iy n uw y ao r kd s ih tf iy

gcl g ih m iy n uw y ao r kcl s ih dx iy

Page 27: Speech Recognition (Part 2)

Search

• Once again, the probabilistic expression is:

• Pronunciation and language models encoded in FST

• Search must efficiently find most likely U and W

,max ( | ) ( | ) ( )U W

P A U P U W P W

Acoustic Model

LexicalFST

Page 28: Speech Recognition (Part 2)

Viterbi Search

• Viterbi search: a time synchronous breadth-first searchL

exic

al N

od

es

h#

m

z

r

a

Time

t0 t1 t2 t3 t4 t5 t6 t7 t8

h# m a r z h#

Page 29: Speech Recognition (Part 2)

Viterbi Search Pruning

• Search efficiency can be improved with pruning– Score-based: Don’t extend low scoring hypotheses

– Count-based: Extend only a fixed number of hypotheses

Lex

ical

No

des

h#

m

z

r

a

Time

t0 t1 t2 t3 t4 t5 t6 t7 t8

x

x

Page 30: Speech Recognition (Part 2)

Search Pruning Example

• Count-based pruning can effectively reduce search

• Example: Fix beam size (count) and vary beam width (score)

36

Page 31: Speech Recognition (Part 2)

Lex

ical

No

des

h#

m

z

r

a

Time

t0 t1 t2 t3 t4 t5 t6 t7 t8

N-best Computation with Backwards A* Search

• Backwards A* search can be used to find N-best paths

• Viterbi backtrace is used as future estimate for path scores

Page 32: Speech Recognition (Part 2)

Street Address Recognition

• Street address recognition is difficult– 6.2M unique street, city, state pairs in US (283K unique words)

– High confusion rate among similar street names

– Very large search space for recognition

• Commercial solution Directed dialogue– Breaks problem into set of smaller recognition tasks

– Simple for first time users, but tedious with repeated use

C: Main menu. Please say one of the following:C: “directions”, “restaurants”, “gas stations”, or “more options”.H: Directions.C: Okay. Directions. What state are you going to?H: Massachusetts.C: Okay. Massachusetts. What city are you going to?H: Cambridge.C: Okay. Cambridge. What is the street address?H: 32 Vassar Street.C: Okay. 32 Vassar Street in Cambridge, Massachusetts.C: From you current location, continue straight on…

Page 33: Speech Recognition (Part 2)

Street Address Recognition

• Research goal Mixed initiative dialogue– More difficult to predict what users will say

– Far more natural for repeat or expert users

C: How can I help you?H: I need directions to 32 Vassar Street in Cambridge, Mass.

• Recognition approach: dynamically adapt recognition vocabulary– 3 recognition passes over

one utterance.

– 1st pass: Detect state and activate relevant cities

– 2nd pass: Detect cities and activate relevant streets

– 3rd pass: Recognize full street address

Page 34: Speech Recognition (Part 2)

Dynamic Vocabulary Recognition