Speech Recognition (Part 2)

Speech Recognition(Part 2)

T. J. Hazen

MIT Computer Science and Artificial Intelligence Laboratory

Lecture Overview

• Probabilistic framework

• Pronunciation modeling

• Language modeling

• Finite state transducers

• Search

• System demonstrations (time permitting)

Probabilistic Framework

• Speech recognition is typically performed a using probabilistic modeling approach

• Goal is to find the most likely string of words, W, given the acoustic observations, A:

Wmax ( | )P W A

• The expression is rewritten using Bayes’ Rule:

W

( | ) ( )max max ( | ) ( )

( )

W

P A W P WP A W P W

P A

LexicalNetwork

Probabilistic Framework

• Words are represented as sequence of phonetic units.

• Using phonetic units, U, expression expands to:

• Pronunciation and language models provide constraint

• Pronunciation and language models encoded in network

• Search must efficiently find most likely U and W

,max ( | ) ( | ) ( )U W

P A U P U W P W

Acoustic Model

Pronunciation Model

Language Model

Phonemes

• Phonemes are the basic linguistic units used to construct morphemes, words and sentences.– Phonemes represent unique canonical acoustic sounds

– When constructing words, changing a single phoneme changes the word.

• Example phonemic mappings:– pin /p ih n/

– thought /th ao t/

– saves /s ey v z/

• English spelling is not (exactly) phonemic – Pronunciation can not always be determined from spelling

– Homophones have same phonemes but different spellings

* Two vs. to vs. too, bear vs. bare, queue vs. cue, etc.

– Same spelling can have different pronunciations

* read, record, either, etc.

Phonemic Units and Classes

Vowelsaa : pot er : bertae : bat ey : baitah : but ih : bitao : bought iy : beataw : bout ow : boatax : about oy : boyay : buy uh : bookeh : bet uw : boot

Semivowelsl : light w : wetr : right y : yet

Fricativess : sue f : feez : zoo v : veesh : shoe th : thesiszh : azure dh : thathh : hat

Nasalsm : mightn : nightng : sing

Affricatesch : chew jh : Joe

Stopsp : pay b : bayt : tea d : dayk : key g : go

Phones

• Phones (or phonetic units) are used to represent the actual acoustic realization of phonemes.

• Examples:– Stops contain a closure and a release

* /t/ [tcl t]

* /k/ [kcl k]

– The /t/ and /d/ phonemes can be flapped

* utter /ah t er/ [ah dx er]

* udder /ah d er/ [ah dx er]

– Vowels can be fronted:

* Tuesday /t uw z d ey/ [tcl t ux z d ey]

Enhanced Phoneme Labels

Stopsp : pay b : bayt : tea d : dayk : key g : go

Special sequencesnt : interviewtq en : Clinton

Stops w/ optional releasepd : tap bd : tabtd : pat dd : badkd : pack gd : dog

Unaspirated stopsp- : speedt- : steepk- : ski

Stops w/ optional flaptf : batterdf : badder

Retroflexed stopstr : treedr : drop

Example Phonemic Baseform File

<hangup> : _h1 +

<noise> : _n1 +

<uh> : ah_fp

<um> : ah_fp m

adder : ae df er

atlanta : ( ae | ax ) td l ae nt ax

either : ( iy | ay ) th er

laptop : l ae pd t aa pd

northwest : n ao r th w eh s td

speech : s p- iy ch

temperature : t eh m p ( r ? ax | er ax ? ) ch er

trenton : tr r eh n tq en

special filledpause vowel

alternatepronunciations

repeat previous symbol

special noise model symbol

optionalphonemes

Applying Phonological Rules

• Multiple phonetic realization of phonemes can be generated by applying phonological rules.

• Example:

• Phonological rewrite rules can be used to generate this:

butter : b ah tf er

This can be realized phonetically as:

bcl b ah tcl t er

or as:

bcl b ah dx er

Standard /t/

Flapped /t/

butter : bcl b ah ( tcl t | dx ) er

Example Phonological Rules

• Example rule for /t/ deletion (“destination”):

{s} t {ax ix} => [tcl t];

Left Context

Phoneme RightContext

PhoneticRealization

• Example rule for palatalization of /s/ (“miss you”):

{} s {y} => s | sh;

Contractions and Reductions

• Examples of contractions:– what’s what is

– isn’t is not

– won’t will not

– i’d i would | i had

– today’s today is | today’s

• Example of multi-word reductions – gimme give me

– gonna going to

– ave avenue

– ‘bout about

– d’y’ave do you have

• Contracted and reduced forms entered in lexical dictionary

Language Modeling

• A language model constrains hypothesized word sequences

• A finite state grammar (FSG) example:

• Probabilities can be added to arcs for additional constraint

• FSGs work well when users stay within grammar…

• …but FSGs can’t cover everything that might be spoken.

tell me

what is

theforecast

weather in

for

baltimore

boston

N-gram Language Modeling

• An n-gram model is a statistical language model

• Predicts current word based on previous n-1 words

• Trigram model expression:

• Examples

• An n-gram model allows any sequence of words…

• …but prefers sequences common in training data.

P( wn | wn-2 , wn-1 )

P( | arriving in )

P( | tuesday march )

boston

seventeenth

N-gram Model Smoothing

• For a bigram model, what if…

• To avoid sparse training data problems, we can use an interpolated bigram:

• One method for determining interpolation weight:

)(p~)1()|(p)|(p~11 11 nwnnwnn wwwww

nn

Kwc

wc

n

nwn

)(

)(

1

11

0)|(p 1 nn ww

Class N-gram Language Modeling

• Class n-gram models can also help sparse data problems

• Class trigram expression:

• Example:

P(class(wn) | class(wn-2) , class(wn-1)) P(wn | class(wn))

P(seventeenth | tuesday march )

P( NTH | WEEKDAY MONTH ) P( seventeenth | NTH )

Multi-Word N-gram Units

• Common multi-word units can be treated as a single unit within an N-gram language model

• Common uses of compound units:– Common multi-word phrases:

* thank_you , good_bye , excuse_me

– Multi word sequences that act as a single semantic unit:

* new_york , labor_day , wind_speed

– Letter sequences or initials:

* j_f_k , t_w_a , washington_d_c

Finite-State Transducer (FST) Motivation

• Most speech recognition constraints and results can be represented as finite-state automata:– Language models (e.g., n-grams and word networks)

– Lexicons

– Phonological rules

– N-best lists

– Word graphs

– Recognition paths

• Common representation and algorithms desirable– Consistency

– Powerful algorithms can be employed throughout system

– Flexibility to combine or factor in unforeseen ways

What is an FST?

• One initial state

• One or more final states

• Transitions between states: input : output / weight– input requires an input symbol to match

– output indicates symbol to output when transition taken

– epsilon () consumes no input or produces no output

– weight is the cost (e.g., -log probability) of taking transition

• An FST defines a weighted relationship between regular languages

• A generalization of the classic finite-state acceptor (FSA)

FST Example: Lexicon

• Lexicon maps /phonemes/ to ‘words’

• Words can share parts of pronunciations

• Sharing at beginning beneficial to recognition speed because pruning can prune many words at once

FST Composition

• Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step

words /phonemes/ /phonemes/ [phones]

o =

words [phones]

FST Optimization Example

letter to word lexicon

FST Optimization Example: Determinization

• Determinization turns lexicon into tree

• Words share common prefix

FST Optimization Example: Minimization

• Minimization enables sharing at the ends

A Cascaded FST Recognizer

Acoustic Model Labels

Phonetic Units

Phonemic Units

Spoken Words

Multi-Word Units

Canonical Words

C : CD Model Mapping

P: Phonological Model

L : Lexical Model

G : Language Model

R : Reductions Model

M : Multi-word MappingLanguage Model

Pronunciation Model

A Cascaded FST Recognizer

Acoustic Model Labels

Phonetic Units

Phonemic Units

Spoken Words

Multi-Word Units

Canonical Words

C : CD Model Mapping

P: Phonological Model

L : Lexical Model

G : Language Model

R : Reductions Model

M : Multi-word Mapping

give me new_york_city

give me new york city

gimme new york city

g ih m iy n uw y ao r kd s ih tf iy

gcl g ih m iy n uw y ao r kcl s ih dx iy

Search

• Once again, the probabilistic expression is:

• Pronunciation and language models encoded in FST

• Search must efficiently find most likely U and W

,max ( | ) ( | ) ( )U W

P A U P U W P W

Acoustic Model

LexicalFST

Viterbi Search

• Viterbi search: a time synchronous breadth-first searchL

exic

al N

od

es

h#

m

z

r

a

Time

t0 t1 t2 t3 t4 t5 t6 t7 t8

h# m a r z h#

Viterbi Search Pruning

• Search efficiency can be improved with pruning– Score-based: Don’t extend low scoring hypotheses

– Count-based: Extend only a fixed number of hypotheses

Lex

ical

No

des

h#

m

z

r

a

Time

t0 t1 t2 t3 t4 t5 t6 t7 t8

x

x

Search Pruning Example

• Count-based pruning can effectively reduce search

• Example: Fix beam size (count) and vary beam width (score)

36

Lex

ical

No

des

h#

m

z

r

a

Time

t0 t1 t2 t3 t4 t5 t6 t7 t8

N-best Computation with Backwards A* Search

• Backwards A* search can be used to find N-best paths

• Viterbi backtrace is used as future estimate for path scores

Street Address Recognition

• Street address recognition is difficult– 6.2M unique street, city, state pairs in US (283K unique words)

– High confusion rate among similar street names

– Very large search space for recognition

• Commercial solution Directed dialogue– Breaks problem into set of smaller recognition tasks

– Simple for first time users, but tedious with repeated use

C: Main menu. Please say one of the following:C: “directions”, “restaurants”, “gas stations”, or “more options”.H: Directions.C: Okay. Directions. What state are you going to?H: Massachusetts.C: Okay. Massachusetts. What city are you going to?H: Cambridge.C: Okay. Cambridge. What is the street address?H: 32 Vassar Street.C: Okay. 32 Vassar Street in Cambridge, Massachusetts.C: From you current location, continue straight on…

Street Address Recognition

• Research goal Mixed initiative dialogue– More difficult to predict what users will say

– Far more natural for repeat or expert users

C: How can I help you?H: I need directions to 32 Vassar Street in Cambridge, Mass.

• Recognition approach: dynamically adapt recognition vocabulary– 3 recognition passes over

one utterance.

– 1st pass: Detect state and activate relevant cities

– 2nd pass: Detect cities and activate relevant streets

– 3rd pass: Recognize full street address

Dynamic Vocabulary Recognition

Documents

Speech Recognition (Part 2)