Upload
meir
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Speech Recognition (Part 2). T. J. Hazen MIT Computer Science and Artificial Intelligence Laboratory. Lecture Overview. Probabilistic framework Pronunciation modeling Language modeling Finite state transducers Search System demonstrations (time permitting). Probabilistic Framework. - PowerPoint PPT Presentation
Citation preview
Speech Recognition(Part 2)
T. J. Hazen
MIT Computer Science and Artificial Intelligence Laboratory
Lecture Overview
• Probabilistic framework
• Pronunciation modeling
• Language modeling
• Finite state transducers
• Search
• System demonstrations (time permitting)
Probabilistic Framework
• Speech recognition is typically performed a using probabilistic modeling approach
• Goal is to find the most likely string of words, W, given the acoustic observations, A:
Wmax ( | )P W A
• The expression is rewritten using Bayes’ Rule:
W
( | ) ( )max max ( | ) ( )
( )
W
P A W P WP A W P W
P A
LexicalNetwork
Probabilistic Framework
• Words are represented as sequence of phonetic units.
• Using phonetic units, U, expression expands to:
• Pronunciation and language models provide constraint
• Pronunciation and language models encoded in network
• Search must efficiently find most likely U and W
,max ( | ) ( | ) ( )U W
P A U P U W P W
Acoustic Model
Pronunciation Model
Language Model
Phonemes
• Phonemes are the basic linguistic units used to construct morphemes, words and sentences.– Phonemes represent unique canonical acoustic sounds
– When constructing words, changing a single phoneme changes the word.
• Example phonemic mappings:– pin /p ih n/
– thought /th ao t/
– saves /s ey v z/
• English spelling is not (exactly) phonemic – Pronunciation can not always be determined from spelling
– Homophones have same phonemes but different spellings
* Two vs. to vs. too, bear vs. bare, queue vs. cue, etc.
– Same spelling can have different pronunciations
* read, record, either, etc.
Phonemic Units and Classes
Vowelsaa : pot er : bertae : bat ey : baitah : but ih : bitao : bought iy : beataw : bout ow : boatax : about oy : boyay : buy uh : bookeh : bet uw : boot
Semivowelsl : light w : wetr : right y : yet
Fricativess : sue f : feez : zoo v : veesh : shoe th : thesiszh : azure dh : thathh : hat
Nasalsm : mightn : nightng : sing
Affricatesch : chew jh : Joe
Stopsp : pay b : bayt : tea d : dayk : key g : go
Phones
• Phones (or phonetic units) are used to represent the actual acoustic realization of phonemes.
• Examples:– Stops contain a closure and a release
* /t/ [tcl t]
* /k/ [kcl k]
– The /t/ and /d/ phonemes can be flapped
* utter /ah t er/ [ah dx er]
* udder /ah d er/ [ah dx er]
– Vowels can be fronted:
* Tuesday /t uw z d ey/ [tcl t ux z d ey]
Enhanced Phoneme Labels
Stopsp : pay b : bayt : tea d : dayk : key g : go
Special sequencesnt : interviewtq en : Clinton
Stops w/ optional releasepd : tap bd : tabtd : pat dd : badkd : pack gd : dog
Unaspirated stopsp- : speedt- : steepk- : ski
Stops w/ optional flaptf : batterdf : badder
Retroflexed stopstr : treedr : drop
Example Phonemic Baseform File
<hangup> : _h1 +
<noise> : _n1 +
<uh> : ah_fp
<um> : ah_fp m
adder : ae df er
atlanta : ( ae | ax ) td l ae nt ax
either : ( iy | ay ) th er
laptop : l ae pd t aa pd
northwest : n ao r th w eh s td
speech : s p- iy ch
temperature : t eh m p ( r ? ax | er ax ? ) ch er
trenton : tr r eh n tq en
special filledpause vowel
alternatepronunciations
repeat previous symbol
special noise model symbol
optionalphonemes
Applying Phonological Rules
• Multiple phonetic realization of phonemes can be generated by applying phonological rules.
• Example:
• Phonological rewrite rules can be used to generate this:
butter : b ah tf er
This can be realized phonetically as:
bcl b ah tcl t er
or as:
bcl b ah dx er
Standard /t/
Flapped /t/
butter : bcl b ah ( tcl t | dx ) er
Example Phonological Rules
• Example rule for /t/ deletion (“destination”):
{s} t {ax ix} => [tcl t];
Left Context
Phoneme RightContext
PhoneticRealization
• Example rule for palatalization of /s/ (“miss you”):
{} s {y} => s | sh;
Contractions and Reductions
• Examples of contractions:– what’s what is
– isn’t is not
– won’t will not
– i’d i would | i had
– today’s today is | today’s
• Example of multi-word reductions – gimme give me
– gonna going to
– ave avenue
– ‘bout about
– d’y’ave do you have
• Contracted and reduced forms entered in lexical dictionary
Language Modeling
• A language model constrains hypothesized word sequences
• A finite state grammar (FSG) example:
• Probabilities can be added to arcs for additional constraint
• FSGs work well when users stay within grammar…
• …but FSGs can’t cover everything that might be spoken.
tell me
what is
theforecast
weather in
for
baltimore
boston
N-gram Language Modeling
• An n-gram model is a statistical language model
• Predicts current word based on previous n-1 words
• Trigram model expression:
• Examples
• An n-gram model allows any sequence of words…
• …but prefers sequences common in training data.
P( wn | wn-2 , wn-1 )
P( | arriving in )
P( | tuesday march )
boston
seventeenth
N-gram Model Smoothing
• For a bigram model, what if…
• To avoid sparse training data problems, we can use an interpolated bigram:
• One method for determining interpolation weight:
)(p~)1()|(p)|(p~11 11 nwnnwnn wwwww
nn
Kwc
wc
n
nwn
)(
)(
1
11
0)|(p 1 nn ww
Class N-gram Language Modeling
• Class n-gram models can also help sparse data problems
• Class trigram expression:
• Example:
P(class(wn) | class(wn-2) , class(wn-1)) P(wn | class(wn))
P(seventeenth | tuesday march )
P( NTH | WEEKDAY MONTH ) P( seventeenth | NTH )
Multi-Word N-gram Units
• Common multi-word units can be treated as a single unit within an N-gram language model
• Common uses of compound units:– Common multi-word phrases:
* thank_you , good_bye , excuse_me
– Multi word sequences that act as a single semantic unit:
* new_york , labor_day , wind_speed
– Letter sequences or initials:
* j_f_k , t_w_a , washington_d_c
Finite-State Transducer (FST) Motivation
• Most speech recognition constraints and results can be represented as finite-state automata:– Language models (e.g., n-grams and word networks)
– Lexicons
– Phonological rules
– N-best lists
– Word graphs
– Recognition paths
• Common representation and algorithms desirable– Consistency
– Powerful algorithms can be employed throughout system
– Flexibility to combine or factor in unforeseen ways
What is an FST?
• One initial state
• One or more final states
• Transitions between states: input : output / weight– input requires an input symbol to match
– output indicates symbol to output when transition taken
– epsilon () consumes no input or produces no output
– weight is the cost (e.g., -log probability) of taking transition
• An FST defines a weighted relationship between regular languages
• A generalization of the classic finite-state acceptor (FSA)
FST Example: Lexicon
• Lexicon maps /phonemes/ to ‘words’
• Words can share parts of pronunciations
• Sharing at beginning beneficial to recognition speed because pruning can prune many words at once
FST Composition
• Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step
words /phonemes/ /phonemes/ [phones]
o =
words [phones]
FST Optimization Example
letter to word lexicon
FST Optimization Example: Determinization
• Determinization turns lexicon into tree
• Words share common prefix
FST Optimization Example: Minimization
• Minimization enables sharing at the ends
A Cascaded FST Recognizer
Acoustic Model Labels
Phonetic Units
Phonemic Units
Spoken Words
Multi-Word Units
Canonical Words
C : CD Model Mapping
P: Phonological Model
L : Lexical Model
G : Language Model
R : Reductions Model
M : Multi-word MappingLanguage Model
Pronunciation Model
A Cascaded FST Recognizer
Acoustic Model Labels
Phonetic Units
Phonemic Units
Spoken Words
Multi-Word Units
Canonical Words
C : CD Model Mapping
P: Phonological Model
L : Lexical Model
G : Language Model
R : Reductions Model
M : Multi-word Mapping
give me new_york_city
give me new york city
gimme new york city
g ih m iy n uw y ao r kd s ih tf iy
gcl g ih m iy n uw y ao r kcl s ih dx iy
Search
• Once again, the probabilistic expression is:
• Pronunciation and language models encoded in FST
• Search must efficiently find most likely U and W
,max ( | ) ( | ) ( )U W
P A U P U W P W
Acoustic Model
LexicalFST
Viterbi Search
• Viterbi search: a time synchronous breadth-first searchL
exic
al N
od
es
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
h# m a r z h#
Viterbi Search Pruning
• Search efficiency can be improved with pruning– Score-based: Don’t extend low scoring hypotheses
– Count-based: Extend only a fixed number of hypotheses
Lex
ical
No
des
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
x
x
Search Pruning Example
• Count-based pruning can effectively reduce search
• Example: Fix beam size (count) and vary beam width (score)
36
Lex
ical
No
des
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
N-best Computation with Backwards A* Search
• Backwards A* search can be used to find N-best paths
• Viterbi backtrace is used as future estimate for path scores
Street Address Recognition
• Street address recognition is difficult– 6.2M unique street, city, state pairs in US (283K unique words)
– High confusion rate among similar street names
– Very large search space for recognition
• Commercial solution Directed dialogue– Breaks problem into set of smaller recognition tasks
– Simple for first time users, but tedious with repeated use
C: Main menu. Please say one of the following:C: “directions”, “restaurants”, “gas stations”, or “more options”.H: Directions.C: Okay. Directions. What state are you going to?H: Massachusetts.C: Okay. Massachusetts. What city are you going to?H: Cambridge.C: Okay. Cambridge. What is the street address?H: 32 Vassar Street.C: Okay. 32 Vassar Street in Cambridge, Massachusetts.C: From you current location, continue straight on…
Street Address Recognition
• Research goal Mixed initiative dialogue– More difficult to predict what users will say
– Far more natural for repeat or expert users
C: How can I help you?H: I need directions to 32 Vassar Street in Cambridge, Mass.
• Recognition approach: dynamically adapt recognition vocabulary– 3 recognition passes over
one utterance.
– 1st pass: Detect state and activate relevant cities
– 2nd pass: Detect cities and activate relevant streets
– 3rd pass: Recognize full street address
Dynamic Vocabulary Recognition