Upload
luke-blake
View
216
Download
1
Embed Size (px)
Citation preview
MIT Computer Science and Artificial Intelligence Laboratory
SPOKEN LANGUAGE SYSTEMS
Feature-based Pronunciation ModelingUsing Dynamic Bayesian Networks
Karen Livescu
JHU Workshop Planning Meeting
April 16, 2004
Joint work with Jim Glass
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSThe problem of pronunciation variation
• Conversation from the Switchboard speech database:
• “neither one of them”:
• “decided”:
• “never really”:
• “probably”:
• Noted as an obstacle for ASR (e.g., [McAllester et al. 1998])
MIT CSAIL
SLSThe problem of pronunciation variation (2)
0
10
20
30
40
50
60
70
80
0 50 100 150 200 250
word frequency
# p
ron
un
cia
tio
ns
/wo
rd
read
casual
• More acute in casual/conversational than in read speech:
p r aa b iy 2
p r ay 1
p r aw l uh 1
p r ah b iy 1
p r aa lg iy 1
p r aa b uw 1
p ow ih 1
p aa iy 1
p aa b uh b l iy 1
p aa ah iy 1
probably
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSTraditional solution: phone-based pronunciation modeling
• Transformation rules are typically of the form p1 p2 / p3 __ p4 (where pi may be null)
– E.g. Ø p / m __ {non-labial}
• Rules are derived from– Linguistic knowledge (e.g. [Hazen et al. 2002])
– Data (e.g. [Riley & Ljolje 1996])
• Powerful, but:– Sparse data issues
– Increased inter-word confusability
– Some pronunciation changes not well described
– Limited success in recognition experiments
[p] insertion rule[ w ao r m p th ]warmth
dictionary/ w ao r m th /
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSA feature-based approach
• Speech can alternatively be described using sub-phonetic features
LIP-OPTT-OPEN
TT-LOC
TB-LOC
TB-OPENVELUM
VOICING
• (This feature set based on articulatory phonology [Browman & Goldstein 1990])
MIT CSAIL
SLSFeature-based pronunciation modeling
• instruments [ih_n s ch em ih_n n s]
…...............
MidCloMidMidNarlip opening
CloOpCloCloClovelum
!VVVVVvoicing lips & velum desynchronize
[ w ao r m p th ]dictionary
warmth
• wants [w aa_n t s] -- Phone deletion??
• several [s eh r v ax l] -- Exchange of two phones???
everybody [eh r uw ay]
MIT CSAIL
SLSRelated work
• Much work on classifying features:– [King et al. 1998]
– [Kirchhoff 2002]
– [Chang, Greenberg, & Wester 2001]
– [Juneja & Espy-Wilson 2003]
– [Omar & Hasegawa-Johnson 2002]
– [Niyogi & Burges 2002]
• Less work on “non-phonetic” relationship between words and features– [Deng et al. 1997], [Richardson & Bilmes 2000]: “fully-connected” state
space via hidden Markov model
– [Kirchhoff 1996]: features independent, except for synchronization at syllable boundaries
– [Carson-Berndsen 1998]: bottom-up, constraint-based approach
• Goal: Develop a general feature-based pronunciation model– Capable of using known independence assumptions
– Without overly strong assumptions
MIT CSAIL
SLSApproach: Main Ideas ([HLT/NAACL-2004])
• Begin with usual assumption: Each word has one or more underlying pronunciations, given by a dictionary
dictionarywarmth
• Surface (actual) feature values can stray from underlying values via:
1) Substitution – modeled by confusion matrices P(s|u)
2) Asynchrony– Assign index (counter) to each feature, and allow index values to differ
– Apply constraints on the difference between the mean indices of feature subsets
• Natural to implement using graphical models, in particular dynamic Bayesian networks (DBNs)
…...............
MidCloMidMidNarlip opening
OffOnOffOffOffvelum
!VVVVVvoicing
43210index
MIT CSAIL
SLS
O
S S
O
......frame i-1 frame i
Aside: Dynamic Bayesian networks
• Bayesian network (BN): Directed-graph representation of a distribution over a set of variables– Graph node variable + its distribution given parents
– Graph edge “dependency”
• Dynamic Bayesian network (DBN): BN with a repeating structure
• Example: HMM
• Uniform algorithms for (among other things)– Finding the most likely values of a subset of the variables, given the
rest (analogous to Viterbi algorithm for HMMs)
– Learning model parameters via EM
)s|p(s 1-ii
)s|p(o ii
)p(s0 )s|p(s 1-ii
L
1i)s|p(o ii)s|p(o 00
) , p( L:0o L:0s
speaking rate
# questions
lunchtime
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSApproach: A DBN-based Model
• Example DBN using 3 features:
…
.1
0
0
MID
… … … … ……
… .2 .7 0 0NAR
… .1 .2 .7 0CRI
… 0 .1 .2 .7CLO
…N-MNARCRICLO
encodes baseform pronunciations
tword
12;1 tsync 13;2,1 tsync
1tind 2
tind 3tind
1tS 2
tS 3tS
1tU 2
tU 3tU
otherwise ,0 ,1 || ,5.0
, ,1 ),|1Pr( 21
21
212;1 indindindind
indindsync
• (Simplified to show important properties! Implemented model has additional variables.)
MIT CSAIL
SLSApproach: A DBN-based Model (2)
• “Unrolled” DBN:
Tword
12;1 Tsync 13;2,1 Tsync
1Tind 2
Tind 3Tind
1TS
2TS
3TS
1TU
2TU
3TU
• Parameter learning via Expectation Maximization (EM)
• Training data– Articulatory databases
– Detailed phonetic transcriptions
1word
12;11 sync 13;2,1
1 sync
11ind 2
1ind 31ind
11S
21S
31S
11U
21U
31U
0word
12;10 sync 13;2,1
0 sync
10ind 2
0ind 30ind
10S
20S
30S
10U
20U
30U
. . .
MIT CSAIL
SLSPreview
• The problem of pronunciation variation for automatic speech recognition (ASR)
• Traditional methods: phone-based pronunciation modeling
• Proposed approach: pronunciation modeling via multiple sequences of linguistic features
• A natural framework: dynamic Bayesian networks (DBNs)
• A feature-based pronunciation model using DBNs
• Proof-of-concept experiments
• Ongoing/future work
• Integration with SVM feature classifiers
MIT CSAIL
SLSA proof-of-concept experiment
• Task: classify an isolated word from the Switchboard corpus, given a detailed phonetic transcription (from ICSI Berkeley, [Greenberg et al. 1996])
– Convert transcription into feature vectors Si, one per 10ms
– For each word w in a 3k+ word vocabulary, compute P(w|Si)
– Output w* = arg maxw P(w|Si)
– Used GMTK [Bilmes & Zweig 2002] for inference and EM parameter training
– Note: ICSI transcription is somewhere between phones and features—not ideal, but as good as we have
MIT CSAIL
SLSResults (development set)
61.263.6Baseforms only
47.950.3+ phonological rules
24.835.2synchronous feature-based
16.429.7asynchronous feature-based
Failure rate (%)
Error rate (%)
Model
19.432.7asynch. + segmental constraint
19.4asynch. + segmental constraint + EM
27.8
1.7 prons/word
4 prons/word
• When did asynchrony matter?– Vowel nasalization & rounding
– Nasal + stop nasal
– Some schwa deletions
– instruments [ih_n s ch em ih_n n s]
– everybody [eh r uw ay]
• What didn’t work? – Some deletions ([ax], [t])
– Vowel retroflexion
– Alveolar + [y] palatal
– (Cross-word effects)
– (Speech/transcription errors…)
MIT CSAIL
SLSSample Viterbi path
everybody [ eh r uw ay ]
MIT CSAIL
SLSOngoing/future work
• Trainable synchrony constraints ([ICSLP 2004?])
• Context-dependent distributions for underlying (Ui) and surface (Si) feature values
• Extension to more complex tasks (multi-word sequences, larger vocabularies)
• Implementation in a complete recognizer (cf. [Eurospeech 2003])
• Articulatory databases for parameter learning/testing
• Can we use such a model to learn something about speech?
MIT CSAIL
SLSIntegration with feature classifier outputs
• Use (hard) classifier decisions as observations for Si
1tS 2
tS 3tS
1tU 2
tU 3tU
(rest of model)
1tO
2tO
3tO
)()|1( sSPsSOP itSVM
it
it
• Landmark-based classifier outputs to DBN Si’s:– Convert landmark-based features to one feature vector/frame
– (Possibly) convert from SVM feature set to DBN feature set
• Convert classifier scores to posterior probabilities and use as “soft evidence” for Si
MIT CSAIL
SLSAcknowledgment
• Jeff Bilmes, U. Washington
MIT Computer Science and Artificial Intelligence Laboratory
SPOKEN LANGUAGE SYSTEMS
Thank you!
MIT Computer Science and Artificial Intelligence Laboratory
SPOKEN LANGUAGE SYSTEMS
GRAVEYARD
MIT CSAIL
SLSBackground: Continuous Speech Recognition
• Given waveform with acoustic features A, find most likely word string :}, ... ,,{ 21
*MwwwW
)|,(maxarg ** AUWPWW
acoustic model pronunciation model language model
Bayes’ Rule
UW
W
AUWP
AWPW
)|,(maxarg
)|(maxarg*possible pronunciations (typically phone strings)
• Assuming U* much more likely than all other U:
)( )|( ),|( maxarg},{,
** WPWUPUWAPUWUW
MIT CSAIL
SLSExample: “warmth” “warmpth”
• Phone-based view: Brain:Give me a []!
• (Articulatory) feature-based view:Brain:
Give me a []!
Tongue:Umm…yeah, OK.
Lips:Huh?
Velum, glottis:Right on it, sir !Velum, glottis:Right on it, sir !
Lips, tongue, velum, glottis:Right on it, sir!
Lips, tongue, velum, glottis:Right on it, sir!
Lips, tongue, velum, glottis:Right on it, sir!
Lips, tongue, velum, glottis:Right on it, sir!
MIT CSAIL
SLSGraphical models for hidden feature modeling
• Most ASR approaches use hidden Markov models (HMMs) and/or finite-state transducers (FSTs)– Efficient and powerful, but limited
– Only one state variable per time frame
• Graphical models (GMs) allow for– Arbitrary numbers of variables and dependencies
– Standard algorithms over large classes of models Straightforward mapping between feature-based models and GMs Potentially large reduction in number of parameters
• GMs for ASR:– Zweig (e.g. PhD thesis, 1998), Bilmes (e.g. PhD thesis, 1999),
Stephenson (e.g. Eurospeech 2001)
– Feature-based ASR with GMs suggested by Zweig, but not previously investigated
MIT CSAIL
SLSBackground
• Brief intro to ASR– Words written in terms of sub-word units, acoustic models
compute probability of acoustic (spectral) features given sub-word units or vice versa
• Pronunciation model: mapping between words and strings of sub-word units
MIT CSAIL
SLSPossible solution?
• Allow every pronunciation in some large database
Unreliable probability estimation due to sparse data
Unseen words
Increased confusability
MIT CSAIL
SLSPhone-based pronunciation modeling (2)
• Generalize across words
• But: Data still sparse Still increased confusability Some pronunciation changes not well described by phonetic
rules Limited gains in speech recognition experiments
MIT CSAIL
SLSApproach
• Begin with usual assumption that each word has one or more “target” pronunciations, given by the dictionary
• Model the evolution of multiple feature streams, allowing for:
– Feature changes on a frame-by-frame basis
– Feature desynchronization
– Control of asynchrony—more “synchronous” feature configurations are preferable
• Dynamic Bayesian networks (DBNs): Efficient parameterization and computation when state can be factored