53
Landmark-Based Speech Landmark-Based Speech Recognition Recognition The Marriage of The Marriage of High-Dimensional High-Dimensional Machine Learning Techniques with Machine Learning Techniques with Modern Linguistic Representations Modern Linguistic Representations Mark Hasegawa-Johnson [email protected] Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT), Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)

Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

  • Upload
    don

  • View
    32

  • Download
    1

Embed Size (px)

DESCRIPTION

Landmark-Based Speech Recognition The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations. Mark Hasegawa-Johnson [email protected] Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), - PowerPoint PPT Presentation

Citation preview

Page 1: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmark-Based Speech Landmark-Based Speech RecognitionRecognition

The Marriage of The Marriage of High-Dimensional Machine High-Dimensional Machine Learning Techniques with Modern Linguistic Learning Techniques with Modern Linguistic

RepresentationsRepresentations Mark Hasegawa-Johnson

[email protected]

Research performed in collaboration withJames Baker (Carnegie Mellon), Sarah Borys (Illinois),

Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT),

Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)

Page 2: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Goal of this TalkGoal of this Talk1. Experiments with human subjects (since 1910 at Bell Labs, since 1950

at Harvard) give us detailed knowledge of human speech perception. • Human speech perception is multi-resolution, like progressive JPEG: • syllables and prosody → distinctive features → words

2. Automatic speech recognition (ASR) works best if all parameters in the system can be simultaneously learned in order to adjust a global optimality criterion

• In 1967, it became possible to globally optimize all parameters of a very simple recognition model called the hidden Markov model

• Multi-resolution speech models could not be globally optimized• Therefore from 1985-1999, standard ASR ignored results from speech

psychology

3. In the 1990s, new results in machine learning made it possible to globally optimize a multi-resolution model of speech psychology, and to use the resulting model as an automatic speech recognizer

• We do not yet know how best to “marry” speech psychology with new machine learning technology

• Goal of this talk: to test globally optimized computational models of speech psychology as automatic speech recognizers

Page 3: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Talk OutlineTalk OutlineHistory and Overview1. Acoustics → Landmarks

a. Psychological Results: Landmark-Based Speech Perceptionb. Psychological Results: Perceptual Space ≠ Acoustic Spacec. Computational Model: Landmark Detection and Classificationd. Algorithm: Support Vector Machines

2. Landmarks → Wordsa. The Pronunciation Modeling Problemb. Psychological Model #1: An Underspecified Distinctive Feature Lexicon

• Computational Model: Discriminative Selection of Landmarksc. Psychological Model #2: Articulatory Phonology

• Computational Model: Dynamic Bayesian Network (DBN)

3. Technological Evaluationa. Landmark Detection and Classificationb. Forced Alignment using the DBNc. Rescoring of word lattice output from an HMM-based recognizerd. Error Analysis and Future Plans

Page 4: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

HistoryHistory• Human Speech Recognition Models

– 1955, Miller and Nicely: Distinctive Features– 1955, Delattre, Liberman, and Cooper: Landmarks– 1975, Goldsmith: Underspecified Lexicon– 1992, Stevens: Landmark-Based Speech Perception Model– 1990, Browman and Goldstein: Articulatory Phonology

• Automatic Speech Recognition– 1999, Niyogi and Ramesh: Support Vector Machines for Landmark

Detection– 2003, Livescu and Glass: Dynamic Bayesian Network

implementation of Articulatory Phonology– 2004, Hasegawa-Johnson et al., WS04 Summer Workshop at the

Johns Hopkins Center for Language and Speech Processing• Underspecified Lexicon with Discriminative Landmark Selection• Hybrid SVM-DBN implementation of Articulatory Phonology

Page 5: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmark-Based Speech Landmark-Based Speech RecognitionRecognition

ONSET

NUCLEUS

CODA

NUCLEUS

CODA

ONSET

PronunciationVariants:

… backed up …… backtup ..… back up …… backt ihp …… wackt ihp…

Lattice hypothesis:… backed up …

SyllableStructure

ScoresWordsTimes

Page 6: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Acoustics Acoustics → Landmarks: → Landmarks: Results and Models from Results and Models from

Psychology and Psychology and LinguisticsLinguistics

Page 7: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Spectral DynamicsSpectral DynamicsDelattre, Liberman and Cooper, 1955Delattre, Liberman and Cooper, 1955

• To recognize a stop consonant, one spectrum is not enough.• Recognition depends on the pattern of spectral change over a 50ms period following the release “landmark.”

Page 8: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmarks are RedundantLandmarks are RedundantMany authors, including Stevens, 1999Many authors, including Stevens, 1999

To recognize a stop consonant, it is necessary and sufficient to hear any one of these:

• Release into vowel

• Closure from vowel

• “Ejective” burst

… three “acoustic landmarks” with very different spectral patterns.

“backed”

Page 9: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Recognition Depends on RhythmRecognition Depends on Rhythm(Warren, Healy, and Chalikia, 1996)(Warren, Healy, and Chalikia, 1996)

Heard as one voice saying “aa iy uw ow ae iy”

Heard as two voices: one says “hi uw,” one says “iowa”

Page 10: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Nonlinear Map from Acoustic Nonlinear Map from Acoustic Features to Perceptual Features to Perceptual

FeaturesFeatures(Kuhl et al., 1992)(Kuhl et al., 1992)

Page 11: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

In the Perceptual Space, In the Perceptual Space, Distinctive Feature Errors are Distinctive Feature Errors are

IndependentIndependent(Miller and Nicely, 1955)(Miller and Nicely, 1955)

• Experimental Method:– Subjects listen to nonsense syllables mixed with noise (white noise or

BPF)

– Subjects write the consonant they hear

• Results:

p(q*|q,SNR,BPF) ≈ i p(fi* | fi,SNR,BPF)

q* = consonant label heard by the listener

q = true consonant label

F*=[f1*,…,f6*] = perceived distinctive feature labels

F=[f1,…,f6] = true distinctive feature labels

[±nasal, ±voiced, ±fricated, ±strident, ±lips, ±blade]

Page 12: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Consonant Confusions at -6dB Consonant Confusions at -6dB SNRSNR

P T K F TH S SH B D G V DH Z ZH M N

P 80 43 64 17 14 6 2 1 1 1 1 2

T 71 84 55 5 9 3 8 1 1 1

K 66 76 107 12 8 9 4 1 1

F 18 12 9 175 48 11 1 7 2 1 2 2

TH 19 17 16 104 64 32 7 5 4 5 6 4 5

S 8 5 4 23 39 107 45 4 2 3 1 1 3 2 1

SH 1 6 3 4 6 29 195 3 1

B 1 5 4 4 136 10 9 47 16 6 1 5 4

D 8 5 80 45 11 20 20 26 1

G 2 3 63 66 3 19 37 56 3

V 2 2 48 5 5 145 45 12 4

DH 6 31 6 17 86 58 21 5 6 4

Z 1 1 17 20 27 16 28 94 44 1

ZH 1 26 18 3 8 45 129 2

M 1 4 4 1 3 177 46

N 4 1 5 2 7 1 6 47 163

Distinctive Features: ±nasal, ±voiced, ±fricative, ±strident

Page 13: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

In the Acoustic Space, In the Acoustic Space, Distinctive Features are Not Distinctive Features are Not

IndependentIndependent(Volaitis and Miller, 1992)(Volaitis and Miller, 1992)

Page 14: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Acoustics Acoustics → Landmarks→ LandmarksA Computational ModelA Computational Model

Page 15: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmark Detection and Landmark Detection and ExplanationExplanation

(based on Stevens, Manuel, Shattuck-Hufnagel, and Liu, (based on Stevens, Manuel, Shattuck-Hufnagel, and Liu, 1992)1992)

ONSET

NUCLEUS

CODA

NUCLEUS

CODA

ONSET

Search Space:…

… buck up …… big dope …

… backed up …… bagged up …

…… big doowop …

MAP understanding:… backed up …

Page 16: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmark Detector Inputs: Landmark Detector Inputs: Acoustic, Phonetic, and Auditory Acoustic, Phonetic, and Auditory

FeaturesFeaturesTotal Feature Vector Dimension: 483/frameTotal Feature Vector Dimension: 483/frame

• MFCCs, 25ms window (standard ASR features)• Spectral shape: energy, spectral tilt, and spectral

compactness, once/millisecond• Noise-robust MUSIC-based formant frequencies,

amplitudes, and bandwidths (Zheng & Hasegawa-Johnson, ICSLP 2004)

• Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996)

• Rate-place model of neural response fields in the cat auditory cortex (Carlyon & Shamma, JASA 2003)

Page 17: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Cues for Place of Cues for Place of Articulation:Articulation:

MFCC+formants + ratescale, within 150ms of

landmark

Page 18: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmark Detection using Support Landmark Detection using Support Vector Machines (SVMs)Vector Machines (SVMs)

False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM

(3) Linear SVM: EER = 0.15%

(4) Radial Basis Function SVM: Equal Error Rate=0.13%

Niyogi, Ramesh & Burges, 1999, 2002

(1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2%

(2) HMM (*): False Rejection Error=0.3%

Page 19: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

What is a Support Vector What is a Support Vector Machine?Machine?

• SVM = hyperplane, RBF, or kernel-based classifier, trained to minimize upper bound on EXPECTED TEST CORPUS ERROR• EXPECTED TEST CORPUS ERROR ≤ (TRAINING CORPUS ERROR) + (Distance between the hyperplane and the nearest data point)-2

• Classifier on right: higher TRAINING CORPUS ERROR, but lower EXPECTED TEST CORPUS ERROR

Page 20: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

What are the SVMs trained to What are the SVMs trained to detect?detect?

Simple answer: any binary distinction that will be useful to the recognizer.Hard answer (as implemented in July 2004):• SVMs trained to be correct in every frame

– Articulatory-free features: Speech vs. silence, vowel vs. consonant, sonorant vs. obstruent, nasal vs. non-nasal, fricative vs. non-fricative

– Landmarks: Stop release vs. any other frame, fricative release vs. any other frame, stop closure vs. any other frame, …

• SVMs trained to be correct given a specified context, and meaningless otherwise:– Primary articulator: lips vs. tongue blade vs. tongue body– Secondary articulators: voiced vs. unvoiced, nasal vs. not

Page 21: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Why are we studying binary Why are we studying binary distinctive features?distinctive features?

By focusing on binary distinction, and using regularized learners (SVMs), we can “push the limit” of classifier complexity …

… in order to get high binary classification accuracy.

NTIMIT Landmark vs. Non-landmark

Feature Multiwindow MFCCs

MFCCs+ Formants

–+continuant 94.1% 94.8%

+–continuant 94.9% 95.6%

–+sonorant 97.2% 97.0%

+–sonorant 96.4% 97.4%

–+syllabic 95.3$ 96.1%

+–syllabic 90.1% 94.3%

TIMIT Stops Consonant Releases

blade 83.3%

lips 90.5%

body 88.1%

Page 22: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Perceptual Space Encodes Perceptual Space Encodes Distinctive Features: Errors Distinctive Features: Errors

Independent even if Acoustics Independent even if Acoustics NotNot

Nonlinear Transform Implicit in the SVM Kernel

Page 23: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

NonlinearTransform:

Implicitin theSVM

Kernel

““Phonetic Features” = Nonlinear Phonetic Features” = Nonlinear Transform followed by a One-Transform followed by a One-

Dimensional CutDimensional Cut

SVM Extracts a Discriminant Dimension

SVM Discriminant Dimension =argmin(error(margin)+1/width(margin)

(Niyogi & Burges, 2002: Posterior PDF = Sigmoid Model in Discriminant Dimension)

An Equivalent Model: Likelihoods =Gaussian in Discriminant Dimension

Page 24: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Soft Decisions once/5msSoft Decisions once/5ms::p ( manner feature d(t) | Y(t) )

p( place feature d(t) | Y(t), t is a landmark )

SVM

Histogram

2000-dimensional acoustic feature vector

Discriminant yi(t)

Posterior probability of distinctive featurep(di(t)=1 | yi(t))

Page 25: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmarks Landmarks → Words→ WordsThe Problem of The Problem of

Pronunciation VariabilityPronunciation Variability

Page 26: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

The Problem of Pronunciation The Problem of Pronunciation VariabilityVariability

(Livescu, 2004)(Livescu, 2004)

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250

word frequency

# p

ron

un

cia

tio

ns

/wo

rd

read

casual

p r aa b iy 2

p r ay 1

p r aw l uh 1

p r ah b iy 1

p r aa lg iy 1

p r aa b uw 1

p ow ih 1

p aa iy 1

p aa b uh b l iy 1

p aa ah iy 1

probably

Page 27: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmarks Landmarks → Words→ WordsPhonological Model #1: Phonological Model #1: Underspecified LexiconUnderspecified Lexicon

Page 28: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Underspecified Lexicon(Goldsmith, 1975)

Features S A P N A P S P A T S C A T

Vowel – + – – + – – – + – – – + –

sonorant – – + – – – – –

continuant + – – – – – – –

strident + – – + – – + – –

lips – + – + + – – –

blade + – + – – + – –

voiced – – – – – – –

• Once listener hears [+vowel] the features sonorant, continuant, strident, lips, blade, and voiced are meaningless and redundant.

• There are no [+sonorant,+strident] or [+sonorant,-voiced] phonemes. Given [+strident], the features [sonorant,strident] are meaningless and redundant.

• If /s/ is in a consonant cluster, the listener only needs to hear [+strident] --- no other features are necessary, because no word could have anything but an /s/ in this position.

Page 29: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Computational Model: Select Landmarks to Distinguish

Confusable Word Pairs• Rationale: baseline HMM-based system already provides

high-quality hypotheses

– 1-best error rate from N-best lists: 24.4% (RT-03 dev set)

– oracle error rate: 16.2%

• Method: 1. Use an HMM-NN hybrid system to generate a first-pass word lattice

2. Use landmark detection only where necessary, to correct errors made by baseline recognition system

Example:

Ref: that cannot be that hard to sneak onto an airplane Hyp: they can be a that hard to speak on an airplane

fsh_60386_1_0105420_0108380

Page 30: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Identifying Confusable Hypotheses

• Use existing alignment algorithms for converting lattices into confusion networks (Mangu, Brill & Stolcke 2000)

• Hypotheses ranked by posterior probability• Generated from n-best lists without 4-gram or pronunciation

model scores ( higher WER compared to lattices)

• Multi-words (“I_don’t_know”) were split prior to generating confusion networks

airplaneanon

onto

sneak

speaktohard

*DEL*be

that can

can’t athey

Page 31: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Identifying Confusable Hypotheses

• How much can be gained from fixing confusions?

• Baseline error rate: 25.8%

• Oracle error rates when selecting correct word from confusion set:

# hypotheses to select from

Including homophones

Not including homophones

2 23.9% 23.9%

3 23.0% 23.0%

4 22.4% 22.5%

5 22.0% 22.1%

Page 32: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Selecting Relevant LandmarksSelecting Relevant Landmarks• Convert each word into a fixed-length vector• Dimensions of the vector = frequencies of occurrence, in the

word, of selected binary landmark-pair relationships:– Manner landmarks: precedence, e.g. V ≺ Son. Cons.– Manner & place features: overlap, e.g. Stop o +blade

• Not all possible relations are used; dimensionality of feature space is 40 - 60

• The vector for each word– … should be derived from actual pronunciation data, e.g., from

landmarks automatically detected in a very large speech corpus– … unfortunately, due to time constraints, that experiment hasn’t been

run yet.– In the mean time, the vector for each word was derived from a standard

pronunciation dictionary (pronlex).

Page 33: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Vector-Space Word Vector-Space Word RepresentationRepresentation

Start < Fric

Fric< Stop

Fric< Son

Fric < Vowel

Stop < Vowel

Vowel o high

Vowel o front

Fric o strident

speak 1 1 0 0 1 1 1 1

sneak 1 0 1 0 0 1 1 1

seek 1 0 0 1 0 1 1 1

he 1 0 0 1 0 1 1 0

she 1 0 0 1 0 1 1 1

steak 1 1 0 0 1 0 1 1

….

Page 34: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Maximum-Entropy Maximum-Entropy DiscriminationDiscrimination

Use maxent classifier

• Here: y = words, x = acoustics, f = landmark relationships

• Why maxent classifier?– Discriminative classifier

– Possibly large set of confusable words

– Later addition of non-binary features

• Training: ideally on real landmark detection output

• Here: on entries from lexicon (includes pronunciation variants)

1( | ) exp( ( , ))

( ) k kk

P y x f x yZ x

Page 35: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Maximum-Entropy Maximum-Entropy DiscriminationDiscrimination

• Example: sneak vs. speak

• Different model is trained for each confusion set landmarks can have different weights in different contexts

speakSC ○ +blade -2.47 FR < SC -2.47FR < SIL 2.11SIL < ST 1.75…..

sneakSC ○ +blade 2.47 FR < SC 2.47FR < SIL -2.11SIL < ST -1.75…..

Page 36: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmark QueriesLandmark Queries• Select N landmarks with highest weights

• Ask landmark detection module to produce scores for selected landmarks within word boundaries given by baseline system

• Example:

sneak 1.70 1.99 SC ○ +blade ?

sneak 1.70 1.99 SC ○ +blade 0.75 0.56

Confusionnetworks

Landmarkdetectors

Page 37: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmarks Landmarks → Words→ WordsPhonological Model #2: Phonological Model #2: Articulatory PhonologyArticulatory Phonology

Page 38: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Articulatory Phonology: Lips and Tongue Have Different

Variability

Page 39: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

LIP-OP TT-OPEN

TT-LOC

TB-LOC

TB-OPENVELUM

VOICING

• warmth [w ao r m p th] - Phone insertion?

• I don’t know [ah dx uh_n ow_n] - Phone deletion??

• several [s eh r v ax l] - Exchange of two phones???

Articulatory PhonologyArticulatory Phonology(Browman and Goldstein, 1990; slide from Livescu and Glass, (Browman and Goldstein, 1990; slide from Livescu and Glass,

2004)2004)

• Many pronunciation phenomena can be parsimoniously described as resulting from asynchrony and reduction of quasi-independent speech articulators.

• instruments [ih_n s ch em ih_n n s] everybody [eh r uw ay]

Page 40: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Brief Review: Bayesian Brief Review: Bayesian NetworksNetworks

• Each node in the graph is a random variable• Arrow represents dependent probabilities• Probability distributions: One per variable

– Number of columns = number of different values the variable can take– Number of rows = number of different values the variable’s parents can take

• Modularity of the graph Modularity of computation; very complicated models can be used for speech recognition with not-so-bad computational cost

H 0 1

P(H|G=0) 0.7 0.3

P(H|G=1) 0.3 0.7 H L

G

G 0 1

P(G) 0.5 0.5

L 0 1

P(H|G=0) 0.1 0.9

P(H|G=1) 0.9 0.1

Page 41: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Dynamic Bayesian Network ModelDynamic Bayesian Network Model(Livescu and Glass, 2004)(Livescu and Glass, 2004)

• The model is implemented as a dynamic Bayesian network (DBN):– A representation, via a directed graph, of a

distribution over a set of variables that evolve through time

• Example DBN with three articulators:

tword

2;1tcheckSync

1tind 2

tind 3tind

1tS 2

tS 3tS

1tU 2

tU 3tU

2;1tasync

= 13;2,1

tcheckSync= 1

3;2,1tasync

)|Pr(|)Pr( 212;1 aindindaasync

2;1212;1 || if 1 asyncindindcheckSync

.1

0

0

4

… … … … … …

… .2 .7 0 0 2

… .1 .2 .7 0 1

… 0 .1 .2 .7 0

… 3 2 1 0

given by baseform pronunciations

Tword

12;1 Tsync 13;2,1 Tsync

1Tind 2

Tind 3Tind

1TS 2

TS 3TS

1TU 2

TU 3TU

1word

12;11 sync 13;2,1

1 sync

11ind 2

1ind 31ind

11S 2

1S 31S

11U 2

1U 31U

0word

12;10 sync 13;2,1

0 sync

10ind 2

0ind 30ind

10S 2

0S 30S

10U 2

0U 30U

. . .

Page 42: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

A DBN Model of Articulatory A DBN Model of Articulatory Phonology for Speech Phonology for Speech

RecognitionRecognition(Livescu and Glass, 2004)(Livescu and Glass, 2004)

• wordt: word ID at frame #t• wdTrt: word transition?• indt

i: which gesture, from the canonical word model, should articulator i be trying to implement?• asynct

i;j: how asynchronous are articulators i and j? • Ut

i: canonical setting of articulator #i• St

i: surface setting of articulator #i

Page 43: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Incorporating the SVMs: An SVM-DBN Incorporating the SVMs: An SVM-DBN Hybrid ModelHybrid Model

Tongue front

Palatal

Tongue closed

Semi-closed

Glide

A

Tongue Mid Tongue open

Tongue open …

Word

Canonical Form

Surface Form

Place

x:Multi-FrameObservation

including

Spectrum,Formants,

& AuditoryModel

Tongue front

Front Vowel

LIKE

Tongue Front

p( gPGR(x) | palatal glide release) p( gGR(x) | glide release )

SVM Outputs

Manner

Page 44: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Technological EvaluationTechnological Evaluation

Page 45: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Acoustic Feature SelectionAcoustic Feature Selection

1. Accuracy per Frame (%), Stop Releases only, NTIMIT

2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)

MFCCs+Shape MFCCs+Formants

Kernel Linear RBF Linear RBF

+/- lips 78.3 90.7 92.7 95.0+/- blade 73.4 87.1 79.6 85.1

+/- body 73.0 85.2 85.7 87.2

Page 46: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

SVM Training: Mixed vs. Targeted DataSVM Training: Mixed vs. Targeted Data

Train NTIMIT NTIMIT&SWB NTIMIT Switchboard

Test NTIMIT NTIMIT&SWB Switchboard Switchboard

Kernel Linear RBF Linear RBF Linear RBF Linear RBF

speech onset 95.1 96.2 86.9 89.9 71.4 62.2 81.6 81.6

speech offset 79.6 88.5 76.3 86.4 65.3 78.6 68.4 83.7

consonant onset 94.5 95.5 91.4 93.5 70.3 72.7 95.8 97.7

consonant offset 91.7 93.7 94.3 96.8 80.3 86.2 92.8 96.8

continuant onset 89.4 94.1 87.3 95.0 69.1 81.9 86.2 92.0

continuant offset 90.8 94.9 90.4 94.6 69.3 68.8 89.6 94.3

sonorant onset 95.6 97.2 97.8 96.7 85.2 86.5 96.3 96.3

sonorant offset 95.3 96.4 94.0 97.4 75.6 75.2 95.2 96.4

syllabic onset 90.7 95.2 91.4 95.5 69.5 78.9 87.9 92.6

syllabic offset 90.1 88.9 87.1 92.9 54.4 60.8 88.2 89.7

Page 47: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

DBN-SVM: Models Nonstandard DBN-SVM: Models Nonstandard PhonesPhonesI don’t know

/d/ becomesflap

/n/ becomes a creakynasal glide

Page 48: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

DBN-SVM Design DecisionsDBN-SVM Design Decisions

• What kind of SVM outputs should be used in the DBN?– Method 1 (EBS/DBN): Generate landmark segmentation with EBS using manner

SVMs, then apply place SVMs at appropriate points in the segmentation

• Force DBN to use EBS segmentation

• Allow DBN to stray from EBS segmentation, using place/voicing SVM outputs whenever available

– Method 2 (SVM/DBN): Apply all SVMs in all frames, allow DBN to consider all possible segmentations

• In a single pass

• In two passes: (1) manner-based segmentation; (2) place+manner scoring

• How should we take into account the distinctive feature hierarchy?

• How do we avoid “over-counting” evidence?

• How do we train the DBN (feature transcriptions vs. SVM outputs)?

Page 49: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

DBN-SVM Rescoring ExperimentsDBN-SVM Rescoring Experiments• For each lattice edge:

– SVM probabilities computed over edge duration and used as soft evidence in DBN– DBN computes a score S P(word | evidence)– Final edge score is a weighted interpolation of baseline scores and EBS/DBN or SVM/DBN score

Date Experimental setup 3-speaker WER (# errors)

RT03 dev WER

- Baseline 27.7 (550) 26.8

Jul31_0 EBS/DBN, “hierarchically-normalized” SVM output probabilities, DBN trained on subset of ICSI transcriptions

27.6 (549) 26.8

Aug1_19 + improved silence modeling 27.6 (549)

Aug2_19 EBS/DBN, unnormalized SVM probs + fricative lip feature 27.3 (543) 26.8

Aug4_2 + DBN trained using SVM outputs 27.3 (543)

Aug6_20 + full feature hierarchy in DBN 27.4 (545)

Aug7_3 + reduction probabilities depend on word frequency 27.4 (544)

Aug8_19 + retrained SVMs + nasal classifier + DBN bug fixes 27.4 (544)

Aug11_19 SVM/DBN, 1 pass Miserable failure!

Aug14_0 SVM/DBN, 2 pass 27.3 (542)

Aug14_20 SVM/DBN, 2 pass, using only high-accuracy SVMs 27.2 (541)

Page 50: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Discriminative Pronunciation Discriminative Pronunciation ModelModel

WER Insertions Deletions Substitutions

Baseline 25.8% 2.6% (982) 9.2% (3526) 14.1% (5417)

Rescored 25.8% 2.6% (984) 9.2% (3524) 14.1% (5408)

RT-03 dev set, 35497 Words, 2930 Segments, 36 Speakers(Switchboard and Fisher data)

Rescored: product combination of old and new prob. distributions, weights 0.8 (old), 0.2 (new)

-Correct/incorrect decision changed in about 8% of all cases -Slightly higher number of fixed errors vs. new errors

Page 51: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

AnalysisAnalysis

• When does it work? – Detectors give high probability for correct distinguishing

feature

• When does it not work?– Problems in lexicon representation

– Landmark detectors are confident but wrong

once (correct) vs. what (false): Sil ○ +blade 0.87

like (correct) vs. liked (false): Sil ○ +blade 0.95

can’t [kæ̃P t] (correct) vs cat (false): SC ○ +nasal 0.26

mean (correct) vs. me (false) V < +nasal 0.76

Page 52: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

AnalysisAnalysis • Incorrect landmark scores often due to word

boundary effects, e.g.:

• Word boundaries given by baseline system may exclude relevant landmarks or include parts of neighbouring words

• DBN-SVM system also failed when word boundaries grossly misaligned

muchhe

she

Page 53: Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Conclusions• SVMs work best when:

– Mixed training data, at least 3000 landmarks/class– Manner classification: Small acoustic feature vectors OK (3-20 dimensions)– Place classification: Large acoustic feature vectors best (~2000 dimensions)

• DBN-SVM correctly models non-canonical pronunciations– DBN is able to match nasalized glide in place of /n/– One talker laughed a lot while speaking; DBN-SVM reduced WER for that talker

• Both DBN-SVM and MaxEnt need more training data– Our training data: 3.5 hours. Baseline HMM training data: 3000 hours.– DBN-SVM novel errors: mostly pronunciation unexpected pronunciations– MaxEnt model currently defines a “lexically discriminative” feature by comparing

dictionary entries, therefore it fails most frequently when observing pronunciation variants.

– MaxEnt model should instead be trained using automatic landmark transcriptions of confusable words from a large training corpus.

• Both DBN-SVM and MaxEnt are sensitive to word boundary time errors. Solution: Probabilistic word boundary times?