29
OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006

OSU ASAT Status Report

  • Upload
    liz

  • View
    36

  • Download
    1

Embed Size (px)

DESCRIPTION

OSU ASAT Status Report. Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006. Personnel changes. Jeremy and Yu are not currently on the project Jeremy is being funded on AFRL/DAGSI project Lexicon learning from orthography - PowerPoint PPT Presentation

Citation preview

Page 1: OSU ASAT Status Report

OSU ASAT Status Report

Jeremy Morris

Yu Wang

Ilana Bromberg

Eric Fosler-Lussier

Keith Johnson

13 October 2006

Page 2: OSU ASAT Status Report

Personnel changes

• Jeremy and Yu are not currently on the project– Jeremy is being funded on AFRL/DAGSI project

• Lexicon learning from orthography• However, he is continuing to help in spare time

– Yu is currently in transition

• New student (to some!): Ilana Bromberg– Technically funded as of 10/1 but did some

experiments for an ICASSP paper in Sept.– Still sorting out project for this year

Page 3: OSU ASAT Status Report

Future potential changes

• May transition in another student in WI 06– Carry on further with some of Jeremy’s

experiments

Page 4: OSU ASAT Status Report

What’s new?

• First pass on the parsing framework– Last time: talked about different models

• Naïve Bayes, Dirichlet modeling, MaxEnt models

– This time: settled on Conditional Random Fields framework• Monophone CRF phone recognition beating triphone HTK

recognition using attribute detectors

• Ready for your inputs!

• More boundary work– Small improvements seen in integrating boundary

information into HMM recognition– Still to be seen if it helps CRFs

Page 5: OSU ASAT Status Report

Parsing

• Desired: ability to combine the output of multiple, correlated attribute detectors to produce– Phone sequences– Word sequences

• Handle both semi-static & dynamic events– Traditional phonological features– Landmarks, boundaries, etc.

• CRFs are good bet for this

Page 6: OSU ASAT Status Report

Conditional Random Fields

• A form of discriminative modelling– Has been used successfully in various domains

such as part of speech tagging and other Natural Language Processing tasks

• Processes evidence bottom-up– Combines multiple features of the data – Builds the probability P( sequence | data)

• Computes joint probability of sequence given data

• Minimal assumptions about input– Inputs don’t need to be decorrelated

• cf. Diagonal Covariance HMMs

Page 7: OSU ASAT Status Report

Conditional Random Fields

• CRFs are based on the idea of Markov Random Fields– Modelled as an undirected graph connecting labels with

observations– Observations in a CRF are not modelled as random

variables

/k/ /k/ /iy/ /iy/ /iy/

X X X X X

Transition functions add associations between transitions from

one label to anotherState functions help determine theidentity of the state

Page 8: OSU ASAT Status Report

• Hammersley-Clifford Theorem states that a random field is an MRF iff it can be described in the above form– The exponential is the sum of the clique potentials of the

undirected graph

Conditional Random Fields

)(

)),,(),((exp

)|(1

xZ

yyxgyxf

xyP t i jttjjtii∑ ∑ ∑ −+

=μλ

State Feature Function

f([x is stop], /t/)

One possible state feature functionFor our attributes and labels

State Feature Weight

λ=10

One possible weight valuefor this state feature

(Strong)

Transition Feature Function

g(x, /iy/,/k/)

One possible transition feature function

Indicates /k/ followed by /iy/

Transition Feature Weight

μ=4

One possible weight valuefor this transition feature

Page 9: OSU ASAT Status Report

Conditional Random Fields

• Conceptual Overview– Each attribute of the data we are trying to model fits into a

feature function that associates the attribute and a possible label

• A positive value if the attribute appears in the data• A zero value if the attribute is not in the data

– Each feature function carries a weight that gives the strength of that feature function for the proposed label

• High positive weights indicate a good association between the feature and the proposed label

• High negative weights indicate a negative association between the feature and the proposed label

• Weights close to zero indicate the feature has little or no impact on the identity of the label

Page 10: OSU ASAT Status Report

Experimental Setup

• Attribute Detectors– ICSI QuickNet Neural Networks

• Two different types of attributes– Phonological feature detectors

• Place, Manner, Voicing, Vowel Height, Backness, etc.• Features are grouped into eight classes, with each class having

a variable number of possible values based on the IPA phonetic chart

– Phone detectors• Neural networks output based on the phone labels – one output

per label– Classifiers were applied to 2960 utterances from the TIMIT

training set

Page 11: OSU ASAT Status Report

Experimental Setup

• Output from the Neural Nets are themselves treated as feature functions for the observed sequence – each attribute/label combination gives us a value for one feature function– Note that this makes the feature functions non-

binary features.• Different than most NLP uses of CRFs• Along lines of Gaussian-based CRFs (e.g., Microsoft)

Page 12: OSU ASAT Status Report

Experiment 1

• Goal: Implement a Conditional Random Field Model on ASAT-style phonological feature data– Perform phone recognition– Compare results to those obtained via a Tandem

HMM system

Page 13: OSU ASAT Status Report

Experiment 1 - ResultsModel Phone

AccuracyPhone Correct

Tandem [monophone] 61.48% 63.50%

Tandem [triphone] 66.69% 72.52%

CRF [monophone] 65.29% 66.81%

• CRF system trained on monophones with these features achieves accuracy superior to HMM on monophones– CRF comes close to achieving HMM triphone accuracy– CRF uses many many fewer parameters

Page 14: OSU ASAT Status Report

Experiment 2

• Goals: – Apply CRF model to phone classifier data– Apply CRF model to combined phonological

feature classifier data and phone classifier data• Perform phone recognition• Compare results to those obtained via a Tandem HMM

system

Page 15: OSU ASAT Status Report

Experiment 2 - ResultsModel Phone

AccPhone Correct

Tandem [mono] (phones) 60.48% 63.30%

Tandem [tri] (phones) 67.32% 73.81%

CRF [mono] (phones) 66.89% 68.49%

Tandem [mono] (phones/feas) 61.78% 63.68%

Tandem [tri] (phones/feas) 67.96% 73.40%

CRF [mono] (phones/feas) 68.00% 69.58%

Note that Tandem HMM result is best result with only top 39 features following a principal components analysis

Page 16: OSU ASAT Status Report

Experiment 3

• Goal: – Previous CRF experiments used phone posteriors

for CRF, and linear outputs transformed via a Karhunen-Loeve (KL) transform for the HMM sytem

• This transformation is needed to improve the HMM performance through decorellation of inputs

– Using the same linear outputs as the HMM system, do our results change?

Page 17: OSU ASAT Status Report

Experiment 3 - ResultsModel Phone

AccuracyPhone Correct

CRF (phones) posteriors 67.27% 68.77%

CRF (phones) linear KL 66.60% 68.25%

CRF (phones) post. + linear 68.18% 69.87%

CRF (features) posteriors 65.25% 66.65%

CRF (features) linear KL 66.32% 67.95%

CRF (features) post + linear 66.89% 68.48%

CRF (features) linear (no KL) 65.89% 68.46%

Also shown – Adding both feature sets together and giving the system supposedly redundant information leads to a gain in accuracy

Page 18: OSU ASAT Status Report

Experiment 4

• Goal: – Previous CRF experiments did not allow for

realignment of the training labels• Boundaries for labels provided by TIMIT hand

transcribers used throughout training• HMM systems allowed to shift boundaries during EM

learning

– If we allow for realignment in our training process, can we improve the CRF results?

Page 19: OSU ASAT Status Report

Experiment 4 - ResultsModel Phone

AccuracyPhone Correct

Tandem [tri] (phones) 67.32% 73.81%

CRF (phones) no realign 67.27% 68.77%

CRF (phones) realign 69.63% 72.40%

Tandem [tri] (features) 66.69% 72.52%

CRF (features) no realign 65.25% 66.65%

CRF (features) realign 67.52% 70.13%

Allowing realignment gives accuracy results for a monophone trained CRF that are superior to a triphone trained HMM, with fewer parameters

Page 20: OSU ASAT Status Report

Code status

• Current version: java-based, multithreaded• TIMIT training takes a few days on 8 proc. machine• At test time, CRF generates AT&T FSM lattice

– Use AT&T FSM tools to decode– Will (hopefully) make it easier to decode words

• Code is stable enough to try different kinds of experiments quickly– Ilana joined the group and ran an experiment within a month

Page 21: OSU ASAT Status Report

Joint models of attributes

• Monica’s work showed that modeling attribute detection with joint detectors worked better– e.g. modeling manner/place jointly better– cf. Chang et al: hierarchical detectors work better

• This study: can we improve phonetic attribute-based detection by using phone classifiers and summing?– Phone classifer: ultimate joint modeling

Page 22: OSU ASAT Status Report

Independent vs Joint Feature Modeling

• Baseline 1: – 61 phone posteriors (joint modeling)

• Baseline 2: – 44 feature posteriors (independent modeling)

• Experiment: – Feature posteriors derived from 61 phone posteriors– In each frame: weight for each feature = summed weight of

each phone exhibiting that feature• e.g. P(stop) = P(/p/) + P(/t/) + P(/k/) + P(/b/) + P(/b/) + P(/g/)

Page 23: OSU ASAT Status Report

Results: Joint vs Independent Modeling

Posterior Type

Number of Weights

Accuracy % Correct

Phonemes 61 67.27 68.77

Features 44 65.25 66.65

Features derived from Phonemes

44 66.45 67.94

Page 24: OSU ASAT Status Report

Removal of Feature Classes

Page 25: OSU ASAT Status Report

Results of Feature Class Removal

Page 26: OSU ASAT Status Report

Continued work on phone boundary detection

• Basic idea: eventually we want to use these as transition functions in CRF

• CRF still under development when this study done– Added features corresponding to P(boundary|

data) to HMM

Page 27: OSU ASAT Status Report

Evaluation and Results Using phonological features as an input representation was modestly better than the phone posterior estimates themselves.Phonological feature representations also seemed to edge out direct acoustic representationsPhonological feature MLPs are more complex to train.The nonlinear representations learned by the MLP were better for boundary detection than metric-based methods.

Phone Boundary Detection

Input features: Phonological features, acoustic features(PLP) and Phone classifier outputsClassification methods: MLP metric-based method

Page 28: OSU ASAT Status Report

How to incorporating phone boundaries, estimated by Multi-layer perceptron (MLP), into an HMM system.

Five-state HMM phone model to capture boundary

information

In order to integrate phone boundary information in speech recognition, phone boundary information were concatenated to MFCCs as additional input features. We explicitly modeled the entering and exiting state of a phone as a separate, one frame distribution. The proposed 5-state HMM phone model is introduced below.

The two additional boundary states were intended to catch phone-boundary transitions, while the three self-looped states in the center can model phone-internal information. Escape arcs were also included to bypass the boundary states for short phones.

Experiments

• For simplicity, the linear outputs from the PLP+MLP detector, were used as the phone boundary features, instead of the ones from the features+ MLP detector. Several experiments were conducted:

0) Baseline system: standard 39 MFCCs

1) MFCCs+ phone boundary features. (no KLT)

2) MFCCs + phone boundary features, which were decorrelated using Karhunen-Loeve transformation (KLT).

3) MFCCs + phone boundary features, with a KL-transformation over all features.

4) MFCCs (KLTed) to show the effect of KL transformation on MFCCs

• The training and recognition were conducted with

the HTK toolkit, on TIMIT data set. When

reaching the 4-mixture stage, some experiment

failed due to data sparsity. We adopted a hybrid 2/4

mixture strategy, promoting triphones to 4-mixture

when the data was sufficient.

Proposed Five-state HMM model

Page 29: OSU ASAT Status Report

Results & Conclusion

Results

The proposed 5-state HMM models performed better than their conventional 3-

state counterparts on all training datasets.

Decorrelation improved the accuracy of recognition on binary boundaries.

Including MFCCs in the decorrelation improved recognition further.

For comparison, several experiments were also conducted on a 5-state HMM with

a traditional, left-to-right all-self-loops transition matrix. The results showed vastly

increased deletions, indicating a bias against short duration phones, whereas the

proposed model is balanced between insertions and deletions.

Recently, I modified the decision tree questions in the tied-state triphone step, and

pushed the model to 16-mix Gaussians. Part of the results are also shown in the

table.

-62.704) KLT(MFCC)

64.3863.203) KLT(MFCC + Boundaries)

63.78(68.02 16-mix)62.47 (67.22 16-mix)2) MFCC + KLT(Boundaries)

62.7961.251) MFCC + Boundaries

63.41%62.37%Baseline: MFCC

5-state phone recognition acc.

3-state phone recognition acc.

Inputs Conclusion

Phonological features perform better as inputs to

phone boundary classifiers than acoustic

features. The results suggest that the pattern

changes in the phonological feature space may

lead to robust boundary detection.

By exploring the potential space of

representations of boundaries, we argue that

phonetic transitions are very important for

automatic speech recognition.

HMMs can be attuned to the transition of phone

boundaries by explicitly modeling phone

transition states. Also, the combined strategy of

binary boundary features, KLT, and 5-state

representations gives almost a 2% absolute

improvement in phone recognition. Considering

the boundary information we integrated is one of

the simplest representations, the result is rather

encouraging.

In future work, we hope to integrate phone

boundary information as additional features to

CRF.